metadata
language:
- kk
license: mit
tags:
- llama
- kazakh
- causal-lm
- pretrained
datasets:
- stukenov/sozkz-corpus-tokenized-kk-llama50k-v3
pipeline_tag: text-generation
model-index:
- name: sozkz-core-llama-600m-kk-base-v1
results:
- task:
type: text-generation
name: Language Modeling
metrics:
- name: Validation BPB
type: bpb
value: 0.756
- name: Training Loss
type: loss
value: 2.713
arxiv: 2603.20854
SozKZ Core Llama 600M — Kazakh Base v1
A 587M parameter Llama model pretrained from scratch on 9 billion Kazakh tokens. Part of the SozKZ family of Kazakh language models.
Model Family
| Model | Params | val_bpb | Train Loss | Status |
|---|---|---|---|---|
| sozkz-core-llama-150m-kk-base-v1 | 152M | — | — | Released |
| sozkz-core-llama-300m-kk-base-v1 | 325M | 0.781 | 2.848 | Released |
| sozkz-core-llama-600m-kk-base-v1 | 587M | 0.756 | 2.713 | This model |
Model Details
| Parameter | Value |
|---|---|
| Architecture | Llama (RMSNorm, RoPE, SwiGLU) |
| Parameters | 587M |
| Hidden size | 1280 |
| Layers | 22 |
| Attention heads | 20 |
| KV heads | 20 (MHA) |
| Intermediate size | 4480 |
| Context length | 1024 |
| Vocab size | 50,257 (GPT-2 BPE, Kazakh) |
| Precision | bfloat16 |
| Tied embeddings | Yes |
Training
| Detail | Value |
|---|---|
| Dataset | sozkz-corpus-tokenized-kk-llama50k-v3 |
| Tokens | 9B |
| Hardware | 4x NVIDIA H100 80GB HBM3 |
| Training time | 5.9 hours |
| Throughput | 423K tok/s |
| Optimizer | AdamW (lr=4e-4, betas=0.9/0.95, wd=0.1) |
| Schedule | Cosine with 500-step warmup, min_lr=0.1x |
| Batch size | 32 per GPU x 4 GPUs = 128 |
| Gradient clipping | 1.0 |
| Framework | PyTorch 2.4 + torch.compile + DDP |
Results
| Metric | Value |
|---|---|
| Validation BPB | 0.756 |
| Training loss | 2.713 |
| Peak VRAM | 64.0 GB/GPU |
| Tokens-to-params ratio | 15.3:1 |
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "stukenov/sozkz-core-llama-600m-kk-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
prompt = "Қазақстан — "
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Tokenizer
Uses sozkz-core-gpt2-50k-kk-base-v1 — a 50K vocab ByteLevel BPE tokenizer trained on Kazakh text.
Limitations
- This is a base model (not instruction-tuned) — it completes text, not answers questions
- Training data is web-scraped Kazakh text (educational sites, Wikipedia, news)
- Context length is 1024 tokens
- May generate repetitive or factually incorrect text
Citation
@misc{sozkz-llama-600m-kk-2026,
title={SozKZ Core Llama 600M: Kazakh Language Model},
author={Tukenov, Saken},
year={2026},
url={https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-base-v1}
}
License
MIT (gated access — manual approval required)
Paper
Small Language Models for Kazakh: Training, Evaluation, and Scaling
@article{tukenov2026slm,
title={Small Language Models for Kazakh: Training, Evaluation, and Scaling},
author={Tukenov, Saken},
journal={arXiv preprint arXiv:2603.20854},
year={2026}
}