kazakh-gec-50m
A Kazakh grammatical error correction (GEC) model fine-tuned from kazakh-llama-50m-v2 on synthetic GEC data (~390K training examples + 20% identity examples).
The model corrects morphological errors (vowel harmony, suffixes), word order issues, and other common grammatical mistakes in Kazakh text.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "stukenov/kazakh-gec-50m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16).eval()
text = "Ол кітапті оқыды"
prompt = f"<TASK_FIX><SRC>{text}<SEP>"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=256,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
result = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(f"Input: {text}")
print(f"Output: {result}")
# Input: Ол кітапті оқыды
# Output: Ол кітапты оқыды
Format
The model uses a decode-only seq2seq format with special tokens:
<TASK_FIX><SRC>{noisy text}<SEP>{corrected text}<EOS>
During training, loss is computed only on tokens after <SEP>.
Evaluation
Evaluated on the test split (200 examples) of kazakh-synthetic-gec-datasets:
| Metric | Value |
|---|---|
| Exact Match | 62.0% |
| Character Error Rate (CER) | 0.0802 |
| Word Precision | 0.494 |
| Word Recall | 0.661 |
| Word F0.5 | 0.520 |
| Identity Preservation | 100% (26/26) |
Strengths:
- Excellent identity preservation — never corrupts already correct text
- Handles morphological errors well (vowel harmony, suffix agreement)
- Good at word order corrections
Limitations:
- Struggles with complex multi-word rearrangements
- May hallucinate alternative words instead of making minimal corrections on rare vocabulary
Examples
| Input | Output | Fix |
|---|---|---|
| Ол кітапті оқыды | Ол кітапты оқыды | Vowel harmony (ті→ты) |
| Ол кеше базарга барды | Ол кеше базарға барды | Vowel harmony (га→ға) |
| Ол маған жазыды хат | Ол маған хат жазды | Word order + morphology |
| Мен сенің кітабыңды алдым | Мен сенің кітабыңды алдым | No change (correct input) |
Architecture
| Parameter | Value |
|---|---|
| Base model | kazakh-llama-50m-v2 (Llama) |
| Parameters | ~50M |
| Hidden size | 576 |
| Layers | 8 |
| Attention heads | 8 |
| Vocab size | 50,263 (+3 special tokens) |
| Max sequence length | 512 |
Training
- Dataset: kazakh-synthetic-gec-datasets — 10 subdirectories, ~390K train / 16K val / 16K test examples
- Identity examples: 20% of training data (input == target) to prevent over-correction
- Epochs: 1
- Batch size: 8 per GPU × 4 GPUs × 4 gradient accumulation = effective 128
- Learning rate: 2e-5, cosine schedule, 5% warmup
- Hardware: 4× RTX 4090 (vast.ai)
- Training time: ~55 minutes
- Final eval loss: 0.377
Special Tokens
| Token | Purpose |
|---|---|
<TASK_FIX> |
Task prefix |
<SRC> |
Source text delimiter |
<SEP> |
Separator between input and target |
License
Apache 2.0
Benchmark Results
Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).
| Category | Score |
|---|---|
| Орфография (емле) | 0/30 (0%) |
| Грамматика | 0/20 (0%) |
| Пунктуация | 0/15 (0%) |
| Смешанный | 0/20 (0%) |
| Identity preservation | 0/15 (0%) |
| Total | 0/100 (0%) |
Leaderboard (100-example custom benchmark)
| Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 |
|---|---|---|---|---|---|---|
| sozkz-core-llama-600m-kk-gec-v1 | 47% | 15 | 12 | 3 | 2 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v3 | 38% | 0 | 16 | 9 | 0 | 13/15 |
| sozkz-core-llama-300m-kk-gec-v4 | 37% | 9 | 6 | 4 | 3 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v1 | 35% | 0 | 12 | 8 | 0 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v2 | 30% | 0 | 11 | 7 | 0 | 12/15 |
| sozkz-core-llama-1b-kk-gec-v1 | 16% | 2 | 6 | 1 | 0 | 7/15 |
| sozkz-fix-qwen-500m-kk-gec-v4 | 5% | 0 | 1 | 4 | 0 | 0/15 |
| sozkz-fix-mt5b-kk-gec-run13-v1 | 5% | 0 | 2 | 0 | 0 | 3/15 |
| sozkz-nllb-1b-kk-gec-v1 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-nllb-1b-kk-pretrain-v1 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v3 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 |
| sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |
- Downloads last month
- 12
Collection including stukenov/sozkz-fix-mt5-50m-kk-gec-v1
Collection
Grammar error correction models and datasets for Kazakh — Llama GEC (300M, 600M), mT5 GEC, morphology models • 10 items • Updated