---
language: kk
license: apache-2.0
tags:
- kazakh
- grammatical-error-correction
- gec
- causal-lm
- llama
- fine-tuned
base_model: stukenov/kazakh-llama-50m-v2
datasets:
- stukenov/kazakh-synthetic-gec-datasets
pipeline_tag: text-generation
---

# kazakh-gec-50m

A Kazakh grammatical error correction (GEC) model fine-tuned from [kazakh-llama-50m-v2](https://huggingface.co/stukenov/kazakh-llama-50m-v2) on synthetic GEC data (~390K training examples + 20% identity examples).

The model corrects morphological errors (vowel harmony, suffixes), word order issues, and other common grammatical mistakes in Kazakh text.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "stukenov/kazakh-gec-50m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16).eval()

text = "Ол кітапті оқыды"
prompt = f"<TASK_FIX><SRC>{text}<SEP>"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=256,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

result = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(f"Input:  {text}")
print(f"Output: {result}")
# Input:  Ол кітапті оқыды
# Output: Ол кітапты оқыды
```

## Format

The model uses a decode-only seq2seq format with special tokens:

```
<TASK_FIX><SRC>{noisy text}<SEP>{corrected text}<EOS>
```

During training, loss is computed only on tokens after `<SEP>`.

## Evaluation

Evaluated on the test split (200 examples) of [kazakh-synthetic-gec-datasets](https://huggingface.co/datasets/stukenov/kazakh-synthetic-gec-datasets):

| Metric | Value |
|--------|-------|
| Exact Match | 62.0% |
| Character Error Rate (CER) | 0.0802 |
| Word Precision | 0.494 |
| Word Recall | 0.661 |
| Word F0.5 | 0.520 |
| Identity Preservation | 100% (26/26) |

**Strengths:**
- Excellent identity preservation — never corrupts already correct text
- Handles morphological errors well (vowel harmony, suffix agreement)
- Good at word order corrections

**Limitations:**
- Struggles with complex multi-word rearrangements
- May hallucinate alternative words instead of making minimal corrections on rare vocabulary

## Examples

| Input | Output | Fix |
|-------|--------|-----|
| Ол кітапті оқыды | Ол кітапты оқыды | Vowel harmony (ті→ты) |
| Ол кеше базарга барды | Ол кеше базарға барды | Vowel harmony (га→ға) |
| Ол маған жазыды хат | Ол маған хат жазды | Word order + morphology |
| Мен сенің кітабыңды алдым | Мен сенің кітабыңды алдым | No change (correct input) |

## Architecture

| Parameter | Value |
|-----------|-------|
| Base model | kazakh-llama-50m-v2 (Llama) |
| Parameters | ~50M |
| Hidden size | 576 |
| Layers | 8 |
| Attention heads | 8 |
| Vocab size | 50,263 (+3 special tokens) |
| Max sequence length | 512 |

## Training

- **Dataset**: [kazakh-synthetic-gec-datasets](https://huggingface.co/datasets/stukenov/kazakh-synthetic-gec-datasets) — 10 subdirectories, ~390K train / 16K val / 16K test examples
- **Identity examples**: 20% of training data (input == target) to prevent over-correction
- **Epochs**: 1
- **Batch size**: 8 per GPU × 4 GPUs × 4 gradient accumulation = effective 128
- **Learning rate**: 2e-5, cosine schedule, 5% warmup
- **Hardware**: 4× RTX 4090 (vast.ai)
- **Training time**: ~55 minutes
- **Final eval loss**: 0.377

## Special Tokens

| Token | Purpose |
|-------|---------|
| `<TASK_FIX>` | Task prefix |
| `<SRC>` | Source text delimiter |
| `<SEP>` | Separator between input and target |

## License

Apache 2.0

## Benchmark Results

Evaluated on **100-example custom GEC test** (pure model inference, no pre/post pipeline).

| Category | Score |
|----------|-------|
| Орфография (емле) | 0/30 (0%) |
| Грамматика | 0/20 (0%) |
| Пунктуация | 0/15 (0%) |
| Смешанный | 0/20 (0%) |
| Identity preservation | 0/15 (0%) |
| **Total** | **0/100 (0%)** |


## Leaderboard (100-example custom benchmark)

| Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 |
|--------|-------|---------|----------|----------|---------|---------|
| **[sozkz-core-llama-600m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-gec-v1)** | **47%** | 15 | 12 | 3 | 2 | 15/15 |
| [sozkz-fix-qwen-500m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v3) | 38% | 0 | 16 | 9 | 0 | 13/15 |
| [sozkz-core-llama-300m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v4) | 37% | 9 | 6 | 4 | 3 | 15/15 |
| [sozkz-fix-qwen-500m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v1) | 35% | 0 | 12 | 8 | 0 | 15/15 |
| [sozkz-fix-qwen-500m-kk-gec-v2](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v2) | 30% | 0 | 11 | 7 | 0 | 12/15 |
| [sozkz-core-llama-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-1b-kk-gec-v1) | 16% | 2 | 6 | 1 | 0 | 7/15 |
| [sozkz-fix-qwen-500m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v4) | 5% | 0 | 1 | 4 | 0 | 0/15 |
| [sozkz-fix-mt5b-kk-gec-run13-v1](https://huggingface.co/stukenov/sozkz-fix-mt5b-kk-gec-run13-v1) | 5% | 0 | 2 | 0 | 0 | 3/15 |
| [sozkz-nllb-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-gec-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 |
| [sozkz-nllb-1b-kk-pretrain-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-pretrain-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 |
| [sozkz-core-llama-300m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v3) | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 |
| sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |