--- language: kk license: apache-2.0 tags: - kazakh - grammatical-error-correction - gec - causal-lm - llama - fine-tuned base_model: stukenov/kazakh-llama-50m-v2 datasets: - stukenov/kazakh-synthetic-gec-datasets pipeline_tag: text-generation --- # kazakh-gec-50m A Kazakh grammatical error correction (GEC) model fine-tuned from [kazakh-llama-50m-v2](https://huggingface.co/stukenov/kazakh-llama-50m-v2) on synthetic GEC data (~390K training examples + 20% identity examples). The model corrects morphological errors (vowel harmony, suffixes), word order issues, and other common grammatical mistakes in Kazakh text. ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "stukenov/kazakh-gec-50m" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16).eval() text = "Ол кітапті оқыды" prompt = f"{text}" inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id, ) result = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print(f"Input: {text}") print(f"Output: {result}") # Input: Ол кітапті оқыды # Output: Ол кітапты оқыды ``` ## Format The model uses a decode-only seq2seq format with special tokens: ``` {noisy text}{corrected text} ``` During training, loss is computed only on tokens after ``. ## Evaluation Evaluated on the test split (200 examples) of [kazakh-synthetic-gec-datasets](https://huggingface.co/datasets/stukenov/kazakh-synthetic-gec-datasets): | Metric | Value | |--------|-------| | Exact Match | 62.0% | | Character Error Rate (CER) | 0.0802 | | Word Precision | 0.494 | | Word Recall | 0.661 | | Word F0.5 | 0.520 | | Identity Preservation | 100% (26/26) | **Strengths:** - Excellent identity preservation — never corrupts already correct text - Handles morphological errors well (vowel harmony, suffix agreement) - Good at word order corrections **Limitations:** - Struggles with complex multi-word rearrangements - May hallucinate alternative words instead of making minimal corrections on rare vocabulary ## Examples | Input | Output | Fix | |-------|--------|-----| | Ол кітапті оқыды | Ол кітапты оқыды | Vowel harmony (ті→ты) | | Ол кеше базарга барды | Ол кеше базарға барды | Vowel harmony (га→ға) | | Ол маған жазыды хат | Ол маған хат жазды | Word order + morphology | | Мен сенің кітабыңды алдым | Мен сенің кітабыңды алдым | No change (correct input) | ## Architecture | Parameter | Value | |-----------|-------| | Base model | kazakh-llama-50m-v2 (Llama) | | Parameters | ~50M | | Hidden size | 576 | | Layers | 8 | | Attention heads | 8 | | Vocab size | 50,263 (+3 special tokens) | | Max sequence length | 512 | ## Training - **Dataset**: [kazakh-synthetic-gec-datasets](https://huggingface.co/datasets/stukenov/kazakh-synthetic-gec-datasets) — 10 subdirectories, ~390K train / 16K val / 16K test examples - **Identity examples**: 20% of training data (input == target) to prevent over-correction - **Epochs**: 1 - **Batch size**: 8 per GPU × 4 GPUs × 4 gradient accumulation = effective 128 - **Learning rate**: 2e-5, cosine schedule, 5% warmup - **Hardware**: 4× RTX 4090 (vast.ai) - **Training time**: ~55 minutes - **Final eval loss**: 0.377 ## Special Tokens | Token | Purpose | |-------|---------| | `` | Task prefix | | `` | Source text delimiter | | `` | Separator between input and target | ## License Apache 2.0 ## Benchmark Results Evaluated on **100-example custom GEC test** (pure model inference, no pre/post pipeline). | Category | Score | |----------|-------| | Орфография (емле) | 0/30 (0%) | | Грамматика | 0/20 (0%) | | Пунктуация | 0/15 (0%) | | Смешанный | 0/20 (0%) | | Identity preservation | 0/15 (0%) | | **Total** | **0/100 (0%)** | ## Leaderboard (100-example custom benchmark) | Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 | |--------|-------|---------|----------|----------|---------|---------| | **[sozkz-core-llama-600m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-gec-v1)** | **47%** | 15 | 12 | 3 | 2 | 15/15 | | [sozkz-fix-qwen-500m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v3) | 38% | 0 | 16 | 9 | 0 | 13/15 | | [sozkz-core-llama-300m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v4) | 37% | 9 | 6 | 4 | 3 | 15/15 | | [sozkz-fix-qwen-500m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v1) | 35% | 0 | 12 | 8 | 0 | 15/15 | | [sozkz-fix-qwen-500m-kk-gec-v2](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v2) | 30% | 0 | 11 | 7 | 0 | 12/15 | | [sozkz-core-llama-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-1b-kk-gec-v1) | 16% | 2 | 6 | 1 | 0 | 7/15 | | [sozkz-fix-qwen-500m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v4) | 5% | 0 | 1 | 4 | 0 | 0/15 | | [sozkz-fix-mt5b-kk-gec-run13-v1](https://huggingface.co/stukenov/sozkz-fix-mt5b-kk-gec-run13-v1) | 5% | 0 | 2 | 0 | 0 | 3/15 | | [sozkz-nllb-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-gec-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 | | [sozkz-nllb-1b-kk-pretrain-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-pretrain-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 | | [sozkz-core-llama-300m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v3) | 1% | 0 | 1 | 0 | 0 | 0/15 | | sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 | | sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |