You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

kazakh-gec-50m

A Kazakh grammatical error correction (GEC) model fine-tuned from kazakh-llama-50m-v2 on synthetic GEC data (~390K training examples + 20% identity examples).

The model corrects morphological errors (vowel harmony, suffixes), word order issues, and other common grammatical mistakes in Kazakh text.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "stukenov/kazakh-gec-50m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16).eval()

text = "Ол кітапті оқыды"
prompt = f"<TASK_FIX><SRC>{text}<SEP>"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=256,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

result = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(f"Input:  {text}")
print(f"Output: {result}")
# Input:  Ол кітапті оқыды
# Output: Ол кітапты оқыды

Format

The model uses a decode-only seq2seq format with special tokens:

<TASK_FIX><SRC>{noisy text}<SEP>{corrected text}<EOS>

During training, loss is computed only on tokens after <SEP>.

Evaluation

Evaluated on the test split (200 examples) of kazakh-synthetic-gec-datasets:

Metric	Value
Exact Match	62.0%
Character Error Rate (CER)	0.0802
Word Precision	0.494
Word Recall	0.661
Word F0.5	0.520
Identity Preservation	100% (26/26)

Strengths:

Excellent identity preservation — never corrupts already correct text
Handles morphological errors well (vowel harmony, suffix agreement)
Good at word order corrections

Limitations:

Struggles with complex multi-word rearrangements
May hallucinate alternative words instead of making minimal corrections on rare vocabulary

Examples

Input	Output	Fix
Ол кітапті оқыды	Ол кітапты оқыды	Vowel harmony (ті→ты)
Ол кеше базарга барды	Ол кеше базарға барды	Vowel harmony (га→ға)
Ол маған жазыды хат	Ол маған хат жазды	Word order + morphology
Мен сенің кітабыңды алдым	Мен сенің кітабыңды алдым	No change (correct input)

Architecture

Parameter	Value
Base model	kazakh-llama-50m-v2 (Llama)
Parameters	~50M
Hidden size	576
Layers	8
Attention heads	8
Vocab size	50,263 (+3 special tokens)
Max sequence length	512

Training

Dataset: kazakh-synthetic-gec-datasets — 10 subdirectories, ~390K train / 16K val / 16K test examples
Identity examples: 20% of training data (input == target) to prevent over-correction
Epochs: 1
Batch size: 8 per GPU × 4 GPUs × 4 gradient accumulation = effective 128
Learning rate: 2e-5, cosine schedule, 5% warmup
Hardware: 4× RTX 4090 (vast.ai)
Training time: ~55 minutes
Final eval loss: 0.377

Special Tokens

Token	Purpose
`<TASK_FIX>`	Task prefix
`<SRC>`	Source text delimiter
`<SEP>`	Separator between input and target

License

Apache 2.0

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category	Score
Орфография (емле)	0/30 (0%)
Грамматика	0/20 (0%)
Пунктуация	0/15 (0%)
Смешанный	0/20 (0%)
Identity preservation	0/15 (0%)
Total	0/100 (0%)

Leaderboard (100-example custom benchmark)

Модель	Total	Емле/30	Грамм/20	Пункт/15	Смеш/20	Ident/15
sozkz-core-llama-600m-kk-gec-v1	47%	15	12	3	2	15/15
sozkz-fix-qwen-500m-kk-gec-v3	38%	0	16	9	0	13/15
sozkz-core-llama-300m-kk-gec-v4	37%	9	6	4	3	15/15
sozkz-fix-qwen-500m-kk-gec-v1	35%	0	12	8	0	15/15
sozkz-fix-qwen-500m-kk-gec-v2	30%	0	11	7	0	12/15
sozkz-core-llama-1b-kk-gec-v1	16%	2	6	1	0	7/15
sozkz-fix-qwen-500m-kk-gec-v4	5%	0	1	4	0	0/15
sozkz-fix-mt5b-kk-gec-run13-v1	5%	0	2	0	0	3/15
sozkz-nllb-1b-kk-gec-v1	1%	0	1	0	0	0/15
sozkz-nllb-1b-kk-pretrain-v1	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v3	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b	0–1%	0	0	0	0	0–1
sozkz-fix-mt5-50m-kk-gec-v1	0%	0	0	0	0	0/15

Downloads last month: 12

Safetensors

Model size

50.6M params

Tensor type

F32

Collection including stukenov/sozkz-fix-mt5-50m-kk-gec-v1

SozKZ GEC: Kazakh Grammar Error Correction

Collection

Grammar error correction models and datasets for Kazakh — Llama GEC (300M, 600M), mT5 GEC, morphology models • 10 items • Updated 22 days ago