SozKZ Fix Qwen 500M — Kazakh GEC v4
Kazakh grammatical error correction model, fine-tuned with KTO (Kahneman-Tversky Optimization) to improve punctuation handling. Fixes spelling (емле), grammar, punctuation, and word usage errors in Kazakh text.
Model Details
|
|
| Base model |
stukenov/sozkz-fix-qwen-500m-kk-gec-v3 |
| Parameters |
447M |
| Method |
KTO preference optimization on SFT v3 (LoRA r=32, alpha=64), merged |
| Training data |
26,404 preference pairs (13,202 positive + 13,202 negative) |
| Eval loss |
0.314 |
| Training time |
29.0 min on RTX 3090 |
| Smoke test |
5/10 standalone |
What's New in v4
- KTO preference optimization: Trained on chosen/rejected pairs to learn output preferences beyond supervised fine-tuning
- Punctuation-focused training data: 3,000 comma-insertion pairs extracted from correct outputs, 1,000 period pairs, 2,000 template-generated compound sentences
- Improved comma insertion: Model now correctly adds commas before conjunctions ("ал", "бірақ", "себебі"), after introductory words ("Иә", "Алайда"), and between compound clauses
- Trade-off: Standalone емле (character substitution) accuracy decreased — use with the емле pipeline for best results
KTO Training Data Composition
| Source |
Count |
Type |
| Original GEC dataset |
~9,600 × 2 |
Correct output → positive, input → negative |
| Comma-insertion pairs |
3,000 × 2 |
Removed comma → negative, original → positive |
| Period pairs |
1,000 × 2 |
Removed period → negative, original → positive |
| Template compound sentences |
2,000 × 2 |
Without comma/period → negative, with → positive |
| Total |
26,404 |
13,202 positive + 13,202 negative |
Optimal Inference Settings
model.generate(
ids,
max_new_tokens=512,
num_beams=4,
num_return_sequences=4,
do_sample=False,
repetition_penalty=1.0,
pad_token_id=1,
)
Important: repetition_penalty must be 1.0. Higher values cause the model to avoid repeating correct words, degrading output quality.
Pipeline Architecture
For best results, use with the емле (spelling) pipeline. The емле fixer handles character substitution errors (у→ү/ұ, о→ө, к→қ, etc.) via dictionary lookup, while the model focuses on grammar and punctuation.
Input text
│
▼
┌─────────────────────┐
│ Емле Pre-fixer │ Dictionary-based: russified chars → Kazakh
│ (kz_full_dict.json │ e.g. "бугін" → "бүгін"
│ kz_word_freq.json)│ Frequency-ratio threshold > 5
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ GEC Model v4 │ KTO-optimized for grammar/punct
│ Beam search (k=4) │ Edit-distance reranking
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ Емле Post-fixer │ Catches any remaining char errors
└─────────┬───────────┘
│
▼
Output text
Inference Examples
Punctuation Corrections (v4 improvement)
| Input |
Output |
Fix |
| Иә мен келемін |
Иә, мен келемін. |
comma after introductory word |
| Мен жұмысқа бардым ол үйде қалды |
Мен жұмысқа бардым, ол үйде қалды. |
comma between clauses |
| Жаңбыр жауды біз үйде отырдық |
Жаңбыр жауды, біз үйде отырдық. |
comma between clauses |
| Менің досым келді. |
Менің досым келді. |
identity preserved |
Емле Corrections (with pipeline)
| Input |
Output |
Type |
| Мен бугін мектепке бардым |
Мен бүгін мектепке бардым. |
у→ү (pipeline) |
| Казакстан Орталык Азиядагы ен ірі мемлекет |
Қазақстан Орталық Азиядағы ең ірі мемлекет. |
multiple (pipeline + model) |
Version Comparison
|
v1 |
v2 |
v3 |
v4 |
| Method |
SFT |
SFT |
SFT (LoRA r=64) |
KTO on v3 (LoRA r=32) |
| Dataset |
~3,740 |
9,599 |
14,597 |
26,404 preference pairs |
| Емле fixer |
No |
No |
Yes (pre+post) |
Yes (pre+post) |
| Beam search |
No |
No |
Yes (k=4) |
Yes (k=4) |
| Eval loss |
~1.2 |
~0.85 |
0.791 (SFT) |
0.314 (KTO) |
| Focus |
General |
General |
Емле + grammar |
Punctuation |
| Real accuracy |
~40% |
~60% |
~93% (with pipeline) |
TBD (with pipeline) |
KTO Metrics
| Metric |
Value |
| Eval loss |
0.314 |
| Reward margin (chosen − rejected) |
3.86 |
| Chosen reward |
0.016 |
| Rejected reward |
−3.846 |
| KL divergence |
0.712 |
Usage
Standalone Inference
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast
from huggingface_hub import hf_hub_download
import torch
model_id = "stukenov/sozkz-fix-qwen-500m-kk-gec-v4"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
tok_file = hf_hub_download(model_id, "tokenizer.json")
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tok_file)
tokenizer.pad_token_id = 1
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
instruction = (
"Мәтіндегі грамматикалық, орфографиялық, пунктуациялық және сөз қолданысындағы "
"қателерді түзет. Мағынаны өзгертпе. Егер мәтін дұрыс болса, оны өзгеріссіз қайтар. "
"Тек түзетілген мәтінді қайтар."
)
text = "Иә мен келемін"
prompt = f"### Нұсқау:\n{instruction}\n\n### Мәтін:\n{text}\n\n### Түзетілген:\n"
ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
with torch.no_grad():
out = model.generate(
ids, max_new_tokens=512,
num_beams=4, num_return_sequences=1,
do_sample=False, repetition_penalty=1.0, pad_token_id=1,
)
result = tokenizer.decode(out[0], skip_special_tokens=True)
corrected = result.split("### Түзетілген:\n")[-1].split("###")[0].strip()
print(corrected)
API Server
See serve_gec_qwen_500m.py for a FastAPI server with OpenAI-compatible /v1/chat/completions endpoint and built-in емле pipeline.
Training
- Script:
autoresearch/exp041_gec_kto_v4.py
- Hardware: NVIDIA RTX 3090 (RunPod)
- Base model: stukenov/sozkz-fix-qwen-500m-kk-gec-v3 (SFT)
- Method: KTO preference optimization (LoRA r=32, alpha=64, all linear layers)
- Epochs: 1
- Learning rate: 5e-5 (cosine schedule, 10% warmup)
- KTO beta: 0.1
- Batch size: 4 × 8 gradient accumulation = 32 effective
- Precision: bf16
- Framework: transformers 4.47 + peft 0.14 + trl 0.13
Citation
@misc{sozkz-gec-qwen-500m-v4,
author = {Saken Tukenov},
title = {SozKZ Fix Qwen 500M — Kazakh GEC v4},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v4}
}
Benchmark Results
Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).
| Category |
Score |
| Орфография (емле) |
0/30 (0%) |
| Грамматика |
1/20 (5%) |
| Пунктуация |
4/15 (27%) |
| Смешанный |
0/20 (0%) |
| Identity preservation |
0/15 (0%) |
| Total |
5/100 (5%) |
Leaderboard (100-example custom benchmark)