YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

SozKZ Fix Qwen 500M — Kazakh GEC v4

Kazakh grammatical error correction model, fine-tuned with KTO (Kahneman-Tversky Optimization) to improve punctuation handling. Fixes spelling (емле), grammar, punctuation, and word usage errors in Kazakh text.

Model Details


Base model	stukenov/sozkz-fix-qwen-500m-kk-gec-v3
Parameters	447M
Method	KTO preference optimization on SFT v3 (LoRA r=32, alpha=64), merged
Training data	26,404 preference pairs (13,202 positive + 13,202 negative)
Eval loss	0.314
Training time	29.0 min on RTX 3090
Smoke test	5/10 standalone

What's New in v4

KTO preference optimization: Trained on chosen/rejected pairs to learn output preferences beyond supervised fine-tuning
Punctuation-focused training data: 3,000 comma-insertion pairs extracted from correct outputs, 1,000 period pairs, 2,000 template-generated compound sentences
Improved comma insertion: Model now correctly adds commas before conjunctions ("ал", "бірақ", "себебі"), after introductory words ("Иә", "Алайда"), and between compound clauses
Trade-off: Standalone емле (character substitution) accuracy decreased — use with the емле pipeline for best results

KTO Training Data Composition

Source	Count	Type
Original GEC dataset	~9,600 × 2	Correct output → positive, input → negative
Comma-insertion pairs	3,000 × 2	Removed comma → negative, original → positive
Period pairs	1,000 × 2	Removed period → negative, original → positive
Template compound sentences	2,000 × 2	Without comma/period → negative, with → positive
Total	26,404	13,202 positive + 13,202 negative

Optimal Inference Settings

model.generate(
    ids,
    max_new_tokens=512,
    num_beams=4,
    num_return_sequences=4,
    do_sample=False,
    repetition_penalty=1.0,  # CRITICAL: any value > 1.0 degrades quality
    pad_token_id=1,
)

Important: repetition_penalty must be 1.0. Higher values cause the model to avoid repeating correct words, degrading output quality.

Pipeline Architecture

For best results, use with the емле (spelling) pipeline. The емле fixer handles character substitution errors (у→ү/ұ, о→ө, к→қ, etc.) via dictionary lookup, while the model focuses on grammar and punctuation.

Input text
    │
    ▼
┌─────────────────────┐
│  Емле Pre-fixer     │  Dictionary-based: russified chars → Kazakh
│  (kz_full_dict.json │  e.g. "бугін" → "бүгін"
│   kz_word_freq.json)│  Frequency-ratio threshold > 5
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  GEC Model v4       │  KTO-optimized for grammar/punct
│  Beam search (k=4)  │  Edit-distance reranking
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Емле Post-fixer    │  Catches any remaining char errors
└─────────┬───────────┘
          │
          ▼
Output text

Inference Examples

Punctuation Corrections (v4 improvement)

Input	Output	Fix
Иә мен келемін	Иә, мен келемін.	comma after introductory word
Мен жұмысқа бардым ол үйде қалды	Мен жұмысқа бардым, ол үйде қалды.	comma between clauses
Жаңбыр жауды біз үйде отырдық	Жаңбыр жауды, біз үйде отырдық.	comma between clauses
Менің досым келді.	Менің досым келді.	identity preserved

Емле Corrections (with pipeline)

Input	Output	Type
Мен бугін мектепке бардым	Мен бүгін мектепке бардым.	у→ү (pipeline)
Казакстан Орталык Азиядагы ен ірі мемлекет	Қазақстан Орталық Азиядағы ең ірі мемлекет.	multiple (pipeline + model)

Version Comparison

	v1	v2	v3	v4
Method	SFT	SFT	SFT (LoRA r=64)	KTO on v3 (LoRA r=32)
Dataset	~3,740	9,599	14,597	26,404 preference pairs
Емле fixer	No	No	Yes (pre+post)	Yes (pre+post)
Beam search	No	No	Yes (k=4)	Yes (k=4)
Eval loss	~1.2	~0.85	0.791 (SFT)	0.314 (KTO)
Focus	General	General	Емле + grammar	Punctuation
Real accuracy	~40%	~60%	~93% (with pipeline)	TBD (with pipeline)

KTO Metrics

Metric	Value
Eval loss	0.314
Reward margin (chosen − rejected)	3.86
Chosen reward	0.016
Rejected reward	−3.846
KL divergence	0.712

Usage

Standalone Inference

from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast
from huggingface_hub import hf_hub_download
import torch

model_id = "stukenov/sozkz-fix-qwen-500m-kk-gec-v4"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
tok_file = hf_hub_download(model_id, "tokenizer.json")
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tok_file)
tokenizer.pad_token_id = 1

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

instruction = (
    "Мәтіндегі грамматикалық, орфографиялық, пунктуациялық және сөз қолданысындағы "
    "қателерді түзет. Мағынаны өзгертпе. Егер мәтін дұрыс болса, оны өзгеріссіз қайтар. "
    "Тек түзетілген мәтінді қайтар."
)

text = "Иә мен келемін"
prompt = f"### Нұсқау:\n{instruction}\n\n### Мәтін:\n{text}\n\n### Түзетілген:\n"

ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    out = model.generate(
        ids, max_new_tokens=512,
        num_beams=4, num_return_sequences=1,
        do_sample=False, repetition_penalty=1.0, pad_token_id=1,
    )
result = tokenizer.decode(out[0], skip_special_tokens=True)
corrected = result.split("### Түзетілген:\n")[-1].split("###")[0].strip()
print(corrected)  # Иә, мен келемін.

API Server

See serve_gec_qwen_500m.py for a FastAPI server with OpenAI-compatible /v1/chat/completions endpoint and built-in емле pipeline.

Training

Script: autoresearch/exp041_gec_kto_v4.py
Hardware: NVIDIA RTX 3090 (RunPod)
Base model: stukenov/sozkz-fix-qwen-500m-kk-gec-v3 (SFT)
Method: KTO preference optimization (LoRA r=32, alpha=64, all linear layers)
Epochs: 1
Learning rate: 5e-5 (cosine schedule, 10% warmup)
KTO beta: 0.1
Batch size: 4 × 8 gradient accumulation = 32 effective
Precision: bf16
Framework: transformers 4.47 + peft 0.14 + trl 0.13

Citation

@misc{sozkz-gec-qwen-500m-v4,
  author = {Saken Tukenov},
  title = {SozKZ Fix Qwen 500M — Kazakh GEC v4},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v4}
}

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category	Score
Орфография (емле)	0/30 (0%)
Грамматика	1/20 (5%)
Пунктуация	4/15 (27%)
Смешанный	0/20 (0%)
Identity preservation	0/15 (0%)
Total	5/100 (5%)

Leaderboard (100-example custom benchmark)

Модель	Total	Емле/30	Грамм/20	Пункт/15	Смеш/20	Ident/15
sozkz-core-llama-600m-kk-gec-v1	47%	15	12	3	2	15/15
sozkz-fix-qwen-500m-kk-gec-v3	38%	0	16	9	0	13/15
sozkz-core-llama-300m-kk-gec-v4	37%	9	6	4	3	15/15
sozkz-fix-qwen-500m-kk-gec-v1	35%	0	12	8	0	15/15
sozkz-fix-qwen-500m-kk-gec-v2	30%	0	11	7	0	12/15
sozkz-core-llama-1b-kk-gec-v1	16%	2	6	1	0	7/15
sozkz-fix-qwen-500m-kk-gec-v4	5%	0	1	4	0	0/15
sozkz-fix-mt5b-kk-gec-run13-v1	5%	0	2	0	0	3/15
sozkz-nllb-1b-kk-gec-v1	1%	0	1	0	0	0/15
sozkz-nllb-1b-kk-pretrain-v1	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v3	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b	0–1%	0	0	0	0	0–1
sozkz-fix-mt5-50m-kk-gec-v1	0%	0	0	0	0	0/15

Downloads last month: 364

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for stukenov/sozkz-fix-qwen-500m-kk-gec-v4

Base model

stukenov/sozkz-core-qwen-500m-kk-base-v1

Finetuned

stukenov/sozkz-fix-qwen-500m-kk-gec-v3