YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

SozKZ Fix Qwen 500M — Kazakh GEC v4

Kazakh grammatical error correction model, fine-tuned with KTO (Kahneman-Tversky Optimization) to improve punctuation handling. Fixes spelling (емле), grammar, punctuation, and word usage errors in Kazakh text.

Model Details

Base model stukenov/sozkz-fix-qwen-500m-kk-gec-v3
Parameters 447M
Method KTO preference optimization on SFT v3 (LoRA r=32, alpha=64), merged
Training data 26,404 preference pairs (13,202 positive + 13,202 negative)
Eval loss 0.314
Training time 29.0 min on RTX 3090
Smoke test 5/10 standalone

What's New in v4

  • KTO preference optimization: Trained on chosen/rejected pairs to learn output preferences beyond supervised fine-tuning
  • Punctuation-focused training data: 3,000 comma-insertion pairs extracted from correct outputs, 1,000 period pairs, 2,000 template-generated compound sentences
  • Improved comma insertion: Model now correctly adds commas before conjunctions ("ал", "бірақ", "себебі"), after introductory words ("Иә", "Алайда"), and between compound clauses
  • Trade-off: Standalone емле (character substitution) accuracy decreased — use with the емле pipeline for best results

KTO Training Data Composition

Source Count Type
Original GEC dataset ~9,600 × 2 Correct output → positive, input → negative
Comma-insertion pairs 3,000 × 2 Removed comma → negative, original → positive
Period pairs 1,000 × 2 Removed period → negative, original → positive
Template compound sentences 2,000 × 2 Without comma/period → negative, with → positive
Total 26,404 13,202 positive + 13,202 negative

Optimal Inference Settings

model.generate(
    ids,
    max_new_tokens=512,
    num_beams=4,
    num_return_sequences=4,
    do_sample=False,
    repetition_penalty=1.0,  # CRITICAL: any value > 1.0 degrades quality
    pad_token_id=1,
)

Important: repetition_penalty must be 1.0. Higher values cause the model to avoid repeating correct words, degrading output quality.

Pipeline Architecture

For best results, use with the емле (spelling) pipeline. The емле fixer handles character substitution errors (у→ү/ұ, о→ө, к→қ, etc.) via dictionary lookup, while the model focuses on grammar and punctuation.

Input text
    │
    ▼
┌─────────────────────┐
│  Емле Pre-fixer     │  Dictionary-based: russified chars → Kazakh
│  (kz_full_dict.json │  e.g. "бугін" → "бүгін"
│   kz_word_freq.json)│  Frequency-ratio threshold > 5
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  GEC Model v4       │  KTO-optimized for grammar/punct
│  Beam search (k=4)  │  Edit-distance reranking
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Емле Post-fixer    │  Catches any remaining char errors
└─────────┬───────────┘
          │
          ▼
Output text

Inference Examples

Punctuation Corrections (v4 improvement)

Input Output Fix
Иә мен келемін Иә, мен келемін. comma after introductory word
Мен жұмысқа бардым ол үйде қалды Мен жұмысқа бардым, ол үйде қалды. comma between clauses
Жаңбыр жауды біз үйде отырдық Жаңбыр жауды, біз үйде отырдық. comma between clauses
Менің досым келді. Менің досым келді. identity preserved

Емле Corrections (with pipeline)

Input Output Type
Мен бугін мектепке бардым Мен бүгін мектепке бардым. у→ү (pipeline)
Казакстан Орталык Азиядагы ен ірі мемлекет Қазақстан Орталық Азиядағы ең ірі мемлекет. multiple (pipeline + model)

Version Comparison

v1 v2 v3 v4
Method SFT SFT SFT (LoRA r=64) KTO on v3 (LoRA r=32)
Dataset ~3,740 9,599 14,597 26,404 preference pairs
Емле fixer No No Yes (pre+post) Yes (pre+post)
Beam search No No Yes (k=4) Yes (k=4)
Eval loss ~1.2 ~0.85 0.791 (SFT) 0.314 (KTO)
Focus General General Емле + grammar Punctuation
Real accuracy ~40% ~60% ~93% (with pipeline) TBD (with pipeline)

KTO Metrics

Metric Value
Eval loss 0.314
Reward margin (chosen − rejected) 3.86
Chosen reward 0.016
Rejected reward −3.846
KL divergence 0.712

Usage

Standalone Inference

from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast
from huggingface_hub import hf_hub_download
import torch

model_id = "stukenov/sozkz-fix-qwen-500m-kk-gec-v4"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
tok_file = hf_hub_download(model_id, "tokenizer.json")
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tok_file)
tokenizer.pad_token_id = 1

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

instruction = (
    "Мәтіндегі грамматикалық, орфографиялық, пунктуациялық және сөз қолданысындағы "
    "қателерді түзет. Мағынаны өзгертпе. Егер мәтін дұрыс болса, оны өзгеріссіз қайтар. "
    "Тек түзетілген мәтінді қайтар."
)

text = "Иә мен келемін"
prompt = f"### Нұсқау:\n{instruction}\n\n### Мәтін:\n{text}\n\n### Түзетілген:\n"

ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    out = model.generate(
        ids, max_new_tokens=512,
        num_beams=4, num_return_sequences=1,
        do_sample=False, repetition_penalty=1.0, pad_token_id=1,
    )
result = tokenizer.decode(out[0], skip_special_tokens=True)
corrected = result.split("### Түзетілген:\n")[-1].split("###")[0].strip()
print(corrected)  # Иә, мен келемін.

API Server

See serve_gec_qwen_500m.py for a FastAPI server with OpenAI-compatible /v1/chat/completions endpoint and built-in емле pipeline.

Training

  • Script: autoresearch/exp041_gec_kto_v4.py
  • Hardware: NVIDIA RTX 3090 (RunPod)
  • Base model: stukenov/sozkz-fix-qwen-500m-kk-gec-v3 (SFT)
  • Method: KTO preference optimization (LoRA r=32, alpha=64, all linear layers)
  • Epochs: 1
  • Learning rate: 5e-5 (cosine schedule, 10% warmup)
  • KTO beta: 0.1
  • Batch size: 4 × 8 gradient accumulation = 32 effective
  • Precision: bf16
  • Framework: transformers 4.47 + peft 0.14 + trl 0.13

Citation

@misc{sozkz-gec-qwen-500m-v4,
  author = {Saken Tukenov},
  title = {SozKZ Fix Qwen 500M — Kazakh GEC v4},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v4}
}

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category Score
Орфография (емле) 0/30 (0%)
Грамматика 1/20 (5%)
Пунктуация 4/15 (27%)
Смешанный 0/20 (0%)
Identity preservation 0/15 (0%)
Total 5/100 (5%)

Leaderboard (100-example custom benchmark)

Модель Total Емле/30 Грамм/20 Пункт/15 Смеш/20 Ident/15
sozkz-core-llama-600m-kk-gec-v1 47% 15 12 3 2 15/15
sozkz-fix-qwen-500m-kk-gec-v3 38% 0 16 9 0 13/15
sozkz-core-llama-300m-kk-gec-v4 37% 9 6 4 3 15/15
sozkz-fix-qwen-500m-kk-gec-v1 35% 0 12 8 0 15/15
sozkz-fix-qwen-500m-kk-gec-v2 30% 0 11 7 0 12/15
sozkz-core-llama-1b-kk-gec-v1 16% 2 6 1 0 7/15
sozkz-fix-qwen-500m-kk-gec-v4 5% 0 1 4 0 0/15
sozkz-fix-mt5b-kk-gec-run13-v1 5% 0 2 0 0 3/15
sozkz-nllb-1b-kk-gec-v1 1% 0 1 0 0 0/15
sozkz-nllb-1b-kk-pretrain-v1 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v3 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b 0–1% 0 0 0 0 0–1
sozkz-fix-mt5-50m-kk-gec-v1 0% 0 0 0 0 0/15
Downloads last month
364
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stukenov/sozkz-fix-qwen-500m-kk-gec-v4

Dataset used to train stukenov/sozkz-fix-qwen-500m-kk-gec-v4