SozKZ OmniAudio 70M — Kazakh ASR v2
A 69.6M parameter encoder-decoder ASR model for Kazakh, trained from scratch using a two-stage approach: CTC pretraining followed by end-to-end fine-tuning.
Model Details
| Parameter |
Value |
| Parameters |
69.58M |
| Architecture |
Custom encoder-decoder with RoPE, RMSNorm, SwiGLU |
| Encoder |
256d / 4 heads / 6 layers / 2 conv layers |
| Decoder |
512d / 8 heads / 8 layers |
| Vocabulary |
50,257 (BPE, kazakh-gpt2-50k) |
| Audio input |
80 mel bins, 16kHz, max 10s |
| Training |
2-stage: CTC pretrain → E2E (CE + 0.3×CTC) |
| Precision |
bf16 |
| Framework |
PyTorch (custom, no HF Transformers dependency) |
Training
Stage 1: CTC Pretraining
- Encoder-only training with CTC loss
- Dataset:
stukenov/sozkz-asr-mels-kk-v1 (2,100h Kazakh speech, pre-computed mels)
- Result: val loss 1.01
Stage 2: End-to-End Fine-tuning
- Full encoder-decoder training with CE + 0.3×CTC auxiliary loss
- Label smoothing: 0.1, LR: 1e-4 with cosine decay
- 10 epochs, batch size 32
- Hardware: NVIDIA A10 (single GPU)
- Final train loss: 2.70, val loss: 2.75
Evaluation
| Dataset |
Samples |
WER |
CER |
| FLEURS kk (test) |
100 |
57.22% |
43.17% |
WER progression during training
| Stage |
WER |
CER |
| E2E from scratch (no CTC pretrain) |
113% |
92.6% |
| CTC pretrain → E2E epoch 1 |
73.6% |
55.4% |
| Epoch 3 |
59.0% |
42.8% |
| Epoch 10 (final) |
57.2% |
43.2% |
Usage
import torch
from omniaudio.model_v2 import OmniAudioScratchModel
encoder_config = {"n_mels": 80, "d_model": 256, "n_heads": 4, "n_layers": 6, "n_conv": 2}
decoder_config = {"d_model": 512, "n_heads": 8, "n_layers": 8}
model = OmniAudioScratchModel(encoder_config=encoder_config, decoder_config=decoder_config, vocab_size=50257)
state = torch.load("model.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()
tokens = model.generate(mel, max_new_tokens=256, eos_token_id=0)
Architecture
Custom encoder-decoder, no HuggingFace Transformers dependency:
- Encoder: Conv1d stack (stride 2, ×2 = 4× downsampling) → 6 Transformer layers with RoPE, RMSNorm, SwiGLU
- Decoder: 8 Transformer layers with causal self-attention + cross-attention to encoder, RoPE, RMSNorm, SwiGLU
- CTC head: Linear projection from encoder for auxiliary CTC loss during training
Limitations
- Max audio length: 10 seconds (longer utterances get truncated)
- Struggles with foreign names and rare words
- Numbers transcribed in words ("2011" → "екі мың он бірінші")
- Long sentences may get truncated by the decoder
- No punctuation or capitalization
Citation
@misc{sozkz-omniaudio-70m-v2,
title={SozKZ OmniAudio 70M: Kazakh ASR v2},
author={Saken Tukenov},
year={2026},
url={https://huggingface.co/stukenov/sozkz-core-omniaudio-70m-kk-asr-v2}
}