You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

SozKZ OmniAudio 70M — Kazakh ASR v2

A 69.6M parameter encoder-decoder ASR model for Kazakh, trained from scratch using a two-stage approach: CTC pretraining followed by end-to-end fine-tuning.

Model Details

Parameter	Value
Parameters	69.58M
Architecture	Custom encoder-decoder with RoPE, RMSNorm, SwiGLU
Encoder	256d / 4 heads / 6 layers / 2 conv layers
Decoder	512d / 8 heads / 8 layers
Vocabulary	50,257 (BPE, kazakh-gpt2-50k)
Audio input	80 mel bins, 16kHz, max 10s
Training	2-stage: CTC pretrain → E2E (CE + 0.3×CTC)
Precision	bf16
Framework	PyTorch (custom, no HF Transformers dependency)

Training

Stage 1: CTC Pretraining

Encoder-only training with CTC loss
Dataset: stukenov/sozkz-asr-mels-kk-v1 (2,100h Kazakh speech, pre-computed mels)
Result: val loss 1.01

Stage 2: End-to-End Fine-tuning

Full encoder-decoder training with CE + 0.3×CTC auxiliary loss
Label smoothing: 0.1, LR: 1e-4 with cosine decay
10 epochs, batch size 32
Hardware: NVIDIA A10 (single GPU)
Final train loss: 2.70, val loss: 2.75

Evaluation

Dataset	Samples	WER	CER
FLEURS kk (test)	100	57.22%	43.17%

WER progression during training

Stage	WER	CER
E2E from scratch (no CTC pretrain)	113%	92.6%
CTC pretrain → E2E epoch 1	73.6%	55.4%
Epoch 3	59.0%	42.8%
Epoch 10 (final)	57.2%	43.2%

Usage

import torch
from omniaudio.model_v2 import OmniAudioScratchModel

# Load model
encoder_config = {"n_mels": 80, "d_model": 256, "n_heads": 4, "n_layers": 6, "n_conv": 2}
decoder_config = {"d_model": 512, "n_heads": 8, "n_layers": 8}
model = OmniAudioScratchModel(encoder_config=encoder_config, decoder_config=decoder_config, vocab_size=50257)

state = torch.load("model.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()

# Inference (mel spectrogram input)
# mel: (1, 80, time_frames) tensor
tokens = model.generate(mel, max_new_tokens=256, eos_token_id=0)

Architecture

Custom encoder-decoder, no HuggingFace Transformers dependency:

Encoder: Conv1d stack (stride 2, ×2 = 4× downsampling) → 6 Transformer layers with RoPE, RMSNorm, SwiGLU
Decoder: 8 Transformer layers with causal self-attention + cross-attention to encoder, RoPE, RMSNorm, SwiGLU
CTC head: Linear projection from encoder for auxiliary CTC loss during training

Limitations

Max audio length: 10 seconds (longer utterances get truncated)
Struggles with foreign names and rare words
Numbers transcribed in words ("2011" → "екі мың он бірінші")
Long sentences may get truncated by the decoder
No punctuation or capitalization

Citation

@misc{sozkz-omniaudio-70m-v2,
  title={SozKZ OmniAudio 70M: Kazakh ASR v2},
  author={Saken Tukenov},
  year={2026},
  url={https://huggingface.co/stukenov/sozkz-core-omniaudio-70m-kk-asr-v2}
}

stukenov
/

sozkz-core-omniaudio-70m-kk-asr-v2