You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

SozKZ OmniAudio 70M — Kazakh ASR v2

A 69.6M parameter encoder-decoder ASR model for Kazakh, trained from scratch using a two-stage approach: CTC pretraining followed by end-to-end fine-tuning.

Model Details

Parameter Value
Parameters 69.58M
Architecture Custom encoder-decoder with RoPE, RMSNorm, SwiGLU
Encoder 256d / 4 heads / 6 layers / 2 conv layers
Decoder 512d / 8 heads / 8 layers
Vocabulary 50,257 (BPE, kazakh-gpt2-50k)
Audio input 80 mel bins, 16kHz, max 10s
Training 2-stage: CTC pretrain → E2E (CE + 0.3×CTC)
Precision bf16
Framework PyTorch (custom, no HF Transformers dependency)

Training

Stage 1: CTC Pretraining

  • Encoder-only training with CTC loss
  • Dataset: stukenov/sozkz-asr-mels-kk-v1 (2,100h Kazakh speech, pre-computed mels)
  • Result: val loss 1.01

Stage 2: End-to-End Fine-tuning

  • Full encoder-decoder training with CE + 0.3×CTC auxiliary loss
  • Label smoothing: 0.1, LR: 1e-4 with cosine decay
  • 10 epochs, batch size 32
  • Hardware: NVIDIA A10 (single GPU)
  • Final train loss: 2.70, val loss: 2.75

Evaluation

Dataset Samples WER CER
FLEURS kk (test) 100 57.22% 43.17%

WER progression during training

Stage WER CER
E2E from scratch (no CTC pretrain) 113% 92.6%
CTC pretrain → E2E epoch 1 73.6% 55.4%
Epoch 3 59.0% 42.8%
Epoch 10 (final) 57.2% 43.2%

Usage

import torch
from omniaudio.model_v2 import OmniAudioScratchModel

# Load model
encoder_config = {"n_mels": 80, "d_model": 256, "n_heads": 4, "n_layers": 6, "n_conv": 2}
decoder_config = {"d_model": 512, "n_heads": 8, "n_layers": 8}
model = OmniAudioScratchModel(encoder_config=encoder_config, decoder_config=decoder_config, vocab_size=50257)

state = torch.load("model.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()

# Inference (mel spectrogram input)
# mel: (1, 80, time_frames) tensor
tokens = model.generate(mel, max_new_tokens=256, eos_token_id=0)

Architecture

Custom encoder-decoder, no HuggingFace Transformers dependency:

  • Encoder: Conv1d stack (stride 2, ×2 = 4× downsampling) → 6 Transformer layers with RoPE, RMSNorm, SwiGLU
  • Decoder: 8 Transformer layers with causal self-attention + cross-attention to encoder, RoPE, RMSNorm, SwiGLU
  • CTC head: Linear projection from encoder for auxiliary CTC loss during training

Limitations

  • Max audio length: 10 seconds (longer utterances get truncated)
  • Struggles with foreign names and rare words
  • Numbers transcribed in words ("2011" → "екі мың он бірінші")
  • Long sentences may get truncated by the decoder
  • No punctuation or capitalization

Citation

@misc{sozkz-omniaudio-70m-v2,
  title={SozKZ OmniAudio 70M: Kazakh ASR v2},
  author={Saken Tukenov},
  year={2026},
  url={https://huggingface.co/stukenov/sozkz-core-omniaudio-70m-kk-asr-v2}
}
Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train stukenov/sozkz-core-omniaudio-70m-kk-asr-v2

Space using stukenov/sozkz-core-omniaudio-70m-kk-asr-v2 1

Evaluation results