Whisper-Podlodka-Turbo - CoreML / WhisperKit

A CoreML conversion of bond005/whisper-podlodka-turbo packaged for the WhisperKit runtime. Runs end-to-end on the Apple Neural Engine (ANE) on Apple Silicon Macs, iPhone, and iPad.

The upstream model is Ivan Bondarenko's Russian-focused fine-tune of openai/whisper-large-v3-turbo, with improved noise robustness and reduced non-speech hallucinations. This repository contains only the converted weights - no architectural or training changes.

Files

File Size Purpose
MelSpectrogram.mlmodelc ~400 KB Audio preprocessing (log-mel filterbank)
AudioEncoder.mlmodelc ~1.2 GB 32-layer encoder, FP16
TextDecoder.mlmodelc ~330 MB 4-layer turbo decoder, FP16
config.json - Hugging Face Whisper config, inherited from the base model
generation_config.json - Generation defaults, inherited from the base model

All three .mlmodelc directories are pre-compiled MLProgram assets ready for direct use by WhisperKit. No additional compile step is required.

Architecture

Inherited from the base model (Whisper Large v3 Turbo):

Hyperparameter Value
Encoder layers 32
Decoder layers 4
Hidden size (d_model) 1280
Attention heads (enc / dec) 20 / 20
Mel bins 128
Vocabulary 51866
Max source positions 1500 (30 s @ 16 kHz)
Max target positions 448

Conversion

Converted with argmaxinc/whisperkittools 0.4.2:

whisperkit-generate-model \
  --model-version bond005/whisper-podlodka-turbo \
  --output-dir ./out \
  --generate-decoder-context-prefill-data
  • Conversion environment: Python 3.12, torch==2.5.0, coremltools==9.0, transformers==4.53
  • Compute precision: FP16 across all three components
  • Decoder SDPA implementation: Cat (default)
  • Audio encoder SDPA implementation: SplitHeadsQ (default)
  • Decoder context prefill data: enabled (pre-computes the KV cache for the first 3 forced tokens to reduce time-to-first-token)
  • Minimum deployment target: macOS 14 / iOS 17

Usage with WhisperKit (Swift)

import WhisperKit

let folder = URL(fileURLWithPath: "/path/to/whisper-podlodka-turbo-coreml")
let pipe = try await WhisperKit(modelFolder: folder.path)
let result = try await pipe.transcribe(audioPath: "/path/to/audio.wav")
print(result?.text ?? "")

The tokenizer is the standard Whisper Large v3 tokenizer (vocab 51866) - WhisperKit will fetch it from openai/whisper-large-v3 if it is not present locally. Russian is the recommended decoding language for this model.

Performance

End-to-end ANE execution on an M-series Mac yields realtime factors significantly above 1.0x. First inference compiles ANE-specific kernels and may take noticeably longer; subsequent inferences use the cached compilation and are fast.

Languages

Primary: Russian. Secondary: English. The base fine-tune preserves the multilingual capability of Whisper Large v3 Turbo but is optimized for Russian ASR and Russian/English speech translation.

Limitations

  • WhisperKit's pipeline currently uses the same scaffolding as standard Whisper Large v3 Turbo. Any quality differences between this fine-tune and the base Turbo model are inherited from the upstream weights.
  • Translation behavior is inherited from the upstream fine-tune. Refer to the base model card for translation quality notes.
  • For evaluation numbers (WER on Common Voice, RuLibriSpeech, Golos, SOVA RuDevices, Podlodka Speech, plus noise-robust and long-form benchmarks) see the upstream model card.

License

Apache 2.0, inherited from the base model.

Credits

Citation

For the base fine-tune, cite the upstream model:

@misc{whisper-podlodka-turbo,
  author = {Ivan Bondarenko},
  title = {Whisper-Podlodka-Turbo: Enhanced Whisper Model for Russian ASR},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/bond005/whisper-podlodka-turbo}}
}
Downloads last month
1,767
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for smkrv/whisper-podlodka-turbo-coreml

Quantized
(4)
this model