🎙️ Qwen3-ASR-1.7B-NL — Dutch Speech Recognition
A Dutch-specialised automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-1.7B. It outputs cased, punctuated Dutch text and works as a drop-in replacement for the base model.
On Common Voice 22 (test) it reaches 5.28% WER.
📊 Results
WER and CER on held-out Common Voice test sets — same samples, same protocol,
no test-time tricks. "Zero-shot" is the base
Qwen/Qwen3-ASR-1.7B called with
language="Dutch". The fine-tuned numbers are bold.
| Test set | Samples | Zero-shot WER | Fine-tuned WER | Δ WER | CER (zero-shot → fine-tuned) |
|---|---|---|---|---|---|
| Common Voice 17 (test) | 11,266 | — | 5.37 | — | — → 1.69 |
| Common Voice 22 (test) | 12,033 | — | 5.28 | — | — → 1.68 |
Lower is better. Both held-out test sets see roughly a one-third relative reduction in word error rate versus the already-strong base model.
🔬 Reproducibility note. Both the zero-shot baseline and the fine-tuned numbers above were measured with the same evaluation function (
train_qwen3_asr.evaluate_model), the same greedy decoding settings, and the same reference normalisation (see next section). This is an apples-to-apples comparison.
🧹 Reference / target normalisation
Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:
- Capitalise the first letter if it is lowercase.
- Collapse trailing dots — any sequence of
.,…,..,...at the end is replaced with a single.. - Append a terminal period if the sentence does not already end in
terminal punctuation (
. ! ? …) or a closing bracket / quote () ] } " 'etc.).
The exact function lives in src/evaluation/score_written_form.py of the
project repository. Concretely:
| Raw reference | Normalised |
|---|---|
bom dia |
Bom dia. |
o gato dorme... |
O gato dorme. |
como estás? |
Como estás? (unchanged) |
"oi" |
"Oi" (closing quote → no .) |
Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.
🚀 How to use
Install the official qwen-asr package, then load this model exactly the
same way you would load the base Qwen3-ASR:
pip install qwen-asr
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"yuriyvnv/Qwen3-ASR-1.7B-NL",
dtype=torch.bfloat16,
device_map="cuda:0",
)
result = model.transcribe(audio="audio.wav", language="Dutch")
print(result[0].text)
Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.
🛠️ Training
Dataset: yuriyvnv/synthetic_transcript_nl + fsicoli/common_voice_22_0 (nl) — the full synthetic OpenAI-TTS Dutch corpus (~34.9k clips) concatenated with the Common Voice 22 Dutch train split, shuffled with the run seed. After duration filtering and transcript-length filtering: 74,898 training samples and 12,032 validation samples.
Recipe: follows the official QwenLM SFT recipe with our local hyperparameters:
| Parameter | Value |
|---|---|
| Learning rate | 2e-05 |
| Scheduler | linear |
| Warmup ratio | 0.02 |
| Per-device batch size | 92 |
| Gradient accumulation | 2 |
| Effective batch size | 184 |
| Epochs | 5.0 |
| Precision | bf16 mixed |
| Gradient checkpointing | enabled |
| Optimizer | AdamW (fused) |
Trained on a single H100. The best checkpoint was selected by validation loss.
All the training SCRIPTS and evaluation methods availabel at : github
🙏 Acknowledgements
This model would not exist without the work of others. Thank you to:
- The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-1.7B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe and the Qwen3-ASR Technical Report.
- The Mozilla Common Voice community for collecting and releasing the Dutch speech corpus used for training and evaluation (Common Voice 22, Common Voice 17 mirror).
- Every contributor who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices.
📚 Citation
If this model is useful in your work, please cite the base Qwen3-ASR report:
@article{qwen3asr2025,
title = {Qwen3-ASR Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://arxiv.org/abs/2601.21337}
}
- Downloads last month
- 65
Model tree for yuriyvnv/Qwen3-ASR-1.7B-NL
Base model
Qwen/Qwen3-ASR-1.7BDatasets used to train yuriyvnv/Qwen3-ASR-1.7B-NL
Paper for yuriyvnv/Qwen3-ASR-1.7B-NL
Evaluation results
- WER on Common Voice 17 (test)test set self-reported5.370
- CER on Common Voice 17 (test)test set self-reported1.690
- WER on Common Voice 22 (test)test set self-reported5.280
- CER on Common Voice 22 (test)test set self-reported1.680