🎙️ Qwen3-ASR-1.7B-NL — Dutch Speech Recognition

A Dutch-specialised automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-1.7B. It outputs cased, punctuated Dutch text and works as a drop-in replacement for the base model.

On Common Voice 22 (test) it reaches 5.28% WER.

📊 Results

WER and CER on held-out Common Voice test sets — same samples, same protocol, no test-time tricks. "Zero-shot" is the base Qwen/Qwen3-ASR-1.7B called with language="Dutch". The fine-tuned numbers are bold.

Test set	Samples	Zero-shot WER	Fine-tuned WER	Δ WER	CER (zero-shot → fine-tuned)
Common Voice 17 (test)	11,266	—	5.37	—	— → 1.69
Common Voice 22 (test)	12,033	—	5.28	—	— → 1.68

Lower is better. Both held-out test sets see roughly a one-third relative reduction in word error rate versus the already-strong base model.

🔬 Reproducibility note. Both the zero-shot baseline and the fine-tuned numbers above were measured with the same evaluation function (train_qwen3_asr.evaluate_model), the same greedy decoding settings, and the same reference normalisation (see next section). This is an apples-to-apples comparison.

🧹 Reference / target normalisation

Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:

Capitalise the first letter if it is lowercase.
Collapse trailing dots — any sequence of ., …, .., ... at the end is replaced with a single ..
Append a terminal period if the sentence does not already end in terminal punctuation (. ! ? …) or a closing bracket / quote () ] } " ' etc.).

The exact function lives in src/evaluation/score_written_form.py of the project repository. Concretely:

Raw reference	Normalised
`bom dia`	`Bom dia.`
`o gato dorme...`	`O gato dorme.`
`como estás?`	`Como estás?` (unchanged)
`"oi"`	`"Oi"` (closing quote → no `.`)

Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.

🚀 How to use

Install the official qwen-asr package, then load this model exactly the same way you would load the base Qwen3-ASR:

pip install qwen-asr

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "yuriyvnv/Qwen3-ASR-1.7B-NL",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

result = model.transcribe(audio="audio.wav", language="Dutch")
print(result[0].text)

Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.

🛠️ Training

Dataset: yuriyvnv/synthetic_transcript_nl + fsicoli/common_voice_22_0 (nl) — the full synthetic OpenAI-TTS Dutch corpus (~34.9k clips) concatenated with the Common Voice 22 Dutch train split, shuffled with the run seed. After duration filtering and transcript-length filtering: 74,898 training samples and 12,032 validation samples.

Recipe: follows the official QwenLM SFT recipe with our local hyperparameters:

Parameter	Value
Learning rate	2e-05
Scheduler	linear
Warmup ratio	0.02
Per-device batch size	92
Gradient accumulation	2
Effective batch size	184
Epochs	5.0
Precision	bf16 mixed
Gradient checkpointing	enabled
Optimizer	AdamW (fused)

Trained on a single H100. The best checkpoint was selected by validation loss.

All the training SCRIPTS and evaluation methods availabel at : github

🙏 Acknowledgements

This model would not exist without the work of others. Thank you to:

The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-1.7B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe and the Qwen3-ASR Technical Report.
The Mozilla Common Voice community for collecting and releasing the Dutch speech corpus used for training and evaluation (Common Voice 22, Common Voice 17 mirror).
Every contributor who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices.

📚 Citation

If this model is useful in your work, please cite the base Qwen3-ASR report:

@article{qwen3asr2025,
  title  = {Qwen3-ASR Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://arxiv.org/abs/2601.21337}
}

Downloads last month: 65

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for yuriyvnv/Qwen3-ASR-1.7B-NL

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(38)

this model

Datasets used to train yuriyvnv/Qwen3-ASR-1.7B-NL

Paper for yuriyvnv/Qwen3-ASR-1.7B-NL

Qwen3-ASR Technical Report

Paper • 2601.21337 • Published Jan 29 • 37

Evaluation results

WER on Common Voice 17 (test)
test set self-reported

5.370
CER on Common Voice 17 (test)
test set self-reported

1.690
WER on Common Voice 22 (test)
test set self-reported

5.280
CER on Common Voice 22 (test)
test set self-reported

1.680