🎙️ Qwen3-ASR-1.7B-NL — Dutch Speech Recognition

1.7B Parameters Speech to Text Dutch Automatic Speech Recognition Base model bf16 Apache-2.0

A Dutch-specialised automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-1.7B. It outputs cased, punctuated Dutch text and works as a drop-in replacement for the base model.

On Common Voice 22 (test) it reaches 5.28% WER.


📊 Results

WER and CER on held-out Common Voice test sets — same samples, same protocol, no test-time tricks. "Zero-shot" is the base Qwen/Qwen3-ASR-1.7B called with language="Dutch". The fine-tuned numbers are bold.

Test set Samples Zero-shot WER Fine-tuned WER Δ WER CER (zero-shot → fine-tuned)
Common Voice 17 (test) 11,266 5.37 — → 1.69
Common Voice 22 (test) 12,033 5.28 — → 1.68

Lower is better. Both held-out test sets see roughly a one-third relative reduction in word error rate versus the already-strong base model.

🔬 Reproducibility note. Both the zero-shot baseline and the fine-tuned numbers above were measured with the same evaluation function (train_qwen3_asr.evaluate_model), the same greedy decoding settings, and the same reference normalisation (see next section). This is an apples-to-apples comparison.

🧹 Reference / target normalisation

Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:

  1. Capitalise the first letter if it is lowercase.
  2. Collapse trailing dots — any sequence of ., , .., ... at the end is replaced with a single ..
  3. Append a terminal period if the sentence does not already end in terminal punctuation (. ! ? …) or a closing bracket / quote () ] } " ' etc.).

The exact function lives in src/evaluation/score_written_form.py of the project repository. Concretely:

Raw reference Normalised
bom dia Bom dia.
o gato dorme... O gato dorme.
como estás? Como estás? (unchanged)
"oi" "Oi" (closing quote → no .)

Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.

🚀 How to use

Install the official qwen-asr package, then load this model exactly the same way you would load the base Qwen3-ASR:

pip install qwen-asr
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "yuriyvnv/Qwen3-ASR-1.7B-NL",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

result = model.transcribe(audio="audio.wav", language="Dutch")
print(result[0].text)

Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.

🛠️ Training

Dataset: yuriyvnv/synthetic_transcript_nl + fsicoli/common_voice_22_0 (nl) — the full synthetic OpenAI-TTS Dutch corpus (~34.9k clips) concatenated with the Common Voice 22 Dutch train split, shuffled with the run seed. After duration filtering and transcript-length filtering: 74,898 training samples and 12,032 validation samples.

Recipe: follows the official QwenLM SFT recipe with our local hyperparameters:

Parameter Value
Learning rate 2e-05
Scheduler linear
Warmup ratio 0.02
Per-device batch size 92
Gradient accumulation 2
Effective batch size 184
Epochs 5.0
Precision bf16 mixed
Gradient checkpointing enabled
Optimizer AdamW (fused)

Trained on a single H100. The best checkpoint was selected by validation loss.

All the training SCRIPTS and evaluation methods availabel at : github

🙏 Acknowledgements

This model would not exist without the work of others. Thank you to:

  • The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-1.7B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe and the Qwen3-ASR Technical Report.
  • The Mozilla Common Voice community for collecting and releasing the Dutch speech corpus used for training and evaluation (Common Voice 22, Common Voice 17 mirror).
  • Every contributor who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices.

📚 Citation

If this model is useful in your work, please cite the base Qwen3-ASR report:

@article{qwen3asr2025,
  title  = {Qwen3-ASR Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://arxiv.org/abs/2601.21337}
}
Downloads last month
65
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/Qwen3-ASR-1.7B-NL

Finetuned
(38)
this model

Datasets used to train yuriyvnv/Qwen3-ASR-1.7B-NL

Paper for yuriyvnv/Qwen3-ASR-1.7B-NL

Evaluation results