🎙️ Qwen3-ASR-1.7B-PT — Portuguese Speech Recognition
A Portuguese-specialised automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-1.7B. It outputs cased, punctuated Portuguese text and works as a drop-in replacement for the base model.
Update. v2 release. The v1 run already gave us a clear picture of the learning curve on Portuguese (eval loss and WER plateaued around epoch 4 of 6, with -33% relative WER vs zero-shot), so for this final release we are no longer using the validation split to pick hyperparameters. Instead we fold CV22-pt train + validation together with the WAVe-filtered synthetic_transcript_pt corpus into one training set to maximise the data the model sees, train for just 4 epochs (the point at which v1 converged), and validate only on the held-out CV17-pt and CV22-pt test sets. The goal is the strongest possible Portuguese model for production use, not another methodological ablation.
On Common Voice 17 (test) it reaches 8.11% WER (down from 12.63% zero-shot, -35.8% relative).
📊 Results
WER and CER on held-out Common Voice test sets — same samples, same protocol,
no test-time tricks. "Zero-shot" is the base
Qwen/Qwen3-ASR-1.7B called with
language="Portuguese". The fine-tuned numbers are bold.
| Test set | Samples | Zero-shot WER | Fine-tuned WER | Δ WER | CER (zero-shot → fine-tuned) |
|---|---|---|---|---|---|
| Common Voice 17 (test) | 9,467 | 12.63 | 8.11 | -4.52 (-35.8%) | 3.51 → 2.41 |
| Common Voice 22 (test) | 9,641 | 12.91 | 8.50 | -4.41 (-34.2%) | 3.6 → 2.55 |
Lower is better. Both held-out test sets see roughly a one-third relative reduction in word error rate versus the already-strong base model.
🔬 Reproducibility note. Both the zero-shot baseline and the fine-tuned numbers above were measured with the same evaluation function (
train_qwen3_asr.evaluate_model), the same greedy decoding settings, and the same reference normalisation (see next section). This is an apples-to-apples comparison.
🧹 Reference / target normalisation
Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:
- Capitalise the first letter if it is lowercase.
- Collapse trailing dots — any sequence of
.,…,..,...at the end is replaced with a single.. - Append a terminal period if the sentence does not already end in
terminal punctuation (
. ! ? …) or a closing bracket / quote () ] } " 'etc.).
The exact function lives in src/evaluation/score_written_form.py of the
project repository. Concretely:
| Raw reference | Normalised |
|---|---|
bom dia |
Bom dia. |
o gato dorme... |
O gato dorme. |
como estás? |
Como estás? (unchanged) |
"oi" |
"Oi" (closing quote → no .) |
Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.
🚀 How to use
Install the official qwen-asr package, then load this model exactly the
same way you would load the base Qwen3-ASR:
pip install qwen-asr
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"yuriyvnv/Qwen3-ASR-1.7B-PT",
dtype=torch.bfloat16,
device_map="cuda:0",
)
result = model.transcribe(audio="audio.wav", language="Portuguese")
print(result[0].text)
Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.
🛠️ Training
Dataset: yuriyvnv/synthetic_transcript_pt + fsicoli/common_voice_22_0 (pt) — Common Voice 22 Portuguese train + validation combined with synthetic_transcript_pt (cv_high_quality, WAVe-filtered CV17-Portuguese), shuffled with the run seed. After duration filtering and transcript-length filtering: 61,579 training samples and 9,641 validation samples.
Recipe: follows the official QwenLM SFT recipe with our local hyperparameters:
| Parameter | Value |
|---|---|
| Learning rate | 2e-05 |
| Scheduler | linear |
| Warmup ratio | 0.02 |
| Per-device batch size | 92 |
| Gradient accumulation | 2 |
| Effective batch size | 184 |
| Epochs | 4.0 |
| Precision | bf16 mixed |
| Gradient checkpointing | enabled |
| Optimizer | AdamW (fused) |
Trained on a single H100. The best checkpoint was selected by validation loss.
For the repository with all the scripts to train the QWEN models are located at : link
🙏 Acknowledgements
This model would not exist without the work of others. Thank you to:
- The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-1.7B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe and the Qwen3-ASR Technical Report.
- The Mozilla Common Voice community for collecting and releasing the Portuguese speech corpus used for training and evaluation (Common Voice 22, Common Voice 17 mirror).
- Every contributor who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices.
📚 Citation
If this model is useful in your work, please cite the base Qwen3-ASR report:
@article{qwen3asr2025,
title = {Qwen3-ASR Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://arxiv.org/abs/2601.21337}
}
- Downloads last month
- 143
Model tree for yuriyvnv/Qwen3-ASR-1.7B-PT
Base model
Qwen/Qwen3-ASR-1.7BDatasets used to train yuriyvnv/Qwen3-ASR-1.7B-PT
Paper for yuriyvnv/Qwen3-ASR-1.7B-PT
Evaluation results
- WER on Common Voice 17 (test)test set self-reported8.110
- CER on Common Voice 17 (test)test set self-reported2.410
- WER on Common Voice 22 (test)test set self-reported8.500
- CER on Common Voice 22 (test)test set self-reported2.550