🎙️ Qwen3-ASR-1.7B-PT — Portuguese Speech Recognition

A Portuguese-specialised automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-1.7B. It outputs cased, punctuated Portuguese text and works as a drop-in replacement for the base model.

Update. v2 release. The v1 run already gave us a clear picture of the learning curve on Portuguese (eval loss and WER plateaued around epoch 4 of 6, with -33% relative WER vs zero-shot), so for this final release we are no longer using the validation split to pick hyperparameters. Instead we fold CV22-pt train + validation together with the WAVe-filtered synthetic_transcript_pt corpus into one training set to maximise the data the model sees, train for just 4 epochs (the point at which v1 converged), and validate only on the held-out CV17-pt and CV22-pt test sets. The goal is the strongest possible Portuguese model for production use, not another methodological ablation.

On Common Voice 17 (test) it reaches 8.11% WER (down from 12.63% zero-shot, -35.8% relative).

📊 Results

WER and CER on held-out Common Voice test sets — same samples, same protocol, no test-time tricks. "Zero-shot" is the base Qwen/Qwen3-ASR-1.7B called with language="Portuguese". The fine-tuned numbers are bold.

Test set	Samples	Zero-shot WER	Fine-tuned WER	Δ WER	CER (zero-shot → fine-tuned)
Common Voice 17 (test)	9,467	12.63	8.11	-4.52 (-35.8%)	3.51 → 2.41
Common Voice 22 (test)	9,641	12.91	8.50	-4.41 (-34.2%)	3.6 → 2.55

Lower is better. Both held-out test sets see roughly a one-third relative reduction in word error rate versus the already-strong base model.

🔬 Reproducibility note. Both the zero-shot baseline and the fine-tuned numbers above were measured with the same evaluation function (train_qwen3_asr.evaluate_model), the same greedy decoding settings, and the same reference normalisation (see next section). This is an apples-to-apples comparison.

🧹 Reference / target normalisation

Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:

Capitalise the first letter if it is lowercase.
Collapse trailing dots — any sequence of ., …, .., ... at the end is replaced with a single ..
Append a terminal period if the sentence does not already end in terminal punctuation (. ! ? …) or a closing bracket / quote () ] } " ' etc.).

The exact function lives in src/evaluation/score_written_form.py of the project repository. Concretely:

Raw reference	Normalised
`bom dia`	`Bom dia.`
`o gato dorme...`	`O gato dorme.`
`como estás?`	`Como estás?` (unchanged)
`"oi"`	`"Oi"` (closing quote → no `.`)

Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.

🚀 How to use

Install the official qwen-asr package, then load this model exactly the same way you would load the base Qwen3-ASR:

pip install qwen-asr

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "yuriyvnv/Qwen3-ASR-1.7B-PT",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

result = model.transcribe(audio="audio.wav", language="Portuguese")
print(result[0].text)

Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.

🛠️ Training

Dataset: yuriyvnv/synthetic_transcript_pt + fsicoli/common_voice_22_0 (pt) — Common Voice 22 Portuguese train + validation combined with synthetic_transcript_pt (cv_high_quality, WAVe-filtered CV17-Portuguese), shuffled with the run seed. After duration filtering and transcript-length filtering: 61,579 training samples and 9,641 validation samples.

Recipe: follows the official QwenLM SFT recipe with our local hyperparameters:

Parameter	Value
Learning rate	2e-05
Scheduler	linear
Warmup ratio	0.02
Per-device batch size	92
Gradient accumulation	2
Effective batch size	184
Epochs	4.0
Precision	bf16 mixed
Gradient checkpointing	enabled
Optimizer	AdamW (fused)

Trained on a single H100. The best checkpoint was selected by validation loss.

For the repository with all the scripts to train the QWEN models are located at : link

🙏 Acknowledgements

This model would not exist without the work of others. Thank you to:

The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-1.7B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe and the Qwen3-ASR Technical Report.
The Mozilla Common Voice community for collecting and releasing the Portuguese speech corpus used for training and evaluation (Common Voice 22, Common Voice 17 mirror).
Every contributor who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices.

📚 Citation

If this model is useful in your work, please cite the base Qwen3-ASR report:

@article{qwen3asr2025,
  title  = {Qwen3-ASR Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://arxiv.org/abs/2601.21337}
}

Downloads last month: 143

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for yuriyvnv/Qwen3-ASR-1.7B-PT

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(40)

this model

Datasets used to train yuriyvnv/Qwen3-ASR-1.7B-PT

Paper for yuriyvnv/Qwen3-ASR-1.7B-PT

Qwen3-ASR Technical Report

Paper • 2601.21337 • Published Jan 29 • 37

Evaluation results

WER on Common Voice 17 (test)
test set self-reported

8.110
CER on Common Voice 17 (test)
test set self-reported

2.410
WER on Common Voice 22 (test)
test set self-reported

8.500
CER on Common Voice 22 (test)
test set self-reported

2.550