🎙️ Qwen3-ASR-1.7B-PT — Portuguese Speech Recognition

1.7B Parameters Speech to Text Portuguese Automatic Speech Recognition Base model bf16 Apache-2.0

A Portuguese-specialised automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-1.7B. It outputs cased, punctuated Portuguese text and works as a drop-in replacement for the base model.

Update. v2 release. The v1 run already gave us a clear picture of the learning curve on Portuguese (eval loss and WER plateaued around epoch 4 of 6, with -33% relative WER vs zero-shot), so for this final release we are no longer using the validation split to pick hyperparameters. Instead we fold CV22-pt train + validation together with the WAVe-filtered synthetic_transcript_pt corpus into one training set to maximise the data the model sees, train for just 4 epochs (the point at which v1 converged), and validate only on the held-out CV17-pt and CV22-pt test sets. The goal is the strongest possible Portuguese model for production use, not another methodological ablation.

On Common Voice 17 (test) it reaches 8.11% WER (down from 12.63% zero-shot, -35.8% relative).


📊 Results

WER and CER on held-out Common Voice test sets — same samples, same protocol, no test-time tricks. "Zero-shot" is the base Qwen/Qwen3-ASR-1.7B called with language="Portuguese". The fine-tuned numbers are bold.

Test set Samples Zero-shot WER Fine-tuned WER Δ WER CER (zero-shot → fine-tuned)
Common Voice 17 (test) 9,467 12.63 8.11 -4.52 (-35.8%) 3.51 → 2.41
Common Voice 22 (test) 9,641 12.91 8.50 -4.41 (-34.2%) 3.6 → 2.55

Lower is better. Both held-out test sets see roughly a one-third relative reduction in word error rate versus the already-strong base model.

🔬 Reproducibility note. Both the zero-shot baseline and the fine-tuned numbers above were measured with the same evaluation function (train_qwen3_asr.evaluate_model), the same greedy decoding settings, and the same reference normalisation (see next section). This is an apples-to-apples comparison.

🧹 Reference / target normalisation

Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:

  1. Capitalise the first letter if it is lowercase.
  2. Collapse trailing dots — any sequence of ., , .., ... at the end is replaced with a single ..
  3. Append a terminal period if the sentence does not already end in terminal punctuation (. ! ? …) or a closing bracket / quote () ] } " ' etc.).

The exact function lives in src/evaluation/score_written_form.py of the project repository. Concretely:

Raw reference Normalised
bom dia Bom dia.
o gato dorme... O gato dorme.
como estás? Como estás? (unchanged)
"oi" "Oi" (closing quote → no .)

Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.

🚀 How to use

Install the official qwen-asr package, then load this model exactly the same way you would load the base Qwen3-ASR:

pip install qwen-asr
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "yuriyvnv/Qwen3-ASR-1.7B-PT",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

result = model.transcribe(audio="audio.wav", language="Portuguese")
print(result[0].text)

Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.

🛠️ Training

Dataset: yuriyvnv/synthetic_transcript_pt + fsicoli/common_voice_22_0 (pt) — Common Voice 22 Portuguese train + validation combined with synthetic_transcript_pt (cv_high_quality, WAVe-filtered CV17-Portuguese), shuffled with the run seed. After duration filtering and transcript-length filtering: 61,579 training samples and 9,641 validation samples.

Recipe: follows the official QwenLM SFT recipe with our local hyperparameters:

Parameter Value
Learning rate 2e-05
Scheduler linear
Warmup ratio 0.02
Per-device batch size 92
Gradient accumulation 2
Effective batch size 184
Epochs 4.0
Precision bf16 mixed
Gradient checkpointing enabled
Optimizer AdamW (fused)

Trained on a single H100. The best checkpoint was selected by validation loss.

For the repository with all the scripts to train the QWEN models are located at : link

🙏 Acknowledgements

This model would not exist without the work of others. Thank you to:

  • The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-1.7B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe and the Qwen3-ASR Technical Report.
  • The Mozilla Common Voice community for collecting and releasing the Portuguese speech corpus used for training and evaluation (Common Voice 22, Common Voice 17 mirror).
  • Every contributor who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices.

📚 Citation

If this model is useful in your work, please cite the base Qwen3-ASR report:

@article{qwen3asr2025,
  title  = {Qwen3-ASR Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://arxiv.org/abs/2601.21337}
}
Downloads last month
143
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/Qwen3-ASR-1.7B-PT

Finetuned
(40)
this model

Datasets used to train yuriyvnv/Qwen3-ASR-1.7B-PT

Paper for yuriyvnv/Qwen3-ASR-1.7B-PT

Evaluation results