Qwen3-ASR-0.6B-P25: Speech Recognition for P25 Public Safety Radio
A fine-tuned Qwen3-ASR-0.6B model specialized for transcribing P25 trunked radio dispatch audio. Significantly outperforms stock Qwen3-ASR, Whisper Large-v3, and other general-purpose ASR models on P25 dispatch vocabulary.
Results
Informal evaluation on a small held-out set of P25 calls with human-verified transcriptions shows substantial improvement over stock models. The fine-tuned 0.6B model consistently outperforms stock Qwen3-ASR (0.6B and 1.7B), Whisper Large-v3 (stock and LoRA-tuned), and other general-purpose ASR on P25 dispatch audio β particularly on domain-specific vocabulary like unit numbers, street names, and radio codes.
What is P25 Radio Audio?
Project 25 (P25) is the digital radio standard used by public safety agencies across the US. P25 audio has characteristics that challenge general-purpose ASR:
- IMBE/AMBE vocoder compression β lossy digital encoding that distorts speech
- PTT-gated transmissions β push-to-talk creates abrupt starts/stops
- Domain vocabulary β unit numbers ("Medic 22", "Engine 111"), radio codes ("10-4"), street addresses, dispatch terminology
- Variable signal quality β portable radios, in-vehicle, building penetration
- Overlapping transmissions β multiple units on shared talkgroups
- Background noise β sirens, wind, engine noise, crowd noise
Stock ASR models frequently misrecognize domain terms: "Medic 23" becomes "Madag 23", "cul-de-sacs" becomes "collar tax", "blow-ins" becomes "loans".
Usage
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"AuggieActual/qwen3-asr-p25-0.6B",
dtype=torch.bfloat16,
device_map="cuda:0",
max_new_tokens=256,
)
results = model.transcribe(audio="path/to/p25_call.wav", language="English")
print(results[0].text)
With Word-Level Timestamps
model = Qwen3ASRModel.from_pretrained(
"AuggieActual/qwen3-asr-p25-0.6B",
forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
forced_aligner_kwargs=dict(dtype=torch.bfloat16, device_map="cuda:0"),
dtype=torch.bfloat16,
device_map="cuda:0",
max_new_tokens=256,
)
results = model.transcribe(
audio="path/to/p25_call.wav",
language="English",
return_time_stamps=True,
)
for word in results[0].time_stamps:
print(f" {word.start_time:.2f}-{word.end_time:.2f} {word.text}")
Training
Overview
Full supervised fine-tuning (SFT) of Qwen3-ASR-0.6B on ~24,800 P25 radio calls with pseudo-labels. The audio encoder is frozen; only the text decoder is trained, as the encoder already handles vocoder-compressed audio adequately β the challenge is vocabulary, not acoustics.
Data Collection
Training data was sourced from a live P25 trunk-recorder installation monitoring multiple county-level public safety systems. Audio was collected via trunk-recorder capturing individual call recordings from trunked radio systems.
Sample selection criteria:
- Duration: 2-60 seconds (optimal for single-utterance training)
- Talkgroup diversity: dispatch, tactical, fire, EMS, law enforcement channels
- Signal quality: filtered by call duration and metadata completeness
- Call type diversity: routine dispatch, emergency, multi-unit, tactical
Two collection phases:
- Phase 1: 6,968 calls selected from a catalog of 25,566 candidates via stratified sampling across talkgroup tags, sites, duration buckets, and emergency status
- Phase 2: 20,000 additional calls selected from a filesystem catalog of 2M+ recordings, broadening coverage across systems and time periods
Audio Preprocessing
All audio preprocessed with sox to standardize format and remove P25-specific interference:
sox input.wav output.wav \
rate 16000 \
channels 1 \
sinc 300-3400 \
bandreject 1200 5q \
bandreject 1800 5q \
bandreject 1000 8q \
norm -3
| Filter | Purpose |
|---|---|
rate 16000 |
Resample to 16 kHz |
channels 1 |
Convert to mono |
sinc 300-3400 |
Bandpass for speech intelligibility range |
bandreject 1200 5q |
Notch filter for MDC-1200 data signaling |
bandreject 1800 5q |
Notch filter for MDC-1200 status tones |
bandreject 1000 8q |
Notch filter for alert/page tones |
norm -3 |
Normalize peak to -3 dB |
The bandpass preserves dispatch speech while attenuating vocoder artifacts outside the speech band. MDC-1200 notch filters remove in-band data signaling that confuses ASR decoders. Narrow Q factors minimize impact on speech content.
Note on VAD: VAD is intentionally not used for segmentation β P25 audio is already PTT-gated, so every transmission is active speech. Applying segmentation VAD (e.g., Silero) degrades accuracy on names and numbers at transmission boundaries.
However, a simple RMS energy gate is recommended before inference to reject blank or encrypted audio. P25 systems sometimes send encrypted channel audio or empty carrier bursts that contain no intelligible speech. Without a gate, the model will hallucinate plausible-sounding dispatch text for silent inputs. A threshold of RMS < 0.01 cleanly separates blank audio (typically < 0.003) from real speech (typically > 0.03). The included server implements this check automatically.
Pseudo-Labeling
All training labels are machine-generated pseudo-labels, quality-filtered for accuracy.
Quality filtering rejected ~7% of labels:
| Filter | Criterion | Rejects |
|---|---|---|
| Empty | No transcription returned | ~3.6% |
| Hallucination | >8 words/second (implausible for speech) | ~2% |
| Too short | <3 words | ~1% |
| Non-ASCII | <90% ASCII characters | <0.1% |
Final labeled dataset: 24,808 P25 calls passing quality filters.
Text Normalization
The raw pseudo-labels contain written-form numbers ("twenty-two", "seventy-six eighty-five"). Dispatch radio uses numeric form ("22", "7685"). A normalization pass converts ~40% of labels to match radio conventions:
| Pattern | Example | Normalized |
|---|---|---|
| Compound numbers | "twenty-two" | "22" |
| Sequential digits | "seven five three one" | "7531" |
| X-oh-Y radio IDs | "one-oh-one" | "101" |
| Common ASR errors | "Medicaid 22" | "Medic 22" |
| Common ASR errors | "Tax 46" | "Tac 46" |
| Common ASR errors | "Italian 16" | "Battalion 16" |
This normalization is critical β without it, the model would learn to output "twenty-two" instead of "22", which is wrong for the dispatch domain.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-ASR-0.6B |
| Training type | Full SFT (not LoRA) |
| Training samples | 23,519 |
| Eval samples | 1,237 |
| Frozen components | Audio encoder (317M params) |
| Trainable params | ~310M (text decoder) |
| Effective batch size | 64 |
| Learning rate | 2e-5 (cosine schedule) |
| Warmup | 5% of steps |
| Epochs | 3 |
| Total steps | 2,205 |
| Optimizer | AdamW |
| Precision | bfloat16 |
| Hardware | 2x NVIDIA RTX 3090 Ti (24 GB) |
| Training time | ~3.4 hours |
Loss masking: Only the transcription target tokens contribute to the loss. The system prompt and audio tokens are masked with -100 so the model only learns to generate transcriptions, not to reproduce the prompt.
Training Dynamics
| Step | Train Loss | Eval Loss |
|---|---|---|
| 25 | 128.1 | β |
| 100 | 29.5 | β |
| 500 | 3.7 | 0.249 |
| 1,000 | 2.6 | 0.229 |
| 1,500 | 2.1 | 0.225 |
| 2,000 | 2.1 | 0.225 |
| 2,205 | 2.0 | β |
Eval loss plateaus at ~0.225 from step 1,500 onward with no overfitting.
Qualitative Performance by Call Type
- Fire/Law Dispatch β strong performance on structured dispatch formats (unit assignments, addresses, incident types)
- Law/Fire Tactical β handles conversational officer-to-officer and fireground communications well
- Security β accurate on facility announcements (e.g., Code Blue pages)
- Multi-Talk / Overlapping β degrades when multiple speakers transmit simultaneously
- Noisy Tactical β still struggles on very noisy channels (EMS tactical with heavy background noise), though all models tested perform poorly here
Limitations
- English only β trained exclusively on English-language P25 dispatch
- Pseudo-labels β training data is machine-labeled, not human-verified; some label noise remains
- Regional vocabulary β trained on a specific set of county systems; may not generalize perfectly to all P25 deployments (different unit numbering, street names, dispatch protocols)
- Short utterances β optimized for typical P25 call lengths (2-60 seconds); untested on longer recordings
- Noisy tactical channels β performance degrades on very noisy tactical channels
- Hallucination on blank audio β like all autoregressive ASR models, will generate plausible-sounding text for silent/encrypted input; use the RMS energy gate described above
Intended Use
- Transcription of P25 public safety radio for dispatch logging, search, and review
- Starting point for fine-tuning on specific agency vocabularies
- Research into domain-adapted ASR for vocoder-compressed radio audio
How to Reproduce
The complete training pipeline is documented in this repository:
- Audio collection β trunk-recorder captures P25 call recordings
- Sample selection β stratified sampling for diversity across talkgroups, duration, call types
- Preprocessing β sox bandpass + notch filters β 16 kHz mono WAV
- Pseudo-labeling β machine-generated transcriptions with quality filtering
- Text normalization β number conversion and dispatch term correction
- Training β full SFT with frozen encoder, 3 epochs, cosine LR schedule
Citation
If you use this model, please cite:
@misc{qwen3-asr-p25,
title={Qwen3-ASR-0.6B-P25: Fine-Tuned Speech Recognition for P25 Public Safety Radio},
author={AuggieActual},
year={2026},
url={https://huggingface.co/AuggieActual/qwen3-asr-p25-0.6B}
}
Acknowledgments
- Qwen3-ASR by Alibaba Qwen team β excellent base model
- trunk-recorder β P25 radio capture
- Inspired by Police Radio ASR (2024) β directly analogous domain study
- Downloads last month
- 75
Model tree for trunk-reporter/qwen3-asr-p25-0.6B
Base model
Qwen/Qwen3-ASR-0.6B