Qwen3-ASR-0.6B-P25: Speech Recognition for P25 Public Safety Radio

A fine-tuned Qwen3-ASR-0.6B model specialized for transcribing P25 trunked radio dispatch audio. Significantly outperforms stock Qwen3-ASR, Whisper Large-v3, and other general-purpose ASR models on P25 dispatch vocabulary.

Results

Informal evaluation on a small held-out set of P25 calls with human-verified transcriptions shows substantial improvement over stock models. The fine-tuned 0.6B model consistently outperforms stock Qwen3-ASR (0.6B and 1.7B), Whisper Large-v3 (stock and LoRA-tuned), and other general-purpose ASR on P25 dispatch audio — particularly on domain-specific vocabulary like unit numbers, street names, and radio codes.

What is P25 Radio Audio?

Project 25 (P25) is the digital radio standard used by public safety agencies across the US. P25 audio has characteristics that challenge general-purpose ASR:

IMBE/AMBE vocoder compression — lossy digital encoding that distorts speech
PTT-gated transmissions — push-to-talk creates abrupt starts/stops
Domain vocabulary — unit numbers ("Medic 22", "Engine 111"), radio codes ("10-4"), street addresses, dispatch terminology
Variable signal quality — portable radios, in-vehicle, building penetration
Overlapping transmissions — multiple units on shared talkgroups
Background noise — sirens, wind, engine noise, crowd noise

Stock ASR models frequently misrecognize domain terms: "Medic 23" becomes "Madag 23", "cul-de-sacs" becomes "collar tax", "blow-ins" becomes "loans".

Usage

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "AuggieActual/qwen3-asr-p25-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_new_tokens=256,
)

results = model.transcribe(audio="path/to/p25_call.wav", language="English")
print(results[0].text)

With Word-Level Timestamps

model = Qwen3ASRModel.from_pretrained(
    "AuggieActual/qwen3-asr-p25-0.6B",
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(dtype=torch.bfloat16, device_map="cuda:0"),
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_new_tokens=256,
)

results = model.transcribe(
    audio="path/to/p25_call.wav",
    language="English",
    return_time_stamps=True,
)

for word in results[0].time_stamps:
    print(f"  {word.start_time:.2f}-{word.end_time:.2f}  {word.text}")

Training

Overview

Full supervised fine-tuning (SFT) of Qwen3-ASR-0.6B on ~24,800 P25 radio calls with pseudo-labels. The audio encoder is frozen; only the text decoder is trained, as the encoder already handles vocoder-compressed audio adequately — the challenge is vocabulary, not acoustics.

Data Collection

Training data was sourced from a live P25 trunk-recorder installation monitoring multiple county-level public safety systems. Audio was collected via trunk-recorder capturing individual call recordings from trunked radio systems.

Sample selection criteria:

Duration: 2-60 seconds (optimal for single-utterance training)
Talkgroup diversity: dispatch, tactical, fire, EMS, law enforcement channels
Signal quality: filtered by call duration and metadata completeness
Call type diversity: routine dispatch, emergency, multi-unit, tactical

Two collection phases:

Phase 1: 6,968 calls selected from a catalog of 25,566 candidates via stratified sampling across talkgroup tags, sites, duration buckets, and emergency status
Phase 2: 20,000 additional calls selected from a filesystem catalog of 2M+ recordings, broadening coverage across systems and time periods

Audio Preprocessing

All audio preprocessed with sox to standardize format and remove P25-specific interference:

sox input.wav output.wav \
    rate 16000 \
    channels 1 \
    sinc 300-3400 \
    bandreject 1200 5q \
    bandreject 1800 5q \
    bandreject 1000 8q \
    norm -3

Filter	Purpose
`rate 16000`	Resample to 16 kHz
`channels 1`	Convert to mono
`sinc 300-3400`	Bandpass for speech intelligibility range
`bandreject 1200 5q`	Notch filter for MDC-1200 data signaling
`bandreject 1800 5q`	Notch filter for MDC-1200 status tones
`bandreject 1000 8q`	Notch filter for alert/page tones
`norm -3`	Normalize peak to -3 dB

The bandpass preserves dispatch speech while attenuating vocoder artifacts outside the speech band. MDC-1200 notch filters remove in-band data signaling that confuses ASR decoders. Narrow Q factors minimize impact on speech content.

Note on VAD: VAD is intentionally not used for segmentation — P25 audio is already PTT-gated, so every transmission is active speech. Applying segmentation VAD (e.g., Silero) degrades accuracy on names and numbers at transmission boundaries.

However, a simple RMS energy gate is recommended before inference to reject blank or encrypted audio. P25 systems sometimes send encrypted channel audio or empty carrier bursts that contain no intelligible speech. Without a gate, the model will hallucinate plausible-sounding dispatch text for silent inputs. A threshold of RMS < 0.01 cleanly separates blank audio (typically < 0.003) from real speech (typically > 0.03). The included server implements this check automatically.

Pseudo-Labeling

All training labels are machine-generated pseudo-labels, quality-filtered for accuracy.

Quality filtering rejected ~7% of labels:

Filter	Criterion	Rejects
Empty	No transcription returned	~3.6%
Hallucination	>8 words/second (implausible for speech)	~2%
Too short	<3 words	~1%
Non-ASCII	<90% ASCII characters	<0.1%

Final labeled dataset: 24,808 P25 calls passing quality filters.

Text Normalization

The raw pseudo-labels contain written-form numbers ("twenty-two", "seventy-six eighty-five"). Dispatch radio uses numeric form ("22", "7685"). A normalization pass converts ~40% of labels to match radio conventions:

Pattern	Example	Normalized
Compound numbers	"twenty-two"	"22"
Sequential digits	"seven five three one"	"7531"
X-oh-Y radio IDs	"one-oh-one"	"101"
Common ASR errors	"Medicaid 22"	"Medic 22"
Common ASR errors	"Tax 46"	"Tac 46"
Common ASR errors	"Italian 16"	"Battalion 16"

This normalization is critical — without it, the model would learn to output "twenty-two" instead of "22", which is wrong for the dispatch domain.

Training Configuration

Parameter	Value
Base model	Qwen/Qwen3-ASR-0.6B
Training type	Full SFT (not LoRA)
Training samples	23,519
Eval samples	1,237
Frozen components	Audio encoder (317M params)
Trainable params	~310M (text decoder)
Effective batch size	64
Learning rate	2e-5 (cosine schedule)
Warmup	5% of steps
Epochs	3
Total steps	2,205
Optimizer	AdamW
Precision	bfloat16
Hardware	2x NVIDIA RTX 3090 Ti (24 GB)
Training time	~3.4 hours

Loss masking: Only the transcription target tokens contribute to the loss. The system prompt and audio tokens are masked with -100 so the model only learns to generate transcriptions, not to reproduce the prompt.

Training Dynamics

Step	Train Loss	Eval Loss
25	128.1	—
100	29.5	—
500	3.7	0.249
1,000	2.6	0.229
1,500	2.1	0.225
2,000	2.1	0.225
2,205	2.0	—

Eval loss plateaus at ~0.225 from step 1,500 onward with no overfitting.

Qualitative Performance by Call Type

Fire/Law Dispatch — strong performance on structured dispatch formats (unit assignments, addresses, incident types)
Law/Fire Tactical — handles conversational officer-to-officer and fireground communications well
Security — accurate on facility announcements (e.g., Code Blue pages)
Multi-Talk / Overlapping — degrades when multiple speakers transmit simultaneously
Noisy Tactical — still struggles on very noisy channels (EMS tactical with heavy background noise), though all models tested perform poorly here

Limitations

English only — trained exclusively on English-language P25 dispatch
Pseudo-labels — training data is machine-labeled, not human-verified; some label noise remains
Regional vocabulary — trained on a specific set of county systems; may not generalize perfectly to all P25 deployments (different unit numbering, street names, dispatch protocols)
Short utterances — optimized for typical P25 call lengths (2-60 seconds); untested on longer recordings
Noisy tactical channels — performance degrades on very noisy tactical channels
Hallucination on blank audio — like all autoregressive ASR models, will generate plausible-sounding text for silent/encrypted input; use the RMS energy gate described above

Intended Use

Transcription of P25 public safety radio for dispatch logging, search, and review
Starting point for fine-tuning on specific agency vocabularies
Research into domain-adapted ASR for vocoder-compressed radio audio

How to Reproduce

The complete training pipeline is documented in this repository:

Audio collection — trunk-recorder captures P25 call recordings
Sample selection — stratified sampling for diversity across talkgroups, duration, call types
Preprocessing — sox bandpass + notch filters → 16 kHz mono WAV
Pseudo-labeling — machine-generated transcriptions with quality filtering
Text normalization — number conversion and dispatch term correction
Training — full SFT with frozen encoder, 3 epochs, cosine LR schedule

Citation

If you use this model, please cite:

@misc{qwen3-asr-p25,
  title={Qwen3-ASR-0.6B-P25: Fine-Tuned Speech Recognition for P25 Public Safety Radio},
  author={AuggieActual},
  year={2026},
  url={https://huggingface.co/AuggieActual/qwen3-asr-p25-0.6B}
}

Acknowledgments

Qwen3-ASR by Alibaba Qwen team — excellent base model
trunk-recorder — P25 radio capture
Inspired by Police Radio ASR (2024) — directly analogous domain study

Downloads last month: 75

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for trunk-reporter/qwen3-asr-p25-0.6B

Base model

Qwen/Qwen3-ASR-0.6B

Finetuned

(12)

this model

Paper for trunk-reporter/qwen3-asr-p25-0.6B

SRPose: Two-view Relative Pose Estimation with Sparse Keypoints

Paper • 2407.08199 • Published Jul 11, 2024