Qwen3-ASR-0.6B-P25: Speech Recognition for P25 Public Safety Radio

A fine-tuned Qwen3-ASR-0.6B model specialized for transcribing P25 trunked radio dispatch audio. Significantly outperforms stock Qwen3-ASR, Whisper Large-v3, and other general-purpose ASR models on P25 dispatch vocabulary.

Results

Informal evaluation on a small held-out set of P25 calls with human-verified transcriptions shows substantial improvement over stock models. The fine-tuned 0.6B model consistently outperforms stock Qwen3-ASR (0.6B and 1.7B), Whisper Large-v3 (stock and LoRA-tuned), and other general-purpose ASR on P25 dispatch audio β€” particularly on domain-specific vocabulary like unit numbers, street names, and radio codes.

What is P25 Radio Audio?

Project 25 (P25) is the digital radio standard used by public safety agencies across the US. P25 audio has characteristics that challenge general-purpose ASR:

  • IMBE/AMBE vocoder compression β€” lossy digital encoding that distorts speech
  • PTT-gated transmissions β€” push-to-talk creates abrupt starts/stops
  • Domain vocabulary β€” unit numbers ("Medic 22", "Engine 111"), radio codes ("10-4"), street addresses, dispatch terminology
  • Variable signal quality β€” portable radios, in-vehicle, building penetration
  • Overlapping transmissions β€” multiple units on shared talkgroups
  • Background noise β€” sirens, wind, engine noise, crowd noise

Stock ASR models frequently misrecognize domain terms: "Medic 23" becomes "Madag 23", "cul-de-sacs" becomes "collar tax", "blow-ins" becomes "loans".

Usage

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "AuggieActual/qwen3-asr-p25-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_new_tokens=256,
)

results = model.transcribe(audio="path/to/p25_call.wav", language="English")
print(results[0].text)

With Word-Level Timestamps

model = Qwen3ASRModel.from_pretrained(
    "AuggieActual/qwen3-asr-p25-0.6B",
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(dtype=torch.bfloat16, device_map="cuda:0"),
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_new_tokens=256,
)

results = model.transcribe(
    audio="path/to/p25_call.wav",
    language="English",
    return_time_stamps=True,
)

for word in results[0].time_stamps:
    print(f"  {word.start_time:.2f}-{word.end_time:.2f}  {word.text}")

Training

Overview

Full supervised fine-tuning (SFT) of Qwen3-ASR-0.6B on ~24,800 P25 radio calls with pseudo-labels. The audio encoder is frozen; only the text decoder is trained, as the encoder already handles vocoder-compressed audio adequately β€” the challenge is vocabulary, not acoustics.

Data Collection

Training data was sourced from a live P25 trunk-recorder installation monitoring multiple county-level public safety systems. Audio was collected via trunk-recorder capturing individual call recordings from trunked radio systems.

Sample selection criteria:

  • Duration: 2-60 seconds (optimal for single-utterance training)
  • Talkgroup diversity: dispatch, tactical, fire, EMS, law enforcement channels
  • Signal quality: filtered by call duration and metadata completeness
  • Call type diversity: routine dispatch, emergency, multi-unit, tactical

Two collection phases:

  1. Phase 1: 6,968 calls selected from a catalog of 25,566 candidates via stratified sampling across talkgroup tags, sites, duration buckets, and emergency status
  2. Phase 2: 20,000 additional calls selected from a filesystem catalog of 2M+ recordings, broadening coverage across systems and time periods

Audio Preprocessing

All audio preprocessed with sox to standardize format and remove P25-specific interference:

sox input.wav output.wav \
    rate 16000 \
    channels 1 \
    sinc 300-3400 \
    bandreject 1200 5q \
    bandreject 1800 5q \
    bandreject 1000 8q \
    norm -3
Filter Purpose
rate 16000 Resample to 16 kHz
channels 1 Convert to mono
sinc 300-3400 Bandpass for speech intelligibility range
bandreject 1200 5q Notch filter for MDC-1200 data signaling
bandreject 1800 5q Notch filter for MDC-1200 status tones
bandreject 1000 8q Notch filter for alert/page tones
norm -3 Normalize peak to -3 dB

The bandpass preserves dispatch speech while attenuating vocoder artifacts outside the speech band. MDC-1200 notch filters remove in-band data signaling that confuses ASR decoders. Narrow Q factors minimize impact on speech content.

Note on VAD: VAD is intentionally not used for segmentation β€” P25 audio is already PTT-gated, so every transmission is active speech. Applying segmentation VAD (e.g., Silero) degrades accuracy on names and numbers at transmission boundaries.

However, a simple RMS energy gate is recommended before inference to reject blank or encrypted audio. P25 systems sometimes send encrypted channel audio or empty carrier bursts that contain no intelligible speech. Without a gate, the model will hallucinate plausible-sounding dispatch text for silent inputs. A threshold of RMS < 0.01 cleanly separates blank audio (typically < 0.003) from real speech (typically > 0.03). The included server implements this check automatically.

Pseudo-Labeling

All training labels are machine-generated pseudo-labels, quality-filtered for accuracy.

Quality filtering rejected ~7% of labels:

Filter Criterion Rejects
Empty No transcription returned ~3.6%
Hallucination >8 words/second (implausible for speech) ~2%
Too short <3 words ~1%
Non-ASCII <90% ASCII characters <0.1%

Final labeled dataset: 24,808 P25 calls passing quality filters.

Text Normalization

The raw pseudo-labels contain written-form numbers ("twenty-two", "seventy-six eighty-five"). Dispatch radio uses numeric form ("22", "7685"). A normalization pass converts ~40% of labels to match radio conventions:

Pattern Example Normalized
Compound numbers "twenty-two" "22"
Sequential digits "seven five three one" "7531"
X-oh-Y radio IDs "one-oh-one" "101"
Common ASR errors "Medicaid 22" "Medic 22"
Common ASR errors "Tax 46" "Tac 46"
Common ASR errors "Italian 16" "Battalion 16"

This normalization is critical β€” without it, the model would learn to output "twenty-two" instead of "22", which is wrong for the dispatch domain.

Training Configuration

Parameter Value
Base model Qwen/Qwen3-ASR-0.6B
Training type Full SFT (not LoRA)
Training samples 23,519
Eval samples 1,237
Frozen components Audio encoder (317M params)
Trainable params ~310M (text decoder)
Effective batch size 64
Learning rate 2e-5 (cosine schedule)
Warmup 5% of steps
Epochs 3
Total steps 2,205
Optimizer AdamW
Precision bfloat16
Hardware 2x NVIDIA RTX 3090 Ti (24 GB)
Training time ~3.4 hours

Loss masking: Only the transcription target tokens contribute to the loss. The system prompt and audio tokens are masked with -100 so the model only learns to generate transcriptions, not to reproduce the prompt.

Training Dynamics

Step Train Loss Eval Loss
25 128.1 β€”
100 29.5 β€”
500 3.7 0.249
1,000 2.6 0.229
1,500 2.1 0.225
2,000 2.1 0.225
2,205 2.0 β€”

Eval loss plateaus at ~0.225 from step 1,500 onward with no overfitting.

Qualitative Performance by Call Type

  • Fire/Law Dispatch β€” strong performance on structured dispatch formats (unit assignments, addresses, incident types)
  • Law/Fire Tactical β€” handles conversational officer-to-officer and fireground communications well
  • Security β€” accurate on facility announcements (e.g., Code Blue pages)
  • Multi-Talk / Overlapping β€” degrades when multiple speakers transmit simultaneously
  • Noisy Tactical β€” still struggles on very noisy channels (EMS tactical with heavy background noise), though all models tested perform poorly here

Limitations

  • English only β€” trained exclusively on English-language P25 dispatch
  • Pseudo-labels β€” training data is machine-labeled, not human-verified; some label noise remains
  • Regional vocabulary β€” trained on a specific set of county systems; may not generalize perfectly to all P25 deployments (different unit numbering, street names, dispatch protocols)
  • Short utterances β€” optimized for typical P25 call lengths (2-60 seconds); untested on longer recordings
  • Noisy tactical channels β€” performance degrades on very noisy tactical channels
  • Hallucination on blank audio β€” like all autoregressive ASR models, will generate plausible-sounding text for silent/encrypted input; use the RMS energy gate described above

Intended Use

  • Transcription of P25 public safety radio for dispatch logging, search, and review
  • Starting point for fine-tuning on specific agency vocabularies
  • Research into domain-adapted ASR for vocoder-compressed radio audio

How to Reproduce

The complete training pipeline is documented in this repository:

  1. Audio collection β€” trunk-recorder captures P25 call recordings
  2. Sample selection β€” stratified sampling for diversity across talkgroups, duration, call types
  3. Preprocessing β€” sox bandpass + notch filters β†’ 16 kHz mono WAV
  4. Pseudo-labeling β€” machine-generated transcriptions with quality filtering
  5. Text normalization β€” number conversion and dispatch term correction
  6. Training β€” full SFT with frozen encoder, 3 epochs, cosine LR schedule

Citation

If you use this model, please cite:

@misc{qwen3-asr-p25,
  title={Qwen3-ASR-0.6B-P25: Fine-Tuned Speech Recognition for P25 Public Safety Radio},
  author={AuggieActual},
  year={2026},
  url={https://huggingface.co/AuggieActual/qwen3-asr-p25-0.6B}
}

Acknowledgments

Downloads last month
75
Safetensors
Model size
0.8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for trunk-reporter/qwen3-asr-p25-0.6B

Finetuned
(12)
this model

Paper for trunk-reporter/qwen3-asr-p25-0.6B