hey-virgil β wake-word detector
A small ONNX wake-word classifier for the phrase "hey virgil", trained from scratch on top of openWakeWord's shared feature extractors. Built for always-listening voice assistants that need low CPU, no network round-trip, and minimal false-fires.
| Phrase | hey virgil |
| Architecture | openWakeWord DNN (1 hidden block, 128-unit) on top of Google Speech Embeddings |
| Parameters | 213,889 |
| Sample rate | 16 kHz mono int16 |
| Window | 2.0 s (16 stacked 96-dim embeddings) |
| Format | ONNX with external-data weights (15 KB graph + 836 KB .onnx.data) |
| Trained on | 40,592 clips (11,800 positives + 28,792 negatives) β all synthetic (OmniVoice + Piper TTS) + LibriSpeech + RIR/MUSAN noise overlays |
| License | Apache-2.0 |
Performance
Held-out test set (1,011 clips, 200 positives + 811 negatives, fully synthetic + adversarial confounders):
| Metric | Value |
|---|---|
| Recall (TPR) | 68.5% |
| False positives / hr | 0.0 |
| Validation accuracy | 84.2% |
Manual spot-check via the wake-ort-probe Rust runner over 10 known-good "hey virgil" positives + 10 hand-curated confounders (virtual, virginia, vergil, the verge, virtuous, plus LibriSpeech speech + MUSAN noise):
| Threshold | TPR | FPR |
|---|---|---|
| 0.50 (default) | 6/10 | 0/10 |
| 0.30 (recommended for live use with smoothing) | 6/10 | 0/10 |
The 4 missed positives in spot-check were OmniVoice synthesis failures (noise-only WAVs), not real model misses. The 60% TPR on this stress-set understates real-world performance because the hand-curated WAVs were chosen to be adversarial.
The 0% FPR margin is very wide: confounders top out at ~0.002 confidence, real wake hits land at ~0.99. There's room to drop the threshold significantly if you want softer wake responsiveness β see "Tuning" below.
Quick start (Python)
from huggingface_hub import hf_hub_download
from openwakeword.model import Model
model_path = hf_hub_download(
"littlebearlabs/hey-virgil-wake-word",
"hey-virgil-v1.onnx",
)
# IMPORTANT: also pull the external-data sidecar so ort can load weights
hf_hub_download(
"littlebearlabs/hey-virgil-wake-word",
"hey-virgil-v1.onnx.data",
)
wake = Model(wakeword_models=[model_path], inference_framework="onnx")
# Score a buffer of 16 kHz int16 PCM samples
import numpy as np
import soundfile as sf
samples, sr = sf.read("hey_virgil.wav", dtype="int16")
assert sr == 16000
scores = wake.predict(samples)
print(scores) # e.g. {"hey_virgil_v1": 0.998}
Streaming use is identical to any other openWakeWord model β feed audio in 80 ms chunks via Model.predict() and threshold on the returned score.
Quick start (Rust, ort)
The reference Rust integration uses ort directly with openWakeWord's bundled featurization graphs. See wake-ort-probe for a complete CLI that loads the 3-stage chain (melspec β embedding β wake DNN) and exposes both batch and live-mic modes.
use ort::session::{builder::GraphOptimizationLevel, Session};
use ort::value::Tensor;
use ndarray::{Array2, Array3, Array4, Axis};
let wake = Session::builder()?
.with_optimization_level(GraphOptimizationLevel::Level3)?
.commit_from_file("hey-virgil-v1.onnx")?; // .onnx.data must be in the same dir
// ... feed via the openwakeword feature pipeline (mel + embed sessions also needed) ...
let out = wake.run(ort::inputs!["x" => Tensor::from_array(stacked_embeddings)?])?;
let confidence: f32 = *out[0].try_extract_array::<f32>()?.iter().next().unwrap();
Dependencies
This is only the wake-word DNN (15 KB graph + 836 KB external weights). To run end-to-end you also need the openWakeWord shared featurization graphs:
melspectrogram.onnx(~1.1 MB) β converts 16 kHz audio β 32-bin mel framesembedding_model.onnx(~1.3 MB) β Google Speech Embedding model, mel frames β 96-dim embeddings
Both are bundled in the openwakeword pip package (pip install openwakeword) and downloaded lazily on first use. They're also Apache-2.0.
If you can't pull from pip, the same two ONNX files can be downloaded directly from the openWakeWord repo.
Tuning
Defaults from in-room dogfooding:
| Setting | Default | Notes |
|---|---|---|
threshold |
0.30 | Per-frame cutoff after smoothing. The 0% FPR floor leaves room down to ~0.10. |
smooth_frames |
3 | Running average over last 3 raw scores (~240 ms). Damps single-spike false fires. |
gain |
1.0 | Pre-multiply input samples. Bump to 1.5β2.5 if your mic / AEC quiets you. |
refractory_ms |
1500 | Lockout after a hit to prevent re-firing on the same utterance. |
If hits feel laggy: shrink smooth_frames to 1β2. If false fires sneak in: raise threshold to 0.5 or grow smooth_frames to 5.
Training
- Architecture: openWakeWord DNN, 1 hidden block of 128 units, sigmoid output (binary classifier).
- Featurization: 16-frame stack of 96-dim Google Speech Embeddings (Google's
speech_embedding/1), unchanged from upstream. - Dataset:
lightsofapollo/virgil-wake-word(currently private β may be made public; reproducibility recipes in planning folder). 40,592 clips total, all synthetic positives (OmniVoice + Piper TTS across diverse voices/accents/styles) plus a curated negative set covering 5 confounder classes (Class A vir- onset, Class B proper-noun neighbors, Class C confounder+verb bigrams, Class D embedding noise, Class E actually-Virgil-but-not-the-wake) and broad LibriSpeech background. - Augmentation: RIR + MUSAN noise overlays applied during training via openwakeword's standard pipeline.
- Training run: ~13.5 min on a single RTX PRO 6000 S, batch size 64, 31,750 steps, AdamW.
The metrics.csv file in this repo contains per-step val_recall / val_accuracy / val_fp from the training run.
Files
| File | Size | Purpose |
|---|---|---|
hey-virgil-v1.onnx |
15 KB | The wake DNN graph (external-data format) |
hey-virgil-v1.onnx.data |
836 KB | External weights β must live alongside the .onnx |
hey-virgil-v1.pt |
840 KB | Original PyTorch checkpoint (for retraining / fine-tuning) |
metrics.csv |
5 KB | Per-step training metrics |
training-config.json |
1 KB | Hparams, dataset revision, commit sha |
Citation
If you use this in research, please cite openWakeWord:
@software{openwakeword,
author = {David Scripka},
title = {openWakeWord: A library for training open-source wake word models},
year = {2024},
url = {https://github.com/dscripka/openWakeWord}
}
License
Apache-2.0. See LICENSE in this repo. Featurization graphs (melspec + Google speech-embedding) inherit their original Apache-2.0 license from openWakeWord / Google.