hey-virgil — wake-word detector

A small ONNX wake-word classifier for the phrase "hey virgil", trained from scratch on top of openWakeWord's shared feature extractors. Built for always-listening voice assistants that need low CPU, no network round-trip, and minimal false-fires.


Phrase	`hey virgil`
Architecture	openWakeWord DNN (1 hidden block, 128-unit) on top of Google Speech Embeddings
Parameters	213,889
Sample rate	16 kHz mono int16
Window	2.0 s (16 stacked 96-dim embeddings)
Format	ONNX with external-data weights (15 KB graph + 836 KB `.onnx.data`)
Trained on	40,592 clips (11,800 positives + 28,792 negatives) — all synthetic (OmniVoice + Piper TTS) + LibriSpeech + RIR/MUSAN noise overlays
License	Apache-2.0

Performance

Held-out test set (1,011 clips, 200 positives + 811 negatives, fully synthetic + adversarial confounders):

Metric	Value
Recall (TPR)	68.5%
False positives / hr	0.0
Validation accuracy	84.2%

Manual spot-check via the wake-ort-probe Rust runner over 10 known-good "hey virgil" positives + 10 hand-curated confounders (virtual, virginia, vergil, the verge, virtuous, plus LibriSpeech speech + MUSAN noise):

Threshold	TPR	FPR
0.50 (default)	6/10	0/10
0.30 (recommended for live use with smoothing)	6/10	0/10

The 4 missed positives in spot-check were OmniVoice synthesis failures (noise-only WAVs), not real model misses. The 60% TPR on this stress-set understates real-world performance because the hand-curated WAVs were chosen to be adversarial.

The 0% FPR margin is very wide: confounders top out at ~0.002 confidence, real wake hits land at ~0.99. There's room to drop the threshold significantly if you want softer wake responsiveness — see "Tuning" below.

Quick start (Python)

from huggingface_hub import hf_hub_download
from openwakeword.model import Model

model_path = hf_hub_download(
    "littlebearlabs/hey-virgil-wake-word",
    "hey-virgil-v1.onnx",
)
# IMPORTANT: also pull the external-data sidecar so ort can load weights
hf_hub_download(
    "littlebearlabs/hey-virgil-wake-word",
    "hey-virgil-v1.onnx.data",
)

wake = Model(wakeword_models=[model_path], inference_framework="onnx")

# Score a buffer of 16 kHz int16 PCM samples
import numpy as np
import soundfile as sf

samples, sr = sf.read("hey_virgil.wav", dtype="int16")
assert sr == 16000
scores = wake.predict(samples)
print(scores)  # e.g. {"hey_virgil_v1": 0.998}

Streaming use is identical to any other openWakeWord model — feed audio in 80 ms chunks via Model.predict() and threshold on the returned score.

Quick start (Rust, `ort`)

The reference Rust integration uses ort directly with openWakeWord's bundled featurization graphs. See wake-ort-probe for a complete CLI that loads the 3-stage chain (melspec → embedding → wake DNN) and exposes both batch and live-mic modes.

use ort::session::{builder::GraphOptimizationLevel, Session};
use ort::value::Tensor;
use ndarray::{Array2, Array3, Array4, Axis};

let wake = Session::builder()?
    .with_optimization_level(GraphOptimizationLevel::Level3)?
    .commit_from_file("hey-virgil-v1.onnx")?;  // .onnx.data must be in the same dir

// ... feed via the openwakeword feature pipeline (mel + embed sessions also needed) ...
let out = wake.run(ort::inputs!["x" => Tensor::from_array(stacked_embeddings)?])?;
let confidence: f32 = *out[0].try_extract_array::<f32>()?.iter().next().unwrap();

Dependencies

This is only the wake-word DNN (15 KB graph + 836 KB external weights). To run end-to-end you also need the openWakeWord shared featurization graphs:

melspectrogram.onnx (~1.1 MB) — converts 16 kHz audio → 32-bin mel frames
embedding_model.onnx (~1.3 MB) — Google Speech Embedding model, mel frames → 96-dim embeddings

Both are bundled in the openwakeword pip package (pip install openwakeword) and downloaded lazily on first use. They're also Apache-2.0.

If you can't pull from pip, the same two ONNX files can be downloaded directly from the openWakeWord repo.

Tuning

Defaults from in-room dogfooding:

Setting	Default	Notes
`threshold`	0.30	Per-frame cutoff after smoothing. The 0% FPR floor leaves room down to ~0.10.
`smooth_frames`	3	Running average over last 3 raw scores (~240 ms). Damps single-spike false fires.
`gain`	1.0	Pre-multiply input samples. Bump to 1.5–2.5 if your mic / AEC quiets you.
`refractory_ms`	1500	Lockout after a hit to prevent re-firing on the same utterance.

If hits feel laggy: shrink smooth_frames to 1–2. If false fires sneak in: raise threshold to 0.5 or grow smooth_frames to 5.

Training

Architecture: openWakeWord DNN, 1 hidden block of 128 units, sigmoid output (binary classifier).
Featurization: 16-frame stack of 96-dim Google Speech Embeddings (Google's speech_embedding/1), unchanged from upstream.
Dataset: lightsofapollo/virgil-wake-word (currently private — may be made public; reproducibility recipes in planning folder). 40,592 clips total, all synthetic positives (OmniVoice + Piper TTS across diverse voices/accents/styles) plus a curated negative set covering 5 confounder classes (Class A vir- onset, Class B proper-noun neighbors, Class C confounder+verb bigrams, Class D embedding noise, Class E actually-Virgil-but-not-the-wake) and broad LibriSpeech background.
Augmentation: RIR + MUSAN noise overlays applied during training via openwakeword's standard pipeline.
Training run: ~13.5 min on a single RTX PRO 6000 S, batch size 64, 31,750 steps, AdamW.

The metrics.csv file in this repo contains per-step val_recall / val_accuracy / val_fp from the training run.

Files

File	Size	Purpose
`hey-virgil-v1.onnx`	15 KB	The wake DNN graph (external-data format)
`hey-virgil-v1.onnx.data`	836 KB	External weights — must live alongside the .onnx
`hey-virgil-v1.pt`	840 KB	Original PyTorch checkpoint (for retraining / fine-tuning)
`metrics.csv`	5 KB	Per-step training metrics
`training-config.json`	1 KB	Hparams, dataset revision, commit sha

Citation

If you use this in research, please cite openWakeWord:

@software{openwakeword,
  author = {David Scripka},
  title = {openWakeWord: A library for training open-source wake word models},
  year = {2024},
  url = {https://github.com/dscripka/openWakeWord}
}

License

Apache-2.0. See LICENSE in this repo. Featurization graphs (melspec + Google speech-embedding) inherit their original Apache-2.0 license from openWakeWord / Google.

Downloads last month: -; Downloads are not tracked for this model. How to track