hey-virgil β€” wake-word detector

A small ONNX wake-word classifier for the phrase "hey virgil", trained from scratch on top of openWakeWord's shared feature extractors. Built for always-listening voice assistants that need low CPU, no network round-trip, and minimal false-fires.

Phrase hey virgil
Architecture openWakeWord DNN (1 hidden block, 128-unit) on top of Google Speech Embeddings
Parameters 213,889
Sample rate 16 kHz mono int16
Window 2.0 s (16 stacked 96-dim embeddings)
Format ONNX with external-data weights (15 KB graph + 836 KB .onnx.data)
Trained on 40,592 clips (11,800 positives + 28,792 negatives) β€” all synthetic (OmniVoice + Piper TTS) + LibriSpeech + RIR/MUSAN noise overlays
License Apache-2.0

Performance

Held-out test set (1,011 clips, 200 positives + 811 negatives, fully synthetic + adversarial confounders):

Metric Value
Recall (TPR) 68.5%
False positives / hr 0.0
Validation accuracy 84.2%

Manual spot-check via the wake-ort-probe Rust runner over 10 known-good "hey virgil" positives + 10 hand-curated confounders (virtual, virginia, vergil, the verge, virtuous, plus LibriSpeech speech + MUSAN noise):

Threshold TPR FPR
0.50 (default) 6/10 0/10
0.30 (recommended for live use with smoothing) 6/10 0/10

The 4 missed positives in spot-check were OmniVoice synthesis failures (noise-only WAVs), not real model misses. The 60% TPR on this stress-set understates real-world performance because the hand-curated WAVs were chosen to be adversarial.

The 0% FPR margin is very wide: confounders top out at ~0.002 confidence, real wake hits land at ~0.99. There's room to drop the threshold significantly if you want softer wake responsiveness β€” see "Tuning" below.

Quick start (Python)

from huggingface_hub import hf_hub_download
from openwakeword.model import Model

model_path = hf_hub_download(
    "littlebearlabs/hey-virgil-wake-word",
    "hey-virgil-v1.onnx",
)
# IMPORTANT: also pull the external-data sidecar so ort can load weights
hf_hub_download(
    "littlebearlabs/hey-virgil-wake-word",
    "hey-virgil-v1.onnx.data",
)

wake = Model(wakeword_models=[model_path], inference_framework="onnx")

# Score a buffer of 16 kHz int16 PCM samples
import numpy as np
import soundfile as sf

samples, sr = sf.read("hey_virgil.wav", dtype="int16")
assert sr == 16000
scores = wake.predict(samples)
print(scores)  # e.g. {"hey_virgil_v1": 0.998}

Streaming use is identical to any other openWakeWord model β€” feed audio in 80 ms chunks via Model.predict() and threshold on the returned score.

Quick start (Rust, ort)

The reference Rust integration uses ort directly with openWakeWord's bundled featurization graphs. See wake-ort-probe for a complete CLI that loads the 3-stage chain (melspec β†’ embedding β†’ wake DNN) and exposes both batch and live-mic modes.

use ort::session::{builder::GraphOptimizationLevel, Session};
use ort::value::Tensor;
use ndarray::{Array2, Array3, Array4, Axis};

let wake = Session::builder()?
    .with_optimization_level(GraphOptimizationLevel::Level3)?
    .commit_from_file("hey-virgil-v1.onnx")?;  // .onnx.data must be in the same dir

// ... feed via the openwakeword feature pipeline (mel + embed sessions also needed) ...
let out = wake.run(ort::inputs!["x" => Tensor::from_array(stacked_embeddings)?])?;
let confidence: f32 = *out[0].try_extract_array::<f32>()?.iter().next().unwrap();

Dependencies

This is only the wake-word DNN (15 KB graph + 836 KB external weights). To run end-to-end you also need the openWakeWord shared featurization graphs:

  • melspectrogram.onnx (~1.1 MB) β€” converts 16 kHz audio β†’ 32-bin mel frames
  • embedding_model.onnx (~1.3 MB) β€” Google Speech Embedding model, mel frames β†’ 96-dim embeddings

Both are bundled in the openwakeword pip package (pip install openwakeword) and downloaded lazily on first use. They're also Apache-2.0.

If you can't pull from pip, the same two ONNX files can be downloaded directly from the openWakeWord repo.

Tuning

Defaults from in-room dogfooding:

Setting Default Notes
threshold 0.30 Per-frame cutoff after smoothing. The 0% FPR floor leaves room down to ~0.10.
smooth_frames 3 Running average over last 3 raw scores (~240 ms). Damps single-spike false fires.
gain 1.0 Pre-multiply input samples. Bump to 1.5–2.5 if your mic / AEC quiets you.
refractory_ms 1500 Lockout after a hit to prevent re-firing on the same utterance.

If hits feel laggy: shrink smooth_frames to 1–2. If false fires sneak in: raise threshold to 0.5 or grow smooth_frames to 5.

Training

  • Architecture: openWakeWord DNN, 1 hidden block of 128 units, sigmoid output (binary classifier).
  • Featurization: 16-frame stack of 96-dim Google Speech Embeddings (Google's speech_embedding/1), unchanged from upstream.
  • Dataset: lightsofapollo/virgil-wake-word (currently private β€” may be made public; reproducibility recipes in planning folder). 40,592 clips total, all synthetic positives (OmniVoice + Piper TTS across diverse voices/accents/styles) plus a curated negative set covering 5 confounder classes (Class A vir- onset, Class B proper-noun neighbors, Class C confounder+verb bigrams, Class D embedding noise, Class E actually-Virgil-but-not-the-wake) and broad LibriSpeech background.
  • Augmentation: RIR + MUSAN noise overlays applied during training via openwakeword's standard pipeline.
  • Training run: ~13.5 min on a single RTX PRO 6000 S, batch size 64, 31,750 steps, AdamW.

The metrics.csv file in this repo contains per-step val_recall / val_accuracy / val_fp from the training run.

Files

File Size Purpose
hey-virgil-v1.onnx 15 KB The wake DNN graph (external-data format)
hey-virgil-v1.onnx.data 836 KB External weights β€” must live alongside the .onnx
hey-virgil-v1.pt 840 KB Original PyTorch checkpoint (for retraining / fine-tuning)
metrics.csv 5 KB Per-step training metrics
training-config.json 1 KB Hparams, dataset revision, commit sha

Citation

If you use this in research, please cite openWakeWord:

@software{openwakeword,
  author = {David Scripka},
  title = {openWakeWord: A library for training open-source wake word models},
  year = {2024},
  url = {https://github.com/dscripka/openWakeWord}
}

License

Apache-2.0. See LICENSE in this repo. Featurization graphs (melspec + Google speech-embedding) inherit their original Apache-2.0 license from openWakeWord / Google.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support