--- language: - en - fr - es - de license: apache-2.0 library_name: transformers pipeline_tag: audio-classification tags: - whisper - audio-classification - telephony - answering-machine-detection - amd - speech-processing - real-time - generated_from_trainer datasets: - AbijahKaj/telephony-amd-dataset - PolyAI/minds14 - pipecat-ai/human_5_all - pipecat-ai/human_convcollector_1 - pipecat-ai/smart-turn-data-v3.2-train base_model: openai/whisper-tiny metrics: - accuracy - f1 - precision - recall model-index: - name: whisper-telephony-amd results: - task: type: audio-classification name: Audio Classification dataset: name: telephony-amd-dataset type: AbijahKaj/telephony-amd-dataset split: test metrics: - type: accuracy value: 0.9875 name: Accuracy - type: f1 value: 0.99 name: F1 (macro) - type: precision value: 0.99 name: Precision (macro) - type: recall value: 0.99 name: Recall (macro) --- # Whisper Telephony AMD (Answering Machine Detection) A real-time audio classifier that detects whether a telephony call is answered by a **human**, **voicemail**, **IVR system**, or **answering machine** — using Whisper's speech understanding to distinguish human-recorded voicemail greetings from live speech. ## Results **98.75% accuracy** on 400 test samples with only 5 misclassifications: ``` precision recall f1-score support human 1.00 0.99 1.00 114 voicemail 0.96 0.99 0.98 102 ivr 1.00 0.99 0.99 92 answering_machine 0.99 0.98 0.98 92 accuracy 0.99 400 macro avg 0.99 0.99 0.99 400 weighted avg 0.99 0.99 0.99 400 ``` **Confusion Matrix** (rows = actual, columns = predicted): | | Human | Voicemail | IVR | Answering Machine | |---|:---:|:---:|:---:|:---:| | **Human** | 113 | 1 | 0 | 0 | | **Voicemail** | 0 | 101 | 0 | 1 | | **IVR** | 0 | 1 | 91 | 0 | | **Answering Machine** | 0 | 2 | 0 | 90 | ### Accuracy Per Epoch | Epoch | Accuracy | Eval Loss | Per-Class | |:-----:|:--------:|:---------:|-----------| | 1 | **98.75%** | 0.0785 | human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% | | 2 | 95.75% | 0.1473 | human=94.7%, vm=93.1%, ivr=97.8%, am=97.8% | | 3 | 98.25% | 0.0779 | human=97.4%, vm=100%, ivr=97.8%, am=97.8% | | 4 | **98.75%** | **0.0415** | human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% | | 5 | 98.75% | 0.0569 | human=99.1%, vm=98.0%, ivr=98.9%, am=98.9% | | 6 | 98.00% | 0.0539 | human=97.4%, vm=99.0%, ivr=97.8%, am=97.8% | Early stopping triggered after epoch 6 (patience=5, best at epoch 4). Best model loaded from epoch 4 checkpoint. ## Model Details | | | |---|---| | **Architecture** | WhisperForAudioClassification (Whisper-Tiny encoder + linear classifier) | | **Base model** | [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | | **Parameters** | 8.3M total, 7.2M trainable (conv layers frozen) | | **Input** | 16kHz mono audio → 80-bin mel spectrogram (30s padded) | | **Output** | 4 classes: `human`, `voicemail`, `ivr`, `answering_machine` | | **Inference speed** | ~12ms CPU (ONNX int8), <5ms GPU | | **Model size** | 31.7 MB (safetensors) | | **Design reference** | Same architecture as [pipecat-ai/smart-turn-v3](https://hf.co/pipecat-ai/smart-turn-v3) | ## Quick Start ### Pipeline (simplest) ```python from transformers import pipeline classifier = pipeline("audio-classification", model="AbijahKaj/whisper-telephony-amd") result = classifier("phone_call.wav") print(result) # [{'score': 0.98, 'label': 'human'}, {'score': 0.01, 'label': 'voicemail'}, ...] ``` ### Manual Inference ```python from transformers import WhisperForAudioClassification, AutoFeatureExtractor import torch model = WhisperForAudioClassification.from_pretrained("AbijahKaj/whisper-telephony-amd") fe = AutoFeatureExtractor.from_pretrained("AbijahKaj/whisper-telephony-amd") # audio_array: numpy array at 16kHz inputs = fe(audio_array, sampling_rate=16000, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits pred = torch.argmax(logits, dim=-1).item() label = model.config.id2label[str(pred)] print(f"Predicted: {label}") ``` ### Streaming Real-Time Inference ```python from streaming_amd import StreamingAMDClassifier classifier = StreamingAMDClassifier("AbijahKaj/whisper-telephony-amd") for pcm_chunk in audio_stream: # 160ms chunks @ 8kHz result = classifier.process_chunk(pcm_chunk) if result: label, confidence, elapsed_s = result print(f"{label} ({confidence:.0%}) after {elapsed_s:.1f}s") break ``` ## Why Whisper? Voicemail greetings are **recorded by real humans** — they are acoustically identical to live speech. Traditional acoustic-only models (energy, pitch, spectral features) cannot reliably distinguish *"Hi, I'm not available, leave a message"* from *"Hello? Who's calling?"*. Whisper's encoder was pre-trained on 680K hours of speech and understands **what is being said**, not just how it sounds. This semantic understanding is critical for AMD. ## Training ### Dataset [AbijahKaj/telephony-amd-dataset](https://hf.co/datasets/AbijahKaj/telephony-amd-dataset) — **8,264 train / 400 test** samples, balanced across 4 classes (~2,000 each). **Data sources:** | Class | Count | Sources | |-------|-------|---------| | Human | 2,151 | [PolyAI/minds14](https://hf.co/datasets/PolyAI/minds14) (real telephony callers, 6 languages), [pipecat-ai/human_5_all](https://hf.co/datasets/pipecat-ai/human_5_all), [pipecat-ai/human_convcollector_1](https://hf.co/datasets/pipecat-ai/human_convcollector_1), original edge-tts | | Voicemail | 2,078 | [pipecat-ai smart-turn rime_2](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (personal greeting style), original edge-tts | | IVR | 2,017 | [pipecat-ai smart-turn chirp3](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (automated system style), original edge-tts | | Answering Machine | 2,018 | [pipecat-ai smart-turn orpheus](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (machine greeting style), original edge-tts | ### Hyperparameters | Parameter | Value | |-----------|-------| | Learning rate | 1e-4 | | Scheduler | Cosine with 25 warmup steps | | Batch size | 32 | | Gradient accumulation | 1 | | Max epochs | 20 (early stopped at 6) | | Weight decay | 0.01 | | Precision | FP16 | | Gradient checkpointing | Enabled | | Freeze strategy | Conv layers frozen, transformer layers + head trainable | | Early stopping patience | 5 | | Max audio length | 10s (truncated, padded to 30s for Whisper) | | Hardware | Tesla T4 (16GB VRAM) | ### Framework Versions - Transformers 5.7.0 - PyTorch 2.11.0+cu130 - Datasets 4.8.5 - Tokenizers 0.22.2 ## Classes | Label | ID | Description | Example | |-------|-----|------------|---------| | `human` | 0 | Live person on the phone | *"Hello? Yes, who is this?"* | | `voicemail` | 1 | Personal voicemail greeting | *"Hi, you've reached John. Leave a message after the beep."* | | `ivr` | 2 | IVR system / automated menu | *"Press 1 for sales, press 2 for support..."* | | `answering_machine` | 3 | Carrier/generic automated message | *"The number you have dialed is not available..."* | ## Limitations - Trained primarily on English, French, Spanish, and German audio - TTS-generated non-human classes may not fully represent all real-world telephony systems - Best performance on first 10 seconds of audio - Not tested on noisy cellular connections or VoIP codec artifacts beyond telephony bandpass (300-3400Hz) - The model may confuse voicemail greetings with answering machine messages in edge cases (2 misclassifications in test set) ## Files - `model.safetensors` — Model weights (31.7MB) - `config.json` — Model configuration - `preprocessor_config.json` — Feature extractor config - `streaming_amd.py` — Streaming real-time inference module - `train_local.py` — Training script (CLI args, RTX 5090 ready) ## Citation ```bibtex @misc{whisper-telephony-amd, author = {AbijahKaj}, title = {Whisper Telephony AMD: Real-Time Answering Machine Detection}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/AbijahKaj/whisper-telephony-amd} } ```