---
language:
  - en
  - fr
  - es
  - de
license: apache-2.0
library_name: transformers
pipeline_tag: audio-classification
tags:
  - whisper
  - audio-classification
  - telephony
  - answering-machine-detection
  - amd
  - speech-processing
  - real-time
  - generated_from_trainer
datasets:
  - AbijahKaj/telephony-amd-dataset
  - PolyAI/minds14
  - pipecat-ai/human_5_all
  - pipecat-ai/human_convcollector_1
  - pipecat-ai/smart-turn-data-v3.2-train
base_model: openai/whisper-tiny
metrics:
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: whisper-telephony-amd
    results:
      - task:
          type: audio-classification
          name: Audio Classification
        dataset:
          name: telephony-amd-dataset
          type: AbijahKaj/telephony-amd-dataset
          split: test
        metrics:
          - type: accuracy
            value: 0.9875
            name: Accuracy
          - type: f1
            value: 0.99
            name: F1 (macro)
          - type: precision
            value: 0.99
            name: Precision (macro)
          - type: recall
            value: 0.99
            name: Recall (macro)
---

# Whisper Telephony AMD (Answering Machine Detection)

A real-time audio classifier that detects whether a telephony call is answered by a **human**, **voicemail**, **IVR system**, or **answering machine** — using Whisper's speech understanding to distinguish human-recorded voicemail greetings from live speech.

## Results

**98.75% accuracy** on 400 test samples with only 5 misclassifications:

```
                   precision    recall  f1-score   support

            human       1.00      0.99      1.00       114
        voicemail       0.96      0.99      0.98       102
              ivr       1.00      0.99      0.99        92
answering_machine       0.99      0.98      0.98        92

         accuracy                           0.99       400
        macro avg       0.99      0.99      0.99       400
     weighted avg       0.99      0.99      0.99       400
```

**Confusion Matrix** (rows = actual, columns = predicted):

|  | Human | Voicemail | IVR | Answering Machine |
|---|:---:|:---:|:---:|:---:|
| **Human** | 113 | 1 | 0 | 0 |
| **Voicemail** | 0 | 101 | 0 | 1 |
| **IVR** | 0 | 1 | 91 | 0 |
| **Answering Machine** | 0 | 2 | 0 | 90 |

### Accuracy Per Epoch

| Epoch | Accuracy | Eval Loss | Per-Class |
|:-----:|:--------:|:---------:|-----------|
| 1 | **98.75%** | 0.0785 | human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% |
| 2 | 95.75% | 0.1473 | human=94.7%, vm=93.1%, ivr=97.8%, am=97.8% |
| 3 | 98.25% | 0.0779 | human=97.4%, vm=100%, ivr=97.8%, am=97.8% |
| 4 | **98.75%** | **0.0415** | human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% |
| 5 | 98.75% | 0.0569 | human=99.1%, vm=98.0%, ivr=98.9%, am=98.9% |
| 6 | 98.00% | 0.0539 | human=97.4%, vm=99.0%, ivr=97.8%, am=97.8% |

Early stopping triggered after epoch 6 (patience=5, best at epoch 4). Best model loaded from epoch 4 checkpoint.

## Model Details

| | |
|---|---|
| **Architecture** | WhisperForAudioClassification (Whisper-Tiny encoder + linear classifier) |
| **Base model** | [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) |
| **Parameters** | 8.3M total, 7.2M trainable (conv layers frozen) |
| **Input** | 16kHz mono audio → 80-bin mel spectrogram (30s padded) |
| **Output** | 4 classes: `human`, `voicemail`, `ivr`, `answering_machine` |
| **Inference speed** | ~12ms CPU (ONNX int8), <5ms GPU |
| **Model size** | 31.7 MB (safetensors) |
| **Design reference** | Same architecture as [pipecat-ai/smart-turn-v3](https://hf.co/pipecat-ai/smart-turn-v3) |

## Quick Start

### Pipeline (simplest)
```python
from transformers import pipeline

classifier = pipeline("audio-classification", model="AbijahKaj/whisper-telephony-amd")
result = classifier("phone_call.wav")
print(result)
# [{'score': 0.98, 'label': 'human'}, {'score': 0.01, 'label': 'voicemail'}, ...]
```

### Manual Inference
```python
from transformers import WhisperForAudioClassification, AutoFeatureExtractor
import torch

model = WhisperForAudioClassification.from_pretrained("AbijahKaj/whisper-telephony-amd")
fe = AutoFeatureExtractor.from_pretrained("AbijahKaj/whisper-telephony-amd")

# audio_array: numpy array at 16kHz
inputs = fe(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
    pred = torch.argmax(logits, dim=-1).item()
    label = model.config.id2label[str(pred)]
    print(f"Predicted: {label}")
```

### Streaming Real-Time Inference
```python
from streaming_amd import StreamingAMDClassifier

classifier = StreamingAMDClassifier("AbijahKaj/whisper-telephony-amd")

for pcm_chunk in audio_stream:  # 160ms chunks @ 8kHz
    result = classifier.process_chunk(pcm_chunk)
    if result:
        label, confidence, elapsed_s = result
        print(f"{label} ({confidence:.0%}) after {elapsed_s:.1f}s")
        break
```

## Why Whisper?

Voicemail greetings are **recorded by real humans** — they are acoustically identical to live speech. Traditional acoustic-only models (energy, pitch, spectral features) cannot reliably distinguish *"Hi, I'm not available, leave a message"* from *"Hello? Who's calling?"*.

Whisper's encoder was pre-trained on 680K hours of speech and understands **what is being said**, not just how it sounds. This semantic understanding is critical for AMD.

## Training

### Dataset

[AbijahKaj/telephony-amd-dataset](https://hf.co/datasets/AbijahKaj/telephony-amd-dataset) — **8,264 train / 400 test** samples, balanced across 4 classes (~2,000 each).

**Data sources:**

| Class | Count | Sources |
|-------|-------|---------|
| Human | 2,151 | [PolyAI/minds14](https://hf.co/datasets/PolyAI/minds14) (real telephony callers, 6 languages), [pipecat-ai/human_5_all](https://hf.co/datasets/pipecat-ai/human_5_all), [pipecat-ai/human_convcollector_1](https://hf.co/datasets/pipecat-ai/human_convcollector_1), original edge-tts |
| Voicemail | 2,078 | [pipecat-ai smart-turn rime_2](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (personal greeting style), original edge-tts |
| IVR | 2,017 | [pipecat-ai smart-turn chirp3](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (automated system style), original edge-tts |
| Answering Machine | 2,018 | [pipecat-ai smart-turn orpheus](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (machine greeting style), original edge-tts |

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Learning rate | 1e-4 |
| Scheduler | Cosine with 25 warmup steps |
| Batch size | 32 |
| Gradient accumulation | 1 |
| Max epochs | 20 (early stopped at 6) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Gradient checkpointing | Enabled |
| Freeze strategy | Conv layers frozen, transformer layers + head trainable |
| Early stopping patience | 5 |
| Max audio length | 10s (truncated, padded to 30s for Whisper) |
| Hardware | Tesla T4 (16GB VRAM) |

### Framework Versions

- Transformers 5.7.0
- PyTorch 2.11.0+cu130
- Datasets 4.8.5
- Tokenizers 0.22.2

## Classes

| Label | ID | Description | Example |
|-------|-----|------------|---------|
| `human` | 0 | Live person on the phone | *"Hello? Yes, who is this?"* |
| `voicemail` | 1 | Personal voicemail greeting | *"Hi, you've reached John. Leave a message after the beep."* |
| `ivr` | 2 | IVR system / automated menu | *"Press 1 for sales, press 2 for support..."* |
| `answering_machine` | 3 | Carrier/generic automated message | *"The number you have dialed is not available..."* |

## Limitations

- Trained primarily on English, French, Spanish, and German audio
- TTS-generated non-human classes may not fully represent all real-world telephony systems
- Best performance on first 10 seconds of audio
- Not tested on noisy cellular connections or VoIP codec artifacts beyond telephony bandpass (300-3400Hz)
- The model may confuse voicemail greetings with answering machine messages in edge cases (2 misclassifications in test set)

## Files

- `model.safetensors` — Model weights (31.7MB)
- `config.json` — Model configuration
- `preprocessor_config.json` — Feature extractor config
- `streaming_amd.py` — Streaming real-time inference module
- `train_local.py` — Training script (CLI args, RTX 5090 ready)

## Citation

```bibtex
@misc{whisper-telephony-amd,
  author = {AbijahKaj},
  title = {Whisper Telephony AMD: Real-Time Answering Machine Detection},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/AbijahKaj/whisper-telephony-amd}
}
```