DF-Arena-1B β ONNX (Int8 Quantized)
This is the ONNX Int8 quantized version of DF-Arena-1B-V1, a 1-billion parameter deepfake audio detection model.
| Task | Audio Deepfake Detection (Binary Classification) |
| Labels | bonafide (genuine), spoof (AI-generated / replayed) |
| Base model | Wav2Vec2-XLS-R-1B + Conformer |
| Quantization | Dynamic Int8 (MatMul/Gemm layers) |
| Size | ~1.3 GB (vs 4.3 GB FP32) |
| Input | Raw PCM waveform, 16 kHz, mono, float32 |
Quickstart
Install
pip install onnxruntime librosa numpy
Inference
import onnxruntime as ort
import librosa
import numpy as np
MAX_LEN = 64600 # ~4 seconds at 16 kHz
def preprocess(audio_path: str) -> np.ndarray:
wav, _ = librosa.load(audio_path, sr=16000, mono=True)
# Tile-repeat if shorter, truncate if longer
if len(wav) < MAX_LEN:
wav = np.tile(wav, MAX_LEN // len(wav) + 1)
return wav[:MAX_LEN].astype(np.float32)[np.newaxis, :] # (1, T)
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
sess = ort.InferenceSession("df_arena_1b_quantized.onnx")
audio = preprocess("your_audio.mp3")
logits = sess.run(None, {"input_values": audio})[0] # (1, 2)
probs = softmax(logits)[0]
label = "bonafide" if probs[1] > probs[0] else "spoof"
print(f"{label} β bonafide: {probs[1]:.2%}, spoof: {probs[0]:.2%}")
Input / Output
Input tensor β input_values
- Shape:
(batch, samples)β dynamic batch and sequence length - dtype:
float32 - Sample rate: 16,000 Hz, mono
- Recommended length: 64,600 samples (β 4 s) β the training length
- No normalization required; feed raw PCM values directly
Padding: for clips shorter than 64,600 samples, tile-repeat the audio instead of zero-padding. This matches training behavior and preserves audio statistics.
Output tensor β logits
- Shape:
(batch, 2)β[spoof_logit, bonafide_logit] - Apply
softmaxto get probabilities
Performance
Benchmarked on CPU (Apple M-series), single 4-second clip:
| Model | Size | Latency | vs PyTorch |
|---|---|---|---|
| PyTorch FP32 | 4.3 GB | ~9,000 ms | baseline |
| ONNX FP32 | 4.3 GB | ~1,400 ms | 6.4Γ faster |
| ONNX Int8 (this) | 1.3 GB | ~600 ms | 15Γ faster |
Accuracy vs FP32 ONNX baseline:
- Cosine similarity: 0.9995
- Classification agreement: 100% on all tested samples
Quantization Details
Dynamic Int8 quantization was applied to MatMul and Gemm operators (all Linear / Attention projection layers), which account for >95% of the model's weights. Convolution layers in the Wav2Vec2 feature encoder are kept in FP32 to preserve audio feature extraction quality.
Limitations
- The model was trained for binary spoof/bonafide detection and does not identify specific attack types or TTS systems.
- Performance may vary on very short clips (< 1 s); tiling is used to compensate but edge cases exist.
- Intended for research and evaluation purposes.
license: mit
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support