DF-Arena-1B — ONNX (Int8 Quantized)

This is the ONNX Int8 quantized version of DF-Arena-1B-V1, a 1-billion parameter deepfake audio detection model.


Task	Audio Deepfake Detection (Binary Classification)
Labels	`bonafide` (genuine), `spoof` (AI-generated / replayed)
Base model	Wav2Vec2-XLS-R-1B + Conformer
Quantization	Dynamic Int8 (MatMul/Gemm layers)
Size	~1.3 GB (vs 4.3 GB FP32)
Input	Raw PCM waveform, 16 kHz, mono, float32

Quickstart

Install

pip install onnxruntime librosa numpy

Inference

import onnxruntime as ort
import librosa
import numpy as np

MAX_LEN = 64600  # ~4 seconds at 16 kHz

def preprocess(audio_path: str) -> np.ndarray:
    wav, _ = librosa.load(audio_path, sr=16000, mono=True)
    # Tile-repeat if shorter, truncate if longer
    if len(wav) < MAX_LEN:
        wav = np.tile(wav, MAX_LEN // len(wav) + 1)
    return wav[:MAX_LEN].astype(np.float32)[np.newaxis, :]  # (1, T)

def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

sess   = ort.InferenceSession("df_arena_1b_quantized.onnx")
audio  = preprocess("your_audio.mp3")

logits = sess.run(None, {"input_values": audio})[0]   # (1, 2)
probs  = softmax(logits)[0]

label  = "bonafide" if probs[1] > probs[0] else "spoof"
print(f"{label}  —  bonafide: {probs[1]:.2%}, spoof: {probs[0]:.2%}")

Input / Output

Input tensor — input_values

Shape: (batch, samples) — dynamic batch and sequence length
dtype: float32
Sample rate: 16,000 Hz, mono
Recommended length: 64,600 samples (≈ 4 s) — the training length
No normalization required; feed raw PCM values directly

Padding: for clips shorter than 64,600 samples, tile-repeat the audio instead of zero-padding. This matches training behavior and preserves audio statistics.

Output tensor — logits

Shape: (batch, 2) — [spoof_logit, bonafide_logit]
Apply softmax to get probabilities

Performance

Benchmarked on CPU (Apple M-series), single 4-second clip:

Model	Size	Latency	vs PyTorch
PyTorch FP32	4.3 GB	~9,000 ms	baseline
ONNX FP32	4.3 GB	~1,400 ms	6.4× faster
ONNX Int8 (this)	1.3 GB	~600 ms	15× faster

Accuracy vs FP32 ONNX baseline:

Cosine similarity: 0.9995
Classification agreement: 100% on all tested samples

Quantization Details

Dynamic Int8 quantization was applied to MatMul and Gemm operators (all Linear / Attention projection layers), which account for >95% of the model's weights. Convolution layers in the Wav2Vec2 feature encoder are kept in FP32 to preserve audio feature extraction quality.

Limitations

The model was trained for binary spoof/bonafide detection and does not identify specific attack types or TTS systems.
Performance may vary on very short clips (< 1 s); tiling is used to compensate but edge cases exist.
Intended for research and evaluation purposes.

license: mit

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support