DF-Arena-1B β€” ONNX (Int8 Quantized)

This is the ONNX Int8 quantized version of DF-Arena-1B-V1, a 1-billion parameter deepfake audio detection model.

Task Audio Deepfake Detection (Binary Classification)
Labels bonafide (genuine), spoof (AI-generated / replayed)
Base model Wav2Vec2-XLS-R-1B + Conformer
Quantization Dynamic Int8 (MatMul/Gemm layers)
Size ~1.3 GB (vs 4.3 GB FP32)
Input Raw PCM waveform, 16 kHz, mono, float32

Quickstart

Install

pip install onnxruntime librosa numpy

Inference

import onnxruntime as ort
import librosa
import numpy as np

MAX_LEN = 64600  # ~4 seconds at 16 kHz

def preprocess(audio_path: str) -> np.ndarray:
    wav, _ = librosa.load(audio_path, sr=16000, mono=True)
    # Tile-repeat if shorter, truncate if longer
    if len(wav) < MAX_LEN:
        wav = np.tile(wav, MAX_LEN // len(wav) + 1)
    return wav[:MAX_LEN].astype(np.float32)[np.newaxis, :]  # (1, T)

def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

sess   = ort.InferenceSession("df_arena_1b_quantized.onnx")
audio  = preprocess("your_audio.mp3")

logits = sess.run(None, {"input_values": audio})[0]   # (1, 2)
probs  = softmax(logits)[0]

label  = "bonafide" if probs[1] > probs[0] else "spoof"
print(f"{label}  β€”  bonafide: {probs[1]:.2%}, spoof: {probs[0]:.2%}")

Input / Output

Input tensor β€” input_values

  • Shape: (batch, samples) β€” dynamic batch and sequence length
  • dtype: float32
  • Sample rate: 16,000 Hz, mono
  • Recommended length: 64,600 samples (β‰ˆ 4 s) β€” the training length
  • No normalization required; feed raw PCM values directly

Padding: for clips shorter than 64,600 samples, tile-repeat the audio instead of zero-padding. This matches training behavior and preserves audio statistics.

Output tensor β€” logits

  • Shape: (batch, 2) β€” [spoof_logit, bonafide_logit]
  • Apply softmax to get probabilities

Performance

Benchmarked on CPU (Apple M-series), single 4-second clip:

Model Size Latency vs PyTorch
PyTorch FP32 4.3 GB ~9,000 ms baseline
ONNX FP32 4.3 GB ~1,400 ms 6.4Γ— faster
ONNX Int8 (this) 1.3 GB ~600 ms 15Γ— faster

Accuracy vs FP32 ONNX baseline:

  • Cosine similarity: 0.9995
  • Classification agreement: 100% on all tested samples

Quantization Details

Dynamic Int8 quantization was applied to MatMul and Gemm operators (all Linear / Attention projection layers), which account for >95% of the model's weights. Convolution layers in the Wav2Vec2 feature encoder are kept in FP32 to preserve audio feature extraction quality.


Limitations

  • The model was trained for binary spoof/bonafide detection and does not identify specific attack types or TTS systems.
  • Performance may vary on very short clips (< 1 s); tiling is used to compensate but edge cases exist.
  • Intended for research and evaluation purposes.

license: mit

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support