You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

AST Fine-tuned for Fake Audio Detection

This model is a binary classification head fine-tuned version of MIT/ast-finetuned-audioset-10-10-0.4593 for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection.

Model Description

Base Model: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet)
Task: Binary classification (fake/real audio detection)
Input: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames)
Output: Probabilities [fake_prob, real_prob]
Training Hardware: 2x NVIDIA T4 GPUs

Usage Guide

Model Usage

import torch
import torchaudio
import soundfile as sf
import numpy as np
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

# Load model and move to available device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "WpythonW/ast-fakeaudio-detector"

extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForAudioClassification.from_pretrained(model_name).to(device)
model.eval()

# Process multiple audio files
audio_files = ["audio1.wav", "audio2.mp3", "audio3.ogg"]
processed_batch = []

for audio_path in audio_files:
    # Load audio file
    audio_data, sr = sf.read(audio_path)
    
    # Convert stereo to mono if needed
    if len(audio_data.shape) > 1 and audio_data.shape[1] > 1:
        audio_data = np.mean(audio_data, axis=1)
    
    # Resample to 16kHz if needed
    if sr != 16000:
        waveform = torch.from_numpy(audio_data).float()
        if len(waveform.shape) == 1:
            waveform = waveform.unsqueeze(0)
        
        resample = torchaudio.transforms.Resample(
            orig_freq=sr, 
            new_freq=16000
        )
        waveform = resample(waveform)
        audio_data = waveform.squeeze().numpy()
    
    processed_batch.append(audio_data)

# Prepare batch input
inputs = extractor(
    processed_batch,
    sampling_rate=16000,
    padding=True,
    return_tensors="pt"
)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Get predictions
with torch.no_grad():
    logits = model(**inputs).logits
    probabilities = torch.nn.functional.softmax(logits, dim=-1)

# Process results
for filename, probs in zip(audio_files, probabilities):
    fake_prob = float(probs[0].cpu())
    real_prob = float(probs[1].cpu())
    prediction = "FAKE" if fake_prob > real_prob else "REAL"
    
    print(f"\nFile: {filename}")
    print(f"Fake probability: {fake_prob:.2%}")
    print(f"Real probability: {real_prob:.2%}")
    print(f"Verdict: {prediction}")

Limitations

Important considerations when using this model:

The model works with 16kHz audio input
Performance may vary with different types of audio manipulation not present in training data
The model was trained on audio samples ranging from 4 to 10 seconds in duration.

Downloads last month: -

Safetensors

Model size

86.2M params

Tensor type

F32

Model tree for WpythonW/ast-fakeaudio-detector

Base model

MIT/ast-finetuned-audioset-10-10-0.4593

Finetuned

(143)

this model

Datasets used to train WpythonW/ast-fakeaudio-detector

Evaluation results

accuracy on real-fake-voices-dataset2
self-reported

0.966
f1 on real-fake-voices-dataset2
self-reported

0.971
precision on real-fake-voices-dataset2
self-reported

0.969
recall on real-fake-voices-dataset2
self-reported

0.973

View on Papers With Code