ConvNeXt-Tiny Audio Deepfake Classifier

A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the Release in the Wild dataset.

model_id: audio-deepfake-classifier-convnext-tiny
model_summary: This model classifies audio clips as real or fake based on visual patterns in mel-spectrograms, using the facebook/convnext-tiny-224 architecture.

Model Details
Uses
Bias, Risks, and Limitations
Training Details
Evaluation
Societal Impact Assessment
Environmental Impact
Technical Specifications

Model Details

Model Description

This model is a fine-tuned ConvNeXt-Tiny image classifier applied to mel-spectrograms derived from 2-second audio clips to detect audio deepfakes. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics.

Architecture: ConvNeXt-Tiny
Base Model: facebook/convnext-tiny-224
Task: Binary classification (real vs fake)
Input Modality: Mel-spectrogram (224x224 image)
Training Framework: PyTorch + Transformers

Developed by

Jakub Polnis

Shared by

Release in the Wild Dataset (2s mel-spectrograms)

Model Type

Image Classification
Audio Deepfake Detection via Spectrograms

Language(s)

English

License

Apache 2.0

Finetuned From

facebook/convnext-tiny-224

Uses

Direct Use

This model can be used to classify 2-second audio clips as real or fake by first converting them into mel-spectrogram images and passing them through the classifier.

from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch

processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")

image = Image.open("/path/to/your/image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
with torch.no_grad():
  outputs = model(pixel_values)
  logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print(model.config.id2label[predicted_class_idx])

Downstream Use

Integration into voice content moderation tools
Forensic applications for detecting synthetic speech
Fine-tuning for speaker-specific deepfake detection

Out-of-Scope Use

Not suitable for raw audio input without spectrogram conversion
May fail on clips not close to 2 seconds in length
Not intended for speech recognition or speaker identification

Bias, Risks, and Limitations

bias_risks_limitations

Demographic Bias: Dataset may be skewed toward specific accents, recording qualities, or environments.
Risk of Misclassification: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes.
Generalization Risk: Model performance may degrade on domains very different from the training data.

bias_recommendations

Evaluate on diverse, in-domain datasets before production use
Combine with metadata or alternative models for high-stakes decisions
Include human-in-the-loop review for sensitive classifications

Training Details

training_data

Model trained on 2-second mel-spectrogram images generated from Release in the Wild, a labeled dataset of real and fake voice clips.

preprocessing

Audio resampled and clipped to 2 seconds
Converted into mel-spectrograms
Images normalized and resized to 224x224

speeds_sizes_times

Metric	Value
Best Epoch	1
Final Validation Accuracy	99.605%
Training Accuracy	99.829%
Validation Loss	0.01301
Final Step Accuracy	99.837%
Final Step Loss	0.00009

Evaluation

testing_data

Tested on held-out portion of the Release in the Wild dataset.

testing_factors

Real vs Fake audio
Consistency across balanced classes

testing_metrics

Accuracy
Class-wise accuracy
Confusion matrix

results

Metric	Value
Overall Accuracy	96.74% (7861/8126)
Real Class Accuracy	97.25% (3959/4071)
Fake Class Accuracy	96.23% (3902/4055)

Confusion Matrix

True Label	Predicted Label	Count
fake	fake	3902
fake	real	153
real	real	3959
real	fake	112

results_summary

The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics.

Societal Impact Assessment

This model has not yet undergone formal testing for:

Child safety or CSEM risks
NCII (non-consensual intimate imagery)
Violence or abusive content detection

Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation.

Environmental Impact

hardware_type

hours_used

cloud_provider

Google Cloud Platform

cloud_region

europe-west1

co2_emitted

0.02

Technical Specifications

model_specs

ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head.

compute_infrastructure

Standard GPU training with PyTorch and Hugging Face Transformers.

hardware_requirements

Minimum: NVIDIA GPU with 8 GB VRAM
CPU inference possible but slower

software

PyTorch
Hugging Face Transformers
Librosa (for spectrogram generation)
Wandb