ConvNeXt-Tiny Audio Deepfake Classifier
A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the Release in the Wild dataset.
model_id: audio-deepfake-classifier-convnext-tiny
model_summary: This model classifies audio clips as real or fake based on visual patterns in mel-spectrograms, using the facebook/convnext-tiny-224
architecture.
Table of Contents
- Model Details
- Uses
- Bias, Risks, and Limitations
- Training Details
- Evaluation
- Societal Impact Assessment
- Environmental Impact
- Technical Specifications
Model Details
Model Description
This model is a fine-tuned ConvNeXt-Tiny image classifier applied to mel-spectrograms derived from 2-second audio clips to detect audio deepfakes. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics.
- Architecture: ConvNeXt-Tiny
- Base Model: facebook/convnext-tiny-224
- Task: Binary classification (real vs fake)
- Input Modality: Mel-spectrogram (224x224 image)
- Training Framework: PyTorch + Transformers
Developed by
Shared by
Release in the Wild Dataset (2s mel-spectrograms)
Model Type
- Image Classification
- Audio Deepfake Detection via Spectrograms
Language(s)
- English
License
Finetuned From
Uses
Direct Use
This model can be used to classify 2-second audio clips as real or fake by first converting them into mel-spectrogram images and passing them through the classifier.
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
image = Image.open("/path/to/your/image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
with torch.no_grad():
outputs = model(pixel_values)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print(model.config.id2label[predicted_class_idx])
Downstream Use
- Integration into voice content moderation tools
- Forensic applications for detecting synthetic speech
- Fine-tuning for speaker-specific deepfake detection
Out-of-Scope Use
- Not suitable for raw audio input without spectrogram conversion
- May fail on clips not close to 2 seconds in length
- Not intended for speech recognition or speaker identification
Bias, Risks, and Limitations
bias_risks_limitations
- Demographic Bias: Dataset may be skewed toward specific accents, recording qualities, or environments.
- Risk of Misclassification: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes.
- Generalization Risk: Model performance may degrade on domains very different from the training data.
bias_recommendations
- Evaluate on diverse, in-domain datasets before production use
- Combine with metadata or alternative models for high-stakes decisions
- Include human-in-the-loop review for sensitive classifications
Training Details
training_data
Model trained on 2-second mel-spectrogram images generated from Release in the Wild, a labeled dataset of real and fake voice clips.
preprocessing
- Audio resampled and clipped to 2 seconds
- Converted into mel-spectrograms
- Images normalized and resized to 224x224
speeds_sizes_times
Metric | Value |
---|---|
Best Epoch | 1 |
Final Validation Accuracy | 99.605% |
Training Accuracy | 99.829% |
Validation Loss | 0.01301 |
Final Step Accuracy | 99.837% |
Final Step Loss | 0.00009 |
Evaluation
testing_data
Tested on held-out portion of the Release in the Wild dataset.
testing_factors
- Real vs Fake audio
- Consistency across balanced classes
testing_metrics
- Accuracy
- Class-wise accuracy
- Confusion matrix
results
Metric | Value |
---|---|
Overall Accuracy | 96.74% (7861/8126) |
Real Class Accuracy | 97.25% (3959/4071) |
Fake Class Accuracy | 96.23% (3902/4055) |
Confusion Matrix
True Label | Predicted Label | Count |
---|---|---|
fake | fake | 3902 |
fake | real | 153 |
real | real | 3959 |
real | fake | 112 |
results_summary
The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics.
Societal Impact Assessment
This model has not yet undergone formal testing for:
- Child safety or CSEM risks
- NCII (non-consensual intimate imagery)
- Violence or abusive content detection
Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation.
Environmental Impact
hardware_type
T4
hours_used
1h
cloud_provider
Google Cloud Platform
cloud_region
europe-west1
co2_emitted
0.02
Technical Specifications
model_specs
ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head.
compute_infrastructure
Standard GPU training with PyTorch and Hugging Face Transformers.
hardware_requirements
- Minimum: NVIDIA GPU with 8 GB VRAM
- CPU inference possible but slower
software
- PyTorch
- Hugging Face Transformers
- Librosa (for spectrogram generation)
- Wandb
- Downloads last month
- 1,530
Model tree for kubinooo/convnext-tiny-224-audio-deepfake-classification
Base model
facebook/convnext-tiny-224