|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- kubinooo/release-in-the-wild-2s-mel-spectrograms |
|
language: |
|
- en |
|
base_model: |
|
- facebook/convnext-tiny-224 |
|
pipeline_tag: image-classification |
|
tags: |
|
- audio-classification |
|
- deepfake-classification |
|
results: |
|
overall_accuracy: 0.9852 |
|
overall_accuracy_fraction: 8006/8126 |
|
overall_accuracy_percentage: 98.5% |
|
confusion_matrix: |
|
- true_label: fake |
|
predicted_label: fake |
|
count: 3958 |
|
- true_label: fake |
|
predicted_label: real |
|
count: 97 |
|
- true_label: real |
|
predicted_label: real |
|
count: 4048 |
|
- true_label: real |
|
predicted_label: fake |
|
count: 23 |
|
class_accuracy: |
|
real_images: |
|
accuracy: 0.9944 |
|
fraction: 4048/4071 |
|
fake_images: |
|
accuracy: 0.9761 |
|
fraction: 3958/4055 |
|
--- |
|
|
|
# ConvNeXt-Tiny Audio Deepfake Classifier |
|
|
|
A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the *Release in the Wild* dataset. |
|
|
|
**model_id**: `audio-deepfake-classifier-convnext-tiny` |
|
**model_summary**: This model classifies audio clips as *real* or *fake* based on visual patterns in mel-spectrograms, using the `facebook/convnext-tiny-224` architecture. |
|
|
|
## Table of Contents |
|
|
|
- [Model Details](#model-details) |
|
- [Uses](#uses) |
|
- [Bias, Risks, and Limitations](#bias-risks-and-limitations) |
|
- [Training Details](#training-details) |
|
- [Evaluation](#evaluation) |
|
- [Societal Impact Assessment](#societal-impact-assessment) |
|
- [Environmental Impact](#environmental-impact) |
|
- [Technical Specifications](#technical-specifications) |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is a fine-tuned ConvNeXt-Tiny image classifier applied to **mel-spectrograms** derived from 2-second audio clips to detect *audio deepfakes*. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics. |
|
|
|
- **Architecture**: ConvNeXt-Tiny |
|
- **Base Model**: [facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224) |
|
- **Task**: Binary classification (real vs fake) |
|
- **Input Modality**: Mel-spectrogram (224x224 image) |
|
- **Training Framework**: PyTorch + Transformers |
|
|
|
#### Developed by |
|
[Jakub Polnis](https://huggingface.co/kubinooo) |
|
|
|
|
|
#### Shared by |
|
[Release in the Wild Dataset (2s mel-spectrograms)](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms) |
|
|
|
#### Model Type |
|
- Image Classification |
|
- Audio Deepfake Detection via Spectrograms |
|
|
|
#### Language(s) |
|
- English |
|
|
|
#### License |
|
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
#### Finetuned From |
|
[facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model can be used to classify 2-second audio clips as **real** or **fake** by first converting them into mel-spectrogram images and passing them through the classifier. |
|
|
|
```python |
|
from PIL import Image |
|
from transformers import AutoImageProcessor, AutoModelForImageClassification |
|
import torch |
|
|
|
processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification") |
|
model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification") |
|
|
|
image = Image.open("/path/to/your/image").convert("RGB") |
|
pixel_values = processor(image, return_tensors="pt").pixel_values |
|
with torch.no_grad(): |
|
outputs = model(pixel_values) |
|
logits = outputs.logits |
|
predicted_class_idx = logits.argmax(-1).item() |
|
print(model.config.id2label[predicted_class_idx]) |
|
``` |
|
|
|
### Downstream Use |
|
|
|
- Integration into voice content moderation tools |
|
- Forensic applications for detecting synthetic speech |
|
- Fine-tuning for speaker-specific deepfake detection |
|
|
|
### Out-of-Scope Use |
|
|
|
- Not suitable for raw audio input without spectrogram conversion |
|
- May fail on clips not close to 2 seconds in length |
|
- Not intended for speech recognition or speaker identification |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
### bias_risks_limitations |
|
|
|
- **Demographic Bias**: Dataset may be skewed toward specific accents, recording qualities, or environments. |
|
- **Risk of Misclassification**: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes. |
|
- **Generalization Risk**: Model performance may degrade on domains very different from the training data. |
|
|
|
### bias_recommendations |
|
|
|
- Evaluate on diverse, in-domain datasets before production use |
|
- Combine with metadata or alternative models for high-stakes decisions |
|
- Include human-in-the-loop review for sensitive classifications |
|
|
|
## Training Details |
|
|
|
### training_data |
|
|
|
Model trained on 2-second mel-spectrogram images generated from [Release in the Wild](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms), a labeled dataset of real and fake voice clips. |
|
|
|
### preprocessing |
|
|
|
- Audio resampled and clipped to 2 seconds |
|
- Converted into mel-spectrograms |
|
- Images normalized and resized to 224x224 |
|
|
|
### speeds_sizes_times |
|
|
|
| Metric | Value | |
|
|--------|-------| |
|
| Best Epoch | 1 | |
|
| Final Validation Accuracy | 99.605% | |
|
| Training Accuracy | 99.829% | |
|
| Validation Loss | 0.01301 | |
|
| Final Step Accuracy | 99.837% | |
|
| Final Step Loss | 0.00009 | |
|
|
|
## Evaluation |
|
|
|
### testing_data |
|
|
|
Tested on held-out portion of the *Release in the Wild* dataset. |
|
|
|
### testing_factors |
|
|
|
- Real vs Fake audio |
|
- Consistency across balanced classes |
|
|
|
### testing_metrics |
|
|
|
- Accuracy |
|
- Class-wise accuracy |
|
- Confusion matrix |
|
|
|
### results |
|
|
|
| Metric | Value | |
|
|--------|-------| |
|
| Overall Accuracy | 96.74% (7861/8126) | |
|
| Real Class Accuracy | 97.25% (3959/4071) | |
|
| Fake Class Accuracy | 96.23% (3902/4055) | |
|
|
|
#### Confusion Matrix |
|
|
|
| True Label | Predicted Label | Count | |
|
|------------|----------------|-------| |
|
| fake | fake | 3902 | |
|
| fake | real | 153 | |
|
| real | real | 3959 | |
|
| real | fake | 112 | |
|
|
|
### results_summary |
|
|
|
The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics. |
|
|
|
## Societal Impact Assessment |
|
|
|
This model has not yet undergone formal testing for: |
|
|
|
- Child safety or CSEM risks |
|
- NCII (non-consensual intimate imagery) |
|
- Violence or abusive content detection |
|
|
|
Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation. |
|
|
|
## Environmental Impact |
|
|
|
### hardware_type |
|
|
|
T4 |
|
|
|
### hours_used |
|
|
|
1h |
|
|
|
### cloud_provider |
|
|
|
Google Cloud Platform |
|
|
|
### cloud_region |
|
|
|
europe-west1 |
|
|
|
### co2_emitted |
|
|
|
0.02 |
|
|
|
## Technical Specifications |
|
|
|
### model_specs |
|
|
|
ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head. |
|
|
|
### compute_infrastructure |
|
|
|
Standard GPU training with PyTorch and Hugging Face Transformers. |
|
|
|
### hardware_requirements |
|
|
|
- Minimum: NVIDIA GPU with 8 GB VRAM |
|
- CPU inference possible but slower |
|
|
|
### software |
|
|
|
- PyTorch |
|
- Hugging Face Transformers |
|
- Librosa (for spectrogram generation) |
|
- Wandb |