File size: 7,043 Bytes
15dcbe6 e1225fc 15dcbe6 8250931 4f713ad 8250931 4f713ad 8250931 4f713ad 8250931 4f713ad 8250931 4f713ad f785355 0c20a4d f785355 0c20a4d 193e6fe 0c20a4d b57ddbf 0c20a4d f785355 0c20a4d f785355 0c20a4d f785355 0c20a4d f785355 0c20a4d f785355 0c20a4d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 |
---
license: apache-2.0
datasets:
- kubinooo/release-in-the-wild-2s-mel-spectrograms
language:
- en
base_model:
- facebook/convnext-tiny-224
pipeline_tag: image-classification
tags:
- audio-classification
- deepfake-classification
results:
overall_accuracy: 0.9852
overall_accuracy_fraction: 8006/8126
overall_accuracy_percentage: 98.5%
confusion_matrix:
- true_label: fake
predicted_label: fake
count: 3958
- true_label: fake
predicted_label: real
count: 97
- true_label: real
predicted_label: real
count: 4048
- true_label: real
predicted_label: fake
count: 23
class_accuracy:
real_images:
accuracy: 0.9944
fraction: 4048/4071
fake_images:
accuracy: 0.9761
fraction: 3958/4055
---
# ConvNeXt-Tiny Audio Deepfake Classifier
A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the *Release in the Wild* dataset.
**model_id**: `audio-deepfake-classifier-convnext-tiny`
**model_summary**: This model classifies audio clips as *real* or *fake* based on visual patterns in mel-spectrograms, using the `facebook/convnext-tiny-224` architecture.
## Table of Contents
- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Societal Impact Assessment](#societal-impact-assessment)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications](#technical-specifications)
## Model Details
### Model Description
This model is a fine-tuned ConvNeXt-Tiny image classifier applied to **mel-spectrograms** derived from 2-second audio clips to detect *audio deepfakes*. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics.
- **Architecture**: ConvNeXt-Tiny
- **Base Model**: [facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)
- **Task**: Binary classification (real vs fake)
- **Input Modality**: Mel-spectrogram (224x224 image)
- **Training Framework**: PyTorch + Transformers
#### Developed by
[Jakub Polnis](https://huggingface.co/kubinooo)
#### Shared by
[Release in the Wild Dataset (2s mel-spectrograms)](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms)
#### Model Type
- Image Classification
- Audio Deepfake Detection via Spectrograms
#### Language(s)
- English
#### License
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
#### Finetuned From
[facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)
## Uses
### Direct Use
This model can be used to classify 2-second audio clips as **real** or **fake** by first converting them into mel-spectrogram images and passing them through the classifier.
```python
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
image = Image.open("/path/to/your/image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
with torch.no_grad():
outputs = model(pixel_values)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print(model.config.id2label[predicted_class_idx])
```
### Downstream Use
- Integration into voice content moderation tools
- Forensic applications for detecting synthetic speech
- Fine-tuning for speaker-specific deepfake detection
### Out-of-Scope Use
- Not suitable for raw audio input without spectrogram conversion
- May fail on clips not close to 2 seconds in length
- Not intended for speech recognition or speaker identification
## Bias, Risks, and Limitations
### bias_risks_limitations
- **Demographic Bias**: Dataset may be skewed toward specific accents, recording qualities, or environments.
- **Risk of Misclassification**: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes.
- **Generalization Risk**: Model performance may degrade on domains very different from the training data.
### bias_recommendations
- Evaluate on diverse, in-domain datasets before production use
- Combine with metadata or alternative models for high-stakes decisions
- Include human-in-the-loop review for sensitive classifications
## Training Details
### training_data
Model trained on 2-second mel-spectrogram images generated from [Release in the Wild](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms), a labeled dataset of real and fake voice clips.
### preprocessing
- Audio resampled and clipped to 2 seconds
- Converted into mel-spectrograms
- Images normalized and resized to 224x224
### speeds_sizes_times
| Metric | Value |
|--------|-------|
| Best Epoch | 1 |
| Final Validation Accuracy | 99.605% |
| Training Accuracy | 99.829% |
| Validation Loss | 0.01301 |
| Final Step Accuracy | 99.837% |
| Final Step Loss | 0.00009 |
## Evaluation
### testing_data
Tested on held-out portion of the *Release in the Wild* dataset.
### testing_factors
- Real vs Fake audio
- Consistency across balanced classes
### testing_metrics
- Accuracy
- Class-wise accuracy
- Confusion matrix
### results
| Metric | Value |
|--------|-------|
| Overall Accuracy | 96.74% (7861/8126) |
| Real Class Accuracy | 97.25% (3959/4071) |
| Fake Class Accuracy | 96.23% (3902/4055) |
#### Confusion Matrix
| True Label | Predicted Label | Count |
|------------|----------------|-------|
| fake | fake | 3902 |
| fake | real | 153 |
| real | real | 3959 |
| real | fake | 112 |
### results_summary
The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics.
## Societal Impact Assessment
This model has not yet undergone formal testing for:
- Child safety or CSEM risks
- NCII (non-consensual intimate imagery)
- Violence or abusive content detection
Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation.
## Environmental Impact
### hardware_type
T4
### hours_used
1h
### cloud_provider
Google Cloud Platform
### cloud_region
europe-west1
### co2_emitted
0.02
## Technical Specifications
### model_specs
ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head.
### compute_infrastructure
Standard GPU training with PyTorch and Hugging Face Transformers.
### hardware_requirements
- Minimum: NVIDIA GPU with 8 GB VRAM
- CPU inference possible but slower
### software
- PyTorch
- Hugging Face Transformers
- Librosa (for spectrogram generation)
- Wandb |