File size: 7,043 Bytes
15dcbe6
 
 
 
 
 
 
 
e1225fc
15dcbe6
 
 
8250931
 
4f713ad
 
8250931
4f713ad
 
 
 
 
 
 
 
 
 
 
 
8250931
4f713ad
8250931
4f713ad
8250931
 
4f713ad
 
f785355
0c20a4d
f785355
0c20a4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193e6fe
0c20a4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b57ddbf
 
 
 
 
 
 
0c20a4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f785355
 
 
0c20a4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f785355
 
 
0c20a4d
 
 
f785355
0c20a4d
f785355
 
 
 
 
 
 
 
0c20a4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f785355
0c20a4d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
license: apache-2.0
datasets:
- kubinooo/release-in-the-wild-2s-mel-spectrograms
language:
- en
base_model:
- facebook/convnext-tiny-224
pipeline_tag: image-classification
tags:
- audio-classification
- deepfake-classification
results:
  overall_accuracy: 0.9852
  overall_accuracy_fraction: 8006/8126
  overall_accuracy_percentage: 98.5%
  confusion_matrix:
  - true_label: fake
    predicted_label: fake
    count: 3958
  - true_label: fake
    predicted_label: real
    count: 97
  - true_label: real
    predicted_label: real
    count: 4048
  - true_label: real
    predicted_label: fake
    count: 23
  class_accuracy:
    real_images:
      accuracy: 0.9944
      fraction: 4048/4071
    fake_images:
      accuracy: 0.9761
      fraction: 3958/4055
---

# ConvNeXt-Tiny Audio Deepfake Classifier

A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the *Release in the Wild* dataset.

**model_id**: `audio-deepfake-classifier-convnext-tiny`  
**model_summary**: This model classifies audio clips as *real* or *fake* based on visual patterns in mel-spectrograms, using the `facebook/convnext-tiny-224` architecture.

## Table of Contents

- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Societal Impact Assessment](#societal-impact-assessment)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications](#technical-specifications)

## Model Details

### Model Description

This model is a fine-tuned ConvNeXt-Tiny image classifier applied to **mel-spectrograms** derived from 2-second audio clips to detect *audio deepfakes*. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics.

- **Architecture**: ConvNeXt-Tiny
- **Base Model**: [facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)
- **Task**: Binary classification (real vs fake)
- **Input Modality**: Mel-spectrogram (224x224 image)
- **Training Framework**: PyTorch + Transformers

#### Developed by
[Jakub Polnis](https://huggingface.co/kubinooo)


#### Shared by
[Release in the Wild Dataset (2s mel-spectrograms)](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms)

#### Model Type
- Image Classification
- Audio Deepfake Detection via Spectrograms

#### Language(s)
- English

#### License
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

#### Finetuned From
[facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)

## Uses

### Direct Use

This model can be used to classify 2-second audio clips as **real** or **fake** by first converting them into mel-spectrogram images and passing them through the classifier.

```python
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch

processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")

image = Image.open("/path/to/your/image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
with torch.no_grad():
  outputs = model(pixel_values)
  logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print(model.config.id2label[predicted_class_idx])
```

### Downstream Use

- Integration into voice content moderation tools
- Forensic applications for detecting synthetic speech
- Fine-tuning for speaker-specific deepfake detection

### Out-of-Scope Use

- Not suitable for raw audio input without spectrogram conversion
- May fail on clips not close to 2 seconds in length
- Not intended for speech recognition or speaker identification

## Bias, Risks, and Limitations

### bias_risks_limitations

- **Demographic Bias**: Dataset may be skewed toward specific accents, recording qualities, or environments.
- **Risk of Misclassification**: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes.
- **Generalization Risk**: Model performance may degrade on domains very different from the training data.

### bias_recommendations

- Evaluate on diverse, in-domain datasets before production use
- Combine with metadata or alternative models for high-stakes decisions
- Include human-in-the-loop review for sensitive classifications

## Training Details

### training_data

Model trained on 2-second mel-spectrogram images generated from [Release in the Wild](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms), a labeled dataset of real and fake voice clips.

### preprocessing

- Audio resampled and clipped to 2 seconds
- Converted into mel-spectrograms
- Images normalized and resized to 224x224

### speeds_sizes_times

| Metric | Value |
|--------|-------|
| Best Epoch | 1 |
| Final Validation Accuracy | 99.605% |
| Training Accuracy | 99.829% |
| Validation Loss | 0.01301 |
| Final Step Accuracy | 99.837% |
| Final Step Loss | 0.00009 |

## Evaluation

### testing_data

Tested on held-out portion of the *Release in the Wild* dataset.

### testing_factors

- Real vs Fake audio
- Consistency across balanced classes

### testing_metrics

- Accuracy
- Class-wise accuracy
- Confusion matrix

### results

| Metric | Value |
|--------|-------|
| Overall Accuracy | 96.74% (7861/8126) |
| Real Class Accuracy | 97.25% (3959/4071) |
| Fake Class Accuracy | 96.23% (3902/4055) |

#### Confusion Matrix

| True Label | Predicted Label | Count |
|------------|----------------|-------|
| fake | fake | 3902 |
| fake | real | 153 |
| real | real | 3959 |
| real | fake | 112 |

### results_summary

The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics.

## Societal Impact Assessment

This model has not yet undergone formal testing for:

- Child safety or CSEM risks
- NCII (non-consensual intimate imagery)
- Violence or abusive content detection

Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation.

## Environmental Impact

### hardware_type

T4

### hours_used

1h

### cloud_provider

Google Cloud Platform

### cloud_region

europe-west1

### co2_emitted

0.02

## Technical Specifications

### model_specs

ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head.

### compute_infrastructure

Standard GPU training with PyTorch and Hugging Face Transformers.

### hardware_requirements

- Minimum: NVIDIA GPU with 8 GB VRAM
- CPU inference possible but slower

### software

- PyTorch
- Hugging Face Transformers
- Librosa (for spectrogram generation)
- Wandb