kubinooo's picture
update name
193e6fe verified
|
raw
history blame
7.04 kB
---
license: apache-2.0
datasets:
- kubinooo/release-in-the-wild-2s-mel-spectrograms
language:
- en
base_model:
- facebook/convnext-tiny-224
pipeline_tag: image-classification
tags:
- audio-classification
- deepfake-classification
results:
overall_accuracy: 0.9852
overall_accuracy_fraction: 8006/8126
overall_accuracy_percentage: 98.5%
confusion_matrix:
- true_label: fake
predicted_label: fake
count: 3958
- true_label: fake
predicted_label: real
count: 97
- true_label: real
predicted_label: real
count: 4048
- true_label: real
predicted_label: fake
count: 23
class_accuracy:
real_images:
accuracy: 0.9944
fraction: 4048/4071
fake_images:
accuracy: 0.9761
fraction: 3958/4055
---
# ConvNeXt-Tiny Audio Deepfake Classifier
A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the *Release in the Wild* dataset.
**model_id**: `audio-deepfake-classifier-convnext-tiny`
**model_summary**: This model classifies audio clips as *real* or *fake* based on visual patterns in mel-spectrograms, using the `facebook/convnext-tiny-224` architecture.
## Table of Contents
- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Societal Impact Assessment](#societal-impact-assessment)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications](#technical-specifications)
## Model Details
### Model Description
This model is a fine-tuned ConvNeXt-Tiny image classifier applied to **mel-spectrograms** derived from 2-second audio clips to detect *audio deepfakes*. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics.
- **Architecture**: ConvNeXt-Tiny
- **Base Model**: [facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)
- **Task**: Binary classification (real vs fake)
- **Input Modality**: Mel-spectrogram (224x224 image)
- **Training Framework**: PyTorch + Transformers
#### Developed by
[Jakub Polnis](https://huggingface.co/kubinooo)
#### Shared by
[Release in the Wild Dataset (2s mel-spectrograms)](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms)
#### Model Type
- Image Classification
- Audio Deepfake Detection via Spectrograms
#### Language(s)
- English
#### License
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
#### Finetuned From
[facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)
## Uses
### Direct Use
This model can be used to classify 2-second audio clips as **real** or **fake** by first converting them into mel-spectrogram images and passing them through the classifier.
```python
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
image = Image.open("/path/to/your/image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
with torch.no_grad():
outputs = model(pixel_values)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print(model.config.id2label[predicted_class_idx])
```
### Downstream Use
- Integration into voice content moderation tools
- Forensic applications for detecting synthetic speech
- Fine-tuning for speaker-specific deepfake detection
### Out-of-Scope Use
- Not suitable for raw audio input without spectrogram conversion
- May fail on clips not close to 2 seconds in length
- Not intended for speech recognition or speaker identification
## Bias, Risks, and Limitations
### bias_risks_limitations
- **Demographic Bias**: Dataset may be skewed toward specific accents, recording qualities, or environments.
- **Risk of Misclassification**: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes.
- **Generalization Risk**: Model performance may degrade on domains very different from the training data.
### bias_recommendations
- Evaluate on diverse, in-domain datasets before production use
- Combine with metadata or alternative models for high-stakes decisions
- Include human-in-the-loop review for sensitive classifications
## Training Details
### training_data
Model trained on 2-second mel-spectrogram images generated from [Release in the Wild](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms), a labeled dataset of real and fake voice clips.
### preprocessing
- Audio resampled and clipped to 2 seconds
- Converted into mel-spectrograms
- Images normalized and resized to 224x224
### speeds_sizes_times
| Metric | Value |
|--------|-------|
| Best Epoch | 1 |
| Final Validation Accuracy | 99.605% |
| Training Accuracy | 99.829% |
| Validation Loss | 0.01301 |
| Final Step Accuracy | 99.837% |
| Final Step Loss | 0.00009 |
## Evaluation
### testing_data
Tested on held-out portion of the *Release in the Wild* dataset.
### testing_factors
- Real vs Fake audio
- Consistency across balanced classes
### testing_metrics
- Accuracy
- Class-wise accuracy
- Confusion matrix
### results
| Metric | Value |
|--------|-------|
| Overall Accuracy | 96.74% (7861/8126) |
| Real Class Accuracy | 97.25% (3959/4071) |
| Fake Class Accuracy | 96.23% (3902/4055) |
#### Confusion Matrix
| True Label | Predicted Label | Count |
|------------|----------------|-------|
| fake | fake | 3902 |
| fake | real | 153 |
| real | real | 3959 |
| real | fake | 112 |
### results_summary
The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics.
## Societal Impact Assessment
This model has not yet undergone formal testing for:
- Child safety or CSEM risks
- NCII (non-consensual intimate imagery)
- Violence or abusive content detection
Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation.
## Environmental Impact
### hardware_type
T4
### hours_used
1h
### cloud_provider
Google Cloud Platform
### cloud_region
europe-west1
### co2_emitted
0.02
## Technical Specifications
### model_specs
ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head.
### compute_infrastructure
Standard GPU training with PyTorch and Hugging Face Transformers.
### hardware_requirements
- Minimum: NVIDIA GPU with 8 GB VRAM
- CPU inference possible but slower
### software
- PyTorch
- Hugging Face Transformers
- Librosa (for spectrogram generation)
- Wandb