File size: 7,043 Bytes

---
license: apache-2.0
datasets:
- kubinooo/release-in-the-wild-2s-mel-spectrograms
language:
- en
base_model:
- facebook/convnext-tiny-224
pipeline_tag: image-classification
tags:
- audio-classification
- deepfake-classification
results:
  overall_accuracy: 0.9852
  overall_accuracy_fraction: 8006/8126
  overall_accuracy_percentage: 98.5%
  confusion_matrix:
  - true_label: fake
    predicted_label: fake
    count: 3958
  - true_label: fake
    predicted_label: real
    count: 97
  - true_label: real
    predicted_label: real
    count: 4048
  - true_label: real
    predicted_label: fake
    count: 23
  class_accuracy:
    real_images:
      accuracy: 0.9944
      fraction: 4048/4071
    fake_images:
      accuracy: 0.9761
      fraction: 3958/4055
---

# ConvNeXt-Tiny Audio Deepfake Classifier

A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the *Release in the Wild* dataset.

**model_id**: `audio-deepfake-classifier-convnext-tiny`  
**model_summary**: This model classifies audio clips as *real* or *fake* based on visual patterns in mel-spectrograms, using the `facebook/convnext-tiny-224` architecture.

## Table of Contents

- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Societal Impact Assessment](#societal-impact-assessment)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications](#technical-specifications)

## Model Details

### Model Description

This model is a fine-tuned ConvNeXt-Tiny image classifier applied to **mel-spectrograms** derived from 2-second audio clips to detect *audio deepfakes*. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics.

- **Architecture**: ConvNeXt-Tiny
- **Base Model**: [facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)
- **Task**: Binary classification (real vs fake)
- **Input Modality**: Mel-spectrogram (224x224 image)
- **Training Framework**: PyTorch + Transformers

#### Developed by
[Jakub Polnis](https://huggingface.co/kubinooo)


#### Shared by
[Release in the Wild Dataset (2s mel-spectrograms)](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms)

#### Model Type
- Image Classification
- Audio Deepfake Detection via Spectrograms

#### Language(s)
- English

#### License
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

#### Finetuned From
[facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)

## Uses

### Direct Use

This model can be used to classify 2-second audio clips as **real** or **fake** by first converting them into mel-spectrogram images and passing them through the classifier.

```python
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch

processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")

image = Image.open("/path/to/your/image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
with torch.no_grad():
  outputs = model(pixel_values)
  logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print(model.config.id2label[predicted_class_idx])
```

### Downstream Use

- Integration into voice content moderation tools
- Forensic applications for detecting synthetic speech
- Fine-tuning for speaker-specific deepfake detection

### Out-of-Scope Use

- Not suitable for raw audio input without spectrogram conversion
- May fail on clips not close to 2 seconds in length
- Not intended for speech recognition or speaker identification

## Bias, Risks, and Limitations

### bias_risks_limitations

- **Demographic Bias**: Dataset may be skewed toward specific accents, recording qualities, or environments.
- **Risk of Misclassification**: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes.
- **Generalization Risk**: Model performance may degrade on domains very different from the training data.

### bias_recommendations

- Evaluate on diverse, in-domain datasets before production use
- Combine with metadata or alternative models for high-stakes decisions
- Include human-in-the-loop review for sensitive classifications

## Training Details

### training_data

Model trained on 2-second mel-spectrogram images generated from [Release in the Wild](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms), a labeled dataset of real and fake voice clips.

### preprocessing

- Audio resampled and clipped to 2 seconds
- Converted into mel-spectrograms
- Images normalized and resized to 224x224

### speeds_sizes_times

| Metric | Value |
|--------|-------|
| Best Epoch | 1 |
| Final Validation Accuracy | 99.605% |
| Training Accuracy | 99.829% |
| Validation Loss | 0.01301 |
| Final Step Accuracy | 99.837% |
| Final Step Loss | 0.00009 |

## Evaluation

### testing_data

Tested on held-out portion of the *Release in the Wild* dataset.

### testing_factors

- Real vs Fake audio
- Consistency across balanced classes

### testing_metrics

- Accuracy
- Class-wise accuracy
- Confusion matrix

### results

| Metric | Value |
|--------|-------|
| Overall Accuracy | 96.74% (7861/8126) |
| Real Class Accuracy | 97.25% (3959/4071) |
| Fake Class Accuracy | 96.23% (3902/4055) |

#### Confusion Matrix

| True Label | Predicted Label | Count |
|------------|----------------|-------|
| fake | fake | 3902 |
| fake | real | 153 |
| real | real | 3959 |
| real | fake | 112 |

### results_summary

The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics.

## Societal Impact Assessment

This model has not yet undergone formal testing for:

- Child safety or CSEM risks
- NCII (non-consensual intimate imagery)
- Violence or abusive content detection

Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation.

## Environmental Impact

### hardware_type

T4

### hours_used

1h

### cloud_provider

Google Cloud Platform

### cloud_region

europe-west1

### co2_emitted

0.02

## Technical Specifications

### model_specs

ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head.

### compute_infrastructure

Standard GPU training with PyTorch and Hugging Face Transformers.

### hardware_requirements

- Minimum: NVIDIA GPU with 8 GB VRAM
- CPU inference possible but slower

### software

- PyTorch
- Hugging Face Transformers
- Librosa (for spectrogram generation)
- Wandb