ConvNeXt-Tiny Audio Deepfake Classifier

A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the Release in the Wild dataset.

model_id: audio-deepfake-classifier-convnext-tiny
model_summary: This model classifies audio clips as real or fake based on visual patterns in mel-spectrograms, using the facebook/convnext-tiny-224 architecture.

Table of Contents

Model Details

Model Description

This model is a fine-tuned ConvNeXt-Tiny image classifier applied to mel-spectrograms derived from 2-second audio clips to detect audio deepfakes. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics.

  • Architecture: ConvNeXt-Tiny
  • Base Model: facebook/convnext-tiny-224
  • Task: Binary classification (real vs fake)
  • Input Modality: Mel-spectrogram (224x224 image)
  • Training Framework: PyTorch + Transformers

Developed by

Jakub Polnis

Shared by

Release in the Wild Dataset (2s mel-spectrograms)

Model Type

  • Image Classification
  • Audio Deepfake Detection via Spectrograms

Language(s)

  • English

License

Apache 2.0

Finetuned From

facebook/convnext-tiny-224

Uses

Direct Use

This model can be used to classify 2-second audio clips as real or fake by first converting them into mel-spectrogram images and passing them through the classifier.

from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch

processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")

image = Image.open("/path/to/your/image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
with torch.no_grad():
  outputs = model(pixel_values)
  logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print(model.config.id2label[predicted_class_idx])

Downstream Use

  • Integration into voice content moderation tools
  • Forensic applications for detecting synthetic speech
  • Fine-tuning for speaker-specific deepfake detection

Out-of-Scope Use

  • Not suitable for raw audio input without spectrogram conversion
  • May fail on clips not close to 2 seconds in length
  • Not intended for speech recognition or speaker identification

Bias, Risks, and Limitations

bias_risks_limitations

  • Demographic Bias: Dataset may be skewed toward specific accents, recording qualities, or environments.
  • Risk of Misclassification: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes.
  • Generalization Risk: Model performance may degrade on domains very different from the training data.

bias_recommendations

  • Evaluate on diverse, in-domain datasets before production use
  • Combine with metadata or alternative models for high-stakes decisions
  • Include human-in-the-loop review for sensitive classifications

Training Details

training_data

Model trained on 2-second mel-spectrogram images generated from Release in the Wild, a labeled dataset of real and fake voice clips.

preprocessing

  • Audio resampled and clipped to 2 seconds
  • Converted into mel-spectrograms
  • Images normalized and resized to 224x224

speeds_sizes_times

Metric Value
Best Epoch 1
Final Validation Accuracy 99.605%
Training Accuracy 99.829%
Validation Loss 0.01301
Final Step Accuracy 99.837%
Final Step Loss 0.00009

Evaluation

testing_data

Tested on held-out portion of the Release in the Wild dataset.

testing_factors

  • Real vs Fake audio
  • Consistency across balanced classes

testing_metrics

  • Accuracy
  • Class-wise accuracy
  • Confusion matrix

results

Metric Value
Overall Accuracy 96.74% (7861/8126)
Real Class Accuracy 97.25% (3959/4071)
Fake Class Accuracy 96.23% (3902/4055)

Confusion Matrix

True Label Predicted Label Count
fake fake 3902
fake real 153
real real 3959
real fake 112

results_summary

The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics.

Societal Impact Assessment

This model has not yet undergone formal testing for:

  • Child safety or CSEM risks
  • NCII (non-consensual intimate imagery)
  • Violence or abusive content detection

Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation.

Environmental Impact

hardware_type

T4

hours_used

1h

cloud_provider

Google Cloud Platform

cloud_region

europe-west1

co2_emitted

0.02

Technical Specifications

model_specs

ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head.

compute_infrastructure

Standard GPU training with PyTorch and Hugging Face Transformers.

hardware_requirements

  • Minimum: NVIDIA GPU with 8 GB VRAM
  • CPU inference possible but slower

software

  • PyTorch
  • Hugging Face Transformers
  • Librosa (for spectrogram generation)
  • Wandb
Downloads last month
1,530
Safetensors
Model size
27.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kubinooo/convnext-tiny-224-audio-deepfake-classification

Finetuned
(48)
this model

Space using kubinooo/convnext-tiny-224-audio-deepfake-classification 1