update name

193e6fe verified 3 months ago

7.04 kB

	---
	license: apache-2.0
	datasets:
	- kubinooo/release-in-the-wild-2s-mel-spectrograms
	language:
	- en
	base_model:
	- facebook/convnext-tiny-224
	pipeline_tag: image-classification
	tags:
	- audio-classification
	- deepfake-classification
	results:
	overall_accuracy: 0.9852
	overall_accuracy_fraction: 8006/8126
	overall_accuracy_percentage: 98.5%
	confusion_matrix:
	- true_label: fake
	predicted_label: fake
	count: 3958
	- true_label: fake
	predicted_label: real
	count: 97
	- true_label: real
	predicted_label: real
	count: 4048
	- true_label: real
	predicted_label: fake
	count: 23
	class_accuracy:
	real_images:
	accuracy: 0.9944
	fraction: 4048/4071
	fake_images:
	accuracy: 0.9761
	fraction: 3958/4055
	---

	# ConvNeXt-Tiny Audio Deepfake Classifier

	A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the Release in the Wild dataset.

	model_id: `audio-deepfake-classifier-convnext-tiny`
	model_summary: This model classifies audio clips as real or fake based on visual patterns in mel-spectrograms, using the `facebook/convnext-tiny-224` architecture.

	## Table of Contents

	- [Model Details](#model-details)
	- [Uses](#uses)
	- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
	- [Training Details](#training-details)
	- [Evaluation](#evaluation)
	- [Societal Impact Assessment](#societal-impact-assessment)
	- [Environmental Impact](#environmental-impact)
	- [Technical Specifications](#technical-specifications)

	## Model Details

	### Model Description

	This model is a fine-tuned ConvNeXt-Tiny image classifier applied to mel-spectrograms derived from 2-second audio clips to detect audio deepfakes. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics.

	- Architecture: ConvNeXt-Tiny
	- Base Model: [facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)
	- Task: Binary classification (real vs fake)
	- Input Modality: Mel-spectrogram (224x224 image)
	- Training Framework: PyTorch + Transformers

	#### Developed by
	[Jakub Polnis](https://huggingface.co/kubinooo)


	#### Shared by
	[Release in the Wild Dataset (2s mel-spectrograms)](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms)

	#### Model Type
	- Image Classification
	- Audio Deepfake Detection via Spectrograms

	#### Language(s)
	- English

	#### License
	[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

	#### Finetuned From
	[facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224)

	## Uses

	### Direct Use

	This model can be used to classify 2-second audio clips as real or fake by first converting them into mel-spectrogram images and passing them through the classifier.

	```python
	from PIL import Image
	from transformers import AutoImageProcessor, AutoModelForImageClassification
	import torch

	processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")
	model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification")

	image = Image.open("/path/to/your/image").convert("RGB")
	pixel_values = processor(image, return_tensors="pt").pixel_values
	with torch.no_grad():
	outputs = model(pixel_values)
	logits = outputs.logits
	predicted_class_idx = logits.argmax(-1).item()
	print(model.config.id2label[predicted_class_idx])
	```

	### Downstream Use

	- Integration into voice content moderation tools
	- Forensic applications for detecting synthetic speech
	- Fine-tuning for speaker-specific deepfake detection

	### Out-of-Scope Use

	- Not suitable for raw audio input without spectrogram conversion
	- May fail on clips not close to 2 seconds in length
	- Not intended for speech recognition or speaker identification

	## Bias, Risks, and Limitations

	### bias_risks_limitations

	- Demographic Bias: Dataset may be skewed toward specific accents, recording qualities, or environments.
	- Risk of Misclassification: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes.
	- Generalization Risk: Model performance may degrade on domains very different from the training data.

	### bias_recommendations

	- Evaluate on diverse, in-domain datasets before production use
	- Combine with metadata or alternative models for high-stakes decisions
	- Include human-in-the-loop review for sensitive classifications

	## Training Details

	### training_data

	Model trained on 2-second mel-spectrogram images generated from [Release in the Wild](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms), a labeled dataset of real and fake voice clips.

	### preprocessing

	- Audio resampled and clipped to 2 seconds
	- Converted into mel-spectrograms
	- Images normalized and resized to 224x224

	### speeds_sizes_times

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Best Epoch \| 1 \|
	\| Final Validation Accuracy \| 99.605% \|
	\| Training Accuracy \| 99.829% \|
	\| Validation Loss \| 0.01301 \|
	\| Final Step Accuracy \| 99.837% \|
	\| Final Step Loss \| 0.00009 \|

	## Evaluation

	### testing_data

	Tested on held-out portion of the Release in the Wild dataset.

	### testing_factors

	- Real vs Fake audio
	- Consistency across balanced classes

	### testing_metrics

	- Accuracy
	- Class-wise accuracy
	- Confusion matrix

	### results

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Overall Accuracy \| 96.74% (7861/8126) \|
	\| Real Class Accuracy \| 97.25% (3959/4071) \|
	\| Fake Class Accuracy \| 96.23% (3902/4055) \|

	#### Confusion Matrix

	\| True Label \| Predicted Label \| Count \|
	\|------------\|----------------\|-------\|
	\| fake \| fake \| 3902 \|
	\| fake \| real \| 153 \|
	\| real \| real \| 3959 \|
	\| real \| fake \| 112 \|

	### results_summary

	The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics.

	## Societal Impact Assessment

	This model has not yet undergone formal testing for:

	- Child safety or CSEM risks
	- NCII (non-consensual intimate imagery)
	- Violence or abusive content detection

	Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation.

	## Environmental Impact

	### hardware_type

	T4

	### hours_used

	1h

	### cloud_provider

	Google Cloud Platform

	### cloud_region

	europe-west1

	### co2_emitted

	0.02

	## Technical Specifications

	### model_specs

	ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head.

	### compute_infrastructure

	Standard GPU training with PyTorch and Hugging Face Transformers.

	### hardware_requirements

	- Minimum: NVIDIA GPU with 8 GB VRAM
	- CPU inference possible but slower

	### software

	- PyTorch
	- Hugging Face Transformers
	- Librosa (for spectrogram generation)
	- Wandb