--- license: apache-2.0 datasets: - kubinooo/release-in-the-wild-2s-mel-spectrograms language: - en base_model: - facebook/convnext-tiny-224 pipeline_tag: image-classification tags: - audio-classification - deepfake-classification results: overall_accuracy: 0.9852 overall_accuracy_fraction: 8006/8126 overall_accuracy_percentage: 98.5% confusion_matrix: - true_label: fake predicted_label: fake count: 3958 - true_label: fake predicted_label: real count: 97 - true_label: real predicted_label: real count: 4048 - true_label: real predicted_label: fake count: 23 class_accuracy: real_images: accuracy: 0.9944 fraction: 4048/4071 fake_images: accuracy: 0.9761 fraction: 3958/4055 --- # ConvNeXt-Tiny Audio Deepfake Classifier A deep learning model fine-tuned for detecting audio deepfakes using mel-spectrograms of 2-second audio clips from the *Release in the Wild* dataset. **model_id**: `audio-deepfake-classifier-convnext-tiny` **model_summary**: This model classifies audio clips as *real* or *fake* based on visual patterns in mel-spectrograms, using the `facebook/convnext-tiny-224` architecture. ## Table of Contents - [Model Details](#model-details) - [Uses](#uses) - [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Training Details](#training-details) - [Evaluation](#evaluation) - [Societal Impact Assessment](#societal-impact-assessment) - [Environmental Impact](#environmental-impact) - [Technical Specifications](#technical-specifications) ## Model Details ### Model Description This model is a fine-tuned ConvNeXt-Tiny image classifier applied to **mel-spectrograms** derived from 2-second audio clips to detect *audio deepfakes*. By transforming audio into visual representations (spectrograms), it leverages image classification techniques for audio-based forensics. - **Architecture**: ConvNeXt-Tiny - **Base Model**: [facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224) - **Task**: Binary classification (real vs fake) - **Input Modality**: Mel-spectrogram (224x224 image) - **Training Framework**: PyTorch + Transformers #### Developed by [Jakub Polnis](https://huggingface.co/kubinooo) #### Shared by [Release in the Wild Dataset (2s mel-spectrograms)](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms) #### Model Type - Image Classification - Audio Deepfake Detection via Spectrograms #### Language(s) - English #### License [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) #### Finetuned From [facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224) ## Uses ### Direct Use This model can be used to classify 2-second audio clips as **real** or **fake** by first converting them into mel-spectrogram images and passing them through the classifier. ```python from PIL import Image from transformers import AutoImageProcessor, AutoModelForImageClassification import torch processor = AutoImageProcessor.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification") model = AutoModelForImageClassification.from_pretrained("kubinooo/convnext-tiny-224-audio-deepfake-classification") image = Image.open("/path/to/your/image").convert("RGB") pixel_values = processor(image, return_tensors="pt").pixel_values with torch.no_grad(): outputs = model(pixel_values) logits = outputs.logits predicted_class_idx = logits.argmax(-1).item() print(model.config.id2label[predicted_class_idx]) ``` ### Downstream Use - Integration into voice content moderation tools - Forensic applications for detecting synthetic speech - Fine-tuning for speaker-specific deepfake detection ### Out-of-Scope Use - Not suitable for raw audio input without spectrogram conversion - May fail on clips not close to 2 seconds in length - Not intended for speech recognition or speaker identification ## Bias, Risks, and Limitations ### bias_risks_limitations - **Demographic Bias**: Dataset may be skewed toward specific accents, recording qualities, or environments. - **Risk of Misclassification**: In critical use cases (e.g., journalism, legal settings), false positives could falsely flag real voices, and false negatives may miss deepfakes. - **Generalization Risk**: Model performance may degrade on domains very different from the training data. ### bias_recommendations - Evaluate on diverse, in-domain datasets before production use - Combine with metadata or alternative models for high-stakes decisions - Include human-in-the-loop review for sensitive classifications ## Training Details ### training_data Model trained on 2-second mel-spectrogram images generated from [Release in the Wild](https://huggingface.co/datasets/kubinooo/release-in-the-wild-2s-mel-spectrograms), a labeled dataset of real and fake voice clips. ### preprocessing - Audio resampled and clipped to 2 seconds - Converted into mel-spectrograms - Images normalized and resized to 224x224 ### speeds_sizes_times | Metric | Value | |--------|-------| | Best Epoch | 1 | | Final Validation Accuracy | 99.605% | | Training Accuracy | 99.829% | | Validation Loss | 0.01301 | | Final Step Accuracy | 99.837% | | Final Step Loss | 0.00009 | ## Evaluation ### testing_data Tested on held-out portion of the *Release in the Wild* dataset. ### testing_factors - Real vs Fake audio - Consistency across balanced classes ### testing_metrics - Accuracy - Class-wise accuracy - Confusion matrix ### results | Metric | Value | |--------|-------| | Overall Accuracy | 96.74% (7861/8126) | | Real Class Accuracy | 97.25% (3959/4071) | | Fake Class Accuracy | 96.23% (3902/4055) | #### Confusion Matrix | True Label | Predicted Label | Count | |------------|----------------|-------| | fake | fake | 3902 | | fake | real | 153 | | real | real | 3959 | | real | fake | 112 | ### results_summary The model achieves strong overall accuracy (~97%) with good balance between real and fake class performance. Minimal overfitting due to early convergence and robust validation metrics. ## Societal Impact Assessment This model has not yet undergone formal testing for: - Child safety or CSEM risks - NCII (non-consensual intimate imagery) - Violence or abusive content detection Caution is advised when using this model in applications that could impact people’s safety, dignity, or reputation. ## Environmental Impact ### hardware_type T4 ### hours_used 1h ### cloud_provider Google Cloud Platform ### cloud_region europe-west1 ### co2_emitted 0.02 ## Technical Specifications ### model_specs ConvNeXt-Tiny Transformer adapted to classify mel-spectrogram images as real or fake audio. Uses standard image classification head. ### compute_infrastructure Standard GPU training with PyTorch and Hugging Face Transformers. ### hardware_requirements - Minimum: NVIDIA GPU with 8 GB VRAM - CPU inference possible but slower ### software - PyTorch - Hugging Face Transformers - Librosa (for spectrogram generation) - Wandb