AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection
This repository contains the original implementation of AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge.
Paper
AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge
This is the original implementation of the paper. The model weights provided here are NOT the same weights used in the paper results.
Overview
AASIST3 is an enhanced version of the AASIST (Anti-spoofing with Adaptive Softmax and Instance-wise Temperature) architecture that incorporates Kolmogorov-Arnold Networks (KAN) for improved speech deepfake detection. The model leverages:
- Self-Supervised Learning (SSL) Features: Uses Wav2Vec2 encoder for robust audio representation
- KAN Linear Layers: Kolmogorov-Arnold Networks for enhanced feature transformation
- Graph Attention Networks (GAT): For spatial and temporal feature modeling
- Multi-branch Inference: Multiple inference branches for robust decision making
Architecture
The AASIST3 model consists of several key components:
- Wav2Vec2 Encoder: Extracts SSL features from raw audio
- KAN Bridge: Transforms SSL features using Kolmogorov-Arnold Networks
- Residual Encoder: Processes features through multiple residual blocks
- Graph Attention Networks:
- GAT-S: Spatial attention mechanism
- GAT-T: Temporal attention mechanism
- Multi-branch Inference: Four parallel inference branches with master tokens
- KAN Output Layer: Final classification using KAN linear layers
Key Innovations
- KAN Integration: Replaces traditional linear layers with KAN linear layers for better feature approximation
- Enhanced Regularization: Additional dropout and regularization techniques
- Multi-dataset Training: Trained on multiple ASVspoof datasets for robustness
π Quick Start
Installation
git clone https://github.com/your-username/AASIST3.git
cd AASIST3
pip install -r requirements.txt
Loading the Model
from model import aasist3
# Load the model from Hugging Face Hub
model = aasist3.from_pretrained("MTUCI/AASIST3")
model.eval()
Basic Usage
import torch
import torchaudio
# Load and preprocess audio
audio, sr = torchaudio.load("audio_file.wav")
# Ensure audio is 16kHz and mono
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
if audio.shape[0] > 1:
audio = torch.mean(audio, dim=0, keepdim=True)
# Prepare input (model expects ~4 seconds of audio at 16kHz)
# Pad or truncate to 64600 samples
if audio.shape[1] < 64600:
audio = torch.nn.functional.pad(audio, (0, 64600 - audio.shape[1]))
else:
audio = audio[:, :64600]
# Run inference
with torch.no_grad():
output = model(audio)
probabilities = torch.softmax(output, dim=1)
prediction = torch.argmax(probabilities, dim=1)
# prediction: 0 = bonafide, 1 = spoof
print(f"Prediction: {'Bonafide' if prediction.item() == 0 else 'Spoof'}")
print(f"Confidence: {probabilities.max().item():.3f}")
Training Details
Datasets Used
The model was trained on a combination of multiple datasets:
- ASVspoof 2019 LA (Logical Access)
- ASVspoof 2024 (ASVspoof5)
- MLAAD (Multi-Language Audio Anti-Spoofing Dataset)
- M-AILABS (Multi-Language Audio Dataset)
Training Configuration
- Epochs: 20
- Batch Size: 12 (training), 24 (validation)
- Learning Rate: 1e-4
- Optimizer: AdamW
- Loss Function: CrossEntropyLoss
- Gradient Accumulation Steps: 2
Hardware
- GPUs: 2xA100 40GB
- Framework: PyTorch with Accelerate for distributed training
Advanced Usage
Custom Training
# Train the model
bash train.sh
Validation
# Run validation on test sets
bash validate.sh
Model Configuration
The model can be configured through the configs/train.yaml
file:
# Key parameters
num_epochs: 20
train_batch_size: 12
val_batch_size: 24
learning_rate: 1e-4
gradient_accumulation_steps: 2
π€ Citation
If you use this implementation in your research, please cite the original paper:
@inproceedings{borodin24_asvspoof,
title = {AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge},
author = {Kirill Borodin and Vasiliy Kudryavtsev and Dmitrii Korzh and Alexey Efimenko and Grach Mkrtchian and Mikhail Gorodnichev and Oleg Y. Rogov},
year = {2024},
booktitle = {The Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024)},
pages = {48--55},
doi = {10.21437/ASVspoof.2024-8},
}
License
This project is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0) - see the LICENSE file for details.
This license allows you to:
- Share: Copy and redistribute the material in any medium or format
- Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made
But does NOT allow:
- Commercial use: You may not use the material for commercial purposes
- Derivatives: You may not distribute modified versions of the material
For more information, visit: https://creativecommons.org/licenses/by-nc-nd/4.0/
Disclaimer: This is a research implementation. The model weights provided are for demonstration purposes and may not match the exact performance reported in the paper.
- Downloads last month
- 91