--- license: mit datasets: - westbrook/English_Accent_DataSet base_model: - openai/whisper-small pipeline_tag: audio-classification tags: - accent - gender --- # Whisper Audio Classification Model A fine-tuned Whisper model for multi-task audio classification, specifically trained to classify **English accents** (23 classes) and **speaker gender** (2 classes) from speech audio. ## 🎯 Model Overview This model uses OpenAI's Whisper encoder as a feature extractor with custom classification heads for: - **Accent Classification**: Identifies 23 different English accents - **Gender Classification**: Classifies speaker as male or female ### Model Architecture - **Base Model**: `openai/whisper-small.en` - **Encoder**: Frozen Whisper encoder (for feature extraction) - **Classification Heads**: Custom neural networks with dropout for robust predictions - **Multi-task Learning**: Jointly trained on both accent and gender classification ## 🚀 Quick Start ### Prerequisites ```bash pip install torch transformers datasets numpy scikit-learn ``` ### Basic Usage ```python import torch import torch.nn as nn import torch.nn.functional as F from transformers import WhisperFeatureExtractor, WhisperModel import numpy as np # Define the model class (same as training) class WhisperClassifier(nn.Module): def __init__(self, model_name="openai/whisper-small.en", num_accent_classes=23, num_gender_classes=2, freeze_encoder=True, dropout_rate=0.3): super().__init__() self.whisper = WhisperModel.from_pretrained(model_name) if freeze_encoder: for param in self.whisper.encoder.parameters(): param.requires_grad = False self.hidden_size = self.whisper.config.d_model self.dropout = nn.Dropout(dropout_rate) # Accent classification head self.accent_classifier = nn.Sequential( nn.Linear(self.hidden_size, 512), nn.ReLU(), nn.Dropout(dropout_rate), nn.Linear(512, 256), nn.ReLU(), nn.Dropout(dropout_rate), nn.Linear(256, num_accent_classes) ) # Gender classification head self.gender_classifier = nn.Sequential( nn.Linear(self.hidden_size, 256), nn.ReLU(), nn.Dropout(dropout_rate), nn.Linear(256, 128), nn.ReLU(), nn.Dropout(dropout_rate), nn.Linear(128, num_gender_classes) ) self.num_accent_classes = num_accent_classes self.num_gender_classes = num_gender_classes def forward(self, input_features, accent_labels=None, gender_labels=None): encoder_outputs = self.whisper.encoder(input_features) hidden_states = encoder_outputs.last_hidden_state pooled_output = hidden_states.mean(dim=1) pooled_output = self.dropout(pooled_output) accent_logits = self.accent_classifier(pooled_output) gender_logits = self.gender_classifier(pooled_output) return { 'accent_logits': accent_logits, 'gender_logits': gender_logits, } # Load the trained model device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = WhisperClassifier() # Load the trained weights model.load_state_dict(torch.load("./model_step1000.safetensors", map_location=device)) model.to(device) model.eval() # Initialize feature extractor feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small.en") ``` ### Making Predictions ```python def predict_audio(audio_file_path, model, feature_extractor, device): """ Predict accent and gender from an audio file Args: audio_file_path: Path to audio file (.wav, .mp3, etc.) model: Trained WhisperClassifier model feature_extractor: Whisper feature extractor device: torch device (cuda/cpu) Returns: Dictionary with predictions and confidence scores """ import librosa # Load audio file audio, sr = librosa.load(audio_file_path, sr=16000, mono=True) # Extract features inputs = feature_extractor( audio, sampling_rate=sr, return_tensors="pt" ) # Move to device input_features = inputs.input_features.to(device) # Get predictions with torch.no_grad(): outputs = model(input_features=input_features) # Get probabilities accent_probs = F.softmax(outputs["accent_logits"], dim=-1) gender_probs = F.softmax(outputs["gender_logits"], dim=-1) # Get predictions accent_pred = torch.argmax(accent_probs, dim=-1).item() gender_pred = torch.argmax(gender_probs, dim=-1).item() # Get confidence scores accent_confidence = accent_probs[0, accent_pred].item() gender_confidence = gender_probs[0, gender_pred].item() # Map predictions to labels accent_names = [ 'african', 'australia', 'bermuda', 'canada', 'england', 'hongkong', 'indian', 'ireland', 'malaysia', 'newzealand', 'philippines', 'scotland', 'singapore', 'southafrica', 'us', 'wales' # Add all 23 accent names based on your dataset ] accent_name = accent_names[accent_pred] if accent_pred < len(accent_names) else f"accent_{accent_pred}" gender_name = "male" if gender_pred == 0 else "female" return { 'accent': accent_name, 'accent_confidence': accent_confidence, 'gender': gender_name, 'gender_confidence': gender_confidence } # Example usage result = predict_audio("path/to/your/audio.wav", model, feature_extractor, device) print(f"Predicted Accent: {result['accent']} (confidence: {result['accent_confidence']:.3f})") print(f"Predicted Gender: {result['gender']} (confidence: {result['gender_confidence']:.3f})") ``` ### Batch Predictions ```python def predict_batch(audio_files, model, feature_extractor, device, batch_size=8): """ Predict accent and gender for multiple audio files """ import librosa from torch.utils.data import DataLoader, Dataset class AudioDataset(Dataset): def __init__(self, audio_files): self.audio_files = audio_files def __len__(self): return len(self.audio_files) def __getitem__(self, idx): audio, sr = librosa.load(self.audio_files[idx], sr=16000, mono=True) inputs = feature_extractor(audio, sampling_rate=sr, return_tensors="pt") return inputs.input_features.squeeze(0) dataset = AudioDataset(audio_files) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False) results = [] model.eval() with torch.no_grad(): for batch in dataloader: batch = batch.to(device) outputs = model(input_features=batch) accent_probs = F.softmax(outputs["accent_logits"], dim=-1) gender_probs = F.softmax(outputs["gender_logits"], dim=-1) accent_preds = torch.argmax(accent_probs, dim=-1) gender_preds = torch.argmax(gender_probs, dim=-1) for i in range(len(batch)): results.append({ 'accent_id': accent_preds[i].item(), 'accent_confidence': accent_probs[i, accent_preds[i]].item(), 'gender_id': gender_preds[i].item(), 'gender_confidence': gender_probs[i, gender_preds[i]].item(), }) return results ``` ## 📊 Model Performance The model was trained on the English Accent Dataset with the following performance: - **Accent Classification**: Achieves high accuracy across 23 English accent varieties - **Gender Classification**: Robust binary classification for male/female voices - **Multi-task Learning**: Benefits from joint training on both tasks ### Supported Accent Classes The model can classify the following accent varieties: 1. African 2. Australian 3. Bermuda 4. Canadian 5. England 6. Hong Kong 7. Indian 8. Irish 9. Malaysian 10. New Zealand 11. Philippines 12. Scottish 13. Singapore 14. South African 15. US American 16. Welsh ... (and more, totaling 23 classes) ## 🔧 Advanced Usage ### Custom Audio Processing ```python def preprocess_custom_audio(audio_array, sample_rate, target_sr=16000): """ Preprocess custom audio data """ import librosa # Resample if needed if sample_rate != target_sr: audio_array = librosa.resample(audio_array, orig_sr=sample_rate, target_sr=target_sr) # Ensure mono if len(audio_array.shape) > 1: audio_array = librosa.to_mono(audio_array) # Normalize audio_array = audio_array / np.max(np.abs(audio_array)) return audio_array ``` ### Getting Top-K Predictions ```python def get_top_k_predictions(audio_file, model, feature_extractor, device, k=3): """ Get top-k accent predictions with confidence scores """ # ... (load and preprocess audio as above) with torch.no_grad(): outputs = model(input_features=input_features) accent_probs = F.softmax(outputs["accent_logits"], dim=-1) # Get top-k predictions top_k_probs, top_k_indices = torch.topk(accent_probs, k, dim=-1) results = [] for i in range(k): results.append({ 'accent_id': top_k_indices[0, i].item(), 'confidence': top_k_probs[0, i].item() }) return results ``` ## 📋 Requirements - Python 3.8+ - PyTorch 1.9+ - Transformers 4.20+ - librosa (for audio loading) - numpy - scikit-learn (for evaluation metrics) ## 📄 License This model is based on OpenAI's Whisper and follows the same licensing terms. Please check the original Whisper repository for license details. ## 🙏 Acknowledgments - OpenAI for the Whisper model - The English Accent Dataset creators - Hugging Face Transformers library --- **Note**: This model is trained for research and educational purposes. Performance may vary on different audio qualities, recording conditions, and accent varieties not represented in the training data.