Open Voice Classification
Collection
audio classifiers
β’
4 items
β’
Updated
Speech-Emotion-Classification is a fine-tuned version of
facebook/wav2vec2-base-960h
for multi-class audio classification, specifically trained to detect emotions in speech. This model utilizes theWav2Vec2ForSequenceClassification
architecture to accurately classify speaker emotions from audio signals.
Wav2Vec2: Self-Supervised Learning for Speech Recognition https://arxiv.org/pdf/2006.11477
Classification Report:
precision recall f1-score test_support
Anger 0.8314 0.9346 0.8800 306
Calm 0.7949 0.8857 0.8378 35
Disgust 0.8261 0.8287 0.8274 321
Fear 0.8303 0.7377 0.7812 305
Happy 0.8929 0.7764 0.8306 322
Neutral 0.8423 0.9303 0.8841 287
Sad 0.7749 0.7825 0.7787 308
Surprised 0.9478 0.9478 0.9478 115
accuracy 0.8379 1999
macro avg 0.8426 0.8530 0.8460 1999
weighted avg 0.8392 0.8379 0.8367 1999
Class 0: Anger
Class 1: Calm
Class 2: Disgust
Class 3: Fear
Class 4: Happy
Class 5: Neutral
Class 6: Sad
Class 7: Surprised
pip install gradio transformers torch librosa hf_xet
import gradio as gr
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import torch
import librosa
# Load model and processor
model_name = "prithivMLmods/Speech-Emotion-Classification"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
processor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
# Label mapping
id2label = {
"0": "Anger",
"1": "Calm",
"2": "Disgust",
"3": "Fear",
"4": "Happy",
"5": "Neutral",
"6": "Sad",
"7": "Surprised"
}
def classify_audio(audio_path):
# Load and resample audio to 16kHz
speech, sample_rate = librosa.load(audio_path, sr=16000)
# Process audio
inputs = processor(
speech,
sampling_rate=sample_rate,
return_tensors="pt",
padding=True
)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()
prediction = {
id2label[str(i)]: round(probs[i], 3) for i in range(len(probs))
}
return prediction
# Gradio Interface
iface = gr.Interface(
fn=classify_audio,
inputs=gr.Audio(type="filepath", label="Upload Audio (WAV, MP3, etc.)"),
outputs=gr.Label(num_top_classes=8, label="Emotion Classification"),
title="Speech Emotion Classification",
description="Upload an audio clip to classify the speaker's emotion from voice signals."
)
if __name__ == "__main__":
iface.launch()
"id2label": {
"0": "ANG",
"1": "CAL",
"2": "DIS",
"3": "FEA",
"4": "HAP",
"5": "NEU",
"6": "SAD",
"7": "SUR"
},
Speech-Emotion-Classification
is designed for: