ViT for Audio Emotion Recognition (Valence-Arousal)
This model is a fine-tuned Vision Transformer (ViT) for audio emotion recognition, predicting valence and arousal values in the continuous range of -1 to 1.
Model Description
- Base Model: google/vit-base-patch16-224-in21k
- Task: Audio emotion recognition (regression)
- Output: Valence and Arousal predictions (2D continuous emotion space)
- Range: [-1, 1] for both dimensions
- Input: Mel spectrogram images (224x224 RGB)
Architecture
ViT Base (86M parameters)
β
CLS Token Output (768-dim)
β
LayerNorm + Dropout
β
Linear (768 β 512) + GELU + Dropout
β
Linear (512 β 128) + GELU + Dropout
β
Linear (128 β 2) + Tanh
β
[Valence, Arousal] β [-1, 1]Β²
Usage
Prerequisites
pip install torch transformers librosa numpy pillow
Loading the Model
import torch
from transformers import ViTModel
import torch.nn as nn
class ViTForEmotionRegression(nn.Module):
def __init__(self, model_name='google/vit-base-patch16-224-in21k', num_emotions=2, dropout=0.1):
super().__init__()
self.vit = ViTModel.from_pretrained(model_name)
hidden_size = self.vit.config.hidden_size
self.head = nn.Sequential(
nn.LayerNorm(hidden_size),
nn.Dropout(dropout),
nn.Linear(hidden_size, 512),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(512, 128),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(128, num_emotions),
nn.Tanh()
)
def forward(self, pixel_values):
outputs = self.vit(pixel_values)
cls_output = outputs.last_hidden_state[:, 0]
return self.head(cls_output)
# Load the model
model = ViTForEmotionRegression()
model.load_state_dict(torch.load('best_model.pth', map_location='cpu'))
model.eval()
Audio Preprocessing
import librosa
import numpy as np
from PIL import Image
import torch
from torchvision import transforms
def preprocess_audio(audio_path):
# Load audio
y, sr = librosa.load(audio_path, sr=22050, duration=30)
# Generate mel spectrogram
mel_spec = librosa.feature.melspectrogram(
y=y, sr=sr, n_mels=128, hop_length=512, n_fft=2048
)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)
# Normalize to 0-255 for RGB conversion
mel_normalized = ((mel_db - mel_db.min()) / (mel_db.max() - mel_db.min()) * 255).astype(np.uint8)
# Convert to RGB image
image = Image.fromarray(mel_normalized).convert('RGB')
image = image.resize((224, 224))
# Apply ImageNet normalization
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
return transform(image).unsqueeze(0)
# Process audio
audio_tensor = preprocess_audio('your_audio.mp3')
# Predict emotions
with torch.no_grad():
predictions = model(audio_tensor)
valence, arousal = predictions[0].tolist()
print(f"Valence: {valence:.3f}, Arousal: {arousal:.3f}")
Emotion Quadrant Mapping
def classify_emotion(valence, arousal):
if valence >= 0 and arousal >= 0:
return "HAPPY" if valence > arousal else "EXCITED"
elif valence >= 0 and arousal < 0:
return "CALM" if abs(arousal) > valence else "CONTENT"
elif valence < 0 and arousal < 0:
return "SAD" if abs(valence) > abs(arousal) else "BORED"
else: # valence < 0 and arousal >= 0
return "TENSE" if arousal > abs(valence) else "ANGRY"
Model Details
- Parameters: ~86.8M
- Model Size: ~331 MB
- Framework: PyTorch
- Base Architecture: ViT-Base (12 layers, 768 hidden, 12 heads)
- Custom Head: 3-layer MLP with GELU activations
- Training Data: Custom audio emotion dataset
- Training: Fine-tuned with MSE loss on valence-arousal targets
Emotion Space
The model predicts emotions in the 2D circumplex model:
High Arousal
|
Angry Tense Excited
|
Sad -------- + -------- Happy
|
Bored Calm Content
|
Low Arousal
- Valence: Negative (unpleasant) β Positive (pleasant)
- Arousal: Low (calm) β High (energetic)
Performance
The model outputs continuous predictions that can be:
- Used directly for emotion intensity analysis
- Mapped to discrete emotion categories
- Visualized on emotion quadrant plots
Limitations
- Trained on music/audio, performance may vary on speech
- Requires mel spectrogram preprocessing
- Fixed 30-second audio duration (or first 30s)
- Cultural bias depending on training data
Citation
@misc{sentio-vit-emotion,
title={Vision Transformer for Audio Emotion Recognition},
author={SentioApp Team},
year={2025},
publisher={HuggingFace}
}
License
MIT License
- Downloads last month
- 13