ViT for Audio Emotion Recognition (Valence-Arousal)

This model is a fine-tuned Vision Transformer (ViT) for audio emotion recognition, predicting valence and arousal values in the continuous range of -1 to 1.

Model Description

Base Model: google/vit-base-patch16-224-in21k
Task: Audio emotion recognition (regression)
Output: Valence and Arousal predictions (2D continuous emotion space)
Range: [-1, 1] for both dimensions
Input: Mel spectrogram images (224x224 RGB)

Architecture

ViT Base (86M parameters)
    ↓
CLS Token Output (768-dim)
    ↓
LayerNorm + Dropout
    ↓
Linear (768 → 512) + GELU + Dropout
    ↓
Linear (512 → 128) + GELU + Dropout
    ↓
Linear (128 → 2) + Tanh
    ↓
[Valence, Arousal] ∈ [-1, 1]²

Usage

Prerequisites

pip install torch transformers librosa numpy pillow

Loading the Model

import torch
from transformers import ViTModel
import torch.nn as nn

class ViTForEmotionRegression(nn.Module):
    def __init__(self, model_name='google/vit-base-patch16-224-in21k', num_emotions=2, dropout=0.1):
        super().__init__()
        self.vit = ViTModel.from_pretrained(model_name)
        hidden_size = self.vit.config.hidden_size
        
        self.head = nn.Sequential(
            nn.LayerNorm(hidden_size),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, 512),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(512, 128),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(128, num_emotions),
            nn.Tanh()
        )
    
    def forward(self, pixel_values):
        outputs = self.vit(pixel_values)
        cls_output = outputs.last_hidden_state[:, 0]
        return self.head(cls_output)

# Load the model
model = ViTForEmotionRegression()
model.load_state_dict(torch.load('best_model.pth', map_location='cpu'))
model.eval()

Audio Preprocessing

import librosa
import numpy as np
from PIL import Image
import torch
from torchvision import transforms

def preprocess_audio(audio_path):
    # Load audio
    y, sr = librosa.load(audio_path, sr=22050, duration=30)
    
    # Generate mel spectrogram
    mel_spec = librosa.feature.melspectrogram(
        y=y, sr=sr, n_mels=128, hop_length=512, n_fft=2048
    )
    mel_db = librosa.power_to_db(mel_spec, ref=np.max)
    
    # Normalize to 0-255 for RGB conversion
    mel_normalized = ((mel_db - mel_db.min()) / (mel_db.max() - mel_db.min()) * 255).astype(np.uint8)
    
    # Convert to RGB image
    image = Image.fromarray(mel_normalized).convert('RGB')
    image = image.resize((224, 224))
    
    # Apply ImageNet normalization
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    return transform(image).unsqueeze(0)

# Process audio
audio_tensor = preprocess_audio('your_audio.mp3')

# Predict emotions
with torch.no_grad():
    predictions = model(audio_tensor)
    valence, arousal = predictions[0].tolist()

print(f"Valence: {valence:.3f}, Arousal: {arousal:.3f}")

Emotion Quadrant Mapping

def classify_emotion(valence, arousal):
    if valence >= 0 and arousal >= 0:
        return "HAPPY" if valence > arousal else "EXCITED"
    elif valence >= 0 and arousal < 0:
        return "CALM" if abs(arousal) > valence else "CONTENT"
    elif valence < 0 and arousal < 0:
        return "SAD" if abs(valence) > abs(arousal) else "BORED"
    else:  # valence < 0 and arousal >= 0
        return "TENSE" if arousal > abs(valence) else "ANGRY"

Model Details

Parameters: ~86.8M
Model Size: ~331 MB
Framework: PyTorch
Base Architecture: ViT-Base (12 layers, 768 hidden, 12 heads)
Custom Head: 3-layer MLP with GELU activations
Training Data: Custom audio emotion dataset
Training: Fine-tuned with MSE loss on valence-arousal targets

Emotion Space

The model predicts emotions in the 2D circumplex model:

        High Arousal
             |
    Angry  Tense  Excited
             |
Sad -------- + -------- Happy
             |
    Bored  Calm  Content
             |
         Low Arousal

Valence: Negative (unpleasant) ↔ Positive (pleasant)
Arousal: Low (calm) ↔ High (energetic)

Performance

The model outputs continuous predictions that can be:

Used directly for emotion intensity analysis
Mapped to discrete emotion categories
Visualized on emotion quadrant plots

Limitations

Trained on music/audio, performance may vary on speech
Requires mel spectrogram preprocessing
Fixed 30-second audio duration (or first 30s)
Cultural bias depending on training data

Citation

@misc{sentio-vit-emotion,
  title={Vision Transformer for Audio Emotion Recognition},
  author={SentioApp Team},
  year={2025},
  publisher={HuggingFace}
}

License

MIT License

Downloads last month: 13