voc2vec-hubert-ls-pt

voc2vec is a foundation model specifically designed for non-verbal human data.

We employed a collection of 10 datasets covering around 125 hours of non-verbal audio and pre-trained a HuBERT-like model.

Model description

voc2vec-hubert is built upon the HuBERT framework and follows its pre-training setup. The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound. This model continues pre-training from a model that was initially trained on the LibriSpeech dataset.

Task and datasets description

We evaluate voc2vec-hubert-ls-pt on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE.

This is currently the best released model of the voc2vec collection.

The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.

Model Architecture Pre-training DS UAR F1 Macro
voc2vec wav2vec 2.0 Voc125 .612±.212 .580±.230
voc2vec-as-pt wav2vec 2.0 AudioSet + Voc125 .603±.183 .574±.194
voc2vec-ls-pt wav2vec 2.0 LibriSpeech + Voc125 .661±.206 .636±.223
voc2vec-hubert-ls-pt HuBERT LibriSpeech + Voc125 .696±.189 .678±.200

Available Models

Model Description Link
voc2vec Pre-trained model on 125 hours of non-verbal audio. 🔗 Model
voc2vec-as-pt Continues pre-training from a wav2vec2-like model that was initially trained on the AudioSet dataset. 🔗 Model
voc2vec-ls-pt Continues pre-training from a wav2vec2-like model that was initially trained on the LibriSpeech dataset. 🔗 Model
voc2vec-hubert-ls-pt Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset. 🔗 Model

Usage examples

You can use the model directly in the following manner:

import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)

## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec-hubert-ls-pt")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec-hubert-ls-pt")

## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")

## Compute logits
logits = model(**inputs).logits

BibTeX entry and citation info

@INPROCEEDINGS{koudounas2025icassp,
  author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={voc2vec: A Foundation Model for Non-Verbal Vocalization}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models},
  doi={10.1109/ICASSP49660.2025.10890672}}
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including alkiskoudounas/voc2vec-hubert-ls-pt