voc2vec-hubert-ls-pt
voc2vec is a foundation model specifically designed for non-verbal human data.
We employed a collection of 10 datasets covering around 125 hours of non-verbal audio and pre-trained a HuBERT-like model.
Model description
voc2vec-hubert is built upon the HuBERT framework and follows its pre-training setup. The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound. This model continues pre-training from a model that was initially trained on the LibriSpeech dataset.
Task and datasets description
We evaluate voc2vec-hubert-ls-pt on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE.
This is currently the best released model of the voc2vec collection.
The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.
Model | Architecture | Pre-training DS | UAR | F1 Macro |
---|---|---|---|---|
voc2vec | wav2vec 2.0 | Voc125 | .612±.212 | .580±.230 |
voc2vec-as-pt | wav2vec 2.0 | AudioSet + Voc125 | .603±.183 | .574±.194 |
voc2vec-ls-pt | wav2vec 2.0 | LibriSpeech + Voc125 | .661±.206 | .636±.223 |
voc2vec-hubert-ls-pt | HuBERT | LibriSpeech + Voc125 | .696±.189 | .678±.200 |
Available Models
Model | Description | Link |
---|---|---|
voc2vec | Pre-trained model on 125 hours of non-verbal audio. | 🔗 Model |
voc2vec-as-pt | Continues pre-training from a wav2vec2-like model that was initially trained on the AudioSet dataset. | 🔗 Model |
voc2vec-ls-pt | Continues pre-training from a wav2vec2-like model that was initially trained on the LibriSpeech dataset. | 🔗 Model |
voc2vec-hubert-ls-pt | Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset. | 🔗 Model |
Usage examples
You can use the model directly in the following manner:
import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)
## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec-hubert-ls-pt")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec-hubert-ls-pt")
## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")
## Compute logits
logits = model(**inputs).logits
BibTeX entry and citation info
@INPROCEEDINGS{koudounas2025icassp,
author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={voc2vec: A Foundation Model for Non-Verbal Vocalization},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models},
doi={10.1109/ICASSP49660.2025.10890672}}
- Downloads last month
- 16