voc2vec-ls-pt
voc2vec is a foundation model specifically designed for non-verbal human data.
We employed a collection of 10 datasets covering around 125 hours of non-verbal audio and pre-trained a Wav2Vec2-like model.
Model description
Voc2vec is built upon the wav2vec 2.0 framework and follows its pre-training setup. The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound. This model continues pre-training from a model that was initially trained on the LibriSpeech dataset.
Task and datasets description
We evaluate voc2vec-ls-pt on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE.
The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.
Model | Architecture | Pre-training DS | UAR | F1 Macro |
---|---|---|---|---|
voc2vec | wav2vec 2.0 | Voc125 | .612±.212 | .580±.230 |
voc2vec-as-pt | wav2vec 2.0 | AudioSet + Voc125 | .603±.183 | .574±.194 |
voc2vec-ls-pt | wav2vec 2.0 | LibriSpeech + Voc125 | .661±.206 | .636±.223 |
voc2vec-hubert-ls-pt | HuBERT | LibriSpeech + Voc125 | .696±.189 | .678±.200 |
Available Models
Model | Description | Link |
---|---|---|
voc2vec | Pre-trained model on 125 hours of non-verbal audio. | 🔗 Model |
voc2vec-as-pt | Continues pre-training from a wav2vec2-like model that was initially trained on the AudioSet dataset. | 🔗 Model |
voc2vec-ls-pt | Continues pre-training from a wav2vec2-like model that was initially trained on the LibriSpeech dataset. | 🔗 Model |
voc2vec-hubert-ls-pt | Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset. | 🔗 Model |
Usage examples
You can use the model directly in the following manner:
import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)
## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec-ls-pt")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec-ls-pt")
## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")
## Compute logits
logits = model(**inputs).logits
BibTeX entry and citation info
@INPROCEEDINGS{koudounas2025icassp,
author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={voc2vec: A Foundation Model for Non-Verbal Vocalization},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models},
doi={10.1109/ICASSP49660.2025.10890672}}
- Downloads last month
- 51