Model adapted from: https://github.com/facebookresearch/speech-resynthesis

This repository contains a VQ-VAE model trained to generate high-quality joint vector embeddings of the F0 and Energy features of speech,
published in the paper https://www.isca-archive.org/interspeech_2025/portes25_interspeech.html.
This preprocessing strategy used in this repository is Normalization + Voicedness mask.

For Interpolation, see https://huggingface.co/MU-NLPC/F0_Energy_joint_VQVAE_embeddings-interp
For Interpolation + Normalization, see https://huggingface.co/MU-NLPC/F0_Energy_joint_VQVAE_embeddings-norm_interp

The script for running the model is included in the generate_embeddings.py file.

To use, clone this repository, create a virtual environment based on the pyproject.toml file,
for example by running:

poetry install

Then, in the generate_embeddings.py script, select the dataset, uncomment the

#trust_remote_code=True

lines, and run the script:

poetry run python generate_embeddings.py

Note: While the model was trained using audio sampled at 16khz, the performance seems to be consistent for 24khz sampled audio as well. Use at your own discretion.

Downloads last month
10
Safetensors
Model size
200k params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support