Model adapted from: https://github.com/facebookresearch/speech-resynthesis

This repository contains a VQ-VAE model trained to generate high-quality joint vector embeddings of the F0 and Energy features of speech,
published in the paper https://www.isca-archive.org/interspeech_2025/portes25_interspeech.html.
This preprocessing strategy used in this repository is Interpolation.

For Normalization + Voicedness mask, see https://huggingface.co/MU-NLPC/F0_Energy_joint_VQVAE_embeddings
For Interpolation + Normalization, see https://huggingface.co/MU-NLPC/F0_Energy_joint_VQVAE_embeddings-norm_interp

The script for running the model is included in the generate_embeddings.py file.

To use, clone this repository, create a virtual environment based on the pyproject.toml file,
for example by running:

poetry install

Then, in the generate_embeddings.py script, select the dataset, uncomment the

#trust_remote_code=True

lines, and run the script:

poetry run python generate_embeddings.py

Note: While the model was trained using audio sampled at 16khz, the performance seems to be consistent for 24khz sampled audio as well. Use at your own discretion.

Downloads last month
4
Safetensors
Model size
490k params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support