SadeghK/tts_fa_fastpitch_hifigan-v2.0

FastPitch and HifiGan v2.0

v2.0 of phonemizer and tokenizer. tokenzier DO SUPPORT pauses, emotion tokens etc,.

Install NeMo

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
rm -rf /usr/lib/python3.10/site-packages/blinker*
rm -rf /usr/local/lib/python3.10/dist-packages/blinker*
pip install --ignore-installed blinker
pip install --upgrade --force-reinstall blinker

git clone https://github.com/SadeghKrmi/NeMo.git
cd NeMo
pip install -e '.[all]'

deterministic split

Run the deterministic-train-test-split.py to split the train/test

Extract the supportive data

using the following scripts, extract pitch statistics

tar -xzf dataset_splits.tar.gz

cd extract-supportive-data
HYDRA_FULL_ERROR=1 python3 ./scripts/extract_sup_data.py \
        --config-path ../config/fastpitch/ \
        --config-name ds_for_fastpitch_align.yaml \
        manifest_filepath=./dataset_splits/train/train.jsonl \
        sup_data_path=sup_data \
        phoneme_dict_path=./persian-dict/persian-v4.0.dict \
        ++dataloader_params.num_workers=8

dataset sup pitch stats

PITCH_MEAN=98.72935485839844, PITCH_STD=29.40760040283203 PITCH_MIN=65.4063949584961, PITCH_MAX=2093.004638671875

zip and download

tar -czf sup_data.tar.gz  sup_data

Training FastPitch

training for about 800 epochs, with CosineAnnealing sched. and max_steps 200,000 for lr to decay overtime.

val_loss didn't decrease lower that about 0.77xx

val_loss = mel_loss + dur_loss + pitch_loss + energy_loss

Training HiFiGAN

training for about 40 epochs, stoped the training based on quality checking by listening to audios