Hey would it be possible for you to train a low and a mid model?

by Poldion - opened Apr 2

Apr 2

Auch nochmal auf deutsch, wäre echt nice wenn du noch ein Low und mid model anlernen könntest. Meine HA instanz braucht viel zu lange um das High model zu verarbeiten :D Und du hast ja den workflow schon, ich müsste mich von grun auf einarbeiten :D

systemofapwne

Owner Apr 7

I just did. Like for original piper-voices (thorsten voice to be specific), low is just "medium" with 16000 sampling rate instead of 22500. For x-low, I couldn't find a previous checkpoint from thorsten-voice and for training from scratch I have too less voice samples. Of course one could use another base for an x-low version but for the moment, I only can offer high and medium (=low).

systemofapwne

Owner Apr 7

I also did not test these smaller models: I expect them to be similar to the high model - just with less quality - since they use the same training data

Poldion

Apr 7

Thank you its perfect, the medium model seems to work just as fine as the High model but its much faster. With the Low model you hear a diffrence, but thats expected i think.

Poldion changed discussion status to closed Apr 7

danielschnell

Apr 9

How much epochs do you usually use for fine-tuning with the given dataset on which hardware and how much time does it need then ?

systemofapwne

Owner Apr 9

•

edited Apr 9

It all depends. According to the piper training manual, fine tuning a pre-trained model can be reached by ~2000 epochs. I used about 2500-3000. A good figure of merit for when to stop training is, once the loss does not decrease that much anymore.
Regarding the hardware: It was an RTX 4000 (Ada). It has 20 GB VRAM but I only allocated around 8 GB to also allow other GPU/AI tasks.
Finetuning with the ~1110 wav files for GLADOS took around 1 minute per epoch. So fininetuning the basemodel (2500-3000 epochs) was done after ~2 days continous of training.
Finetuning the GLADOS voice to the turret voice was way quicker since I only had 47 wav files of good quality (but the turrets model suffered from this). This was done after ~2-3 hours (Again 3000 epochs).

Right now, I am sort of happy, how well training worked. This was my first model I ever trained and I learned a lot. Especially, what I could have done better: I might have reached a faster decent by increasing batch size (requires more VRAM) or even better model quality by splitting the dataset equally into training-data and validation-data.

danielschnell

Apr 9

We have a dataset of 19.000 utterances and using 2xA6000 (non-Ada) with bs=64. One epoch here needed 3.5 minutes, our voice has 768000 steps with overall ~2400 epochs, so this would equalize for your GLADOS dataset at ~8.5h finetuning time.
Probably your training time would be way faster if you give it the full VRAM and increase batch size. But these are interesting numbers. Thx for sharing !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment