Hey would it be possible for you to train a low and a mid model?
Auch nochmal auf deutsch, wäre echt nice wenn du noch ein Low und mid model anlernen könntest. Meine HA instanz braucht viel zu lange um das High model zu verarbeiten :D Und du hast ja den workflow schon, ich müsste mich von grun auf einarbeiten :D
I just did. Like for original piper-voices (thorsten voice to be specific), low is just "medium" with 16000 sampling rate instead of 22500. For x-low, I couldn't find a previous checkpoint from thorsten-voice and for training from scratch I have too less voice samples. Of course one could use another base for an x-low version but for the moment, I only can offer high and medium (=low).
I also did not test these smaller models: I expect them to be similar to the high model - just with less quality - since they use the same training data
Thank you its perfect, the medium model seems to work just as fine as the High model but its much faster. With the Low model you hear a diffrence, but thats expected i think.
How much epochs do you usually use for fine-tuning with the given dataset on which hardware and how much time does it need then ?
It all depends. According to the piper training manual, fine tuning a pre-trained model can be reached by ~2000 epochs. I used about 2500-3000. A good figure of merit for when to stop training is, once the loss does not decrease that much anymore.
Regarding the hardware: It was an RTX 4000 (Ada). It has 20 GB VRAM but I only allocated around 8 GB to also allow other GPU/AI tasks.
Finetuning with the ~1110 wav files for GLADOS took around 1 minute per epoch. So fininetuning the basemodel (2500-3000 epochs) was done after ~2 days continous of training.
Finetuning the GLADOS voice to the turret voice was way quicker since I only had 47 wav files of good quality (but the turrets model suffered from this). This was done after ~2-3 hours (Again 3000 epochs).
Right now, I am sort of happy, how well training worked. This was my first model I ever trained and I learned a lot. Especially, what I could have done better: I might have reached a faster decent by increasing batch size (requires more VRAM) or even better model quality by splitting the dataset equally into training-data and validation-data.
We have a dataset of 19.000 utterances and using 2xA6000 (non-Ada) with bs=64. One epoch here needed 3.5 minutes, our voice has 768000 steps with overall ~2400 epochs, so this would equalize for your GLADOS dataset at ~8.5h finetuning time.
Probably your training time would be way faster if you give it the full VRAM and increase batch size. But these are interesting numbers. Thx for sharing !