Pre-training dataset format?

#7
by asif00 - opened

Did anyone try to pre-train it for a new language?
I'm a bit confused. What should the structure of the pre-training datasets be?
text_QA_dataset: [I'm assuming this is for training the LLM]
TTS_dataset: [This is for training the TTS]
I'm just unsure what their format should be. An example (sample) or dataset link for both types would be awesome! Thanks in advance!

Canopy Labs org

See the github for more info - there is also a similar issue there, where I go over the format. Happy to go into more detail if unclear (posting the issue in the github will probably be looked at sooner) https://github.com/canopyai/Orpheus-TTS

amuvarma changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment