Pre-training dataset format?
#7
by
asif00
- opened
Did anyone try to pre-train it for a new language?
I'm a bit confused. What should the structure of the pre-training datasets be?
text_QA_dataset: [I'm assuming this is for training the LLM]
TTS_dataset: [This is for training the TTS]
I'm just unsure what their format should be. An example (sample) or dataset link for both types would be awesome! Thanks in advance!
See the github for more info - there is also a similar issue there, where I go over the format. Happy to go into more detail if unclear (posting the issue in the github will probably be looked at sooner) https://github.com/canopyai/Orpheus-TTS
amuvarma
changed discussion status to
closed