Pre-training dataset format?

by asif00 - opened 6 days ago

6 days ago

Did anyone try to pre-train it for a new language?
I'm a bit confused. What should the structure of the pre-training datasets be?
text_QA_dataset: [I'm assuming this is for training the LLM]
TTS_dataset: [This is for training the TTS]
I'm just unsure what their format should be. An example (sample) or dataset link for both types would be awesome! Thanks in advance!

amuvarma

Canopy Labs org 6 days ago

See the github for more info - there is also a similar issue there, where I go over the format. Happy to go into more detail if unclear (posting the issue in the github will probably be looked at sooner) https://github.com/canopyai/Orpheus-TTS

amuvarma changed discussion status to closed 6 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment