Data Formatting

#21
by moutasem - opened

Hi,

I noticed that the Tokenizer does not have a chat template. How should I format my data?

Should it be something like this:

{'prompt': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.',
 'completion': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.'}

Or should I add special tokens like so?

{'prompt': '<|im_start|>user\nRemove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.\n<|im_end|>\n<|im_start|>assistant\n',
 'completion': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.\n<|im_end|>'}

Also, should I leave both prompt and completion in the dataset and pass them to the SFTTrainer?

Thanks!!

Sign up or log in to comment