Data Formatting
#21
by
moutasem
- opened
Hi,
I noticed that the Tokenizer does not have a chat template. How should I format my data?
Should it be something like this:
{'prompt': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.',
'completion': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.'}
Or should I add special tokens like so?
{'prompt': '<|im_start|>user\nRemove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.\n<|im_end|>\n<|im_start|>assistant\n',
'completion': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.\n<|im_end|>'}
Also, should I leave both prompt and completion in the dataset and pass them to the SFTTrainer?
Thanks!!