|
Let's see an example: |
|
thon |
|
from transformers import AutoTokenizer |
|
from datasets import Dataset |
|
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") |
|
chat1 = [ |
|
{"role": "user", "content": "Which is bigger, the moon or the sun?"}, |
|
{"role": "assistant", "content": "The sun."} |
|
] |
|
chat2 = [ |
|
{"role": "user", "content": "Which is bigger, a virus or a bacterium?"}, |
|
{"role": "assistant", "content": "A bacterium."} |
|
] |
|
dataset = Dataset.from_dict({"chat": [chat1, chat2]}) |
|
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)}) |
|
print(dataset['formatted_chat'][0]) |
|
And we get:text |
|
<|user|> |
|
Which is bigger, the moon or the sun? |
|
<|assistant|> |
|
The sun. |
|
|
|
From here, just continue training like you would with a standard language modelling task, using the formatted_chat column. |
|
Advanced: How do chat templates work? |
|
The chat template for a model is stored on the tokenizer.chat_template attribute. |