Ahmadzei's picture
update 1
57bdca5
raw
history blame
1 kB
Let's see an example:
thon
from transformers import AutoTokenizer
from datasets import Dataset
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
chat1 = [
{"role": "user", "content": "Which is bigger, the moon or the sun?"},
{"role": "assistant", "content": "The sun."}
]
chat2 = [
{"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
{"role": "assistant", "content": "A bacterium."}
]
dataset = Dataset.from_dict({"chat": [chat1, chat2]})
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
print(dataset['formatted_chat'][0])
And we get:text
<|user|>
Which is bigger, the moon or the sun?
<|assistant|>
The sun.
From here, just continue training like you would with a standard language modelling task, using the formatted_chat column.
Advanced: How do chat templates work?
The chat template for a model is stored on the tokenizer.chat_template attribute.