Spaces:

Ahmadzei
/

RAG

Runtime error

update 1

57bdca5 over 1 year ago

1 kB

	Let's see an example:
	thon
	from transformers import AutoTokenizer
	from datasets import Dataset
	tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
	chat1 = [
	{"role": "user", "content": "Which is bigger, the moon or the sun?"},
	{"role": "assistant", "content": "The sun."}
	]
	chat2 = [
	{"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
	{"role": "assistant", "content": "A bacterium."}
	]
	dataset = Dataset.from_dict({"chat": [chat1, chat2]})
	dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
	print(dataset['formatted_chat'][0])
	And we get:text
	<\|user\|>
	Which is bigger, the moon or the sun?
	<\|assistant\|>
	The sun.

	From here, just continue training like you would with a standard language modelling task, using the formatted_chat column.
	Advanced: How do chat templates work?
	The chat template for a model is stored on the tokenizer.chat_template attribute.