Formatting Datasets for Chat Template Compatibility
When working with datasets for fine-tuning conversational models, it's essential to ensure that the data is formatted correctly to work seamlessly with any chat template. In this article, we'll explore a Python function that transforms the
nroggendorff/mayo
dataset from Hugging Face into a compatible format.
The format_prompts
Function
Here's a breakdown of the format_prompts
function:
def format_prompts(examples):
texts = []
for text in examples['text']:
conversation = []
parts = text.split('<|end|>')
for i in range(0, len(parts) - 1, 2):
prompt = parts[i].replace("<|user|>", "")
response = parts[i + 1].replace("<|bot|>", "")
conversation.append({"role": "user", "content": prompt})
conversation.append({"role": "assistant", "content": response})
formatted_conversation = tokenizer.apply_chat_template(conversation, tokenize=False)
texts.append(formatted_conversation)
return {"text": texts}
The function takes an examples
parameter, which is expected to be a dictionary containing a 'text' key with a list of conversation strings.
We initialize an empty list called
texts
to store the formatted conversations.We iterate over each
text
inexamples['text']
:- We split the
text
using the delimiter'<|end|>'
to separate the conversation into parts. - We iterate over the
parts
in steps of 2, assuming that even indices represent user prompts and odd indices represent bot responses. - We extract the
prompt
andresponse
by removing the"<|user|>"
and"<|bot|>"
tags, respectively. - We append the
prompt
andresponse
to theconversation
list as dictionaries with "role" and "content" keys.
- We split the
After processing all the parts, we apply the chat template to the
conversation
usingtokenizer.apply_chat_template()
, withtokenize
set toFalse
to avoid tokenization at this stage.We append the
formatted_conversation
to thetexts
list.Finally, we create an
output
dictionary with a 'text' key containing the list of formatted conversations and return it.
Usage
To use the format_prompts
function, you can pass your dataset examples to it:
from datasets import load_dataset
dataset = load_dataset("nroggendorff/mayo", split="train")
dataset = dataset.map(format_prompts, batched=True)
dataset['text'][2] # Check to see if the fields were formatted correctly
By applying this formatting step, you can ensure that your dataset is compatible with various chat templates, making it easier to fine-tune conversational models for different use cases.