license: mit
1. Introduction
According to the authors of FoodSky [5], existing LLMs exhibit biases towards Western food knowledge, leading to incorrect or culturally insensitive responses when handling queries from more diverse backgrounds. The goal of this project is to use transcripts from cooking channels on YouTube and other sources that incorporate history in culture as well as traditional techniques and culinary practice for recipes.
Unfortunately, curating a dataset in the instruction-response format from reputable sources in the span of one semester became impractical so the focus for this portion of the project was to refine existing recipe generation from LLMs and determine which methodology for fine-tuning works well for this task on instruction-response pairs.
Although bias was not eliminated, the model's improvement on mmlu_nutrition shows that training on recipes can improve food-related knowledge. The next two stages for this project will be to curate datasets for specific cuisines from video transcripts of reputable chefs on the internet and to generate a RAG pipeline that can update responses based on the target cuisines database.
An additional note for future work mentioned by a classmate to mitigate bias is to tag the dataset entries as "Western" vs "non-Western" cuisines and subset the entries to fine-tune on only the "non-Western" recipes instead of a random selection of all recipes.
2. Training Data
Training Data Source: RecipeNLG: A Cooking Recipes Dataset for Semi-Structured Text Generation (Bień et al., INLG 2020)
Training Set Preparation and Formatting:
Read in csvs versions of the en_RecipeNLG dataset into dataframes with pd.read_csv (broken into 3 csv due to size)
Set Columns for Response and Instruction based on given data
a. df['response'] = df['title'] +' Ingredients: ' + df['ingredients'] + ' Directions: ' + df['directions']
b. df['instruction'] = df['NER']
Create dataframes with only 'instruction' and 'response' columns then concatenate into one 'recipes_train'
Shuffle recipes_train using sample(frac=1) with random.seed(42)
Index the first 15,000 rows as the 'train' set and 15k-20k as the 'val' set then tokenize 'instruction' with map function
3. Training Method
Initially Full fine-tuning was explored as the training method for this project. Unfortunately, the formatting of the training data did not result in effective responses for the instructions and the benchmarks deteriorated significantly due to catastrophic forgetting.
Prompt-tuning was selected as the secondary training method. Although this did deteriorate most of the benchmarks slightly (0.04pts for mmlu) and some to a greater degree (0.27pts for mnli), the LLM maintained solid performance generating recipes and improved its score for mmlu_nutrition from 0.73 to 0.80.
By comparing the outcomes of these two methods it is likely that prompt-tuning will maintain the level of quality inherent to well-trained instruction models while also improving the standard for food knowledge by fine-tuning on recipes.
Prompt Tuning Setup: Formatted_prompt f"Using the following ingredients: {train[10]['instruction']}\n\nProvide a step-by-step recipe:"
Generate Output: max_new_tokens = 2000
PromptTuningConfig:
task_type = TaskType.CAUSAL_LM
prompt_tuning_init = PromptTuningInit.RANDOM
num_virtual_tokens = 10
Training Arguments
Learning_rate = 0.0001
Epochs = 5
Eval_steps = 500
4. Evaluation on Benchmarks
Task | Falcon-7B | Prompt-tuned (~30k) | Mistral-7B | Llama-8B |
---|---|---|---|---|
cola | 0.65 | 0.43 | 0.49 | 0.0 |
bleu | 0.46 | 0.46 | 0.53 | 0.46 |
rouge | 0.66 | 0.46 | 0.53 | 0.46 |
mmlu_nutrition | 0.73 | 0.8 | 0.66 | 0.73 |
mmlu | 0.69 | 0.65 | 0.59 | 0.68 |
Rational for Benchmarks:
GLUE (cola) is a natural language understanding evaluator and Cola specifically looks at linguistic acceptability. This assesses the grammar of output and evaluates the model’s capacity for general language capabilities. Providing the end user with grammatically correct instructions is important to a final product.
MMLU is like GLUE but targets deeper knowledge and a wide array of domains. Although most domains are not relevant for recipe generation, it is worth noting the model’s evaluation on the nutrition domain and that this has improved when trained on thousands of recipes.
TruthfulQA (rouge/bleu) evaluates whether a model can “avoid generating false answers learned from imitating human texts” and although recipe generation is not a mission-critical endeavor, it would be frustrating to read a recipe that misuses ingredients or makes up a dish. Rouge and Bleu are used to score how similar the response is to true and false reference answers with the difference in similarity of the response to the true and false references being used to evaluate the truthfulness score.
Overall the prompt-tuned model does show a depreciation across most of the benchmarks. However, the model improves for the food-related nutrition domain of mmlu and the degradation values from slightly with mmlu and it maintains performance for truthfulqa via Bleu and Rouge. The cola is significantly worse but this is less concerning considering the level of performance on the other general language evaluator, mmlu.
5. Usage and Intended Uses
At this stage, the recipe generator is meant to take a list-like string of ingredients and provide a recipe that incorporates these ingredients.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon-7B-Instruct", device_map="auto")
model = PeftModel.from_pretrained(base_model, "dananthony1/adapter_model")
6. Prompt Format
The code below is used to generate the recipes, where train[0]['instruction'] is a list of ingredients: ["dark sweet pitted cherries", "ginger ale", "flavor gelatin", "boiling water", "almond extract", "marshmallows"]
formatted_prompt = f"Using the following ingredients: {train[0]['instruction']}\n\nProvide a step-by-step recipe:"
# Tokenize and move to device
inputs = tokenizer(formatted_prompt, return_tensors="pt", padding=True, truncation=True, return_attention_mask=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Generate
outputs = model.generate(**inputs, max_new_tokens=2000, pad_token_id=tokenizer.pad_token_id)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
7. Expected Output Format
For the ingredients list above the expected output would be:
Using the following ingredients: ["dark sweet pitted cherries", "ginger ale", "flavor gelatin", "boiling water", "almond extract", "marshmallows"]
Provide a step-by-step recipe:
Cherry Marshmallow Pudding Recipe <|assistant|> Ingredients:
- 1 cup dark sweet pitted cherries
- 1 can (12 fl oz) ginger ale
- 1 envelope (0.25 oz) flavor gelatin (cherry flavor)
- 1 cup boiling water
- 1 teaspoon almond extract
- 1 cup mini marshmallows
Instructions:
Prepare the Cherries:
- Rinse the pitted cherries under cold water and drain them. Set aside.
Prepare the Gelatin:
- In a small bowl, sprinkle the flavor gelatin over 1/2 cup of cold water. Let it sit for about 5 minutes to soften.
- Pour the boiling water over the softened gelatin. Stir until the gelatin is completely dissolved.
Combine Ingredients:
- In a large mixing bowl, combine the dissolved gelatin mixture, drained cherries, and almond extract. Stir until well mixed. ...
- Spoon the Cherry Marshmallow Pudding into individual serving dishes or glasses.
- Serve immediately, garnished with additional cherries or a sprinkle of almond extract if desired.
Enjoy your homemade Cherry Marshmallow Pudding!
8. Limitations
Relative to the ideal use case of a model that has sufficient understanding of cuisines and cooking techniques retrieved from a database, the current edition shows promise since the fine-tuned model shows improvements in food knowledge and does not deteriorate in other domains drastically.
Further work for this project includes additional experimentation on the prompt for prompt-tuning and potentially training on more recipes (100k+) with similar epochs and potentially a smaller learning rate. While this will improve the LLM that acts as the generator of the recipes, implementing a RAG approach to effectively learn the ingredients, tools, and techniques of a cuisine would bring the use of this generator to its full potential.
9. Citations
Training Data from:
RecipeNLG: A Cooking Recipes Dataset for Semi-Structured Text Generation (Bień et al., INLG 2020)