dananthony1
/

recipe-generator-prompt-tuned

Safetensors

Model card Files Files and versions Community

dananthony1 commited on Apr 25

Commit

89758af

verified ·

1 Parent(s): c2a1da3

Update README.md

Browse files

Files changed (1) hide show

README.md +9 -9

README.md CHANGED Viewed

@@ -14,6 +14,8 @@ Unfortunately, curating a dataset in the instruction-response format from reputa
 Although bias was not eliminated, the model's improvement on mmlu_nutrition shows that training on recipes can improve food-related knowledge. The next two stages for this project will be to curate datasets for specific cuisines from video transcripts of reputable chefs on the internet and to generate a RAG pipeline that can update responses based on the target cuisines database.
 **2. Training Data**
 Training Data Source:
@@ -70,15 +72,13 @@ Eval_steps = 500
 **4. Evaluation on Benchmarks**
-| Task  |Falcon-7B   | Prompt-tuned (30k)  |Llama-7B   |Another-7B   |
 |---|---|---|---|---|
-| cola | 0.65 | 0.43  |   |   |
-| bleu | 0.46 |0.46 |   |   |
-| rouge |0.66 |0.46 |   |   |
-| truthfulqa_mc1  | --  | 0.26 |   |   |
-| truthfulqa_mc2 | --  | 0.51 |   |   |
-| mmlu_nutrition  |0.73 | 0.8   |   |   |
-| mmlu  |0.69 |0.65   |   |   |
 **Rational for Benchmarks:**
@@ -86,7 +86,7 @@ GLUE (cola) is a natural language understanding evaluator and Cola specifically
 MMLU is like GLUE but targets deeper knowledge and a wide array of domains. Although most domains are not relevant for recipe generation, it is worth noting the model’s evaluation on the nutrition domain and that this has improved when trained on thousands of recipes.
-TruthfulQA evaluates whether a model can “avoid generating false answers learned from imitating human texts” and although recipe generation is not a mission-critical endeavor, it would be frustrating to read a recipe that misuses ingredients or makes up a dish. Rouge and Bleu are used to score how similar the response is to true and false reference answers with the difference in similarity of the response to the true and false references being used to evaluate the truthfulness score.
 Overall the prompt-tuned model does show a depreciation across most of the benchmarks. However, the model improves for the food-related nutrition domain of mmlu and the degradation values from slightly with mmlu and it maintains performance for truthfulqa via Bleu and Rouge. The cola is significantly worse but this is less concerning considering the level of performance on the other general language evaluator, mmlu.

 Although bias was not eliminated, the model's improvement on mmlu_nutrition shows that training on recipes can improve food-related knowledge. The next two stages for this project will be to curate datasets for specific cuisines from video transcripts of reputable chefs on the internet and to generate a RAG pipeline that can update responses based on the target cuisines database.
+*An additional note for future work mentioned by a classmate to mitigate bias is to tag the dataset entries as "Western" vs "non-Western" cuisines and subset the entries to fine-tune on only the "non-Western" recipes instead of a random selection of all recipes.*
 **2. Training Data**
 Training Data Source:
 **4. Evaluation on Benchmarks**
+| Task  |Falcon-7B   | Prompt-tuned (~30k)  |Mistral-7B   |Llama-8B   |
 |---|---|---|---|---|
+| cola | 0.65 | 0.43  | 0.49  | 0.0 |
+| bleu | 0.46 |0.46 | 0.53  | 0.46 |
+| rouge |0.66 |0.46 | 0.53  | 0.46 |
+| mmlu_nutrition  |0.73 | 0.8   | 0.66  | 0.73  |
+| mmlu  |0.69 |0.65   | 0.59  | 0.68  |
 **Rational for Benchmarks:**
 MMLU is like GLUE but targets deeper knowledge and a wide array of domains. Although most domains are not relevant for recipe generation, it is worth noting the model’s evaluation on the nutrition domain and that this has improved when trained on thousands of recipes.
+TruthfulQA (rouge/bleu) evaluates whether a model can “avoid generating false answers learned from imitating human texts” and although recipe generation is not a mission-critical endeavor, it would be frustrating to read a recipe that misuses ingredients or makes up a dish. Rouge and Bleu are used to score how similar the response is to true and false reference answers with the difference in similarity of the response to the true and false references being used to evaluate the truthfulness score.
 Overall the prompt-tuned model does show a depreciation across most of the benchmarks. However, the model improves for the food-related nutrition domain of mmlu and the degradation values from slightly with mmlu and it maintains performance for truthfulqa via Bleu and Rouge. The cola is significantly worse but this is less concerning considering the level of performance on the other general language evaluator, mmlu.