dananthony1 commited on
Commit
89758af
·
verified ·
1 Parent(s): c2a1da3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -14,6 +14,8 @@ Unfortunately, curating a dataset in the instruction-response format from reputa
14
 
15
  Although bias was not eliminated, the model's improvement on mmlu_nutrition shows that training on recipes can improve food-related knowledge. The next two stages for this project will be to curate datasets for specific cuisines from video transcripts of reputable chefs on the internet and to generate a RAG pipeline that can update responses based on the target cuisines database.
16
 
 
 
17
  **2. Training Data**
18
 
19
  Training Data Source:
@@ -70,15 +72,13 @@ Eval_steps = 500
70
 
71
  **4. Evaluation on Benchmarks**
72
 
73
- | Task |Falcon-7B | Prompt-tuned (30k) |Llama-7B |Another-7B |
74
  |---|---|---|---|---|
75
- | cola | 0.65 | 0.43 | | |
76
- | bleu | 0.46 |0.46 | | |
77
- | rouge |0.66 |0.46 | | |
78
- | truthfulqa_mc1 | -- | 0.26 | | |
79
- | truthfulqa_mc2 | -- | 0.51 | | |
80
- | mmlu_nutrition |0.73 | 0.8 | | |
81
- | mmlu |0.69 |0.65 | | |
82
 
83
  **Rational for Benchmarks:**
84
 
@@ -86,7 +86,7 @@ GLUE (cola) is a natural language understanding evaluator and Cola specifically
86
 
87
  MMLU is like GLUE but targets deeper knowledge and a wide array of domains. Although most domains are not relevant for recipe generation, it is worth noting the model’s evaluation on the nutrition domain and that this has improved when trained on thousands of recipes.
88
 
89
- TruthfulQA evaluates whether a model can “avoid generating false answers learned from imitating human texts” and although recipe generation is not a mission-critical endeavor, it would be frustrating to read a recipe that misuses ingredients or makes up a dish. Rouge and Bleu are used to score how similar the response is to true and false reference answers with the difference in similarity of the response to the true and false references being used to evaluate the truthfulness score.
90
 
91
  Overall the prompt-tuned model does show a depreciation across most of the benchmarks. However, the model improves for the food-related nutrition domain of mmlu and the degradation values from slightly with mmlu and it maintains performance for truthfulqa via Bleu and Rouge. The cola is significantly worse but this is less concerning considering the level of performance on the other general language evaluator, mmlu.
92
 
 
14
 
15
  Although bias was not eliminated, the model's improvement on mmlu_nutrition shows that training on recipes can improve food-related knowledge. The next two stages for this project will be to curate datasets for specific cuisines from video transcripts of reputable chefs on the internet and to generate a RAG pipeline that can update responses based on the target cuisines database.
16
 
17
+ *An additional note for future work mentioned by a classmate to mitigate bias is to tag the dataset entries as "Western" vs "non-Western" cuisines and subset the entries to fine-tune on only the "non-Western" recipes instead of a random selection of all recipes.*
18
+
19
  **2. Training Data**
20
 
21
  Training Data Source:
 
72
 
73
  **4. Evaluation on Benchmarks**
74
 
75
+ | Task |Falcon-7B | Prompt-tuned (~30k) |Mistral-7B |Llama-8B |
76
  |---|---|---|---|---|
77
+ | cola | 0.65 | 0.43 | 0.49 | 0.0 |
78
+ | bleu | 0.46 |0.46 | 0.53 | 0.46 |
79
+ | rouge |0.66 |0.46 | 0.53 | 0.46 |
80
+ | mmlu_nutrition |0.73 | 0.8 | 0.66 | 0.73 |
81
+ | mmlu |0.69 |0.65 | 0.59 | 0.68 |
 
 
82
 
83
  **Rational for Benchmarks:**
84
 
 
86
 
87
  MMLU is like GLUE but targets deeper knowledge and a wide array of domains. Although most domains are not relevant for recipe generation, it is worth noting the model’s evaluation on the nutrition domain and that this has improved when trained on thousands of recipes.
88
 
89
+ TruthfulQA (rouge/bleu) evaluates whether a model can “avoid generating false answers learned from imitating human texts” and although recipe generation is not a mission-critical endeavor, it would be frustrating to read a recipe that misuses ingredients or makes up a dish. Rouge and Bleu are used to score how similar the response is to true and false reference answers with the difference in similarity of the response to the true and false references being used to evaluate the truthfulness score.
90
 
91
  Overall the prompt-tuned model does show a depreciation across most of the benchmarks. However, the model improves for the food-related nutrition domain of mmlu and the degradation values from slightly with mmlu and it maintains performance for truthfulqa via Bleu and Rouge. The cola is significantly worse but this is less concerning considering the level of performance on the other general language evaluator, mmlu.
92