Update README.md
Browse files
README.md
CHANGED
@@ -27,4 +27,14 @@ parameters:
|
|
27 |
dtype: float16
|
28 |
```
|
29 |
|
30 |
-
Models
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
dtype: float16
|
28 |
```
|
29 |
|
30 |
+
Models chosen to achieve a mix of performance on reasoning datasets like GSM8k and conversational tasks.
|
31 |
+
|
32 |
+
Evaluation results:
|
33 |
+
|
34 |
+
| Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
|
35 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
36 |
+
| 73.1 | 69.62 | 87.09 | 64.81 | 62.82 | 81.45 | 72.78 |
|
37 |
+
|
38 |
+
The model did achieve an improvement in TruthfulQA over `cookinai/CatMacaroni-Slerp` and GSM8K over `mncai/mistral-7b-dpo-v5`
|
39 |
+
which was the goal of the merge leading to an average score that was a better than both. It is unclear why the TruthfulQA metric
|
40 |
+
is still meaningfully lower than the base `mncai/mistral-7b-dpo-v5`.
|