|
| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |
|
|--------------------------------------------------------------|------:|------:|---------:|-------:|------:| |
|
|[gemma-2b-orpo](https://huggingface.co/anakin87/gemma-2b-orpo)| 23.76| 58.25| 44.47| 31.32| 39.45| |
|
|
|
### AGIEval |
|
| Task |Version| Metric |Value| |Stderr| |
|
|------------------------------|------:|--------|----:|---|-----:| |
|
|agieval_aqua_rat | 0|acc |15.35|± | 2.27| |
|
| | |acc_norm|17.32|± | 2.38| |
|
|agieval_logiqa_en | 0|acc |25.96|± | 1.72| |
|
| | |acc_norm|29.34|± | 1.79| |
|
|agieval_lsat_ar | 0|acc |19.57|± | 2.62| |
|
| | |acc_norm|20.00|± | 2.64| |
|
|agieval_lsat_lr | 0|acc |23.14|± | 1.87| |
|
| | |acc_norm|21.96|± | 1.83| |
|
|agieval_lsat_rc | 0|acc |24.16|± | 2.61| |
|
| | |acc_norm|24.54|± | 2.63| |
|
|agieval_sat_en | 0|acc |29.61|± | 3.19| |
|
| | |acc_norm|27.18|± | 3.11| |
|
|agieval_sat_en_without_passage| 0|acc |30.58|± | 3.22| |
|
| | |acc_norm|24.76|± | 3.01| |
|
|agieval_sat_math | 0|acc |23.64|± | 2.87| |
|
| | |acc_norm|25.00|± | 2.93| |
|
|
|
Average: 23.76% |
|
|
|
### GPT4All |
|
| Task |Version| Metric |Value| |Stderr| |
|
|-------------|------:|--------|----:|---|-----:| |
|
|arc_challenge| 0|acc |37.97|± | 1.42| |
|
| | |acc_norm|40.61|± | 1.44| |
|
|arc_easy | 0|acc |67.63|± | 0.96| |
|
| | |acc_norm|65.82|± | 0.97| |
|
|boolq | 1|acc |69.85|± | 0.80| |
|
|hellaswag | 0|acc |52.39|± | 0.50| |
|
| | |acc_norm|67.70|± | 0.47| |
|
|openbookqa | 0|acc |25.40|± | 1.95| |
|
| | |acc_norm|37.40|± | 2.17| |
|
|piqa | 0|acc |71.71|± | 1.05| |
|
| | |acc_norm|72.74|± | 1.04| |
|
|winogrande | 0|acc |53.59|± | 1.40| |
|
|
|
Average: 58.25% |
|
|
|
### TruthfulQA |
|
| Task |Version|Metric|Value| |Stderr| |
|
|-------------|------:|------|----:|---|-----:| |
|
|truthfulqa_mc| 1|mc1 |28.76|± | 1.58| |
|
| | |mc2 |44.47|± | 1.61| |
|
|
|
Average: 44.47% |
|
|
|
### Bigbench |
|
| Task |Version| Metric |Value| |Stderr| |
|
|------------------------------------------------|------:|---------------------|----:|---|-----:| |
|
|bigbench_causal_judgement | 0|multiple_choice_grade|51.58|± | 3.64| |
|
|bigbench_date_understanding | 0|multiple_choice_grade|43.63|± | 2.59| |
|
|bigbench_disambiguation_qa | 0|multiple_choice_grade|37.21|± | 3.02| |
|
|bigbench_geometric_shapes | 0|multiple_choice_grade|10.03|± | 1.59| |
|
| | |exact_str_match | 0.00|± | 0.00| |
|
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|23.80|± | 1.91| |
|
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|18.00|± | 1.45| |
|
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|38.67|± | 2.82| |
|
|bigbench_movie_recommendation | 0|multiple_choice_grade|22.60|± | 1.87| |
|
|bigbench_navigate | 0|multiple_choice_grade|50.00|± | 1.58| |
|
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|32.80|± | 1.05| |
|
|bigbench_ruin_names | 0|multiple_choice_grade|25.67|± | 2.07| |
|
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|19.24|± | 1.25| |
|
|bigbench_snarks | 0|multiple_choice_grade|44.75|± | 3.71| |
|
|bigbench_sports_understanding | 0|multiple_choice_grade|49.70|± | 1.59| |
|
|bigbench_temporal_sequences | 0|multiple_choice_grade|24.60|± | 1.36| |
|
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|19.20|± | 1.11| |
|
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|13.60|± | 0.82| |
|
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|38.67|± | 2.82| |
|
|
|
Average: 31.32% |
|
|
|
Average score: 39.45% |
|
|
|
Elapsed time: 02:46:40 |