tperes commited on
Commit
8c6c161
·
verified ·
1 Parent(s): 55d900f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -1
README.md CHANGED
@@ -1 +1,62 @@
1
- Please update card
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card: palmyra-mini
2
+
3
+ ## Model Details
4
+
5
+ **Model Name:** palmyra-mini
6
+ **Version:** 1.0
7
+ **Type:** Generative AI Language Model
8
+
9
+ ## Model Description
10
+
11
+ The palmyra-mini model demonstrates exceptional capabilities in complex reasoning and mathematical problem-solving domains. Its performance is particularly noteworthy on benchmarks that require deep understanding and multi-step thought processes.
12
+
13
+ A key strength of the model is its proficiency in grade-school-level math problems, as evidenced by its impressive score of 0.818 on the gsm8k (strict-match) benchmark. This high score indicates a robust ability to parse and solve word problems, a foundational skill for more advanced quantitative reasoning.
14
+
15
+ This aptitude for mathematics is further confirmed by its outstanding performance on the MATH500 benchmark, where it also achieved a score of 0.818. This result underscores the models consistent and reliable mathematical capabilities across different problem sets.
16
+
17
+ The model also shows strong performance on the AMC23 benchmark, with a solid score of 0.6. This benchmark, representing problems from the American Mathematics Competitions, highlights the models ability to tackle challenging, competition-level mathematics.
18
+
19
+ Beyond pure mathematics, the model exhibits strong reasoning abilities on a diverse set of challenging tasks. Its score of 0.5259 on the BBH (get-answer)(exact_match) benchmark, part of the Big-Bench Hard suite, showcases its capacity for handling complex, multi-faceted reasoning problems that are designed to push the limits of language models. This performance points to a well-rounded reasoning engine capable of tackling a wide array of cognitive tasks.
20
+
21
+ ## Benchmark Performance
22
+
23
+ The following table presents the full, unordered results of the model across all evaluated benchmarks.
24
+
25
+ | Benchmark | Score |
26
+ |:-----------------------------------------------------------------|---------:|
27
+ | gsm8k (strict-match) | 0.818 |
28
+ | minerva_math(exact_match) | 0.4582 |
29
+ | mmlu_pro(exact_match) | 0.314 |
30
+ | hendrycks_math | 0.025 |
31
+ | ifeval (inst_level_loose_acc) | 0.4688 |
32
+ | mathqa (acc) | 0.4509 |
33
+ | humaneval (pass@1) | 0.5 |
34
+ | BBH (get-answer)(exact_match) | 0.5259 |
35
+ | mbpp | 0.47 |
36
+ | leadboard_musr (acc_norm) | 0.3413 |
37
+ | gpqa lighteval gpqa diamond_pass@1:8_samples | 0.442 |
38
+ | AIME24(pass@1)(avg-of-1) | 0.2 |
39
+ | AIME25(pass@1)(avg-of-1) | 0.25 |
40
+ | Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.1519 |
41
+ | AMC23 | 0.6 |
42
+ | MATH500 | 0.818 |
43
+ | Minerva | 0.2794 |
44
+ | Olympiadbench (extractive_match) | 0.3822 |
45
+ | Codecontests (pass_rate) | 0.1034 |
46
+ | Codeforces (pass_rate) | 0.3199 |
47
+ | Taco (pass_rate) | 0.1744 |
48
+ | APPS (all_levels) | 0.0405 |
49
+ | HMMT23 (extractive_match) | 0.0333 |
50
+ | Average | 0.355091 |
51
+
52
+ ## Intended Use
53
+
54
+ This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.
55
+
56
+ ## Limitations
57
+
58
+ The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.
59
+
60
+ ## Ethical Considerations
61
+
62
+ As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.