Update README.md
Browse files
README.md
CHANGED
|
@@ -1 +1,62 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Card: palmyra-mini
|
| 2 |
+
|
| 3 |
+
## Model Details
|
| 4 |
+
|
| 5 |
+
**Model Name:** palmyra-mini
|
| 6 |
+
**Version:** 1.0
|
| 7 |
+
**Type:** Generative AI Language Model
|
| 8 |
+
|
| 9 |
+
## Model Description
|
| 10 |
+
|
| 11 |
+
The palmyra-mini model demonstrates exceptional capabilities in complex reasoning and mathematical problem-solving domains. Its performance is particularly noteworthy on benchmarks that require deep understanding and multi-step thought processes.
|
| 12 |
+
|
| 13 |
+
A key strength of the model is its proficiency in grade-school-level math problems, as evidenced by its impressive score of 0.818 on the gsm8k (strict-match) benchmark. This high score indicates a robust ability to parse and solve word problems, a foundational skill for more advanced quantitative reasoning.
|
| 14 |
+
|
| 15 |
+
This aptitude for mathematics is further confirmed by its outstanding performance on the MATH500 benchmark, where it also achieved a score of 0.818. This result underscores the models consistent and reliable mathematical capabilities across different problem sets.
|
| 16 |
+
|
| 17 |
+
The model also shows strong performance on the AMC23 benchmark, with a solid score of 0.6. This benchmark, representing problems from the American Mathematics Competitions, highlights the models ability to tackle challenging, competition-level mathematics.
|
| 18 |
+
|
| 19 |
+
Beyond pure mathematics, the model exhibits strong reasoning abilities on a diverse set of challenging tasks. Its score of 0.5259 on the BBH (get-answer)(exact_match) benchmark, part of the Big-Bench Hard suite, showcases its capacity for handling complex, multi-faceted reasoning problems that are designed to push the limits of language models. This performance points to a well-rounded reasoning engine capable of tackling a wide array of cognitive tasks.
|
| 20 |
+
|
| 21 |
+
## Benchmark Performance
|
| 22 |
+
|
| 23 |
+
The following table presents the full, unordered results of the model across all evaluated benchmarks.
|
| 24 |
+
|
| 25 |
+
| Benchmark | Score |
|
| 26 |
+
|:-----------------------------------------------------------------|---------:|
|
| 27 |
+
| gsm8k (strict-match) | 0.818 |
|
| 28 |
+
| minerva_math(exact_match) | 0.4582 |
|
| 29 |
+
| mmlu_pro(exact_match) | 0.314 |
|
| 30 |
+
| hendrycks_math | 0.025 |
|
| 31 |
+
| ifeval (inst_level_loose_acc) | 0.4688 |
|
| 32 |
+
| mathqa (acc) | 0.4509 |
|
| 33 |
+
| humaneval (pass@1) | 0.5 |
|
| 34 |
+
| BBH (get-answer)(exact_match) | 0.5259 |
|
| 35 |
+
| mbpp | 0.47 |
|
| 36 |
+
| leadboard_musr (acc_norm) | 0.3413 |
|
| 37 |
+
| gpqa lighteval gpqa diamond_pass@1:8_samples | 0.442 |
|
| 38 |
+
| AIME24(pass@1)(avg-of-1) | 0.2 |
|
| 39 |
+
| AIME25(pass@1)(avg-of-1) | 0.25 |
|
| 40 |
+
| Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.1519 |
|
| 41 |
+
| AMC23 | 0.6 |
|
| 42 |
+
| MATH500 | 0.818 |
|
| 43 |
+
| Minerva | 0.2794 |
|
| 44 |
+
| Olympiadbench (extractive_match) | 0.3822 |
|
| 45 |
+
| Codecontests (pass_rate) | 0.1034 |
|
| 46 |
+
| Codeforces (pass_rate) | 0.3199 |
|
| 47 |
+
| Taco (pass_rate) | 0.1744 |
|
| 48 |
+
| APPS (all_levels) | 0.0405 |
|
| 49 |
+
| HMMT23 (extractive_match) | 0.0333 |
|
| 50 |
+
| Average | 0.355091 |
|
| 51 |
+
|
| 52 |
+
## Intended Use
|
| 53 |
+
|
| 54 |
+
This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.
|
| 55 |
+
|
| 56 |
+
## Limitations
|
| 57 |
+
|
| 58 |
+
The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.
|
| 59 |
+
|
| 60 |
+
## Ethical Considerations
|
| 61 |
+
|
| 62 |
+
As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.
|