Arithmo2-Mistral-7B model improves initially released Arithmo-Mistral-7B model on both GSM8K and MATH benchmarks. Specifically, there is absolute improvement of:

  • +1.7% on GSM8K
  • +3.0% on GSM8K PoT
  • +1.9% on MATH

This repo contains final merged model. If you are interested in LoRA adapter, use LoRA Adapter instead.

Model Description

Results

Arithmo2-Mistral-7B is improved version of Arithmo-Mistral-7B model and is competitive with full fine-tuned state-of-the-art 7B Mathematical Reasoning models. Refer to Comparing Arithmo models with other SFT LLM models section for more details.

Prompt Approach GSM8k MATH
Zero-Shot CoT 76.4 27.2
Zero-Shot PoT 74.2 -
  • Zero-Shot CoT: On providing a question as prompt, model generates reasoning steps to solve the question along with answer. We check if answer matches with ground-truth.
  • Zero-Shot PoT: We prompt the model to generate a Python program for the given question. During inference, we execute the Python program generated by the model and check if the program output matches with ground-truth answer.

Installation

pip install transformers >=4.34.0
pip install accelerate
pip install sentencepiece
pip install protobuf

# If you are GPU poor like me
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# If you have a GPU.
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118
pip install scipy
pip install bitsandbytes

How to query the model

# Set `run_model_on_gpu` to `False` if you are running on CPU. Model will generate reasoning steps with answer for your question. If you want to generate Python program, uncomment line-69 that adds a Python prompt.
# This script automatically does formatting for you, so you just need to type question (eg: `What is 2+2?`) without any prefix like `Question:`, etc.**

$ python query_model.py

Note: Above script automatically does formatting for you, so you just need to type question (eg: What is 2+2?) without any prefix like Question:, etc. Checkout query_model.py for more details.

Sample Input:
Question: There are total 10 children. I have to give 1 apple to first child, 2 apples to second child, 3 apples to third child, and so on. How many apples do I need?
Model Output:
Answer: The total number of apples needed is the sum of the first 10 positive integers.
This can be calculated using the formula for the sum of an arithmetic series:
\[S = \frac{n}{2}(a_1 + a_n),\]
where $S$ is the sum, $n$ is the number of terms, $a_1$ is the first term, and $a_n$ is the last term.
In this case, $n = 10$, $a_1 = 1$, and $a_n = 10$.
Plugging these values into the formula, we get:
\[S = \frac{10}{2}(1 + 10) = 5(11) = \boxed{55}.\]
The answer is: 55

Arithmo2-Mistral-7B is trained with same format as Arithmo-Mistral-7B:

CoT Format (generate reasoning steps with answer):

Question: <question>

Answer:

PoT Format (generate a python program):

Question: <question> <python_prompt>

Answer:

It will perform best if queried in this way with your own script.

Comparing Arithmo models with other SFT LLM models

Results for all models except Arithmo2-Mistral-7B are taken from MetaMath repository.

Model GSM8k Pass@1 MATH Pass@1 Fine-tuning
MPT-7B 6.8 3.0
Falcon-7B 6.8 2.3
LLaMA-1-7B 11.0 2.9
LLaMA-2-7B 14.6 2.5
MPT-30B 15.2 3.1
LLaMA-1-13B 17.8 3.9
GPT-Neo-2.7B 19.5 --
Falcon-40B 19.6 2.5
Baichuan-chat-13B 23.9 --
Vicuna-v1.3-13B 27.6 --
LLaMA-2-13B 28.7 3.9
InternLM-7B 31.2 --
ChatGLM-2-6B 32.4 --
GPT-J-6B 34.9 --
LLaMA-1-33B 35.6 3.9
LLaMA-2-34B 42.2 6.24
RFT-7B 50.3 --
LLaMA-1-65B 50.9 10.6
Qwen-7B 51.6 --
WizardMath-7B 54.9 10.7
LLaMA-2-70B 56.8 13.5
WizardMath-13B 63.9 14.0
MetaMath-7B 66.5 19.8
MetaMath-13B 72.3 22.4
Arithmo-Mistral-7B (PoT) 71.2 -- SFT: 4-bit QLoRA
Arithmo2-Mistral-7B (PoT) 74.2 -- SFT: 4-bit QLoRA
MetaMath-Mistral-7B 77.7 28.2 SFT: Full fine-tuned
Arithmo-Mistral-7B 74.7 25.3 SFT: 4-bit QLoRA
🔥 Arithmo2-Mistral-7B 76.4 27.2 SFT: 4-bit QLoRA

If you are interested in reproducing the results, visit https://github.com/akjindal53244/Arithmo#reproducing-results section.

Support My Work

Building LLMs takes time and resources; if you find my work interesting, your support would be epic! Buy Me A Coffee

Citation

To cite Arithmo models:

@misc{jindal_2023_arithmo,
  author = {Jindal, Ashvini},
  title = {Arithmo-Mistral-7B: Mathematical Reasoning Model},
  howpublished = {Hugging Face},
  month = {October},
  year = {2023},
  url = {https://huggingface.co/akjindal53244/Arithmo-Mistral-7B}
}

References

@article{yu2023metamath,
  title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models},
  author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang},
  journal={arXiv preprint arXiv:2309.12284},
  year={2023}
}

@article{Yue2023mammoth,
  title={MAmmoTH: Building math generalist models through hybrid instruction tuning},
  author={Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen},
  journal={arXiv preprint arXiv:2309.05653},
  year={2023}
}

@article{mishra2022lila,
  title={Lila: A unified benchmark for mathematical reasoning},
  author={Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan},
  journal={arXiv preprint arXiv:2210.17517},
  year={2022}
}
Downloads last month
312
Safetensors
Model size
7.24B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for upaya07/Arithmo2-Mistral-7B

Merges
2 models
Quantizations
1 model

Dataset used to train upaya07/Arithmo2-Mistral-7B

Collection including upaya07/Arithmo2-Mistral-7B