mpt-7b-gsm8k-pruned75-quant

Paper: Sparse Finetuning for Inference Acceleration of Large Language Models
Code: https://github.com/neuralmagic/deepsparse/tree/main/research/mpt

This model was produced from a MPT-7B base model finetuned on the GSM8k dataset with pruning applied using SparseGPT and retrain for 4 epochs with L2 distillation. Then it was exported for optimized inference with DeepSparse.

GSM8k zero-shot accuracy with lm-evaluation-harness : 26.61% (FP32 baseline is 28.2%)

Usage

from deepsparse import TextGeneration
model_path = "hf:neuralmagic/mpt-7b-gsm8k-pruned75-quant" # or use a sparsezoo stub (zoo:mpt-7b-gsm8k_mpt_pretrain-pruned75_quantized)
model = TextGeneration(model=model_path)
model("There are twice as many boys as girls at Dr. Wertz's school. If there are 60 girls and 5 students to every teacher, how many teachers are there?", max_new_tokens=50)

All MPT model weights are available on SparseZoo and CPU speedup for generative inference can be reproduced by following the instructions at DeepSparse

Model Links Compression
neuralmagic/mpt-7b-gsm8k-quant Quantization (W8A8)
neuralmagic/mpt-7b-gsm8k-pruned40-quant Quantization (W8A8) & 40% Pruning
neuralmagic/mpt-7b-gsm8k-pruned50-quant Quantization (W8A8) & 50% Pruning
neuralmagic/mpt-7b-gsm8k-pruned60-quant Quantization (W8A8) & 60% Pruning
neuralmagic/mpt-7b-gsm8k-pruned70-quant Quantization (W8A8) & 70% Pruning
neuralmagic/mpt-7b-gsm8k-pruned70-quant Quantization (W8A8) & 75% Pruning
neuralmagic/mpt-7b-gsm8k-pruned80-quant Quantization (W8A8) & 80% Pruning

For general questions on these models and sparsification methods, reach out to the engineering team on our community Slack.

Downloads last month
15
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Dataset used to train neuralmagic/mpt-7b-gsm8k-pruned75-quant-ds

Collection including neuralmagic/mpt-7b-gsm8k-pruned75-quant-ds