MR-GPTQ
Collection
MXFP4 and NVFP4 quantized models
•
2 items
•
Updated
This model was obtained by quantizing the weights of Llama-3.1-8B-Instruct to MXFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.25, reducing the disk size and GPU memory requirements by approximately 73%.
MR-GPTQ quantized models with QuTLASS kernels are supported in the following integrations:
transformers
with these features:main
(Documentation).vLLM
with these features:FP-Quant
and the transformers
integration.This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the vLLM
engine.
OpenLLM v1 results
Model | MMLU‑CoT | GSM8k | Hellaswag | Winogrande | Average | Recovery (%) |
---|---|---|---|---|---|---|
meta‑llama/Llama 3.1‑8B‑Instruct |
0.7276 | 0.8506 | 0.8001 | 0.7790 | 0.7893 | – |
ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑mxfp |
0.6754 | 0.7892 | 0.7737 | 0.7324 | 0.7427 | 94.09 |
Platinum bench results
Below we report recoveries on individual tasks as well as the average recovery.
Recovery by Task
Task | Recovery (%) |
---|---|
SingleOp | 97.94 |
SingleQ | 95.95 |
MultiArith | 98.22 |
SVAMP | 95.08 |
GSM8K | 93.69 |
MMLU-Math | 80.54 |
BBH-LogicalDeduction-3Obj | 89.87 |
BBH-ObjectCounting | 82.03 |
BBH-Navigate | 90.66 |
TabFact | 86.92 |
HotpotQA | 96.81 |
SQuAD | 98.46 |
DROP | 94.33 |
Winograd-WSC | 89.47 |
Average | 92.14 |
Base model
meta-llama/Llama-3.1-8B