Llama-3.1-8B-Instruct-MR-GPTQ-mxfp

Model Overview

This model was obtained by quantizing the weights of Llama-3.1-8B-Instruct to MXFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.25, reducing the disk size and GPU memory requirements by approximately 73%.

Usage

MR-GPTQ quantized models with QuTLASS kernels are supported in the following integrations:

transformers with these features:
- Available in main (Documentation).
- RTN on-the-fly quantization.
- Pseudo-quantization QAT.
vLLM with these features:
- Available in this PR.
- Compatible with real quantization models from FP-Quant and the transformers integration.

Evaluation

This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the vLLM engine.

OpenLLM v1 results

Model	MMLU‑CoT	GSM8k	Hellaswag	Winogrande	Average	Recovery (%)
`meta‑llama/Llama 3.1‑8B‑Instruct`	0.7276	0.8506	0.8001	0.7790	0.7893	–
`ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑mxfp`	0.6754	0.7892	0.7737	0.7324	0.7427	94.09

Platinum bench results

Below we report recoveries on individual tasks as well as the average recovery.

Recovery by Task

Task	Recovery (%)
SingleOp	97.94
SingleQ	95.95
MultiArith	98.22
SVAMP	95.08
GSM8K	93.69
MMLU-Math	80.54
BBH-LogicalDeduction-3Obj	89.87
BBH-ObjectCounting	82.03
BBH-Navigate	90.66
TabFact	86.92
HotpotQA	96.81
SQuAD	98.46
DROP	94.33
Winograd-WSC	89.47
Average	92.14