Llama-3.1-8B-Instruct-MR-GPTQ-mxfp

Model Overview

This model was obtained by quantizing the weights of Llama-3.1-8B-Instruct to MXFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.25, reducing the disk size and GPU memory requirements by approximately 73%.

Usage

MR-GPTQ quantized models with QuTLASS kernels are supported in the following integrations:

  • transformers with these features:
    • Available in main (Documentation).
    • RTN on-the-fly quantization.
    • Pseudo-quantization QAT.
  • vLLM with these features:
    • Available in this PR.
    • Compatible with real quantization models from FP-Quant and the transformers integration.

Evaluation

This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the vLLM engine.

OpenLLM v1 results

Model MMLU‑CoT GSM8k Hellaswag Winogrande Average Recovery (%)
meta‑llama/Llama 3.1‑8B‑Instruct 0.7276 0.8506 0.8001 0.7790 0.7893
ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑mxfp 0.6754 0.7892 0.7737 0.7324 0.7427 94.09

Platinum bench results

Below we report recoveries on individual tasks as well as the average recovery.

Recovery by Task

Task Recovery (%)
SingleOp 97.94
SingleQ 95.95
MultiArith 98.22
SVAMP 95.08
GSM8K 93.69
MMLU-Math 80.54
BBH-LogicalDeduction-3Obj 89.87
BBH-ObjectCounting 82.03
BBH-Navigate 90.66
TabFact 86.92
HotpotQA 96.81
SQuAD 98.46
DROP 94.33
Winograd-WSC 89.47
Average 92.14
Downloads last month
14
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-mxfp

Quantized
(507)
this model

Collection including ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-mxfp