File size: 2,497 Bytes
			
			| 5657196 269ad6e 5657196 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | ---
license: llama3.1
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- int4
- vllm
- llmcompressor
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---
# Llama-3.1-8B-Instruct-MR-GPTQ-mxfp
## Model Overview
This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to MXFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.25, reducing the disk size and GPU memory requirements by approximately 73%.
## Usage 
*MR-GPTQ* quantized models with [QuTLASS](https://github.com/IST-DASLab/qutlass) kernels are supported in the following integrations:
 - `transformers` with these features:
     - Available in `main` ([Documentation](https://huggingface.co/docs/transformers/main/en/quantization/fp_quant#fp-quant)).
     - RTN on-the-fly quantization.
     - Pseudo-quantization QAT.
 - `vLLM` with these features:
     - Available in [this PR](https://github.com/vllm-project/vllm/pull/24440).
     - Compatible with real quantization models from `FP-Quant` and the `transformers` integration.
## Evaluation 
This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the `vLLM` engine.
*OpenLLM v1 results*
| Model                                                                                           | MMLU‑CoT | GSM8k | Hellaswag | Winogrande | **Average** | **Recovery (%)** |
|--------------------------------------------------------------------------------------------------|--------:|------:|----------:|-----------:|------------:|-----------------:|
| `meta‑llama/Llama 3.1‑8B‑Instruct`                                                               | 0.7276 | 0.8506 | 0.8001 | 0.7790 | 0.7893 | – |
| `ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑mxfp`                                                | 0.6754 | 0.7892 | 0.7737 | 0.7324 | 0.7427 | 94.09 |
*Platinum bench results*
Below we report recoveries on individual tasks as well as the average recovery.
**Recovery by Task**
| Task | Recovery (%) |
|------|--------------|
| SingleOp | 97.94 |
| SingleQ | 95.95 |
| MultiArith | 98.22 |
| SVAMP | 95.08 |
| GSM8K | 93.69 |
| MMLU-Math | 80.54 |
| BBH-LogicalDeduction-3Obj | 89.87 |
| BBH-ObjectCounting | 82.03 |
| BBH-Navigate | 90.66 |
| TabFact | 86.92 |
| HotpotQA | 96.81 |
| SQuAD | 98.46 |
| DROP | 94.33 |
| Winograd-WSC | 89.47 |
| Average | **92.14** |
 | 
