File size: 2,497 Bytes
5657196
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269ad6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5657196
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: llama3.1
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- int4
- vllm
- llmcompressor
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---

# Llama-3.1-8B-Instruct-MR-GPTQ-mxfp

## Model Overview

This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to MXFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.25, reducing the disk size and GPU memory requirements by approximately 73%.

## Usage 

*MR-GPTQ* quantized models with [QuTLASS](https://github.com/IST-DASLab/qutlass) kernels are supported in the following integrations:
 - `transformers` with these features:
     - Available in `main` ([Documentation](https://huggingface.co/docs/transformers/main/en/quantization/fp_quant#fp-quant)).
     - RTN on-the-fly quantization.
     - Pseudo-quantization QAT.
 - `vLLM` with these features:
     - Available in [this PR](https://github.com/vllm-project/vllm/pull/24440).
     - Compatible with real quantization models from `FP-Quant` and the `transformers` integration.

## Evaluation 

This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the `vLLM` engine.

*OpenLLM v1 results*

| Model                                                                                           | MMLU‑CoT | GSM8k | Hellaswag | Winogrande | **Average** | **Recovery (%)** |
|--------------------------------------------------------------------------------------------------|--------:|------:|----------:|-----------:|------------:|-----------------:|
| `meta‑llama/Llama 3.1‑8B‑Instruct`                                                               | 0.7276 | 0.8506 | 0.8001 | 0.7790 | 0.7893 | – |
| `ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑mxfp`                                                | 0.6754 | 0.7892 | 0.7737 | 0.7324 | 0.7427 | 94.09 |

*Platinum bench results*

Below we report recoveries on individual tasks as well as the average recovery.

**Recovery by Task**

| Task | Recovery (%) |
|------|--------------|
| SingleOp | 97.94 |
| SingleQ | 95.95 |
| MultiArith | 98.22 |
| SVAMP | 95.08 |
| GSM8K | 93.69 |
| MMLU-Math | 80.54 |
| BBH-LogicalDeduction-3Obj | 89.87 |
| BBH-ObjectCounting | 82.03 |
| BBH-Navigate | 90.66 |
| TabFact | 86.92 |
| HotpotQA | 96.81 |
| SQuAD | 98.46 |
| DROP | 94.33 |
| Winograd-WSC | 89.47 |
| Average | **92.14** |