|
|
--- |
|
|
license: llama3.1 |
|
|
library_name: transformers |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- int4 |
|
|
- vllm |
|
|
- llmcompressor |
|
|
base_model: |
|
|
- meta-llama/Llama-3.1-8B-Instruct |
|
|
--- |
|
|
|
|
|
# Llama-3.1-8B-Instruct-MR-GPTQ-mxfp |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to MXFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.25, reducing the disk size and GPU memory requirements by approximately 73%. |
|
|
|
|
|
## Usage |
|
|
|
|
|
*MR-GPTQ* quantized models with [QuTLASS](https://github.com/IST-DASLab/qutlass) kernels are supported in the following integrations: |
|
|
- `transformers` with these features: |
|
|
- Available in `main` ([Documentation](https://huggingface.co/docs/transformers/main/en/quantization/fp_quant#fp-quant)). |
|
|
- RTN on-the-fly quantization. |
|
|
- Pseudo-quantization QAT. |
|
|
- `vLLM` with these features: |
|
|
- Available in [this PR](https://github.com/vllm-project/vllm/pull/24440). |
|
|
- Compatible with real quantization models from `FP-Quant` and the `transformers` integration. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the `vLLM` engine. |
|
|
|
|
|
*OpenLLM v1 results* |
|
|
|
|
|
| Model | MMLU‑CoT | GSM8k | Hellaswag | Winogrande | **Average** | **Recovery (%)** | |
|
|
|--------------------------------------------------------------------------------------------------|--------:|------:|----------:|-----------:|------------:|-----------------:| |
|
|
| `meta‑llama/Llama 3.1‑8B‑Instruct` | 0.7276 | 0.8506 | 0.8001 | 0.7790 | 0.7893 | – | |
|
|
| `ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑mxfp` | 0.6754 | 0.7892 | 0.7737 | 0.7324 | 0.7427 | 94.09 | |
|
|
|
|
|
*Platinum bench results* |
|
|
|
|
|
Below we report recoveries on individual tasks as well as the average recovery. |
|
|
|
|
|
**Recovery by Task** |
|
|
|
|
|
| Task | Recovery (%) | |
|
|
|------|--------------| |
|
|
| SingleOp | 97.94 | |
|
|
| SingleQ | 95.95 | |
|
|
| MultiArith | 98.22 | |
|
|
| SVAMP | 95.08 | |
|
|
| GSM8K | 93.69 | |
|
|
| MMLU-Math | 80.54 | |
|
|
| BBH-LogicalDeduction-3Obj | 89.87 | |
|
|
| BBH-ObjectCounting | 82.03 | |
|
|
| BBH-Navigate | 90.66 | |
|
|
| TabFact | 86.92 | |
|
|
| HotpotQA | 96.81 | |
|
|
| SQuAD | 98.46 | |
|
|
| DROP | 94.33 | |
|
|
| Winograd-WSC | 89.47 | |
|
|
| Average | **92.14** | |
|
|
|
|
|
|