|
--- |
|
license: apache-2.0 |
|
library_name: vllm |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- int4 |
|
- vllm |
|
- llmcompressor |
|
base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503 |
|
--- |
|
|
|
# Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g |
|
|
|
## Model Overview |
|
|
|
This model was obtained by quantizing the weights of [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. |
|
|
|
Only the weights of the linear operators within `language_model` transformers blocks are quantized. Vision model and multimodal projection are kept in original precision. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization. |
|
|
|
Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format. |
|
|
|
## Evaluation |
|
|
|
This model was evaluated on the OpenLLM v1 benchmarks. Model outputs were generated with the `vLLM` engine. |
|
|
|
| Model | ArcC | GSM8k | Hellaswag | MMLU | TruthfulQA-mc2 | Winogrande | Average | Recovery | |
|
|----------------------------|:------:|:------:|:---------:|:------:|:--------------:|:----------:|:-------:|:--------:| |
|
| Mistral-Small-3.1-24B-Instruct-2503 | 0.7125 | 0.8848 | 0.8576 | 0.8107 | 0.6409 | 0.8398 | 0.7910 | 1.0000 | |
|
| Mistral-Small-3.1-24B-Instruct-2503-INT4 (this) | 0.7073 | 0.8711 | 0.8530 | 0.8062 | 0.6252 | 0.8256 | 0.7814 | 0.9878 | |
|
|
|
## Reproduction |
|
|
|
The results were obtained using the following commands: |
|
|
|
```bash |
|
MODEL=ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g |
|
MODEL_ARGS="pretrained=$MODEL,max_model_len=4096,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.80" |
|
|
|
lm_eval \ |
|
--model vllm \ |
|
--model_args $MODEL_ARGS \ |
|
--tasks openllm \ |
|
--batch_size auto |
|
``` |
|
|
|
## Usage |
|
|
|
* To use the model in `transformers` update the package to stable release of Mistral-3 |
|
|
|
`pip install git+https://github.com/huggingface/[email protected]` |
|
* To use the model in `vLLM` update the package to version `vllm>=0.8.0`. |
|
|
|
And example of inference via transformers is provided below: |
|
|
|
```python |
|
# pip install accelerate |
|
|
|
from transformers import AutoProcessor, AutoModelForImageTextToText |
|
from PIL import Image |
|
import requests |
|
import torch |
|
|
|
model_id = "ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g" |
|
|
|
model = AutoModelForImageTextToText.from_pretrained( |
|
model_id, device_map="auto" |
|
).eval() |
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
messages = [ |
|
{ |
|
"role": "system", |
|
"content": [{"type": "text", "text": "You are a helpful assistant."}] |
|
}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, |
|
{"type": "text", "text": "Describe this image in detail."} |
|
] |
|
} |
|
] |
|
|
|
inputs = processor.apply_chat_template( |
|
messages, add_generation_prompt=True, tokenize=True, |
|
return_dict=True, return_tensors="pt" |
|
).to(model.device, dtype=torch.bfloat16) |
|
|
|
input_len = inputs["input_ids"].shape[-1] |
|
|
|
with torch.inference_mode(): |
|
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False) |
|
generation = generation[0][input_len:] |
|
|
|
decoded = processor.decode(generation, skip_special_tokens=True) |
|
print(decoded) |
|
``` |