---
license: apache-2.0
base_model:
- mistralai/Devstral-Small-2505
datasets:
- nvidia/OpenCodeInstruct
pipeline_tag: text2text-generation
tags:
- gptq
- vllm
- llmcompressor
- text-generation-inference
---

# mistralai/Devstral-Small-2505 Quantized with GPTQ (4-Bit weight-only, W4A16)

This repo contains mistralai/Devstral-Small-2505 quantized with asymmetric GPTQ to 4-bit to make it suitable for consumer hardware.

The model was calibrated with 2048 samples of max sequence length 4096 from the dataset [`nvidia/OpenCodeInstruct`](https://huggingface.co/datasets/nvidia/OpenCodeInstruct).

This is my second model, I welcome suggestions. In particular the peculiarities of Mistral's tekkenizer were tricky to figure out.

2048/4096 were chosen over the default of 512/2048 to minimize overfitting risk and maximize convergence.

Original Model:
  - [mistralai/Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505)

## 📥 Usage & Running Instructions

The model was tested with vLLM, here is a script suitable for 32GB VRAM GPUs.
It reserves 31.2GiB of GPU VRAM so you should run your OS on iGPU.

```
export MODEL="mratsim/Devstral-Small-2505.w4a16-gptq"
vllm serve "${MODEL}" \
  --served-model-name devstral-32b \
  --gpu-memory-utilization 0.95 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-model-len 94000 \
  --max_num_seqs 256 \
  --tokenizer_mode mistral \
  --generation-config "${MODEL}" \
  --enable-auto-tool-choice --tool-call-parser mistral
```

## 🔬 Quantization method

The llmcompressor library was used with the following recipe for asymmetric GPTQ:

```yaml
default_stage:
  default_modifiers:
    GPTQModifier:
      dampening_frac: 0.005
      config_groups:
        group_0:
          targets: [Linear]
          weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group,
            dynamic: false, observer: minmax}
      ignore: [lm_head]
```

and calibrated on 2048 samples, 4096 sequence length of [`nvidia/OpenCodeInstruct`](https://huggingface.co/datasets/nvidia/OpenCodeInstruct)