mistralai/Devstral-Small-2505 Quantized with GPTQ (4-Bit weight-only, W4A16)
This repo contains mistralai/Devstral-Small-2505 quantized with asymmetric GPTQ to 4-bit to make it suitable for consumer hardware.
The model was calibrated with 2048 samples of max sequence length 4096 from the dataset nvidia/OpenCodeInstruct
.
This is my second model, I welcome suggestions. In particular the peculiarities of Mistral's tekkenizer were tricky to figure out.
2048/4096 were chosen over the default of 512/2048 to minimize overfitting risk and maximize convergence.
Original Model:
📥 Usage & Running Instructions
The model was tested with vLLM, here is a script suitable for 32GB VRAM GPUs. It reserves 31.2GiB of GPU VRAM so you should run your OS on iGPU.
export MODEL="mratsim/Devstral-Small-2505.w4a16-gptq"
vllm serve "${MODEL}" \
--served-model-name devstral-32b \
--gpu-memory-utilization 0.95 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-model-len 94000 \
--max_num_seqs 256 \
--tokenizer_mode mistral \
--generation-config "${MODEL}" \
--enable-auto-tool-choice --tool-call-parser mistral
🔬 Quantization method
The llmcompressor library was used with the following recipe for asymmetric GPTQ:
default_stage:
default_modifiers:
GPTQModifier:
dampening_frac: 0.005
config_groups:
group_0:
targets: [Linear]
weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group,
dynamic: false, observer: minmax}
ignore: [lm_head]
and calibrated on 2048 samples, 4096 sequence length of nvidia/OpenCodeInstruct
- Downloads last month
- 1,087
Model tree for mratsim/Devstral-Small-2505.w4a16-gptq
Base model
mistralai/Devstral-Small-2505