--- license: apache-2.0 base_model: - mistralai/Devstral-Small-2505 datasets: - nvidia/OpenCodeInstruct pipeline_tag: text2text-generation tags: - gptq - vllm - llmcompressor - text-generation-inference --- # mistralai/Devstral-Small-2505 Quantized with GPTQ (4-Bit weight-only, W4A16) This repo contains mistralai/Devstral-Small-2505 quantized with asymmetric GPTQ to 4-bit to make it suitable for consumer hardware. The model was calibrated with 2048 samples of max sequence length 4096 from the dataset [`nvidia/OpenCodeInstruct`](https://huggingface.co/datasets/nvidia/OpenCodeInstruct). This is my second model, I welcome suggestions. In particular the peculiarities of Mistral's tekkenizer were tricky to figure out. 2048/4096 were chosen over the default of 512/2048 to minimize overfitting risk and maximize convergence. Original Model: - [mistralai/Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505) ## 📥 Usage & Running Instructions The model was tested with vLLM, here is a script suitable for 32GB VRAM GPUs. It reserves 31.2GiB of GPU VRAM so you should run your OS on iGPU. ``` export MODEL="mratsim/Devstral-Small-2505.w4a16-gptq" vllm serve "${MODEL}" \ --served-model-name devstral-32b \ --gpu-memory-utilization 0.95 \ --enable-prefix-caching \ --enable-chunked-prefill \ --max-model-len 94000 \ --max_num_seqs 256 \ --tokenizer_mode mistral \ --generation-config "${MODEL}" \ --enable-auto-tool-choice --tool-call-parser mistral ``` ## 🔬 Quantization method The llmcompressor library was used with the following recipe for asymmetric GPTQ: ```yaml default_stage: default_modifiers: GPTQModifier: dampening_frac: 0.005 config_groups: group_0: targets: [Linear] weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group, dynamic: false, observer: minmax} ignore: [lm_head] ``` and calibrated on 2048 samples, 4096 sequence length of [`nvidia/OpenCodeInstruct`](https://huggingface.co/datasets/nvidia/OpenCodeInstruct)