--- license: apache-2.0 tags: - mistral - devstral - fp8 - quantization - llm-compressor - vllm pipeline_tag: text-generation base_model: - mistralai/Devstral-Small-2505 --- # Devstral-Small-2505-FP8-Dynamic This is a version of [mistralai/Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor). This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper, Blackwell or newer). ## Model Description Devstral is a cutting-edge, versatile language model developed by Mistral AI, fine-tuned for development tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision. ## Quantization with llm-compressor The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme. No calibration dataset was required for this quantization scheme. The following script was used for conversion: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from mistral_common.tokens.tokenizers.mistral import MistralTokenizer from mistral_common.protocol.instruct.messages import UserMessage from mistral_common.protocol.instruct.request import ChatCompletionRequest from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from huggingface_hub import hf_hub_download import shutil import os MODEL_ID = "mistralai/Devstral-Small-2505" # Load model. model = AutoModelForCausalLM.from_pretrained( MODEL_ID, device_map="auto", torch_dtype="auto" ) #tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) tekken_file = hf_hub_download(repo_id=MODEL_ID, filename="tekken.json") tokenizer = MistralTokenizer.from_file(tekken_file) # Configure the quantization algorithm and scheme. # In this case, we: # * quantize the weights to fp8 with per channel via ptq # * quantize the activations to fp8 with dynamic per token recipe = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"] ) # Apply quantization. oneshot(model=model, recipe=recipe, tokenizer=tokenizer) # Confirm generations of the quantized model look sane. print("========== SAMPLE GENERATION ==============") prompt = "The capital of France is" # Create a ChatCompletionRequest with a single UserMessage chat_request = ChatCompletionRequest( messages=[ UserMessage(content=prompt) ] ) # Encode the request using the tokenizer's specific method tokenized_payload = tokenizer.encode_chat_completion(chat_request) # The actual token IDs are usually in an attribute like '.tokens' encoded_prompt_ids = tokenized_payload.tokens # Convert to a PyTorch tensor and move to the model's device. input_ids = torch.tensor([encoded_prompt_ids], device=model.device) # Generate output output = model.generate(input_ids, max_new_tokens=20) # Decode the output # The output from model.generate includes the input_ids. # To get only the newly generated tokens, you might need to slice it: # generated_token_ids = output[0][len(encoded_prompt_ids):] # However, for a simple check, decoding the whole thing is often fine initially. # If the output includes the prompt, that's expected. print(tokenizer.decode(output[0].tolist())) print("==========================================") # Save to disk in compressed-tensors format. SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" model.save_pretrained(SAVE_DIR) # This saves the quantized model # --- Correct way to "save" the MistralTokenizer --- # Ensure the save directory exists if not os.path.exists(SAVE_DIR): os.makedirs(SAVE_DIR) # Define the destination path for tekken.json within your SAVE_DIR destination_tekken_file = os.path.join(SAVE_DIR, "tekken.json") # Copy the tekken.json file from its original download location to your SAVE_DIR shutil.copyfile(tekken_file, destination_tekken_file) # --- End of tokenizer saving --- print(f"Model saved to {SAVE_DIR}") print(f"Tokenizer file (tekken.json) copied to {destination_tekken_file}") ``` ## Inference Example This model can be loaded and run with transformers and mistral-common, or for optimized FP8 inference, with vLLM. ### Using transformers and mistral-common (for functional checking, not FP8 optimized) > [!NOTE] > The following inference code block has been functionally verified. > The example was successfully executed within the following Docker container environment on a system with Nvidia RTX 5090 GPU: > > ```bash > # 1. Set your Hugging Face Token > export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE" > > # 2. Run the Triton Server container with GPU access and necessary privileges > sudo docker run --gpus all -it --rm \ > --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 \ > -e HF_TOKEN=$HF_TOKEN --net host \ > nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3 > ``` > Inside the container, the Python script was run after installing necessary packages: > ```bash > pip install torch transformers huggingface_hub mistral-common` > python your_inference_script.py > ``` ```python import torch from transformers import AutoModelForCausalLM from mistral_common.tokens.tokenizers.mistral import MistralTokenizer from mistral_common.protocol.instruct.messages import UserMessage, SystemMessage from mistral_common.protocol.instruct.request import ChatCompletionRequest from huggingface_hub import hf_hub_download MODEL_REPO_ID = "textgeflecht/Devstral-Small-2505-FP8-llmcompressor" # Load model # For FP8 inference, specific inference engines like vLLM are needed. # Transformers will load the weights but might not run them in true FP8. # device_map="auto" and torch_dtype="auto" are good starting points. model = AutoModelForCausalLM.from_pretrained( MODEL_REPO_ID, device_map="auto", torch_dtype="auto" ) # Load tokenizer from the tekken.json file within the repo tekken_file = hf_hub_download(repo_id=MODEL_REPO_ID, filename="tekken.json") tokenizer = MistralTokenizer.from_file(tekken_file) # (Optional) Load System Prompt if your model uses one and it's in the repo # try: # system_prompt_file = hf_hub_download(repo_id=MODEL_REPO_ID, filename="SYSTEM_PROMPT.txt") # with open(system_prompt_file, "r") as f: # SYSTEM_PROMPT = f.read() # except Exception: # pylint: disable=broad-except # SYSTEM_PROMPT = "You are a helpful coding assistant." # dev specific example: # prompt = "Write a python function that calculates the factorial of a number." # quick example: prompt = "What is the capital of France?" messages = [ # SystemMessage(content=SYSTEM_PROMPT), # Uncomment if using a system prompt UserMessage(content=prompt) ] # Create ChatCompletionRequest chat_request = ChatCompletionRequest(messages=messages) # Encode the request tokenized_payload = tokenizer.encode_chat_completion(chat_request) input_ids = torch.tensor([tokenized_payload.tokens], device=model.device) attention_mask = torch.ones_like(input_ids) # Generate output # Note: Setting pad_token_id is common for open-ended generation to prevent warnings. # For Mistral/Devstral models using this tokenizer, the End-Of-Sentence (EOS) token ID is 2, # which is suitable for use as pad_token_id in this context. output = model.generate( input_ids, attention_mask=attention_mask, max_new_tokens=200, pad_token_id=2, # EOS token ID used as PAD token ID do_sample=True, # Add sampling parameters for more diverse outputs top_p=0.9, temperature=0.7 ) # Decode only the generated tokens generated_tokens = output[0][len(tokenized_payload.tokens):] decoded_output = tokenizer.decode(generated_tokens.tolist()) print("Original Prompt:\n", prompt) print("\nGenerated Output:\n", decoded_output) ``` ### Using vLLM (for optimized FP8 inference) This model, quantized to FP8 with `llm-compressor`, is designed for efficient inference with vLLM, especially on newer NVIDIA GPUs. **Prerequisites:** * A recent version of vLLM. The author's successful tests used a custom, very recent build of vLLM with specific patches for NVIDIA Blackwell FP8 support. * A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer architectures are recommended for FP8). * Docker and NVIDIA Container Toolkit installed. **Running with Docker (Recommended & Tested by Author):** The following Docker command starts a vLLM OpenAI-compatible server with this quantized model. This setup has been verified by the author to load the model successfully and serve requests. ```bash # 1. Set your Hugging Face Token (optional, but recommended to avoid rate limits or for private models) # export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE" # 2. Run the vLLM Docker container. # Replace 'vllm/vllm-openai:latest' with your specific vLLM image if using a custom build. # The 'latest' tag should pull a recent official build from vLLM. sudo docker run --gpus all \ -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ -p 8000:8000 \ -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" \ vllm/vllm-openai:latest \ --model textgeflecht/Devstral-Small-2505-FP8-llmcompressor \ --tokenizer_mode mistral \ --load_format auto \ --max-model-len 2048 # Optional: Limit VRAM usage # Optional: Add Mistral-specific tool usage flags if needed by your application. # --tool-call-parser mistral \ # --enable-auto-tool-choice \ # Optional: Explicitly set tensor parallel size if you have multiple GPUs e.g. on 2 GPUs: # --tensor-parallel-size 2 ``` Key Command-Line Arguments Used: - `--model textgeflecht/Devstral-Small-2505-FP8-llmcompressor`: Specifies the quantized model from the Hugging Face Hub. - `--tokenizer_mode mistral`: Essential for vLLM to correctly use the MistralTokenizer with the `tekken.json` file from the repository. - `--load_format auto`: Allows vLLM to auto-detect the Hugging Face sharded safetensors format for weights. With this, vLLM successfully reads the `config.json` (which includes `quantization_config` with `quant_method: "compressed-tensors"`) and auto-detects the FP8 quantization scheme. - `--max-model-len 2048`: Limits the maximum sequence length (input + output tokens combined) to manage VRAM. Adjust this value based on your needs and available GPU memory. - The flags `--tool-call-parser mistral` and `--enable-auto-tool-choice` can be added if you intend to use Devstral's tool-calling capabilities. Note on FP8 Support (especially for newer architectures like Blackwell): - vLLM's support for FP8, particularly on the newest GPU architectures like NVIDIA Blackwell, is an area of active development. - The successful tests for this model on Blackwell used a custom, very recent vLLM build with specific patches for Blackwell FP8 support. - While standard `vllm/vllm-openai:latest` images are updated regularly, cutting-edge hardware support and specific quantization schemes might take time to be fully integrated and stabilized in official releases. - If you encounter issues related to FP8 performance or compatibility on very new hardware with official vLLM builds, it's recommended to check the vLLM GitHub repository issues and discussions for the latest status, potential workarounds, or information on required builds. Interacting with the Server: - Once the vLLM server is running, it exposes an OpenAI-compatible API. - You can interact with it using any OpenAI client library (like openai for Python) or tools like curl. - Endpoint for chat completions: `http://localhost:8000/v1/chat/completions` - Model name in requests: Use `textgeflecht/Devstral-Small-2505-FP8-llmcompressor` Refer to the Python requests example in the original https://huggingface.co/mistralai/Devstral-Small-2505 model card for client-side interaction, adjusting the URL and model name as needed. ### Original Model Card (mistralai/Devstral-Small-2505) For more details on the base model, please refer to the original model card: https://huggingface.co/mistralai/Devstral-Small-2505