Devstral-Small-2505-FP8-Dynamic

This is a version of mistralai/Devstral-Small-2505 quantized to FP8 (weights and dynamic activations) using llm-compressor.

This model format is particularly useful for accelerated inference with vLLM on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper, Blackwell or newer).

Model Description

Devstral is a cutting-edge, versatile language model developed by Mistral AI, fine-tuned for development tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the lm_head layer kept in its original precision.

Quantization with llm-compressor

The model was quantized using the oneshot method from llm-compressor with the FP8_DYNAMIC scheme. No calibration dataset was required for this quantization scheme.

The following script was used for conversion:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from huggingface_hub import hf_hub_download
import shutil
import os

MODEL_ID = "mistralai/Devstral-Small-2505"

# Load model.
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
#tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tekken_file = hf_hub_download(repo_id=MODEL_ID, filename="tekken.json")
tokenizer = MistralTokenizer.from_file(tekken_file)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

# Apply quantization.
oneshot(model=model, recipe=recipe, tokenizer=tokenizer)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
prompt = "The capital of France is"

# Create a ChatCompletionRequest with a single UserMessage
chat_request = ChatCompletionRequest(
    messages=[
        UserMessage(content=prompt)
    ]
)

# Encode the request using the tokenizer's specific method
tokenized_payload = tokenizer.encode_chat_completion(chat_request)
# The actual token IDs are usually in an attribute like '.tokens'
encoded_prompt_ids = tokenized_payload.tokens

# Convert to a PyTorch tensor and move to the model's device.
input_ids = torch.tensor([encoded_prompt_ids], device=model.device)

# Generate output
output = model.generate(input_ids, max_new_tokens=20)

# Decode the output
# The output from model.generate includes the input_ids.
# To get only the newly generated tokens, you might need to slice it:
# generated_token_ids = output[0][len(encoded_prompt_ids):]
# However, for a simple check, decoding the whole thing is often fine initially.
# If the output includes the prompt, that's expected.
print(tokenizer.decode(output[0].tolist()))
print("==========================================")

# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR) # This saves the quantized model

# --- Correct way to "save" the MistralTokenizer ---
# Ensure the save directory exists
if not os.path.exists(SAVE_DIR):
    os.makedirs(SAVE_DIR)

# Define the destination path for tekken.json within your SAVE_DIR
destination_tekken_file = os.path.join(SAVE_DIR, "tekken.json")

# Copy the tekken.json file from its original download location to your SAVE_DIR
shutil.copyfile(tekken_file, destination_tekken_file)
# --- End of tokenizer saving ---

print(f"Model saved to {SAVE_DIR}")
print(f"Tokenizer file (tekken.json) copied to {destination_tekken_file}")

Inference Example

This model can be loaded and run with transformers and mistral-common, or for optimized FP8 inference, with vLLM.

Using transformers and mistral-common (for functional checking, not FP8 optimized)

The following inference code block has been functionally verified. The example was successfully executed within the following Docker container environment on a system with Nvidia RTX 5090 GPU:
# 1. Set your Hugging Face Token
export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"

# 2. Run the Triton Server container with GPU access and necessary privileges
sudo docker run --gpus all -it --rm \
  --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 \
  -e HF_TOKEN=$HF_TOKEN --net host \
  nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3
Inside the container, the Python script was run after installing necessary packages:
pip install torch transformers huggingface_hub mistral-common`
python your_inference_script.py

import torch
from transformers import AutoModelForCausalLM
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage, SystemMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from huggingface_hub import hf_hub_download

MODEL_REPO_ID = "textgeflecht/Devstral-Small-2505-FP8-llmcompressor"

# Load model
# For FP8 inference, specific inference engines like vLLM are needed.
# Transformers will load the weights but might not run them in true FP8.
# device_map="auto" and torch_dtype="auto" are good starting points.
model = AutoModelForCausalLM.from_pretrained(
    MODEL_REPO_ID, 
    device_map="auto", 
    torch_dtype="auto"
)

# Load tokenizer from the tekken.json file within the repo
tekken_file = hf_hub_download(repo_id=MODEL_REPO_ID, filename="tekken.json")
tokenizer = MistralTokenizer.from_file(tekken_file)

# (Optional) Load System Prompt if your model uses one and it's in the repo
# try:
#     system_prompt_file = hf_hub_download(repo_id=MODEL_REPO_ID, filename="SYSTEM_PROMPT.txt")
#     with open(system_prompt_file, "r") as f:
#         SYSTEM_PROMPT = f.read()
# except Exception: # pylint: disable=broad-except
#     SYSTEM_PROMPT = "You are a helpful coding assistant."

# dev specific example:
# prompt = "Write a python function that calculates the factorial of a number."
# quick example:
prompt = "What is the capital of France?"

messages = [
    # SystemMessage(content=SYSTEM_PROMPT), # Uncomment if using a system prompt
    UserMessage(content=prompt)
]

# Create ChatCompletionRequest
chat_request = ChatCompletionRequest(messages=messages)

# Encode the request
tokenized_payload = tokenizer.encode_chat_completion(chat_request)
input_ids = torch.tensor([tokenized_payload.tokens], device=model.device)
attention_mask = torch.ones_like(input_ids)

# Generate output
# Note: Setting pad_token_id is common for open-ended generation to prevent warnings.
# For Mistral/Devstral models using this tokenizer, the End-Of-Sentence (EOS) token ID is 2,
# which is suitable for use as pad_token_id in this context.
output = model.generate(
    input_ids, 
    attention_mask=attention_mask,
    max_new_tokens=200, 
    pad_token_id=2, # EOS token ID used as PAD token ID
    do_sample=True, # Add sampling parameters for more diverse outputs
    top_p=0.9,
    temperature=0.7
)

# Decode only the generated tokens
generated_tokens = output[0][len(tokenized_payload.tokens):]
decoded_output = tokenizer.decode(generated_tokens.tolist())

print("Original Prompt:\n", prompt)
print("\nGenerated Output:\n", decoded_output)

Using vLLM (for optimized FP8 inference)

This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM, especially on newer NVIDIA GPUs.

Prerequisites:

A recent version of vLLM. The author's successful tests used a custom, very recent build of vLLM with specific patches for NVIDIA Blackwell FP8 support.
A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer architectures are recommended for FP8).
Docker and NVIDIA Container Toolkit installed.

Running with Docker (Recommended & Tested by Author):

The following Docker command starts a vLLM OpenAI-compatible server with this quantized model. This setup has been verified by the author to load the model successfully and serve requests.

# 1. Set your Hugging Face Token (optional, but recommended to avoid rate limits or for private models)
# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"

# 2. Run the vLLM Docker container.
# Replace 'vllm/vllm-openai:latest' with your specific vLLM image if using a custom build.
# The 'latest' tag should pull a recent official build from vLLM.
sudo docker run --gpus all \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    -p 8000:8000 \
    -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
    vllm/vllm-openai:latest \
    --model textgeflecht/Devstral-Small-2505-FP8-llmcompressor \
    --tokenizer_mode mistral \
    --load_format auto \
    --max-model-len 2048 # Optional: Limit VRAM usage

    # Optional: Add Mistral-specific tool usage flags if needed by your application.
    # --tool-call-parser mistral \
    # --enable-auto-tool-choice \
    # Optional: Explicitly set tensor parallel size if you have multiple GPUs e.g. on 2 GPUs:
    # --tensor-parallel-size 2

Key Command-Line Arguments Used:

--model textgeflecht/Devstral-Small-2505-FP8-llmcompressor: Specifies the quantized model from the Hugging Face Hub.
--tokenizer_mode mistral: Essential for vLLM to correctly use the MistralTokenizer with the tekken.json file from the repository.
--load_format auto: Allows vLLM to auto-detect the Hugging Face sharded safetensors format for weights. With this, vLLM successfully reads the config.json (which includes quantization_config with quant_method: "compressed-tensors") and auto-detects the FP8 quantization scheme.
--max-model-len 2048: Limits the maximum sequence length (input + output tokens combined) to manage VRAM. Adjust this value based on your needs and available GPU memory.
The flags --tool-call-parser mistral and --enable-auto-tool-choice can be added if you intend to use Devstral's tool-calling capabilities.

Note on FP8 Support (especially for newer architectures like Blackwell):

vLLM's support for FP8, particularly on the newest GPU architectures like NVIDIA Blackwell, is an area of active development.
The successful tests for this model on Blackwell used a custom, very recent vLLM build with specific patches for Blackwell FP8 support.
While standard vllm/vllm-openai:latest images are updated regularly, cutting-edge hardware support and specific quantization schemes might take time to be fully integrated and stabilized in official releases.
If you encounter issues related to FP8 performance or compatibility on very new hardware with official vLLM builds, it's recommended to check the vLLM GitHub repository issues and discussions for the latest status, potential workarounds, or information on required builds.

Interacting with the Server:

Once the vLLM server is running, it exposes an OpenAI-compatible API.
You can interact with it using any OpenAI client library (like openai for Python) or tools like curl.
Endpoint for chat completions: http://localhost:8000/v1/chat/completions
Model name in requests: Use textgeflecht/Devstral-Small-2505-FP8-llmcompressor

Refer to the Python requests example in the original https://huggingface.co/mistralai/Devstral-Small-2505 model card for client-side interaction, adjusting the URL and model name as needed.

Original Model Card (mistralai/Devstral-Small-2505)

For more details on the base model, please refer to the original model card: https://huggingface.co/mistralai/Devstral-Small-2505

textgeflecht
/

Devstral-Small-2505-FP8-llmcompressor