Quantization NVFP4A16

Quantified from https://huggingface.co/unsloth/Devstral-Small-2507 (due to in-folder tokenizer). Compressed with llm-compressor.

We recommend cuda capabilities 12.0 hardware (NVIDIA Blackwell: RTX 5000 series GPU, DGX Spark, B200, ...) due to native FP4 acceleration.

Devstral Small 1.1

Devstral is an agentic LLM for software engineering tasks built under a collaboration between Mistral AI and All Hands AI 🙌. Devstral excels at using tools to explore codebases, editing multiple files and power software engineering agents. The model achieves remarkable performance on SWE-bench which positionates it as the #1 open source model on this benchmark.

It is finetuned from Mistral-Small-3.1, therefore it has a long context window of up to 128k tokens. As a coding agent, Devstral is text-only and before fine-tuning from Mistral-Small-3.1 the vision encoder was removed.

For enterprises requiring specialized capabilities (increased context, domain-specific knowledge, etc.), we will release commercial models beyond what Mistral AI contributes to the community.

Learn more about Devstral in our blog post.

Updates compared to Devstral Small 1.0:

  • Improved performance, please refer to the benchmark results.
  • Devstral Small 1.1 is still great when paired with OpenHands. This new version also generalizes better to other prompts and coding environments.
  • Supports Mistral's function calling format.

Key Features:

  • Agentic coding: Devstral is designed to excel at agentic coding tasks, making it a great choice for software engineering agents.
  • lightweight: with its compact size due to quantization, Devstral NVFP4A16 is light enough to run on a single RTX 5060ti 16GB, making it an appropriate model for local deployment and on-device use.
  • Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes.
  • Context Window: A 128k context window.
  • Tokenizer: Utilizes a Tekken tokenizer with a 131k vocabulary size.

Benchmark Results (base model / no quant)

SWE-Bench

Devstral Small 1.1 achieves a score of 53.6% on SWE-Bench Verified, outperforming Devstral Small 1.0 by +6,8% and the second best state of the art model by +11.4%.

Model Agentic Scaffold SWE-Bench Verified (%)
Devstral Small 1.1 OpenHands Scaffold 53.6
Devstral Small 1.0 OpenHands Scaffold 46.8
GPT-4.1-mini OpenAI Scaffold 23.6
Claude 3.5 Haiku Anthropic Scaffold 40.6
SWE-smith-LM 32B SWE-agent Scaffold 40.2
Skywork SWE OpenHands Scaffold 38.0
DeepSWE R2E-Gym Scaffold 42.2

When evaluated under the same test scaffold (OpenHands, provided by All Hands AI 🙌), Devstral exceeds far larger models such as Deepseek-V3-0324 and Qwen3 232B-A22B.

Local inference Usage

We recommend to use Devstral NVFP4A16 with the [vLLM >= 0.9.1](https://github.com/vllm-project/vllm/releases/tag/v0.9.1 Other methods are untested

vLLM (recommended, other methods untested)

ExpandWe recommend using this model with the vLLM library to implement production-ready inference pipelines.

Installation Make sure you install vLLM >= 0.9.1:

pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128

Also make sure to have installed mistral_common >= 1.7.0.

pip install mistral-common --upgrade

To check:

python -c "import mistral_common; print(mistral_common.__version__)"

You can also make use of a ready-to-go docker image or on the docker hub.

Launch server

We recommand that you use Devstral in a server/client setting.

  1. Spin up a server:
vllm serve apolloparty/Devstral-Small-2507-NVFP4A16 --tool-call-parser mistral --enable-auto-tool-choice
  1. To ping the client you can use a simple Python snippet.
import requests
import json
from huggingface_hub import hf_hub_download


url = "http://<your-server-url>:8000/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}

model = "apolloparty/Devstral-Small-2507-NVFP4A16"

def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    return system_prompt

SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "<your-command>",
            },
        ],
    },
]

data = {"model": model, "messages": messages, "temperature": 0.15}

# Devstral Small 1.1 supports tool calling. If you want to use tools, follow this:
# tools = [ # Define tools for vLLM
#     {
#         "type": "function",
#         "function": {
#             "name": "git_clone",
#             "description": "Clone a git repository",
#             "parameters": {
#                 "type": "object",
#                 "properties": {
#                     "url": {
#                         "type": "string",
#                         "description": "The url of the git repository",
#                     },
#                 },
#                 "required": ["url"],
#             },
#         },
#     }
# ] 
# data = {"model": model, "messages": messages, "temperature": 0.15, "tools": tools} # Pass tools to payload.

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()["choices"][0]["message"]["content"])
Downloads last month
611
Safetensors
Model size
13.8B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for apolloparty/Devstral-Small-2507-NVFP4A16

Collection including apolloparty/Devstral-Small-2507-NVFP4A16