Mungert
/

Mistral-Small-3.1-24B-Instruct-2503-GGUF

+---
+language:
+- en
+- fr
+- de
+- es
+- pt
+- it
+- ja
+- ko
+- ru
+- zh
+- ar
+- fa
+- id
+- ms
+- ne
+- pl
+- ro
+- sr
+- sv
+- tr
+- uk
+- vi
+- hi
+- bn
+license: apache-2.0
+library_name: vllm
+inference: false
+base_model:
+- mistralai/Mistral-Small-3.1-24B-Base-2503
+extra_gated_description: If you want to learn more about how we process your personal
+  data, please read our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
+---
+# <span style="color: #7FFF7F;">Mistral-Small-3.1-24B-Instruct-2503 GGUF Models</span>
+## **Choosing the Right Model Format**
+Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
+### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
+- A 16-bit floating-point format designed for **faster computation** while retaining good precision.
+- Provides **similar dynamic range** as FP32 but with **lower memory usage**.
+- Recommended if your hardware supports **BF16 acceleration** (check your device’s specs).
+- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
+📌 **Use BF16 if:**
+✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
+✔ You want **higher precision** while saving memory.
+✔ You plan to **requantize** the model into another format.
+📌 **Avoid BF16 if:**
+❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
+❌ You need compatibility with older devices that lack BF16 optimization.
+---
+### **F16 (Float 16) – More widely supported than BF16**
+- A 16-bit floating-point **high precision** but with less of range of values than BF16.
+- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
+- Slightly lower numerical precision than BF16 but generally sufficient for inference.
+📌 **Use F16 if:**
+✔ Your hardware supports **FP16** but **not BF16**.
+✔ You need a **balance between speed, memory usage, and accuracy**.
+✔ You are running on a **GPU** or another device optimized for FP16 computations.
+📌 **Avoid F16 if:**
+❌ Your device lacks **native FP16 support** (it may run slower than expected).
+❌ You have memory limitations.
+---
+### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
+Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
+- **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
+- **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
+📌 **Use Quantized Models if:**
+✔ You are running inference on a **CPU** and need an optimized model.
+✔ Your device has **low VRAM** and cannot load full-precision models.
+✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
+📌 **Avoid Quantized Models if:**
+❌ You need **maximum accuracy** (full-precision models are better for this).
+❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
+---
+### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
+These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
+- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
+  - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
+  - **Trade-off**: Lower accuracy compared to higher-bit quantizations.
+- **IQ3_S**: Small block size for **maximum memory efficiency**.
+  - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
+- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
+  - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
+- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
+  - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
+- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
+  - **Use case**: Best for **ARM-based devices** or **low-memory environments**.
+---
+### **Summary Table: Model Format Selection**
+| Model Format  | Precision  | Memory Usage  | Device Requirements  | Best Use Case  |
+|--------------|------------|---------------|----------------------|---------------|
+| **BF16**     | Highest    | High          | BF16-supported GPU/CPUs  | High-speed inference with reduced memory |
+| **F16**      | High       | High          | FP16-supported devices | GPU inference when BF16 isn’t available |
+| **Q4_K**     | Medium Low | Low           | CPU or Low-VRAM devices | Best for memory-constrained environments |
+| **Q6_K**     | Medium     | Moderate      | CPU with more memory | Better accuracy while still being quantized |
+| **Q8_0**     | High       | Moderate      | CPU or GPU with enough VRAM | Best accuracy among quantized models |
+| **IQ3_XS**   | Very Low   | Very Low      | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
+| **Q4_0**     | Low        | Low           | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
+---
+## **Included Files & Details**
+### `Mistral-Small-3.1-24B-Instruct-2503-bf16.gguf`
+- Model weights preserved in **BF16**.
+- Use this if you want to **requantize** the model into a different format.
+- Best if your device supports **BF16 acceleration**.
+### `Mistral-Small-3.1-24B-Instruct-2503-f16.gguf`
+- Model weights stored in **F16**.
+- Use if your device supports **FP16**, especially if BF16 is not available.
+### `Mistral-Small-3.1-24B-Instruct-2503-bf16-q8_0.gguf`
+- **Output & embeddings** remain in **BF16**.
+- All other layers quantized to **Q8_0**.
+- Use if your device supports **BF16** and you want a quantized version.
+### `Mistral-Small-3.1-24B-Instruct-2503-f16-q8_0.gguf`
+- **Output & embeddings** remain in **F16**.
+- All other layers quantized to **Q8_0**.
+### `Mistral-Small-3.1-24B-Instruct-2503-q4_k.gguf`
+- **Output & embeddings** quantized to **Q8_0**.
+- All other layers quantized to **Q4_K**.
+- Good for **CPU inference** with limited memory.
+### `Mistral-Small-3.1-24B-Instruct-2503-q4_k_s.gguf`
+- Smallest **Q4_K** variant, using less memory at the cost of accuracy.
+- Best for **very low-memory setups**.
+### `Mistral-Small-3.1-24B-Instruct-2503-q6_k.gguf`
+- **Output & embeddings** quantized to **Q8_0**.
+- All other layers quantized to **Q6_K** .
+### `Mistral-Small-3.1-24B-Instruct-2503-q8_0.gguf`
+- Fully **Q8** quantized model for better accuracy.
+- Requires **more memory** but offers higher precision.
+### `Mistral-Small-3.1-24B-Instruct-2503-iq3_xs.gguf`
+- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
+- Best for **ultra-low-memory devices**.
+### `Mistral-Small-3.1-24B-Instruct-2503-iq3_m.gguf`
+- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
+- Suitable for **low-memory devices**.
+### `Mistral-Small-3.1-24B-Instruct-2503-q4_0.gguf`
+- Pure **Q4_0** quantization, optimized for **ARM devices**.
+- Best for **low-memory environments**.
+- Prefer IQ4_NL for better accuracy.
+# <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
+Please click like ❤ . Also I’d really appreciate it if you could test my Network Monitor Assistant at 👉 [Network Monitor Assitant](https://freenetworkmonitor.click/dashboard).
+💬 Click the **chat icon** (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.
+### What I'm Testing
+I'm experimenting with **function calling** against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".
+🟡 **TestLLM** – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a time—still working on scaling!). If you're curious, I'd be happy to share how it works! .
+### The other Available AI Assistants
+🟢 **TurboLLM** – Uses **gpt-4o-mini** Fast! . Note: tokens are limited since OpenAI models are pricey, but you can [Login](https://freenetworkmonitor.click) or [Download](https://freenetworkmonitor.click/download) the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM .
+🔵 **FreeLLM** – Runs **open-source Hugging Face models** Medium speed (unlimited, subject to Hugging Face API availability).
+# Model Card for Mistral-Small-3.1-24B-Instruct-2503
+Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) **adds state-of-the-art vision understanding** and enhances **long context capabilities up to 128k tokens** without compromising text performance.
+With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
+This model is an instruction-finetuned version of: [Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503).
+Mistral Small 3.1 can be deployed locally and is exceptionally "knowledge-dense," fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized.
+It is ideal for:
+- Fast-response conversational agents.
+- Low-latency function calling.
+- Subject matter experts via fine-tuning.
+- Local inference for hobbyists and organizations handling sensitive data.
+- Programming and math reasoning.
+- Long document understanding.
+- Visual understanding.
+For enterprises requiring specialized capabilities (increased context, specific modalities, domain-specific knowledge, etc.), we will release commercial models beyond what Mistral AI contributes to the community.
+Learn more about Mistral Small 3.1 in our [blog post](https://mistral.ai/news/mistral-small-3-1/).
+## Key Features
+- **Vision:** Vision capabilities enable the model to analyze images and provide insights based on visual content in addition to text.
+- **Multilingual:** Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi.
+- **Agent-Centric:** Offers best-in-class agentic capabilities with native function calling and JSON outputting.
+- **Advanced Reasoning:** State-of-the-art conversational and reasoning capabilities.
+- **Apache 2.0 License:** Open license allowing usage and modification for both commercial and non-commercial purposes.
+- **Context Window:** A 128k context window.
+- **System Prompt:** Maintains strong adherence and support for system prompts.
+- **Tokenizer:** Utilizes a Tekken tokenizer with a 131k vocabulary size.
+## Benchmark Results
+When available, we report numbers previously published by other model providers, otherwise we re-evaluate them using our own evaluation harness.
+### Pretrain Evals
+| Model                          | MMLU (5-shot) | MMLU Pro (5-shot CoT) | TriviaQA   | GPQA Main (5-shot CoT)| MMMU      |
+|--------------------------------|---------------|-----------------------|------------|-----------------------|-----------|
+| **Small 3.1 24B Base**         | **81.01%**    | **56.03%**            | 80.50%     | **37.50%**            | **59.27%**|
+| Gemma 3 27B PT                 | 78.60%        | 52.20%                | **81.30%** | 24.30%                | 56.10%    |
+### Instruction Evals
+#### Text
+| Model                          | MMLU      | MMLU Pro (5-shot CoT) | MATH                   | GPQA Main (5-shot CoT) | GPQA Diamond (5-shot CoT )| MBPP      | HumanEval | SimpleQA (TotalAcc)|
+|--------------------------------|-----------|-----------------------|------------------------|------------------------|---------------------------|-----------|-----------|--------------------|
+| **Small 3.1 24B Instruct**     | 80.62%    | 66.76%                | 69.30%                 | **44.42%**             | **45.96%**                | 74.71%    | **88.41%**| **10.43%**         |
+| Gemma 3 27B IT                 | 76.90%    | **67.50%**            | **89.00%**             | 36.83%                 | 42.40%                    | 74.40%    | 87.80%    | 10.00%             |
+| GPT4o Mini                     | **82.00%**| 61.70%                | 70.20%                 | 40.20%                 | 39.39%                    | 84.82%    | 87.20%    | 9.50%              |
+| Claude 3.5 Haiku               | 77.60%    | 65.00%                | 69.20%                 | 37.05%                 | 41.60%                    | **85.60%**| 88.10%    | 8.02%              |
+| Cohere Aya-Vision 32B          | 72.14%    | 47.16%                | 41.98%                 | 34.38%                 | 33.84%                    | 70.43%    | 62.20%    | 7.65%              |
+#### Vision
+| Model                          | MMMU       | MMMU PRO  | Mathvista | ChartQA   | DocVQA    | AI2D        | MM MT Bench |
+|--------------------------------|------------|-----------|-----------|-----------|-----------|-------------|-------------|
+| **Small 3.1 24B Instruct**     | 64.00%     | **49.25%**| **68.91%**| 86.24%    | **94.08%**| **93.72%**  | **7.3**     |
+| Gemma 3 27B IT                 | **64.90%** | 48.38%    | 67.60%    | 76.00%    | 86.60%    | 84.50%      | 7           |
+| GPT4o Mini                     | 59.40%     | 37.60%    | 56.70%    | 76.80%    | 86.70%    | 88.10%      | 6.6         |
+| Claude 3.5 Haiku               | 60.50%     | 45.03%    | 61.60%    | **87.20%**| 90.00%    | 92.10%      | 6.5         |
+| Cohere Aya-Vision 32B          | 48.20%     | 31.50%    | 50.10%    | 63.04%    | 72.40%    | 82.57%      | 4.1         |
+### Multilingual Evals
+| Model                          | Average    | European   | East Asian | Middle Eastern |
+|--------------------------------|------------|------------|------------|----------------|
+| **Small 3.1 24B Instruct**     | **71.18%** | **75.30%** | **69.17%** | 69.08%         |
+| Gemma 3 27B IT                 | 70.19%     | 74.14%     | 65.65%     | 70.76%         |
+| GPT4o Mini                     | 70.36%     | 74.21%     | 65.96%     | **70.90%**     |
+| Claude 3.5 Haiku               | 70.16%     | 73.45%     | 67.05%     | 70.00%         |
+| Cohere Aya-Vision 32B          | 62.15%     | 64.70%     | 57.61%     | 64.12%         |
+### Long Context Evals
+| Model                          | LongBench v2    | RULER 32K   | RULER 128K |
+|--------------------------------|-----------------|-------------|------------|
+| **Small 3.1 24B Instruct**     | **37.18%**      | **93.96%**  | 81.20%     |
+| Gemma 3 27B IT                 | 34.59%          | 91.10%      | 66.00%     |
+| GPT4o Mini                     | 29.30%          | 90.20%      | 65.8%      |
+| Claude 3.5 Haiku               | 35.19%          | 92.60%      | **91.90%** |