--- language: - en - fr - de - es - pt - it - ja - ko - ru - zh - ar - fa - id - ms - ne - pl - ro - sr - sv - tr - uk - vi - hi - bn license: apache-2.0 library_name: vllm inference: false base_model: - mistralai/Mistral-Small-3.1-24B-Base-2503 extra_gated_description: If you want to learn more about how we process your personal data, please read our Privacy Policy. --- # Mistral-Small-3.1-24B-Instruct-2503 GGUF Models ## **Choosing the Right Model Format** Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**. ### **BF16 (Brain Float 16) – Use if BF16 acceleration is available** - A 16-bit floating-point format designed for **faster computation** while retaining good precision. - Provides **similar dynamic range** as FP32 but with **lower memory usage**. - Recommended if your hardware supports **BF16 acceleration** (check your device’s specs). - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32. πŸ“Œ **Use BF16 if:** βœ” Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs). βœ” You want **higher precision** while saving memory. βœ” You plan to **requantize** the model into another format. πŸ“Œ **Avoid BF16 if:** ❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower). ❌ You need compatibility with older devices that lack BF16 optimization. --- ### **F16 (Float 16) – More widely supported than BF16** - A 16-bit floating-point **high precision** but with less of range of values than BF16. - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs). - Slightly lower numerical precision than BF16 but generally sufficient for inference. πŸ“Œ **Use F16 if:** βœ” Your hardware supports **FP16** but **not BF16**. βœ” You need a **balance between speed, memory usage, and accuracy**. βœ” You are running on a **GPU** or another device optimized for FP16 computations. πŸ“Œ **Avoid F16 if:** ❌ Your device lacks **native FP16 support** (it may run slower than expected). ❌ You have memory limitations. --- ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference** Quantization reduces model size and memory usage while maintaining as much accuracy as possible. - **Lower-bit models (Q4_K)** β†’ **Best for minimal memory usage**, may have lower precision. - **Higher-bit models (Q6_K, Q8_0)** β†’ **Better accuracy**, requires more memory. πŸ“Œ **Use Quantized Models if:** βœ” You are running inference on a **CPU** and need an optimized model. βœ” Your device has **low VRAM** and cannot load full-precision models. βœ” You want to reduce **memory footprint** while keeping reasonable accuracy. πŸ“Œ **Avoid Quantized Models if:** ❌ You need **maximum accuracy** (full-precision models are better for this). ❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16). --- ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)** These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint. - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**. - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large. - **Trade-off**: Lower accuracy compared to higher-bit quantizations. - **IQ3_S**: Small block size for **maximum memory efficiency**. - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive. - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**. - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting. - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy. - **Use case**: Best for **low-memory devices** where **Q6_K** is too large. - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**. - **Use case**: Best for **ARM-based devices** or **low-memory environments**. --- ### **Summary Table: Model Format Selection** | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case | |--------------|------------|---------------|----------------------|---------------| | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory | | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available | | **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments | | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized | | **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models | | **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy | | **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices | --- ## **Included Files & Details** ### `Mistral-Small-3.1-24B-Instruct-2503-bf16.gguf` - Model weights preserved in **BF16**. - Use this if you want to **requantize** the model into a different format. - Best if your device supports **BF16 acceleration**. ### `Mistral-Small-3.1-24B-Instruct-2503-f16.gguf` - Model weights stored in **F16**. - Use if your device supports **FP16**, especially if BF16 is not available. ### `Mistral-Small-3.1-24B-Instruct-2503-bf16-q8_0.gguf` - **Output & embeddings** remain in **BF16**. - All other layers quantized to **Q8_0**. - Use if your device supports **BF16** and you want a quantized version. ### `Mistral-Small-3.1-24B-Instruct-2503-f16-q8_0.gguf` - **Output & embeddings** remain in **F16**. - All other layers quantized to **Q8_0**. ### `Mistral-Small-3.1-24B-Instruct-2503-q4_k.gguf` - **Output & embeddings** quantized to **Q8_0**. - All other layers quantized to **Q4_K**. - Good for **CPU inference** with limited memory. ### `Mistral-Small-3.1-24B-Instruct-2503-q4_k_s.gguf` - Smallest **Q4_K** variant, using less memory at the cost of accuracy. - Best for **very low-memory setups**. ### `Mistral-Small-3.1-24B-Instruct-2503-q6_k.gguf` - **Output & embeddings** quantized to **Q8_0**. - All other layers quantized to **Q6_K** . ### `Mistral-Small-3.1-24B-Instruct-2503-q8_0.gguf` - Fully **Q8** quantized model for better accuracy. - Requires **more memory** but offers higher precision. ### `Mistral-Small-3.1-24B-Instruct-2503-iq3_xs.gguf` - **IQ3_XS** quantization, optimized for **extreme memory efficiency**. - Best for **ultra-low-memory devices**. ### `Mistral-Small-3.1-24B-Instruct-2503-iq3_m.gguf` - **IQ3_M** quantization, offering a **medium block size** for better accuracy. - Suitable for **low-memory devices**. ### `Mistral-Small-3.1-24B-Instruct-2503-q4_0.gguf` - Pure **Q4_0** quantization, optimized for **ARM devices**. - Best for **low-memory environments**. - Prefer IQ4_NL for better accuracy. # πŸš€ If you find these models useful Please click like ❀ . Also I’d really appreciate it if you could test my Network Monitor Assistant at πŸ‘‰ [Network Monitor Assitant](https://freenetworkmonitor.click/dashboard). πŸ’¬ Click the **chat icon** (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM. ### What I'm Testing I'm experimenting with **function calling** against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function". 🟑 **TestLLM** – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a timeβ€”still working on scaling!). If you're curious, I'd be happy to share how it works! . ### The other Available AI Assistants 🟒 **TurboLLM** – Uses **gpt-4o-mini** Fast! . Note: tokens are limited since OpenAI models are pricey, but you can [Login](https://freenetworkmonitor.click) or [Download](https://freenetworkmonitor.click/download) the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM . πŸ”΅ **FreeLLM** – Runs **open-source Hugging Face models** Medium speed (unlimited, subject to Hugging Face API availability). # Model Card for Mistral-Small-3.1-24B-Instruct-2503 Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) **adds state-of-the-art vision understanding** and enhances **long context capabilities up to 128k tokens** without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks. This model is an instruction-finetuned version of: [Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503). Mistral Small 3.1 can be deployed locally and is exceptionally "knowledge-dense," fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized. It is ideal for: - Fast-response conversational agents. - Low-latency function calling. - Subject matter experts via fine-tuning. - Local inference for hobbyists and organizations handling sensitive data. - Programming and math reasoning. - Long document understanding. - Visual understanding. For enterprises requiring specialized capabilities (increased context, specific modalities, domain-specific knowledge, etc.), we will release commercial models beyond what Mistral AI contributes to the community. Learn more about Mistral Small 3.1 in our [blog post](https://mistral.ai/news/mistral-small-3-1/). ## Key Features - **Vision:** Vision capabilities enable the model to analyze images and provide insights based on visual content in addition to text. - **Multilingual:** Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi. - **Agent-Centric:** Offers best-in-class agentic capabilities with native function calling and JSON outputting. - **Advanced Reasoning:** State-of-the-art conversational and reasoning capabilities. - **Apache 2.0 License:** Open license allowing usage and modification for both commercial and non-commercial purposes. - **Context Window:** A 128k context window. - **System Prompt:** Maintains strong adherence and support for system prompts. - **Tokenizer:** Utilizes a Tekken tokenizer with a 131k vocabulary size. ## Benchmark Results When available, we report numbers previously published by other model providers, otherwise we re-evaluate them using our own evaluation harness. ### Pretrain Evals | Model | MMLU (5-shot) | MMLU Pro (5-shot CoT) | TriviaQA | GPQA Main (5-shot CoT)| MMMU | |--------------------------------|---------------|-----------------------|------------|-----------------------|-----------| | **Small 3.1 24B Base** | **81.01%** | **56.03%** | 80.50% | **37.50%** | **59.27%**| | Gemma 3 27B PT | 78.60% | 52.20% | **81.30%** | 24.30% | 56.10% | ### Instruction Evals #### Text | Model | MMLU | MMLU Pro (5-shot CoT) | MATH | GPQA Main (5-shot CoT) | GPQA Diamond (5-shot CoT )| MBPP | HumanEval | SimpleQA (TotalAcc)| |--------------------------------|-----------|-----------------------|------------------------|------------------------|---------------------------|-----------|-----------|--------------------| | **Small 3.1 24B Instruct** | 80.62% | 66.76% | 69.30% | **44.42%** | **45.96%** | 74.71% | **88.41%**| **10.43%** | | Gemma 3 27B IT | 76.90% | **67.50%** | **89.00%** | 36.83% | 42.40% | 74.40% | 87.80% | 10.00% | | GPT4o Mini | **82.00%**| 61.70% | 70.20% | 40.20% | 39.39% | 84.82% | 87.20% | 9.50% | | Claude 3.5 Haiku | 77.60% | 65.00% | 69.20% | 37.05% | 41.60% | **85.60%**| 88.10% | 8.02% | | Cohere Aya-Vision 32B | 72.14% | 47.16% | 41.98% | 34.38% | 33.84% | 70.43% | 62.20% | 7.65% | #### Vision | Model | MMMU | MMMU PRO | Mathvista | ChartQA | DocVQA | AI2D | MM MT Bench | |--------------------------------|------------|-----------|-----------|-----------|-----------|-------------|-------------| | **Small 3.1 24B Instruct** | 64.00% | **49.25%**| **68.91%**| 86.24% | **94.08%**| **93.72%** | **7.3** | | Gemma 3 27B IT | **64.90%** | 48.38% | 67.60% | 76.00% | 86.60% | 84.50% | 7 | | GPT4o Mini | 59.40% | 37.60% | 56.70% | 76.80% | 86.70% | 88.10% | 6.6 | | Claude 3.5 Haiku | 60.50% | 45.03% | 61.60% | **87.20%**| 90.00% | 92.10% | 6.5 | | Cohere Aya-Vision 32B | 48.20% | 31.50% | 50.10% | 63.04% | 72.40% | 82.57% | 4.1 | ### Multilingual Evals | Model | Average | European | East Asian | Middle Eastern | |--------------------------------|------------|------------|------------|----------------| | **Small 3.1 24B Instruct** | **71.18%** | **75.30%** | **69.17%** | 69.08% | | Gemma 3 27B IT | 70.19% | 74.14% | 65.65% | 70.76% | | GPT4o Mini | 70.36% | 74.21% | 65.96% | **70.90%** | | Claude 3.5 Haiku | 70.16% | 73.45% | 67.05% | 70.00% | | Cohere Aya-Vision 32B | 62.15% | 64.70% | 57.61% | 64.12% | ### Long Context Evals | Model | LongBench v2 | RULER 32K | RULER 128K | |--------------------------------|-----------------|-------------|------------| | **Small 3.1 24B Instruct** | **37.18%** | **93.96%** | 81.20% | | Gemma 3 27B IT | 34.59% | 91.10% | 66.00% | | GPT4o Mini | 29.30% | 90.20% | 65.8% | | Claude 3.5 Haiku | 35.19% | 92.60% | **91.90%** |