Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,280 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
- fr
|
5 |
+
- de
|
6 |
+
- es
|
7 |
+
- pt
|
8 |
+
- it
|
9 |
+
- ja
|
10 |
+
- ko
|
11 |
+
- ru
|
12 |
+
- zh
|
13 |
+
- ar
|
14 |
+
- fa
|
15 |
+
- id
|
16 |
+
- ms
|
17 |
+
- ne
|
18 |
+
- pl
|
19 |
+
- ro
|
20 |
+
- sr
|
21 |
+
- sv
|
22 |
+
- tr
|
23 |
+
- uk
|
24 |
+
- vi
|
25 |
+
- hi
|
26 |
+
- bn
|
27 |
+
license: apache-2.0
|
28 |
+
library_name: vllm
|
29 |
+
inference: false
|
30 |
+
base_model:
|
31 |
+
- mistralai/Mistral-Small-3.1-24B-Base-2503
|
32 |
+
extra_gated_description: If you want to learn more about how we process your personal
|
33 |
+
data, please read our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
|
34 |
+
---
|
35 |
+
|
36 |
+
# <span style="color: #7FFF7F;">Mistral-Small-3.1-24B-Instruct-2503 GGUF Models</span>
|
37 |
+
|
38 |
+
## **Choosing the Right Model Format**
|
39 |
+
|
40 |
+
Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
|
41 |
+
|
42 |
+
### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
|
43 |
+
- A 16-bit floating-point format designed for **faster computation** while retaining good precision.
|
44 |
+
- Provides **similar dynamic range** as FP32 but with **lower memory usage**.
|
45 |
+
- Recommended if your hardware supports **BF16 acceleration** (check your device’s specs).
|
46 |
+
- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
|
47 |
+
|
48 |
+
📌 **Use BF16 if:**
|
49 |
+
✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
|
50 |
+
✔ You want **higher precision** while saving memory.
|
51 |
+
✔ You plan to **requantize** the model into another format.
|
52 |
+
|
53 |
+
📌 **Avoid BF16 if:**
|
54 |
+
❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
|
55 |
+
❌ You need compatibility with older devices that lack BF16 optimization.
|
56 |
+
|
57 |
+
---
|
58 |
+
|
59 |
+
### **F16 (Float 16) – More widely supported than BF16**
|
60 |
+
- A 16-bit floating-point **high precision** but with less of range of values than BF16.
|
61 |
+
- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
|
62 |
+
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
|
63 |
+
|
64 |
+
📌 **Use F16 if:**
|
65 |
+
✔ Your hardware supports **FP16** but **not BF16**.
|
66 |
+
✔ You need a **balance between speed, memory usage, and accuracy**.
|
67 |
+
✔ You are running on a **GPU** or another device optimized for FP16 computations.
|
68 |
+
|
69 |
+
📌 **Avoid F16 if:**
|
70 |
+
❌ Your device lacks **native FP16 support** (it may run slower than expected).
|
71 |
+
❌ You have memory limitations.
|
72 |
+
|
73 |
+
---
|
74 |
+
|
75 |
+
### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
|
76 |
+
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
|
77 |
+
- **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
|
78 |
+
- **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
|
79 |
+
|
80 |
+
📌 **Use Quantized Models if:**
|
81 |
+
✔ You are running inference on a **CPU** and need an optimized model.
|
82 |
+
✔ Your device has **low VRAM** and cannot load full-precision models.
|
83 |
+
✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
|
84 |
+
|
85 |
+
📌 **Avoid Quantized Models if:**
|
86 |
+
❌ You need **maximum accuracy** (full-precision models are better for this).
|
87 |
+
❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
|
88 |
+
|
89 |
+
---
|
90 |
+
|
91 |
+
### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
|
92 |
+
These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
|
93 |
+
|
94 |
+
- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
|
95 |
+
- **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
|
96 |
+
- **Trade-off**: Lower accuracy compared to higher-bit quantizations.
|
97 |
+
|
98 |
+
- **IQ3_S**: Small block size for **maximum memory efficiency**.
|
99 |
+
- **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
|
100 |
+
|
101 |
+
- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
|
102 |
+
- **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
|
103 |
+
|
104 |
+
- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
|
105 |
+
- **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
|
106 |
+
|
107 |
+
- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
|
108 |
+
- **Use case**: Best for **ARM-based devices** or **low-memory environments**.
|
109 |
+
|
110 |
+
---
|
111 |
+
|
112 |
+
### **Summary Table: Model Format Selection**
|
113 |
+
|
114 |
+
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
|
115 |
+
|--------------|------------|---------------|----------------------|---------------|
|
116 |
+
| **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
|
117 |
+
| **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available |
|
118 |
+
| **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
|
119 |
+
| **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
|
120 |
+
| **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
|
121 |
+
| **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
|
122 |
+
| **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
|
123 |
+
|
124 |
+
---
|
125 |
+
|
126 |
+
## **Included Files & Details**
|
127 |
+
|
128 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-bf16.gguf`
|
129 |
+
- Model weights preserved in **BF16**.
|
130 |
+
- Use this if you want to **requantize** the model into a different format.
|
131 |
+
- Best if your device supports **BF16 acceleration**.
|
132 |
+
|
133 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-f16.gguf`
|
134 |
+
- Model weights stored in **F16**.
|
135 |
+
- Use if your device supports **FP16**, especially if BF16 is not available.
|
136 |
+
|
137 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-bf16-q8_0.gguf`
|
138 |
+
- **Output & embeddings** remain in **BF16**.
|
139 |
+
- All other layers quantized to **Q8_0**.
|
140 |
+
- Use if your device supports **BF16** and you want a quantized version.
|
141 |
+
|
142 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-f16-q8_0.gguf`
|
143 |
+
- **Output & embeddings** remain in **F16**.
|
144 |
+
- All other layers quantized to **Q8_0**.
|
145 |
+
|
146 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-q4_k.gguf`
|
147 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
148 |
+
- All other layers quantized to **Q4_K**.
|
149 |
+
- Good for **CPU inference** with limited memory.
|
150 |
+
|
151 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-q4_k_s.gguf`
|
152 |
+
- Smallest **Q4_K** variant, using less memory at the cost of accuracy.
|
153 |
+
- Best for **very low-memory setups**.
|
154 |
+
|
155 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-q6_k.gguf`
|
156 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
157 |
+
- All other layers quantized to **Q6_K** .
|
158 |
+
|
159 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-q8_0.gguf`
|
160 |
+
- Fully **Q8** quantized model for better accuracy.
|
161 |
+
- Requires **more memory** but offers higher precision.
|
162 |
+
|
163 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-iq3_xs.gguf`
|
164 |
+
- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
|
165 |
+
- Best for **ultra-low-memory devices**.
|
166 |
+
|
167 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-iq3_m.gguf`
|
168 |
+
- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
|
169 |
+
- Suitable for **low-memory devices**.
|
170 |
+
|
171 |
+
### `Mistral-Small-3.1-24B-Instruct-2503-q4_0.gguf`
|
172 |
+
- Pure **Q4_0** quantization, optimized for **ARM devices**.
|
173 |
+
- Best for **low-memory environments**.
|
174 |
+
- Prefer IQ4_NL for better accuracy.
|
175 |
+
|
176 |
+
# <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
|
177 |
+
|
178 |
+
Please click like ❤ . Also I’d really appreciate it if you could test my Network Monitor Assistant at 👉 [Network Monitor Assitant](https://freenetworkmonitor.click/dashboard).
|
179 |
+
|
180 |
+
💬 Click the **chat icon** (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.
|
181 |
+
|
182 |
+
### What I'm Testing
|
183 |
+
|
184 |
+
I'm experimenting with **function calling** against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".
|
185 |
+
|
186 |
+
🟡 **TestLLM** – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a time—still working on scaling!). If you're curious, I'd be happy to share how it works! .
|
187 |
+
|
188 |
+
### The other Available AI Assistants
|
189 |
+
|
190 |
+
🟢 **TurboLLM** – Uses **gpt-4o-mini** Fast! . Note: tokens are limited since OpenAI models are pricey, but you can [Login](https://freenetworkmonitor.click) or [Download](https://freenetworkmonitor.click/download) the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM .
|
191 |
+
|
192 |
+
🔵 **FreeLLM** – Runs **open-source Hugging Face models** Medium speed (unlimited, subject to Hugging Face API availability).
|
193 |
+
|
194 |
+
|
195 |
+
|
196 |
+
|
197 |
+
# Model Card for Mistral-Small-3.1-24B-Instruct-2503
|
198 |
+
|
199 |
+
Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) **adds state-of-the-art vision understanding** and enhances **long context capabilities up to 128k tokens** without compromising text performance.
|
200 |
+
With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
|
201 |
+
This model is an instruction-finetuned version of: [Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503).
|
202 |
+
|
203 |
+
Mistral Small 3.1 can be deployed locally and is exceptionally "knowledge-dense," fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized.
|
204 |
+
|
205 |
+
It is ideal for:
|
206 |
+
- Fast-response conversational agents.
|
207 |
+
- Low-latency function calling.
|
208 |
+
- Subject matter experts via fine-tuning.
|
209 |
+
- Local inference for hobbyists and organizations handling sensitive data.
|
210 |
+
- Programming and math reasoning.
|
211 |
+
- Long document understanding.
|
212 |
+
- Visual understanding.
|
213 |
+
|
214 |
+
For enterprises requiring specialized capabilities (increased context, specific modalities, domain-specific knowledge, etc.), we will release commercial models beyond what Mistral AI contributes to the community.
|
215 |
+
|
216 |
+
Learn more about Mistral Small 3.1 in our [blog post](https://mistral.ai/news/mistral-small-3-1/).
|
217 |
+
|
218 |
+
## Key Features
|
219 |
+
- **Vision:** Vision capabilities enable the model to analyze images and provide insights based on visual content in addition to text.
|
220 |
+
- **Multilingual:** Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi.
|
221 |
+
- **Agent-Centric:** Offers best-in-class agentic capabilities with native function calling and JSON outputting.
|
222 |
+
- **Advanced Reasoning:** State-of-the-art conversational and reasoning capabilities.
|
223 |
+
- **Apache 2.0 License:** Open license allowing usage and modification for both commercial and non-commercial purposes.
|
224 |
+
- **Context Window:** A 128k context window.
|
225 |
+
- **System Prompt:** Maintains strong adherence and support for system prompts.
|
226 |
+
- **Tokenizer:** Utilizes a Tekken tokenizer with a 131k vocabulary size.
|
227 |
+
|
228 |
+
## Benchmark Results
|
229 |
+
|
230 |
+
When available, we report numbers previously published by other model providers, otherwise we re-evaluate them using our own evaluation harness.
|
231 |
+
|
232 |
+
### Pretrain Evals
|
233 |
+
|
234 |
+
| Model | MMLU (5-shot) | MMLU Pro (5-shot CoT) | TriviaQA | GPQA Main (5-shot CoT)| MMMU |
|
235 |
+
|--------------------------------|---------------|-----------------------|------------|-----------------------|-----------|
|
236 |
+
| **Small 3.1 24B Base** | **81.01%** | **56.03%** | 80.50% | **37.50%** | **59.27%**|
|
237 |
+
| Gemma 3 27B PT | 78.60% | 52.20% | **81.30%** | 24.30% | 56.10% |
|
238 |
+
|
239 |
+
### Instruction Evals
|
240 |
+
|
241 |
+
#### Text
|
242 |
+
|
243 |
+
| Model | MMLU | MMLU Pro (5-shot CoT) | MATH | GPQA Main (5-shot CoT) | GPQA Diamond (5-shot CoT )| MBPP | HumanEval | SimpleQA (TotalAcc)|
|
244 |
+
|--------------------------------|-----------|-----------------------|------------------------|------------------------|---------------------------|-----------|-----------|--------------------|
|
245 |
+
| **Small 3.1 24B Instruct** | 80.62% | 66.76% | 69.30% | **44.42%** | **45.96%** | 74.71% | **88.41%**| **10.43%** |
|
246 |
+
| Gemma 3 27B IT | 76.90% | **67.50%** | **89.00%** | 36.83% | 42.40% | 74.40% | 87.80% | 10.00% |
|
247 |
+
| GPT4o Mini | **82.00%**| 61.70% | 70.20% | 40.20% | 39.39% | 84.82% | 87.20% | 9.50% |
|
248 |
+
| Claude 3.5 Haiku | 77.60% | 65.00% | 69.20% | 37.05% | 41.60% | **85.60%**| 88.10% | 8.02% |
|
249 |
+
| Cohere Aya-Vision 32B | 72.14% | 47.16% | 41.98% | 34.38% | 33.84% | 70.43% | 62.20% | 7.65% |
|
250 |
+
|
251 |
+
#### Vision
|
252 |
+
|
253 |
+
| Model | MMMU | MMMU PRO | Mathvista | ChartQA | DocVQA | AI2D | MM MT Bench |
|
254 |
+
|--------------------------------|------------|-----------|-----------|-----------|-----------|-------------|-------------|
|
255 |
+
| **Small 3.1 24B Instruct** | 64.00% | **49.25%**| **68.91%**| 86.24% | **94.08%**| **93.72%** | **7.3** |
|
256 |
+
| Gemma 3 27B IT | **64.90%** | 48.38% | 67.60% | 76.00% | 86.60% | 84.50% | 7 |
|
257 |
+
| GPT4o Mini | 59.40% | 37.60% | 56.70% | 76.80% | 86.70% | 88.10% | 6.6 |
|
258 |
+
| Claude 3.5 Haiku | 60.50% | 45.03% | 61.60% | **87.20%**| 90.00% | 92.10% | 6.5 |
|
259 |
+
| Cohere Aya-Vision 32B | 48.20% | 31.50% | 50.10% | 63.04% | 72.40% | 82.57% | 4.1 |
|
260 |
+
|
261 |
+
### Multilingual Evals
|
262 |
+
|
263 |
+
| Model | Average | European | East Asian | Middle Eastern |
|
264 |
+
|--------------------------------|------------|------------|------------|----------------|
|
265 |
+
| **Small 3.1 24B Instruct** | **71.18%** | **75.30%** | **69.17%** | 69.08% |
|
266 |
+
| Gemma 3 27B IT | 70.19% | 74.14% | 65.65% | 70.76% |
|
267 |
+
| GPT4o Mini | 70.36% | 74.21% | 65.96% | **70.90%** |
|
268 |
+
| Claude 3.5 Haiku | 70.16% | 73.45% | 67.05% | 70.00% |
|
269 |
+
| Cohere Aya-Vision 32B | 62.15% | 64.70% | 57.61% | 64.12% |
|
270 |
+
|
271 |
+
### Long Context Evals
|
272 |
+
|
273 |
+
| Model | LongBench v2 | RULER 32K | RULER 128K |
|
274 |
+
|--------------------------------|-----------------|-------------|------------|
|
275 |
+
| **Small 3.1 24B Instruct** | **37.18%** | **93.96%** | 81.20% |
|
276 |
+
| Gemma 3 27B IT | 34.59% | 91.10% | 66.00% |
|
277 |
+
| GPT4o Mini | 29.30% | 90.20% | 65.8% |
|
278 |
+
| Claude 3.5 Haiku | 35.19% | 92.60% | **91.90%** |
|
279 |
+
|
280 |
+
|