Mungert commited on
Commit
616291b
·
verified ·
1 Parent(s): e1c5d99

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +581 -0
README.md ADDED
@@ -0,0 +1,581 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.2
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model:
7
+ - meta-llama/Llama-3.2-3B
8
+ - lianghsun/Llama-3.2-3B-F1-Base
9
+ library_name: transformers
10
+ tags:
11
+ - Taiwan
12
+ - R.O.C
13
+ - zhtw
14
+ - SLM
15
+ - Llama-32
16
+ datasets:
17
+ - lianghsun/tw-reasoning-instruct
18
+ - minyichen/tw-instruct-R1-200k
19
+ - minyichen/tw_mm_R1
20
+ model-index:
21
+ - name: Llama-3.2-3B-F1-Reasoning-Instruct
22
+ results:
23
+ - task:
24
+ type: question-answering
25
+ name: Single Choice Question
26
+ dataset:
27
+ type: ikala/tmmluplus
28
+ name: tmmlu+
29
+ config: all
30
+ split: test
31
+ revision: c0e8ae955997300d5dbf0e382bf0ba5115f85e8c
32
+ metrics:
33
+ - name: single choice
34
+ type: accuracy
35
+ value: 46.16
36
+ - task:
37
+ type: question-answering
38
+ name: Single Choice Question
39
+ dataset:
40
+ type: cais/mmlu
41
+ name: mmlu
42
+ config: all
43
+ split: test
44
+ revision: c30699e
45
+ metrics:
46
+ - name: single choice
47
+ type: accuracy
48
+ value: 51.22
49
+ - task:
50
+ type: question-answering
51
+ name: Single Choice Question
52
+ dataset:
53
+ type: lianghsun/tw-legal-benchmark-v1
54
+ name: tw-legal-benchmark-v1
55
+ config: all
56
+ split: test
57
+ revision: 66c3a5f
58
+ metrics:
59
+ - name: single choice
60
+ type: accuracy
61
+ value: 34.92
62
+ metrics:
63
+ - accuracy
64
+ ---
65
+
66
+ # <span style="color: #7FFF7F;">Llama-3.2-3B-F1-Reasoning-Instruct GGUF Models</span>
67
+
68
+
69
+ ## <span style="color: #7F7FFF;">Model Generation Details</span>
70
+
71
+ This model was generated using [llama.cpp](https://github.com/ggerganov/llama.cpp) at commit [`064cc596`](https://github.com/ggerganov/llama.cpp/commit/064cc596ac44308dc326a17c9e3163c34a6f29d1).
72
+
73
+
74
+
75
+
76
+ ## <span style="color: #7FFF7F;">Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit)</span>
77
+
78
+ Our latest quantization method introduces **precision-adaptive quantization** for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on **Llama-3-8B**. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.
79
+
80
+ ### **Benchmark Context**
81
+ All tests conducted on **Llama-3-8B-Instruct** using:
82
+ - Standard perplexity evaluation pipeline
83
+ - 2048-token context window
84
+ - Same prompt set across all quantizations
85
+
86
+ ### **Method**
87
+ - **Dynamic Precision Allocation**:
88
+ - First/Last 25% of layers → IQ4_XS (selected layers)
89
+ - Middle 50% → IQ2_XXS/IQ3_S (increase efficiency)
90
+ - **Critical Component Protection**:
91
+ - Embeddings/output layers use Q5_K
92
+ - Reduces error propagation by 38% vs standard 1-2bit
93
+
94
+ ### **Quantization Performance Comparison (Llama-3-8B)**
95
+
96
+ | Quantization | Standard PPL | DynamicGate PPL | Δ PPL | Std Size | DG Size | Δ Size | Std Speed | DG Speed |
97
+ |--------------|--------------|------------------|---------|----------|---------|--------|-----------|----------|
98
+ | IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s |
99
+ | IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s |
100
+ | IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s |
101
+ | IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s |
102
+ | IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s |
103
+
104
+ **Key**:
105
+ - PPL = Perplexity (lower is better)
106
+ - Δ PPL = Percentage change from standard to DynamicGate
107
+ - Speed = Inference time (CPU avx2, 2048 token context)
108
+ - Size differences reflect mixed quantization overhead
109
+
110
+ **Key Improvements:**
111
+ - 🔥 **IQ1_M** shows massive 43.9% perplexity reduction (27.46 → 15.41)
112
+ - 🚀 **IQ2_S** cuts perplexity by 36.9% while adding only 0.2GB
113
+ - ⚡ **IQ1_S** maintains 39.7% better accuracy despite 1-bit quantization
114
+
115
+ **Tradeoffs:**
116
+ - All variants have modest size increases (0.1-0.3GB)
117
+ - Inference speeds remain comparable (<5% difference)
118
+
119
+
120
+ ### **When to Use These Models**
121
+ 📌 **Fitting models into GPU VRAM**
122
+
123
+ ✔ **Memory-constrained deployments**
124
+
125
+ ✔ **Cpu and Edge Devices** where 1-2bit errors can be tolerated
126
+
127
+ ✔ **Research** into ultra-low-bit quantization
128
+
129
+
130
+
131
+ ## **Choosing the Right Model Format**
132
+
133
+ Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
134
+
135
+ ### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
136
+ - A 16-bit floating-point format designed for **faster computation** while retaining good precision.
137
+ - Provides **similar dynamic range** as FP32 but with **lower memory usage**.
138
+ - Recommended if your hardware supports **BF16 acceleration** (check your device's specs).
139
+ - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
140
+
141
+ 📌 **Use BF16 if:**
142
+ ✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
143
+ ✔ You want **higher precision** while saving memory.
144
+ ✔ You plan to **requantize** the model into another format.
145
+
146
+ 📌 **Avoid BF16 if:**
147
+ ❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
148
+ ❌ You need compatibility with older devices that lack BF16 optimization.
149
+
150
+ ---
151
+
152
+ ### **F16 (Float 16) – More widely supported than BF16**
153
+ - A 16-bit floating-point **high precision** but with less of range of values than BF16.
154
+ - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
155
+ - Slightly lower numerical precision than BF16 but generally sufficient for inference.
156
+
157
+ 📌 **Use F16 if:**
158
+ ✔ Your hardware supports **FP16** but **not BF16**.
159
+ ✔ You need a **balance between speed, memory usage, and accuracy**.
160
+ ✔ You are running on a **GPU** or another device optimized for FP16 computations.
161
+
162
+ 📌 **Avoid F16 if:**
163
+ ❌ Your device lacks **native FP16 support** (it may run slower than expected).
164
+ ❌ You have memory limitations.
165
+
166
+ ---
167
+
168
+ ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
169
+ Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
170
+ - **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
171
+ - **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
172
+
173
+ 📌 **Use Quantized Models if:**
174
+ ✔ You are running inference on a **CPU** and need an optimized model.
175
+ ✔ Your device has **low VRAM** and cannot load full-precision models.
176
+ ✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
177
+
178
+ 📌 **Avoid Quantized Models if:**
179
+ ❌ You need **maximum accuracy** (full-precision models are better for this).
180
+ ❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
181
+
182
+ ---
183
+
184
+ ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
185
+ These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
186
+
187
+ - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
188
+ - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
189
+ - **Trade-off**: Lower accuracy compared to higher-bit quantizations.
190
+
191
+ - **IQ3_S**: Small block size for **maximum memory efficiency**.
192
+ - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
193
+
194
+ - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
195
+ - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
196
+
197
+ - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
198
+ - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
199
+
200
+ - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
201
+ - **Use case**: Best for **ARM-based devices** or **low-memory environments**.
202
+
203
+ ---
204
+
205
+ ### **Summary Table: Model Format Selection**
206
+
207
+ | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
208
+ |--------------|------------|---------------|----------------------|---------------|
209
+ | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
210
+ | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn't available |
211
+ | **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
212
+ | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
213
+ | **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
214
+ | **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
215
+ | **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
216
+
217
+ ---
218
+
219
+ ## **Included Files & Details**
220
+
221
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-bf16.gguf`
222
+ - Model weights preserved in **BF16**.
223
+ - Use this if you want to **requantize** the model into a different format.
224
+ - Best if your device supports **BF16 acceleration**.
225
+
226
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-f16.gguf`
227
+ - Model weights stored in **F16**.
228
+ - Use if your device supports **FP16**, especially if BF16 is not available.
229
+
230
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-bf16-q8_0.gguf`
231
+ - **Output & embeddings** remain in **BF16**.
232
+ - All other layers quantized to **Q8_0**.
233
+ - Use if your device supports **BF16** and you want a quantized version.
234
+
235
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-f16-q8_0.gguf`
236
+ - **Output & embeddings** remain in **F16**.
237
+ - All other layers quantized to **Q8_0**.
238
+
239
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-q4_k.gguf`
240
+ - **Output & embeddings** quantized to **Q8_0**.
241
+ - All other layers quantized to **Q4_K**.
242
+ - Good for **CPU inference** with limited memory.
243
+
244
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-q4_k_s.gguf`
245
+ - Smallest **Q4_K** variant, using less memory at the cost of accuracy.
246
+ - Best for **very low-memory setups**.
247
+
248
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-q6_k.gguf`
249
+ - **Output & embeddings** quantized to **Q8_0**.
250
+ - All other layers quantized to **Q6_K** .
251
+
252
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-q8_0.gguf`
253
+ - Fully **Q8** quantized model for better accuracy.
254
+ - Requires **more memory** but offers higher precision.
255
+
256
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-iq3_xs.gguf`
257
+ - **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
258
+ - Best for **ultra-low-memory devices**.
259
+
260
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-iq3_m.gguf`
261
+ - **IQ3_M** quantization, offering a **medium block size** for better accuracy.
262
+ - Suitable for **low-memory devices**.
263
+
264
+ ### `Llama-3.2-3B-F1-Reasoning-Instruct-q4_0.gguf`
265
+ - Pure **Q4_0** quantization, optimized for **ARM devices**.
266
+ - Best for **low-memory environments**.
267
+ - Prefer IQ4_NL for better accuracy.
268
+
269
+ # <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
270
+ ❤ **Please click "Like" if you find this useful!**
271
+ Help me test my **AI-Powered Network Monitor Assistant** with **quantum-ready security checks**:
272
+ 👉 [Free Network Monitor](https://readyforquantum.com/dashboard/?assistant=open)
273
+
274
+ 💬 **How to test**:
275
+ Choose an **AI assistant type**:
276
+ - `TurboLLM` (GPT-4o-mini)
277
+ - `HugLLM` (Hugginface Open-source)
278
+ - `TestLLM` (Experimental CPU-only)
279
+
280
+ ### **What I’m Testing**
281
+ I’m pushing the limits of **small open-source models for AI network monitoring**, specifically:
282
+ - **Function calling** against live network services
283
+ - **How small can a model go** while still handling:
284
+ - Automated **Nmap scans**
285
+ - **Quantum-readiness checks**
286
+ - **Network Monitoring tasks**
287
+
288
+ 🟡 **TestLLM** – Current experimental model (llama.cpp on 2 CPU threads):
289
+ - ✅ **Zero-configuration setup**
290
+ - ⏳ 30s load time (slow inference but **no API costs**)
291
+ - 🔧 **Help wanted!** If you’re into **edge-device AI**, let’s collaborate!
292
+
293
+ ### **Other Assistants**
294
+ 🟢 **TurboLLM** – Uses **gpt-4o-mini** for:
295
+ - **Create custom cmd processors to run .net code on Free Network Monitor Agents**
296
+ - **Real-time network diagnostics and monitoring**
297
+ - **Security Audits**
298
+ - **Penetration testing** (Nmap/Metasploit)
299
+ - 🔑 Get more tokens by logging in or [downloading our Free Network Monitor Agent with integrated AI Assistant](https://readyforquantum.com/download)
300
+
301
+ 🔵 **HugLLM** – Latest Open-source models:
302
+ - 🌐 Runs on Hugging Face Inference API
303
+
304
+ ### 💡 **Example commands to you could test**:
305
+ 1. `"Give me info on my websites SSL certificate"`
306
+ 2. `"Check if my server is using quantum safe encyption for communication"`
307
+ 3. `"Run a comprehensive security audit on my server"`
308
+ 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Free Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution!
309
+
310
+
311
+
312
+ # Model Card for Llama-3.2-3B-F1-Reasoning-Instruct (a.k.a __Formosa-1-Reasoning__ or __F1-Reasoning__)
313
+
314
+ <div align="center" style="line-height: 1;">
315
+ <a href="https://discord.gg/Cx737yw4ed" target="_blank" style="margin: 2px;">
316
+ <img alt="Discord" src="https://img.shields.io/badge/Discord-Twinkle%20AI-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
317
+ </a>
318
+ <a href="https://huggingface.co/twinkle-ai" target="_blank" style="margin: 2px;">
319
+ <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Twinkle%20AI-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
320
+ </a>
321
+ </div>
322
+
323
+ <div align="center" style="line-height: 1;">
324
+ <a href="https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt" style="margin: 2px;">
325
+ <img alt="License" src="https://img.shields.io/badge/License-llama3.2-f5de53?&color=0081fb" style="display: inline-block; vertical-align: middle;"/>
326
+ </a>
327
+ </div>
328
+
329
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/618dc56cbc345ca7bf95f3cd/lBonfNs_7lzYguD4kJo6z.png)
330
+
331
+ <!-- Provide a quick summary of what the model is/does. -->
332
+ **Llama-3.2-3B-F1-Reasoning-Instruct**(a.k.a **Formosa-1-Reasoning** or **F1-Reasoning**) 是由 **[Twinkle AI](https://huggingface.co/twinkle-ai)** 與 **[APMIC](https://www.apmic.ai/)** 合作開發,並在[國家高速網路與計算中心](https://www.nchc.org.tw/)技術指導之下,針對中華民國台灣語境與任務需求所微調之繁體中文語言模型,涵蓋法律、教育、生活應用等多元場景,並以高指令跟隨能力為目標進行強化。
333
+
334
+ ## Model Details
335
+
336
+ ### Model Description
337
+
338
+ <!-- Provide a longer summary of what this model is. -->
339
+
340
+ - **Developed by:** [Liang Hsun Huang](https://huggingface.co/lianghsun)、[Min Yi Chen](https://huggingface.co/minyichen)、[Wen Bin Lin](https://huggingface.co/tedslin)、[Chao Chun Chuang](https://huggingface.co/c00cjz00) & [Dave Sung](https://huggingface.co/k1dave6412) (All authors have contributed equally to this work.)
341
+ - **Funded by:** [APMIC](https://www.apmic.ai/)
342
+ - **Model type:** LlamaForCausalLM
343
+ - **Language(s) (NLP):** Tranditional Chinese & English
344
+ - **License:** [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt)
345
+
346
+ ### Model Sources
347
+ <!-- Provide the basic links for the model. -->
348
+
349
+ - **Repository:** [twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct](https://huggingface.co/twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct)
350
+ - **Paper:** (TBA)
351
+ - **Demo:** [Playground](https://3b02.coolify.apmic.ai/)
352
+
353
+ ## Evaluation
354
+
355
+ ### Results
356
+
357
+ 下表採用 [🌟 Twinkle Eval](https://github.com/ai-twinkle/Eval) 評測框架
358
+ | 模型 | 評測模式 | TMMLU+(%) | 台灣法律(%) | MMLU(%) | 測試次數 | 選項排序 |
359
+ |------------------------------------|---------|----------------|----------------|----------------|---------|---------|
360
+ | [mistralai/Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501) | box | 56.15 (±0.0172) | 37.48 (±0.0098) | 74.61 (±0.0154) | 3 | 隨機 |
361
+ | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | box | 15.49 (±0.0104) | 25.68 (±0.0200) | 6.90 (±0.0096) | 3 | 隨機 |
362
+ | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | pattern | 35.85 (±0.0174) | 32.22 (±0.0023) | 59.33 (±0.0168) | 3 | 隨機 |
363
+ | [MediaTek-Research/Llama-Breeze2-3B-Instruct](https://huggingface.co/MediaTek-Research/Llama-Breeze2-3B-Instruct) | pattern | 40.32 (±0.0181) | 38.92 (±0.0193) | 55.37 (±0.0180) | 3 | 隨機 |
364
+ | [twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct](https://huggingface.co/twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct) (ours) | box | 46.16 (±0.0198) | 34.92 (±0.0243) | 51.22 (±0.0206) | 3 | 隨機 |
365
+
366
+ 下表用 lighteval 評測框架
367
+ | 模型 | MATH-500 | GPQA Diamond |
368
+ |--------------------------------------------|----------|--------------|
369
+ | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | 44.40 | 27.78 |
370
+ | [twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct](https://huggingface.co/twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct) (ours) | **51.40**| **33.84** |
371
+
372
+
373
+ ---
374
+
375
+ ## 🔧 Tool Calling
376
+
377
+ 本模型使用 Hermes 格式訓練,並支援平行呼叫(Parallel calling),以下為完整範例流程。
378
+ Tool call 模板已經為大家寫好放進 chat-template 了,Enjoy it!
379
+
380
+ ### 1️⃣ 啟動 vLLM 後端
381
+ > **⚠️ 注意:需要 vLLM 版本 >= 0.8.3,否則 `enable-reasoning`、`enable-auto-tool-choice` 無法同時開啟**
382
+
383
+ ```bash
384
+ vllm serve twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct \
385
+ --port 8001 \
386
+ --enable-reasoning \
387
+ --reasoning-parser deepseek_r1 \
388
+ --enable-auto-tool-choice \
389
+ --tool-call-parser hermes
390
+ ```
391
+
392
+ ### 2️⃣ 定義工具(Functions)
393
+
394
+ ```python
395
+ def get_weather(location: str, unit: str):
396
+ return f"{location}的氣溫是{unit}26度,晴朗無風"
397
+
398
+ def search(query: str):
399
+ return "川普終於宣布對等關稅政策,針對 18 個經濟體課徵一半的對等關稅,並從 4/5 起對所有進口產品徵收10%的基準關稅!美國將針對被認定為不當貿易行為(不公平貿易) 的國家,於 4/9 起課徵報復型對等關稅 (Discounted Reciprocal Tariff),例如:日本將被課徵 24% 的關稅,歐盟則為 20%,以取代普遍性的 10% 關稅。\n針對中國則開啟新一波 34% 關稅,並疊加於先前已實施的關稅上,這將使中國進口商品的基本關稅稅率達到 54%,而且這尚未包含拜登總統任內或川普第一任期所施加的額外關稅。加拿大與墨西哥則不適用這套對等關稅制度,但川普認為這些國家在芬太尼危機與非法移民問題尚未完全解決,因此計畫對這兩國的大多數進口商品施加 25% 關稅。另外原本針對汽車與多數其他商品的關稅豁免將於 4/2 到期。\n台灣的部分,美國擬向台灣課徵32%的對等關稅,雖然並未針對晶片特別課徵關稅,但仍在記者會中提到台灣搶奪所有的電腦與半導體晶片,最終促成台積電對美國投資計劃額外加碼 1,000 億美元的歷史性投資;歐盟則課徵20%的對等關稅。最後是汽車關稅將於 4/2 起,對所有外國製造的汽車課徵25% 關稅。"
400
+
401
+ tools = [
402
+ {
403
+ "type": "function",
404
+ "function": {
405
+ "name": "get_weather",
406
+ "description": "Get the current weather in a given location",
407
+ "parameters": {
408
+ "type": "object",
409
+ "properties": {
410
+ "location": {"type": "string", "description": "國家或城市名, e.g., 'Taipei'、'Jaipei'"},
411
+ "unit": {"type": "string", "description": "氣溫單位,亞洲城市使用攝氏;歐美城市使用華氏", "enum": ["celsius", "fahrenheit"]}
412
+ },
413
+ "required": ["location", "unit"]
414
+ }
415
+ }
416
+ },
417
+ {
418
+ "type": "function",
419
+ "function": {
420
+ "name": "search",
421
+ "description": "這是一個類似 Google 的搜尋引擎,關於知識、天氣、股票、電影、小說、百科等等問題,如果你不確定答案就搜尋一下。",
422
+ "parameters": {
423
+ "type": "object",
424
+ "properties": {
425
+ "query": {"type": "string", "description": "should be a search query, e.g., '2024 南韓 戒嚴'"}
426
+ },
427
+ "required": ["query"]
428
+ }
429
+ }
430
+ }
431
+ ]
432
+ ```
433
+
434
+ ### 3️⃣ 執行工具調用(Tool Calls)
435
+
436
+ > **⚠️ 注意:system_prompt 可以不用帶,除非是需要時間基準的工具。**
437
+ ```python
438
+ response = client.chat.completions.create(
439
+ model=client.models.list().data[0].id,
440
+ messages=[
441
+ {"role": "system", "content": "記住你的知識截止於 2024/12,今天是 2025/4/7"},
442
+ {"role": "user", "content": "台北氣溫如何? 另外,告訴我川普最新關稅政策"},
443
+ ],
444
+ max_tokens=1500,
445
+ temperature=0.6,
446
+ top_p=0.95,
447
+ tools=tools,
448
+ tool_choice="auto"
449
+ )
450
+
451
+ print(response.choices[0].message.reasoning_content)
452
+ print(response.choices[0].message.tool_calls)
453
+ ```
454
+
455
+ #### 🧠 推理內容輸出(僅顯示部分)
456
+ > 好的,我需要幫助這個使用者解決他們的問題。他們問了兩件事:首先,臺北市的天氣情況,以及第二,關於川普最近的關稅政策。
457
+ > 對於第一部分,他們提到了“臺北”,所以應該呼叫 get_weather 函式…
458
+ > 接下來是關於川普的新關稅政策…
459
+ > 總結一下,我需要分別進行兩次 API 呼叫,每次都有各自正確填寫的參數…
460
+
461
+ #### ⚙️ Tool Calls List
462
+
463
+
464
+ ```json
465
+ [ChatCompletionMessageToolCall(id='chatcmpl-tool-35e74420119349999913a10133b84bd3', function=Function(arguments='{"location": "Taipei", "unit": "celsius"}', name='get_weather'), type='function'), ChatCompletionMessageToolCall(id='chatcmpl-tool-7ffdcb98e59f4134a6171defe7f2e31b', function=Function(arguments='{"query": "Donald Trump latest tariffs policy"}', name='search'), type='function')]
466
+ ```
467
+
468
+ ### 4️⃣ 產生最終回答
469
+
470
+ ```python
471
+ response = client.chat.completions.create(
472
+ model=client.models.list().data[0].id,
473
+ messages=[
474
+ {"role": "system", "content": "記住你的知識截止於 2024/12,今天是 2025/4/7"},
475
+ {"role": "user", "content": "台北氣溫如何? 另外,告訴我川普最新關稅政策"},
476
+ {
477
+ "role": "assistant",
478
+ "content": "",
479
+ "tool_calls": [
480
+ {
481
+ "id": response.choices[0].message.tool_calls[0].id,
482
+ "type": "function",
483
+ "function": {
484
+ "name": response.choices[0].message.tool_calls[0].function.name,
485
+ "arguments": response.choices[0].message.tool_calls[0].function.arguments
486
+ }
487
+ },
488
+ {
489
+ "id": response.choices[0].message.tool_calls[1].id,
490
+ "type": "function",
491
+ "function": {
492
+ "name": response.choices[0].message.tool_calls[1].function.name,
493
+ "arguments": response.choices[0].message.tool_calls[1].function.arguments
494
+ }
495
+ }
496
+ ]
497
+ },
498
+ {
499
+ "role": "tool",
500
+ "content": search(**json.loads(response.choices[0].message.tool_calls[0].function.arguments)),
501
+ "tool_call_id": response.choices[0].message.tool_calls[0].id # tool_call_id 必須要帶,才能正確配對 工具 及 tool_call
502
+ },
503
+ {
504
+ "role": "tool",
505
+ "content": get_weather(**json.loads(response.choices[0].message.tool_calls[1].function.arguments)),
506
+ "tool_call_id": response.choices[0].message.tool_calls[1].id # tool_call_id 必須要帶,才能正確配對 工具 及 tool_call
507
+ }
508
+ ],
509
+ max_tokens=1500,
510
+ temperature=0.6,
511
+ top_p=0.95,
512
+ tools=tools,
513
+ tool_choice="auto"
514
+ )
515
+
516
+ print(response.choices[0].message.reasoning_content)
517
+ print(response.choices[0].message.content)
518
+ ```
519
+
520
+ #### 🧠 推理內容輸出(僅顯示部分)
521
+ > 首先,我需要處理使用者的查詢,他們要求了解臺北市的當下氣溫以及川普最近的關稅政策…
522
+ > 在呼叫了 `get_weather` 後得到了臺北市氣溫為 26 度(攝氏)…
523
+ > 接著,使用 `search` 搜尋「川普最新關稅政策 2025」…
524
+ > 整合後,我提供如下摘要:
525
+
526
+ #### 📋 最終輸出內容
527
+
528
+ ```text
529
+ 以下是您請求的資訊:
530
+
531
+ **臺北市氣溫**
532
+ - 目前的氣溫為 **26°C**(攝氏)
533
+ - 天候狀況:晴朗無風
534
+
535
+ **川普最新關稅政策概述**
536
+ 1. **對等關稅政策**
537
+ - 對 18 個經濟體課徵 50% 的對等關稅
538
+ - 自 4 月 5 日起,所有進口產品全面徵收 10% 基本關稅
539
+
540
+ 2. **報復型對等關稅**
541
+ - 日本 24%、歐盟 20%
542
+
543
+ 3. **對中國的高額關稅**
544
+ - 增加至 54%(原有關稅 + 新增 34%)
545
+
546
+ 4. **特殊案例**
547
+ - 加拿大與墨西哥不適用,但其他商品課徵 25%
548
+ - 汽車與部分商品的免稅即將到期
549
+
550
+ 5. **對台灣的影響**
551
+ - 美國計畫對台灣課徵 32% 關稅,但晶片暫無額外課稅
552
+
553
+ 6. **全球視角**
554
+ - 歐盟與日本關稅比例相對較高
555
+ ```
556
+
557
+
558
+ ## Citation
559
+
560
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
561
+ ```yaml
562
+ @misc{twinkleai2025llama3.2f1,
563
+ title = {Llama-3.2-3B-F1-Reasoning-Instruct: A Traditional Chinese Instruction-Tuned Reasoning Language Model for Taiwan},
564
+ author = {Huang, Liang Hsun and Chen, Min Yi and Lin, Wen Bin and Chuang, Chao Chun and Sung, Dave},
565
+ year = {2025},
566
+ howpublished = {\url{https://huggingface.co/twinkle-ai/Llama-3.2-3B-F1-Instruct}},
567
+ note = {Twinkle AI and APMIC. All authors contributed equally.}
568
+ }
569
+ ```
570
+
571
+ ## Acknowledge
572
+ - 特此感謝[國家高速網路與計算中心](https://www.nchc.org.tw/)的指導與 [APMIC](https://www.apmic.ai/) 的算力支援,才得以讓本專案訓利完成。
573
+ - 特此致謝黃啟聖老師、許武龍(哈爸)、臺北市立第一女子高級中學物理科陳姿燁老師、[奈視科技](https://nanoseex.com/) CTO Howard、[AIPLUX Technology](https://aiplux.com/)、郭家嘉老師以及所有在資料集製作過程中提供寶貴協助的夥伴。
574
+
575
+ ## Model Card Authors
576
+
577
+ [Twinkle AI](https://huggingface.co/twinkle-ai)
578
+
579
+ ## Model Card Contact
580
+
581
+ [Twinkle AI](https://huggingface.co/twinkle-ai)