Upload README.md with huggingface_hub

616291b verified 1 day ago

27 kB

	---
	license: llama3.2
	language:
	- en
	- zh
	base_model:
	- meta-llama/Llama-3.2-3B
	- lianghsun/Llama-3.2-3B-F1-Base
	library_name: transformers
	tags:
	- Taiwan
	- R.O.C
	- zhtw
	- SLM
	- Llama-32
	datasets:
	- lianghsun/tw-reasoning-instruct
	- minyichen/tw-instruct-R1-200k
	- minyichen/tw_mm_R1
	model-index:
	- name: Llama-3.2-3B-F1-Reasoning-Instruct
	results:
	- task:
	type: question-answering
	name: Single Choice Question
	dataset:
	type: ikala/tmmluplus
	name: tmmlu+
	config: all
	split: test
	revision: c0e8ae955997300d5dbf0e382bf0ba5115f85e8c
	metrics:
	- name: single choice
	type: accuracy
	value: 46.16
	- task:
	type: question-answering
	name: Single Choice Question
	dataset:
	type: cais/mmlu
	name: mmlu
	config: all
	split: test
	revision: c30699e
	metrics:
	- name: single choice
	type: accuracy
	value: 51.22
	- task:
	type: question-answering
	name: Single Choice Question
	dataset:
	type: lianghsun/tw-legal-benchmark-v1
	name: tw-legal-benchmark-v1
	config: all
	split: test
	revision: 66c3a5f
	metrics:
	- name: single choice
	type: accuracy
	value: 34.92
	metrics:
	- accuracy
	---

	# <span style="color: #7FFF7F;">Llama-3.2-3B-F1-Reasoning-Instruct GGUF Models</span>


	## <span style="color: #7F7FFF;">Model Generation Details</span>

	This model was generated using [llama.cpp](https://github.com/ggerganov/llama.cpp) at commit [`064cc596`](https://github.com/ggerganov/llama.cpp/commit/064cc596ac44308dc326a17c9e3163c34a6f29d1).




	## <span style="color: #7FFF7F;">Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit)</span>

	Our latest quantization method introduces precision-adaptive quantization for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on Llama-3-8B. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.

	### Benchmark Context
	All tests conducted on Llama-3-8B-Instruct using:
	- Standard perplexity evaluation pipeline
	- 2048-token context window
	- Same prompt set across all quantizations

	### Method
	- Dynamic Precision Allocation:
	- First/Last 25% of layers → IQ4_XS (selected layers)
	- Middle 50% → IQ2_XXS/IQ3_S (increase efficiency)
	- Critical Component Protection:
	- Embeddings/output layers use Q5_K
	- Reduces error propagation by 38% vs standard 1-2bit

	### Quantization Performance Comparison (Llama-3-8B)

	\| Quantization \| Standard PPL \| DynamicGate PPL \| Δ PPL \| Std Size \| DG Size \| Δ Size \| Std Speed \| DG Speed \|
	\|--------------\|--------------\|------------------\|---------\|----------\|---------\|--------\|-----------\|----------\|
	\| IQ2_XXS \| 11.30 \| 9.84 \| -12.9% \| 2.5G \| 2.6G \| +0.1G \| 234s \| 246s \|
	\| IQ2_XS \| 11.72 \| 11.63 \| -0.8% \| 2.7G \| 2.8G \| +0.1G \| 242s \| 246s \|
	\| IQ2_S \| 14.31 \| 9.02 \| -36.9% \| 2.7G \| 2.9G \| +0.2G \| 238s \| 244s \|
	\| IQ1_M \| 27.46 \| 15.41 \| -43.9% \| 2.2G \| 2.5G \| +0.3G \| 206s \| 212s \|
	\| IQ1_S \| 53.07 \| 32.00 \| -39.7% \| 2.1G \| 2.4G \| +0.3G \| 184s \| 209s \|

	Key:
	- PPL = Perplexity (lower is better)
	- Δ PPL = Percentage change from standard to DynamicGate
	- Speed = Inference time (CPU avx2, 2048 token context)
	- Size differences reflect mixed quantization overhead

	Key Improvements:
	- 🔥 IQ1_M shows massive 43.9% perplexity reduction (27.46 → 15.41)
	- 🚀 IQ2_S cuts perplexity by 36.9% while adding only 0.2GB
	- ⚡ IQ1_S maintains 39.7% better accuracy despite 1-bit quantization

	Tradeoffs:
	- All variants have modest size increases (0.1-0.3GB)
	- Inference speeds remain comparable (<5% difference)


	### When to Use These Models
	📌 Fitting models into GPU VRAM

	✔ Memory-constrained deployments

	✔ Cpu and Edge Devices where 1-2bit errors can be tolerated

	✔ Research into ultra-low-bit quantization



	## Choosing the Right Model Format

	Selecting the correct model format depends on your hardware capabilities and memory constraints.

	### BF16 (Brain Float 16) – Use if BF16 acceleration is available
	- A 16-bit floating-point format designed for faster computation while retaining good precision.
	- Provides similar dynamic range as FP32 but with lower memory usage.
	- Recommended if your hardware supports BF16 acceleration (check your device's specs).
	- Ideal for high-performance inference with reduced memory footprint compared to FP32.

	📌 Use BF16 if:
	✔ Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
	✔ You want higher precision while saving memory.
	✔ You plan to requantize the model into another format.

	📌 Avoid BF16 if:
	❌ Your hardware does not support BF16 (it may fall back to FP32 and run slower).
	❌ You need compatibility with older devices that lack BF16 optimization.

	---

	### F16 (Float 16) – More widely supported than BF16
	- A 16-bit floating-point high precision but with less of range of values than BF16.
	- Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
	- Slightly lower numerical precision than BF16 but generally sufficient for inference.

	📌 Use F16 if:
	✔ Your hardware supports FP16 but not BF16.
	✔ You need a balance between speed, memory usage, and accuracy.
	✔ You are running on a GPU or another device optimized for FP16 computations.

	📌 Avoid F16 if:
	❌ Your device lacks native FP16 support (it may run slower than expected).
	❌ You have memory limitations.

	---

	### Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference
	Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
	- Lower-bit models (Q4_K) → Best for minimal memory usage, may have lower precision.
	- Higher-bit models (Q6_K, Q8_0) → Better accuracy, requires more memory.

	📌 Use Quantized Models if:
	✔ You are running inference on a CPU and need an optimized model.
	✔ Your device has low VRAM and cannot load full-precision models.
	✔ You want to reduce memory footprint while keeping reasonable accuracy.

	📌 Avoid Quantized Models if:
	❌ You need maximum accuracy (full-precision models are better for this).
	❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).

	---

	### Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)
	These models are optimized for extreme memory efficiency, making them ideal for low-power devices or large-scale deployments where memory is a critical constraint.

	- IQ3_XS: Ultra-low-bit quantization (3-bit) with extreme memory efficiency.
	- Use case: Best for ultra-low-memory devices where even Q4_K is too large.
	- Trade-off: Lower accuracy compared to higher-bit quantizations.

	- IQ3_S: Small block size for maximum memory efficiency.
	- Use case: Best for low-memory devices where IQ3_XS is too aggressive.

	- IQ3_M: Medium block size for better accuracy than IQ3_S.
	- Use case: Suitable for low-memory devices where IQ3_S is too limiting.

	- Q4_K: 4-bit quantization with block-wise optimization for better accuracy.
	- Use case: Best for low-memory devices where Q6_K is too large.

	- Q4_0: Pure 4-bit quantization, optimized for ARM devices.
	- Use case: Best for ARM-based devices or low-memory environments.

	---

	### Summary Table: Model Format Selection

	\| Model Format \| Precision \| Memory Usage \| Device Requirements \| Best Use Case \|
	\|--------------\|------------\|---------------\|----------------------\|---------------\|
	\| BF16 \| Highest \| High \| BF16-supported GPU/CPUs \| High-speed inference with reduced memory \|
	\| F16 \| High \| High \| FP16-supported devices \| GPU inference when BF16 isn't available \|
	\| Q4_K \| Medium Low \| Low \| CPU or Low-VRAM devices \| Best for memory-constrained environments \|
	\| Q6_K \| Medium \| Moderate \| CPU with more memory \| Better accuracy while still being quantized \|
	\| Q8_0 \| High \| Moderate \| CPU or GPU with enough VRAM \| Best accuracy among quantized models \|
	\| IQ3_XS \| Very Low \| Very Low \| Ultra-low-memory devices \| Extreme memory efficiency and low accuracy \|
	\| Q4_0 \| Low \| Low \| ARM or low-memory devices \| llama.cpp can optimize for ARM devices \|

	---

	## Included Files & Details

	### `Llama-3.2-3B-F1-Reasoning-Instruct-bf16.gguf`
	- Model weights preserved in BF16.
	- Use this if you want to requantize the model into a different format.
	- Best if your device supports BF16 acceleration.

	### `Llama-3.2-3B-F1-Reasoning-Instruct-f16.gguf`
	- Model weights stored in F16.
	- Use if your device supports FP16, especially if BF16 is not available.

	### `Llama-3.2-3B-F1-Reasoning-Instruct-bf16-q8_0.gguf`
	- Output & embeddings remain in BF16.
	- All other layers quantized to Q8_0.
	- Use if your device supports BF16 and you want a quantized version.

	### `Llama-3.2-3B-F1-Reasoning-Instruct-f16-q8_0.gguf`
	- Output & embeddings remain in F16.
	- All other layers quantized to Q8_0.

	### `Llama-3.2-3B-F1-Reasoning-Instruct-q4_k.gguf`
	- Output & embeddings quantized to Q8_0.
	- All other layers quantized to Q4_K.
	- Good for CPU inference with limited memory.

	### `Llama-3.2-3B-F1-Reasoning-Instruct-q4_k_s.gguf`
	- Smallest Q4_K variant, using less memory at the cost of accuracy.
	- Best for very low-memory setups.

	### `Llama-3.2-3B-F1-Reasoning-Instruct-q6_k.gguf`
	- Output & embeddings quantized to Q8_0.
	- All other layers quantized to Q6_K .

	### `Llama-3.2-3B-F1-Reasoning-Instruct-q8_0.gguf`
	- Fully Q8 quantized model for better accuracy.
	- Requires more memory but offers higher precision.

	### `Llama-3.2-3B-F1-Reasoning-Instruct-iq3_xs.gguf`
	- IQ3_XS quantization, optimized for extreme memory efficiency.
	- Best for ultra-low-memory devices.

	### `Llama-3.2-3B-F1-Reasoning-Instruct-iq3_m.gguf`
	- IQ3_M quantization, offering a medium block size for better accuracy.
	- Suitable for low-memory devices.

	### `Llama-3.2-3B-F1-Reasoning-Instruct-q4_0.gguf`
	- Pure Q4_0 quantization, optimized for ARM devices.
	- Best for low-memory environments.
	- Prefer IQ4_NL for better accuracy.

	# <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
	❤ Please click "Like" if you find this useful!
	Help me test my AI-Powered Network Monitor Assistant with quantum-ready security checks:
	👉 [Free Network Monitor](https://readyforquantum.com/dashboard/?assistant=open)

	💬 How to test:
	Choose an AI assistant type:
	- `TurboLLM` (GPT-4o-mini)
	- `HugLLM` (Hugginface Open-source)
	- `TestLLM` (Experimental CPU-only)

	### What I’m Testing
	I’m pushing the limits of small open-source models for AI network monitoring, specifically:
	- Function calling against live network services
	- How small can a model go while still handling:
	- Automated Nmap scans
	- Quantum-readiness checks
	- Network Monitoring tasks

	🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads):
	- ✅ Zero-configuration setup
	- ⏳ 30s load time (slow inference but no API costs)
	- 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate!

	### Other Assistants
	🟢 TurboLLM – Uses gpt-4o-mini for:
	- Create custom cmd processors to run .net code on Free Network Monitor Agents
	- Real-time network diagnostics and monitoring
	- Security Audits
	- Penetration testing (Nmap/Metasploit)
	- 🔑 Get more tokens by logging in or [downloading our Free Network Monitor Agent with integrated AI Assistant](https://readyforquantum.com/download)

	🔵 HugLLM – Latest Open-source models:
	- 🌐 Runs on Hugging Face Inference API

	### 💡 Example commands to you could test:
	1. `"Give me info on my websites SSL certificate"`
	2. `"Check if my server is using quantum safe encyption for communication"`
	3. `"Run a comprehensive security audit on my server"`
	4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Free Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution!



	# Model Card for Llama-3.2-3B-F1-Reasoning-Instruct (a.k.a __Formosa-1-Reasoning__ or __F1-Reasoning__)

	<div align="center" style="line-height: 1;">
	<a href="https://discord.gg/Cx737yw4ed" target="_blank" style="margin: 2px;">
	<img alt="Discord" src="https://img.shields.io/badge/Discord-Twinkle%20AI-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
	</a>
	<a href="https://huggingface.co/twinkle-ai" target="_blank" style="margin: 2px;">
	<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Twinkle%20AI-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
	</a>
	</div>

	<div align="center" style="line-height: 1;">
	<a href="https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt" style="margin: 2px;">
	<img alt="License" src="https://img.shields.io/badge/License-llama3.2-f5de53?&color=0081fb" style="display: inline-block; vertical-align: middle;"/>
	</a>
	</div>

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/618dc56cbc345ca7bf95f3cd/lBonfNs_7lzYguD4kJo6z.png)

	<!-- Provide a quick summary of what the model is/does. -->
	Llama-3.2-3B-F1-Reasoning-Instruct（a.k.a Formosa-1-Reasoning or F1-Reasoning）是由 [Twinkle AI](https://huggingface.co/twinkle-ai) 與 [APMIC](https://www.apmic.ai/) 合作開發，並在[國家高速網路與計算中心](https://www.nchc.org.tw/)技術指導之下，針對中華民國台灣語境與任務需求所微調之繁體中文語言模型，涵蓋法律、教育、生活應用等多元場景，並以高指令跟隨能力為目標進行強化。

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Developed by: [Liang Hsun Huang](https://huggingface.co/lianghsun)、[Min Yi Chen](https://huggingface.co/minyichen)、[Wen Bin Lin](https://huggingface.co/tedslin)、[Chao Chun Chuang](https://huggingface.co/c00cjz00) & [Dave Sung](https://huggingface.co/k1dave6412) (All authors have contributed equally to this work.)
	- Funded by: [APMIC](https://www.apmic.ai/)
	- Model type: LlamaForCausalLM
	- Language(s) (NLP): Tranditional Chinese & English
	- License: [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt)

	### Model Sources
	<!-- Provide the basic links for the model. -->

	- Repository: [twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct](https://huggingface.co/twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct)
	- Paper: (TBA)
	- Demo: [Playground](https://3b02.coolify.apmic.ai/)

	## Evaluation

	### Results

	下表採用 [🌟 Twinkle Eval](https://github.com/ai-twinkle/Eval) 評測框架
	\| 模型 \| 評測模式 \| TMMLU+(%) \| 台灣法律(%) \| MMLU(%) \| 測試次數 \| 選項排序 \|
	\|------------------------------------\|---------\|----------------\|----------------\|----------------\|---------\|---------\|
	\| [mistralai/Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501) \| box \| 56.15 (±0.0172) \| 37.48 (±0.0098) \| 74.61 (±0.0154) \| 3 \| 隨機 \|
	\| [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) \| box \| 15.49 (±0.0104) \| 25.68 (±0.0200) \| 6.90 (±0.0096) \| 3 \| 隨機 \|
	\| [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) \| pattern \| 35.85 (±0.0174) \| 32.22 (±0.0023) \| 59.33 (±0.0168) \| 3 \| 隨機 \|
	\| [MediaTek-Research/Llama-Breeze2-3B-Instruct](https://huggingface.co/MediaTek-Research/Llama-Breeze2-3B-Instruct) \| pattern \| 40.32 (±0.0181) \| 38.92 (±0.0193) \| 55.37 (±0.0180) \| 3 \| 隨機 \|
	\| [twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct](https://huggingface.co/twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct) (ours) \| box \| 46.16 (±0.0198) \| 34.92 (±0.0243) \| 51.22 (±0.0206) \| 3 \| 隨機 \|

	下表用 lighteval 評測框架
	\| 模型 \| MATH-500 \| GPQA Diamond \|
	\|--------------------------------------------\|----------\|--------------\|
	\| [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) \| 44.40 \| 27.78 \|
	\| [twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct](https://huggingface.co/twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct) (ours) \| 51.40\| 33.84 \|


	---

	## 🔧 Tool Calling

	本模型使用 Hermes 格式訓練，並支援平行呼叫（Parallel calling），以下為完整範例流程。
	Tool call 模板已經為大家寫好放進 chat-template 了，Enjoy it！

	### 1️⃣ 啟動 vLLM 後端
	> ⚠️ 注意：需要 vLLM 版本 >= 0.8.3，否則 `enable-reasoning`、`enable-auto-tool-choice` 無法同時開啟

	```bash
	vllm serve twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct \
	--port 8001 \
	--enable-reasoning \
	--reasoning-parser deepseek_r1 \
	--enable-auto-tool-choice \
	--tool-call-parser hermes
	```

	### 2️⃣ 定義工具（Functions）

	```python
	def get_weather(location: str, unit: str):
	return f"{location}的氣溫是{unit}26度，晴朗無風"

	def search(query: str):
	return "川普終於宣布對等關稅政策，針對 18 個經濟體課徵一半的對等關稅，並從 4/5 起對所有進口產品徵收10%的基準關稅！美國將針對被認定為不當貿易行為(不公平貿易) 的國家，於 4/9 起課徵報復型對等關稅 (Discounted Reciprocal Tariff)，例如：日本將被課徵 24% 的關稅，歐盟則為 20%，以取代普遍性的 10% 關稅。\n針對中國則開啟新一波 34% 關稅，並疊加於先前已實施的關稅上，這將使中國進口商品的基本關稅稅率達到 54%，而且這尚未包含拜登總統任內或川普第一任期所施加的額外關稅。加拿大與墨西哥則不適用這套對等關稅制度，但川普認為這些國家在芬太尼危機與非法移民問題尚未完全解決，因此計畫對這兩國的大多數進口商品施加 25% 關稅。另外原本針對汽車與多數其他商品的關稅豁免將於 4/2 到期。\n台灣的部分，美國擬向台灣課徵32％的對等關稅，雖然並未針對晶片特別課徵關稅，但仍在記者會中提到台灣搶奪所有的電腦與半導體晶片，最終促成台積電對美國投資計劃額外加碼 1,000 億美元的歷史性投資；歐盟則課徵20％的對等關稅。最後是汽車關稅將於 4/2 起，對所有外國製造的汽車課徵25% 關稅。"

	tools = [
	{
	"type": "function",
	"function": {
	"name": "get_weather",
	"description": "Get the current weather in a given location",
	"parameters": {
	"type": "object",
	"properties": {
	"location": {"type": "string", "description": "國家或城市名, e.g., 'Taipei'、'Jaipei'"},
	"unit": {"type": "string", "description": "氣溫單位，亞洲城市使用攝氏；歐美城市使用華氏", "enum": ["celsius", "fahrenheit"]}
	},
	"required": ["location", "unit"]
	}
	}
	},
	{
	"type": "function",
	"function": {
	"name": "search",
	"description": "這是一個類似 Google 的搜尋引擎，關於知識、天氣、股票、電影、小說、百科等等問題，如果你不確定答案就搜尋一下。",
	"parameters": {
	"type": "object",
	"properties": {
	"query": {"type": "string", "description": "should be a search query, e.g., '2024 南韓戒嚴'"}
	},
	"required": ["query"]
	}
	}
	}
	]
	```

	### 3️⃣ 執行工具調用（Tool Calls）

	> ⚠️ 注意：system_prompt 可以不用帶，除非是需要時間基準的工具。
	```python
	response = client.chat.completions.create(
	model=client.models.list().data[0].id,
	messages=[
	{"role": "system", "content": "記住你的知識截止於 2024/12，今天是 2025/4/7"},
	{"role": "user", "content": "台北氣溫如何? 另外，告訴我川普最新關稅政策"},
	],
	max_tokens=1500,
	temperature=0.6,
	top_p=0.95,
	tools=tools,
	tool_choice="auto"
	)

	print(response.choices[0].message.reasoning_content)
	print(response.choices[0].message.tool_calls)
	```

	#### 🧠 推理內容輸出（僅顯示部分）
	> 好的，我需要幫助這個使用者解決他們的問題。他們問了兩件事：首先，臺北市的天氣情況，以及第二，關於川普最近的關稅政策。
	> 對於第一部分，他們提到了“臺北”，所以應該呼叫 get_weather 函式…
	> 接下來是關於川普的新關稅政策…
	> 總結一下，我需要分別進行兩次 API 呼叫，每次都有各自正確填寫的參數…

	#### ⚙️ Tool Calls List


	```json
	[ChatCompletionMessageToolCall(id='chatcmpl-tool-35e74420119349999913a10133b84bd3', function=Function(arguments='{"location": "Taipei", "unit": "celsius"}', name='get_weather'), type='function'), ChatCompletionMessageToolCall(id='chatcmpl-tool-7ffdcb98e59f4134a6171defe7f2e31b', function=Function(arguments='{"query": "Donald Trump latest tariffs policy"}', name='search'), type='function')]
	```

	### 4️⃣ 產生最終回答

	```python
	response = client.chat.completions.create(
	model=client.models.list().data[0].id,
	messages=[
	{"role": "system", "content": "記住你的知識截止於 2024/12，今天是 2025/4/7"},
	{"role": "user", "content": "台北氣溫如何? 另外，告訴我川普最新關稅政策"},
	{
	"role": "assistant",
	"content": "",
	"tool_calls": [
	{
	"id": response.choices[0].message.tool_calls[0].id,
	"type": "function",
	"function": {
	"name": response.choices[0].message.tool_calls[0].function.name,
	"arguments": response.choices[0].message.tool_calls[0].function.arguments
	}
	},
	{
	"id": response.choices[0].message.tool_calls[1].id,
	"type": "function",
	"function": {
	"name": response.choices[0].message.tool_calls[1].function.name,
	"arguments": response.choices[0].message.tool_calls[1].function.arguments
	}
	}
	]
	},
	{
	"role": "tool",
	"content": search(**json.loads(response.choices[0].message.tool_calls[0].function.arguments)),
	"tool_call_id": response.choices[0].message.tool_calls[0].id # tool_call_id 必須要帶，才能正確配對工具及 tool_call
	},
	{
	"role": "tool",
	"content": get_weather(**json.loads(response.choices[0].message.tool_calls[1].function.arguments)),
	"tool_call_id": response.choices[0].message.tool_calls[1].id # tool_call_id 必須要帶，才能正確配對工具及 tool_call
	}
	],
	max_tokens=1500,
	temperature=0.6,
	top_p=0.95,
	tools=tools,
	tool_choice="auto"
	)

	print(response.choices[0].message.reasoning_content)
	print(response.choices[0].message.content)
	```

	#### 🧠 推理內容輸出（僅顯示部分）
	> 首先，我需要處理使用者的查詢，他們要求了解臺北市的當下氣溫以及川普最近的關稅政策…
	> 在呼叫了 `get_weather` 後得到了臺北市氣溫為 26 度（攝氏）…
	> 接著，使用 `search` 搜尋「川普最新關稅政策 2025」…
	> 整合後，我提供如下摘要：

	#### 📋 最終輸出內容

	```text
	以下是您請求的資訊：

	臺北市氣溫
	- 目前的氣溫為 26°C（攝氏）
	- 天候狀況：晴朗無風

	川普最新關稅政策概述
	1. 對等關稅政策
	- 對 18 個經濟體課徵 50% 的對等關稅
	- 自 4 月 5 日起，所有進口產品全面徵收 10% 基本關稅

	2. 報復型對等關稅
	- 日本 24%、歐盟 20%

	3. 對中國的高額關稅
	- 增加至 54%（原有關稅 + 新增 34%）

	4. 特殊案例
	- 加拿大與墨西哥不適用，但其他商品課徵 25%
	- 汽車與部分商品的免稅即將到期

	5. 對台灣的影響
	- 美國計畫對台灣課徵 32% 關稅，但晶片暫無額外課稅

	6. 全球視角
	- 歐盟與日本關稅比例相對較高
	```


	## Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
	```yaml
	@misc{twinkleai2025llama3.2f1,
	title = {Llama-3.2-3B-F1-Reasoning-Instruct: A Traditional Chinese Instruction-Tuned Reasoning Language Model for Taiwan},
	author = {Huang, Liang Hsun and Chen, Min Yi and Lin, Wen Bin and Chuang, Chao Chun and Sung, Dave},
	year = {2025},
	howpublished = {\url{https://huggingface.co/twinkle-ai/Llama-3.2-3B-F1-Instruct}},
	note = {Twinkle AI and APMIC. All authors contributed equally.}
	}
	```

	## Acknowledge
	- 特此感謝[國家高速網路與計算中心](https://www.nchc.org.tw/)的指導與 [APMIC](https://www.apmic.ai/) 的算力支援，才得以讓本專案訓利完成。
	- 特此致謝黃啟聖老師、許武龍（哈爸）、臺北市立第一女子高級中學物理科陳姿燁老師、[奈視科技](https://nanoseex.com/) CTO Howard、[AIPLUX Technology](https://aiplux.com/)、郭家嘉老師以及所有在資料集製作過程中提供寶貴協助的夥伴。

	## Model Card Authors

	[Twinkle AI](https://huggingface.co/twinkle-ai)

	## Model Card Contact

	[Twinkle AI](https://huggingface.co/twinkle-ai)