Mungert's picture
Upload README.md with huggingface_hub
616291b verified
metadata
license: llama3.2
language:
  - en
  - zh
base_model:
  - meta-llama/Llama-3.2-3B
  - lianghsun/Llama-3.2-3B-F1-Base
library_name: transformers
tags:
  - Taiwan
  - R.O.C
  - zhtw
  - SLM
  - Llama-32
datasets:
  - lianghsun/tw-reasoning-instruct
  - minyichen/tw-instruct-R1-200k
  - minyichen/tw_mm_R1
model-index:
  - name: Llama-3.2-3B-F1-Reasoning-Instruct
    results:
      - task:
          type: question-answering
          name: Single Choice Question
        dataset:
          type: ikala/tmmluplus
          name: tmmlu+
          config: all
          split: test
          revision: c0e8ae955997300d5dbf0e382bf0ba5115f85e8c
        metrics:
          - name: single choice
            type: accuracy
            value: 46.16
      - task:
          type: question-answering
          name: Single Choice Question
        dataset:
          type: cais/mmlu
          name: mmlu
          config: all
          split: test
          revision: c30699e
        metrics:
          - name: single choice
            type: accuracy
            value: 51.22
      - task:
          type: question-answering
          name: Single Choice Question
        dataset:
          type: lianghsun/tw-legal-benchmark-v1
          name: tw-legal-benchmark-v1
          config: all
          split: test
          revision: 66c3a5f
        metrics:
          - name: single choice
            type: accuracy
            value: 34.92
metrics:
  - accuracy

Llama-3.2-3B-F1-Reasoning-Instruct GGUF Models

Model Generation Details

This model was generated using llama.cpp at commit 064cc596.

Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit)

Our latest quantization method introduces precision-adaptive quantization for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on Llama-3-8B. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.

Benchmark Context

All tests conducted on Llama-3-8B-Instruct using:

  • Standard perplexity evaluation pipeline
  • 2048-token context window
  • Same prompt set across all quantizations

Method

  • Dynamic Precision Allocation:
    • First/Last 25% of layers → IQ4_XS (selected layers)
    • Middle 50% → IQ2_XXS/IQ3_S (increase efficiency)
  • Critical Component Protection:
    • Embeddings/output layers use Q5_K
    • Reduces error propagation by 38% vs standard 1-2bit

Quantization Performance Comparison (Llama-3-8B)

Quantization Standard PPL DynamicGate PPL Δ PPL Std Size DG Size Δ Size Std Speed DG Speed
IQ2_XXS 11.30 9.84 -12.9% 2.5G 2.6G +0.1G 234s 246s
IQ2_XS 11.72 11.63 -0.8% 2.7G 2.8G +0.1G 242s 246s
IQ2_S 14.31 9.02 -36.9% 2.7G 2.9G +0.2G 238s 244s
IQ1_M 27.46 15.41 -43.9% 2.2G 2.5G +0.3G 206s 212s
IQ1_S 53.07 32.00 -39.7% 2.1G 2.4G +0.3G 184s 209s

Key:

  • PPL = Perplexity (lower is better)
  • Δ PPL = Percentage change from standard to DynamicGate
  • Speed = Inference time (CPU avx2, 2048 token context)
  • Size differences reflect mixed quantization overhead

Key Improvements:

  • 🔥 IQ1_M shows massive 43.9% perplexity reduction (27.46 → 15.41)
  • 🚀 IQ2_S cuts perplexity by 36.9% while adding only 0.2GB
  • IQ1_S maintains 39.7% better accuracy despite 1-bit quantization

Tradeoffs:

  • All variants have modest size increases (0.1-0.3GB)
  • Inference speeds remain comparable (<5% difference)

When to Use These Models

📌 Fitting models into GPU VRAM

Memory-constrained deployments

Cpu and Edge Devices where 1-2bit errors can be tolerated

Research into ultra-low-bit quantization

Choosing the Right Model Format

Selecting the correct model format depends on your hardware capabilities and memory constraints.

BF16 (Brain Float 16) – Use if BF16 acceleration is available

  • A 16-bit floating-point format designed for faster computation while retaining good precision.
  • Provides similar dynamic range as FP32 but with lower memory usage.
  • Recommended if your hardware supports BF16 acceleration (check your device's specs).
  • Ideal for high-performance inference with reduced memory footprint compared to FP32.

📌 Use BF16 if:
✔ Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
✔ You want higher precision while saving memory.
✔ You plan to requantize the model into another format.

📌 Avoid BF16 if:
❌ Your hardware does not support BF16 (it may fall back to FP32 and run slower).
❌ You need compatibility with older devices that lack BF16 optimization.


F16 (Float 16) – More widely supported than BF16

  • A 16-bit floating-point high precision but with less of range of values than BF16.
  • Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
  • Slightly lower numerical precision than BF16 but generally sufficient for inference.

📌 Use F16 if:
✔ Your hardware supports FP16 but not BF16.
✔ You need a balance between speed, memory usage, and accuracy.
✔ You are running on a GPU or another device optimized for FP16 computations.

📌 Avoid F16 if:
❌ Your device lacks native FP16 support (it may run slower than expected).
❌ You have memory limitations.


Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference

Quantization reduces model size and memory usage while maintaining as much accuracy as possible.

  • Lower-bit models (Q4_K)Best for minimal memory usage, may have lower precision.
  • Higher-bit models (Q6_K, Q8_0)Better accuracy, requires more memory.

📌 Use Quantized Models if:
✔ You are running inference on a CPU and need an optimized model.
✔ Your device has low VRAM and cannot load full-precision models.
✔ You want to reduce memory footprint while keeping reasonable accuracy.

📌 Avoid Quantized Models if:
❌ You need maximum accuracy (full-precision models are better for this).
❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).


Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)

These models are optimized for extreme memory efficiency, making them ideal for low-power devices or large-scale deployments where memory is a critical constraint.

  • IQ3_XS: Ultra-low-bit quantization (3-bit) with extreme memory efficiency.

    • Use case: Best for ultra-low-memory devices where even Q4_K is too large.
    • Trade-off: Lower accuracy compared to higher-bit quantizations.
  • IQ3_S: Small block size for maximum memory efficiency.

    • Use case: Best for low-memory devices where IQ3_XS is too aggressive.
  • IQ3_M: Medium block size for better accuracy than IQ3_S.

    • Use case: Suitable for low-memory devices where IQ3_S is too limiting.
  • Q4_K: 4-bit quantization with block-wise optimization for better accuracy.

    • Use case: Best for low-memory devices where Q6_K is too large.
  • Q4_0: Pure 4-bit quantization, optimized for ARM devices.

    • Use case: Best for ARM-based devices or low-memory environments.

Summary Table: Model Format Selection

Model Format Precision Memory Usage Device Requirements Best Use Case
BF16 Highest High BF16-supported GPU/CPUs High-speed inference with reduced memory
F16 High High FP16-supported devices GPU inference when BF16 isn't available
Q4_K Medium Low Low CPU or Low-VRAM devices Best for memory-constrained environments
Q6_K Medium Moderate CPU with more memory Better accuracy while still being quantized
Q8_0 High Moderate CPU or GPU with enough VRAM Best accuracy among quantized models
IQ3_XS Very Low Very Low Ultra-low-memory devices Extreme memory efficiency and low accuracy
Q4_0 Low Low ARM or low-memory devices llama.cpp can optimize for ARM devices

Included Files & Details

Llama-3.2-3B-F1-Reasoning-Instruct-bf16.gguf

  • Model weights preserved in BF16.
  • Use this if you want to requantize the model into a different format.
  • Best if your device supports BF16 acceleration.

Llama-3.2-3B-F1-Reasoning-Instruct-f16.gguf

  • Model weights stored in F16.
  • Use if your device supports FP16, especially if BF16 is not available.

Llama-3.2-3B-F1-Reasoning-Instruct-bf16-q8_0.gguf

  • Output & embeddings remain in BF16.
  • All other layers quantized to Q8_0.
  • Use if your device supports BF16 and you want a quantized version.

Llama-3.2-3B-F1-Reasoning-Instruct-f16-q8_0.gguf

  • Output & embeddings remain in F16.
  • All other layers quantized to Q8_0.

Llama-3.2-3B-F1-Reasoning-Instruct-q4_k.gguf

  • Output & embeddings quantized to Q8_0.
  • All other layers quantized to Q4_K.
  • Good for CPU inference with limited memory.

Llama-3.2-3B-F1-Reasoning-Instruct-q4_k_s.gguf

  • Smallest Q4_K variant, using less memory at the cost of accuracy.
  • Best for very low-memory setups.

Llama-3.2-3B-F1-Reasoning-Instruct-q6_k.gguf

  • Output & embeddings quantized to Q8_0.
  • All other layers quantized to Q6_K .

Llama-3.2-3B-F1-Reasoning-Instruct-q8_0.gguf

  • Fully Q8 quantized model for better accuracy.
  • Requires more memory but offers higher precision.

Llama-3.2-3B-F1-Reasoning-Instruct-iq3_xs.gguf

  • IQ3_XS quantization, optimized for extreme memory efficiency.
  • Best for ultra-low-memory devices.

Llama-3.2-3B-F1-Reasoning-Instruct-iq3_m.gguf

  • IQ3_M quantization, offering a medium block size for better accuracy.
  • Suitable for low-memory devices.

Llama-3.2-3B-F1-Reasoning-Instruct-q4_0.gguf

  • Pure Q4_0 quantization, optimized for ARM devices.
  • Best for low-memory environments.
  • Prefer IQ4_NL for better accuracy.

🚀 If you find these models useful

Please click "Like" if you find this useful!
Help me test my AI-Powered Network Monitor Assistant with quantum-ready security checks:
👉 Free Network Monitor

💬 How to test:
Choose an AI assistant type:

  • TurboLLM (GPT-4o-mini)
  • HugLLM (Hugginface Open-source)
  • TestLLM (Experimental CPU-only)

What I’m Testing

I’m pushing the limits of small open-source models for AI network monitoring, specifically:

  • Function calling against live network services
  • How small can a model go while still handling:
    • Automated Nmap scans
    • Quantum-readiness checks
    • Network Monitoring tasks

🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads):

  • Zero-configuration setup
  • ⏳ 30s load time (slow inference but no API costs)
  • 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate!

Other Assistants

🟢 TurboLLM – Uses gpt-4o-mini for:

🔵 HugLLM – Latest Open-source models:

  • 🌐 Runs on Hugging Face Inference API

💡 Example commands to you could test:

  1. "Give me info on my websites SSL certificate"
  2. "Check if my server is using quantum safe encyption for communication"
  3. "Run a comprehensive security audit on my server"
  4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Free Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution!

Model Card for Llama-3.2-3B-F1-Reasoning-Instruct (a.k.a Formosa-1-Reasoning or F1-Reasoning)

image/png

Llama-3.2-3B-F1-Reasoning-Instruct(a.k.a Formosa-1-Reasoning or F1-Reasoning) 是由 Twinkle AIAPMIC 合作開發,並在國家高速網路與計算中心技術指導之下,針對中華民國台灣語境與任務需求所微調之繁體中文語言模型,涵蓋法律、教育、生活應用等多元場景,並以高指令跟隨能力為目標進行強化。

Model Details

Model Description

Model Sources

Evaluation

Results

下表採用 🌟 Twinkle Eval 評測框架

模型 評測模式 TMMLU+(%) 台灣法律(%) MMLU(%) 測試次數 選項排序
mistralai/Mistral-Small-24B-Instruct-2501 box 56.15 (±0.0172) 37.48 (±0.0098) 74.61 (±0.0154) 3 隨機
meta-llama/Llama-3.2-3B-Instruct box 15.49 (±0.0104) 25.68 (±0.0200) 6.90 (±0.0096) 3 隨機
meta-llama/Llama-3.2-3B-Instruct pattern 35.85 (±0.0174) 32.22 (±0.0023) 59.33 (±0.0168) 3 隨機
MediaTek-Research/Llama-Breeze2-3B-Instruct pattern 40.32 (±0.0181) 38.92 (±0.0193) 55.37 (±0.0180) 3 隨機
twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct (ours) box 46.16 (±0.0198) 34.92 (±0.0243) 51.22 (±0.0206) 3 隨機

下表用 lighteval 評測框架

模型 MATH-500 GPQA Diamond
meta-llama/Llama-3.2-3B-Instruct 44.40 27.78
twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct (ours) 51.40 33.84

🔧 Tool Calling

本模型使用 Hermes 格式訓練,並支援平行呼叫(Parallel calling),以下為完整範例流程。 Tool call 模板已經為大家寫好放進 chat-template 了,Enjoy it!

1️⃣ 啟動 vLLM 後端

⚠️ 注意:需要 vLLM 版本 >= 0.8.3,否則 enable-reasoningenable-auto-tool-choice 無法同時開啟

vllm serve twinkle-ai/Llama-3.2-3B-F1-Reasoning-Instruct \
  --port 8001 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

2️⃣ 定義工具(Functions)

def get_weather(location: str, unit: str):
    return f"{location}的氣溫是{unit}26度,晴朗無風"

def search(query: str):
    return "川普終於宣布對等關稅政策,針對 18 個經濟體課徵一半的對等關稅,並從 4/5 起對所有進口產品徵收10%的基準關稅!美國將針對被認定為不當貿易行為(不公平貿易) 的國家,於 4/9 起課徵報復型對等關稅 (Discounted Reciprocal Tariff),例如:日本將被課徵 24% 的關稅,歐盟則為 20%,以取代普遍性的 10% 關稅。\n針對中國則開啟新一波 34% 關稅,並疊加於先前已實施的關稅上,這將使中國進口商品的基本關稅稅率達到 54%,而且這尚未包含拜登總統任內或川普第一任期所施加的額外關稅。加拿大與墨西哥則不適用這套對等關稅制度,但川普認為這些國家在芬太尼危機與非法移民問題尚未完全解決,因此計畫對這兩國的大多數進口商品施加 25% 關稅。另外原本針對汽車與多數其他商品的關稅豁免將於 4/2 到期。\n台灣的部分,美國擬向台灣課徵32%的對等關稅,雖然並未針對晶片特別課徵關稅,但仍在記者會中提到台灣搶奪所有的電腦與半導體晶片,最終促成台積電對美國投資計劃額外加碼 1,000 億美元的歷史性投資;歐盟則課徵20%的對等關稅。最後是汽車關稅將於 4/2 起,對所有外國製造的汽車課徵25% 關稅。"

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "國家或城市名, e.g., 'Taipei'、'Jaipei'"},
                    "unit": {"type": "string", "description": "氣溫單位,亞洲城市使用攝氏;歐美城市使用華氏", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location", "unit"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search",
            "description": "這是一個類似 Google 的搜尋引擎,關於知識、天氣、股票、電影、小說、百科等等問題,如果你不確定答案就搜尋一下。",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "should be a search query, e.g., '2024 南韓 戒嚴'"}
                },
                "required": ["query"]
            }
        }
    }
]

3️⃣ 執行工具調用(Tool Calls)

⚠️ 注意:system_prompt 可以不用帶,除非是需要時間基準的工具。

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[
        {"role": "system", "content": "記住你的知識截止於 2024/12,今天是 2025/4/7"},
        {"role": "user", "content": "台北氣溫如何? 另外,告訴我川普最新關稅政策"},
    ],
    max_tokens=1500,
    temperature=0.6,
    top_p=0.95,
    tools=tools,
    tool_choice="auto"
)

print(response.choices[0].message.reasoning_content)
print(response.choices[0].message.tool_calls)

🧠 推理內容輸出(僅顯示部分)

好的,我需要幫助這個使用者解決他們的問題。他們問了兩件事:首先,臺北市的天氣情況,以及第二,關於川普最近的關稅政策。
對於第一部分,他們提到了“臺北”,所以應該呼叫 get_weather 函式…
接下來是關於川普的新關稅政策…
總結一下,我需要分別進行兩次 API 呼叫,每次都有各自正確填寫的參數…

⚙️ Tool Calls List

[ChatCompletionMessageToolCall(id='chatcmpl-tool-35e74420119349999913a10133b84bd3', function=Function(arguments='{"location": "Taipei", "unit": "celsius"}', name='get_weather'), type='function'), ChatCompletionMessageToolCall(id='chatcmpl-tool-7ffdcb98e59f4134a6171defe7f2e31b', function=Function(arguments='{"query": "Donald Trump latest tariffs policy"}', name='search'), type='function')]

4️⃣ 產生最終回答

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[
        {"role": "system", "content": "記住你的知識截止於 2024/12,今天是 2025/4/7"},
        {"role": "user", "content": "台北氣溫如何? 另外,告訴我川普最新關稅政策"},
        {
            "role": "assistant",
            "content": "",
            "tool_calls": [
                {
                    "id": response.choices[0].message.tool_calls[0].id,
                    "type": "function",
                    "function": {
                        "name": response.choices[0].message.tool_calls[0].function.name,
                        "arguments": response.choices[0].message.tool_calls[0].function.arguments
                    }
                },
                {
                    "id": response.choices[0].message.tool_calls[1].id,
                    "type": "function",
                    "function": {
                        "name": response.choices[0].message.tool_calls[1].function.name,
                        "arguments": response.choices[0].message.tool_calls[1].function.arguments
                    }
                }
            ]
        },
        {
            "role": "tool",
            "content": search(**json.loads(response.choices[0].message.tool_calls[0].function.arguments)),
            "tool_call_id": response.choices[0].message.tool_calls[0].id # tool_call_id 必須要帶,才能正確配對 工具 及 tool_call
        },
        {
            "role": "tool",
            "content": get_weather(**json.loads(response.choices[0].message.tool_calls[1].function.arguments)),
            "tool_call_id": response.choices[0].message.tool_calls[1].id # tool_call_id 必須要帶,才能正確配對 工具 及 tool_call
        }
    ],
    max_tokens=1500,
    temperature=0.6,
    top_p=0.95,
    tools=tools,
    tool_choice="auto"
)

print(response.choices[0].message.reasoning_content)
print(response.choices[0].message.content)

🧠 推理內容輸出(僅顯示部分)

首先,我需要處理使用者的查詢,他們要求了解臺北市的當下氣溫以及川普最近的關稅政策…
在呼叫了 get_weather 後得到了臺北市氣溫為 26 度(攝氏)…
接著,使用 search 搜尋「川普最新關稅政策 2025」…
整合後,我提供如下摘要:

📋 最終輸出內容

以下是您請求的資訊:

**臺北市氣溫**
- 目前的氣溫為 **26°C**(攝氏)
- 天候狀況:晴朗無風

**川普最新關稅政策概述**
1. **對等關稅政策**  
   - 對 18 個經濟體課徵 50% 的對等關稅  
   - 自 4 月 5 日起,所有進口產品全面徵收 10% 基本關稅  

2. **報復型對等關稅**  
   - 日本 24%、歐盟 20%  

3. **對中國的高額關稅**  
   - 增加至 54%(原有關稅 + 新增 34%)  

4. **特殊案例**  
   - 加拿大與墨西哥不適用,但其他商品課徵 25%  
   - 汽車與部分商品的免稅即將到期  

5. **對台灣的影響**  
   - 美國計畫對台灣課徵 32% 關稅,但晶片暫無額外課稅  

6. **全球視角**  
   - 歐盟與日本關稅比例相對較高

Citation

@misc{twinkleai2025llama3.2f1,
  title        = {Llama-3.2-3B-F1-Reasoning-Instruct: A Traditional Chinese Instruction-Tuned Reasoning Language Model for Taiwan},
  author       = {Huang, Liang Hsun and Chen, Min Yi and Lin, Wen Bin and Chuang, Chao Chun and Sung, Dave},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/twinkle-ai/Llama-3.2-3B-F1-Instruct}},
  note         = {Twinkle AI and APMIC. All authors contributed equally.}
}

Acknowledge

  • 特此感謝國家高速網路與計算中心的指導與 APMIC 的算力支援,才得以讓本專案訓利完成。
  • 特此致謝黃啟聖老師、許武龍(哈爸)、臺北市立第一女子高級中學物理科陳姿燁老師、奈視科技 CTO Howard、AIPLUX Technology、郭家嘉老師以及所有在資料集製作過程中提供寶貴協助的夥伴。

Model Card Authors

Twinkle AI

Model Card Contact

Twinkle AI