granite-3.2-2b-instruct GGUF Models
Choosing the Right Model Format
Selecting the correct model format depends on your hardware capabilities and memory constraints.
BF16 (Brain Float 16) β Use if BF16 acceleration is available
- A 16-bit floating-point format designed for faster computation while retaining good precision.
- Provides similar dynamic range as FP32 but with lower memory usage.
- Recommended if your hardware supports BF16 acceleration (check your deviceβs specs).
- Ideal for high-performance inference with reduced memory footprint compared to FP32.
π Use BF16 if:
β Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
β You want higher precision while saving memory.
β You plan to requantize the model into another format.
π Avoid BF16 if:
β Your hardware does not support BF16 (it may fall back to FP32 and run slower).
β You need compatibility with older devices that lack BF16 optimization.
F16 (Float 16) β More widely supported than BF16
- A 16-bit floating-point high precision but with less of range of values than BF16.
- Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
π Use F16 if:
β Your hardware supports FP16 but not BF16.
β You need a balance between speed, memory usage, and accuracy.
β You are running on a GPU or another device optimized for FP16 computations.
π Avoid F16 if:
β Your device lacks native FP16 support (it may run slower than expected).
β You have memory limitations.
Quantized Models (Q4_K, Q6_K, Q8, etc.) β For CPU & Low-VRAM Inference
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
- Lower-bit models (Q4_K) β Best for minimal memory usage, may have lower precision.
- Higher-bit models (Q6_K, Q8_0) β Better accuracy, requires more memory.
π Use Quantized Models if:
β You are running inference on a CPU and need an optimized model.
β Your device has low VRAM and cannot load full-precision models.
β You want to reduce memory footprint while keeping reasonable accuracy.
π Avoid Quantized Models if:
β You need maximum accuracy (full-precision models are better for this).
β Your hardware has enough VRAM for higher-precision formats (BF16/F16).
Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)
These models are optimized for extreme memory efficiency, making them ideal for low-power devices or large-scale deployments where memory is a critical constraint.
IQ3_XS: Ultra-low-bit quantization (3-bit) with extreme memory efficiency.
- Use case: Best for ultra-low-memory devices where even Q4_K is too large.
- Trade-off: Lower accuracy compared to higher-bit quantizations.
IQ3_S: Small block size for maximum memory efficiency.
- Use case: Best for low-memory devices where IQ3_XS is too aggressive.
IQ3_M: Medium block size for better accuracy than IQ3_S.
- Use case: Suitable for low-memory devices where IQ3_S is too limiting.
Q4_K: 4-bit quantization with block-wise optimization for better accuracy.
- Use case: Best for low-memory devices where Q6_K is too large.
Q4_0: Pure 4-bit quantization, optimized for ARM devices.
- Use case: Best for ARM-based devices or low-memory environments.
Summary Table: Model Format Selection
Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
---|---|---|---|---|
BF16 | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
F16 | High | High | FP16-supported devices | GPU inference when BF16 isnβt available |
Q4_K | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
Q6_K | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
Q8_0 | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
IQ3_XS | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
Q4_0 | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
Included Files & Details
granite-3.2-2b-instruct-bf16.gguf
- Model weights preserved in BF16.
- Use this if you want to requantize the model into a different format.
- Best if your device supports BF16 acceleration.
granite-3.2-2b-instruct-f16.gguf
- Model weights stored in F16.
- Use if your device supports FP16, especially if BF16 is not available.
granite-3.2-2b-instruct-bf16-q8_0.gguf
- Output & embeddings remain in BF16.
- All other layers quantized to Q8_0.
- Use if your device supports BF16 and you want a quantized version.
granite-3.2-2b-instruct-f16-q8_0.gguf
- Output & embeddings remain in F16.
- All other layers quantized to Q8_0.
granite-3.2-2b-instruct-q4_k.gguf
- Output & embeddings quantized to Q8_0.
- All other layers quantized to Q4_K.
- Good for CPU inference with limited memory.
granite-3.2-2b-instruct-q4_k_s.gguf
- Smallest Q4_K variant, using less memory at the cost of accuracy.
- Best for very low-memory setups.
granite-3.2-2b-instruct-q6_k.gguf
- Output & embeddings quantized to Q8_0.
- All other layers quantized to Q6_K .
granite-3.2-2b-instruct-q8_0.gguf
- Fully Q8 quantized model for better accuracy.
- Requires more memory but offers higher precision.
granite-3.2-2b-instruct-iq3_xs.gguf
- IQ3_XS quantization, optimized for extreme memory efficiency.
- Best for ultra-low-memory devices.
granite-3.2-2b-instruct-iq3_m.gguf
- IQ3_M quantization, offering a medium block size for better accuracy.
- Suitable for low-memory devices.
granite-3.2-2b-instruct-q4_0.gguf
- Pure Q4_0 quantization, optimized for ARM devices.
- Best for low-memory environments.
- Prefer IQ4_NL for better accuracy.
π If you find these models useful
Please click like β€ . Also Iβd really appreciate it if you could test my Network Monitor Assistant at π Network Monitor Assitant.
π¬ Click the chat icon (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.
What I'm Testing
I'm experimenting with function calling against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".
π‘ TestLLM β Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a timeβstill working on scaling!). If you're curious, I'd be happy to share how it works! .
The other Available AI Assistants
π’ TurboLLM β Uses gpt-4o-mini Fast! . Note: tokens are limited since OpenAI models are pricey, but you can Login or Download the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM .
π΅ FreeLLM β Runs open-source Hugging Face models Medium speed (unlimited, subject to Hugging Face API availability).
Granite-3.2-2B-Instruct
Model Summary: Granite-3.2-2B-Instruct is an 2-billion-parameter, long-context AI model fine-tuned for thinking capabilities. Built on top of Granite-3.1-2B-Instruct, it has been trained using a mix of permissively licensed open-source datasets and internally generated synthetic data designed for reasoning tasks. The model allows controllability of its thinking capability, ensuring it is applied only when required.
- Developers: Granite Team, IBM
- Website: Granite Docs
- Release Date: February 26th, 2025
- License: Apache 2.0
Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. However, users may finetune this Granite model for languages beyond these 12 languages.
Intended Use: This model is designed to handle general instruction-following tasks and can be integrated into AI assistants across various domains, including business applications.
Capabilities
- Thinking
- Summarization
- Text classification
- Text extraction
- Question-answering
- Retrieval Augmented Generation (RAG)
- Code related tasks
- Function-calling tasks
- Multilingual dialog use cases
- Long-context tasks including long document/meeting summarization, long document QA, etc.
Example Outputs
- thinking=True
Here is my thought process:
1. First, let's understand the composition of the initial 10 liters of 30% acid solution:
- Acid content = 30% of 10 liters = 0.30 * 10 = 3 liters
2. We want to end up with a 50% acid solution. Let's denote the volume of the 70% acid solution to be added as 'x' liters.
3. The amount of acid in the final mixture (50% solution) should equal the initial acid content plus the acid from the added 70% solution.
- Total acid in final mixture = 0.50 * (10 + x)
4. The equation representing the total acid content is:
3 (from initial solution) + 0.70 * x (from added solution) = 0.50 * (10 + x)
5. Now, we solve for 'x':
3 + 0.70x = 5 + 0.50x
0.70x - 0.50x = 5 - 3
0.20x = 2
x = 2 / 0.20
x = 10 liters
6. Therefore, 10 liters of a 70% acid solution must be added to the initial 10 liters of 30% acid solution to achieve a 50% acid mixture.
Here is my response:
To solve this problem, we need to determine how much of a 70% acid solution must be added to a 10-liter solution that is already 30% acid to create a new mixture that is 50% acid. Here's how to calculate it:
1. **Initial Acid Content**:
- The initial 10-liter solution contains 3 liters of acid (since 30% of 10 liters).
2. **Desired Final Acid Content**:
- We want the final mixture to be 50% acid.
3. **Equation Setup**:
- Let 'x' be the volume (in liters) of the 70% acid solution to be added.
- The equation representing the total acid content in the final mixture (which is 10 liters from the initial solution plus 'x' liters of the added solution) should equal 50% of the total volume (10 + x).
- This gives us: 3 (acid from initial solution) + 0.70x (acid from added solution) = 0.50(10 + x)
4. **Solving the Equation**:
- First, simplify the right side of the equation: 0.50(10 + x) = 5 + 0.50x
- Now, the equation is: 3 + 0.70x = 5 + 0.50x
- Subtract 0.50x from both sides: 0.20x = 2
- Divide both sides by 0.20: x = 2 / 0.20
- x = 10 liters
5. **Conclusion**:
- To achieve a 50% acid mixture from the initial 30% solution, you need to add 10 liters of a 70% acid solution.
So, the final answer is that 10 liters of a 70% acid solution must be added to the initial 10 liters of 30% acid solution.
- thinking=False
To solve this problem, we'll use the concept of volume and concentration. Let's denote the amount of 70% acid solution we need to add as "x" liters.
First, let's find out how much acid is in the initial 10-liter solution:
Acid in initial solution = 30% of 10 liters = 0.30 * 10 = 3 liters
Now, we want to end up with a 50% acid solution in a total volume of (10 + x) liters. Let's denote the final volume as V.
Final acid concentration = 50%
Final acid amount = 50% of V = 0.50 * V
We know the initial acid amount and the final acid amount, so we can set up an equation:
Initial acid amount + Acid from added solution = Final acid amount
3 liters + (70% of x) = 0.50 * (10 + x)
Now, let's solve for x:
0.70x + 3 = 0.50 * 10 + 0.50x
0.70x - 0.50x = 0.50 * 10 - 3
0.20x = 5 - 3
0.20x = 2
x = 2 / 0.20
x = 10 liters
So, you need to add 10 liters of a 70% acid solution to the initial 10-liter 30% acid solution to achieve a 50% acid mixture.
Evaluation Results:
Models | ArenaHard | Alpaca-Eval-2 | MMLU | PopQA | TruthfulQA | BigBenchHard | DROP | GSM8K | HumanEval | HumanEval+ | IFEval | AttaQ |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama-3.1-8B-Instruct | 36.43 | 27.22 | 69.15 | 28.79 | 52.79 | 72.66 | 61.48 | 83.24 | 85.32 | 80.15 | 79.10 | 83.43 |
DeepSeek-R1-Distill-Llama-8B | 17.17 | 21.85 | 45.80 | 13.25 | 47.43 | 65.71 | 44.46 | 72.18 | 67.54 | 62.91 | 66.50 | 42.87 |
Qwen-2.5-7B-Instruct | 25.44 | 30.34 | 74.30 | 18.12 | 63.06 | 70.40 | 54.71 | 84.46 | 93.35 | 89.91 | 74.90 | 81.90 |
DeepSeek-R1-Distill-Qwen-7B | 10.36 | 15.35 | 50.72 | 9.94 | 47.14 | 65.04 | 42.76 | 78.47 | 79.89 | 78.43 | 59.10 | 42.45 |
Granite-3.1-8B-Instruct | 37.58 | 30.34 | 66.77 | 28.7 | 65.84 | 68.55 | 50.78 | 79.15 | 89.63 | 85.79 | 73.20 | 85.73 |
Granite-3.1-2B-Instruct | 23.3 | 27.17 | 57.11 | 20.55 | 59.79 | 54.46 | 18.68 | 67.55 | 79.45 | 75.26 | 63.59 | 84.7 |
Granite-3.2-8B-Instruct | 55.25 | 61.19 | 66.79 | 28.04 | 66.92 | 64.77 | 50.95 | 81.65 | 89.35 | 85.72 | 74.31 | 85.42 |
Granite-3.2-2B-Instruct | 24.86 | 34.51 | 57.18 | 20.56 | 59.8 | 52.27 | 21.12 | 67.02 | 80.13 | 73.39 | 61.55 | 83.23 |
Training Data: Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.
Infrastructure: We train Granite-3.2-2B-Instruct using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
Ethical Considerations and Limitations: Granite-3.2-2B-Instruct builds upon Granite-3.1-2B-Instruct, leveraging both permissively licensed open-source and select proprietary data for enhanced performance. Since it inherits its foundation from the previous model, all ethical considerations and limitations applicable to Granite-3.1-2B-Instruct remain relevant.
Resources
- βοΈ Learn about the latest updates with Granite: https://www.ibm.com/granite
- π Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
- π‘ Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
- Downloads last month
- 580
Model tree for Mungert/granite-3.2-2b-instruct-GGUF
Base model
ibm-granite/granite-3.1-2b-base