Granite 3.2 8B Instruct - llamafile

Model creator: IBM
Original model: ibm-granite/granite-3.2-8b-instruct

Mozilla packaged the IBM Granite 3.2 models into executable weights that we call llamafiles. This gives you the easiest fastest way to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD and NetBSD systems you control on both AMD64 and ARM64.

Software Last Updated: 2025-03-31

Llamafile Version: 0.9.2

Quickstart

To get started, you need both the Granite 3.2 weights, and the llamafile software. Both of them are included in a single file, which can be downloaded and run as follows:

wget https://huggingface.co/Mozilla/granite-3.2-8b-instruct-llamafile/resolve/main/granite-3.2-8b-instruct-Q6_K.llamafile
chmod +x granite-3.2-8b-instruct-Q6_K.llamafile
./granite-3.2-8b-instruct-Q6_K.llamafile

The default mode of operation for these llamafiles is our new command line chatbot interface.

Usage

You can use triple quotes to ask questions on multiple lines. You can pass commands like /stats and /context to see runtime status information. You can change the system prompt by passing the -p "new system prompt" flag. You can press CTRL-C to interrupt the model. Finally CTRL-D may be used to exit.

If you prefer to use a web GUI, then a --server mode is provided, that will open a tab with a chatbot and completion interface in your browser. For additional help on how it may be used, pass the --help flag. The server also has an OpenAI API compatible completions endpoint that can be accessed via Python using the openai pip package.

./granite-3.2-8b-instruct-Q6_K.llamafile --server

An advanced CLI mode is provided that's useful for shell scripting. You can use it by passing the --cli flag. For additional help on how it may be used, pass the --help flag.

./granite-3.2-8b-instruct-Q6_K.llamafile --cli -p 'four score and seven' --log-disable

Troubleshooting

Having trouble? See the "Gotchas" section of the README.

On Linux, the way to avoid run-detector errors is to install the APE interpreter.

sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
sudo chmod +x /usr/bin/ape
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

On Windows there's a 4GB limit on executable sizes.

Context Window

This model has a max context window size of 128k tokens. By default, a context window size of 8192 tokens is used. You can ask llamafile to use the maximum context size by passing the -c 0 flag. That's big enough for a small book. If you want to be able to have a conversation with your book, you can use the -f book.txt flag.

GPU Acceleration

On GPUs with sufficient RAM, the -ngl 999 flag may be passed to use the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card driver needs to be installed if you own an NVIDIA GPU. On Windows, if you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass the flags --recompile --gpu amd the first time you run your llamafile.

On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to perform matrix multiplications. This is open source software, but it doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK installed on your system, then you can pass the --recompile flag to build a GGML CUDA library just for your system that uses cuBLAS. This ensures you get maximum performance.

For further information, please see the llamafile README.

About llamafile

llamafile is a new format introduced by Mozilla on Nov 20th 2023. It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp binaries that run on the stock installs of six OSes for both ARM64 and AMD64.

Granite-3.2-8B-Instruct

Model Summary: Granite-3.2-8B-Instruct is an 8-billion-parameter, long-context AI model fine-tuned for thinking capabilities. Built on top of Granite-3.1-8B-Instruct, it has been trained using a mix of permissively licensed open-source datasets and internally generated synthetic data designed for reasoning tasks. The model allows controllability of its thinking capability, ensuring it is applied only when required.

Developers: Granite Team, IBM
Website: Granite Docs
Release Date: February 26th, 2025
License: Apache 2.0

Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. However, users may finetune this Granite model for languages beyond these 12 languages.

Intended Use: This model is designed to handle general instruction-following tasks and can be integrated into AI assistants across various domains, including business applications.

Capabilities

Thinking
Summarization
Text classification
Text extraction
Question-answering
Retrieval Augmented Generation (RAG)
Code related tasks
Function-calling tasks
Multilingual dialog use cases
Long-context tasks including long document/meeting summarization, long document QA, etc.

Generation: This is a simple example of how to use Granite-3.2-8B-Instruct model.

Install the following libraries:

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

Then, copy the snippet from the section that is relevant for your use case.

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

model_path="ibm-granite/granite-3.2-8b-instruct"
device="cuda"
model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map=device,
        torch_dtype=torch.bfloat16,
    )
tokenizer = AutoTokenizer.from_pretrained(
        model_path
)

conv = [{"role": "user", "content":"You have 10 liters of a 30% acid solution. How many liters of a 70% acid solution must be added to achieve a 50% acid mixture?"}]

input_ids = tokenizer.apply_chat_template(conv, return_tensors="pt", thinking=True, return_dict=True, add_generation_prompt=True).to(device)

set_seed(42)
output = model.generate(
    **input_ids,
    max_new_tokens=8192,
)

prediction = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True)
print(prediction)

Example Outputs

thinking=True

Here is my thought process:

1. We know we start with 10 liters of a 30% acid solution, so there's 3 liters of pure acid in it (10 liters * 0.3).
2. We need to determine how much of the 70% acid solution to add such that the final mixture is 50% acid.
3. Let's say x liters of the 70% acid solution are added. This will contribute 0.7x liters of pure acid.
4. The total volume of the new mixture will be 10 + x liters, and the total amount of pure acid will be 3 + 0.7x liters.
5. We need to set up an equation where (3 + 0.7x) / (10 + x) = 0.5 (since we want a 50% acid mixture).
6. Solving this equation for x will give us the volume of 70% solution needed.

Here is my response:

To solve this, we need to set up an equation based on the concentration of acid before and after mixing. 

1. **Initial Acid Content**: 
   - We start with 10 liters of a 30% acid solution. 
   - Pure acid in this = 10 * 0.3 = 3 liters.

2. **Variable for Addition**:
   - Let x be the liters of 70% acid solution to be added.
   - Pure acid in x liters of 70% solution = 0.7x liters.

3. **Final Mixture Concentration**:
   - Total volume of mixture = 10 + x liters.
   - Total pure acid in the mixture = 3 (from initial solution) + 0.7x (from added solution).

4. **Setting up the equation for 50% acid concentration**:
   - We require that the ratio of pure acid to total volume is 0.5 (50%).
   - Therefore, the equation is:

   \[
   \frac{3 + 0.7x}{10 + x} = 0.5
   \]

5. **Solving the equation**:

   Multiply both sides by (10 + x) to clear the fraction:

   \[
   3 + 0.7x = 0.5(10 + x)
   \]

   Expanding the right side:

   \[
   3 + 0.7x = 5 + 0.5x
   \]

   Subtract 0.5x from both sides:

   \[
   3 + 0.2x = 5
   \]

   Subtract 3 from both sides:

   \[
   0.2x = 2
   \]

   Divide by 0.2:

   \[
   x = 10
   \]

So, 10 liters of the 70% acid solution must be added to the initial 10 liters of 30% solution to get a 50% acid mixture. 

Let's verify: 

- Initial acid: 3 liters
- Acid from added solution: 0.7 * 10 = 7 liters
- Total acid: 3 + 7 = 10 liters
- Total volume: 10 + 10 = 20 liters
- Acid percentage: \(\frac{10}{20} = 0.5 = 50\%\)

This confirms our solution. 

Final answer: 10 liters of 70% acid solution should be added.

thinking=False

Let's denote the amount of 70% acid solution to be added as x liters.

We start with 10 liters of a 30% acid solution, which means there are 0.3 * 10 = 3 liters of pure acid in it.

We want to end up with a 50% acid solution after adding x liters of 70% acid solution. The total volume of the mixture will be 10 + x liters, and we want half of this volume to be pure acid.

So, the amount of pure acid in the final mixture should be 0.5 * (10 + x) liters.

The amount of pure acid in the final mixture comes from two sources: the original 30% solution and the added 70% solution. So, we can set up the following equation:

3 (from the original solution) + 0.7x (from the added solution) = 0.5 * (10 + x)

Now, let's solve for x:

3 + 0.7x = 5 + 0.5x
0.7x - 0.5x = 5 - 3
0.2x = 2
x = 2 / 0.2
x = 10

So, you need to add 10 liters of a 70% acid solution to the 10 liters of a 30% acid solution to get a 50% acid mixture.

Evaluation Results:

Models	ArenaHard	Alpaca-Eval-2	MMLU	PopQA	TruthfulQA	BigBenchHard	DROP	GSM8K	HumanEval	HumanEval+	IFEval	AttaQ
Llama-3.1-8B-Instruct	36.43	27.22	69.15	28.79	52.79	72.66	61.48	83.24	85.32	80.15	79.10	83.43
DeepSeek-R1-Distill-Llama-8B	17.17	21.85	45.80	13.25	47.43	65.71	44.46	72.18	67.54	62.91	66.50	42.87
Qwen-2.5-7B-Instruct	25.44	30.34	74.30	18.12	63.06	70.40	54.71	84.46	93.35	89.91	74.90	81.90
DeepSeek-R1-Distill-Qwen-7B	10.36	15.35	50.72	9.94	47.14	65.04	42.76	78.47	79.89	78.43	59.10	42.45
Granite-3.1-8B-Instruct	37.58	30.34	66.77	28.7	65.84	68.55	50.78	79.15	89.63	85.79	73.20	85.73
Granite-3.1-2B-Instruct	23.3	27.17	57.11	20.55	59.79	54.46	18.68	67.55	79.45	75.26	63.59	84.7
Granite-3.2-2B-Instruct	24.86	34.51	57.18	20.56	59.8	52.27	21.12	67.02	80.13	73.39	61.55	83.23
Granite-3.2-8B-Instruct	55.25	61.19	66.79	28.04	66.92	64.77	50.95	81.65	89.35	85.72	74.31	85.42

Training Data: Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.

Infrastructure: We train Granite-3.2-8B-Instruct using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.

Ethical Considerations and Limitations: Granite-3.2-8B-Instruct builds upon Granite-3.1-8B-Instruct, leveraging both permissively licensed open-source and select proprietary data for enhanced performance. Since it inherits its foundation from the previous model, all ethical considerations and limitations applicable to Granite-3.1-8B-Instruct remain relevant.

Resources

⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

Mozilla
/

granite-3.2-8b-instruct-llamafile