matrixportal/txgemma-2b-predict-GGUF

This model was converted to GGUF format from google/txgemma-2b-predict using llama.cpp via the ggml.ai's all-gguf-same-where space. Refer to the original model card for more details on the model.

✅ Quantized Models Download List

🔍 Recommended Quantizations

✨ General CPU Use: Q4_K_M (Best balance of speed/quality)
📱 ARM Devices: Q4_0 (Optimized for ARM CPUs)
🏆 Maximum Quality: Q8_0 (Near-original quality)

📦 Full Quantization Options

🚀 Download	🔢 Type	📝 Notes
Download		Basic quantization
Download		Small size
Download		Balanced quality
Download		Better quality
Download		Fast on ARM
Download		Fast, recommended
Download	⭐	Best balance
Download		Good quality
Download		Balanced
Download		High quality
Download	🏆	Very good quality
Download	⚡	Fast, best quality
Download		Maximum accuracy

💡 Tip: Use F16 for maximum precision when quality is critical

GGUF Model Quantization & Usage Guide with llama.cpp

What is GGUF and Quantization?

GGUF (GPT-Generated Unified Format) is an efficient model file format developed by the llama.cpp team that:

Supports multiple quantization levels
Works cross-platform
Enables fast loading and inference

Quantization converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:

Reduce model size
Decrease memory usage
Speed up inference
(With minor accuracy trade-offs)

Step-by-Step Guide

1. Prerequisites

# System updates
sudo apt update && sudo apt upgrade -y

# Dependencies
sudo apt install -y build-essential cmake python3-pip

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j4

2. Using Quantized Models from Hugging Face

My automated quantization script produces models in this format:

https://huggingface.co/matrixportal/txgemma-2b-predict-GGUF/resolve/main/txgemma-2b-predict-q4_k_m.gguf

Download your quantized model directly:

wget https://huggingface.co/matrixportal/txgemma-2b-predict-GGUF/resolve/main/txgemma-2b-predict-q4_k_m.gguf

3. Running the Quantized Model

Basic usage:

./main -m txgemma-2b-predict-q4_k_m.gguf -p "Your prompt here" -n 128

Example with a creative writing prompt:

./main -m txgemma-2b-predict-q4_k_m.gguf        -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]"        -n 256 -c 2048 -t 8 --temp 0.7

Advanced parameters:

./main -m txgemma-2b-predict-q4_k_m.gguf        -p "Question: What is the GGUF format?
Answer:"        -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9

4. Python Integration

Install the Python package:

pip install llama-cpp-python

Example script:

from llama_cpp import Llama

# Initialize the model
llm = Llama(
    model_path="txgemma-2b-predict-q4_k_m.gguf",
    n_ctx=2048,
    n_threads=8
)

# Run inference
response = llm(
    "[INST] Explain GGUF quantization to a beginner [/INST]",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9
)

print(response["choices"][0]["text"])

Performance Tips

Hardware Utilization:
- Set thread count with -t (typically CPU core count)
- Compile with CUDA/OpenCL for GPU support
Memory Optimization:
- Lower quantization (like q4_k_m) uses less RAM
- Adjust context size with -c parameter
Speed/Accuracy Balance:
- Higher bit quantization is slower but more accurate
- Reduce randomness with --temp 0 for consistent results

FAQ

Q: What quantization levels are available?
A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0

Q: How much performance loss occurs with q4_k_m?
A: Typically 2-5% accuracy reduction but 4x smaller size

Q: How to enable GPU support?
A: Build with make LLAMA_CUBLAS=1 for NVIDIA GPUs

matrixportal
/

txgemma-2b-predict-GGUF