matrixportal/txgemma-2b-predict-GGUF

This model was converted to GGUF format from google/txgemma-2b-predict using llama.cpp via the ggml.ai's all-gguf-same-where space. Refer to the original model card for more details on the model.

βœ… Quantized Models Download List

πŸ” Recommended Quantizations

  • ✨ General CPU Use: Q4_K_M (Best balance of speed/quality)
  • πŸ“± ARM Devices: Q4_0 (Optimized for ARM CPUs)
  • πŸ† Maximum Quality: Q8_0 (Near-original quality)

πŸ“¦ Full Quantization Options

πŸš€ Download πŸ”’ Type πŸ“ Notes
Download Q2_K Basic quantization
Download Q3_K_S Small size
Download Q3_K_M Balanced quality
Download Q3_K_L Better quality
Download Q4_0 Fast on ARM
Download Q4_K_S Fast, recommended
Download Q4_K_M ⭐ Best balance
Download Q5_0 Good quality
Download Q5_K_S Balanced
Download Q5_K_M High quality
Download Q6_K πŸ† Very good quality
Download Q8_0 ⚑ Fast, best quality
Download F16 Maximum accuracy

πŸ’‘ Tip: Use F16 for maximum precision when quality is critical

GGUF Model Quantization & Usage Guide with llama.cpp

What is GGUF and Quantization?

GGUF (GPT-Generated Unified Format) is an efficient model file format developed by the llama.cpp team that:

  • Supports multiple quantization levels
  • Works cross-platform
  • Enables fast loading and inference

Quantization converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:

  • Reduce model size
  • Decrease memory usage
  • Speed up inference
  • (With minor accuracy trade-offs)

Step-by-Step Guide

1. Prerequisites

# System updates
sudo apt update && sudo apt upgrade -y

# Dependencies
sudo apt install -y build-essential cmake python3-pip

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j4

2. Using Quantized Models from Hugging Face

My automated quantization script produces models in this format:

https://huggingface.co/matrixportal/txgemma-2b-predict-GGUF/resolve/main/txgemma-2b-predict-q4_k_m.gguf

Download your quantized model directly:

wget https://huggingface.co/matrixportal/txgemma-2b-predict-GGUF/resolve/main/txgemma-2b-predict-q4_k_m.gguf

3. Running the Quantized Model

Basic usage:

./main -m txgemma-2b-predict-q4_k_m.gguf -p "Your prompt here" -n 128

Example with a creative writing prompt:

./main -m txgemma-2b-predict-q4_k_m.gguf        -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]"        -n 256 -c 2048 -t 8 --temp 0.7

Advanced parameters:

./main -m txgemma-2b-predict-q4_k_m.gguf        -p "Question: What is the GGUF format?
Answer:"        -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9

4. Python Integration

Install the Python package:

pip install llama-cpp-python

Example script:

from llama_cpp import Llama

# Initialize the model
llm = Llama(
    model_path="txgemma-2b-predict-q4_k_m.gguf",
    n_ctx=2048,
    n_threads=8
)

# Run inference
response = llm(
    "[INST] Explain GGUF quantization to a beginner [/INST]",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9
)

print(response["choices"][0]["text"])

Performance Tips

  1. Hardware Utilization:

    • Set thread count with -t (typically CPU core count)
    • Compile with CUDA/OpenCL for GPU support
  2. Memory Optimization:

    • Lower quantization (like q4_k_m) uses less RAM
    • Adjust context size with -c parameter
  3. Speed/Accuracy Balance:

    • Higher bit quantization is slower but more accurate
    • Reduce randomness with --temp 0 for consistent results

FAQ

Q: What quantization levels are available?
A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0

Q: How much performance loss occurs with q4_k_m?
A: Typically 2-5% accuracy reduction but 4x smaller size

Q: How to enable GPU support?
A: Build with make LLAMA_CUBLAS=1 for NVIDIA GPUs

Useful Resources

  1. llama.cpp GitHub
  2. GGUF Format Specs
  3. Hugging Face Model Hub
Downloads last month
112
GGUF
Model size
2.61B params
Architecture
gemma2
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for matrixportal/txgemma-2b-predict-GGUF

Quantized
(3)
this model