matrixportal's picture
Upload README.md with huggingface_hub
3bf32e4 verified
metadata
base_model: google/txgemma-2b-predict
language:
  - en
library_name: transformers
license: other
license_name: health-ai-developer-foundations
license_link: https://developers.google.com/health-ai-developer-foundations/terms
pipeline_tag: text-generation
tags:
  - therapeutics
  - drug-development
  - llama-cpp
  - matrixportal
extra_gated_heading: Access TxGemma on Hugging Face
extra_gated_prompt: >-
  To access TxGemma on Hugging Face, you're required to review and agree to
  [Health AI Developer Foundation's terms of
  use](https://developers.google.com/health-ai-developer-foundations/terms). To
  do this, please ensure you're logged in to Hugging Face and click below.
  Requests are processed immediately.
extra_gated_button_content: Acknowledge license

matrixportal/txgemma-2b-predict-GGUF

This model was converted to GGUF format from google/txgemma-2b-predict using llama.cpp via the ggml.ai's all-gguf-same-where space. Refer to the original model card for more details on the model.

βœ… Quantized Models Download List

πŸ” Recommended Quantizations

  • ✨ General CPU Use: Q4_K_M (Best balance of speed/quality)
  • πŸ“± ARM Devices: Q4_0 (Optimized for ARM CPUs)
  • πŸ† Maximum Quality: Q8_0 (Near-original quality)

πŸ“¦ Full Quantization Options

πŸš€ Download πŸ”’ Type πŸ“ Notes
Download Q2_K Basic quantization
Download Q3_K_S Small size
Download Q3_K_M Balanced quality
Download Q3_K_L Better quality
Download Q4_0 Fast on ARM
Download Q4_K_S Fast, recommended
Download Q4_K_M ⭐ Best balance
Download Q5_0 Good quality
Download Q5_K_S Balanced
Download Q5_K_M High quality
Download Q6_K πŸ† Very good quality
Download Q8_0 ⚑ Fast, best quality
Download F16 Maximum accuracy

πŸ’‘ Tip: Use F16 for maximum precision when quality is critical

GGUF Model Quantization & Usage Guide with llama.cpp

What is GGUF and Quantization?

GGUF (GPT-Generated Unified Format) is an efficient model file format developed by the llama.cpp team that:

  • Supports multiple quantization levels
  • Works cross-platform
  • Enables fast loading and inference

Quantization converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:

  • Reduce model size
  • Decrease memory usage
  • Speed up inference
  • (With minor accuracy trade-offs)

Step-by-Step Guide

1. Prerequisites

# System updates
sudo apt update && sudo apt upgrade -y

# Dependencies
sudo apt install -y build-essential cmake python3-pip

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j4

2. Using Quantized Models from Hugging Face

My automated quantization script produces models in this format:

https://huggingface.co/matrixportal/txgemma-2b-predict-GGUF/resolve/main/txgemma-2b-predict-q4_k_m.gguf

Download your quantized model directly:

wget https://huggingface.co/matrixportal/txgemma-2b-predict-GGUF/resolve/main/txgemma-2b-predict-q4_k_m.gguf

3. Running the Quantized Model

Basic usage:

./main -m txgemma-2b-predict-q4_k_m.gguf -p "Your prompt here" -n 128

Example with a creative writing prompt:

./main -m txgemma-2b-predict-q4_k_m.gguf        -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]"        -n 256 -c 2048 -t 8 --temp 0.7

Advanced parameters:

./main -m txgemma-2b-predict-q4_k_m.gguf        -p "Question: What is the GGUF format?
Answer:"        -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9

4. Python Integration

Install the Python package:

pip install llama-cpp-python

Example script:

from llama_cpp import Llama

# Initialize the model
llm = Llama(
    model_path="txgemma-2b-predict-q4_k_m.gguf",
    n_ctx=2048,
    n_threads=8
)

# Run inference
response = llm(
    "[INST] Explain GGUF quantization to a beginner [/INST]",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9
)

print(response["choices"][0]["text"])

Performance Tips

  1. Hardware Utilization:

    • Set thread count with -t (typically CPU core count)
    • Compile with CUDA/OpenCL for GPU support
  2. Memory Optimization:

    • Lower quantization (like q4_k_m) uses less RAM
    • Adjust context size with -c parameter
  3. Speed/Accuracy Balance:

    • Higher bit quantization is slower but more accurate
    • Reduce randomness with --temp 0 for consistent results

FAQ

Q: What quantization levels are available?
A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0

Q: How much performance loss occurs with q4_k_m?
A: Typically 2-5% accuracy reduction but 4x smaller size

Q: How to enable GPU support?
A: Build with make LLAMA_CUBLAS=1 for NVIDIA GPUs

Useful Resources

  1. llama.cpp GitHub
  2. GGUF Format Specs
  3. Hugging Face Model Hub