utkmst/chimera-beta-test2-lora-merged-Q4_K_M-GGUF

Model Description

This is a quantized GGUF version of my fine-tuned model utkmst/chimera-beta-test2-lora-merged, which was created by LoRA fine-tuning the Meta Llama-3.1-8B-Instruct model and merging the resulting adapter with the base model. The GGUF conversion was performed using llama.cpp with Q4_K_M quantization for efficient inference.

Architecture

Base Model: meta-llama/Llama-3.1-8B-Instruct
Size: 8.03B parameters
Type: Decoder-only transformer
Quantization: Q4_K_M GGUF format (4-bit quantization with K-means clustering)

Training Details

Training Method: LoRA fine-tuning followed by adapter merging
LoRA Configuration:
- Rank: 8
- Alpha: 16
- Trainable modules: Attention layers and feed-forward networks
Training Hyperparameters:
- Learning rate: 2e-4
- Batch size: 2
- Training epochs: 1
- Optimizer: AdamW with constant scheduler

Dataset

The model was trained on a curated mixture of high-quality instruction datasets:

OpenAssistant/oasst1: Human-generated conversations with AI assistants
databricks/databricks-dolly-15k: Instruction-following examples
Open-Orca/OpenOrca: Augmented training data based on GPT-4 generations
mlabonne/open-perfectblend: A carefully balanced blend of open-source instruction data
tatsu-lab/alpaca: Self-instructed data based on demonstrations

Intended Use

This model is designed for:

General purpose assistant capabilities
Question answering and knowledge retrieval
Creative content generation
Instructional guidance

It's optimized for deployment in resource-constrained environments due to its quantized nature while maintaining good response quality.

Limitations

Reduced numerical precision due to quantization may impact performance on certain mathematical or precise reasoning tasks
Base model limitations including potential hallucinations and factual inaccuracies
Limited context window compared to larger models
Knowledge cutoff from the base Llama-3.1 model
May exhibit biases present in training data

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo utkmst/chimera-beta-test2-lora-merged-Q4_K_M-GGUF --hf-file chimera-beta-test2-lora-merged-q4_k_m.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo utkmst/chimera-beta-test2-lora-merged-Q4_K_M-GGUF --hf-file chimera-beta-test2-lora-merged-q4_k_m.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo utkmst/chimera-beta-test2-lora-merged-Q4_K_M-GGUF --hf-file chimera-beta-test2-lora-merged-q4_k_m.gguf -p "The meaning to life and the universe is"

./llama-server --hf-repo utkmst/chimera-beta-test2-lora-merged-Q4_K_M-GGUF --hf-file chimera-beta-test2-lora-merged-q4_k_m.gguf -c 2048

utkmst
/

chimera-beta-test2-lora-merged-Q4_K_M-GGUF