--- tags: - llama - gguf - quantization license: mit library_name: llama.cpp base_model: - TinyLlama/TinyLlama-1.1B-Chat-v1.0 --- # TinyLlama-1.1B-Chat - Q4_K_M GGUF This is a quantized GGUF version of [`TinyLlama/TinyLlama-1.1B-Chat`](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat), quantized to **Q4_K_M** using [llama.cpp](https://github.com/ggerganov/llama.cpp)'s quantization tool. ## Model Info - **Base model**: TinyLlama-1.1B-Chat - **Quantization**: Q4_K_M - **Format**: GGUF (compatible with llama.cpp and llama-cpp-python) - **Size**: ~500 MB - **Use case**: Efficient CPU inference for chat and assistant use-cases ## 📦 Files - `TinyLlama-1.1B-Chat-Q4_K_M.gguf`: The quantized model ## How to Use (with `llama-cpp-python`) ```python from llama_cpp import Llama llm = Llama(model_path="TinyLlama-1.1B-Chat-Q4_K_M.gguf") output = llm("Who are you?", max_tokens=128) print(output) ``` ## Credits - Original model: [TinyLlama/TinyLlama-1.1B-Chat](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat) - Quantized with: [llama.cpp](https://github.com/ggerganov/llama.cpp)