---
tags:
- llama
- gguf
- quantization
license: mit
library_name: llama.cpp
base_model:
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
---

# TinyLlama-1.1B-Chat - Q4_K_M GGUF

This is a quantized GGUF version of [`TinyLlama/TinyLlama-1.1B-Chat`](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat), quantized to **Q4_K_M** using [llama.cpp](https://github.com/ggerganov/llama.cpp)'s quantization tool.

## Model Info

- **Base model**: TinyLlama-1.1B-Chat
- **Quantization**: Q4_K_M
- **Format**: GGUF (compatible with llama.cpp and llama-cpp-python)
- **Size**: ~500 MB
- **Use case**: Efficient CPU inference for chat and assistant use-cases

## 📦 Files

- `TinyLlama-1.1B-Chat-Q4_K_M.gguf`: The quantized model

## How to Use (with `llama-cpp-python`)

```python
from llama_cpp import Llama

llm = Llama(model_path="TinyLlama-1.1B-Chat-Q4_K_M.gguf")
output = llm("Who are you?", max_tokens=128)
print(output)
```

## Credits

- Original model: [TinyLlama/TinyLlama-1.1B-Chat](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat)
- Quantized with: [llama.cpp](https://github.com/ggerganov/llama.cpp)