Spaces:
Running
Running
# LLM Model Converter and Quantizer | |
Large Language Models (LLMs) are typically distributed in formats optimized for training (like PyTorch) and can be extremely large (hundreds of gigabytes), making them impractical for most real-world applications. This tool addresses two critical challenges in LLM deployment: | |
1. **Size**: Original models are too large to run on consumer hardware | |
2. **Format**: Training formats are not optimized for inference | |
## Why This Tool? | |
I decided to build this tool to help AI Researchers achieve the following: | |
- Converting models from Hugging Face to GGUF format (optimized for inference) | |
- Quantizing models to reduce their size while maintaining acceptable performance | |
- Making deployment possible on consumer hardware (laptops, desktops) with limited resources | |
### The Problem | |
- LLMs in their original format require significant computational resources | |
- Running these models typically needs: | |
- High-end GPUs | |
- Large amounts of RAM (32GB+) | |
- Substantial storage space | |
- Complex software dependencies | |
### The Solution | |
This tool provides: | |
1. **Format Conversion** | |
- Converts from PyTorch/Hugging Face format to GGUF | |
- GGUF is specifically designed for efficient inference | |
- Enables memory mapping for faster loading | |
- Reduces dependency requirements | |
2. **Quantization** | |
- Reduces model size by up to 4-8x | |
- Converts from FP16/FP32 to more efficient formats (INT8/INT4) | |
- Maintains reasonable model performance | |
- Makes models runnable on consumer-grade hardware | |
3. **Accessibility** | |
- Enables running LLMs on standard laptops | |
- Reduces RAM requirements | |
- Speeds up model loading and inference | |
- Simplifies deployment process | |
## π― Purpose | |
This tool helps developers and researchers to: | |
- Download LLMs from Hugging Face Hub | |
- Convert models to GGUF (GPT-Generated Unified Format) | |
- Quantize models for efficient deployment | |
- Upload processed models back to Hugging Face | |
## π Features | |
- **Model Download**: Direct integration with Hugging Face Hub | |
- **GGUF Conversion**: Convert PyTorch models to GGUF format | |
- **Quantization Options**: Support for various quantization levels | |
- **Batch Processing**: Automate the entire conversion pipeline | |
- **HF Upload**: Option to upload processed models back to Hugging Face | |
# Quantization Types Overview | |
| **Quantizer Name** | **Purpose** | **Benefits** | **When to Use** | | |
|--------------------|--------------------------------------------------------|-----------------------------------------------------------------|------------------------------------------------------------| | |
| **Q2_K** | Quantizes model to 2 bits using K mode | Minimizes memory usage, faster inference | Use for highly memory-constrained environments | | |
| **Q3_K_l** | 3-bit quantization using low precision mode | Balance between size reduction and inference quality | When a small model size with moderate precision is needed | | |
| **Q3_K_M** | 3-bit quantization with medium precision mode | Better performance with slight increase in memory usage | When moderate precision and size reduction are desired | | |
| **Q3_K_S** | 3-bit quantization using high precision mode | Higher inference quality with minimal size reduction | When inference quality is a higher priority than size | | |
| **Q4_0** | 4-bit quantization with zero mode | Reduced model size with minimal impact on performance | Use when a larger model is required but memory is limited | | |
| **Q4_1** | 4-bit quantization with another precision mode | Better performance than Q4_0 with slight increase in size | When a balance of size and performance is required | | |
| **Q4_K_M** | 4-bit quantization using K mode with medium precision | Further optimized performance with reduced model size | For performance optimization in moderately sized models | | |
| **Q4_K_S** | 4-bit quantization using K mode with high precision | Optimized for size with higher precision | When slightly higher precision and smaller size are needed| | |
| **Q5_0** | 5-bit quantization using zero mode | Larger model size with enhanced precision | Use when memory is not a major constraint and high precision is required | | |
| **Q5_1** | 5-bit quantization with an alternative mode | Offers trade-off between size and performance | For improved performance at the cost of some additional memory usage | | |
| **Q5_K_M** | 5-bit quantization using K mode with medium precision | Better model compression and performance | When model performance is crucial and space is a concern | | |
| **Q5_K_S** | 5-bit quantization using K mode with high precision | Optimal performance with minimal size increase | Use for high-performance applications with moderate memory limits | | |
| **Q6_K** | 6-bit quantization using K mode | Larger model size but better precision | For applications where precision is critical and space is more available | | |
| **Q8_0** | 8-bit quantization with zero mode | Maximum size reduction with reasonable precision | Use when model size is most critical and higher precision is not needed | | |
| **BF16** | 16-bit Brain Floating Point quantization | Balances precision and size with higher performance | When a high level of performance is needed with moderate memory usage | | |
| **F16** | 16-bit Floating Point quantization | Offers good precision and performance with moderate memory usage | When maintaining a high precision model is essential | | |
| **F32** | 32-bit Floating Point quantization | Highest precision, best for model training and inference | Use when maximum precision is required for sensitive tasks | | |
## π‘ Why GGUF? | |
GGUF (GPT-Generated Unified Format) offers several advantages: | |
# GGUF (GPT-Generated Unified Format) | |
GGUF (GPT-Generated Unified Format) is a file format specifically designed for efficient deployment and inference of large language models. Let me break down why it's important and beneficial: | |
## Key Benefits of GGUF: | |
### Optimized for Inference: | |
- GGUF is specifically designed for model inference (running predictions) rather than training. | |
- It's the native format used by llama.cpp, a popular framework for running LLMs on consumer hardware. | |
### Memory Efficiency: | |
- Reduces memory usage compared to the original PyTorch/Hugging Face formats. | |
- Allows running larger models on devices with limited RAM. | |
- Supports various quantization levels (reducing model precision from FP16/FP32 to INT8/INT4). | |
### Faster Loading: | |
- Models in GGUF format can be memory-mapped (mmap), meaning they can be loaded partially as needed. | |
- Reduces initial loading time and memory overhead. | |
### Cross-Platform Compatibility: | |
- Works well across different operating systems and hardware. | |
- Doesn't require Python or PyTorch installation. | |
- Can run on CPU-only systems effectively. | |
### Embedded Metadata: | |
- Contains model configuration, tokenizer, and other necessary information in a single file. | |
- Makes deployment simpler as all required information is bundled together. | |
## π οΈ Installation | |
```bash | |
# Clone the repository | |
git clone https://github.com/bhaskatripathi/LLM_Quantization | |
# Install dependencies | |
pip install -r requirements.txt | |
``` | |
## π Usage | |
```bash | |
# Run the Streamlit application | |
streamlit run app.py | |
``` | |
## π€ Contributing | |
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change. | |
1. Fork the Project | |
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) | |
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) | |
4. Push to the Branch (`git push origin feature/AmazingFeature`) | |
5. Open a Pull Request | |
## π License | |
This project is licensed under the MIT License - see below for details: | |
## β οΈ Requirements | |
- Python 3.8+ | |
- Streamlit | |
- Hugging Face Hub account (for model download/upload) | |
- Sufficient storage space for model processing | |
## π Supported Models | |
The tool currently supports various model architectures including: | |
- DeepSeek models | |
- Mistral models | |
- Llama models | |
- Qwen models | |
- And more... | |
## π€ Need Help? | |
If you encounter any issues or have questions: | |
1. Check the existing issues | |
2. Create a new issue with a detailed description | |
3. Include relevant error messages and environment details | |
## π Acknowledgments | |
- [Hugging Face](https://huggingface.co/) for the model hub | |
- [llama.cpp](https://github.com/ggerganov/llama.cpp) for GGUF format implementation | |
- All contributors and maintainers | |
--- | |
Made with β€οΈ for the AI community | |