Spaces:
Running
Running
| # LLM Model Converter and Quantizer | |
| Large Language Models (LLMs) are typically distributed in formats optimized for training (like PyTorch) and can be extremely large (hundreds of gigabytes), making them impractical for most real-world applications. This tool addresses two critical challenges in LLM deployment: | |
| 1. **Size**: Original models are too large to run on consumer hardware | |
| 2. **Format**: Training formats are not optimized for inference | |
| ## Why This Tool? | |
| I decided to build this tool to help AI Researchers achieve the following: | |
| - Converting models from Hugging Face to GGUF format (optimized for inference) | |
| - Quantizing models to reduce their size while maintaining acceptable performance | |
| - Making deployment possible on consumer hardware (laptops, desktops) with limited resources | |
| ### The Problem | |
| - LLMs in their original format require significant computational resources | |
| - Running these models typically needs: | |
| - High-end GPUs | |
| - Large amounts of RAM (32GB+) | |
| - Substantial storage space | |
| - Complex software dependencies | |
| ### The Solution | |
| This tool provides: | |
| 1. **Format Conversion** | |
| - Converts from PyTorch/Hugging Face format to GGUF | |
| - GGUF is specifically designed for efficient inference | |
| - Enables memory mapping for faster loading | |
| - Reduces dependency requirements | |
| 2. **Quantization** | |
| - Reduces model size by up to 4-8x | |
| - Converts from FP16/FP32 to more efficient formats (INT8/INT4) | |
| - Maintains reasonable model performance | |
| - Makes models runnable on consumer-grade hardware | |
| 3. **Accessibility** | |
| - Enables running LLMs on standard laptops | |
| - Reduces RAM requirements | |
| - Speeds up model loading and inference | |
| - Simplifies deployment process | |
| ## π― Purpose | |
| This tool helps developers and researchers to: | |
| - Download LLMs from Hugging Face Hub | |
| - Convert models to GGUF (GPT-Generated Unified Format) | |
| - Quantize models for efficient deployment | |
| - Upload processed models back to Hugging Face | |
| ## π Features | |
| - **Model Download**: Direct integration with Hugging Face Hub | |
| - **GGUF Conversion**: Convert PyTorch models to GGUF format | |
| - **Quantization Options**: Support for various quantization levels | |
| - **Batch Processing**: Automate the entire conversion pipeline | |
| - **HF Upload**: Option to upload processed models back to Hugging Face | |
| # Quantization Types Overview | |
| | **Quantizer Name** | **Purpose** | **Benefits** | **When to Use** | | |
| |--------------------|--------------------------------------------------------|-----------------------------------------------------------------|------------------------------------------------------------| | |
| | **Q2_K** | Quantizes model to 2 bits using K mode | Minimizes memory usage, faster inference | Use for highly memory-constrained environments | | |
| | **Q3_K_l** | 3-bit quantization using low precision mode | Balance between size reduction and inference quality | When a small model size with moderate precision is needed | | |
| | **Q3_K_M** | 3-bit quantization with medium precision mode | Better performance with slight increase in memory usage | When moderate precision and size reduction are desired | | |
| | **Q3_K_S** | 3-bit quantization using high precision mode | Higher inference quality with minimal size reduction | When inference quality is a higher priority than size | | |
| | **Q4_0** | 4-bit quantization with zero mode | Reduced model size with minimal impact on performance | Use when a larger model is required but memory is limited | | |
| | **Q4_1** | 4-bit quantization with another precision mode | Better performance than Q4_0 with slight increase in size | When a balance of size and performance is required | | |
| | **Q4_K_M** | 4-bit quantization using K mode with medium precision | Further optimized performance with reduced model size | For performance optimization in moderately sized models | | |
| | **Q4_K_S** | 4-bit quantization using K mode with high precision | Optimized for size with higher precision | When slightly higher precision and smaller size are needed| | |
| | **Q5_0** | 5-bit quantization using zero mode | Larger model size with enhanced precision | Use when memory is not a major constraint and high precision is required | | |
| | **Q5_1** | 5-bit quantization with an alternative mode | Offers trade-off between size and performance | For improved performance at the cost of some additional memory usage | | |
| | **Q5_K_M** | 5-bit quantization using K mode with medium precision | Better model compression and performance | When model performance is crucial and space is a concern | | |
| | **Q5_K_S** | 5-bit quantization using K mode with high precision | Optimal performance with minimal size increase | Use for high-performance applications with moderate memory limits | | |
| | **Q6_K** | 6-bit quantization using K mode | Larger model size but better precision | For applications where precision is critical and space is more available | | |
| | **Q8_0** | 8-bit quantization with zero mode | Maximum size reduction with reasonable precision | Use when model size is most critical and higher precision is not needed | | |
| | **BF16** | 16-bit Brain Floating Point quantization | Balances precision and size with higher performance | When a high level of performance is needed with moderate memory usage | | |
| | **F16** | 16-bit Floating Point quantization | Offers good precision and performance with moderate memory usage | When maintaining a high precision model is essential | | |
| | **F32** | 32-bit Floating Point quantization | Highest precision, best for model training and inference | Use when maximum precision is required for sensitive tasks | | |
| ## π‘ Why GGUF? | |
| GGUF (GPT-Generated Unified Format) offers several advantages: | |
| # GGUF (GPT-Generated Unified Format) | |
| GGUF (GPT-Generated Unified Format) is a file format specifically designed for efficient deployment and inference of large language models. Let me break down why it's important and beneficial: | |
| ## Key Benefits of GGUF: | |
| ### Optimized for Inference: | |
| - GGUF is specifically designed for model inference (running predictions) rather than training. | |
| - It's the native format used by llama.cpp, a popular framework for running LLMs on consumer hardware. | |
| ### Memory Efficiency: | |
| - Reduces memory usage compared to the original PyTorch/Hugging Face formats. | |
| - Allows running larger models on devices with limited RAM. | |
| - Supports various quantization levels (reducing model precision from FP16/FP32 to INT8/INT4). | |
| ### Faster Loading: | |
| - Models in GGUF format can be memory-mapped (mmap), meaning they can be loaded partially as needed. | |
| - Reduces initial loading time and memory overhead. | |
| ### Cross-Platform Compatibility: | |
| - Works well across different operating systems and hardware. | |
| - Doesn't require Python or PyTorch installation. | |
| - Can run on CPU-only systems effectively. | |
| ### Embedded Metadata: | |
| - Contains model configuration, tokenizer, and other necessary information in a single file. | |
| - Makes deployment simpler as all required information is bundled together. | |
| ## π οΈ Installation | |
| ```bash | |
| # Clone the repository | |
| git clone https://github.com/bhaskatripathi/LLM_Quantization | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| ## π Usage | |
| ```bash | |
| # Run the Streamlit application | |
| streamlit run app.py | |
| ``` | |
| ## π€ Contributing | |
| Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change. | |
| 1. Fork the Project | |
| 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) | |
| 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) | |
| 4. Push to the Branch (`git push origin feature/AmazingFeature`) | |
| 5. Open a Pull Request | |
| ## π License | |
| This project is licensed under the MIT License - see below for details: | |
| ## β οΈ Requirements | |
| - Python 3.8+ | |
| - Streamlit | |
| - Hugging Face Hub account (for model download/upload) | |
| - Sufficient storage space for model processing | |
| ## π Supported Models | |
| The tool currently supports various model architectures including: | |
| - DeepSeek models | |
| - Mistral models | |
| - Llama models | |
| - Qwen models | |
| - And more... | |
| ## π€ Need Help? | |
| If you encounter any issues or have questions: | |
| 1. Check the existing issues | |
| 2. Create a new issue with a detailed description | |
| 3. Include relevant error messages and environment details | |
| ## π Acknowledgments | |
| - [Hugging Face](https://huggingface.co/) for the model hub | |
| - [llama.cpp](https://github.com/ggerganov/llama.cpp) for GGUF format implementation | |
| - All contributors and maintainers | |
| --- | |
| Made with β€οΈ for the AI community | |