|  | --- | 
					
						
						|  | base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 
					
						
						|  | inference: true | 
					
						
						|  | model_type: llama | 
					
						
						|  | quantized_by: robertgshaw2 | 
					
						
						|  | tags: | 
					
						
						|  | - nm-vllm | 
					
						
						|  | - marlin | 
					
						
						|  | - int4 | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | ## TinyLlama-1.1B-Chat-v1.0 | 
					
						
						|  | This repo contains model files for [TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) optimized for [nm-vllm](https://github.com/neuralmagic/nm-vllm), a high-throughput serving engine for compressed LLMs. | 
					
						
						|  |  | 
					
						
						|  | This model was quantized with [GPTQ](https://arxiv.org/abs/2210.17323) and saved in the Marlin format for efficient 4-bit inference. Marlin is a highly optimized inference kernel for 4 bit models. | 
					
						
						|  |  | 
					
						
						|  | ## Inference | 
					
						
						|  | Install [nm-vllm](https://github.com/neuralmagic/nm-vllm) for fast inference and low memory-usage: | 
					
						
						|  | ```bash | 
					
						
						|  | pip install nm-vllm[sparse] | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Run in a Python pipeline for local inference: | 
					
						
						|  | ```python | 
					
						
						|  | from transformers import AutoTokenizer | 
					
						
						|  | from vllm import LLM, SamplingParams | 
					
						
						|  |  | 
					
						
						|  | model_id = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-marlin" | 
					
						
						|  | model = LLM(model_id) | 
					
						
						|  |  | 
					
						
						|  | tokenizer = AutoTokenizer.from_pretrained(model_id) | 
					
						
						|  | messages = [ | 
					
						
						|  | {"role": "user", "content": "How to make banana bread?"}, | 
					
						
						|  | ] | 
					
						
						|  | formatted_prompt =  tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | 
					
						
						|  | sampling_params = SamplingParams(max_tokens=200) | 
					
						
						|  | outputs = model.generate(formatted_prompt, sampling_params=sampling_params) | 
					
						
						|  | print(outputs[0].outputs[0].text) | 
					
						
						|  |  | 
					
						
						|  | """ | 
					
						
						|  | Sure! Here's a simple recipe for banana bread: | 
					
						
						|  |  | 
					
						
						|  | Ingredients: | 
					
						
						|  | - 3-4 ripe bananas,mashed | 
					
						
						|  | - 1 large egg | 
					
						
						|  | - 2 Tbsp. Flour | 
					
						
						|  | - 2 tsp. Baking powder | 
					
						
						|  | - 1 tsp. Baking soda | 
					
						
						|  | - 1/2 tsp. Ground cinnamon | 
					
						
						|  | - 1/4 tsp. Salt | 
					
						
						|  | - 1/2 cup butter, melted | 
					
						
						|  | - 3 Cups All-purpose flour | 
					
						
						|  | - 1/2 tsp. Ground cinnamon | 
					
						
						|  |  | 
					
						
						|  | Instructions: | 
					
						
						|  |  | 
					
						
						|  | 1. Preheat your oven to 350 F (175 C). | 
					
						
						|  | """ | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ## Quantization | 
					
						
						|  | For details on how this model was quantized and converted to marlin format, run the `quantization/apply_gptq_save_marlin.py` script: | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | pip install -r quantization/requirements.txt | 
					
						
						|  | python3 quantization/apply_gptq_save_marlin.py --model-id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --save-dir ./tinyllama-marlin | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ## Slack | 
					
						
						|  |  | 
					
						
						|  | For further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ) |