--- license: apache-2.0 train: false inference: false pipeline_tag: text-generation --- This is an HQQ all 4-bit (group-size=128) quantized Qwen2.5-7B-Instruct model, via TorchAO and GemLite as a backend. ## Usage First, install the dependecies: ``` pip install torchao; pip install git+https://github.com/mobiusml/gemlite.git; ``` Then you can use the sample code below: ``` Python import torch from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig model_id = "mobiuslabsgmbh/Qwen2.5-7B-Instruct_gemlite-ao_a16w4_gs_128_pack_32bit" model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map='cuda', ) tokenizer = AutoTokenizer.from_pretrained(model_id) ``` Use in vLLM: ```Python from vllm import LLM from vllm.sampling_params import SamplingParams model_id = "mobiuslabsgmbh/Qwen2.5-7B-Instruct_gemlite-ao_a16w4_gs_128_pack_32bit" llm = LLM(model=model_id, max_model_len=4096) sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024) outputs = llm.generate(["What is the capital of Germany?"], sampling_params) print(outputs[0].outputs[0].text) ```