AO/GemLite
Collection
Quantized models in AO/GemLite format
•
15 items
•
Updated
This is an HQQ all 4-bit (group-size=128) quantized Meta-Llama-3-8B-Instruct model, via TorchAO and GemLite as a backend.
First, install the dependecies:
pip install torchao;
pip install git+https://github.com/mobiusml/gemlite.git;
Then you can use the sample code below:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
model_id = "mobiuslabsgmbh/Llama-3.1-8B-Instruct_gemlite-ao_a16w4_gs_128_pack_16bit"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map='cuda',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
Use in vLLM:
from vllm import LLM
from vllm.sampling_params import SamplingParams
model_id = "mobiuslabsgmbh/Llama-3.1-8B-Instruct_gemlite-ao_a16w4_gs_128_pack_16bit"
llm = LLM(model=model_id, max_model_len=4096)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
outputs = llm.generate(["What is the capital of Germany?"], sampling_params)
print(outputs[0].outputs[0].text)