This is an HQQ all 4-bit (group-size=128) quantized Phi-4-mini-instruct model, via TorchAO and GemLite as a backend.

Usage

First, install the dependecies:

pip install torchao;
pip install git+https://github.com/mobiusml/gemlite.git;

Then you can use the sample code below:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig 

model_id = "mobiuslabsgmbh/Phi-4-mini-instruct_gemlite-ao_a16w4_gs_128_pack_32bit"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map='cuda', 
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

Use in vLLM:

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_id = "mobiuslabsgmbh/Phi-4-mini-instruct_gemlite-ao_a16w4_gs_128_pack_32bit"

llm = LLM(model=model_id, max_model_len=4096)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
outputs = llm.generate(["What is the capital of Germany?"], sampling_params)
print(outputs[0].outputs[0].text)
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including mobiuslabsgmbh/Phi-4-mini-instruct_gemlite-ao_a16w4_gs_128_pack_32bit