README.md · pytorch/Phi-4-mini-instruct-float8dq at 800c265ae620109e07a03fc00bc51632ba78a4e4

File size: 5,424 Bytes

11ca438
 
f38ad3d
 
 
11ca438
 
a7bb628
 
800c265
fa9082a
 
 
 
 
800c265
 
a7bb628
 
 
 
 
 
 
 
 
 
 
 
39b90e8
a7bb628
 
 
28f465c
a7bb628
39b90e8
a7bb628
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be21cdb
a7bb628
 
 
 
be21cdb
a7bb628
 
 
 
be21cdb
a7bb628
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d682ce6
a7bb628
36880bf
 
a7bb628
 
d682ce6
a7bb628
 
 
d682ce6
36880bf
 
 
d682ce6
a7bb628
 
 
 
d682ce6
a7bb628
 
 
 
d682ce6
a7bb628
 
 
36880bf
 
d682ce6
a7bb628
 
 
 
 
 
 
 
 
 
d682ce6
a7bb628
 
 
 
 
 
 
f38ad3d
a7bb628
 
 
 
 
 
f38ad3d

---
library_name: transformers
tags:
- torchao
license: mit
---

[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team.

# Installation
```
pip install transformers
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
```

# Quantization Recipe

We used following code to get the quantized model:

```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "microsoft/Phi-4-mini-instruct"

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Push to hub
USER_ID = "YOUR_USER_ID"
save_to = f"{USER_ID}/{model_id}-float8dq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

# Local Benchmark
import torch.utils.benchmark as benchmark
from torchao.utils import benchmark_model
import torchao

def benchmark_fn(f, *args, **kwargs):
    # Manual warmup
    for _ in range(2):
        f(*args, **kwargs)

    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)",
        globals={"args": args, "kwargs": kwargs, "f": f},
        num_threads=torch.get_num_threads(),
    )
    return f"{(t0.blocked_autorange().mean):.3f}"

torchao.quantization.utils.recommended_inductor_config_setter()
quantized_model = torch.compile(quantized_model, mode="max-autotune")
print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
```
# Model Quality
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.

## Installing the nightly version to get most recent updates
```
pip install git+https://github.com/EleutherAI/lm-evaluation-harness
```

## baseline
```
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
```

## float8dq
```
lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
```

`TODO: more complete eval results`


| Benchmark                        |             |                   |
|----------------------------------|-------------|-------------------|
|                                  | Phi-4 mini-Ins | phi4-mini-float8dq | 
| **Popular aggregated benchmark** |             |                   |
| **Reasoning**                    |             |                   |
| HellaSwag                        | 54.57       | 54.55             |
| **Multilingual**                 |             |                   |
| **Math**                         |             |                   |
| **Overall**                      | **TODO**    | **TODO**          |
 
# Model Performance

## Download vllm source code and install vllm
```
git clone [email protected]:vllm-project/vllm.git
VLLM_USE_PRECOMPILED=1 pip install .
```

## Download dataset
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`

Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
## benchmark_latency

Run the following under `vllm` source code root folder:

### baseline
```
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
```

### float8dq
```
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model jerryzh168/phi4-mini-float8dq --batch-size 1
```

## benchmark_serving

We also benchmarked the throughput in a serving environment.

Run the following under `vllm` source code root folder:

### baseline
Server:
```
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
```

Client:
```
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
```

### float8dq
Server:
```
vllm serve jerryzh168/phi4-mini-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
```

Client:
```
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
```

# Serving with vllm
We can use the same command we used in serving benchmarks to serve the model with vllm
```
vllm serve jerryzh168/phi4-mini-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
```