jerryzh168 commited on
Commit
49ef0db
·
verified ·
1 Parent(s): d7d66d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -43
README.md CHANGED
@@ -19,16 +19,15 @@ pipeline_tag: text-generation
19
 
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team.
21
 
22
- # Installation
 
 
 
 
23
  ```
24
  pip install git+https://github.com/huggingface/transformers
25
  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
26
- pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
27
  ```
28
- Also need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
29
-
30
-
31
- # Quantization Recipe
32
 
33
  We used following code to get the quantized model:
34
 
@@ -76,28 +75,14 @@ output_text = tokenizer.batch_decode(
76
  generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
77
  )
78
  print("Response:", output_text[0][len(prompt):])
 
79
 
80
- # Local Benchmark
81
- import torch.utils.benchmark as benchmark
82
- from torchao.utils import benchmark_model
83
- import torchao
84
-
85
- def benchmark_fn(f, *args, **kwargs):
86
- # Manual warmup
87
- for _ in range(2):
88
- f(*args, **kwargs)
89
-
90
- t0 = benchmark.Timer(
91
- stmt="f(*args, **kwargs)",
92
- globals={"args": args, "kwargs": kwargs, "f": f},
93
- num_threads=torch.get_num_threads(),
94
- )
95
- return f"{(t0.blocked_autorange().mean):.3f}"
96
-
97
- torchao.quantization.utils.recommended_inductor_config_setter()
98
- quantized_model = torch.compile(quantized_model, mode="max-autotune")
99
- print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
100
  ```
 
101
  # Model Quality
102
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
103
 
@@ -118,21 +103,20 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
118
  | mmlu (0-shot) | | x |
119
  | mmlu_pro (5-shot) | | x |
120
  | **Reasoning** | | |
121
- | arc_challenge (0-shot) | | x |
122
- | gpqa_main_zeroshot | | x |
123
  | HellaSwag | 54.57 | 54.55 |
124
- | openbookqa | | x |
125
- | piqa (0-shot) | | x |
126
- | social_iqa | | x |
127
- | truthfulqa_mc2 (0-shot) | | x |
128
- | winogrande (0-shot) | | x |
129
  | **Multilingual** | | |
130
- | mgsm_en_cot_en | | x |
131
  | **Math** | | |
132
- | gsm8k (5-shot) | | x |
133
- | mathqa (0-shot) | | x |
134
  | **Overall** | **TODO** | **TODO** |
135
-
136
 
137
  # Model Performance
138
 
@@ -191,10 +175,4 @@ vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini
191
  Client:
192
  ```
193
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
194
- ```
195
-
196
- # Serving with vllm
197
- We can use the same command we used in serving benchmarks to serve the model with vllm
198
- ```
199
- vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
200
  ```
 
19
 
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team.
21
 
22
+
23
+ # Quantization Recipe
24
+
25
+ First need to install the required packages:
26
+
27
  ```
28
  pip install git+https://github.com/huggingface/transformers
29
  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 
30
  ```
 
 
 
 
31
 
32
  We used following code to get the quantized model:
33
 
 
75
  generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
76
  )
77
  print("Response:", output_text[0][len(prompt):])
78
+ ```
79
 
80
+ # Serving with vllm
81
+ We can use the same command we used in serving benchmarks to serve the model with vllm
82
+ ```
83
+ vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  ```
85
+
86
  # Model Quality
87
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
88
 
 
103
  | mmlu (0-shot) | | x |
104
  | mmlu_pro (5-shot) | | x |
105
  | **Reasoning** | | |
106
+ | arc_challenge (0-shot) | 56.91 | x |
107
+ | gpqa_main_zeroshot | 30.13 | x |
108
  | HellaSwag | 54.57 | 54.55 |
109
+ | openbookqa | 33.00 | x |
110
+ | piqa (0-shot) | 77.64 | x |
111
+ | social_iqa | 49.59 | x |
112
+ | truthfulqa_mc2 (0-shot) | 48.39 | x |
113
+ | winogrande (0-shot) | 71.11 | x |
114
  | **Multilingual** | | |
115
+ | mgsm_en_cot_en | 60.8 | 60.0 |
116
  | **Math** | | |
117
+ | gsm8k (5-shot) | 81.88 | 80.89 |
118
+ | mathqa (0-shot) | 42.31 | 42.51 |
119
  | **Overall** | **TODO** | **TODO** |
 
120
 
121
  # Model Performance
122
 
 
175
  Client:
176
  ```
177
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
 
 
 
 
 
 
178
  ```