Update README.md

Browse files

Files changed (1) hide show

README.md +21 -43

README.md CHANGED Viewed

@@ -19,16 +19,15 @@ pipeline_tag: text-generation
 [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team.
-# Installation
 ```
 pip install git+https://github.com/huggingface/transformers
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
-pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 ```
-Also need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
-# Quantization Recipe
 We used following code to get the quantized model:
@@ -76,28 +75,14 @@ output_text = tokenizer.batch_decode(
     generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
 print("Response:", output_text[0][len(prompt):])
-# Local Benchmark
-import torch.utils.benchmark as benchmark
-from torchao.utils import benchmark_model
-import torchao
-def benchmark_fn(f, *args, **kwargs):
-    # Manual warmup
-    for _ in range(2):
-        f(*args, **kwargs)
-    t0 = benchmark.Timer(
-        stmt="f(*args, **kwargs)",
-        globals={"args": args, "kwargs": kwargs, "f": f},
-        num_threads=torch.get_num_threads(),
-    )
-    return f"{(t0.blocked_autorange().mean):.3f}"
-torchao.quantization.utils.recommended_inductor_config_setter()
-quantized_model = torch.compile(quantized_model, mode="max-autotune")
-print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
 ```
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
@@ -118,21 +103,20 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
 | mmlu (0-shot)                    |                |  x              |
 | mmlu_pro (5-shot)                |                |  x              |
 | **Reasoning**                    |                |                     |
-| arc_challenge (0-shot)           |                |  x              |
-| gpqa_main_zeroshot               |                |  x              |
 | HellaSwag                        | 54.57          |  54.55              |
-| openbookqa                       |                |  x              |
-| piqa (0-shot)	                   |                |  x              |
-| social_iqa                       |                |  x              |
-| truthfulqa_mc2 (0-shot)          |                |  x              |
-| winogrande  (0-shot)             |                |  x              |
 | **Multilingual**                 |                |                     |
-| mgsm_en_cot_en                   |                |   x              |
 | **Math**                         |                |                     |
-| gsm8k (5-shot)                   |                |   x             |
-| mathqa (0-shot)                  |                |   x             |
 | **Overall**                      | **TODO**       | **TODO**            |
 # Model Performance
@@ -191,10 +175,4 @@ vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini
 Client:
 ```
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
-```
-# Serving with vllm
-We can use the same command we used in serving benchmarks to serve the model with vllm
-```
-vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```

 [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team.
+# Quantization Recipe
+First need to install the required packages:
 ```
 pip install git+https://github.com/huggingface/transformers
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 ```
 We used following code to get the quantized model:
     generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
 print("Response:", output_text[0][len(prompt):])
+```
+# Serving with vllm
+We can use the same command we used in serving benchmarks to serve the model with vllm
+```
+vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
 | mmlu (0-shot)                    |                |  x              |
 | mmlu_pro (5-shot)                |                |  x              |
 | **Reasoning**                    |                |                     |
+| arc_challenge (0-shot)           | 56.91          |  x              |
+| gpqa_main_zeroshot               | 30.13          |  x              |
 | HellaSwag                        | 54.57          |  54.55              |
+| openbookqa                       | 33.00          |  x              |
+| piqa (0-shot)	                   | 77.64          |  x              |
+| social_iqa                       | 49.59          |  x              |
+| truthfulqa_mc2 (0-shot)          | 48.39          |  x              |
+| winogrande  (0-shot)             | 71.11          |  x              |
 | **Multilingual**                 |                |                     |
+| mgsm_en_cot_en                   | 60.8           |  60.0               |
 | **Math**                         |                |                     |
+| gsm8k (5-shot)                   | 81.88          |  80.89              |
+| mathqa (0-shot)                  | 42.31          |  42.51              |
 | **Overall**                      | **TODO**       | **TODO**            |
 # Model Performance
 Client:
 ```
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
 ```