Update README.md

Browse files

Files changed (1) hide show

README.md +18 -18

README.md CHANGED Viewed

@@ -26,7 +26,7 @@ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 ```
 ## Code Example
-```
 from vllm import LLM, SamplingParams
 llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq", trust_remote_code=True)
@@ -49,14 +49,14 @@ print(output[0].outputs[0].text)
 ## Serving
 Then we can serve with the following command:
-```
 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 # Inference with Transformers
 Install the required packages:
-```
 pip install git+https://github.com/huggingface/transformers@main
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 pip install torch
@@ -64,7 +64,7 @@ pip install accelerate
 ```
 Example:
-```
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
@@ -108,7 +108,7 @@ print(output[0]['generated_text'])
 Install the required packages:
-```
 pip install git+https://github.com/huggingface/transformers@main
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 pip install torch
@@ -117,7 +117,7 @@ pip install accelerate
 Use the following code to get the quantized model:
-```
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
@@ -170,12 +170,12 @@ https://github.com/EleutherAI/lm-evaluation-harness#install
 ## baseline
-```
 lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
 ```
 ## float8 dynamic activation and float8 weight quantization (float8dq)
-```
 lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
 ```
@@ -217,7 +217,7 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
 We can use the following code to get a sense of peak memory usage during inference:
-```
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
@@ -273,24 +273,24 @@ Note the result of latency (benchmark_latency) is in seconds, and serving (bench
 ## benchmark_latency
 Need to install vllm nightly to get some recent changes
-```
 pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 ```
 Get vllm source code:
-```
 git clone [email protected]:vllm-project/vllm.git
 ```
 Run the following under `vllm` root folder:
 ### baseline
-```
 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
 ```
 ### float8dq
-```
 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
 ```
@@ -302,7 +302,7 @@ Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/
 Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
 Get vllm source code:
-```
 git clone [email protected]:vllm-project/vllm.git
 ```
@@ -310,23 +310,23 @@ Run the following under `vllm` root folder:
 ### baseline
 Server:
-```
 vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 Client:
-```
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
 ```
 ### float8dq
 Server:
-```
 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 Client:
-```
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
 ```

 ```
 ## Code Example
+```Py
 from vllm import LLM, SamplingParams
 llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq", trust_remote_code=True)
 ## Serving
 Then we can serve with the following command:
+```Shell
 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 # Inference with Transformers
 Install the required packages:
+```Shell
 pip install git+https://github.com/huggingface/transformers@main
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 pip install torch
 ```
 Example:
+```Py
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 Install the required packages:
+```Shell
 pip install git+https://github.com/huggingface/transformers@main
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 pip install torch
 Use the following code to get the quantized model:
+```Py
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
 ## baseline
+```Shell
 lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
 ```
 ## float8 dynamic activation and float8 weight quantization (float8dq)
+```Shell
 lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
 ```
 We can use the following code to get a sense of peak memory usage during inference:
+```Py
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
 ## benchmark_latency
 Need to install vllm nightly to get some recent changes
+```Shell
 pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 ```
 Get vllm source code:
+```Shell
 git clone [email protected]:vllm-project/vllm.git
 ```
 Run the following under `vllm` root folder:
 ### baseline
+```Shell
 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
 ```
 ### float8dq
+```Shell
 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
 ```
 Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
 Get vllm source code:
+```Shell
 git clone [email protected]:vllm-project/vllm.git
 ```
 ### baseline
 Server:
+```Shell
 vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 Client:
+```Shell
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
 ```
 ### float8dq
 Server:
+```Shell
 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 Client:
+```Shell
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
 ```