drisspg commited on
Commit
c6246f9
·
verified ·
1 Parent(s): 8680f69

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -26,7 +26,7 @@ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
26
  ```
27
 
28
  ## Code Example
29
- ```
30
  from vllm import LLM, SamplingParams
31
 
32
  llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq", trust_remote_code=True)
@@ -49,14 +49,14 @@ print(output[0].outputs[0].text)
49
 
50
  ## Serving
51
  Then we can serve with the following command:
52
- ```
53
  vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
54
  ```
55
 
56
  # Inference with Transformers
57
 
58
  Install the required packages:
59
- ```
60
  pip install git+https://github.com/huggingface/transformers@main
61
  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
62
  pip install torch
@@ -64,7 +64,7 @@ pip install accelerate
64
  ```
65
 
66
  Example:
67
- ```
68
  import torch
69
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
70
 
@@ -108,7 +108,7 @@ print(output[0]['generated_text'])
108
 
109
  Install the required packages:
110
 
111
- ```
112
  pip install git+https://github.com/huggingface/transformers@main
113
  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
114
  pip install torch
@@ -117,7 +117,7 @@ pip install accelerate
117
 
118
  Use the following code to get the quantized model:
119
 
120
- ```
121
  import torch
122
  from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
123
 
@@ -170,12 +170,12 @@ https://github.com/EleutherAI/lm-evaluation-harness#install
170
 
171
 
172
  ## baseline
173
- ```
174
  lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
175
  ```
176
 
177
  ## float8 dynamic activation and float8 weight quantization (float8dq)
178
- ```
179
  lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
180
  ```
181
 
@@ -217,7 +217,7 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
217
  We can use the following code to get a sense of peak memory usage during inference:
218
 
219
 
220
- ```
221
  import torch
222
  from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
223
 
@@ -273,24 +273,24 @@ Note the result of latency (benchmark_latency) is in seconds, and serving (bench
273
  ## benchmark_latency
274
 
275
  Need to install vllm nightly to get some recent changes
276
- ```
277
  pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
278
  ```
279
 
280
  Get vllm source code:
281
- ```
282
  git clone [email protected]:vllm-project/vllm.git
283
  ```
284
 
285
  Run the following under `vllm` root folder:
286
 
287
  ### baseline
288
- ```
289
  python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
290
  ```
291
 
292
  ### float8dq
293
- ```
294
  python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
295
  ```
296
 
@@ -302,7 +302,7 @@ Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/
302
  Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
303
 
304
  Get vllm source code:
305
- ```
306
  git clone [email protected]:vllm-project/vllm.git
307
  ```
308
 
@@ -310,23 +310,23 @@ Run the following under `vllm` root folder:
310
 
311
  ### baseline
312
  Server:
313
- ```
314
  vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
315
  ```
316
 
317
  Client:
318
- ```
319
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
320
  ```
321
 
322
  ### float8dq
323
  Server:
324
- ```
325
  vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
326
  ```
327
 
328
  Client:
329
- ```
330
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
331
  ```
332
 
 
26
  ```
27
 
28
  ## Code Example
29
+ ```Py
30
  from vllm import LLM, SamplingParams
31
 
32
  llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq", trust_remote_code=True)
 
49
 
50
  ## Serving
51
  Then we can serve with the following command:
52
+ ```Shell
53
  vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
54
  ```
55
 
56
  # Inference with Transformers
57
 
58
  Install the required packages:
59
+ ```Shell
60
  pip install git+https://github.com/huggingface/transformers@main
61
  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
62
  pip install torch
 
64
  ```
65
 
66
  Example:
67
+ ```Py
68
  import torch
69
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
70
 
 
108
 
109
  Install the required packages:
110
 
111
+ ```Shell
112
  pip install git+https://github.com/huggingface/transformers@main
113
  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
114
  pip install torch
 
117
 
118
  Use the following code to get the quantized model:
119
 
120
+ ```Py
121
  import torch
122
  from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
123
 
 
170
 
171
 
172
  ## baseline
173
+ ```Shell
174
  lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
175
  ```
176
 
177
  ## float8 dynamic activation and float8 weight quantization (float8dq)
178
+ ```Shell
179
  lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
180
  ```
181
 
 
217
  We can use the following code to get a sense of peak memory usage during inference:
218
 
219
 
220
+ ```Py
221
  import torch
222
  from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
223
 
 
273
  ## benchmark_latency
274
 
275
  Need to install vllm nightly to get some recent changes
276
+ ```Shell
277
  pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
278
  ```
279
 
280
  Get vllm source code:
281
+ ```Shell
282
  git clone [email protected]:vllm-project/vllm.git
283
  ```
284
 
285
  Run the following under `vllm` root folder:
286
 
287
  ### baseline
288
+ ```Shell
289
  python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
290
  ```
291
 
292
  ### float8dq
293
+ ```Shell
294
  python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
295
  ```
296
 
 
302
  Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
303
 
304
  Get vllm source code:
305
+ ```Shell
306
  git clone [email protected]:vllm-project/vllm.git
307
  ```
308
 
 
310
 
311
  ### baseline
312
  Server:
313
+ ```Shell
314
  vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
315
  ```
316
 
317
  Client:
318
+ ```Shell
319
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
320
  ```
321
 
322
  ### float8dq
323
  Server:
324
+ ```Shell
325
  vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
326
  ```
327
 
328
  Client:
329
+ ```Shell
330
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
331
  ```
332