jerryzh168 commited on
Commit
a2a1dd6
·
verified ·
1 Parent(s): 3f7c9d6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -295,10 +295,11 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
295
  | | Phi-4 mini-Ins | Phi-4-mini-instruct-FP8 |
296
  | latency (batch_size=1) | 1.61s | 1.25s (1.29x speedup) |
297
  | latency (batch_size=256) | 5.16s | 4.89s (1.05x speedup) |
298
- | serving (num_prompts=1) | 1.37 req/s | 1.74 req/s (1.27x speedup) |
299
- | serving (num_prompts=1000) | 66.68 req/s | 80.53 req/s (1.21x speedup) |
300
 
301
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 
302
 
303
  <details>
304
  <summary> Reproduce Model Performance Results </summary>
 
295
  | | Phi-4 mini-Ins | Phi-4-mini-instruct-FP8 |
296
  | latency (batch_size=1) | 1.61s | 1.25s (1.29x speedup) |
297
  | latency (batch_size=256) | 5.16s | 4.89s (1.05x speedup) |
298
+ | serving (num_prompts=1) | 1.37 req/s | 1.66 req/s (1.21x speedup) |
299
+ | serving (num_prompts=1000) | 62.55 req/s | 72.56 req/s (1.16x speedup) |
300
 
301
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
302
+ Note the result is not using fbgemm kernels, (no `fbgemm-gpu-genai` installed), fbgemm kernels has less speedup when num_prompts is 1000 currently.
303
 
304
  <details>
305
  <summary> Reproduce Model Performance Results </summary>