fengwuyao commited on
Commit
f2f4ce0
·
verified ·
1 Parent(s): 054f4e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -11
README.md CHANGED
@@ -11,8 +11,9 @@ tags:
11
  This model provides a few variants of
12
  [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) that are ready for
13
  deployment on Android using the
14
- [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and
15
- [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
 
16
 
17
  ## Use the models
18
 
@@ -28,6 +29,15 @@ on Colab could be much worse than on a local device.*
28
 
29
  ### Android
30
 
 
 
 
 
 
 
 
 
 
31
  * Download and install
32
  [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
33
  * Follow the instructions in the app.
@@ -45,22 +55,37 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
45
 
46
  <table border="1">
47
  <tr>
48
- <th></th>
49
  <th>Backend</th>
 
 
50
  <th>Prefill (tokens/sec)</th>
51
  <th>Decode (tokens/sec)</th>
52
  <th>Time-to-first-token (sec)</th>
53
- <th>Memory (RSS in MB)</th>
54
  <th>Model size (MB)</th>
 
 
55
  </tr>
56
  <tr>
57
- <td>dynamic_int8</td>
58
- <td>cpu</td>
59
- <td><p style="text-align: right">55.60 tk/s</p></td>
60
- <td><p style="text-align: right">6.08 tk/s</p></td>
61
- <td><p style="text-align: right">16.66 s</p></td>
62
- <td><p style="text-align: right">6,195 MB</p></td>
63
- <td><p style="text-align: right">3,761 MB</p></td>
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  </tr>
65
 
66
  </table>
@@ -71,4 +96,5 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
71
  * The inference on CPU is accelerated via the LiteRT
72
  [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
73
  * Benchmark is done assuming XNNPACK cache is enabled
 
74
  * dynamic_int8: quantized model with int8 weights and float activations.
 
11
  This model provides a few variants of
12
  [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) that are ready for
13
  deployment on Android using the
14
+ [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
15
+ [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
16
+ [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM).
17
 
18
  ## Use the models
19
 
 
29
 
30
  ### Android
31
 
32
+ #### Edge Gallery App
33
+ * Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
34
+
35
+ * Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play.
36
+
37
+ * Follow the instructions in the app.
38
+
39
+ #### LLM Inference API
40
+
41
  * Download and install
42
  [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
43
  * Follow the instructions in the app.
 
55
 
56
  <table border="1">
57
  <tr>
 
58
  <th>Backend</th>
59
+ <th>Quantization scheme</th>
60
+ <th>Context length</th>
61
  <th>Prefill (tokens/sec)</th>
62
  <th>Decode (tokens/sec)</th>
63
  <th>Time-to-first-token (sec)</th>
 
64
  <th>Model size (MB)</th>
65
+ <th>Peak RSS Memory (MB)</th>
66
+ <th>GPU Memory (MB)</th>
67
  </tr>
68
  <tr>
69
+ <td><p style="text-align: right">CPU</td>
70
+ <td><p style="text-align: right">dynamic_int8</td>
71
+ <td><p style="text-align: right">4096</td>
72
+ <td><p style="text-align: right">66.53 tk/s</p></td>
73
+ <td><p style="text-align: right">7.28 tk/s</p></td>
74
+ <td><p style="text-align: right">15.90 s</p></td>
75
+ <td><p style="text-align: right">3906 MB</p></td>
76
+ <td><p style="text-align: right">5308 MB</p></td>
77
+ <td><p style="text-align: right">N/A</p></td>
78
+ </tr>
79
+ <tr>
80
+ <td><p style="text-align: right">GPU</td>
81
+ <td><p style="text-align: right">dynamic_int8</td>
82
+ <td><p style="text-align: right">4096</td>
83
+ <td><p style="text-align: right">314.01 tk/s</p></td>
84
+ <td><p style="text-align: right">10.39 tk/s</p></td>
85
+ <td><p style="text-align: right">10.32 s</p></td>
86
+ <td><p style="text-align: right">3906 MB</p></td>
87
+ <td><p style="text-align: right">4107 MB</p></td>
88
+ <td><p style="text-align: right">4608 MB</p></td>
89
  </tr>
90
 
91
  </table>
 
96
  * The inference on CPU is accelerated via the LiteRT
97
  [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
98
  * Benchmark is done assuming XNNPACK cache is enabled
99
+ * Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.
100
  * dynamic_int8: quantized model with int8 weights and float activations.