Update README.md
Browse files
README.md
CHANGED
@@ -11,8 +11,9 @@ tags:
|
|
11 |
This model provides a few variants of
|
12 |
[microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) that are ready for
|
13 |
deployment on Android using the
|
14 |
-
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert)
|
15 |
-
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference)
|
|
|
16 |
|
17 |
## Use the models
|
18 |
|
@@ -28,6 +29,15 @@ on Colab could be much worse than on a local device.*
|
|
28 |
|
29 |
### Android
|
30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
* Download and install
|
32 |
[the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
|
33 |
* Follow the instructions in the app.
|
@@ -45,22 +55,37 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
|
|
45 |
|
46 |
<table border="1">
|
47 |
<tr>
|
48 |
-
<th></th>
|
49 |
<th>Backend</th>
|
|
|
|
|
50 |
<th>Prefill (tokens/sec)</th>
|
51 |
<th>Decode (tokens/sec)</th>
|
52 |
<th>Time-to-first-token (sec)</th>
|
53 |
-
<th>Memory (RSS in MB)</th>
|
54 |
<th>Model size (MB)</th>
|
|
|
|
|
55 |
</tr>
|
56 |
<tr>
|
57 |
-
<td>
|
58 |
-
<td>
|
59 |
-
<td><p style="text-align: right">
|
60 |
-
<td><p style="text-align: right">
|
61 |
-
<td><p style="text-align: right">
|
62 |
-
<td><p style="text-align: right">
|
63 |
-
<td><p style="text-align: right">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
</tr>
|
65 |
|
66 |
</table>
|
@@ -71,4 +96,5 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
|
|
71 |
* The inference on CPU is accelerated via the LiteRT
|
72 |
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
|
73 |
* Benchmark is done assuming XNNPACK cache is enabled
|
|
|
74 |
* dynamic_int8: quantized model with int8 weights and float activations.
|
|
|
11 |
This model provides a few variants of
|
12 |
[microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) that are ready for
|
13 |
deployment on Android using the
|
14 |
+
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
|
15 |
+
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
|
16 |
+
[LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM).
|
17 |
|
18 |
## Use the models
|
19 |
|
|
|
29 |
|
30 |
### Android
|
31 |
|
32 |
+
#### Edge Gallery App
|
33 |
+
* Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
|
34 |
+
|
35 |
+
* Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play.
|
36 |
+
|
37 |
+
* Follow the instructions in the app.
|
38 |
+
|
39 |
+
#### LLM Inference API
|
40 |
+
|
41 |
* Download and install
|
42 |
[the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
|
43 |
* Follow the instructions in the app.
|
|
|
55 |
|
56 |
<table border="1">
|
57 |
<tr>
|
|
|
58 |
<th>Backend</th>
|
59 |
+
<th>Quantization scheme</th>
|
60 |
+
<th>Context length</th>
|
61 |
<th>Prefill (tokens/sec)</th>
|
62 |
<th>Decode (tokens/sec)</th>
|
63 |
<th>Time-to-first-token (sec)</th>
|
|
|
64 |
<th>Model size (MB)</th>
|
65 |
+
<th>Peak RSS Memory (MB)</th>
|
66 |
+
<th>GPU Memory (MB)</th>
|
67 |
</tr>
|
68 |
<tr>
|
69 |
+
<td><p style="text-align: right">CPU</td>
|
70 |
+
<td><p style="text-align: right">dynamic_int8</td>
|
71 |
+
<td><p style="text-align: right">4096</td>
|
72 |
+
<td><p style="text-align: right">66.53 tk/s</p></td>
|
73 |
+
<td><p style="text-align: right">7.28 tk/s</p></td>
|
74 |
+
<td><p style="text-align: right">15.90 s</p></td>
|
75 |
+
<td><p style="text-align: right">3906 MB</p></td>
|
76 |
+
<td><p style="text-align: right">5308 MB</p></td>
|
77 |
+
<td><p style="text-align: right">N/A</p></td>
|
78 |
+
</tr>
|
79 |
+
<tr>
|
80 |
+
<td><p style="text-align: right">GPU</td>
|
81 |
+
<td><p style="text-align: right">dynamic_int8</td>
|
82 |
+
<td><p style="text-align: right">4096</td>
|
83 |
+
<td><p style="text-align: right">314.01 tk/s</p></td>
|
84 |
+
<td><p style="text-align: right">10.39 tk/s</p></td>
|
85 |
+
<td><p style="text-align: right">10.32 s</p></td>
|
86 |
+
<td><p style="text-align: right">3906 MB</p></td>
|
87 |
+
<td><p style="text-align: right">4107 MB</p></td>
|
88 |
+
<td><p style="text-align: right">4608 MB</p></td>
|
89 |
</tr>
|
90 |
|
91 |
</table>
|
|
|
96 |
* The inference on CPU is accelerated via the LiteRT
|
97 |
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
|
98 |
* Benchmark is done assuming XNNPACK cache is enabled
|
99 |
+
* Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.
|
100 |
* dynamic_int8: quantized model with int8 weights and float activations.
|