litert-community
/

Phi-4-mini-instruct

@@ -11,8 +11,9 @@ tags:
 This model provides a few variants of
 [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) that are ready for
 deployment on Android using the
-[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and
-[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
 ## Use the models
@@ -28,6 +29,15 @@ on Colab could be much worse than on a local device.*
 ### Android
 *   Download and install
     [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
 *   Follow the instructions in the app.
@@ -45,22 +55,37 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
 <table border="1">
   <tr>
-   <th></th>
    <th>Backend</th>
    <th>Prefill (tokens/sec)</th>
    <th>Decode (tokens/sec)</th>
    <th>Time-to-first-token (sec)</th>
-   <th>Memory (RSS in MB)</th>
    <th>Model size (MB)</th>
   </tr>
   <tr>
-<td>dynamic_int8</td>
-<td>cpu</td>
-<td><p style="text-align: right">55.60 tk/s</p></td>
-<td><p style="text-align: right">6.08 tk/s</p></td>
-<td><p style="text-align: right">16.66 s</p></td>
-<td><p style="text-align: right">6,195 MB</p></td>
-<td><p style="text-align: right">3,761 MB</p></td>
 </tr>
 </table>
@@ -71,4 +96,5 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
 *   The inference on CPU is accelerated via the LiteRT
     [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
 *   Benchmark is done assuming XNNPACK cache is enabled
 *   dynamic_int8: quantized model with int8 weights and float activations.

 This model provides a few variants of
 [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) that are ready for
 deployment on Android using the
+[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
+[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
+[LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM).
 ## Use the models
 ### Android
+#### Edge Gallery App
+*   Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
+*   Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play.
+*   Follow the instructions in the app.
+#### LLM Inference API
 *   Download and install
     [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
 *   Follow the instructions in the app.
 <table border="1">
   <tr>
    <th>Backend</th>
+   <th>Quantization scheme</th>
+   <th>Context length</th>
    <th>Prefill (tokens/sec)</th>
    <th>Decode (tokens/sec)</th>
    <th>Time-to-first-token (sec)</th>
    <th>Model size (MB)</th>
+   <th>Peak RSS Memory (MB)</th>
+   <th>GPU Memory (MB)</th>
   </tr>
   <tr>
+<td><p style="text-align: right">CPU</td>
+<td><p style="text-align: right">dynamic_int8</td>
+<td><p style="text-align: right">4096</td>
+<td><p style="text-align: right">66.53 tk/s</p></td>
+<td><p style="text-align: right">7.28 tk/s</p></td>
+<td><p style="text-align: right">15.90 s</p></td>
+<td><p style="text-align: right">3906 MB</p></td>
+<td><p style="text-align: right">5308 MB</p></td>
+<td><p style="text-align: right">N/A</p></td>
+</tr>
+<tr>
+<td><p style="text-align: right">GPU</td>
+<td><p style="text-align: right">dynamic_int8</td>
+<td><p style="text-align: right">4096</td>
+<td><p style="text-align: right">314.01 tk/s</p></td>
+<td><p style="text-align: right">10.39 tk/s</p></td>
+<td><p style="text-align: right">10.32 s</p></td>
+<td><p style="text-align: right">3906 MB</p></td>
+<td><p style="text-align: right">4107 MB</p></td>
+<td><p style="text-align: right">4608 MB</p></td>
 </tr>
 </table>
 *   The inference on CPU is accelerated via the LiteRT
     [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
 *   Benchmark is done assuming XNNPACK cache is enabled
+*   Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.
 *   dynamic_int8: quantized model with int8 weights and float activations.