Update README.md
Browse files
README.md
CHANGED
@@ -23,7 +23,7 @@ The model is suitable for mobile deployment with [ExecuTorch](https://github.com
|
|
23 |
See [Exporting to ExecuTorch](#exporting-to-executorch) for exporting the quantized model to an ExecuTorch pte file. We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
|
24 |
|
25 |
# Running in a mobile app
|
26 |
-
The [
|
27 |
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
28 |
|
29 |

|
@@ -37,7 +37,7 @@ pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/c
|
|
37 |
```
|
38 |
|
39 |
## Untie Embedding Weights
|
40 |
-
|
41 |
|
42 |
```Py
|
43 |
from transformers import (
|
|
|
23 |
See [Exporting to ExecuTorch](#exporting-to-executorch) for exporting the quantized model to an ExecuTorch pte file. We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
|
24 |
|
25 |
# Running in a mobile app
|
26 |
+
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
27 |
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
28 |
|
29 |

|
|
|
37 |
```
|
38 |
|
39 |
## Untie Embedding Weights
|
40 |
+
We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
|
41 |
|
42 |
```Py
|
43 |
from transformers import (
|