metascroy commited on
Commit
2821130
·
verified ·
1 Parent(s): b9abd8d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -23,7 +23,7 @@ The model is suitable for mobile deployment with [ExecuTorch](https://github.com
23
  See [Exporting to ExecuTorch](#exporting-to-executorch) for exporting the quantized model to an ExecuTorch pte file. We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
24
 
25
  # Running in a mobile app
26
- The [PTE file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
27
  On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
28
 
29
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
@@ -37,7 +37,7 @@ pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/c
37
  ```
38
 
39
  ## Untie Embedding Weights
40
- Before quantization, since we need quantize input embedding and unembedding (lm_head) layer which are tied, but we want to quantize them separately, we first need to untie the model:
41
 
42
  ```Py
43
  from transformers import (
 
23
  See [Exporting to ExecuTorch](#exporting-to-executorch) for exporting the quantized model to an ExecuTorch pte file. We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
24
 
25
  # Running in a mobile app
26
+ The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
27
  On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
28
 
29
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
 
37
  ```
38
 
39
  ## Untie Embedding Weights
40
+ We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
41
 
42
  ```Py
43
  from transformers import (