Delete README.md
Browse files
README.md
DELETED
@@ -1,794 +0,0 @@
|
|
1 |
-
# llamafile
|
2 |
-
|
3 |
-
[](https://github.com/Mozilla-Ocho/llamafile/actions/workflows/ci.yml)<br/>
|
4 |
-
[](https://discord.gg/YuMNeuKStr)<br/><br/>
|
5 |
-
|
6 |
-
<img src="llamafile/llamafile-640x640.png" width="320" height="320"
|
7 |
-
alt="[line drawing of llama animal head in front of slightly open manilla folder filled with files]">
|
8 |
-
|
9 |
-
**llamafile lets you distribute and run LLMs with a single file. ([announcement blog post](https://hacks.mozilla.org/2023/11/introducing-llamafile/))**
|
10 |
-
|
11 |
-
Llamafile aims to make open LLMs much more
|
12 |
-
accessible to both developers and end users. They're doing that by
|
13 |
-
combining [llama.cpp](https://github.com/ggerganov/llama.cpp) with [Cosmopolitan Libc](https://github.com/jart/cosmopolitan) into one
|
14 |
-
framework that collapses all the complexity of LLMs down to
|
15 |
-
a single-file executable (called a "llamafile") that runs
|
16 |
-
locally on most computers, with no installation.<br/><br/>
|
17 |
-
|
18 |
-
<a href="https://future.mozilla.org"><img src="llamafile/mozilla-logo-bw-rgb.png" width="150"></a><br/>
|
19 |
-
llamafile is a Mozilla Builders project.<br/><br/>
|
20 |
-
|
21 |
-
## Quickstart
|
22 |
-
|
23 |
-
The easiest way to try it for yourself is to download our example
|
24 |
-
llamafile for the [LLaVA](https://llava-vl.github.io/) model (license: [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/),
|
25 |
-
[OpenAI](https://openai.com/policies/terms-of-use)). LLaVA is a new LLM that can do more
|
26 |
-
than just chat; you can also upload images and ask it questions
|
27 |
-
about them. With llamafile, this all happens locally; no data
|
28 |
-
ever leaves your computer.
|
29 |
-
|
30 |
-
1. Download [numind.NuExtract-v1.5.Q5_K_M.llamafile](https://huggingface.co/Devarui379/numind.NuExtract-v1.5-Q5_K_M-llamafile/resolve/main/numind.NuExtract-v1.5.Q5_K_M.llamafile?download=true) (4.29 GB).
|
31 |
-
|
32 |
-
2. Open your computer's terminal.
|
33 |
-
|
34 |
-
3. If you're using macOS, Linux, or BSD, you'll need to grant permission
|
35 |
-
for your computer to execute this new file. (You only need to do this
|
36 |
-
once.)
|
37 |
-
|
38 |
-
```sh
|
39 |
-
chmod +x numind.NuExtract-v1.5.Q5_K_M.llamafile
|
40 |
-
```
|
41 |
-
|
42 |
-
4. If you're on Windows, rename the file by adding ".exe" on the end.
|
43 |
-
|
44 |
-
5. Run the llamafile. e.g.:
|
45 |
-
|
46 |
-
```sh
|
47 |
-
./numind.NuExtract-v1.5.Q5_K_M.llamafile
|
48 |
-
```
|
49 |
-
|
50 |
-
6. Your browser should open automatically and display a chat interface.
|
51 |
-
(If it doesn't, just open your browser and point it at http://localhost:8080)
|
52 |
-
|
53 |
-
7. When you're done chatting, return to your terminal and hit
|
54 |
-
`Control-C` to shut down llamafile.
|
55 |
-
|
56 |
-
**Having trouble? See the "Gotchas" section below.**
|
57 |
-
|
58 |
-
### JSON API Quickstart
|
59 |
-
|
60 |
-
When llamafile is started, in addition to hosting a web
|
61 |
-
UI chat server at <http://127.0.0.1:8080/>, an [OpenAI
|
62 |
-
API](https://platform.openai.com/docs/api-reference/chat) compatible
|
63 |
-
chat completions endpoint is provided too. It's designed to support the
|
64 |
-
most common OpenAI API use cases, in a way that runs entirely locally.
|
65 |
-
We've also extended it to include llama.cpp specific features (e.g.
|
66 |
-
mirostat) that may also be used. For further details on what fields and
|
67 |
-
endpoints are available, refer to both the [OpenAI
|
68 |
-
documentation](https://platform.openai.com/docs/api-reference/chat/create)
|
69 |
-
and the [llamafile server
|
70 |
-
README](llama.cpp/server/README.md#api-endpoints).
|
71 |
-
|
72 |
-
<details>
|
73 |
-
<summary>Curl API Client Example</summary>
|
74 |
-
|
75 |
-
The simplest way to get started using the API is to copy and paste the
|
76 |
-
following curl command into your terminal.
|
77 |
-
|
78 |
-
```shell
|
79 |
-
curl http://localhost:8080/v1/chat/completions \
|
80 |
-
-H "Content-Type: application/json" \
|
81 |
-
-H "Authorization: Bearer no-key" \
|
82 |
-
-d '{
|
83 |
-
"model": "LLaMA_CPP",
|
84 |
-
"messages": [
|
85 |
-
{
|
86 |
-
"role": "system",
|
87 |
-
"content": "You are LLAMAfile, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
|
88 |
-
},
|
89 |
-
{
|
90 |
-
"role": "user",
|
91 |
-
"content": "Write a limerick about python exceptions"
|
92 |
-
}
|
93 |
-
]
|
94 |
-
}' | python3 -c '
|
95 |
-
import json
|
96 |
-
import sys
|
97 |
-
json.dump(json.load(sys.stdin), sys.stdout, indent=2)
|
98 |
-
print()
|
99 |
-
'
|
100 |
-
```
|
101 |
-
|
102 |
-
The response that's printed should look like the following:
|
103 |
-
|
104 |
-
```json
|
105 |
-
{
|
106 |
-
"choices" : [
|
107 |
-
{
|
108 |
-
"finish_reason" : "stop",
|
109 |
-
"index" : 0,
|
110 |
-
"message" : {
|
111 |
-
"content" : "There once was a programmer named Mike\nWho wrote code that would often choke\nHe used try and except\nTo handle each step\nAnd his program ran without any hike.",
|
112 |
-
"role" : "assistant"
|
113 |
-
}
|
114 |
-
}
|
115 |
-
],
|
116 |
-
"created" : 1704199256,
|
117 |
-
"id" : "chatcmpl-Dt16ugf3vF8btUZj9psG7To5tc4murBU",
|
118 |
-
"model" : "LLaMA_CPP",
|
119 |
-
"object" : "chat.completion",
|
120 |
-
"usage" : {
|
121 |
-
"completion_tokens" : 38,
|
122 |
-
"prompt_tokens" : 78,
|
123 |
-
"total_tokens" : 116
|
124 |
-
}
|
125 |
-
}
|
126 |
-
```
|
127 |
-
|
128 |
-
</details>
|
129 |
-
|
130 |
-
<details>
|
131 |
-
<summary>Python API Client example</summary>
|
132 |
-
|
133 |
-
If you've already developed your software using the [`openai` Python
|
134 |
-
package](https://pypi.org/project/openai/) (that's published by OpenAI)
|
135 |
-
then you should be able to port your app to talk to llamafile instead,
|
136 |
-
by making a few changes to `base_url` and `api_key`. This example
|
137 |
-
assumes you've run `pip3 install openai` to install OpenAI's client
|
138 |
-
software, which is required by this example. Their package is just a
|
139 |
-
simple Python wrapper around the OpenAI API interface, which can be
|
140 |
-
implemented by any server.
|
141 |
-
|
142 |
-
```python
|
143 |
-
#!/usr/bin/env python3
|
144 |
-
from openai import OpenAI
|
145 |
-
client = OpenAI(
|
146 |
-
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
|
147 |
-
api_key = "sk-no-key-required"
|
148 |
-
)
|
149 |
-
completion = client.chat.completions.create(
|
150 |
-
model="LLaMA_CPP",
|
151 |
-
messages=[
|
152 |
-
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
|
153 |
-
{"role": "user", "content": "Write a limerick about python exceptions"}
|
154 |
-
]
|
155 |
-
)
|
156 |
-
print(completion.choices[0].message)
|
157 |
-
```
|
158 |
-
|
159 |
-
The above code will return a Python object like this:
|
160 |
-
|
161 |
-
```python
|
162 |
-
ChatCompletionMessage(content='There once was a programmer named Mike\nWho wrote code that would often strike\nAn error would occur\nAnd he\'d shout "Oh no!"\nBut Python\'s exceptions made it all right.', role='assistant', function_call=None, tool_calls=None)
|
163 |
-
```
|
164 |
-
|
165 |
-
</details>
|
166 |
-
|
167 |
-
|
168 |
-
## Other example llamafiles
|
169 |
-
|
170 |
-
We also provide example llamafiles for other models, so you can easily
|
171 |
-
try out llamafile with different kinds of LLMs.
|
172 |
-
|
173 |
-
| Model | Size | License | llamafile | other quants |
|
174 |
-
| --- | --- | --- | --- | --- |
|
175 |
-
| LLaMA 3.2 3B Instruct | 2.62 GB | [LLaMA 3.2](https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile/blob/main/LICENSE) | [Llama-3.2-3B-Instruct.Q6\_K.llamafile](https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile/blob/main/Llama-3.2-3B-Instruct.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile) |
|
176 |
-
| LLaMA 3.2 1B Instruct | 1.11 GB | [LLaMA 3.2](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile/blob/main/LICENSE) | [Llama-3.2-1B-Instruct.Q6\_K.llamafile](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile/blob/main/Llama-3.2-1B-Instruct.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile) |
|
177 |
-
| Gemma 2 2B Instruct | 2.32 GB | [Gemma 2](https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile/blob/main/LICENSE) | [gemma-2-2b-it.Q6\_K.llamafile](https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile/blob/main/gemma-2-2b-it.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile) |
|
178 |
-
| Gemma 2 9B Instruct | 7.76 GB | [Gemma 2](https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/blob/main/LICENSE) | [gemma-2-9b-it.Q6\_K.llamafile](https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/blob/main/gemma-2-9b-it.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile) |
|
179 |
-
| Gemma 2 27B Instruct | 22.5 GB | [Gemma 2](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile/blob/main/LICENSE) | [gemma-2-27b-it.Q6\_K.llamafile](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile/blob/main/gemma-2-27b-it.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile) |
|
180 |
-
| LLaVA 1.5 | 3.97 GB | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) | [llava-v1.5-7b-q4.llamafile](https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile) |
|
181 |
-
| TinyLlama-1.1B | 2.05 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | [TinyLlama-1.1B-Chat-v1.0.F16.llamafile](https://huggingface.co/Mozilla/TinyLlama-1.1B-Chat-v1.0-llamafile/resolve/main/TinyLlama-1.1B-Chat-v1.0.F16.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/TinyLlama-1.1B-Chat-v1.0-llamafile) |
|
182 |
-
| Mistral-7B-Instruct | 3.85 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | [mistral-7b-instruct-v0.2.Q4\_0.llamafile](https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q4_0.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile) |
|
183 |
-
| Phi-3-mini-4k-instruct | 7.67 GB | [Apache 2.0](https://huggingface.co/Mozilla/Phi-3-mini-4k-instruct-llamafile/blob/main/LICENSE) | [Phi-3-mini-4k-instruct.F16.llamafile](https://huggingface.co/Mozilla/Phi-3-mini-4k-instruct-llamafile/resolve/main/Phi-3-mini-4k-instruct.F16.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Phi-3-mini-4k-instruct-llamafile) |
|
184 |
-
| Mixtral-8x7B-Instruct | 30.03 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | [mixtral-8x7b-instruct-v0.1.Q5\_K\_M.llamafile](https://huggingface.co/Mozilla/Mixtral-8x7B-Instruct-v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Mixtral-8x7B-Instruct-v0.1-llamafile) |
|
185 |
-
| WizardCoder-Python-34B | 22.23 GB | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) | [wizardcoder-python-34b-v1.0.Q5\_K\_M.llamafile](https://huggingface.co/Mozilla/WizardCoder-Python-34B-V1.0-llamafile/resolve/main/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/WizardCoder-Python-34B-V1.0-llamafile) |
|
186 |
-
| WizardCoder-Python-13B | 7.33 GB | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) | [wizardcoder-python-13b.llamafile](https://huggingface.co/jartine/wizardcoder-13b-python/resolve/main/wizardcoder-python-13b.llamafile?download=true) | [See HF repo](https://huggingface.co/jartine/wizardcoder-13b-python) |
|
187 |
-
| LLaMA-3-Instruct-70B | 37.25 GB | [llama3](https://huggingface.co/Mozilla/Meta-Llama-3-8B-Instruct-llamafile/blob/main/Meta-Llama-3-Community-License-Agreement.txt) | [Meta-Llama-3-70B-Instruct.Q4\_0.llamafile](https://huggingface.co/Mozilla/Meta-Llama-3-70B-Instruct-llamafile/resolve/main/Meta-Llama-3-70B-Instruct.Q4_0.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Meta-Llama-3-70B-Instruct-llamafile) |
|
188 |
-
| LLaMA-3-Instruct-8B | 5.37 GB | [llama3](https://huggingface.co/Mozilla/Meta-Llama-3-8B-Instruct-llamafile/blob/main/Meta-Llama-3-Community-License-Agreement.txt) | [Meta-Llama-3-8B-Instruct.Q5\_K\_M.llamafile](https://huggingface.co/Mozilla/Meta-Llama-3-8B-Instruct-llamafile/resolve/main/Meta-Llama-3-8B-Instruct.Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Meta-Llama-3-8B-Instruct-llamafile) |
|
189 |
-
| Rocket-3B | 1.89 GB | [cc-by-sa-4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en) | [rocket-3b.Q5\_K\_M.llamafile](https://huggingface.co/Mozilla/rocket-3B-llamafile/resolve/main/rocket-3b.Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/rocket-3B-llamafile) |
|
190 |
-
| OLMo-7B | 5.68 GB | [Apache 2.0](https://huggingface.co/Mozilla/OLMo-7B-0424-llamafile/blob/main/LICENSE) | [OLMo-7B-0424.Q6\_K.llamafile](https://huggingface.co/Mozilla/OLMo-7B-0424-llamafile/resolve/main/OLMo-7B-0424.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/OLMo-7B-0424-llamafile) |
|
191 |
-
| *Text Embedding Models* | | | | |
|
192 |
-
| E5-Mistral-7B-Instruct | 5.16 GB | [MIT](https://choosealicense.com/licenses/mit/) | [e5-mistral-7b-instruct-Q5_K_M.llamafile](https://huggingface.co/Mozilla/e5-mistral-7b-instruct/resolve/main/e5-mistral-7b-instruct-Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/e5-mistral-7b-instruct) |
|
193 |
-
| mxbai-embed-large-v1 | 0.7 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | [mxbai-embed-large-v1-f16.llamafile](https://huggingface.co/Mozilla/mxbai-embed-large-v1-llamafile/resolve/main/mxbai-embed-large-v1-f16.llamafile?download=true) | [See HF Repo](https://huggingface.co/Mozilla/mxbai-embed-large-v1-llamafile) |
|
194 |
-
|
195 |
-
Here is an example for the Mistral command-line llamafile:
|
196 |
-
|
197 |
-
```sh
|
198 |
-
./mistral-7b-instruct-v0.2.Q5_K_M.llamafile --temp 0.7 -p '[INST]Write a story about llamas[/INST]'
|
199 |
-
```
|
200 |
-
|
201 |
-
And here is an example for WizardCoder-Python command-line llamafile:
|
202 |
-
|
203 |
-
```sh
|
204 |
-
./wizardcoder-python-13b.llamafile --temp 0 -e -r '```\n' -p '```c\nvoid *memcpy_sse2(char *dst, const char *src, size_t size) {\n'
|
205 |
-
```
|
206 |
-
|
207 |
-
And here's an example for the LLaVA command-line llamafile:
|
208 |
-
|
209 |
-
```sh
|
210 |
-
./llava-v1.5-7b-q4.llamafile --temp 0.2 --image lemurs.jpg -e -p '### User: What do you see?\n### Assistant:'
|
211 |
-
```
|
212 |
-
|
213 |
-
As before, macOS, Linux, and BSD users will need to use the "chmod"
|
214 |
-
command to grant execution permissions to the file before running these
|
215 |
-
llamafiles for the first time.
|
216 |
-
|
217 |
-
Unfortunately, Windows users cannot make use of many of these example
|
218 |
-
llamafiles because Windows has a maximum executable file size of 4GB,
|
219 |
-
and all of these examples exceed that size. (The LLaVA llamafile works
|
220 |
-
on Windows because it is 30MB shy of the size limit.) But don't lose
|
221 |
-
heart: llamafile allows you to use external weights; this is described
|
222 |
-
later in this document.
|
223 |
-
|
224 |
-
**Having trouble? See the "Gotchas" section below.**
|
225 |
-
|
226 |
-
## How llamafile works
|
227 |
-
|
228 |
-
A llamafile is an executable LLM that you can run on your own
|
229 |
-
computer. It contains the weights for a given open LLM, as well
|
230 |
-
as everything needed to actually run that model on your computer.
|
231 |
-
There's nothing to install or configure (with a few caveats, discussed
|
232 |
-
in subsequent sections of this document).
|
233 |
-
|
234 |
-
This is all accomplished by combining llama.cpp with Cosmopolitan Libc,
|
235 |
-
which provides some useful capabilities:
|
236 |
-
|
237 |
-
1. llamafiles can run on multiple CPU microarchitectures. We
|
238 |
-
added runtime dispatching to llama.cpp that lets new Intel systems use
|
239 |
-
modern CPU features without trading away support for older computers.
|
240 |
-
|
241 |
-
2. llamafiles can run on multiple CPU architectures. We do
|
242 |
-
that by concatenating AMD64 and ARM64 builds with a shell script that
|
243 |
-
launches the appropriate one. Our file format is compatible with WIN32
|
244 |
-
and most UNIX shells. It's also able to be easily converted (by either
|
245 |
-
you or your users) to the platform-native format, whenever required.
|
246 |
-
|
247 |
-
3. llamafiles can run on six OSes (macOS, Windows, Linux,
|
248 |
-
FreeBSD, OpenBSD, and NetBSD). If you make your own llama files, you'll
|
249 |
-
only need to build your code once, using a Linux-style toolchain. The
|
250 |
-
GCC-based compiler we provide is itself an Actually Portable Executable,
|
251 |
-
so you can build your software for all six OSes from the comfort of
|
252 |
-
whichever one you prefer most for development.
|
253 |
-
|
254 |
-
4. The weights for an LLM can be embedded within the llamafile.
|
255 |
-
We added support for PKZIP to the GGML library. This lets uncompressed
|
256 |
-
weights be mapped directly into memory, similar to a self-extracting
|
257 |
-
archive. It enables quantized weights distributed online to be prefixed
|
258 |
-
with a compatible version of the llama.cpp software, thereby ensuring
|
259 |
-
its originally observed behaviors can be reproduced indefinitely.
|
260 |
-
|
261 |
-
5. Finally, with the tools included in this project you can create your
|
262 |
-
*own* llamafiles, using any compatible model weights you want. You can
|
263 |
-
then distribute these llamafiles to other people, who can easily make
|
264 |
-
use of them regardless of what kind of computer they have.
|
265 |
-
|
266 |
-
## Using llamafile with external weights
|
267 |
-
|
268 |
-
Even though our example llamafiles have the weights built-in, you don't
|
269 |
-
*have* to use llamafile that way. Instead, you can download *just* the
|
270 |
-
llamafile software (without any weights included) from our releases page.
|
271 |
-
You can then use it alongside any external weights you may have on hand.
|
272 |
-
External weights are particularly useful for Windows users because they
|
273 |
-
enable you to work around Windows' 4GB executable file size limit.
|
274 |
-
|
275 |
-
For Windows users, here's an example for the Mistral LLM:
|
276 |
-
|
277 |
-
```sh
|
278 |
-
curl -L -o llamafile.exe https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.11/llamafile-0.8.11
|
279 |
-
curl -L -o mistral.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
|
280 |
-
./llamafile.exe -m mistral.gguf
|
281 |
-
```
|
282 |
-
|
283 |
-
Windows users may need to change `./llamafile.exe` to `.\llamafile.exe`
|
284 |
-
when running the above command.
|
285 |
-
|
286 |
-
|
287 |
-
## Gotchas and troubleshooting
|
288 |
-
|
289 |
-
On any platform, if your llamafile process is immediately killed, check
|
290 |
-
if you have CrowdStrike and then ask to be whitelisted.
|
291 |
-
|
292 |
-
### Mac
|
293 |
-
|
294 |
-
On macOS with Apple Silicon you need to have Xcode Command Line Tools
|
295 |
-
installed for llamafile to be able to bootstrap itself.
|
296 |
-
|
297 |
-
If you use zsh and have trouble running llamafile, try saying `sh -c
|
298 |
-
./llamafile`. This is due to a bug that was fixed in zsh 5.9+. The same
|
299 |
-
is the case for Python `subprocess`, old versions of Fish, etc.
|
300 |
-
|
301 |
-
|
302 |
-
#### Mac error "... cannot be opened because the developer cannot be verified"
|
303 |
-
|
304 |
-
1. Immediately launch System Settings, then go to Privacy & Security. llamafile should be listed at the bottom, with a button to Allow.
|
305 |
-
2. If not, then change your command in the Terminal to be `sudo spctl --master-disable; [llama launch command]; sudo spctl --master-enable`. This is because `--master-disable` disables _all_ checking, so you need to turn it back on after quitting llama.
|
306 |
-
|
307 |
-
### Linux
|
308 |
-
|
309 |
-
On some Linux systems, you might get errors relating to `run-detectors`
|
310 |
-
or WINE. This is due to `binfmt_misc` registrations. You can fix that by
|
311 |
-
adding an additional registration for the APE file format llamafile
|
312 |
-
uses:
|
313 |
-
|
314 |
-
```sh
|
315 |
-
sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
|
316 |
-
sudo chmod +x /usr/bin/ape
|
317 |
-
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
|
318 |
-
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
|
319 |
-
```
|
320 |
-
|
321 |
-
### Windows
|
322 |
-
As mentioned above, on Windows you may need to rename your llamafile by
|
323 |
-
adding `.exe` to the filename.
|
324 |
-
|
325 |
-
Also as mentioned above, Windows also has a maximum file size limit of 4GB
|
326 |
-
for executables. The LLaVA server executable above is just 30MB shy of
|
327 |
-
that limit, so it'll work on Windows, but with larger models like
|
328 |
-
WizardCoder 13B, you need to store the weights in a separate file. An
|
329 |
-
example is provided above; see "Using llamafile with external weights."
|
330 |
-
|
331 |
-
On WSL, there are many possible gotchas. One thing that helps solve them
|
332 |
-
completely is this:
|
333 |
-
|
334 |
-
```
|
335 |
-
[Unit]
|
336 |
-
Description=cosmopolitan APE binfmt service
|
337 |
-
After=wsl-binfmt.service
|
338 |
-
|
339 |
-
[Service]
|
340 |
-
Type=oneshot
|
341 |
-
ExecStart=/bin/sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
|
342 |
-
|
343 |
-
[Install]
|
344 |
-
WantedBy=multi-user.target
|
345 |
-
```
|
346 |
-
|
347 |
-
Put that in `/etc/systemd/system/cosmo-binfmt.service`.
|
348 |
-
|
349 |
-
Then run `sudo systemctl enable cosmo-binfmt`.
|
350 |
-
|
351 |
-
Another thing that's helped WSL users who experience issues, is to
|
352 |
-
disable the WIN32 interop feature:
|
353 |
-
|
354 |
-
```sh
|
355 |
-
sudo sh -c "echo -1 > /proc/sys/fs/binfmt_misc/WSLInterop"
|
356 |
-
```
|
357 |
-
|
358 |
-
In the instance of getting a `Permission Denied` on disabling interop
|
359 |
-
through CLI, it can be permanently disabled by adding the following in
|
360 |
-
`/etc/wsl.conf`
|
361 |
-
|
362 |
-
```sh
|
363 |
-
[interop]
|
364 |
-
enabled=false
|
365 |
-
```
|
366 |
-
|
367 |
-
## Supported OSes
|
368 |
-
|
369 |
-
llamafile supports the following operating systems, which require a minimum
|
370 |
-
stock install:
|
371 |
-
|
372 |
-
- Linux 2.6.18+ (i.e. every distro since RHEL5 c. 2007)
|
373 |
-
- Darwin (macOS) 23.1.0+ [1] (GPU is only supported on ARM64)
|
374 |
-
- Windows 10+ (AMD64 only)
|
375 |
-
- FreeBSD 13+
|
376 |
-
- NetBSD 9.2+ (AMD64 only)
|
377 |
-
- OpenBSD 7+ (AMD64 only)
|
378 |
-
|
379 |
-
On Windows, llamafile runs as a native portable executable. On UNIX
|
380 |
-
systems, llamafile extracts a small loader program named `ape` to
|
381 |
-
`$TMPDIR/.llamafile` or `~/.ape-1.9` which is used to map your model
|
382 |
-
into memory.
|
383 |
-
|
384 |
-
[1] Darwin kernel versions 15.6+ *should* be supported, but we currently
|
385 |
-
have no way of testing that.
|
386 |
-
|
387 |
-
## Supported CPUs
|
388 |
-
|
389 |
-
llamafile supports the following CPUs:
|
390 |
-
|
391 |
-
- **AMD64** microprocessors must have AVX. Otherwise llamafile will
|
392 |
-
print an error and refuse to run. This means that if you have an Intel
|
393 |
-
CPU, it needs to be Intel Core or newer (circa 2006+), and if you have
|
394 |
-
an AMD CPU, then it needs to be K8 or newer (circa 2003+). Support for
|
395 |
-
AVX512, AVX2, FMA, F16C, and VNNI are conditionally enabled at runtime
|
396 |
-
if you have a newer CPU. For example, Zen4 has very good AVX512 that
|
397 |
-
can speed up BF16 llamafiles.
|
398 |
-
|
399 |
-
- **ARM64** microprocessors must have ARMv8a+. This means everything
|
400 |
-
from Apple Silicon to 64-bit Raspberry Pis will work, provided your
|
401 |
-
weights fit into memory.
|
402 |
-
|
403 |
-
## GPU support
|
404 |
-
|
405 |
-
llamafile supports the following kinds of GPUs:
|
406 |
-
|
407 |
-
- Apple Metal
|
408 |
-
- NVIDIA
|
409 |
-
- AMD
|
410 |
-
|
411 |
-
GPU on MacOS ARM64 is supported by compiling a small module using the
|
412 |
-
Xcode Command Line Tools, which need to be installed. This is a one time
|
413 |
-
cost that happens the first time you run your llamafile. The DSO built
|
414 |
-
by llamafile is stored in `$TMPDIR/.llamafile` or `$HOME/.llamafile`.
|
415 |
-
Offloading to GPU is enabled by default when a Metal GPU is present.
|
416 |
-
This can be disabled by passing `-ngl 0` or `--gpu disable` to force
|
417 |
-
llamafile to perform CPU inference.
|
418 |
-
|
419 |
-
Owners of NVIDIA and AMD graphics cards need to pass the `-ngl 999` flag
|
420 |
-
to enable maximum offloading. If multiple GPUs are present then the work
|
421 |
-
will be divided evenly among them by default, so you can load larger
|
422 |
-
models. Multiple GPU support may be broken on AMD Radeon systems. If
|
423 |
-
that happens to you, then use `export HIP_VISIBLE_DEVICES=0` which
|
424 |
-
forces llamafile to only use the first GPU.
|
425 |
-
|
426 |
-
Windows users are encouraged to use our release binaries, because they
|
427 |
-
contain prebuilt DLLs for both NVIDIA and AMD graphics cards, which only
|
428 |
-
depend on the graphics driver being installed. If llamafile detects that
|
429 |
-
NVIDIA's CUDA SDK or AMD's ROCm HIP SDK are installed, then llamafile
|
430 |
-
will try to build a faster DLL that uses cuBLAS or rocBLAS. In order for
|
431 |
-
llamafile to successfully build a cuBLAS module, it needs to be run on
|
432 |
-
the x64 MSVC command prompt. You can use CUDA via WSL by enabling
|
433 |
-
[Nvidia CUDA on
|
434 |
-
WSL](https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl)
|
435 |
-
and running your llamafiles inside of WSL. Using WSL has the added
|
436 |
-
benefit of letting you run llamafiles greater than 4GB on Windows.
|
437 |
-
|
438 |
-
On Linux, NVIDIA users will need to install the CUDA SDK (ideally using
|
439 |
-
the shell script installer) and ROCm users need to install the HIP SDK.
|
440 |
-
They're detected by looking to see if `nvcc` or `hipcc` are on the PATH.
|
441 |
-
|
442 |
-
If you have both an AMD GPU *and* an NVIDIA GPU in your machine, then
|
443 |
-
you may need to qualify which one you want used, by passing either
|
444 |
-
`--gpu amd` or `--gpu nvidia`.
|
445 |
-
|
446 |
-
In the event that GPU support couldn't be compiled and dynamically
|
447 |
-
linked on the fly for any reason, llamafile will fall back to CPU
|
448 |
-
inference.
|
449 |
-
|
450 |
-
## Source installation
|
451 |
-
|
452 |
-
Developing on llamafile requires a modern version of the GNU `make`
|
453 |
-
command (called `gmake` on some systems), `sha256sum` (otherwise `cc`
|
454 |
-
will be used to build it), `wget` (or `curl`), and `unzip` available at
|
455 |
-
[https://cosmo.zip/pub/cosmos/bin/](https://cosmo.zip/pub/cosmos/bin/).
|
456 |
-
Windows users need [cosmos bash](https://justine.lol/cosmo3/) shell too.
|
457 |
-
|
458 |
-
```sh
|
459 |
-
make -j8
|
460 |
-
sudo make install PREFIX=/usr/local
|
461 |
-
```
|
462 |
-
|
463 |
-
Here's an example of how to generate code for a libc function using the
|
464 |
-
llama.cpp command line interface, utilizing WizardCoder-Python-13B
|
465 |
-
weights:
|
466 |
-
|
467 |
-
```sh
|
468 |
-
llamafile \
|
469 |
-
-m wizardcoder-python-13b-v1.0.Q8_0.gguf \
|
470 |
-
--temp 0 -r '}\n' -r '```\n' \
|
471 |
-
-e -p '```c\nvoid *memcpy(void *dst, const void *src, size_t size) {\n'
|
472 |
-
```
|
473 |
-
|
474 |
-
Here's a similar example that instead utilizes Mistral-7B-Instruct
|
475 |
-
weights for prose composition:
|
476 |
-
|
477 |
-
```sh
|
478 |
-
llamafile -ngl 9999 \
|
479 |
-
-m mistral-7b-instruct-v0.1.Q4_K_M.gguf \
|
480 |
-
-p '[INST]Write a story about llamas[/INST]'
|
481 |
-
```
|
482 |
-
|
483 |
-
Here's an example of how llamafile can be used as an interactive chatbot
|
484 |
-
that lets you query knowledge contained in training data:
|
485 |
-
|
486 |
-
```sh
|
487 |
-
llamafile -m llama-65b-Q5_K.gguf -p '
|
488 |
-
The following is a conversation between a Researcher and their helpful AI assistant Digital Athena which is a large language model trained on the sum of human knowledge.
|
489 |
-
Researcher: Good morning.
|
490 |
-
Digital Athena: How can I help you today?
|
491 |
-
Researcher:' --interactive --color --batch_size 1024 --ctx_size 4096 \
|
492 |
-
--keep -1 --temp 0 --mirostat 2 --in-prefix ' ' --interactive-first \
|
493 |
-
--in-suffix 'Digital Athena:' --reverse-prompt 'Researcher:'
|
494 |
-
```
|
495 |
-
|
496 |
-
Here's an example of how you can use llamafile to summarize HTML URLs:
|
497 |
-
|
498 |
-
```sh
|
499 |
-
(
|
500 |
-
echo '[INST]Summarize the following text:'
|
501 |
-
links -codepage utf-8 \
|
502 |
-
-force-html \
|
503 |
-
-width 500 \
|
504 |
-
-dump https://www.poetryfoundation.org/poems/48860/the-raven |
|
505 |
-
sed 's/ */ /g'
|
506 |
-
echo '[/INST]'
|
507 |
-
) | llamafile -ngl 9999 \
|
508 |
-
-m mistral-7b-instruct-v0.2.Q5_K_M.gguf \
|
509 |
-
-f /dev/stdin \
|
510 |
-
-c 0 \
|
511 |
-
--temp 0 \
|
512 |
-
-n 500 \
|
513 |
-
--no-display-prompt 2>/dev/null
|
514 |
-
```
|
515 |
-
|
516 |
-
Here's how you can use llamafile to describe a jpg/png/gif/bmp image:
|
517 |
-
|
518 |
-
```sh
|
519 |
-
llamafile -ngl 9999 --temp 0 \
|
520 |
-
--image ~/Pictures/lemurs.jpg \
|
521 |
-
-m llava-v1.5-7b-Q4_K.gguf \
|
522 |
-
--mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
|
523 |
-
-e -p '### User: What do you see?\n### Assistant: ' \
|
524 |
-
--no-display-prompt 2>/dev/null
|
525 |
-
```
|
526 |
-
|
527 |
-
It's possible to use BNF grammar to enforce the output is predictable
|
528 |
-
and safe to use in your shell script. The simplest grammar would be
|
529 |
-
`--grammar 'root ::= "yes" | "no"'` to force the LLM to only print to
|
530 |
-
standard output either `"yes\n"` or `"no\n"`. Another example is if you
|
531 |
-
wanted to write a script to rename all your image files, you could say:
|
532 |
-
|
533 |
-
```sh
|
534 |
-
llamafile -ngl 9999 --temp 0 \
|
535 |
-
--image lemurs.jpg \
|
536 |
-
-m llava-v1.5-7b-Q4_K.gguf \
|
537 |
-
--mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
|
538 |
-
--grammar 'root ::= [a-z]+ (" " [a-z]+)+' \
|
539 |
-
-e -p '### User: What do you see?\n### Assistant: ' \
|
540 |
-
--no-display-prompt 2>/dev/null |
|
541 |
-
sed -e's/ /_/g' -e's/$/.jpg/'
|
542 |
-
a_baby_monkey_on_the_back_of_a_mother.jpg
|
543 |
-
```
|
544 |
-
|
545 |
-
Here's an example of how to run llama.cpp's built-in HTTP server. This
|
546 |
-
example uses LLaVA v1.5-7B, a multimodal LLM that works with llama.cpp's
|
547 |
-
recently-added support for image inputs.
|
548 |
-
|
549 |
-
```sh
|
550 |
-
llamafile -ngl 9999 \
|
551 |
-
-m llava-v1.5-7b-Q8_0.gguf \
|
552 |
-
--mmproj llava-v1.5-7b-mmproj-Q8_0.gguf \
|
553 |
-
--host 0.0.0.0
|
554 |
-
```
|
555 |
-
|
556 |
-
The above command will launch a browser tab on your personal computer to
|
557 |
-
display a web interface. It lets you chat with your LLM and upload
|
558 |
-
images to it.
|
559 |
-
|
560 |
-
## Creating llamafiles
|
561 |
-
|
562 |
-
If you want to be able to just say:
|
563 |
-
|
564 |
-
```sh
|
565 |
-
./llava.llamafile
|
566 |
-
```
|
567 |
-
|
568 |
-
...and have it run the web server without having to specify arguments,
|
569 |
-
then you can embed both the weights and a special `.args` inside, which
|
570 |
-
specifies the default arguments. First, let's create a file named
|
571 |
-
`.args` which has this content:
|
572 |
-
|
573 |
-
```sh
|
574 |
-
-m
|
575 |
-
llava-v1.5-7b-Q8_0.gguf
|
576 |
-
--mmproj
|
577 |
-
llava-v1.5-7b-mmproj-Q8_0.gguf
|
578 |
-
--host
|
579 |
-
0.0.0.0
|
580 |
-
-ngl
|
581 |
-
9999
|
582 |
-
...
|
583 |
-
```
|
584 |
-
|
585 |
-
As we can see above, there's one argument per line. The `...` argument
|
586 |
-
optionally specifies where any additional CLI arguments passed by the
|
587 |
-
user are to be inserted. Next, we'll add both the weights and the
|
588 |
-
argument file to the executable:
|
589 |
-
|
590 |
-
```sh
|
591 |
-
cp /usr/local/bin/llamafile llava.llamafile
|
592 |
-
|
593 |
-
zipalign -j0 \
|
594 |
-
llava.llamafile \
|
595 |
-
llava-v1.5-7b-Q8_0.gguf \
|
596 |
-
llava-v1.5-7b-mmproj-Q8_0.gguf \
|
597 |
-
.args
|
598 |
-
|
599 |
-
./llava.llamafile
|
600 |
-
```
|
601 |
-
|
602 |
-
Congratulations. You've just made your own LLM executable that's easy to
|
603 |
-
share with your friends.
|
604 |
-
|
605 |
-
## Distribution
|
606 |
-
|
607 |
-
One good way to share a llamafile with your friends is by posting it on
|
608 |
-
Hugging Face. If you do that, then it's recommended that you mention in
|
609 |
-
your Hugging Face commit message what git revision or released version
|
610 |
-
of llamafile you used when building your llamafile. That way everyone
|
611 |
-
online will be able verify the provenance of its executable content. If
|
612 |
-
you've made changes to the llama.cpp or cosmopolitan source code, then
|
613 |
-
the Apache 2.0 license requires you to explain what changed. One way you
|
614 |
-
can do that is by embedding a notice in your llamafile using `zipalign`
|
615 |
-
that describes the changes, and mention it in your Hugging Face commit.
|
616 |
-
|
617 |
-
## Documentation
|
618 |
-
|
619 |
-
There's a manual page for each of the llamafile programs installed when you
|
620 |
-
run `sudo make install`. The command manuals are also typeset as PDF
|
621 |
-
files that you can download from our GitHub releases page. Lastly, most
|
622 |
-
commands will display that information when passing the `--help` flag.
|
623 |
-
|
624 |
-
## Running llamafile with models downloaded by third-party applications
|
625 |
-
|
626 |
-
This section answers the question *"I already have a model downloaded locally by application X, can I use it with llamafile?"*. The general answer is "yes, as long as those models are locally stored in GGUF format" but its implementation can be more or less hacky depending on the application. A few examples (tested on a Mac) follow.
|
627 |
-
|
628 |
-
### LM Studio
|
629 |
-
[LM Studio](https://lmstudio.ai/) stores downloaded models in `~/.cache/lm-studio/models`, in subdirectories with the same name of the models (following HuggingFace's `account_name/model_name` format), with the same filename you saw when you chose to download the file.
|
630 |
-
|
631 |
-
So if you have downloaded e.g. the `llama-2-7b.Q2_K.gguf` file for `TheBloke/Llama-2-7B-GGUF`, you can run llamafile as follows:
|
632 |
-
|
633 |
-
```
|
634 |
-
cd ~/.cache/lm-studio/models/TheBloke/Llama-2-7B-GGUF
|
635 |
-
llamafile -m llama-2-7b.Q2_K.gguf
|
636 |
-
```
|
637 |
-
|
638 |
-
### Ollama
|
639 |
-
|
640 |
-
When you download a new model with [ollama](https://ollama.com), all its metadata will be stored in a manifest file under `~/.ollama/models/manifests/registry.ollama.ai/library/`. The directory and manifest file name are the model name as returned by `ollama list`. For instance, for `llama3:latest` the manifest file will be named `.ollama/models/manifests/registry.ollama.ai/library/llama3/latest`.
|
641 |
-
|
642 |
-
The manifest maps each file related to the model (e.g. GGUF weights, license, prompt template, etc) to a sha256 digest. The digest corresponding to the element whose `mediaType` is `application/vnd.ollama.image.model` is the one referring to the model's GGUF file.
|
643 |
-
|
644 |
-
Each sha256 digest is also used as a filename in the `~/.ollama/models/blobs` directory (if you look into that directory you'll see *only* those sha256-* filenames). This means you can directly run llamafile by passing the sha256 digest as the model filename. So if e.g. the `llama3:latest` GGUF file digest is `sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29`, you can run llamafile as follows:
|
645 |
-
|
646 |
-
```
|
647 |
-
cd ~/.ollama/models/blobs
|
648 |
-
llamafile -m sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
|
649 |
-
```
|
650 |
-
|
651 |
-
## Technical details
|
652 |
-
|
653 |
-
Here is a succinct overview of the tricks we used to create the fattest
|
654 |
-
executable format ever. The long story short is llamafile is a shell
|
655 |
-
script that launches itself and runs inference on embedded weights in
|
656 |
-
milliseconds without needing to be copied or installed. What makes that
|
657 |
-
possible is mmap(). Both the llama.cpp executable and the weights are
|
658 |
-
concatenated onto the shell script. A tiny loader program is then
|
659 |
-
extracted by the shell script, which maps the executable into memory.
|
660 |
-
The llama.cpp executable then opens the shell script again as a file,
|
661 |
-
and calls mmap() again to pull the weights into memory and make them
|
662 |
-
directly accessible to both the CPU and GPU.
|
663 |
-
|
664 |
-
### ZIP weights embedding
|
665 |
-
|
666 |
-
The trick to embedding weights inside llama.cpp executables is to ensure
|
667 |
-
the local file is aligned on a page size boundary. That way, assuming
|
668 |
-
the zip file is uncompressed, once it's mmap()'d into memory we can pass
|
669 |
-
pointers directly to GPUs like Apple Metal, which require that data be
|
670 |
-
page size aligned. Since no existing ZIP archiving tool has an alignment
|
671 |
-
flag, we had to write about [500 lines of code](llamafile/zipalign.c) to
|
672 |
-
insert the ZIP files ourselves. However, once there, every existing ZIP
|
673 |
-
program should be able to read them, provided they support ZIP64. This
|
674 |
-
makes the weights much more easily accessible than they otherwise would
|
675 |
-
have been, had we invented our own file format for concatenated files.
|
676 |
-
|
677 |
-
### Microarchitectural portability
|
678 |
-
|
679 |
-
On Intel and AMD microprocessors, llama.cpp spends most of its time in
|
680 |
-
the matmul quants, which are usually written thrice for SSSE3, AVX, and
|
681 |
-
AVX2. llamafile pulls each of these functions out into a separate file
|
682 |
-
that can be `#include`ed multiple times, with varying
|
683 |
-
`__attribute__((__target__("arch")))` function attributes. Then, a
|
684 |
-
wrapper function is added which uses Cosmopolitan's `X86_HAVE(FOO)`
|
685 |
-
feature to runtime dispatch to the appropriate implementation.
|
686 |
-
|
687 |
-
### Architecture portability
|
688 |
-
|
689 |
-
llamafile solves architecture portability by building llama.cpp twice:
|
690 |
-
once for AMD64 and again for ARM64. It then wraps them with a shell
|
691 |
-
script which has an MZ prefix. On Windows, it'll run as a native binary.
|
692 |
-
On Linux, it'll extract a small 8kb executable called [APE
|
693 |
-
Loader](https://github.com/jart/cosmopolitan/blob/master/ape/loader.c)
|
694 |
-
to `${TMPDIR:-${HOME:-.}}/.ape` that'll map the binary portions of the
|
695 |
-
shell script into memory. It's possible to avoid this process by running
|
696 |
-
the
|
697 |
-
[`assimilate`](https://github.com/jart/cosmopolitan/blob/master/tool/build/assimilate.c)
|
698 |
-
program that comes included with the `cosmocc` compiler. What the
|
699 |
-
`assimilate` program does is turn the shell script executable into
|
700 |
-
the host platform's native executable format. This guarantees a fallback
|
701 |
-
path exists for traditional release processes when it's needed.
|
702 |
-
|
703 |
-
### GPU support
|
704 |
-
|
705 |
-
Cosmopolitan Libc uses static linking, since that's the only way to get
|
706 |
-
the same executable to run on six OSes. This presents a challenge for
|
707 |
-
llama.cpp, because it's not possible to statically link GPU support. The
|
708 |
-
way we solve that is by checking if a compiler is installed on the host
|
709 |
-
system. For Apple, that would be Xcode, and for other platforms, that
|
710 |
-
would be `nvcc`. llama.cpp has a single file implementation of each GPU
|
711 |
-
module, named `ggml-metal.m` (Objective C) and `ggml-cuda.cu` (Nvidia
|
712 |
-
C). llamafile embeds those source files within the zip archive and asks
|
713 |
-
the platform compiler to build them at runtime, targeting the native GPU
|
714 |
-
microarchitecture. If it works, then it's linked with platform C library
|
715 |
-
dlopen() implementation. See [llamafile/cuda.c](llamafile/cuda.c) and
|
716 |
-
[llamafile/metal.c](llamafile/metal.c).
|
717 |
-
|
718 |
-
In order to use the platform-specific dlopen() function, we need to ask
|
719 |
-
the platform-specific compiler to build a small executable that exposes
|
720 |
-
these interfaces. On ELF platforms, Cosmopolitan Libc maps this helper
|
721 |
-
executable into memory along with the platform's ELF interpreter. The
|
722 |
-
platform C library then takes care of linking all the GPU libraries, and
|
723 |
-
then runs the helper program which longjmp()'s back into Cosmopolitan.
|
724 |
-
The executable program is now in a weird hybrid state where two separate
|
725 |
-
C libraries exist which have different ABIs. For example, thread local
|
726 |
-
storage works differently on each operating system, and programs will
|
727 |
-
crash if the TLS register doesn't point to the appropriate memory. The
|
728 |
-
way Cosmopolitan Libc solves that on AMD is by using SSE to recompile
|
729 |
-
the executable at runtime to change `%fs` register accesses into `%gs`
|
730 |
-
which takes a millisecond. On ARM, Cosmo uses the `x28` register for TLS
|
731 |
-
which can be made safe by passing the `-ffixed-x28` flag when compiling
|
732 |
-
GPU modules. Lastly, llamafile uses the `__ms_abi__` attribute so that
|
733 |
-
function pointers passed between the application and GPU modules conform
|
734 |
-
to the Windows calling convention. Amazingly enough, every compiler we
|
735 |
-
tested, including nvcc on Linux and even Objective-C on MacOS, all
|
736 |
-
support compiling WIN32 style functions, thus ensuring your llamafile
|
737 |
-
will be able to talk to Windows drivers, when it's run on Windows,
|
738 |
-
without needing to be recompiled as a separate file for Windows. See
|
739 |
-
[cosmopolitan/dlopen.c](https://github.com/jart/cosmopolitan/blob/master/libc/dlopen/dlopen.c)
|
740 |
-
for further details.
|
741 |
-
|
742 |
-
## A note about models
|
743 |
-
|
744 |
-
The example llamafiles provided above should not be interpreted as
|
745 |
-
endorsements or recommendations of specific models, licenses, or data
|
746 |
-
sets on the part of Mozilla.
|
747 |
-
|
748 |
-
## Security
|
749 |
-
|
750 |
-
llamafile adds pledge() and SECCOMP sandboxing to llama.cpp. This is
|
751 |
-
enabled by default. It can be turned off by passing the `--unsecure`
|
752 |
-
flag. Sandboxing is currently only supported on Linux and OpenBSD on
|
753 |
-
systems without GPUs; on other platforms it'll simply log a warning.
|
754 |
-
|
755 |
-
Our approach to security has these benefits:
|
756 |
-
|
757 |
-
1. After it starts up, your HTTP server isn't able to access the
|
758 |
-
filesystem at all. This is good, since it means if someone discovers
|
759 |
-
a bug in the llama.cpp server, then it's much less likely they'll be
|
760 |
-
able to access sensitive information on your machine or make changes
|
761 |
-
to its configuration. On Linux, we're able to sandbox things even
|
762 |
-
further; the only networking related system call the HTTP server will
|
763 |
-
allowed to use after starting up, is accept(). That further limits an
|
764 |
-
attacker's ability to exfiltrate information, in the event that your
|
765 |
-
HTTP server is compromised.
|
766 |
-
|
767 |
-
2. The main CLI command won't be able to access the network at all. This
|
768 |
-
is enforced by the operating system kernel. It also won't be able to
|
769 |
-
write to the file system. This keeps your computer safe in the event
|
770 |
-
that a bug is ever discovered in the GGUF file format that lets
|
771 |
-
an attacker craft malicious weights files and post them online. The
|
772 |
-
only exception to this rule is if you pass the `--prompt-cache` flag
|
773 |
-
without also specifying `--prompt-cache-ro`. In that case, security
|
774 |
-
currently needs to be weakened to allow `cpath` and `wpath` access,
|
775 |
-
but network access will remain forbidden.
|
776 |
-
|
777 |
-
Therefore your llamafile is able to protect itself against the outside
|
778 |
-
world, but that doesn't mean you're protected from llamafile. Sandboxing
|
779 |
-
is self-imposed. If you obtained your llamafile from an untrusted source
|
780 |
-
then its author could have simply modified it to not do that. In that
|
781 |
-
case, you can run the untrusted llamafile inside another sandbox, such
|
782 |
-
as a virtual machine, to make sure it behaves how you expect.
|
783 |
-
|
784 |
-
## Licensing
|
785 |
-
|
786 |
-
While the llamafile project is Apache 2.0-licensed, our changes
|
787 |
-
to llama.cpp are licensed under MIT (just like the llama.cpp project
|
788 |
-
itself) so as to remain compatible and upstreamable in the future,
|
789 |
-
should that be desired.
|
790 |
-
|
791 |
-
The llamafile logo on this page was generated with the assistance of DALL·E 3.
|
792 |
-
|
793 |
-
|
794 |
-
[](https://star-history.com/#Mozilla-Ocho/llamafile&Date)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|