Devarui379 commited on
Commit
4658c90
·
verified ·
1 Parent(s): 45d8aa5

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -794
README.md DELETED
@@ -1,794 +0,0 @@
1
- # llamafile
2
-
3
- [![ci status](https://github.com/Mozilla-Ocho/llamafile/actions/workflows/ci.yml/badge.svg)](https://github.com/Mozilla-Ocho/llamafile/actions/workflows/ci.yml)<br/>
4
- [![](https://dcbadge.vercel.app/api/server/YuMNeuKStr)](https://discord.gg/YuMNeuKStr)<br/><br/>
5
-
6
- <img src="llamafile/llamafile-640x640.png" width="320" height="320"
7
- alt="[line drawing of llama animal head in front of slightly open manilla folder filled with files]">
8
-
9
- **llamafile lets you distribute and run LLMs with a single file. ([announcement blog post](https://hacks.mozilla.org/2023/11/introducing-llamafile/))**
10
-
11
- Llamafile aims to make open LLMs much more
12
- accessible to both developers and end users. They're doing that by
13
- combining [llama.cpp](https://github.com/ggerganov/llama.cpp) with [Cosmopolitan Libc](https://github.com/jart/cosmopolitan) into one
14
- framework that collapses all the complexity of LLMs down to
15
- a single-file executable (called a "llamafile") that runs
16
- locally on most computers, with no installation.<br/><br/>
17
-
18
- <a href="https://future.mozilla.org"><img src="llamafile/mozilla-logo-bw-rgb.png" width="150"></a><br/>
19
- llamafile is a Mozilla Builders project.<br/><br/>
20
-
21
- ## Quickstart
22
-
23
- The easiest way to try it for yourself is to download our example
24
- llamafile for the [LLaVA](https://llava-vl.github.io/) model (license: [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/),
25
- [OpenAI](https://openai.com/policies/terms-of-use)). LLaVA is a new LLM that can do more
26
- than just chat; you can also upload images and ask it questions
27
- about them. With llamafile, this all happens locally; no data
28
- ever leaves your computer.
29
-
30
- 1. Download [numind.NuExtract-v1.5.Q5_K_M.llamafile](https://huggingface.co/Devarui379/numind.NuExtract-v1.5-Q5_K_M-llamafile/resolve/main/numind.NuExtract-v1.5.Q5_K_M.llamafile?download=true) (4.29 GB).
31
-
32
- 2. Open your computer's terminal.
33
-
34
- 3. If you're using macOS, Linux, or BSD, you'll need to grant permission
35
- for your computer to execute this new file. (You only need to do this
36
- once.)
37
-
38
- ```sh
39
- chmod +x numind.NuExtract-v1.5.Q5_K_M.llamafile
40
- ```
41
-
42
- 4. If you're on Windows, rename the file by adding ".exe" on the end.
43
-
44
- 5. Run the llamafile. e.g.:
45
-
46
- ```sh
47
- ./numind.NuExtract-v1.5.Q5_K_M.llamafile
48
- ```
49
-
50
- 6. Your browser should open automatically and display a chat interface.
51
- (If it doesn't, just open your browser and point it at http://localhost:8080)
52
-
53
- 7. When you're done chatting, return to your terminal and hit
54
- `Control-C` to shut down llamafile.
55
-
56
- **Having trouble? See the "Gotchas" section below.**
57
-
58
- ### JSON API Quickstart
59
-
60
- When llamafile is started, in addition to hosting a web
61
- UI chat server at <http://127.0.0.1:8080/>, an [OpenAI
62
- API](https://platform.openai.com/docs/api-reference/chat) compatible
63
- chat completions endpoint is provided too. It's designed to support the
64
- most common OpenAI API use cases, in a way that runs entirely locally.
65
- We've also extended it to include llama.cpp specific features (e.g.
66
- mirostat) that may also be used. For further details on what fields and
67
- endpoints are available, refer to both the [OpenAI
68
- documentation](https://platform.openai.com/docs/api-reference/chat/create)
69
- and the [llamafile server
70
- README](llama.cpp/server/README.md#api-endpoints).
71
-
72
- <details>
73
- <summary>Curl API Client Example</summary>
74
-
75
- The simplest way to get started using the API is to copy and paste the
76
- following curl command into your terminal.
77
-
78
- ```shell
79
- curl http://localhost:8080/v1/chat/completions \
80
- -H "Content-Type: application/json" \
81
- -H "Authorization: Bearer no-key" \
82
- -d '{
83
- "model": "LLaMA_CPP",
84
- "messages": [
85
- {
86
- "role": "system",
87
- "content": "You are LLAMAfile, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
88
- },
89
- {
90
- "role": "user",
91
- "content": "Write a limerick about python exceptions"
92
- }
93
- ]
94
- }' | python3 -c '
95
- import json
96
- import sys
97
- json.dump(json.load(sys.stdin), sys.stdout, indent=2)
98
- print()
99
- '
100
- ```
101
-
102
- The response that's printed should look like the following:
103
-
104
- ```json
105
- {
106
- "choices" : [
107
- {
108
- "finish_reason" : "stop",
109
- "index" : 0,
110
- "message" : {
111
- "content" : "There once was a programmer named Mike\nWho wrote code that would often choke\nHe used try and except\nTo handle each step\nAnd his program ran without any hike.",
112
- "role" : "assistant"
113
- }
114
- }
115
- ],
116
- "created" : 1704199256,
117
- "id" : "chatcmpl-Dt16ugf3vF8btUZj9psG7To5tc4murBU",
118
- "model" : "LLaMA_CPP",
119
- "object" : "chat.completion",
120
- "usage" : {
121
- "completion_tokens" : 38,
122
- "prompt_tokens" : 78,
123
- "total_tokens" : 116
124
- }
125
- }
126
- ```
127
-
128
- </details>
129
-
130
- <details>
131
- <summary>Python API Client example</summary>
132
-
133
- If you've already developed your software using the [`openai` Python
134
- package](https://pypi.org/project/openai/) (that's published by OpenAI)
135
- then you should be able to port your app to talk to llamafile instead,
136
- by making a few changes to `base_url` and `api_key`. This example
137
- assumes you've run `pip3 install openai` to install OpenAI's client
138
- software, which is required by this example. Their package is just a
139
- simple Python wrapper around the OpenAI API interface, which can be
140
- implemented by any server.
141
-
142
- ```python
143
- #!/usr/bin/env python3
144
- from openai import OpenAI
145
- client = OpenAI(
146
- base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
147
- api_key = "sk-no-key-required"
148
- )
149
- completion = client.chat.completions.create(
150
- model="LLaMA_CPP",
151
- messages=[
152
- {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
153
- {"role": "user", "content": "Write a limerick about python exceptions"}
154
- ]
155
- )
156
- print(completion.choices[0].message)
157
- ```
158
-
159
- The above code will return a Python object like this:
160
-
161
- ```python
162
- ChatCompletionMessage(content='There once was a programmer named Mike\nWho wrote code that would often strike\nAn error would occur\nAnd he\'d shout "Oh no!"\nBut Python\'s exceptions made it all right.', role='assistant', function_call=None, tool_calls=None)
163
- ```
164
-
165
- </details>
166
-
167
-
168
- ## Other example llamafiles
169
-
170
- We also provide example llamafiles for other models, so you can easily
171
- try out llamafile with different kinds of LLMs.
172
-
173
- | Model | Size | License | llamafile | other quants |
174
- | --- | --- | --- | --- | --- |
175
- | LLaMA 3.2 3B Instruct | 2.62 GB | [LLaMA 3.2](https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile/blob/main/LICENSE) | [Llama-3.2-3B-Instruct.Q6\_K.llamafile](https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile/blob/main/Llama-3.2-3B-Instruct.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile) |
176
- | LLaMA 3.2 1B Instruct | 1.11 GB | [LLaMA 3.2](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile/blob/main/LICENSE) | [Llama-3.2-1B-Instruct.Q6\_K.llamafile](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile/blob/main/Llama-3.2-1B-Instruct.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile) |
177
- | Gemma 2 2B Instruct | 2.32 GB | [Gemma 2](https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile/blob/main/LICENSE) | [gemma-2-2b-it.Q6\_K.llamafile](https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile/blob/main/gemma-2-2b-it.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile) |
178
- | Gemma 2 9B Instruct | 7.76 GB | [Gemma 2](https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/blob/main/LICENSE) | [gemma-2-9b-it.Q6\_K.llamafile](https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/blob/main/gemma-2-9b-it.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile) |
179
- | Gemma 2 27B Instruct | 22.5 GB | [Gemma 2](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile/blob/main/LICENSE) | [gemma-2-27b-it.Q6\_K.llamafile](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile/blob/main/gemma-2-27b-it.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile) |
180
- | LLaVA 1.5 | 3.97 GB | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) | [llava-v1.5-7b-q4.llamafile](https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile) |
181
- | TinyLlama-1.1B | 2.05 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | [TinyLlama-1.1B-Chat-v1.0.F16.llamafile](https://huggingface.co/Mozilla/TinyLlama-1.1B-Chat-v1.0-llamafile/resolve/main/TinyLlama-1.1B-Chat-v1.0.F16.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/TinyLlama-1.1B-Chat-v1.0-llamafile) |
182
- | Mistral-7B-Instruct | 3.85 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | [mistral-7b-instruct-v0.2.Q4\_0.llamafile](https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q4_0.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile) |
183
- | Phi-3-mini-4k-instruct | 7.67 GB | [Apache 2.0](https://huggingface.co/Mozilla/Phi-3-mini-4k-instruct-llamafile/blob/main/LICENSE) | [Phi-3-mini-4k-instruct.F16.llamafile](https://huggingface.co/Mozilla/Phi-3-mini-4k-instruct-llamafile/resolve/main/Phi-3-mini-4k-instruct.F16.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Phi-3-mini-4k-instruct-llamafile) |
184
- | Mixtral-8x7B-Instruct | 30.03 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | [mixtral-8x7b-instruct-v0.1.Q5\_K\_M.llamafile](https://huggingface.co/Mozilla/Mixtral-8x7B-Instruct-v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Mixtral-8x7B-Instruct-v0.1-llamafile) |
185
- | WizardCoder-Python-34B | 22.23 GB | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) | [wizardcoder-python-34b-v1.0.Q5\_K\_M.llamafile](https://huggingface.co/Mozilla/WizardCoder-Python-34B-V1.0-llamafile/resolve/main/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/WizardCoder-Python-34B-V1.0-llamafile) |
186
- | WizardCoder-Python-13B | 7.33 GB | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) | [wizardcoder-python-13b.llamafile](https://huggingface.co/jartine/wizardcoder-13b-python/resolve/main/wizardcoder-python-13b.llamafile?download=true) | [See HF repo](https://huggingface.co/jartine/wizardcoder-13b-python) |
187
- | LLaMA-3-Instruct-70B | 37.25 GB | [llama3](https://huggingface.co/Mozilla/Meta-Llama-3-8B-Instruct-llamafile/blob/main/Meta-Llama-3-Community-License-Agreement.txt) | [Meta-Llama-3-70B-Instruct.Q4\_0.llamafile](https://huggingface.co/Mozilla/Meta-Llama-3-70B-Instruct-llamafile/resolve/main/Meta-Llama-3-70B-Instruct.Q4_0.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Meta-Llama-3-70B-Instruct-llamafile) |
188
- | LLaMA-3-Instruct-8B | 5.37 GB | [llama3](https://huggingface.co/Mozilla/Meta-Llama-3-8B-Instruct-llamafile/blob/main/Meta-Llama-3-Community-License-Agreement.txt) | [Meta-Llama-3-8B-Instruct.Q5\_K\_M.llamafile](https://huggingface.co/Mozilla/Meta-Llama-3-8B-Instruct-llamafile/resolve/main/Meta-Llama-3-8B-Instruct.Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/Meta-Llama-3-8B-Instruct-llamafile) |
189
- | Rocket-3B | 1.89 GB | [cc-by-sa-4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en) | [rocket-3b.Q5\_K\_M.llamafile](https://huggingface.co/Mozilla/rocket-3B-llamafile/resolve/main/rocket-3b.Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/rocket-3B-llamafile) |
190
- | OLMo-7B | 5.68 GB | [Apache 2.0](https://huggingface.co/Mozilla/OLMo-7B-0424-llamafile/blob/main/LICENSE) | [OLMo-7B-0424.Q6\_K.llamafile](https://huggingface.co/Mozilla/OLMo-7B-0424-llamafile/resolve/main/OLMo-7B-0424.Q6_K.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/OLMo-7B-0424-llamafile) |
191
- | *Text Embedding Models* | | | | |
192
- | E5-Mistral-7B-Instruct | 5.16 GB | [MIT](https://choosealicense.com/licenses/mit/) | [e5-mistral-7b-instruct-Q5_K_M.llamafile](https://huggingface.co/Mozilla/e5-mistral-7b-instruct/resolve/main/e5-mistral-7b-instruct-Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/Mozilla/e5-mistral-7b-instruct) |
193
- | mxbai-embed-large-v1 | 0.7 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | [mxbai-embed-large-v1-f16.llamafile](https://huggingface.co/Mozilla/mxbai-embed-large-v1-llamafile/resolve/main/mxbai-embed-large-v1-f16.llamafile?download=true) | [See HF Repo](https://huggingface.co/Mozilla/mxbai-embed-large-v1-llamafile) |
194
-
195
- Here is an example for the Mistral command-line llamafile:
196
-
197
- ```sh
198
- ./mistral-7b-instruct-v0.2.Q5_K_M.llamafile --temp 0.7 -p '[INST]Write a story about llamas[/INST]'
199
- ```
200
-
201
- And here is an example for WizardCoder-Python command-line llamafile:
202
-
203
- ```sh
204
- ./wizardcoder-python-13b.llamafile --temp 0 -e -r '```\n' -p '```c\nvoid *memcpy_sse2(char *dst, const char *src, size_t size) {\n'
205
- ```
206
-
207
- And here's an example for the LLaVA command-line llamafile:
208
-
209
- ```sh
210
- ./llava-v1.5-7b-q4.llamafile --temp 0.2 --image lemurs.jpg -e -p '### User: What do you see?\n### Assistant:'
211
- ```
212
-
213
- As before, macOS, Linux, and BSD users will need to use the "chmod"
214
- command to grant execution permissions to the file before running these
215
- llamafiles for the first time.
216
-
217
- Unfortunately, Windows users cannot make use of many of these example
218
- llamafiles because Windows has a maximum executable file size of 4GB,
219
- and all of these examples exceed that size. (The LLaVA llamafile works
220
- on Windows because it is 30MB shy of the size limit.) But don't lose
221
- heart: llamafile allows you to use external weights; this is described
222
- later in this document.
223
-
224
- **Having trouble? See the "Gotchas" section below.**
225
-
226
- ## How llamafile works
227
-
228
- A llamafile is an executable LLM that you can run on your own
229
- computer. It contains the weights for a given open LLM, as well
230
- as everything needed to actually run that model on your computer.
231
- There's nothing to install or configure (with a few caveats, discussed
232
- in subsequent sections of this document).
233
-
234
- This is all accomplished by combining llama.cpp with Cosmopolitan Libc,
235
- which provides some useful capabilities:
236
-
237
- 1. llamafiles can run on multiple CPU microarchitectures. We
238
- added runtime dispatching to llama.cpp that lets new Intel systems use
239
- modern CPU features without trading away support for older computers.
240
-
241
- 2. llamafiles can run on multiple CPU architectures. We do
242
- that by concatenating AMD64 and ARM64 builds with a shell script that
243
- launches the appropriate one. Our file format is compatible with WIN32
244
- and most UNIX shells. It's also able to be easily converted (by either
245
- you or your users) to the platform-native format, whenever required.
246
-
247
- 3. llamafiles can run on six OSes (macOS, Windows, Linux,
248
- FreeBSD, OpenBSD, and NetBSD). If you make your own llama files, you'll
249
- only need to build your code once, using a Linux-style toolchain. The
250
- GCC-based compiler we provide is itself an Actually Portable Executable,
251
- so you can build your software for all six OSes from the comfort of
252
- whichever one you prefer most for development.
253
-
254
- 4. The weights for an LLM can be embedded within the llamafile.
255
- We added support for PKZIP to the GGML library. This lets uncompressed
256
- weights be mapped directly into memory, similar to a self-extracting
257
- archive. It enables quantized weights distributed online to be prefixed
258
- with a compatible version of the llama.cpp software, thereby ensuring
259
- its originally observed behaviors can be reproduced indefinitely.
260
-
261
- 5. Finally, with the tools included in this project you can create your
262
- *own* llamafiles, using any compatible model weights you want. You can
263
- then distribute these llamafiles to other people, who can easily make
264
- use of them regardless of what kind of computer they have.
265
-
266
- ## Using llamafile with external weights
267
-
268
- Even though our example llamafiles have the weights built-in, you don't
269
- *have* to use llamafile that way. Instead, you can download *just* the
270
- llamafile software (without any weights included) from our releases page.
271
- You can then use it alongside any external weights you may have on hand.
272
- External weights are particularly useful for Windows users because they
273
- enable you to work around Windows' 4GB executable file size limit.
274
-
275
- For Windows users, here's an example for the Mistral LLM:
276
-
277
- ```sh
278
- curl -L -o llamafile.exe https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.11/llamafile-0.8.11
279
- curl -L -o mistral.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
280
- ./llamafile.exe -m mistral.gguf
281
- ```
282
-
283
- Windows users may need to change `./llamafile.exe` to `.\llamafile.exe`
284
- when running the above command.
285
-
286
-
287
- ## Gotchas and troubleshooting
288
-
289
- On any platform, if your llamafile process is immediately killed, check
290
- if you have CrowdStrike and then ask to be whitelisted.
291
-
292
- ### Mac
293
-
294
- On macOS with Apple Silicon you need to have Xcode Command Line Tools
295
- installed for llamafile to be able to bootstrap itself.
296
-
297
- If you use zsh and have trouble running llamafile, try saying `sh -c
298
- ./llamafile`. This is due to a bug that was fixed in zsh 5.9+. The same
299
- is the case for Python `subprocess`, old versions of Fish, etc.
300
-
301
-
302
- #### Mac error "... cannot be opened because the developer cannot be verified"
303
-
304
- 1. Immediately launch System Settings, then go to Privacy & Security. llamafile should be listed at the bottom, with a button to Allow.
305
- 2. If not, then change your command in the Terminal to be `sudo spctl --master-disable; [llama launch command]; sudo spctl --master-enable`. This is because `--master-disable` disables _all_ checking, so you need to turn it back on after quitting llama.
306
-
307
- ### Linux
308
-
309
- On some Linux systems, you might get errors relating to `run-detectors`
310
- or WINE. This is due to `binfmt_misc` registrations. You can fix that by
311
- adding an additional registration for the APE file format llamafile
312
- uses:
313
-
314
- ```sh
315
- sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
316
- sudo chmod +x /usr/bin/ape
317
- sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
318
- sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
319
- ```
320
-
321
- ### Windows
322
- As mentioned above, on Windows you may need to rename your llamafile by
323
- adding `.exe` to the filename.
324
-
325
- Also as mentioned above, Windows also has a maximum file size limit of 4GB
326
- for executables. The LLaVA server executable above is just 30MB shy of
327
- that limit, so it'll work on Windows, but with larger models like
328
- WizardCoder 13B, you need to store the weights in a separate file. An
329
- example is provided above; see "Using llamafile with external weights."
330
-
331
- On WSL, there are many possible gotchas. One thing that helps solve them
332
- completely is this:
333
-
334
- ```
335
- [Unit]
336
- Description=cosmopolitan APE binfmt service
337
- After=wsl-binfmt.service
338
-
339
- [Service]
340
- Type=oneshot
341
- ExecStart=/bin/sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
342
-
343
- [Install]
344
- WantedBy=multi-user.target
345
- ```
346
-
347
- Put that in `/etc/systemd/system/cosmo-binfmt.service`.
348
-
349
- Then run `sudo systemctl enable cosmo-binfmt`.
350
-
351
- Another thing that's helped WSL users who experience issues, is to
352
- disable the WIN32 interop feature:
353
-
354
- ```sh
355
- sudo sh -c "echo -1 > /proc/sys/fs/binfmt_misc/WSLInterop"
356
- ```
357
-
358
- In the instance of getting a `Permission Denied` on disabling interop
359
- through CLI, it can be permanently disabled by adding the following in
360
- `/etc/wsl.conf`
361
-
362
- ```sh
363
- [interop]
364
- enabled=false
365
- ```
366
-
367
- ## Supported OSes
368
-
369
- llamafile supports the following operating systems, which require a minimum
370
- stock install:
371
-
372
- - Linux 2.6.18+ (i.e. every distro since RHEL5 c. 2007)
373
- - Darwin (macOS) 23.1.0+ [1] (GPU is only supported on ARM64)
374
- - Windows 10+ (AMD64 only)
375
- - FreeBSD 13+
376
- - NetBSD 9.2+ (AMD64 only)
377
- - OpenBSD 7+ (AMD64 only)
378
-
379
- On Windows, llamafile runs as a native portable executable. On UNIX
380
- systems, llamafile extracts a small loader program named `ape` to
381
- `$TMPDIR/.llamafile` or `~/.ape-1.9` which is used to map your model
382
- into memory.
383
-
384
- [1] Darwin kernel versions 15.6+ *should* be supported, but we currently
385
- have no way of testing that.
386
-
387
- ## Supported CPUs
388
-
389
- llamafile supports the following CPUs:
390
-
391
- - **AMD64** microprocessors must have AVX. Otherwise llamafile will
392
- print an error and refuse to run. This means that if you have an Intel
393
- CPU, it needs to be Intel Core or newer (circa 2006+), and if you have
394
- an AMD CPU, then it needs to be K8 or newer (circa 2003+). Support for
395
- AVX512, AVX2, FMA, F16C, and VNNI are conditionally enabled at runtime
396
- if you have a newer CPU. For example, Zen4 has very good AVX512 that
397
- can speed up BF16 llamafiles.
398
-
399
- - **ARM64** microprocessors must have ARMv8a+. This means everything
400
- from Apple Silicon to 64-bit Raspberry Pis will work, provided your
401
- weights fit into memory.
402
-
403
- ## GPU support
404
-
405
- llamafile supports the following kinds of GPUs:
406
-
407
- - Apple Metal
408
- - NVIDIA
409
- - AMD
410
-
411
- GPU on MacOS ARM64 is supported by compiling a small module using the
412
- Xcode Command Line Tools, which need to be installed. This is a one time
413
- cost that happens the first time you run your llamafile. The DSO built
414
- by llamafile is stored in `$TMPDIR/.llamafile` or `$HOME/.llamafile`.
415
- Offloading to GPU is enabled by default when a Metal GPU is present.
416
- This can be disabled by passing `-ngl 0` or `--gpu disable` to force
417
- llamafile to perform CPU inference.
418
-
419
- Owners of NVIDIA and AMD graphics cards need to pass the `-ngl 999` flag
420
- to enable maximum offloading. If multiple GPUs are present then the work
421
- will be divided evenly among them by default, so you can load larger
422
- models. Multiple GPU support may be broken on AMD Radeon systems. If
423
- that happens to you, then use `export HIP_VISIBLE_DEVICES=0` which
424
- forces llamafile to only use the first GPU.
425
-
426
- Windows users are encouraged to use our release binaries, because they
427
- contain prebuilt DLLs for both NVIDIA and AMD graphics cards, which only
428
- depend on the graphics driver being installed. If llamafile detects that
429
- NVIDIA's CUDA SDK or AMD's ROCm HIP SDK are installed, then llamafile
430
- will try to build a faster DLL that uses cuBLAS or rocBLAS. In order for
431
- llamafile to successfully build a cuBLAS module, it needs to be run on
432
- the x64 MSVC command prompt. You can use CUDA via WSL by enabling
433
- [Nvidia CUDA on
434
- WSL](https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl)
435
- and running your llamafiles inside of WSL. Using WSL has the added
436
- benefit of letting you run llamafiles greater than 4GB on Windows.
437
-
438
- On Linux, NVIDIA users will need to install the CUDA SDK (ideally using
439
- the shell script installer) and ROCm users need to install the HIP SDK.
440
- They're detected by looking to see if `nvcc` or `hipcc` are on the PATH.
441
-
442
- If you have both an AMD GPU *and* an NVIDIA GPU in your machine, then
443
- you may need to qualify which one you want used, by passing either
444
- `--gpu amd` or `--gpu nvidia`.
445
-
446
- In the event that GPU support couldn't be compiled and dynamically
447
- linked on the fly for any reason, llamafile will fall back to CPU
448
- inference.
449
-
450
- ## Source installation
451
-
452
- Developing on llamafile requires a modern version of the GNU `make`
453
- command (called `gmake` on some systems), `sha256sum` (otherwise `cc`
454
- will be used to build it), `wget` (or `curl`), and `unzip` available at
455
- [https://cosmo.zip/pub/cosmos/bin/](https://cosmo.zip/pub/cosmos/bin/).
456
- Windows users need [cosmos bash](https://justine.lol/cosmo3/) shell too.
457
-
458
- ```sh
459
- make -j8
460
- sudo make install PREFIX=/usr/local
461
- ```
462
-
463
- Here's an example of how to generate code for a libc function using the
464
- llama.cpp command line interface, utilizing WizardCoder-Python-13B
465
- weights:
466
-
467
- ```sh
468
- llamafile \
469
- -m wizardcoder-python-13b-v1.0.Q8_0.gguf \
470
- --temp 0 -r '}\n' -r '```\n' \
471
- -e -p '```c\nvoid *memcpy(void *dst, const void *src, size_t size) {\n'
472
- ```
473
-
474
- Here's a similar example that instead utilizes Mistral-7B-Instruct
475
- weights for prose composition:
476
-
477
- ```sh
478
- llamafile -ngl 9999 \
479
- -m mistral-7b-instruct-v0.1.Q4_K_M.gguf \
480
- -p '[INST]Write a story about llamas[/INST]'
481
- ```
482
-
483
- Here's an example of how llamafile can be used as an interactive chatbot
484
- that lets you query knowledge contained in training data:
485
-
486
- ```sh
487
- llamafile -m llama-65b-Q5_K.gguf -p '
488
- The following is a conversation between a Researcher and their helpful AI assistant Digital Athena which is a large language model trained on the sum of human knowledge.
489
- Researcher: Good morning.
490
- Digital Athena: How can I help you today?
491
- Researcher:' --interactive --color --batch_size 1024 --ctx_size 4096 \
492
- --keep -1 --temp 0 --mirostat 2 --in-prefix ' ' --interactive-first \
493
- --in-suffix 'Digital Athena:' --reverse-prompt 'Researcher:'
494
- ```
495
-
496
- Here's an example of how you can use llamafile to summarize HTML URLs:
497
-
498
- ```sh
499
- (
500
- echo '[INST]Summarize the following text:'
501
- links -codepage utf-8 \
502
- -force-html \
503
- -width 500 \
504
- -dump https://www.poetryfoundation.org/poems/48860/the-raven |
505
- sed 's/ */ /g'
506
- echo '[/INST]'
507
- ) | llamafile -ngl 9999 \
508
- -m mistral-7b-instruct-v0.2.Q5_K_M.gguf \
509
- -f /dev/stdin \
510
- -c 0 \
511
- --temp 0 \
512
- -n 500 \
513
- --no-display-prompt 2>/dev/null
514
- ```
515
-
516
- Here's how you can use llamafile to describe a jpg/png/gif/bmp image:
517
-
518
- ```sh
519
- llamafile -ngl 9999 --temp 0 \
520
- --image ~/Pictures/lemurs.jpg \
521
- -m llava-v1.5-7b-Q4_K.gguf \
522
- --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
523
- -e -p '### User: What do you see?\n### Assistant: ' \
524
- --no-display-prompt 2>/dev/null
525
- ```
526
-
527
- It's possible to use BNF grammar to enforce the output is predictable
528
- and safe to use in your shell script. The simplest grammar would be
529
- `--grammar 'root ::= "yes" | "no"'` to force the LLM to only print to
530
- standard output either `"yes\n"` or `"no\n"`. Another example is if you
531
- wanted to write a script to rename all your image files, you could say:
532
-
533
- ```sh
534
- llamafile -ngl 9999 --temp 0 \
535
- --image lemurs.jpg \
536
- -m llava-v1.5-7b-Q4_K.gguf \
537
- --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
538
- --grammar 'root ::= [a-z]+ (" " [a-z]+)+' \
539
- -e -p '### User: What do you see?\n### Assistant: ' \
540
- --no-display-prompt 2>/dev/null |
541
- sed -e's/ /_/g' -e's/$/.jpg/'
542
- a_baby_monkey_on_the_back_of_a_mother.jpg
543
- ```
544
-
545
- Here's an example of how to run llama.cpp's built-in HTTP server. This
546
- example uses LLaVA v1.5-7B, a multimodal LLM that works with llama.cpp's
547
- recently-added support for image inputs.
548
-
549
- ```sh
550
- llamafile -ngl 9999 \
551
- -m llava-v1.5-7b-Q8_0.gguf \
552
- --mmproj llava-v1.5-7b-mmproj-Q8_0.gguf \
553
- --host 0.0.0.0
554
- ```
555
-
556
- The above command will launch a browser tab on your personal computer to
557
- display a web interface. It lets you chat with your LLM and upload
558
- images to it.
559
-
560
- ## Creating llamafiles
561
-
562
- If you want to be able to just say:
563
-
564
- ```sh
565
- ./llava.llamafile
566
- ```
567
-
568
- ...and have it run the web server without having to specify arguments,
569
- then you can embed both the weights and a special `.args` inside, which
570
- specifies the default arguments. First, let's create a file named
571
- `.args` which has this content:
572
-
573
- ```sh
574
- -m
575
- llava-v1.5-7b-Q8_0.gguf
576
- --mmproj
577
- llava-v1.5-7b-mmproj-Q8_0.gguf
578
- --host
579
- 0.0.0.0
580
- -ngl
581
- 9999
582
- ...
583
- ```
584
-
585
- As we can see above, there's one argument per line. The `...` argument
586
- optionally specifies where any additional CLI arguments passed by the
587
- user are to be inserted. Next, we'll add both the weights and the
588
- argument file to the executable:
589
-
590
- ```sh
591
- cp /usr/local/bin/llamafile llava.llamafile
592
-
593
- zipalign -j0 \
594
- llava.llamafile \
595
- llava-v1.5-7b-Q8_0.gguf \
596
- llava-v1.5-7b-mmproj-Q8_0.gguf \
597
- .args
598
-
599
- ./llava.llamafile
600
- ```
601
-
602
- Congratulations. You've just made your own LLM executable that's easy to
603
- share with your friends.
604
-
605
- ## Distribution
606
-
607
- One good way to share a llamafile with your friends is by posting it on
608
- Hugging Face. If you do that, then it's recommended that you mention in
609
- your Hugging Face commit message what git revision or released version
610
- of llamafile you used when building your llamafile. That way everyone
611
- online will be able verify the provenance of its executable content. If
612
- you've made changes to the llama.cpp or cosmopolitan source code, then
613
- the Apache 2.0 license requires you to explain what changed. One way you
614
- can do that is by embedding a notice in your llamafile using `zipalign`
615
- that describes the changes, and mention it in your Hugging Face commit.
616
-
617
- ## Documentation
618
-
619
- There's a manual page for each of the llamafile programs installed when you
620
- run `sudo make install`. The command manuals are also typeset as PDF
621
- files that you can download from our GitHub releases page. Lastly, most
622
- commands will display that information when passing the `--help` flag.
623
-
624
- ## Running llamafile with models downloaded by third-party applications
625
-
626
- This section answers the question *"I already have a model downloaded locally by application X, can I use it with llamafile?"*. The general answer is "yes, as long as those models are locally stored in GGUF format" but its implementation can be more or less hacky depending on the application. A few examples (tested on a Mac) follow.
627
-
628
- ### LM Studio
629
- [LM Studio](https://lmstudio.ai/) stores downloaded models in `~/.cache/lm-studio/models`, in subdirectories with the same name of the models (following HuggingFace's `account_name/model_name` format), with the same filename you saw when you chose to download the file.
630
-
631
- So if you have downloaded e.g. the `llama-2-7b.Q2_K.gguf` file for `TheBloke/Llama-2-7B-GGUF`, you can run llamafile as follows:
632
-
633
- ```
634
- cd ~/.cache/lm-studio/models/TheBloke/Llama-2-7B-GGUF
635
- llamafile -m llama-2-7b.Q2_K.gguf
636
- ```
637
-
638
- ### Ollama
639
-
640
- When you download a new model with [ollama](https://ollama.com), all its metadata will be stored in a manifest file under `~/.ollama/models/manifests/registry.ollama.ai/library/`. The directory and manifest file name are the model name as returned by `ollama list`. For instance, for `llama3:latest` the manifest file will be named `.ollama/models/manifests/registry.ollama.ai/library/llama3/latest`.
641
-
642
- The manifest maps each file related to the model (e.g. GGUF weights, license, prompt template, etc) to a sha256 digest. The digest corresponding to the element whose `mediaType` is `application/vnd.ollama.image.model` is the one referring to the model's GGUF file.
643
-
644
- Each sha256 digest is also used as a filename in the `~/.ollama/models/blobs` directory (if you look into that directory you'll see *only* those sha256-* filenames). This means you can directly run llamafile by passing the sha256 digest as the model filename. So if e.g. the `llama3:latest` GGUF file digest is `sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29`, you can run llamafile as follows:
645
-
646
- ```
647
- cd ~/.ollama/models/blobs
648
- llamafile -m sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
649
- ```
650
-
651
- ## Technical details
652
-
653
- Here is a succinct overview of the tricks we used to create the fattest
654
- executable format ever. The long story short is llamafile is a shell
655
- script that launches itself and runs inference on embedded weights in
656
- milliseconds without needing to be copied or installed. What makes that
657
- possible is mmap(). Both the llama.cpp executable and the weights are
658
- concatenated onto the shell script. A tiny loader program is then
659
- extracted by the shell script, which maps the executable into memory.
660
- The llama.cpp executable then opens the shell script again as a file,
661
- and calls mmap() again to pull the weights into memory and make them
662
- directly accessible to both the CPU and GPU.
663
-
664
- ### ZIP weights embedding
665
-
666
- The trick to embedding weights inside llama.cpp executables is to ensure
667
- the local file is aligned on a page size boundary. That way, assuming
668
- the zip file is uncompressed, once it's mmap()'d into memory we can pass
669
- pointers directly to GPUs like Apple Metal, which require that data be
670
- page size aligned. Since no existing ZIP archiving tool has an alignment
671
- flag, we had to write about [500 lines of code](llamafile/zipalign.c) to
672
- insert the ZIP files ourselves. However, once there, every existing ZIP
673
- program should be able to read them, provided they support ZIP64. This
674
- makes the weights much more easily accessible than they otherwise would
675
- have been, had we invented our own file format for concatenated files.
676
-
677
- ### Microarchitectural portability
678
-
679
- On Intel and AMD microprocessors, llama.cpp spends most of its time in
680
- the matmul quants, which are usually written thrice for SSSE3, AVX, and
681
- AVX2. llamafile pulls each of these functions out into a separate file
682
- that can be `#include`ed multiple times, with varying
683
- `__attribute__((__target__("arch")))` function attributes. Then, a
684
- wrapper function is added which uses Cosmopolitan's `X86_HAVE(FOO)`
685
- feature to runtime dispatch to the appropriate implementation.
686
-
687
- ### Architecture portability
688
-
689
- llamafile solves architecture portability by building llama.cpp twice:
690
- once for AMD64 and again for ARM64. It then wraps them with a shell
691
- script which has an MZ prefix. On Windows, it'll run as a native binary.
692
- On Linux, it'll extract a small 8kb executable called [APE
693
- Loader](https://github.com/jart/cosmopolitan/blob/master/ape/loader.c)
694
- to `${TMPDIR:-${HOME:-.}}/.ape` that'll map the binary portions of the
695
- shell script into memory. It's possible to avoid this process by running
696
- the
697
- [`assimilate`](https://github.com/jart/cosmopolitan/blob/master/tool/build/assimilate.c)
698
- program that comes included with the `cosmocc` compiler. What the
699
- `assimilate` program does is turn the shell script executable into
700
- the host platform's native executable format. This guarantees a fallback
701
- path exists for traditional release processes when it's needed.
702
-
703
- ### GPU support
704
-
705
- Cosmopolitan Libc uses static linking, since that's the only way to get
706
- the same executable to run on six OSes. This presents a challenge for
707
- llama.cpp, because it's not possible to statically link GPU support. The
708
- way we solve that is by checking if a compiler is installed on the host
709
- system. For Apple, that would be Xcode, and for other platforms, that
710
- would be `nvcc`. llama.cpp has a single file implementation of each GPU
711
- module, named `ggml-metal.m` (Objective C) and `ggml-cuda.cu` (Nvidia
712
- C). llamafile embeds those source files within the zip archive and asks
713
- the platform compiler to build them at runtime, targeting the native GPU
714
- microarchitecture. If it works, then it's linked with platform C library
715
- dlopen() implementation. See [llamafile/cuda.c](llamafile/cuda.c) and
716
- [llamafile/metal.c](llamafile/metal.c).
717
-
718
- In order to use the platform-specific dlopen() function, we need to ask
719
- the platform-specific compiler to build a small executable that exposes
720
- these interfaces. On ELF platforms, Cosmopolitan Libc maps this helper
721
- executable into memory along with the platform's ELF interpreter. The
722
- platform C library then takes care of linking all the GPU libraries, and
723
- then runs the helper program which longjmp()'s back into Cosmopolitan.
724
- The executable program is now in a weird hybrid state where two separate
725
- C libraries exist which have different ABIs. For example, thread local
726
- storage works differently on each operating system, and programs will
727
- crash if the TLS register doesn't point to the appropriate memory. The
728
- way Cosmopolitan Libc solves that on AMD is by using SSE to recompile
729
- the executable at runtime to change `%fs` register accesses into `%gs`
730
- which takes a millisecond. On ARM, Cosmo uses the `x28` register for TLS
731
- which can be made safe by passing the `-ffixed-x28` flag when compiling
732
- GPU modules. Lastly, llamafile uses the `__ms_abi__` attribute so that
733
- function pointers passed between the application and GPU modules conform
734
- to the Windows calling convention. Amazingly enough, every compiler we
735
- tested, including nvcc on Linux and even Objective-C on MacOS, all
736
- support compiling WIN32 style functions, thus ensuring your llamafile
737
- will be able to talk to Windows drivers, when it's run on Windows,
738
- without needing to be recompiled as a separate file for Windows. See
739
- [cosmopolitan/dlopen.c](https://github.com/jart/cosmopolitan/blob/master/libc/dlopen/dlopen.c)
740
- for further details.
741
-
742
- ## A note about models
743
-
744
- The example llamafiles provided above should not be interpreted as
745
- endorsements or recommendations of specific models, licenses, or data
746
- sets on the part of Mozilla.
747
-
748
- ## Security
749
-
750
- llamafile adds pledge() and SECCOMP sandboxing to llama.cpp. This is
751
- enabled by default. It can be turned off by passing the `--unsecure`
752
- flag. Sandboxing is currently only supported on Linux and OpenBSD on
753
- systems without GPUs; on other platforms it'll simply log a warning.
754
-
755
- Our approach to security has these benefits:
756
-
757
- 1. After it starts up, your HTTP server isn't able to access the
758
- filesystem at all. This is good, since it means if someone discovers
759
- a bug in the llama.cpp server, then it's much less likely they'll be
760
- able to access sensitive information on your machine or make changes
761
- to its configuration. On Linux, we're able to sandbox things even
762
- further; the only networking related system call the HTTP server will
763
- allowed to use after starting up, is accept(). That further limits an
764
- attacker's ability to exfiltrate information, in the event that your
765
- HTTP server is compromised.
766
-
767
- 2. The main CLI command won't be able to access the network at all. This
768
- is enforced by the operating system kernel. It also won't be able to
769
- write to the file system. This keeps your computer safe in the event
770
- that a bug is ever discovered in the GGUF file format that lets
771
- an attacker craft malicious weights files and post them online. The
772
- only exception to this rule is if you pass the `--prompt-cache` flag
773
- without also specifying `--prompt-cache-ro`. In that case, security
774
- currently needs to be weakened to allow `cpath` and `wpath` access,
775
- but network access will remain forbidden.
776
-
777
- Therefore your llamafile is able to protect itself against the outside
778
- world, but that doesn't mean you're protected from llamafile. Sandboxing
779
- is self-imposed. If you obtained your llamafile from an untrusted source
780
- then its author could have simply modified it to not do that. In that
781
- case, you can run the untrusted llamafile inside another sandbox, such
782
- as a virtual machine, to make sure it behaves how you expect.
783
-
784
- ## Licensing
785
-
786
- While the llamafile project is Apache 2.0-licensed, our changes
787
- to llama.cpp are licensed under MIT (just like the llama.cpp project
788
- itself) so as to remain compatible and upstreamable in the future,
789
- should that be desired.
790
-
791
- The llamafile logo on this page was generated with the assistance of DALL·E 3.
792
-
793
-
794
- [![Star History Chart](https://api.star-history.com/svg?repos=Mozilla-Ocho/llamafile&type=Date)](https://star-history.com/#Mozilla-Ocho/llamafile&Date)