llamafile

llamafile lets you distribute and run LLMs with a single file. (announcement blog post)

llamafile aims to make open LLMs much more accessible to both developers and end users. They're doing that by combining llama.cpp with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file executable (called a "llamafile") that runs locally on most computers, with no installation.

llamafile is a Mozilla Builders project.

Quickstart

The easiest way to try it for yourself is to download the example llamafile for the numind.NuExtract model (license: [mit], OpenAI). With llamafile, this you can run this model locally while consuming comparitively less resources and having better performance in CPU alone.

Download numind.NuExtract-v1.5.Q5_K_M.llamafile (2.78 GB).
Open your computer's terminal.
If you're using macOS, Linux, or BSD, you'll need to grant permission for your computer to execute this new file. (You only need to do this once.)

chmod +x numind.NuExtract-v1.5.Q5_K_M.llamafile

If you're on Windows, rename the file by adding ".exe" on the end.
Run the llamafile. e.g.:

./numind.NuExtract-v1.5.Q5_K_M.llamafile

Your browser should open automatically and display a chat interface. (If it doesn't, just open your browser and point it at http://localhost:8080)
When you're done chatting, return to your terminal and hit Control-C to shut down llamafile.

**Having trouble? See the "Gotchas" section in the official github page of llamafile **

Distribution

One good way to share a llamafile with your friends is by posting it on Hugging Face. If you do that, then it's recommended that you mention in your Hugging Face commit message what git revision or released version of llamafile you used when building your llamafile. That way everyone online will be able verify the provenance of its executable content. If you've made changes to the llama.cpp or cosmopolitan source code, then the Apache 2.0 license requires you to explain what changed. One way you can do that is by embedding a notice in your llamafile using zipalign that describes the changes, and mention it in your Hugging Face commit.

Documentation

There's a manual page for each of the llamafile programs installed when you run sudo make install. The command manuals are also typeset as PDF files that you can download from the GitHub releases page. Lastly, most commands will display that information when passing the --help flag.

Running llamafile with models downloaded by third-party applications

This section answers the question "I already have a model downloaded locally by application X, can I use it with llamafile?". The general answer is "yes, as long as those models are locally stored in GGUF format" but its implementation can be more or less hacky depending on the application. A few examples (tested on a Mac) follow.

LM Studio

LM Studio stores downloaded models in ~/.cache/lm-studio/models, in subdirectories with the same name of the models (following HuggingFace's account_name/model_name format), with the same filename you saw when you chose to download the file.

So if you have downloaded e.g. the llama-2-7b.Q2_K.gguf file for TheBloke/Llama-2-7B-GGUF, you can run llamafile as follows:

cd ~/.cache/lm-studio/models/TheBloke/Llama-2-7B-GGUF
llamafile -m llama-2-7b.Q2_K.gguf

Ollama

When you download a new model with ollama, all its metadata will be stored in a manifest file under ~/.ollama/models/manifests/registry.ollama.ai/library/. The directory and manifest file name are the model name as returned by ollama list. For instance, for llama3:latest the manifest file will be named .ollama/models/manifests/registry.ollama.ai/library/llama3/latest.

The manifest maps each file related to the model (e.g. GGUF weights, license, prompt template, etc) to a sha256 digest. The digest corresponding to the element whose mediaType is application/vnd.ollama.image.model is the one referring to the model's GGUF file.

Each sha256 digest is also used as a filename in the ~/.ollama/models/blobs directory (if you look into that directory you'll see only those sha256-* filenames). This means you can directly run llamafile by passing the sha256 digest as the model filename. So if e.g. the llama3:latest GGUF file digest is sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29, you can run llamafile as follows:

cd ~/.ollama/models/blobs
llamafile -m sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29

Security

llamafile adds pledge() and SECCOMP sandboxing to llama.cpp. This is enabled by default. It can be turned off by passing the --unsecure flag. Sandboxing is currently only supported on Linux and OpenBSD on systems without GPUs; on other platforms it'll simply log a warning.

Our approach to security has these benefits:

After it starts up, your HTTP server isn't able to access the filesystem at all. This is good, since it means if someone discovers a bug in the llama.cpp server, then it's much less likely they'll be able to access sensitive information on your machine or make changes to its configuration. On Linux, we're able to sandbox things even further; the only networking related system call the HTTP server will allowed to use after starting up, is accept(). That further limits an attacker's ability to exfiltrate information, in the event that your HTTP server is compromised.
The main CLI command won't be able to access the network at all. This is enforced by the operating system kernel. It also won't be able to write to the file system. This keeps your computer safe in the event that a bug is ever discovered in the GGUF file format that lets an attacker craft malicious weights files and post them online. The only exception to this rule is if you pass the --prompt-cache flag without also specifying --prompt-cache-ro. In that case, security currently needs to be weakened to allow cpath and wpath access, but network access will remain forbidden.

Therefore your llamafile is able to protect itself against the outside world, but that doesn't mean you're protected from llamafile. Sandboxing is self-imposed. If you obtained your llamafile from an untrusted source then its author could have simply modified it to not do that. In that case, you can run the untrusted llamafile inside another sandbox, such as a virtual machine, to make sure it behaves how you expect.

Licensing

While the llamafile project is Apache 2.0-licensed, the changes to llama.cpp are licensed under MIT (just like the llama.cpp project itself) so as to remain compatible and upstreamable in the future, should that be desired.