Hub documentation

Local Agents with llama.cpp and Pi

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Local Agents with llama.cpp and Pi

You can run a coding agent entirely on your own hardware. Pi connects to a local llama.cpp server to give you an experience similar to Claude Code or Codex β€” but everything runs on your machine.

Pi is the agent behind OpenClaw and is now integrated directly into Hugging Face, giving you access to thousands of compatible models.

Getting Started

1. Set Your Local Hardware

Set your local hardware so it can show you which models are compatible with your setup.

Go to huggingface.co/settings/local-apps and configure your local hardware profile. Select llama.cpp in the Local Apps section as this will be the engine you’ll use.

2. Find a Compatible Model

Browse for models compatible with Pi: huggingface.co/models?apps=pi&sort=trending

3. Launch the llama.cpp Server

On the model page, click the β€œUse this model” button and select llama.cpp. Pi will show you the exact commands for your setup. The first step is to start a llama.cpp server, e.g.

# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:Q4_K_M --jinja

This downloads the model and starts an OpenAI-compatible API server on your machine. See the llama.cpp guide for installation instructions.

4. Install and Configure Pi

In a separate terminal, install Pi:

npm install -g @mariozechner/pi-coding-agent

Then add your local model to Pi’s configuration file at ~/.pi/agent/models.json:

{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Qwen3.5-122B-A10B-GGUF"
        }
      ]
    }
  }
}

Update the id field to match the model you launched in step 3.

5. Run Pi

Start Pi in your project directory:

pi

Pi connects to your local llama.cpp server and gives you an interactive agent session.

Demo

How It Works

The setup has two components running locally:

  1. llama.cpp server β€” Serves the model as an OpenAI-compatible API on localhost.
  2. Pi β€” The agent process that sends prompts to the local server, reasons about tasks, and executes actions.
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     API calls     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Pi    β”‚ ───────────────▢  β”‚  llama.cpp server β”‚
β”‚ (agent) β”‚ ◀───────────────  β”‚  (local model)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    responses      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β–Ό
  Your files,
  terminal, etc.

Alternative: llama-agent

llama-agent takes a different approach β€” it builds the agent loop directly into llama.cpp as a single binary with zero external dependencies. No Node.js, no Python, just compile and run:

git clone https://github.com/gary149/llama-agent.git
cd llama-agent

# Build
cmake -B build
cmake --build build --target llama-agent

# Run (downloads the model automatically)
./build/bin/llama-agent -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL

Because tool calls happen in-process rather than over HTTP, there is no network overhead between the model and the agent. It also supports subagents, MCP servers, and an HTTP API server mode.

Next Steps

Update on GitHub