---
name: llamafile
description: When setting up local LLM inference without cloud APIs. When running GGUF models locally. When needing OpenAI-compatible API from a local model. When building offline/air-gapped AI tools. When troubleshooting local LLM server connections.
---

# Llamafile

Configure and manage Mozilla Llamafile - a cross-platform executable distribution format that runs LLMs locally with an OpenAI-compatible API.

## When to Use This Skill

Use this skill when:

- Installing llamafile binary and GGUF model files
- Starting llamafile server with optimal configuration
- Integrating llamafile with LiteLLM or OpenAI SDK
- Configuring llamafile for different performance profiles (GPU, CPU, network access)
- Troubleshooting llamafile server startup or API connection issues
- Building applications requiring local LLM inference
- Setting up commit message tools, code review systems, or other developer tools with local AI
- Managing llamafile as a background service
- Selecting and downloading appropriate GGUF models
- Validating OpenAI-compatible API responses

## Core Capabilities

### What Llamafile Provides

Llamafile combines llama.cpp with Cosmopolitan Libc to create single-file executables that:

- Run on macOS, Windows, Linux, FreeBSD, OpenBSD, NetBSD
- Support AMD64 and ARM64 architectures
- Serve OpenAI-compatible HTTP API on localhost
- Load GGUF model files for inference
- Provide `/health` endpoint for monitoring
- Support GPU acceleration (CUDA, Metal, Vulkan)
- Enable embeddings generation with `--embedding` flag

### API Compatibility

Llamafile exposes these OpenAI-compatible endpoints when running with `--server`:

| Endpoint                                    | Description                | Requirements       |
| ------------------------------------------- | -------------------------- | ------------------ |
| `http://localhost:8080/v1/chat/completions` | Chat completions (primary) | Server mode        |
| `http://localhost:8080/v1/completions`      | Text completions           | Server mode        |
| `http://localhost:8080/v1/embeddings`       | Generate embeddings        | `--embedding` flag |
| `http://localhost:8080/health`              | Health check               | Server mode        |

**Critical Detail**: All OpenAI-compatible endpoints require `/v1` prefix in the URL path.

## Installation

### Download Llamafile Binary

```bash
# Download llamafile v0.9.3 binary
curl -L -o llamafile https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3

# Make executable
chmod 755 llamafile

# Verify version
./llamafile --version
```

**Alternative download sources:**

- GitHub Release: `https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3`
- SourceForge Mirror: `https://sourceforge.net/projects/llamafile.mirror/files/0.9.3/`

### Download GGUF Model

Llamafile requires GGUF format models. Download from Hugging Face:

```bash
# Recommended: Gemma 3 3B (balanced speed/quality, ~2GB)
curl -L -o gemma-3-3b.gguf \
  https://huggingface.co/Mozilla/gemma-3-3b-it-gguf/resolve/main/gemma-3-3b-it-Q4_K_M.gguf

# Alternative: Pre-packaged llamafile with embedded model
curl -LO https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile
```

**Recommended models by use case:**

| Model | Size | Use Case | Download |
| --- | --- | --- | --- |
| Gemma 3 3B | ~2GB | Balanced speed/quality | [Mozilla/gemma-3-3b-it-gguf](https://huggingface.co/Mozilla/gemma-3-3b-it-gguf) |
| Qwen3-0.6B | ~500MB | Fast, lower quality | [Mozilla/Qwen3-0.6B-gguf](https://huggingface.co/Mozilla/Qwen3-0.6B-gguf) |
| Mistral 7B | ~4GB | Higher quality, slower | [Mozilla/Mistral-7B-gguf](https://huggingface.co/Mozilla/Mistral-7B-gguf) |
| Llama 3.1 8B | ~5GB | Best quality, slowest | [Mozilla/Llama-3.1-8B-gguf](https://huggingface.co/Mozilla/Llama-3.1-8B-gguf) |

**Quantization recommendation**: Use Q4_K_M quantized models for optimal balance of quality and performance.

## Server Configuration

### Basic Server Command

Start llamafile server for local API access:

```bash
./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --port 8080 \
    --host 127.0.0.1
```

**Critical flags explained:**

- `--server`: Required to enable HTTP API endpoints
- `-m`: Path to GGUF model file (required)
- `--nobrowser`: Prevents auto-opening browser on startup
- `--port 8080`: Default port (note: NOT 8000)
- `--host 127.0.0.1`: Localhost only (secure default)

### Performance-Optimized Configuration

For GPU-accelerated inference with higher throughput:

```bash
./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --port 8080 \
    --host 127.0.0.1 \
    --ctx-size 4096 \
    --n-gpu-layers 99 \
    --threads 8 \
    --cont-batching \
    --parallel 4
```

**Advanced flags:**

| Flag | Purpose | Default | When to Use |
| --- | --- | --- | --- |
| `--ctx-size` | Prompt context window size | 512 | Increase for longer conversations |
| `--n-gpu-layers` | GPU offload layer count | 0 | Set to 99 to offload all layers to GPU |
| `--threads` | CPU threads for generation | Auto | Set explicitly for consistent performance |
| `--threads-batch` | Threads for batch processing | Same as `--threads` | Tune separately for prompt vs generation |
| `--cont-batching` | Continuous batching | Off | Enable for multiple concurrent requests |
| `--parallel` | Parallel sequence count | 1 | Increase for concurrent request handling |
| `--mlock` | Lock model in memory | Off | Prevent swapping on systems with sufficient RAM |
| `--embedding` | Enable embeddings endpoint | Off | Required for `/v1/embeddings` API |

### Network-Accessible Configuration

To allow connections from other machines (development/testing only):

```bash
./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --host 0.0.0.0 \
    --port 8080
```

**Security warning**: Binding to `0.0.0.0` exposes the API to network access. Use only in trusted environments.

## API Integration

### Using LiteLLM (Recommended)

LiteLLM provides unified interface for llamafile and cloud LLM providers.

```python
import litellm

response = litellm.completion(
    model="llamafile/gemma-3-3b",  # MUST use llamafile/ prefix
    messages=[{"role": "user", "content": "Hello, world!"}],
    api_base="http://localhost:8080/v1",  # MUST include /v1 suffix
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)
```

**Critical requirements for LiteLLM:**

1. Model name MUST use `llamafile/` prefix for routing
2. `api_base` MUST include `/v1` suffix
3. No API key required (any placeholder value works)

**Related skill**: For comprehensive LiteLLM configuration, activate the litellm skill:

```
Skill(command: "litellm")
```

### Using OpenAI Python SDK

Direct integration with OpenAI SDK for llamafile endpoints:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",  # MUST include /v1
    api_key="sk-no-key-required"  # Any value works
)

response = client.chat.completions.create(
    model="local-model",  # Model name is flexible
    messages=[
        {"role": "user", "content": "Hello, world!"}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)
```

### Using curl for Testing

Verify llamafile server is responding correctly:

```bash
# Health check
curl http://localhost:8080/health

# Chat completions
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.3,
    "max_tokens": 200
  }'

# Embeddings (requires --embedding flag on server)
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "input": ["Hello world"]
  }'
```

## Server Management

### Process Management Script

Python script to start llamafile as background process with health checking:

```python
import subprocess
import time
import httpx

def start_llamafile(
    llamafile_path: str,
    model_path: str,
    port: int = 8080,
    host: str = "127.0.0.1"
) -> subprocess.Popen:
    """Start llamafile server as background process."""
    cmd = [
        llamafile_path,
        "--server",
        "-m", model_path,
        "--nobrowser",
        "--port", str(port),
        "--host", host,
    ]
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    _wait_for_server(host, port)
    return process


def _wait_for_server(host: str, port: int, timeout: int = 30) -> None:
    """Wait for server to respond to health checks."""
    url = f"http://{host}:{port}/health"
    start = time.time()
    while time.time() - start < timeout:
        try:
            response = httpx.get(url, timeout=2)
            if response.status_code == 200:
                return
        except httpx.RequestError:
            pass
        time.sleep(0.5)
    raise TimeoutError(f"Server did not start within {timeout} seconds")
```

### Configuration File Pattern

Example TOML configuration for applications using llamafile:

```toml
# ~/.config/app-name/config.toml
[ai]
model = "llamafile/gemma-3-3b"  # Must use llamafile/ prefix
temperature = 0.3
max_tokens = 200

[llamafile]
path = "/home/user/.local/bin/llamafile"
model_path = "/home/user/.local/share/app-name/models/gemma-3-3b.gguf"
api_base = "http://127.0.0.1:8080/v1"  # Include /v1 suffix
```

## Troubleshooting

### Server Fails to Start

**Check if port is already in use:**

```bash
# Find process using port 8080
lsof -i :8080

# Kill existing process
kill $(lsof -t -i :8080)
```

**Verify model file exists and is readable:**

```bash
ls -lh /path/to/model.gguf
```

**Check llamafile binary permissions:**

```bash
ls -la /path/to/llamafile
# Should show: -rwxr-xr-x (executable)

# Fix permissions if needed
chmod 755 /path/to/llamafile
```

### Connection Refused Errors

**Verify server is running:**

```bash
# Check health endpoint
curl http://localhost:8080/health

# Check server is listening
netstat -tlnp | grep 8080
# or
lsof -i :8080
```

**Common causes:**

1. Server not started with `--server` flag
2. Wrong port number (8080 vs 8000)
3. Missing `/v1` in API URL path
4. Server bound to `127.0.0.1` but accessing from another machine

### API Errors

**Test basic connectivity:**

```bash
# Verbose health check
curl -v http://localhost:8080/health

# Test chat completions with verbose output
curl -v http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"test","messages":[{"role":"user","content":"Hi"}]}'
```

**Common API issues:**

| Error              | Cause                | Solution                          |
| ------------------ | -------------------- | --------------------------------- |
| 404 Not Found      | Missing `/v1` in URL | Add `/v1` before endpoint path    |
| Connection refused | Server not running   | Start server with `--server` flag |
| Timeout            | Model loading slowly | Wait longer or use smaller model  |
| Invalid model      | Wrong model path     | Verify `-m` path to GGUF file     |

### Performance Issues

**Optimize inference speed:**

1. Use quantized models (Q4_K_M recommended)
2. Enable GPU acceleration: `--n-gpu-layers 99`
3. Increase threads: `--threads 8`
4. Enable continuous batching: `--cont-batching`
5. Reduce context size if not needed: `--ctx-size 2048`

**Check GPU availability:**

```bash
# NVIDIA GPU
nvidia-smi

# AMD GPU
rocm-smi

# Apple Metal (check activity monitor)
```

## Common Pitfalls

Avoid these frequent errors when using llamafile:

1. **Port 8000 vs 8080**: Llamafile defaults to **port 8080**, not 8000
2. **Missing `/v1` in API URL**: Always include `/v1` suffix for OpenAI-compatible endpoints
3. **LiteLLM prefix**: Must use `llamafile/` prefix in model name for proper routing
4. **API key confusion**: No real API key needed, but some clients require placeholder value
5. **Starting server from hooks**: Application hooks should check if server is running, not start it
6. **Model path issues**: Ensure GGUF file exists and is readable before starting server
7. **Binary permissions**: Llamafile must be executable (`chmod 755`)
8. **GPU layers on CPU**: Setting `--n-gpu-layers` on CPU-only systems causes errors

## Version Information

**Current stable version**: 0.9.3 (May 14, 2025)

**Version constants:**

```
LLAMAFILE_MAJOR = 0
LLAMAFILE_MINOR = 9
LLAMAFILE_PATCH = 3
```

**Recent changes in 0.9.3:**

- Added Phi4 model support
- Added Qwen3 model support
- Respects NO_COLOR environment variable
- Fixed URL handling in JavaScript (preserves path when building relative URLs)
- Added Plaintext output option to LocalScore

## Related Skills and Tools

**Skills to activate:**

- `litellm` - For unified LLM provider interface and routing
  ```
  Skill(command: "litellm")
  ```

**External tools:**

- LiteLLM - Unified interface for multiple LLM providers
- OpenAI Python SDK - Direct OpenAI-compatible API access
- llama.cpp - Underlying inference engine
- GGUF format - Model format specification

## References

### Official Documentation

- [Mozilla llamafile GitHub](https://github.com/mozilla-ai/llamafile) - Primary repository and source code
- [Mozilla llamafile Documentation](https://mozilla-ai.github.io/llamafile/) - Official documentation site
- [LiteLLM llamafile Provider](https://docs.litellm.ai/docs/providers/llamafile) - LiteLLM integration guide
- [llama.cpp Server Documentation](https://github.com/ggml-org/llama.cpp/tree/master/examples/server) - Underlying server implementation
- [Releases Page](https://github.com/mozilla-ai/llamafile/releases) - Binary downloads and changelog

### Model Resources

- [Hugging Face Mozilla Models](https://huggingface.co/Mozilla) - Official Mozilla GGUF models
- [GGUF Format Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) - Model file format details

### Related Technologies

- [Cosmopolitan Libc](https://justine.lol/cosmopolitan/) - Cross-platform binary format
- [llama.cpp](https://github.com/ggerganov/llama.cpp) - LLM inference engine
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference) - API compatibility reference