# LLM Setup for FileScopeMCP

[Back to README](../README.md)

FileScopeMCP uses [llama.cpp](https://github.com/ggml-org/llama.cpp)'s `llama-server` as a local OpenAI-compatible LLM backend. The default model is Qwen3.6 35B A3B MoE (UD-IQ4_XS quant, ~3B active params per token). The model alias `llm-model` is what the broker expects — pass `--alias llm-model` on the llama-server command line.

Without a running llama-server, FileScopeMCP still works for file tracking and dependency analysis — you just won't get auto-generated summaries, concepts, or change-impact assessments.

Pick the guide that matches your setup:

- [Same machine (Linux/macOS)](#same-machine-linuxmacos) — **default** — llama-server and FileScopeMCP on the same Linux or macOS host (this is what agent runtimes like Hermes use)
- [Remote / LAN server](#remote--lan-server) — llama-server on a different machine on your network
- [WSL2 + Windows GPU](#wsl2--windows-gpu) — alternative for Windows users: FileScopeMCP in WSL2, llama-server on the Windows host for GPU access

---

## Same Machine (Linux/macOS)

```bash
./setup-llm.sh
```

This prints a platform-specific setup guide. It does NOT install anything for you — you build or install llama.cpp yourself.

### Linux

Build from source with the backend that matches your GPU:

```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# NVIDIA:
cmake -B build -DGGML_CUDA=ON
# AMD (Vulkan is the recommended backend — see WSL2 section for why):
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j
```

Or run the CUDA Docker image:

```bash
docker run --gpus all -p 8880:8880 \
  -v $HOME/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \
  --alias llm-model -c 98304 -n 32768 -ngl 99 --n-cpu-moe 14 \
  -fa on --no-mmap --mlock -b 2048 -ub 512 \
  --cache-type-k q8_0 --cache-type-v q8_0 --swa-full \
  --no-context-shift --ctx-checkpoints 128 --checkpoint-every-n-tokens 4096 \
  --cache-ram 4096 --jinja --reasoning-format deepseek --reasoning-budget 4096 \
  --host 0.0.0.0 --port 8880
```

Run `./setup-llm.sh --launch` to print the exact launch command for the native binary.

### macOS

```bash
brew install llama.cpp
```

Metal is the default backend — no configuration needed. Launch with the same command `./setup-llm.sh --launch` prints.

### First run

The GGUF file must already exist at the path specified by `-m`. Download it before launching. llama-server does not accept HTTP traffic until the model is fully loaded.

Verify with:

```bash
./setup-llm.sh --status
```

No broker config changes are needed — the default `broker.default.json` template points at `localhost:8880`, and the broker auto-copies it to `~/.filescope/broker.json` on first start if the file is missing.

### Run as a systemd service (Linux only)

Once your launch script (typically `~/start-llama-server.sh`, the command `./setup-llm.sh --launch` prints) is in place, register llama-server as a systemd unit so it auto-starts on boot, restarts on failure, and logs to `journalctl`:

```bash
sudo ./setup-llm.sh --install-service
```

The unit lives at `/etc/systemd/system/llama-server.service` (template at `monitoring/systemd/llama-server.service`). It captures stdout/stderr to the journal, sets `OOMScoreAdjust=-500` so the kernel picks lighter workloads first under memory pressure, and writes a start-time metric used by the optional [monitoring dashboard](../monitoring/). Override the launch-script path with `--start-script /path/to/script` if it isn't at `$HOME/start-llama-server.sh`.

The flag is refused under WSL2 — see the [WSL2 + Windows GPU](#wsl2--windows-gpu) section instead.

Tail logs with:

```bash
sudo journalctl -u llama-server -f
```

---

## WSL2 + Windows GPU

Alternative setup for Windows users with a dedicated GPU. WSL2 doesn't give native GPU access to llama.cpp, so llama-server runs on Windows and FileScopeMCP connects to it across the WSL2 boundary. If you're on a native Linux host (including Hermes on Ubuntu), use the [Same Machine](#same-machine-linuxmacos) guide above instead.

### Step 1: Pick the Windows binary

Download the llama.cpp Windows release from [github.com/ggml-org/llama.cpp/releases](https://github.com/ggml-org/llama.cpp/releases). Pick the zip that matches your GPU.

| GPU | File | Backend |
|-----|------|---------|
| **AMD RDNA2/RDNA3** (RX 6800 XT, RX 7900 XT, etc.) | `llama-*-bin-win-vulkan-x64.zip` | Vulkan |
| **NVIDIA** | `llama-*-bin-win-cuda-12.X-x64.zip` | CUDA (no toolkit required for prebuilt) |
| **Intel Arc** | `llama-*-bin-win-vulkan-x64.zip` | Vulkan |

**For AMD: use Vulkan, NOT ROCm.** Two reasons:

1. The ROCm backend is broken on Windows 11 since llama.cpp build b8152 (Issue #19943) — models load CPU-only.
2. Vulkan is 0-50% faster than ROCm on RDNA2 in practice, and the gap widens for MoE models.

No HIP SDK, no Visual Studio, no ROCm SDK needed.

### Step 2: Extract the zip

Right-click → Extract All → enter `C:\llama.cpp`. The zip may or may not create a nested subfolder. Find the folder that actually contains `llama-server.exe`:

```powershell
Get-ChildItem -Recurse -Filter llama-server.exe C:\llama.cpp
```

Note the exact folder — you will `cd` into it in Step 4.

### Step 3: Open port 8880 in Windows Firewall

In an elevated PowerShell (Run as Administrator):

```powershell
New-NetFirewallRule -DisplayName "llama-server 8880" `
  -Direction Inbound -Action Allow -Protocol TCP -LocalPort 8880
```

If your WSL2 network interface is on the Public profile (unusual — usually Private), ensure the rule covers both.

### Step 4: Launch llama-server

In PowerShell, from the folder that contains `llama-server.exe`:

```powershell
cd C:\llama.cpp  # or the nested subfolder from Step 2
.\llama-server.exe `
  -m ~/models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf `
  --alias llm-model `
  -c 98304 `
  -n 32768 `
  -ngl 99 `
  --n-cpu-moe 14 `
  -fa on `
  --no-mmap `
  --mlock `
  -b 2048 -ub 512 `
  --cache-type-k q8_0 --cache-type-v q8_0 `
  --swa-full `
  --no-context-shift `
  --ctx-checkpoints 128 `
  --checkpoint-every-n-tokens 4096 `
  --cache-ram 4096 `
  --jinja `
  --reasoning-format deepseek `
  --reasoning-budget 4096 `
  --chat-template-kwargs '{"preserve_thinking":true}' `
  --temp 0.6 `
  --top-p 0.95 `
  --top-k 20 `
  --min-p 0.0 `
  --presence-penalty 0.0 `
  --repeat-penalty 1.0 `
  --host 0.0.0.0 --port 8880 `
  --metrics `
  -np 1
```

Flag breakdown:

- `-ngl 99` — offload all layers to GPU
- `--n-cpu-moe 14` — keep routed expert FFNs in system RAM for 14 layers (tuned for 16GB VRAM). Raise to `99` if you hit OOM; lower for more speed if you have headroom.
- `-fa on` — flash attention
- `--no-mmap --mlock` — disable memory mapping, lock model in RAM for consistent performance
- `-b 2048 -ub 512` — logical and physical batch size
- `--cache-type-k q8_0 --cache-type-v q8_0` — KV cache in int8. **Do NOT use `q4_0` on gfx1030** — known segfault (Issue #15107).
- `--swa-full` — full sliding window attention
- `--no-context-shift --ctx-checkpoints 128 --checkpoint-every-n-tokens 4096` — context management with periodic checkpoints
- `--cache-ram 4096` — 4GB RAM cache for context checkpoints
- `--jinja` — enable Jinja chat template
- `--reasoning-format deepseek --reasoning-budget 4096` — enable reasoning mode with 4K token budget
- `--chat-template-kwargs '{"preserve_thinking":true}'` — preserve thinking blocks in output
- `-c 98304` — 96K context window
- `-n 32768` — max tokens per generation

**RAM requirement:** `--n-cpu-moe` streams routed experts from system RAM. At `--n-cpu-moe 14`, keep ~12GB of system RAM free beyond what Windows itself uses; at `--n-cpu-moe 99`, keep ~20GB. The `--cache-ram 4096` flag reserves an additional 4GB for context checkpoints.

### Step 5: Configure the broker in WSL

```bash
mkdir -p ~/.filescope
cp ~/FileScopeMCP/broker.windows-host.json ~/.filescope/broker.json
```

The `wsl-host` placeholder in `broker.windows-host.json` is auto-resolved by the broker at startup — `src/broker/config.ts` runs `ip route show default | awk '{print $3}'` to find the Windows host IP and rewrites `baseURL` in memory. No manual editing required in 99% of cases.

### Step 6: Verify from WSL

```bash
curl http://$(ip route show default | awk '{print $3}'):8880/v1/models
```

Expected: JSON with `data[].id` containing `llm-model`.

### Step 7: Restart Claude Code

Start (or restart) a Claude Code session in your project. FileScopeMCP auto-spawns the broker, which connects to llama-server on Windows. Verify end-to-end with:

```bash
./setup-llm.sh --status
```

Or call `status()` from an MCP tool in Claude Code.

---

## Remote / LAN Server

llama-server runs on a different machine on your network.

**1. On the remote machine:** Launch llama-server with the full flag set from Step 4 above, ensuring `--host 0.0.0.0 --port 8880 --alias llm-model` are present.

**2. In WSL / on the FileScopeMCP machine:**

```bash
mkdir -p ~/.filescope
cp ~/FileScopeMCP/broker.remote-lan.json ~/.filescope/broker.json
```

**3. Edit `~/.filescope/broker.json`** and replace `192.168.1.100` with the actual IP of the remote machine:

```json
{
  "llm": {
    "provider": "openai-compatible",
    "model": "llm-model",
    "baseURL": "http://YOUR_SERVER_IP:8880/v1",
    "maxTokensPerCall": 1024
  },
  "jobTimeoutMs": 120000,
  "maxQueueSize": 1000
}
```

**4. Verify connectivity:**

```bash
curl http://<remote-ip>:8880/v1/models
```

**5. Restart Claude Code.**

---

## WSL + Windows Troubleshooting

If FileScopeMCP runs in WSL2 and llama-server runs on Windows, work through these checks in order.

### 1. Is llama-server running on Windows?

Check the PowerShell window it was launched in. If you closed that window, llama-server is gone — relaunch it with the command from Step 4.

### 2. Is llama-server listening on all interfaces?

In a Windows terminal:

```powershell
netstat -an | findstr 8880
```

You should see `0.0.0.0:8880`. If you see `127.0.0.1:8880`, you forgot `--host 0.0.0.0` on the launch command.

### 3. Can WSL reach the Windows host?

From WSL:

```bash
ip route show default | awk '{print $3}'
curl http://$(ip route show default | awk '{print $3}'):8880/v1/models
```

If `curl` hangs or returns "Connection refused":

- **Firewall:** The inbound rule from Step 3 may not be active. Re-run `New-NetFirewallRule` in an elevated PowerShell.
- **VPN/proxy:** Some VPN software changes WSL2 networking. Try disconnecting the VPN temporarily.

### 4. Is the broker config correct?

```bash
cat ~/.filescope/broker.json
```

`baseURL` should contain `wsl-host:8880` (auto-resolved at startup) or the literal Windows host IP on port 8880. If it points at a different port or still has `localhost`, re-copy the template:

```bash
cp ~/FileScopeMCP/broker.windows-host.json ~/.filescope/broker.json
```

### 5. Is the broker process running?

```bash
ps aux | grep broker | grep -v grep
cat ~/.filescope/broker.log
```

Common errors:

- `ECONNREFUSED` — llama-server isn't reachable (go back to checks 2-3).
- Stale socket file — remove it and let the broker respawn:
  ```bash
  rm ~/.filescope/broker.sock
  ```
  Then restart your Claude Code session.

### 6. Is `wsl-host` resolving correctly?

```bash
ip route show default
```

This should print one line whose third field is the Windows host gateway IP. If this fails (unusual), edit `~/.filescope/broker.json` and replace `wsl-host` with the literal IP.