# LLM Setup for FileScopeMCP [Back to README](../README.md) FileScopeMCP uses [llama.cpp](https://github.com/ggml-org/llama.cpp)'s `llama-server` as a local OpenAI-compatible LLM backend. The default model is Qwen3.6 35B A3B MoE (UD-IQ4_XS quant, ~3B active params per token). The model alias `llm-model` is what the broker expects — pass `--alias llm-model` on the llama-server command line. Without a running llama-server, FileScopeMCP still works for file tracking and dependency analysis — you just won't get auto-generated summaries, concepts, or change-impact assessments. Pick the guide that matches your setup: - [Same machine (Linux/macOS)](#same-machine-linuxmacos) — **default** — llama-server and FileScopeMCP on the same Linux or macOS host (this is what agent runtimes like Hermes use) - [Remote / LAN server](#remote--lan-server) — llama-server on a different machine on your network - [WSL2 + Windows GPU](#wsl2--windows-gpu) — alternative for Windows users: FileScopeMCP in WSL2, llama-server on the Windows host for GPU access --- ## Same Machine (Linux/macOS) ```bash ./setup-llm.sh ``` This prints a platform-specific setup guide. It does NOT install anything for you — you build or install llama.cpp yourself. ### Linux Build from source with the backend that matches your GPU: ```bash git clone https://github.com/ggml-org/llama.cpp cd llama.cpp # NVIDIA: cmake -B build -DGGML_CUDA=ON # AMD (Vulkan is the recommended backend — see WSL2 section for why): cmake -B build -DGGML_VULKAN=ON cmake --build build --config Release -j ``` Or run the CUDA Docker image: ```bash docker run --gpus all -p 8880:8880 \ -v $HOME/models:/models \ ghcr.io/ggml-org/llama.cpp:server-cuda \ -m /models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \ --alias llm-model -c 98304 -n 32768 -ngl 99 --n-cpu-moe 14 \ -fa on --no-mmap --mlock -b 2048 -ub 512 \ --cache-type-k q8_0 --cache-type-v q8_0 --swa-full \ --no-context-shift --ctx-checkpoints 128 --checkpoint-every-n-tokens 4096 \ --cache-ram 4096 --jinja --reasoning-format deepseek --reasoning-budget 4096 \ --host 0.0.0.0 --port 8880 ``` Run `./setup-llm.sh --launch` to print the exact launch command for the native binary. ### macOS ```bash brew install llama.cpp ``` Metal is the default backend — no configuration needed. Launch with the same command `./setup-llm.sh --launch` prints. ### First run The GGUF file must already exist at the path specified by `-m`. Download it before launching. llama-server does not accept HTTP traffic until the model is fully loaded. Verify with: ```bash ./setup-llm.sh --status ``` No broker config changes are needed — the default `broker.default.json` template points at `localhost:8880`, and the broker auto-copies it to `~/.filescope/broker.json` on first start if the file is missing. ### Run as a systemd service (Linux only) Once your launch script (typically `~/start-llama-server.sh`, the command `./setup-llm.sh --launch` prints) is in place, register llama-server as a systemd unit so it auto-starts on boot, restarts on failure, and logs to `journalctl`: ```bash sudo ./setup-llm.sh --install-service ``` The unit lives at `/etc/systemd/system/llama-server.service` (template at `monitoring/systemd/llama-server.service`). It captures stdout/stderr to the journal, sets `OOMScoreAdjust=-500` so the kernel picks lighter workloads first under memory pressure, and writes a start-time metric used by the optional [monitoring dashboard](../monitoring/). Override the launch-script path with `--start-script /path/to/script` if it isn't at `$HOME/start-llama-server.sh`. The flag is refused under WSL2 — see the [WSL2 + Windows GPU](#wsl2--windows-gpu) section instead. Tail logs with: ```bash sudo journalctl -u llama-server -f ``` --- ## WSL2 + Windows GPU Alternative setup for Windows users with a dedicated GPU. WSL2 doesn't give native GPU access to llama.cpp, so llama-server runs on Windows and FileScopeMCP connects to it across the WSL2 boundary. If you're on a native Linux host (including Hermes on Ubuntu), use the [Same Machine](#same-machine-linuxmacos) guide above instead. ### Step 1: Pick the Windows binary Download the llama.cpp Windows release from [github.com/ggml-org/llama.cpp/releases](https://github.com/ggml-org/llama.cpp/releases). Pick the zip that matches your GPU. | GPU | File | Backend | |-----|------|---------| | **AMD RDNA2/RDNA3** (RX 6800 XT, RX 7900 XT, etc.) | `llama-*-bin-win-vulkan-x64.zip` | Vulkan | | **NVIDIA** | `llama-*-bin-win-cuda-12.X-x64.zip` | CUDA (no toolkit required for prebuilt) | | **Intel Arc** | `llama-*-bin-win-vulkan-x64.zip` | Vulkan | **For AMD: use Vulkan, NOT ROCm.** Two reasons: 1. The ROCm backend is broken on Windows 11 since llama.cpp build b8152 (Issue #19943) — models load CPU-only. 2. Vulkan is 0-50% faster than ROCm on RDNA2 in practice, and the gap widens for MoE models. No HIP SDK, no Visual Studio, no ROCm SDK needed. ### Step 2: Extract the zip Right-click → Extract All → enter `C:\llama.cpp`. The zip may or may not create a nested subfolder. Find the folder that actually contains `llama-server.exe`: ```powershell Get-ChildItem -Recurse -Filter llama-server.exe C:\llama.cpp ``` Note the exact folder — you will `cd` into it in Step 4. ### Step 3: Open port 8880 in Windows Firewall In an elevated PowerShell (Run as Administrator): ```powershell New-NetFirewallRule -DisplayName "llama-server 8880" ` -Direction Inbound -Action Allow -Protocol TCP -LocalPort 8880 ``` If your WSL2 network interface is on the Public profile (unusual — usually Private), ensure the rule covers both. ### Step 4: Launch llama-server In PowerShell, from the folder that contains `llama-server.exe`: ```powershell cd C:\llama.cpp # or the nested subfolder from Step 2 .\llama-server.exe ` -m ~/models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf ` --alias llm-model ` -c 98304 ` -n 32768 ` -ngl 99 ` --n-cpu-moe 14 ` -fa on ` --no-mmap ` --mlock ` -b 2048 -ub 512 ` --cache-type-k q8_0 --cache-type-v q8_0 ` --swa-full ` --no-context-shift ` --ctx-checkpoints 128 ` --checkpoint-every-n-tokens 4096 ` --cache-ram 4096 ` --jinja ` --reasoning-format deepseek ` --reasoning-budget 4096 ` --chat-template-kwargs '{"preserve_thinking":true}' ` --temp 0.6 ` --top-p 0.95 ` --top-k 20 ` --min-p 0.0 ` --presence-penalty 0.0 ` --repeat-penalty 1.0 ` --host 0.0.0.0 --port 8880 ` --metrics ` -np 1 ``` Flag breakdown: - `-ngl 99` — offload all layers to GPU - `--n-cpu-moe 14` — keep routed expert FFNs in system RAM for 14 layers (tuned for 16GB VRAM). Raise to `99` if you hit OOM; lower for more speed if you have headroom. - `-fa on` — flash attention - `--no-mmap --mlock` — disable memory mapping, lock model in RAM for consistent performance - `-b 2048 -ub 512` — logical and physical batch size - `--cache-type-k q8_0 --cache-type-v q8_0` — KV cache in int8. **Do NOT use `q4_0` on gfx1030** — known segfault (Issue #15107). - `--swa-full` — full sliding window attention - `--no-context-shift --ctx-checkpoints 128 --checkpoint-every-n-tokens 4096` — context management with periodic checkpoints - `--cache-ram 4096` — 4GB RAM cache for context checkpoints - `--jinja` — enable Jinja chat template - `--reasoning-format deepseek --reasoning-budget 4096` — enable reasoning mode with 4K token budget - `--chat-template-kwargs '{"preserve_thinking":true}'` — preserve thinking blocks in output - `-c 98304` — 96K context window - `-n 32768` — max tokens per generation **RAM requirement:** `--n-cpu-moe` streams routed experts from system RAM. At `--n-cpu-moe 14`, keep ~12GB of system RAM free beyond what Windows itself uses; at `--n-cpu-moe 99`, keep ~20GB. The `--cache-ram 4096` flag reserves an additional 4GB for context checkpoints. ### Step 5: Configure the broker in WSL ```bash mkdir -p ~/.filescope cp ~/FileScopeMCP/broker.windows-host.json ~/.filescope/broker.json ``` The `wsl-host` placeholder in `broker.windows-host.json` is auto-resolved by the broker at startup — `src/broker/config.ts` runs `ip route show default | awk '{print $3}'` to find the Windows host IP and rewrites `baseURL` in memory. No manual editing required in 99% of cases. ### Step 6: Verify from WSL ```bash curl http://$(ip route show default | awk '{print $3}'):8880/v1/models ``` Expected: JSON with `data[].id` containing `llm-model`. ### Step 7: Restart Claude Code Start (or restart) a Claude Code session in your project. FileScopeMCP auto-spawns the broker, which connects to llama-server on Windows. Verify end-to-end with: ```bash ./setup-llm.sh --status ``` Or call `status()` from an MCP tool in Claude Code. --- ## Remote / LAN Server llama-server runs on a different machine on your network. **1. On the remote machine:** Launch llama-server with the full flag set from Step 4 above, ensuring `--host 0.0.0.0 --port 8880 --alias llm-model` are present. **2. In WSL / on the FileScopeMCP machine:** ```bash mkdir -p ~/.filescope cp ~/FileScopeMCP/broker.remote-lan.json ~/.filescope/broker.json ``` **3. Edit `~/.filescope/broker.json`** and replace `192.168.1.100` with the actual IP of the remote machine: ```json { "llm": { "provider": "openai-compatible", "model": "llm-model", "baseURL": "http://YOUR_SERVER_IP:8880/v1", "maxTokensPerCall": 1024 }, "jobTimeoutMs": 120000, "maxQueueSize": 1000 } ``` **4. Verify connectivity:** ```bash curl http://:8880/v1/models ``` **5. Restart Claude Code.** --- ## WSL + Windows Troubleshooting If FileScopeMCP runs in WSL2 and llama-server runs on Windows, work through these checks in order. ### 1. Is llama-server running on Windows? Check the PowerShell window it was launched in. If you closed that window, llama-server is gone — relaunch it with the command from Step 4. ### 2. Is llama-server listening on all interfaces? In a Windows terminal: ```powershell netstat -an | findstr 8880 ``` You should see `0.0.0.0:8880`. If you see `127.0.0.1:8880`, you forgot `--host 0.0.0.0` on the launch command. ### 3. Can WSL reach the Windows host? From WSL: ```bash ip route show default | awk '{print $3}' curl http://$(ip route show default | awk '{print $3}'):8880/v1/models ``` If `curl` hangs or returns "Connection refused": - **Firewall:** The inbound rule from Step 3 may not be active. Re-run `New-NetFirewallRule` in an elevated PowerShell. - **VPN/proxy:** Some VPN software changes WSL2 networking. Try disconnecting the VPN temporarily. ### 4. Is the broker config correct? ```bash cat ~/.filescope/broker.json ``` `baseURL` should contain `wsl-host:8880` (auto-resolved at startup) or the literal Windows host IP on port 8880. If it points at a different port or still has `localhost`, re-copy the template: ```bash cp ~/FileScopeMCP/broker.windows-host.json ~/.filescope/broker.json ``` ### 5. Is the broker process running? ```bash ps aux | grep broker | grep -v grep cat ~/.filescope/broker.log ``` Common errors: - `ECONNREFUSED` — llama-server isn't reachable (go back to checks 2-3). - Stale socket file — remove it and let the broker respawn: ```bash rm ~/.filescope/broker.sock ``` Then restart your Claude Code session. ### 6. Is `wsl-host` resolving correctly? ```bash ip route show default ``` This should print one line whose third field is the Windows host gateway IP. If this fails (unusual), edit `~/.filescope/broker.json` and replace `wsl-host` with the literal IP.