### [llama.cpp](https://github.com/ggerganov/llama.cpp)
> Handle: `llamacpp`
> URL: [http://localhost:33831](http://localhost:33831)

[](https://opensource.org/licenses/MIT)
[](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)
[](https://conan.io/center/llama-cpp)
LLM inference in C/C++. Allows to bypass Ollama release cycle when needed - to get access to the latest models or features.
#### Starting
`llamacpp` docker image is quite large due to dependency on CUDA and other libraries. You might want to pull it ahead of time.
```bash
# [Optional] Pull the llamacpp
# images ahead of starting the service
harbor pull llamacpp
# Start the llama.cpp service
harbor up llamacpp
# Tail service logs
harbor logs llamacpp
# Open llamacpp Web UI
harbor open llamacpp
```
- Harbor will automatically allocate GPU resources to the container if available, see [Capabilities Detection](./3.-Harbor-CLI-Reference.md#capabilities-detection).
- `llamacpp` will be connected to `aider`, `anythingllm`, `boost`, `chatui`, `cmdh`, `opint`, `optillm`, `plandex`, `traefik`, `webui` services when they are running together.
#### Models
You can find GGUF models to run on Huggingface [here](https://huggingface.co/models?sort=trending&search=gguf). After you find a model you want to run, grab the URL from the browser address bar and pass it to the [`harbor config`](./3.-Harbor-CLI-Reference#harbor-config)
**Quick download from HuggingFace:**
```bash
# Pull a model directly from HuggingFace (with optional tag)
# Downloads to llama.cpp cache using an ephemeral server
harbor pull microsoft/Phi-3.5-mini-instruct-gguf
harbor pull microsoft/Phi-3.5-mini-instruct-gguf:Q4_K_M
```
This method automatically downloads the model to llama.cpp's cache. The model will be available for use immediately.
When `llamacpp` is running, you can check which models it detects in the cache with:
```bash
# List models detected in the cache
# Needs curl+jq installed on the host
harbor llamacpp models
```
**Manual configuration methods:**
```bash
# Quick lookup for the models
harbor hf find gguf
# 1. With llama.cpp own cache:
#
# - Set the model to run, will be downloaded when llamacpp starts
# Accepts a full URL to the GGUF file (from Browser address bar)
harbor llamacpp model https://huggingface.co/user/repo/file.gguf
# TIP: You can monitor the download progress with a one-liner below
# Replace "" with the unique portion from the "file.gguf" URL above
du -h $(harbor find .gguf | grep )
# 2. Shared HuggingFace Hub cache, single file:
#
# - Locate the GGUF to download, for example:
# https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/blob/main/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# - Download a single file:
harbor hf download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# - Locate the file in the cache
harbor find Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# - Set the GGUF to llama.cpp
# "/app/models/hub" is where the HuggingFace cache is mounted in the container
harbor llamacpp gguf /app/models/hub/models--bartowski--Meta-Llama-3.1-70B-Instruct-GGUF/snapshots/83fb6e83d0a8aada42d499259bc929d922e9a558/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# 3. Shared HuggingFace Hub cache, whole repo:
#
# - Locate and download the repo in its entirety
harbor hf download av-codes/Trinity-2-Codestral-22B-Q4_K_M-GGUF
# - Find the files from the repo
harbor find Trinity-2-Codestral-22B-Q4_K_M-GGUF
# - Set the GGUF to llama.cpp
# "/app/models/hub" is where the HuggingFace cache is mounted in the container
harbor llamacpp gguf /app/models/hub/models--av-codes--Trinity-2-Codestral-22B-Q4_K_M-GGUF/snapshots/c0a1f7283809423d193025e92eec6f287425ed59/trinity-2-codestral-22b-q4_k_m.gguf
```
> [!NOTE]
> Please, note that this procedure doesn't download the model. If model is not found in the cache, it will be downloaded on the next start of `llamacpp` service.
Downloaded models are stored in the global `llama.cpp` cache on your local machine (same as native version uses). The server can only run one model at a time and must be restarted to switch models.
#### Multiple models (router mode)
TLDR;
```bash
harbor config set llamacpp.model.specifier ""
harbor up llamacpp # shows all your downloaded models
```
The official llama.cpp server can run in router mode to load and unload multiple models dynamically. In Harbor, this maps to starting the service without a fixed model specifier and using extra args to point to model sources.
**Start in router mode**
Router mode requires no `-m`/`--hf-repo` arguments. Clear the model specifier and restart the service:
```bash
# Clear the model specifier (router mode requires no -m / --hf-repo)
harbor config set llamacpp.model.specifier ""
# Or edit .env directly
HARBOR_LLAMACPP_MODEL_SPECIFIER=""
```
**Model sources (official docs → Harbor paths)**
- **Cache (default):** llama.cpp uses its cache to discover models. In Harbor this is mounted at `/root/.cache/llama.cpp` from `HARBOR_LLAMACPP_CACHE`.
- **Models directory:** place GGUFs under `./llamacpp/data/models` and point the router to `/app/data/models`.
- **Preset file:** place an INI file at `./llamacpp/data/models.ini` and point the router to `/app/data/models.ini`.
```bash
# Use a models directory
harbor llamacpp args "--models-dir /app/data/models"
# Or use a preset file
harbor llamacpp args "--models-preset /app/data/models.ini"
# Optional router limits
harbor llamacpp args "--models-dir /app/data/models --models-max 4 --no-models-autoload"
```
If a model has multiple GGUF shards or an `mmproj` file for multimodal, place them in a subdirectory (the `mmproj` file name must start with `mmproj`).
**Routing and model lifecycle**
- **POST endpoints** route by the `model` field in the JSON body.
- **GET endpoints** use the `?model=...` query parameter.
- Use `/models` to list known models and `/models/load` or `/models/unload` to manage them.
```bash
# List known models
curl http://localhost:33831/models
# Load a model
curl -X POST http://localhost:33831/models/load \
-H "Content-Type: application/json" \
-d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"}'
# Unload a model
curl -X POST http://localhost:33831/models/unload \
-H "Content-Type: application/json" \
-d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"}'
# Route a request to a specific model
curl http://localhost:33831/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M","messages":[{"role":"user","content":"Hello"}]}'
```
#### Configuration
You can provide additional arguments to the `llama.cpp` CLI via the `LLAMACPP_EXTRA_ARGS`. It can be set either with Harbor CLI or in the `.env` file.
```bash
# See llama.cpp server args
harbor run llamacpp --server --help
# Set the extra arguments
harbor llamacpp args '--max-tokens 1024 -ngl 100'
# Edit the .env file
HARBOR_LLAMACPP_EXTRA_ARGS="--max-tokens 1024 -ngl 100"
```
You can add `llamacpp` to default services in Harbor:
```bash
# Add llamacpp to the default services
# Will always start when running `harbor up`
harbor defaults add llamacpp
# Remove llamacpp from the default services
harbor defaults rm llamacpp
```
Following options are available via [`harbor config`](./3.-Harbor-CLI-Reference#harbor-config):
```bash
# Location of the llama.cpp own cache, either global
# or relative to $(harbor home)
LLAMACPP_CACHE ~/.cache/llama.cpp
# The port on the host machine where the llama.cpp service
# will be available
LLAMACPP_HOST_PORT 33831
# Docker images for each detected capability
LLAMACPP_IMAGE_CPU ghcr.io/ggml-org/llama.cpp:server
LLAMACPP_IMAGE_NVIDIA ghcr.io/ggml-org/llama.cpp:server-cuda
LLAMACPP_IMAGE_ROCM ghcr.io/ggml-org/llama.cpp:server-rocm
```
To switch the base image, set one or more of these variables to your preferred image/tag. Harbor picks the variable that matches detected capability (`CPU`, `NVIDIA`, or `ROCM`), so you can override just one target without affecting others. For example, use `harbor config set llamacpp.image.nvidia ghcr.io/your-org/llama.cpp:server-cuda` to keep CPU/ROCm defaults while using a custom NVIDIA image.
#### Environment Variables
Follow Harbor's [environment variables guide](./1.-Harbor-User-Guide#environment-variables) to set arbitrary variables for `llamacpp` service.
#### `llama.cpp` CLIs and scripts
`llama.cpp` comes with a lot of helper tools/CLIs, which all can be accessed via the `harbor exec llamacpp` command (once the service is running).
```bash
# Show the list of available llama.cpp CLIs
harbor exec llamacpp ls
# See the help for one of the CLIs
harbor exec llamacpp ./scripts/llama-bench --help
```