### [llama.cpp](https://github.com/ggerganov/llama.cpp) > Handle: `llamacpp`
> URL: [http://localhost:33831](http://localhost:33831) ![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png) [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml) [![Conan Center](https://shields.io/conan/v/llama-cpp)](https://conan.io/center/llama-cpp) LLM inference in C/C++. Allows to bypass Ollama release cycle when needed - to get access to the latest models or features. #### Starting `llamacpp` docker image is quite large due to dependency on CUDA and other libraries. You might want to pull it ahead of time. ```bash # [Optional] Pull the llamacpp # images ahead of starting the service harbor pull llamacpp # Start the llama.cpp service harbor up llamacpp # Tail service logs harbor logs llamacpp # Open llamacpp Web UI harbor open llamacpp ``` - Harbor will automatically allocate GPU resources to the container if available, see [Capabilities Detection](./3.-Harbor-CLI-Reference.md#capabilities-detection). - `llamacpp` will be connected to `aider`, `anythingllm`, `boost`, `chatui`, `cmdh`, `opint`, `optillm`, `plandex`, `traefik`, `webui` services when they are running together. #### Models You can find GGUF models to run on Huggingface [here](https://huggingface.co/models?sort=trending&search=gguf). After you find a model you want to run, grab the URL from the browser address bar and pass it to the [`harbor config`](./3.-Harbor-CLI-Reference#harbor-config) **Quick download from HuggingFace:** ```bash # Pull a model directly from HuggingFace (with optional tag) # Downloads to llama.cpp cache using an ephemeral server harbor pull microsoft/Phi-3.5-mini-instruct-gguf harbor pull microsoft/Phi-3.5-mini-instruct-gguf:Q4_K_M ``` This method automatically downloads the model to llama.cpp's cache. The model will be available for use immediately. When `llamacpp` is running, you can check which models it detects in the cache with: ```bash # List models detected in the cache # Needs curl+jq installed on the host harbor llamacpp models ``` **Manual configuration methods:** ```bash # Quick lookup for the models harbor hf find gguf # 1. With llama.cpp own cache: # # - Set the model to run, will be downloaded when llamacpp starts # Accepts a full URL to the GGUF file (from Browser address bar) harbor llamacpp model https://huggingface.co/user/repo/file.gguf # TIP: You can monitor the download progress with a one-liner below # Replace "" with the unique portion from the "file.gguf" URL above du -h $(harbor find .gguf | grep ) # 2. Shared HuggingFace Hub cache, single file: # # - Locate the GGUF to download, for example: # https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/blob/main/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf # - Download a single file: harbor hf download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf # - Locate the file in the cache harbor find Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf # - Set the GGUF to llama.cpp # "/app/models/hub" is where the HuggingFace cache is mounted in the container harbor llamacpp gguf /app/models/hub/models--bartowski--Meta-Llama-3.1-70B-Instruct-GGUF/snapshots/83fb6e83d0a8aada42d499259bc929d922e9a558/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf # 3. Shared HuggingFace Hub cache, whole repo: # # - Locate and download the repo in its entirety harbor hf download av-codes/Trinity-2-Codestral-22B-Q4_K_M-GGUF # - Find the files from the repo harbor find Trinity-2-Codestral-22B-Q4_K_M-GGUF # - Set the GGUF to llama.cpp # "/app/models/hub" is where the HuggingFace cache is mounted in the container harbor llamacpp gguf /app/models/hub/models--av-codes--Trinity-2-Codestral-22B-Q4_K_M-GGUF/snapshots/c0a1f7283809423d193025e92eec6f287425ed59/trinity-2-codestral-22b-q4_k_m.gguf ``` > [!NOTE] > Please, note that this procedure doesn't download the model. If model is not found in the cache, it will be downloaded on the next start of `llamacpp` service. Downloaded models are stored in the global `llama.cpp` cache on your local machine (same as native version uses). The server can only run one model at a time and must be restarted to switch models. #### Multiple models (router mode) TLDR; ```bash harbor config set llamacpp.model.specifier "" harbor up llamacpp # shows all your downloaded models ``` The official llama.cpp server can run in router mode to load and unload multiple models dynamically. In Harbor, this maps to starting the service without a fixed model specifier and using extra args to point to model sources. **Start in router mode** Router mode requires no `-m`/`--hf-repo` arguments. Clear the model specifier and restart the service: ```bash # Clear the model specifier (router mode requires no -m / --hf-repo) harbor config set llamacpp.model.specifier "" # Or edit .env directly HARBOR_LLAMACPP_MODEL_SPECIFIER="" ``` **Model sources (official docs → Harbor paths)** - **Cache (default):** llama.cpp uses its cache to discover models. In Harbor this is mounted at `/root/.cache/llama.cpp` from `HARBOR_LLAMACPP_CACHE`. - **Models directory:** place GGUFs under `./llamacpp/data/models` and point the router to `/app/data/models`. - **Preset file:** place an INI file at `./llamacpp/data/models.ini` and point the router to `/app/data/models.ini`. ```bash # Use a models directory harbor llamacpp args "--models-dir /app/data/models" # Or use a preset file harbor llamacpp args "--models-preset /app/data/models.ini" # Optional router limits harbor llamacpp args "--models-dir /app/data/models --models-max 4 --no-models-autoload" ``` If a model has multiple GGUF shards or an `mmproj` file for multimodal, place them in a subdirectory (the `mmproj` file name must start with `mmproj`). **Routing and model lifecycle** - **POST endpoints** route by the `model` field in the JSON body. - **GET endpoints** use the `?model=...` query parameter. - Use `/models` to list known models and `/models/load` or `/models/unload` to manage them. ```bash # List known models curl http://localhost:33831/models # Load a model curl -X POST http://localhost:33831/models/load \ -H "Content-Type: application/json" \ -d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"}' # Unload a model curl -X POST http://localhost:33831/models/unload \ -H "Content-Type: application/json" \ -d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"}' # Route a request to a specific model curl http://localhost:33831/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M","messages":[{"role":"user","content":"Hello"}]}' ``` #### Configuration You can provide additional arguments to the `llama.cpp` CLI via the `LLAMACPP_EXTRA_ARGS`. It can be set either with Harbor CLI or in the `.env` file. ```bash # See llama.cpp server args harbor run llamacpp --server --help # Set the extra arguments harbor llamacpp args '--max-tokens 1024 -ngl 100' # Edit the .env file HARBOR_LLAMACPP_EXTRA_ARGS="--max-tokens 1024 -ngl 100" ``` You can add `llamacpp` to default services in Harbor: ```bash # Add llamacpp to the default services # Will always start when running `harbor up` harbor defaults add llamacpp # Remove llamacpp from the default services harbor defaults rm llamacpp ``` Following options are available via [`harbor config`](./3.-Harbor-CLI-Reference#harbor-config): ```bash # Location of the llama.cpp own cache, either global # or relative to $(harbor home) LLAMACPP_CACHE ~/.cache/llama.cpp # The port on the host machine where the llama.cpp service # will be available LLAMACPP_HOST_PORT 33831 # Docker images for each detected capability LLAMACPP_IMAGE_CPU ghcr.io/ggml-org/llama.cpp:server LLAMACPP_IMAGE_NVIDIA ghcr.io/ggml-org/llama.cpp:server-cuda LLAMACPP_IMAGE_ROCM ghcr.io/ggml-org/llama.cpp:server-rocm ``` To switch the base image, set one or more of these variables to your preferred image/tag. Harbor picks the variable that matches detected capability (`CPU`, `NVIDIA`, or `ROCM`), so you can override just one target without affecting others. For example, use `harbor config set llamacpp.image.nvidia ghcr.io/your-org/llama.cpp:server-cuda` to keep CPU/ROCm defaults while using a custom NVIDIA image. #### Environment Variables Follow Harbor's [environment variables guide](./1.-Harbor-User-Guide#environment-variables) to set arbitrary variables for `llamacpp` service. #### `llama.cpp` CLIs and scripts `llama.cpp` comes with a lot of helper tools/CLIs, which all can be accessed via the `harbor exec llamacpp` command (once the service is running). ```bash # Show the list of available llama.cpp CLIs harbor exec llamacpp ls # See the help for one of the CLIs harbor exec llamacpp ./scripts/llama-bench --help ```