### [llama.cpp](https://github.com/ggerganov/llama.cpp) > Handle: `llamacpp`
> URL: [http://localhost:33831](http://localhost:33831) ![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png) [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml) [![Conan Center](https://shields.io/conan/v/llama-cpp)](https://conan.io/center/llama-cpp) LLM inference in C/C++. Allows to bypass Ollama release cycle when needed - to get access to the latest models or features. #### Starting `llamacpp` docker image is quite large due to dependency on CUDA and other libraries. You might want to pull it ahead of time. ```bash # [Optional] Pull the llamacpp # images ahead of starting the service harbor pull llamacpp # Start the llama.cpp service harbor up llamacpp # Tail service logs harbor logs llamacpp # Open llamacpp Web UI harbor open llamacpp ``` - Harbor will automatically allocate GPU resources to the container if available, see [Capabilities Detection](./3.-Harbor-CLI-Reference.md#capabilities-detection). - `llamacpp` will be connected to `aider`, `anythingllm`, `boost`, `chatui`, `cmdh`, `opint`, `optillm`, `plandex`, `traefik`, `webui` services when they are running together. #### Models You can find GGUF models to run on Huggingface [here](https://huggingface.co/models?sort=trending&search=gguf). After you find a model you want to run, grab the URL from the browser address bar and pass it to the [`harbor config`](./3.-Harbor-CLI-Reference#harbor-config) ```bash # Quick lookup for the models harbof hf find gguf # 1. With llama.cpp own cache: # # - Set the model to run, will be downloaded when llamacpp starts # Accepts a full URL to the GGUF file (from Browser address bar) harbor llamacpp model https://huggingface.co/user/repo/file.gguf # TIP: You can monitor the download progress with a one-liner below # Replace "" with the unique portion from the "file.gguf" URL above du -h $(harbor find .gguf | grep ) # 2. Shared HuggingFace Hub cache, single file: # # - Locate the GGUF to download, for example: # https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/blob/main/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf # - Download a single file: harbor hf download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf # - Locate the file in the cache harbor find Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf # - Set the GGUF to llama.cpp # "/app/models/hub" is where the HuggingFace cache is mounted in the container harbor llamacpp gguf /app/models/hub/models--bartowski--Meta-Llama-3.1-70B-Instruct-GGUF/snapshots/83fb6e83d0a8aada42d499259bc929d922e9a558/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf # 3. Shared HuggingFace Hub cache, whole repo: # # - Locate and download the repo in its entirety harbor hf download av-codes/Trinity-2-Codestral-22B-Q4_K_M-GGUF # - Find the files from the repo harbor find Trinity-2-Codestral-22B-Q4_K_M-GGUF # - Set the GGUF to llama.cpp # "/app/models/hub" is where the HuggingFace cache is mounted in the container harbor llamacpp gguf /app/models/hub/models--av-codes--Trinity-2-Codestral-22B-Q4_K_M-GGUF/snapshots/c0a1f7283809423d193025e92eec6f287425ed59/trinity-2-codestral-22b-q4_k_m.gguf ``` > [!NOTE] > Please, note that this procedure doesn't download the model. If model is not found in the cache, it will be downloaded on the next start of `llamacpp` service. Downloaded models are stored in the global `llama.cpp` cache on your local machine (same as native version uses). The server can only run one model at a time and must be restarted to switch models. #### Configuration You can provide additional arguments to the `llama.cpp` CLI via the `LLAMACPP_EXTRA_ARGS`. It can be set either with Harbor CLI or in the `.env` file. ```bash # See llama.cpp server args harbor run llamacpp --server --help # Set the extra arguments harbor llamacpp args '--max-tokens 1024 -ngl 100' # Edit the .env file HARBOR_LLAMACPP_EXTRA_ARGS="--max-tokens 1024 -ngl 100" ``` You can add `llamacpp` to default services in Harbor: ```bash # Add llamacpp to the default services # Will always start when running `harbor up` harbor defaults add llamacpp # Remove llamacpp from the default services harbor defaults rm llamacpp ``` Following options are available via [`harbor config`](./3.-Harbor-CLI-Reference#harbor-config): ```bash # Location of the llama.cpp own cache, either global # or relative to $(harbor home) LLAMACPP_CACHE ~/.cache/llama.cpp # The port on the host machine where the llama.cpp service # will be available LLAMACPP_HOST_PORT 33831 ``` #### Environment Variables Follow Harbor's [environment variables guide](./1.-Harbor-User-Guide#environment-variables) to set arbitrary variables for `llamacpp` service. #### `llama.cpp` CLIs and scripts `llama.cpp` comes with a lot of helper tools/CLIs, which all can be accessed via the `harbor exec llamacpp` command (once the service is running). ```bash # Show the list of available llama.cpp CLIs harbor exec llamacpp ls # See the help for one of the CLIs harbor exec llamacpp ./scripts/llama-bench --help ```