--- categories: - jetstream - llm layout: post date: 2025-09-30 slug: jetstream-llm-llama-cpp title: Deploy a ChatGPT-like LLM on Jetstream with llama.cpp --- This is a crosspost of the official Jetstream documentation: [Deploy a ChatGPT-like LLM service on Jetstream](https://docs.jetstream-cloud.org/general/llm/). I built a brand new version of that tutorial that swaps in `llama.cpp` for `vLLM` so we can run GGUF quantized models on Jetstream's GPUs without giving up speed or context length. Tutorial last updated in September 2025 In this tutorial we deploy a Large Language Model (LLM) on Jetstream, run inference locally on the smallest currently available GPU node (`g3.medium`, 10 GB VRAM), then install a web chat interface (Open WebUI) and serve it with HTTPS using Caddy. Before spinning up your own GPU, consider the managed [Jetstream LLM inference service](https://docs.jetstream-cloud.org/inference-service/overview/). It may be more cost‑ and time‑effective if you just need API access to standard models. We will deploy a single (quantized) model: **`Meta Llama 3.1 8B Instruct Q3_K_M` (GGUF)**. This quantized 8B model fits comfortably in ~8 GB of GPU memory, so it runs on a `g3.medium` (10 GB) with a little headroom. (The older `g3.small` flavor has been retired.) If you later choose a different quantization or a larger context length, or move to an unquantized 8B / 70B model, you'll need a larger flavor—adjust accordingly. This tutorial is adapted from work by [Tijmen de Haan](https://www2.kek.jp/qup/en/member/dehaan.html), the author of [Cosmosage](https://cosmosage.online/). ## Model choice & sizing Jetstream GPU flavors (current key options): | Instance Type | Approx. GPU Memory (GB) | |---------------|-------------------------| | g3.medium | 10 | | g3.large | 20 | | g3.xl | 40 (full A100) | We pick the quantized `Llama 3.1 8B Instruct Q3_K_M` variant (GGUF format). Its VRAM residency during inference is about ~8 GB with default context settings, leaving some margin on `g3.medium`. Always keep a couple of GB free to avoid OOM errors when increasing context length or concurrency. Ensure the model is an Instruct fine‑tuned variant (it is) so it responds well to chat prompts. ## Create a Jetstream instance Log in to Exosphere, request an Ubuntu 24 **`g3.medium`** instance (name it `chat`) and SSH into it using either your SSH key or the passphrase generated by Exosphere. ## Load Miniforge A centrally provided Miniforge module is available on Jetstream images. Load it (each new shell) and then create the two Conda environments used below (one for the model server, one for the web UI). ```bash module load miniforge conda init ``` > After running `conda init`, reload your shell so `conda` is available: run `exec bash -l` (avoids logging out and back in). ## Serve the model with `llama.cpp` (OpenAI‑compatible server) We use `llama.cpp` via the `llama-cpp-python` package, which provides an OpenAI‑style HTTP API (default port 8000) that Open WebUI can connect to. Create an environment and install (remember to `module load miniforge` first in any new shell). The last `pip install` step may take several minutes to compile llama.cpp from source, so please be patient. ```bash conda create -y -n llama python=3.11 conda activate llama conda install -y cmake ninja scikit-build-core huggingface_hub module load nvhpc/24.7/nvhpc # Enable CUDA acceleration with explicit compilers, arch, release build CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_C_COMPILER=$(which gcc) -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_BUILD_TYPE=Release" \ pip install --no-cache-dir --no-build-isolation --force-reinstall "llama-cpp-python[server]==0.3.16" ``` Download the quantized GGUF file (`Q3_K_M` variant) from the QuantFactory model page: https://huggingface.co/QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF ```bash mkdir -p ~/models hf download QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF \ Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf \ --local-dir ~/models ``` Test run (Ctrl-C to stop): ```bash python -m llama_cpp.server \ --model /home/exouser/models/Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf \ --chat_format llama-3 \ --n_ctx 8192 \ --n_gpu_layers -1 \ --port 8000 ``` `--n_gpu_layers -1` tells `llama.cpp` to offload **all model layers** to the GPU (full GPU inference). Without this flag the default is CPU layers (`n_gpu_layers=0`), which results in only ~1 GB of VRAM being used and much slower generation. Full offload of this 8B `Q3_K_M` model plus context buffers should occupy roughly 8–9 GB VRAM at `--n_ctx 8192` on first real requests. If it fails to start with an out‑of‑memory (OOM) error you have a few mitigation options (apply one, then retry): * Lower context length: e.g. `--n_ctx 4096` (largest single lever; roughly linear VRAM impact for KV cache). * Partially offload: replace `--n_gpu_layers -1` with a number (e.g. `--n_gpu_layers 20`). Remaining layers will run on CPU (slower, but reduces VRAM need). * Use a lower‑bit quantization (e.g. `Q2_K`) or a smaller model. You can inspect VRAM usage with `watch -n 2 nvidia-smi` after starting the server. Quick note on the “KV cache”: During generation the model reuses previously computed attention Key and Value tensors (instead of recalculating them each new token). These tensors are stored per layer and per processed token; as your prompt and conversation grow, the cache grows linearly with the number of tokens kept in context. That’s why idle VRAM (~weights only) is lower (~6 GB) and rises toward the higher number (up to ~8–9 GB here) only after longer prompts / chats. Reducing `--n_ctx` caps the maximum KV cache size; clearing history or restarting frees it. If it starts without errors, create a systemd service so it restarts automatically. > Quick option: If you prefer a single copy/paste that creates **both** the `llama` and `open-webui` systemd services at once, skip the next two manual unit file sections and jump ahead to the subsection titled "(Optional) One-liner to create both services" below. You can always come back here for the longer, step-by-step version and troubleshooting notes. Using `sudo` to run your preferred text editor, create `/etc/systemd/system/llama.service` with the following contents: ```ini [Unit] Description=Llama.cpp OpenAI-compatible server After=network.target [Service] User=exouser Group=exouser WorkingDirectory=/home/exouser ExecStart=/bin/bash -lc "module load nvhpc/24.7/nvhpc miniforge && conda run -n llama python -m llama_cpp.server --model /home/exouser/models/Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf --chat_format llama-3 --n_ctx 8192 --n_gpu_layers -1 --port 8000" Restart=always [Install] WantedBy=multi-user.target ``` Enable and start: ```bash sudo systemctl enable llama sudo systemctl start llama ``` Troubleshooting: * Logs: `sudo journalctl -u llama -f` * Status: `sudo systemctl status llama` * GPU usage: `nvidia-smi` (≈6 GB idle right after start with full offload; can grow toward ~8–9 GB under long prompts/conversations as KV cache fills) ## Configure the chat interface The chat interface is provided by [Open Web UI](https://openwebui.com/). Create the environment (in a new shell remember to `module load miniforge` first): ```bash module load miniforge conda create -y -n open-webui python=3.11 conda activate open-webui pip install open-webui open-webui serve ``` If this starts with no error, we can kill it with `Ctrl-C` and create a service for it. Using `sudo` to run your preferred text editor, create `/etc/systemd/system/webui.service` with the following contents: ```ini [Unit] Description=Open Web UI serving After=network.target [Service] User=exouser Group=exouser WorkingDirectory=/home/exouser # Activating the conda environment and starting the service ExecStart=/bin/bash -lc "module load miniforge && conda run -n open-webui open-webui serve" Restart=always # PATH managed by module + conda [Install] WantedBy=multi-user.target ``` Then enable and start the service: ```bash sudo systemctl enable webui sudo systemctl start webui ``` ### (Optional) One-liner to create both services If you already created the Conda environments (`llama` and `open-webui`) and downloaded the model, you can create, enable, and start both systemd services (model server + Open WebUI) in a single copy/paste. Adjust `MODEL`, `N_CTX`, `USER`, and `NVHPC_MOD` if needed before running: ```bash MODEL=/home/exouser/models/Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf N_CTX=8192 USER=exouser NVHPC_MOD=nvhpc/24.7/nvhpc ; sudo tee /etc/systemd/system/llama.service >/dev/null </dev/null <