--- categories: - jetstream - llm layout: post date: 2025-09-30 slug: jetstream-llm-llama-cpp title: Deploy a ChatGPT-like LLM on Jetstream with llama.cpp --- Tutorial last updated in December 2025 This is a crosspost of the official Jetstream documentation: [Deploy a ChatGPT-like LLM service on Jetstream](https://docs.jetstream-cloud.org/general/llm/). I built a brand new version of that tutorial that swaps in `llama.cpp` for `vLLM` so we can run GGUF quantized models on Jetstream's GPUs without giving up speed or context length. This is the updated version of the tutorial using `llama.cpp` instead of `vLLM`, so we can run GGUF quantized models on Jetstream GPUs without giving up speed or context length. In this tutorial we deploy a Large Language Model (LLM) on Jetstream, run inference locally on the smallest currently available GPU node (`g3.medium`, 10 GB VRAM), then install a web chat interface (Open WebUI) and serve it with HTTPS using Caddy. Before spinning up your own GPU, consider the managed [Jetstream LLM inference service](https://docs.jetstream-cloud.org/inference-service/overview/). It may be more cost‑ and time‑effective if you just need API access to standard models. We will deploy a single (quantized) model: **`Meta Llama 3.1 8B Instruct Q3_K_M` (GGUF)**. This quantized 8B model fits comfortably in ~8 GB of GPU memory, so it runs on a `g3.medium` (10 GB) with a little headroom. (The older `g3.small` flavor has been retired.) If you later choose a different quantization or a larger context length, or move to an unquantized 8B / 70B model, you'll need a larger flavor—adjust accordingly. This tutorial is adapted from work by [Tijmen de Haan](https://www2.kek.jp/qup/en/member/dehaan.html), the author of [Cosmosage](https://cosmosage.online/). ## Model choice & sizing Jetstream GPU flavors (current key options): | Instance Type | Approx. GPU Memory (GB) | |---------------|-------------------------| | g3.medium | 10 | | g3.large | 20 | | g3.xl | 40 (full A100) | We pick the quantized `Llama 3.1 8B Instruct Q3_K_M` variant (GGUF format). Its VRAM residency during inference is about ~8 GB with default context settings, leaving some margin on `g3.medium`. Always keep a couple of GB free to avoid OOM errors when increasing context length or concurrency. Ensure the model is an Instruct fine‑tuned variant (it is) so it responds well to chat prompts. ## Create a Jetstream instance Log in to Exosphere, request an Ubuntu 24 **`g3.medium`** instance (name it `chat`) and SSH into it using either your SSH key or the passphrase generated by Exosphere. ## Load Miniforge A centrally provided Miniforge module is available on Jetstream images. Load it (each new shell) and then create the two Conda environments used below (one for the model server, one for the web UI). ```bash module load miniforge conda init ``` > After running `conda init`, reload your shell so `conda` is available: run `exec bash -l` (avoids logging out and back in). ## Serve the model with `llama.cpp` (OpenAI‑compatible server) We use `llama.cpp` via the `llama-cpp-python` package, which provides an OpenAI‑style HTTP API (default port 8000) that Open WebUI can connect to. Create an environment and install (remember to `module load miniforge` first in any new shell). The last `pip install` step may take several minutes to compile llama.cpp from source, so please be patient. ```bash conda create -y -n llama python=3.11 conda activate llama conda install -y cmake ninja scikit-build-core huggingface_hub module load nvhpc/24.7/nvhpc # Enable CUDA acceleration with explicit compilers, arch, release build CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_C_COMPILER=$(which gcc) -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_BUILD_TYPE=Release" \ conda run -n llama python -m pip install --no-cache-dir --no-build-isolation --force-reinstall "llama-cpp-python[server]==0.3.16" ``` Download the quantized GGUF file (`Q3_K_M` variant) from the QuantFactory model page: https://huggingface.co/QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF Set these variables once (then you can copy/paste the rest of the commands): ```bash export HF_REPO="QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF" export MODEL="Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf" ``` ```bash mkdir -p ~/models hf download "$HF_REPO" \ "$MODEL" \ --local-dir ~/models ``` Test run (Ctrl-C to stop): ```bash python -m llama_cpp.server \ --model "$HOME/models/$MODEL" \ --chat_format llama-3 \ --n_ctx 8192 \ --n_gpu_layers -1 \ --port 8000 ``` `--n_gpu_layers -1` tells `llama.cpp` to offload **all model layers** to the GPU (full GPU inference). Without this flag the default is CPU layers (`n_gpu_layers=0`), which results in only ~1 GB of VRAM being used and much slower generation. Full offload of this 8B `Q3_K_M` model plus context buffers should occupy roughly 8–9 GB VRAM at `--n_ctx 8192` on first real requests. If it fails to start with an out‑of‑memory (OOM) error you have a few mitigation options (apply one, then retry): * Lower context length: e.g. `--n_ctx 4096` (largest single lever; roughly linear VRAM impact for KV cache). * Partially offload: replace `--n_gpu_layers -1` with a number (e.g. `--n_gpu_layers 20`). Remaining layers will run on CPU (slower, but reduces VRAM need). * Use a lower‑bit quantization (e.g. `Q2_K`) or a smaller model. You can inspect VRAM usage with `watch -n 2 nvidia-smi` after starting the server. Quick note on the “KV cache”: During generation the model reuses previously computed attention Key and Value tensors (instead of recalculating them each new token). These tensors are stored per layer and per processed token; as your prompt and conversation grow, the cache grows linearly with the number of tokens kept in context. That’s why idle VRAM (~weights only) is lower (~6 GB) and rises toward the higher number (up to ~8–9 GB here) only after longer prompts / chats. Reducing `--n_ctx` caps the maximum KV cache size; clearing history or restarting frees it. If it starts without errors, create a systemd service so it restarts automatically. > Quick option: If you prefer a single copy/paste that creates **both** the `llama` and `open-webui` systemd services at once, skip the next two manual unit file sections and jump ahead to the subsection titled "(Optional) One-liner to create both services" below. You can always come back here for the longer, step-by-step version and troubleshooting notes. Using `sudo` to run your preferred text editor, create `/etc/systemd/system/llama.service` with the following contents: ```ini [Unit] Description=Llama.cpp OpenAI-compatible server After=network.target [Service] User=exouser Group=exouser WorkingDirectory=/home/exouser Environment=MODEL=Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf ExecStart=/bin/bash -lc "module load nvhpc/24.7/nvhpc miniforge && conda run -n llama python -m llama_cpp.server --model $HOME/models/$MODEL --chat_format llama-3 --n_ctx 8192 --n_gpu_layers -1 --port 8000" Restart=always [Install] WantedBy=multi-user.target ``` Enable and start: ```bash sudo systemctl daemon-reload sudo systemctl enable llama sudo systemctl start llama ``` Troubleshooting: * Logs: `sudo journalctl -u llama -f` * Status: `sudo systemctl status llama` * GPU usage: `nvidia-smi` (≈6 GB idle right after start with full offload; can grow toward ~8–9 GB under long prompts/conversations as KV cache fills) ## Configure the chat interface The chat interface is provided by [Open WebUI](https://openwebui.com/). Create the environment (in a new shell remember to `module load miniforge` first): ```bash module load miniforge conda create -y -n open-webui python=3.11 conda run -n open-webui python -m pip install open-webui conda run -n open-webui open-webui serve --port 8080 ``` If this starts with no error, we can kill it with `Ctrl-C` and create a service for it. Using `sudo` to run your preferred text editor, create `/etc/systemd/system/webui.service` with the following contents: ```ini [Unit] Description=Open Web UI serving Wants=network-online.target After=network-online.target llama.service Requires=llama.service PartOf=llama.service [Service] User=exouser Group=exouser WorkingDirectory=/home/exouser Environment=OPENAI_API_BASE_URL=http://localhost:8000/v1 Environment=OPENAI_API_KEY=local-no-key ExecStartPre=/bin/bash -lc 'for i in {1..600}; do /usr/bin/curl -sf http://localhost:8000/v1/models >/dev/null && exit 0; sleep 1; done; echo "llama not ready" >&2; exit 1' ExecStart=/bin/bash -lc 'source /etc/profile.d/modules.sh 2>/dev/null || true; module load miniforge; conda run -n open-webui open-webui serve --port 8080' Restart=on-failure RestartSec=5 TimeoutStartSec=600 Type=simple [Install] WantedBy=multi-user.target ``` Then enable and start the service: ```bash sudo systemctl daemon-reload sudo systemctl enable webui sudo systemctl start webui ``` ### (Optional) One-liner to create both services If you already created the Conda environments (`llama` and `open-webui`) and downloaded the model, you can create, enable, and start both systemd services (model server + Open WebUI) in a single copy/paste. Adjust `MODEL`, `N_CTX`, `USER`, and `NVHPC_MOD` if needed before running: ```bash : "${MODEL:?export MODEL (model filename) first}" ; N_CTX=8192 USER=exouser NVHPC_MOD=nvhpc/24.7/nvhpc ; sudo tee /etc/systemd/system/llama.service >/dev/null </dev/null </dev/null && exit 0; sleep 1; done; echo "llama not ready" >&2; exit 1' ExecStart=/bin/bash -lc 'source /etc/profile.d/modules.sh 2>/dev/null || true; module load miniforge; conda run -n open-webui open-webui serve --port 8080' Restart=on-failure RestartSec=5 TimeoutStartSec=600 Type=simple [Install] WantedBy=multi-user.target EOF2 ``` To later change context length: edit `/etc/systemd/system/llama.service`, modify `--n_ctx`, then run: ```bash sudo systemctl daemon-reload sudo systemctl restart llama ``` ## Configure web server for HTTPS Finally we can use Caddy to serve the web interface with HTTPS. Install [Caddy](https://caddyserver.com/). Note that the version of Caddy available in the Ubuntu APT repositories is often outdated. Follow [the instructions to install Caddy on Ubuntu](https://caddyserver.com/docs/install#debian-ubuntu-raspbian). You can copy-paste all the lines at once. Modify the Caddyfile to serve the web interface. (Note that `sensible-editor` will prompt you to choose a text editor; select the number for `/bin/nano` if you aren't sure what else to pick.) ```bash sudo sensible-editor /etc/caddy/Caddyfile ``` to: ``` chat.xxx000000.projects.jetstream-cloud.org { reverse_proxy localhost:8080 } ``` Where `chat` is the name of your instance, and `xxx000000` is the allocation code. You can find the full hostname (e.g. `chat.xxx000000.projects.jetstream-cloud.org`) in Exosphere: open your instance's details page, scroll to the Credentials section, and copy the value shown under Hostname. Then reload Caddy: ```bash sudo systemctl reload caddy ``` ## Connect the model and test the chat interface Point your browser to `https://chat.xxx000000.projects.jetstream-cloud.org` and you should see the chat interface. Create an account, click the profile icon in the top right, then open "Admin panel" > "Settings" > "Connections". Once you create the first account, that user becomes the admin. Anyone else who signs up is a regular user and must be approved by the admin. This approval step is the only protection in this setup; an attacker could still leverage vulnerabilities in Open WebUI to gain access. For stronger security, use `ufw` to allow connections only from your IP. Under "OpenAI API" enter the URL `http://localhost:8000/v1` and leave the API key empty (the local llama.cpp server is unsecured by default on localhost). Click on the "Verify connection" button, then to "Save" on the bottom. Finally you can start chatting with the model! If you change context length (`--n_ctx`) or increase concurrent users you may approach the 10 GB limit. Reduce `--n_ctx` (e.g. 4096) if you encounter out‑of‑memory errors. ## Scaling up or changing models Want a larger model or higher quality? Options: * Use a higher‑bit quantization (`Q4_K_M` / `Q5_K_M`) for better quality (needs more VRAM). * Move to unquantized FP16 8B (≈16 GB VRAM) on `g3.large` or bigger. * Increase context length (each 1k tokens adds memory usage). If you see OOM, lower `--n_ctx`. For production workloads, consider the managed Jetstream inference service or frameworks like `vllm` on larger GPUs for higher throughput. ## Related tutorials These resources are contributed by the community, make sure you understand all steps involved, if unsure consider opening a ticket. * [Deploy a larger LLM (70B) on Jetstream](https://www.zonca.dev/posts/2025-09-18-deploy-70b-llm-jetstream) * [Demo of a web application used to shelve and unshelve a Jetstream instance](https://www.zonca.dev/posts/2025-09-22-openstack-unshelver-demo), allows community members to unshelve an instance for a time even when they are not members of the allocation. This strategy conserves Jetstream2 service units for larger instances that do not need to run continuously, yet still host a community-facing service (such as an LLM specific to a single domain of science).