--- categories: - jetstream - llm layout: post date: '2026-03-11' slug: jetstream-llm-llama-cpp-review title: Deploy a ChatGPT-like LLM on Jetstream with llama.cpp, tested on g3.medium --- This is a tested follow-up and updated standalone version of [Deploy a ChatGPT-like LLM on Jetstream with llama.cpp](./2025-09-30-jetstream-llm-llama-cpp.md). I ran the deployment end to end on a fresh Jetstream Ubuntu 24 `g3.medium` instance and folded the corrections, runtime notes, and performance measurements into the walkthrough below so you do not need to jump back and forth between posts. If you want the original September 2025 version for reference, see: [Deploy a ChatGPT-like LLM on Jetstream with llama.cpp](./2025-09-30-jetstream-llm-llama-cpp.md) In this tutorial we deploy a Large Language Model (LLM) on Jetstream, run inference locally on the smallest currently available GPU node (`g3.medium`, 10 GB VRAM), then install a web chat interface (Open WebUI) and serve it with HTTPS using Caddy. Before spinning up your own GPU, consider the managed [Jetstream LLM inference service](https://docs.jetstream-cloud.org/inference-service/overview/). It may be more cost‑ and time‑effective if you just need API access to standard models. We will deploy a single quantized model: **`Meta Llama 3.1 8B Instruct Q3_K_M` (GGUF)**. This quantized 8B model fits comfortably in the tested `g3.medium` instance. ## Model choice & sizing Jetstream GPU flavors (current key options): | Instance Type | Approx. GPU Memory (GB) | |---------------|-------------------------| | g3.medium | 10 | | g3.large | 20 | | g3.xl | 40 (full A100) | We pick the quantized `Llama 3.1 8B Instruct Q3_K_M` variant (GGUF format). Its VRAM residency during inference is about ~8 GB with default context settings, leaving some margin on `g3.medium`. Always keep a couple of GB free to avoid OOM errors when increasing context length or concurrency. Ensure the model is an Instruct fine‑tuned variant so it responds well to chat prompts. ## Create a Jetstream instance Log in to Exosphere, request an Ubuntu 24 **`g3.medium`** instance (name it `chat`) and SSH into it using either your SSH key or the passphrase generated by Exosphere. You will need the public hostname later for HTTPS, for example `chat.xxx000000.projects.jetstream-cloud.org`. Copy it from Exosphere on the instance details page under Credentials > Hostname. Do not rely on `hostname -f` on the VM, which may return only the internal hostname. ## Load Miniforge A centrally provided Miniforge module is available on Jetstream images. First initialize Lmod, then load Miniforge (repeat this in each new shell) and create the two Conda environments used below (one for the model server, one for the web UI). ```bash source /etc/profile.d/lmod.sh module load miniforge conda init ``` > After running `conda init`, reload your shell so `conda` is available: run `exec bash -l` (avoids logging out and back in). > > In non-interactive shells (for example over SSH in a script, inside `nohup`, or in `systemd`), prefer `conda run -n ENV ...` instead of `conda activate ENV`. The latter depends on shell initialization and is less reliable outside an interactive login shell. ## Serve the model with `llama.cpp` (OpenAI-compatible server) We use `llama.cpp` via the `llama-cpp-python` package, which provides an OpenAI-style HTTP API (default port 8000) that Open WebUI can connect to. Create an environment and install. Remember to initialize Lmod and then `module load miniforge` first in any new shell. The last `pip install` step may take several minutes to compile llama.cpp from source, so please be patient. ```bash source /etc/profile.d/lmod.sh module load miniforge conda create -y -n llama python=3.11 conda activate llama conda install -y cmake ninja scikit-build-core huggingface_hub module load nvhpc/24.7/nvhpc # Enable CUDA acceleration with explicit compilers, arch, release build CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_C_COMPILER=$(which gcc) -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_BUILD_TYPE=Release" \ conda run -n llama python -m pip install --no-cache-dir --no-build-isolation --force-reinstall "llama-cpp-python[server]==0.3.16" ``` As of March 11, 2026, `0.3.16` was the newest published `llama-cpp-python` release. Download the quantized GGUF file (`Q3_K_M` variant) from the QuantFactory model page: Set these variables once: ```bash export HF_REPO="QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF" export MODEL="Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf" ``` ```bash mkdir -p ~/models hf download "$HF_REPO" \ "$MODEL" \ --local-dir ~/models ``` If you are not authenticated to Hugging Face, downloads still work for this model but may be subject to lower rate limits. Test run (Ctrl-C to stop): ```bash source /etc/profile.d/lmod.sh module load miniforge nvhpc/24.7/nvhpc conda activate llama python -m llama_cpp.server \ --model "$HOME/models/$MODEL" \ --chat_format llama-3 \ --n_ctx 8192 \ --n_gpu_layers -1 \ --port 8000 ``` `--n_gpu_layers -1` tells `llama.cpp` to offload **all model layers** to the GPU (full GPU inference). Without this flag the default is CPU layers (`n_gpu_layers=0`), which results in only ~1 GB of VRAM being used and much slower generation. Full offload of this 8B `Q3_K_M` model plus context buffers should occupy roughly 8–9 GB VRAM at `--n_ctx 8192` on first real requests. If it fails to start with an out‑of‑memory (OOM) error you have a few mitigation options (apply one, then retry): The `nvhpc/24.7/nvhpc` module must still be loaded at runtime, not only during the build. Without it, `llama_cpp` may fail to import with an error like `libcudart.so.12: cannot open shared object file`. * Lower context length: e.g. `--n_ctx 4096` (largest single lever; roughly linear VRAM impact for KV cache). * Partially offload: replace `--n_gpu_layers -1` with a number (e.g. `--n_gpu_layers 20`). Remaining layers will run on CPU (slower, but reduces VRAM need). * Use a lower-bit quantization (e.g. `Q2_K`) or a smaller model. You can inspect VRAM usage with: ```bash watch -n 2 nvidia-smi ``` Quick note on the “KV cache”: During generation the model reuses previously computed attention Key and Value tensors (instead of recalculating them each new token). These tensors are stored per layer and per processed token; as your prompt and conversation grow, the cache grows linearly with the number of tokens kept in context. That’s why idle VRAM (~weights only) is lower (~6 GB) and rises toward the higher number only after longer prompts or chats. Reducing `--n_ctx` caps the maximum KV cache size; clearing history or restarting frees it. On the tested `g3.medium` setup (`Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf`, full GPU offload, `--n_ctx 8192`), short local chat-completion requests generated about **85-90 tokens/second** after warm-up. In practice that feels very fast for interactive chat, closer to the “instant response” experience users expect from a non-reasoning assistant than a slow step-by-step model. Treat it as an approximate reference point, not a guarantee: longer prompts, larger context, concurrent users, or different quantizations will reduce throughput. If the test run works, create a `systemd` service so it restarts automatically. Using `sudo` to run your preferred text editor, create `/etc/systemd/system/llama.service` with the following contents: ```ini [Unit] Description=Llama.cpp OpenAI-compatible server After=network.target [Service] User=exouser Group=exouser WorkingDirectory=/home/exouser Environment=MODEL=Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf ExecStart=/bin/bash -lc "source /etc/profile.d/lmod.sh; module load nvhpc/24.7/nvhpc miniforge && conda run -n llama python -m llama_cpp.server --model $HOME/models/$MODEL --chat_format llama-3 --n_ctx 8192 --n_gpu_layers -1 --port 8000" Restart=always [Install] WantedBy=multi-user.target ``` Enable and start: ```bash sudo systemctl daemon-reload sudo systemctl enable llama sudo systemctl start llama ``` Troubleshooting: * Logs: `sudo journalctl -u llama -f` * Status: `sudo systemctl status llama` * GPU usage: `nvidia-smi` ## Configure the chat interface The chat interface is provided by [Open WebUI](https://openwebui.com/). Create the environment. In a new shell remember to initialize Lmod and then `module load miniforge` first. The `open-webui` install pulls in a very large dependency set and can take a while even on a fast connection, so expect this step to be noticeably slower than the earlier `llama-cpp-python` install. ```bash source /etc/profile.d/lmod.sh module load miniforge conda create -y -n open-webui python=3.11 conda run -n open-webui python -m pip install open-webui conda run -n open-webui open-webui serve --port 8080 ``` If this starts with no error, kill it with `Ctrl-C` and create a service for it. Using `sudo` to run your preferred text editor, create `/etc/systemd/system/webui.service` with the following contents: ```ini [Unit] Description=Open Web UI serving Wants=network-online.target After=network-online.target llama.service Requires=llama.service PartOf=llama.service [Service] User=exouser Group=exouser WorkingDirectory=/home/exouser Environment=OPENAI_API_BASE_URL=http://localhost:8000/v1 Environment=OPENAI_API_KEY=local-no-key ExecStartPre=/bin/bash -lc 'for i in {1..600}; do /usr/bin/curl -sf http://localhost:8000/v1/models >/dev/null && exit 0; sleep 1; done; echo "llama not ready" >&2; exit 1' ExecStart=/bin/bash -lc 'source /etc/profile.d/lmod.sh; module load miniforge; conda run -n open-webui open-webui serve --port 8080' Restart=on-failure RestartSec=5 TimeoutStartSec=600 Type=simple [Install] WantedBy=multi-user.target ``` Then enable and start: ```bash sudo systemctl daemon-reload sudo systemctl enable webui sudo systemctl start webui ``` ### Optional one-liner to create both services If you already created the Conda environments (`llama` and `open-webui`) and downloaded the model, you can create, enable, and start both `systemd` services in a single copy-paste. Adjust `MODEL`, `N_CTX`, `USER`, and `NVHPC_MOD` if needed before running: ```bash : "${MODEL:?export MODEL (model filename) first}" ; N_CTX=8192 USER=exouser NVHPC_MOD=nvhpc/24.7/nvhpc ; sudo tee /etc/systemd/system/llama.service >/dev/null </dev/null </dev/null && exit 0; sleep 1; done; echo "llama not ready" >&2; exit 1' ExecStart=/bin/bash -lc 'source /etc/profile.d/lmod.sh; module load miniforge; conda run -n open-webui open-webui serve --port 8080' Restart=on-failure RestartSec=5 TimeoutStartSec=600 Type=simple [Install] WantedBy=multi-user.target EOF2 ``` To later change context length, edit `/etc/systemd/system/llama.service`, modify `--n_ctx`, then run: ```bash sudo systemctl daemon-reload sudo systemctl restart llama ``` ## Configure web server for HTTPS Finally we can use Caddy to serve the web interface with HTTPS. Install [Caddy](https://caddyserver.com/). The version in the Ubuntu APT repositories is often outdated, so follow the [official Ubuntu installation instructions](https://caddyserver.com/docs/install#debian-ubuntu-raspbian). You can copy-paste all the lines at once. Modify the Caddyfile: ```bash sudo sensible-editor /etc/caddy/Caddyfile ``` to: ```text chat.xxx000000.projects.jetstream-cloud.org { reverse_proxy localhost:8080 } ``` Where `chat` is the instance name and `xxx000000` is the allocation code. You can find the full hostname in Exosphere: open the instance details page, scroll to Credentials, and copy Hostname. Then reload Caddy: ```bash sudo systemctl reload caddy ``` ## Connect the model and test the chat interface Point your browser to `https://chat.xxx000000.projects.jetstream-cloud.org` and you should see the chat interface. Create an account, click the profile icon in the top right, then open `Admin panel > Settings > Connections`. Once you create the first account, that user becomes the admin. Anyone else who signs up is a regular user and must be approved by the admin. This approval step is the only protection in this setup; an attacker could still leverage vulnerabilities in Open WebUI to gain access. For stronger security, use firewall rules to allow connections only from trusted IPs. Under `OpenAI API` enter the URL `http://localhost:8000/v1` and leave the API key empty. Click `Verify connection`, then click `Save`. Finally you can start chatting with the model. ## What I verified on the tested deployment On the tested `g3.medium` VM I was able to: * build `llama-cpp-python==0.3.16` with CUDA support * download `Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf` * serve the model locally with `llama_cpp.server` * install and launch Open WebUI * expose the chat interface publicly with Caddy and HTTPS The tested public URL was: * `https://chat.cis230085.projects.jetstream-cloud.org` ## Related posts Original September 2025 version of this tutorial: [Deploy a ChatGPT-like LLM on Jetstream with llama.cpp](./2025-09-30-jetstream-llm-llama-cpp.md)