---
categories:
- jetstream
- llm
layout: post
date: 2025-09-30
slug: jetstream-llm-llama-cpp
title: Deploy a ChatGPT-like LLM on Jetstream with llama.cpp

---

This is a crosspost of the official Jetstream documentation: [Deploy a ChatGPT-like LLM service on Jetstream](https://docs.jetstream-cloud.org/general/llm/). I built a brand new version of that tutorial that swaps in `llama.cpp` for `vLLM` so we can run GGUF quantized models on Jetstream's GPUs without giving up speed or context length.

Tutorial last updated in September 2025

In this tutorial we deploy a Large Language Model (LLM) on Jetstream, run inference locally on the smallest currently available GPU node (`g3.medium`, 10 GB VRAM), then install a web chat interface (Open WebUI) and serve it with HTTPS using Caddy.

Before spinning up your own GPU, consider the managed [Jetstream LLM inference service](https://docs.jetstream-cloud.org/inference-service/overview/). It may be more cost‑ and time‑effective if you just need API access to standard models.

We will deploy a single (quantized) model: **`Meta Llama 3.1 8B Instruct Q3_K_M` (GGUF)**. This quantized 8B model fits comfortably in ~8 GB of GPU memory, so it runs on a `g3.medium` (10 GB) with a little headroom. (The older `g3.small` flavor has been retired.)

If you later choose a different quantization or a larger context length, or move to an unquantized 8B / 70B model, you'll need a larger flavor—adjust accordingly.

This tutorial is adapted from work by [Tijmen de Haan](https://www2.kek.jp/qup/en/member/dehaan.html), the author of [Cosmosage](https://cosmosage.online/).

## Model choice & sizing

Jetstream GPU flavors (current key options):

| Instance Type | Approx. GPU Memory (GB) |
|---------------|-------------------------|
| g3.medium     | 10                      |
| g3.large      | 20                      |
| g3.xl         | 40 (full A100)          |

We pick the quantized `Llama 3.1 8B Instruct Q3_K_M` variant (GGUF format). Its VRAM residency during inference is about ~8 GB with default context settings, leaving some margin on `g3.medium`. Always keep a couple of GB free to avoid OOM errors when increasing context length or concurrency.

Ensure the model is an Instruct fine‑tuned variant (it is) so it responds well to chat prompts.

## Create a Jetstream instance

Log in to Exosphere, request an Ubuntu 24 **`g3.medium`** instance (name it `chat`) and SSH into it using either your SSH key or the passphrase generated by Exosphere.

## Load Miniforge

A centrally provided Miniforge module is available on Jetstream images. Load it (each new shell) and then create the two Conda environments used below (one for the model server, one for the web UI).

```bash
module load miniforge
conda init
```

> After running `conda init`, reload your shell so `conda` is available: run `exec bash -l` (avoids logging out and back in).

## Serve the model with `llama.cpp` (OpenAI‑compatible server)

We use `llama.cpp` via the `llama-cpp-python` package, which provides an OpenAI‑style HTTP API (default port 8000) that Open WebUI can connect to.

Create an environment and install (remember to `module load miniforge` first in any new shell).

The last `pip install` step may take several minutes to compile llama.cpp from source, so please be patient.

```bash
conda create -y -n llama python=3.11
conda activate llama
conda install -y cmake ninja scikit-build-core huggingface_hub
module load nvhpc/24.7/nvhpc
# Enable CUDA acceleration with explicit compilers, arch, release build
CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_C_COMPILER=$(which gcc) -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_BUILD_TYPE=Release" \
    pip install --no-cache-dir --no-build-isolation --force-reinstall "llama-cpp-python[server]==0.3.16"
```


Download the quantized GGUF file (`Q3_K_M` variant) from the QuantFactory model page: https://huggingface.co/QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF

```bash
mkdir -p ~/models
hf download QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF \
    Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf \
    --local-dir ~/models
```

Test run (Ctrl-C to stop):

```bash
python -m llama_cpp.server \
    --model /home/exouser/models/Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf \
    --chat_format llama-3 \
    --n_ctx 8192 \
    --n_gpu_layers -1 \
    --port 8000
```

`--n_gpu_layers -1` tells `llama.cpp` to offload **all model layers** to the GPU (full GPU inference). Without this flag the default is CPU layers (`n_gpu_layers=0`), which results in only ~1 GB of VRAM being used and much slower generation. Full offload of this 8B `Q3_K_M` model plus context buffers should occupy roughly 8–9 GB VRAM at `--n_ctx 8192` on first real requests. If it fails to start with an out‑of‑memory (OOM) error you have a few mitigation options (apply one, then retry):

* Lower context length: e.g. `--n_ctx 4096` (largest single lever; roughly linear VRAM impact for KV cache).
* Partially offload: replace `--n_gpu_layers -1` with a number (e.g. `--n_gpu_layers 20`). Remaining layers will run on CPU (slower, but reduces VRAM need).
* Use a lower‑bit quantization (e.g. `Q2_K`) or a smaller model.

You can inspect VRAM usage with `watch -n 2 nvidia-smi` after starting the server.

Quick note on the “KV cache”: During generation the model reuses previously computed attention Key and Value tensors (instead of recalculating them each new token). These tensors are stored per layer and per processed token; as your prompt and conversation grow, the cache grows linearly with the number of tokens kept in context. That’s why idle VRAM (~weights only) is lower (~6 GB) and rises toward the higher number (up to ~8–9 GB here) only after longer prompts / chats. Reducing `--n_ctx` caps the maximum KV cache size; clearing history or restarting frees it.

If it starts without errors, create a systemd service so it restarts automatically.

> Quick option: If you prefer a single copy/paste that creates **both** the `llama` and `open-webui` systemd services at once, skip the next two manual unit file sections and jump ahead to the subsection titled "(Optional) One-liner to create both services" below. You can always come back here for the longer, step-by-step version and troubleshooting notes.

Using `sudo` to run your preferred text editor, create `/etc/systemd/system/llama.service` with the following contents:

```ini
[Unit]
Description=Llama.cpp OpenAI-compatible server
After=network.target

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser
ExecStart=/bin/bash -lc "module load nvhpc/24.7/nvhpc miniforge && conda run -n llama python -m llama_cpp.server --model /home/exouser/models/Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf --chat_format llama-3 --n_ctx 8192 --n_gpu_layers -1 --port 8000"
Restart=always

[Install]
WantedBy=multi-user.target
```

Enable and start:

```bash
sudo systemctl enable llama
sudo systemctl start llama
```

Troubleshooting:

* Logs: `sudo journalctl -u llama -f`
* Status: `sudo systemctl status llama`
* GPU usage: `nvidia-smi` (≈6 GB idle right after start with full offload; can grow toward ~8–9 GB under long prompts/conversations as KV cache fills)

## Configure the chat interface

The chat interface is provided by [Open Web UI](https://openwebui.com/).

Create the environment (in a new shell remember to `module load miniforge` first):

```bash
module load miniforge
conda create -y -n open-webui python=3.11
conda activate open-webui
pip install open-webui
open-webui serve
```

If this starts with no error, we can kill it with `Ctrl-C` and create a service for it.

Using `sudo` to run your preferred text editor, create `/etc/systemd/system/webui.service` with the following contents:

```ini
[Unit]
Description=Open Web UI serving
After=network.target

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser

# Activating the conda environment and starting the service
ExecStart=/bin/bash -lc "module load miniforge && conda run -n open-webui open-webui serve"
Restart=always
# PATH managed by module + conda

[Install]
WantedBy=multi-user.target
```

Then enable and start the service:

```bash
sudo systemctl enable webui
sudo systemctl start webui
```

### (Optional) One-liner to create both services

If you already created the Conda environments (`llama` and `open-webui`) and downloaded the model, you can create, enable, and start both systemd services (model server + Open WebUI) in a single copy/paste. Adjust `MODEL`, `N_CTX`, `USER`, and `NVHPC_MOD` if needed before running:

```bash
MODEL=/home/exouser/models/Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf N_CTX=8192 USER=exouser NVHPC_MOD=nvhpc/24.7/nvhpc ; sudo tee /etc/systemd/system/llama.service >/dev/null <<EOF && sudo tee /etc/systemd/system/webui.service >/dev/null <<EOF2 && sudo systemctl daemon-reload && sudo systemctl enable --now llama webui
[Unit]
Description=Llama.cpp OpenAI-compatible server
After=network.target

[Service]
User=$USER
Group=$USER
WorkingDirectory=/home/$USER
ExecStart=/bin/bash -lc "module load $NVHPC_MOD miniforge && conda run -n llama python -m llama_cpp.server --model $MODEL --chat_format llama-3 --n_ctx $N_CTX --n_gpu_layers -1 --port 8000"
Restart=always

[Install]
WantedBy=multi-user.target
EOF
[Unit]
Description=Open Web UI serving
After=network.target

[Service]
User=$USER
Group=$USER
WorkingDirectory=/home/$USER
ExecStart=/bin/bash -lc "module load miniforge && conda run -n open-webui open-webui serve"
Restart=always

[Install]
WantedBy=multi-user.target
EOF2
```

To later change context length: edit `/etc/systemd/system/llama.service`, modify `--n_ctx`, then run:

```bash
sudo systemctl daemon-reload
sudo systemctl restart llama
```

## Configure web server for HTTPS

Finally we can use Caddy to serve the web interface with HTTPS.

Install [Caddy](https://caddyserver.com/). Note that the version of Caddy available in the Ubuntu APT repositories is often outdated. Follow [the instructions to install Caddy on Ubuntu](https://caddyserver.com/docs/install#debian-ubuntu-raspbian). You can copy-paste all the lines at once.


Modify the Caddyfile to serve the web interface. (Note that `sensible-editor` will prompt you to choose a text editor; select the number for `/bin/nano` if you aren't sure what else to pick.)

```bash
sudo sensible-editor /etc/caddy/Caddyfile
```

to:

```
chat.xxx000000.projects.jetstream-cloud.org {

        reverse_proxy localhost:8080
}
```

Where `chat` is the name of your instance, and `xxx000000` is the allocation code. You can find the full hostname (e.g. `chat.xxx000000.projects.jetstream-cloud.org`) in Exosphere: open your instance's details page, scroll to the Credentials section, and copy the value shown under Hostname.

Then reload Caddy:

```bash
sudo systemctl reload caddy
```

## Connect the model and test the chat interface

Point your browser to `https://chat.xxx000000.projects.jetstream-cloud.org` and you should see the chat interface.

Create an account, click on the profile icon on the top right and enter the "Admin panel" section, open "Settings" then "Connections".
Once you create the first account, that will become admin, if anyone else creates an account they will be a regular user and need to be approved by the admin user. This is the only protection available in this setup, an attacker could still leverage vulnerabilities on Open WebUI to gain access. If you require more security the easiest way is to just open the firewall (using `ufw`) to only allow connections from your IP.

Under "OpenAI API" enter the URL `http://localhost:8000/v1` and leave the API key empty (the local llama.cpp server is unsecured by default on localhost).

Click on the "Verify connection" button, then to "Save" on the bottom.

Finally you can start chatting with the model!

If you change context length (`--n_ctx`) or increase concurrent users you may approach the 10 GB limit. Reduce `--n_ctx` (e.g. 4096) if you encounter out‑of‑memory errors.
## Scaling up or changing models

Want a larger model or higher quality? Options:

* Use a higher‑bit quantization (`Q4_K_M` / `Q5_K_M`) for better quality (needs more VRAM).
* Move to unquantized FP16 8B (≈16 GB VRAM) on `g3.large` or bigger.
* Increase context length (each 1k tokens adds memory usage). If you see OOM, lower `--n_ctx`.

For production workloads, consider the managed Jetstream inference service or frameworks like `vllm` on larger GPUs for higher throughput.