---
categories:
- jetstream
- llm
- kubernetes
- python
layout: post
date: 2024-10-31
slug: jetstream-llm-chat
title: Deploy a LLM Chat-GPT like service on Jetstream

---

In this tutorial we will deploy a LLM Chat-GPT like service on a GPU node on Jetstream.
For experimentation purposes we are using the smaller and cheapest GPU node available on Jetstream, the `g3.small` which has a virtual GPU with 8GB of memory, and deploy the `meta-llama/Llama-3.2-1B-Instruct` model which is a 1.3B parameter model.

However the same instructions can be used to deploy any other model available on the Hugging Face model hub.

This tutorial is based on work by [Tijmen de Haan](https://www2.kek.jp/qup/en/member/dehaan.html), the author of [Cosmosage](https://cosmosage.online/).

## Choose a model

Jetstream has GPU nodes with 4 NVIDIA A100 GPUs, a user can create a Virtual Machine with 1 entire GPU or a fraction of it.

The most important requirement is the GPU memory available to load the model parameters,
Jetstream provides:


| Instance Type | GPU Memory (GB) |
|---------------|-----------------|
| g3.small      | 8               |
| g3.medium     | 10              |
| g3.large      | 20              |
| g3.xl         | 40              |

So `g3.xl` is the largest available and gets an entire A100 GPU with 40GB of memory.

Therefore we need to make sure that the model we want to deploy fits in the available memory.

Llama 3.2 1B model uses about 6.5GB of memory, so it fits in the `g3.small` instance.

We also need to make sure the model has been fine-tuned for responding to text prompts, generally those models are marked as `Instruct`.

## Create a Jetstream instance

Login to Exosphere, request a Ubuntu 24 `g3.small` instance, name it `chat` and ssh into it using either the SSH key or the passphrase generated by Exosphere.

## Install miniforge

We will use Miniforge to create 2 separate Python environments, one for the Hugging Face model serving and one for the web interface.

```bash
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
```

## Configure `vllm` to serve the model

Create the environment:

```bash
conda create -y -n vllm python=3.11
conda activate vllm
pip install vllm
```

As we are using a Llama model, we need specific authorization, you can login to Hugging Face and request access to the model on [the model page](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).

Next we can serve the model:

```bash
huggingface-cli login
vllm serve "meta-llama/Llama-3.2-1B-Instruct" --max-model-len=8192
```

If this starts with no error, we can kill it with `Ctrl-C` and create a service for it.

Create a file `/etc/systemd/system/vllm.service` with the following content:

```ini
[Unit]
Description=VLLM model serving
After=network.target

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser

# Activating the conda environment and starting the service
ExecStart=/bin/bash -c "source /home/exouser/miniforge3/etc/profile.d/conda.sh && conda activate vllm && vllm serve 'meta-llama/Llama-3.2-1B-Instruct' --max-model-len=8192 --enforce-eager"
Restart=always
Environment=PATH=/home/exouser/miniforge3/envs/llm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

[Install]
WantedBy=multi-user.target
```

Then enable and start the service:

```bash
sudo systemctl enable vllm
sudo systemctl start vllm
```

In case of errors:

* Check the logs with `sudo journalctl -u vllm`
* Check the status with `sudo systemctl status vllm`

You can also check the GPU usage with `nvidia-smi`:

```
Thu Oct 31 16:51:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-8C                  On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |   6400MiB /  8192MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     53332      C   ...miniforge3/envs/vllm/bin/python3.11     6399MiB |
+---------------------------------------------------------------------------------------+
```

## Configure the chat interface

The chat interface is provided by [Open Web UI](https://openwebui.com/).

Create the environment:

```bash
conda create -y -n open-webui python=3.11
conda activate open-webui
pip install open-webui
open-webui serve
```

If this starts with no error, we can kill it with `Ctrl-C` and create a service for it.

Create a file `/etc/systemd/system/webui.service` with the following content:

```ini
[Unit]
Description=Open Web UI serving
After=network.target

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser

# Activating the conda environment and starting the service
ExecStart=/bin/bash -c "source /home/exouser/miniforge3/etc/profile.d/conda.sh && conda activate open-webui && open-webui serve"
Restart=always
Environment=PATH=/home/exouser/miniforge3/envs/open-webui/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

[Install]
WantedBy=multi-user.target
```

Then enable and start the service:

```bash
sudo systemctl enable webui
sudo systemctl start webui
```

## Configure web server for HTTPS

Finally we can use Caddy to serve the web interface with HTTPS.

Follow the instructions to [Install Caddy](https://caddyserver.com/docs/install#debian-ubuntu-raspbian)

Modify the Caddyfile to serve the web interface:

```bash
sudo vim /etc/caddy/Caddyfile
```

to:

```
chat.xxx000000.projects.jetstream-cloud.org {

        reverse_proxy localhost:8080
}
```

Where `xxx000000` is the allocation code of your Jetstream instance.

Then reload Caddy:

```bash
sudo systemctl reload caddy
```

## Connect the model and test the chat interface

Point your browser to `https://chat.xxx000000.projects.jetstream-cloud.org` and you should see the chat interface.

Create an account, click on the profile icon on the top right and enter the "Admin panel" section, open "Settings" then "Connections".

Under "OpenAI API" enter the URL `http://localhost:8000/v1` and leave the API key empty.

Click on the "Verify connection" button, then to "Save" on the bottom.

Finally you can start chatting with the model!

## Use a larger model using quantization

The weights of LLMs can be quantized to a lower precision to reduce the GPU memory required to run them, often larger models with quantization outperform smaller models with no quantization. Hugging Face has several quantized models, the most popular are GGUF models, but `vllm` has just experimental support for that format, so better search explicitely for a model "quantized for vllm", for example [`hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4`](https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4).

    sudo systemctl stop vllm
    vllm serve "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4" --max_model_len 4096 --gpu_memory_utilization 1 --enforce-eager

Modify the systemd service replacing the relevant line with:

    ExecStart=/bin/bash -c "source /home/exouser/miniforge3/etc/profile.d/conda.sh && conda activate vllm && vllm serve 'hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4' --max-model-len=4096 --gpu_memory_utilization 1 --enforce-eager"

Then restart the service:

    sudo systemctl daemon-reload
    sudo systemctl start vllm

Check `nvidia-smi`, memory consumption should be about 7.5 GB.

Unfortunately we needed to also decrease `max-model-len` to fit in such a small GPU, so the model will only support 4096 tokens, so it would be best to deploy this model on a slightly larger Virtual Machine and increase the number of tokens.