llama-swap
ghcr.io/mostlygeek/llama-swap:cuda
https://github.com/mostlygeek/llama-swap
cuda
CUDA GPU accelerated (NVIDIA) Extra Parameters: --gpus all
cpu
CPU only (no GPU required)
rocm
ROCm GPU accelerated (AMD) — best performance for AMD GPUs Extra Parameters: --device=/dev/kfd --device=/dev/dri --group-add video --security-opt seccomp=unconfined
vulkan
Vulkan GPU accelerated (AMD/Intel) Extra Parameters: --device=/dev/dri
bridge
false
https://forums.unraid.net/topic/197997-support-llama-swap-hot-model-swapping-for-llamacpp/
https://github.com/mostlygeek/llama-swap
https://raw.githubusercontent.com/PikkonMG/unraid-docker-templates/refs/heads/main/templates/llama-swap.xml
llama-swap is a lightweight proxy server that provides automatic model swapping for llama.cpp (llama-server).
It hot-swaps models on demand based on API requests so you can serve many GGUF models from a single endpoint without restarting.
Features: automatic model loading/unloading, macros, aliases, groups for multi-model concurrency, TTL auto-unload, streaming log viewer, and OpenAI-compatible API.
SETUP:
1. Place your GGUF model files in the Models path below.
2. Create a config.yaml (see template https://github.com/PikkonMG/unraid-docker-templates/blob/main/examples/llama-swap/example-llama-swap-config.yaml) and place it in the Config path.
3. In your config.yaml, reference models as /models/yourmodel.gguf
4. For NVIDIA GPU: install the Unraid Nvidia plugin, select the cuda tag, and set ExtraParams to: --gpus all
5. For AMD GPU: select the rocm tag and set ExtraParams to: --device /dev/kfd --device /dev/dri --group-add video --security-opt seccomp=unconfined
6. For Intel iGPU or other Vulkan-capable GPUs: select the vulkan tag and set ExtraParams to: --device /dev/dri
7. For CPU only: select the cpu tag and remove ExtraParams entirely
AI: Productivity: Tools: Other
http://[IP]:[PORT:8080]/ui
https://raw.githubusercontent.com/PikkonMG/unraid-docker-templates/main/templates/img/llama-swap.png
https://raw.githubusercontent.com/PikkonMG/unraid-docker-templates/main/docs/screenshots/llama-swap-screenshot1.png
--gpus all
-config /config/config.yaml -watch-config
8080
/mnt/user/appdata/llama-swap/models
/mnt/user/appdata/llama-swap/config
all
all