--- name: vllm-deploy-simple description: Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API. --- # vLLM Simple Deployment A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API. ## What this skill does This skill provides a streamlined workflow to: - Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU) - Install vLLM with appropriate backend support - Start the vLLM server with configurable model and port - Test the OpenAI-compatible API endpoint - Validate the deployment is working correctly - Support virtual environment isolation ## Prerequisites - Python 3.10+ - GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU - pip or uv package manager - curl (for API testing) - Virtual environment (optional but recommended) ## Usage ### Create a venv If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment. ### Run the complete workflow (suggested) If user did not specify the venv path, model, or port, use default options: ```bash # Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8) scripts/quickstart.sh ``` Or with custom options: ```bash # Use custom virtual environment scripts/quickstart.sh --venv /path/to/venv # Use custom model and port scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 # Use custom GPU memory utilization scripts/quickstart.sh --gpu_memory_utilization 0.6 # Combine all options scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8 ``` This will: 1. Activate the virtual environment (if specified) 2. Detect hardware backend (CUDA/ROCm/TPU/CPU) 3. Install vLLM with appropriate backend support 4. Start the vLLM server in the background 5. Wait for the server to be ready 6. Test the API with a sample request 7. Display the server status ### Run individual commands (for step-by-step usage or troubleshooting) **Install vLLM:** ```bash scripts/quickstart.sh install # Or with virtual environment scripts/quickstart.sh install --venv /path/to/venv ``` **Start the server:** ```bash scripts/quickstart.sh start # Or with custom options scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8 ``` **Test the API:** ```bash scripts/quickstart.sh test # Or with custom port scripts/quickstart.sh test --port 8000 ``` **Stop the server:** ```bash scripts/quickstart.sh stop # Or with virtual environment scripts/quickstart.sh stop --venv /path/to/venv ``` **Check server status:** ```bash scripts/quickstart.sh status ``` **Restart the server:** ```bash scripts/quickstart.sh restart # Or with custom options scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8 ``` ## Configuration The script supports the following command-line options: ```bash scripts/quickstart.sh [command] [OPTIONS] Commands: install - Install vLLM and dependencies start - Start the vLLM server stop - Stop the vLLM server test - Test the OpenAI-compatible API status - Show server status restart - Restart the server all - Run complete workflow (default) Options: --model MODEL Model to use (default: Qwen/Qwen2.5-1.5B-Instruct) --port PORT Port to run server on (default: 8000) --venv VENV_PATH Virtual environment path (default: .) --gpu_memory_utilization VRAM GPU memory utilization (default: 0.8) ``` ### Hardware Backend Detection The script automatically detects your hardware and installs the appropriate vLLM version: - **NVIDIA CUDA**: Detected via `nvidia-smi` command - **AMD ROCm**: Detected via `/dev/kfd` and `/dev/dri` devices - **Google TPU**: Detected via `TPU_NAME` environment variable or `gcloud` command - **CPU**: Fallback if no GPU/TPU detected For Google TPU, the script installs `vllm-tpu` instead of the standard `vllm` package. ## API Testing The test script sends a simple chat completion request: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role": "user", "content": "Say hello!"}], "max_tokens": 50 }' ``` ## Troubleshooting **Virtual environment not found:** - Ensure the path provided with `--venv` exists and is a valid virtual environment - Check that the activation script exists (`bin/activate` on Linux/macOS or `Scripts/activate` on Windows) - Check and install uv, and create a new virtual environment with uv: `uv venv /path/to/venv` (suggested); or with pip: `python3 -m venv /path/to/venv` **Server won't start:** - Check if the port is already in use: `lsof -i :8000` - Verify GPU availability: `nvidia-smi` (for NVIDIA) or `rocm-smi` (for AMD) - Check vLLM installation: `python -c "import vllm; print(vllm.__version__)"` - Review server logs at `$VENV_PATH/tmp/vllm-server.log` **API returns errors:** - Wait a few seconds for the model to load - Check server logs: `cat $VENV_PATH/tmp/vllm-server.log` - Verify the server is running: `scripts/quickstart.sh status` **Out of memory:** - Use a smaller model (e.g., Qwen2.5-0.5B-Instruct) - Reduce `--gpu-memory-utilization` parameter - Close other GPU-intensive applications **Wrong backend detected:** - For NVIDIA: Ensure `nvidia-smi` is in your PATH - For AMD: Check that ROCm drivers are properly installed - For TPU: Set `TPU_NAME` environment variable or install `gcloud` ## Notes - The server runs in the background and logs to `$VENV_PATH/tmp/vllm-server.log` - The PID is stored in `$VENV_PATH/tmp/vllm-server.pid` for easy management - First run will download the model (~3GB for Qwen2.5-1.5B-Instruct) - Subsequent runs will use the cached model - The script automatically detects and uses `uv` if available, otherwise falls back to `pip` - Virtual environment support allows isolation from system Python packages - Arguments can be specified in any order (e.g., `scripts/quickstart.sh --port 8080 start --venv /path/to/venv`)