--- name: model-serving description: LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns. --- # Model Serving ## Purpose Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications. ## When to Use - Deploying LLMs for production (self-hosted Llama, Mistral, Qwen) - Building AI APIs with streaming responses - Serving traditional ML models (scikit-learn, XGBoost, PyTorch) - Implementing RAG pipelines with vector databases - Optimizing inference throughput and latency - Integrating LLM serving with frontend chat interfaces ## Model Serving Selection ### LLM Serving Engines **vLLM (Recommended Primary)** - PagedAttention memory management (20-30x throughput improvement) - Continuous batching for dynamic request handling - OpenAI-compatible API endpoints - Use for: Most self-hosted LLM deployments **TensorRT-LLM** - Maximum GPU efficiency (2-8x faster than vLLM) - Requires model conversion and optimization - Use for: Production workloads needing absolute maximum throughput **Ollama** - Local development without GPUs - Simple CLI interface - Use for: Prototyping, laptop development, educational purposes **Decision Framework:** ``` Self-hosted LLM deployment needed? ├─ Yes, need maximum throughput → vLLM ├─ Yes, need absolute max GPU efficiency → TensorRT-LLM ├─ Yes, local development only → Ollama └─ No, use managed API (OpenAI, Anthropic) → No serving layer needed ``` ### ML Model Serving (Non-LLM) **BentoML (Recommended)** - Python-native, easy deployment - Adaptive batching for throughput - Multi-framework support (scikit-learn, PyTorch, XGBoost) - Use for: Most traditional ML model deployments **Triton Inference Server** - Multi-model serving on same GPU - Model ensembles (chain multiple models) - Use for: NVIDIA GPU optimization, serving 10+ models ### LLM Orchestration **LangChain** - General-purpose workflows, agents, RAG - 100+ integrations (LLMs, vector DBs, tools) - Use for: Most RAG and agent applications **LlamaIndex** - RAG-focused with advanced retrieval strategies - 100+ data connectors (PDF, Notion, web) - Use for: RAG is primary use case ## Quick Start Examples ### vLLM Server Setup ```bash # Install pip install vllm # Serve a model (OpenAI-compatible API) vllm serve meta-llama/Llama-3.1-8B-Instruct \ --dtype auto \ --max-model-len 4096 \ --gpu-memory-utilization 0.9 \ --port 8000 ``` **Key Parameters:** - `--dtype`: Model precision (auto, float16, bfloat16) - `--max-model-len`: Context window size - `--gpu-memory-utilization`: GPU memory fraction (0.8-0.95) - `--tensor-parallel-size`: Number of GPUs for model parallelism ### Streaming Responses (SSE Pattern) **Backend (FastAPI):** ```python from fastapi import FastAPI from fastapi.responses import StreamingResponse from openai import OpenAI import json app = FastAPI() client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") @app.post("/chat/stream") async def chat_stream(message: str): async def generate(): stream = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": message}], stream=True, max_tokens=512 ) for chunk in stream: if chunk.choices[0].delta.content: token = chunk.choices[0].delta.content yield f"data: {json.dumps({'token': token})}\n\n" yield f"data: {json.dumps({'done': True})}\n\n" return StreamingResponse( generate(), media_type="text/event-stream", headers={"Cache-Control": "no-cache"} ) ``` **Frontend (React):** ```typescript // Integration with ai-chat skill const sendMessage = async (message: string) => { const response = await fetch('/chat/stream', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ message }) }) const reader = response.body!.getReader() const decoder = new TextDecoder() while (true) { const { done, value } = await reader.read() if (done) break const chunk = decoder.decode(value) const lines = chunk.split('\n\n') for (const line of lines) { if (line.startsWith('data: ')) { const data = JSON.parse(line.slice(6)) if (data.token) { setResponse(prev => prev + data.token) } } } } } ``` ### BentoML Service ```python import bentoml from bentoml.io import JSON import numpy as np @bentoml.service( resources={"cpu": "2", "memory": "4Gi"}, traffic={"timeout": 10} ) class IrisClassifier: model_ref = bentoml.models.get("iris_classifier:latest") def __init__(self): self.model = bentoml.sklearn.load_model(self.model_ref) @bentoml.api(batchable=True, max_batch_size=32) def classify(self, features: list[dict]) -> list[str]: X = np.array([[f['sepal_length'], f['sepal_width'], f['petal_length'], f['petal_width']] for f in features]) predictions = self.model.predict(X) return ['setosa', 'versicolor', 'virginica'][predictions] ``` ### LangChain RAG Pipeline ```python from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Qdrant from langchain.chains import RetrievalQA from langchain.text_splitter import RecursiveCharacterTextSplitter # Load and chunk documents text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50) chunks = text_splitter.split_documents(documents) # Create vector store embeddings = OpenAIEmbeddings() vectorstore = Qdrant.from_documents( chunks, embeddings, url="http://localhost:6333", collection_name="docs" ) # Create retrieval chain llm = ChatOpenAI(model="gpt-4o") qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 3}), return_source_documents=True ) # Query result = qa_chain({"query": "What is PagedAttention?"}) ``` ## Performance Optimization ### GPU Memory Estimation **Rule of thumb for LLMs:** ``` GPU Memory (GB) = Model Parameters (B) × Precision (bytes) × 1.2 ``` **Examples:** - Llama-3.1-8B (FP16): 8B × 2 bytes × 1.2 = 19.2 GB - Llama-3.1-70B (FP16): 70B × 2 bytes × 1.2 = 168 GB (requires 2-4 A100s) **Quantization reduces memory:** - FP16: 2 bytes per parameter - INT8: 1 byte per parameter (2x memory reduction) - INT4: 0.5 bytes per parameter (4x memory reduction) ### vLLM Optimization ```bash # Enable quantization (AWQ for 4-bit) vllm serve TheBloke/Llama-3.1-8B-AWQ \ --quantization awq \ --gpu-memory-utilization 0.9 # Multi-GPU deployment (tensor parallelism) vllm serve meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 ``` ### Batching Strategies **Continuous batching (vLLM default):** - Dynamically adds/removes requests from batch - Higher throughput than static batching - No configuration needed **Adaptive batching (BentoML):** ```python @bentoml.api( batchable=True, max_batch_size=32, max_latency_ms=1000 # Wait max 1s to fill batch ) def predict(self, inputs: list[np.ndarray]) -> list[float]: # BentoML automatically batches requests return self.model.predict(np.array(inputs)) ``` ## Production Deployment ### Kubernetes Deployment See `examples/k8s-vllm-deployment/` for complete YAML manifests. **Key considerations:** - GPU resource requests: `nvidia.com/gpu: 1` - Health checks: `/health` endpoint - Horizontal Pod Autoscaling based on queue depth - Persistent volume for model caching ### API Gateway Pattern For production, add rate limiting, authentication, and monitoring: **Kong Configuration:** ```yaml services: - name: vllm-service url: http://vllm-llama-8b:8000 plugins: - name: rate-limiting config: minute: 60 # 60 requests per minute per API key - name: key-auth - name: prometheus ``` ### Monitoring Metrics **Essential LLM metrics:** - Tokens per second (throughput) - Time to first token (TTFT) - Inter-token latency - GPU utilization and memory - Queue depth **Prometheus instrumentation:** ```python from prometheus_client import Counter, Histogram requests_total = Counter('llm_requests_total', 'Total requests') tokens_generated = Counter('llm_tokens_generated', 'Total tokens') request_duration = Histogram('llm_request_duration_seconds', 'Request duration') @app.post("/chat") async def chat(request): requests_total.inc() start = time.time() response = await generate(request) tokens_generated.inc(len(response.tokens)) request_duration.observe(time.time() - start) return response ``` ## Integration Patterns ### Frontend (ai-chat) Integration This skill provides the backend serving layer for the `ai-chat` skill. **Flow:** ``` Frontend (React) → API Gateway → vLLM Server → GPU Inference ↑ ↓ └─────────── SSE Stream (tokens) ─────────────────┘ ``` See `references/streaming-sse.md` for complete implementation patterns. ### RAG with Vector Databases **Architecture:** ``` User Query → LangChain ├─> Vector DB (Qdrant) for retrieval ├─> Combine context + query └─> LLM (vLLM) for generation ``` See `references/langchain-orchestration.md` and `examples/langchain-rag-qdrant/` for complete patterns. ### Async Inference Queue For batch processing or non-real-time inference: ``` Client → API → Message Queue (Celery) → Workers (vLLM) → Results DB ``` Useful for: - Batch document processing - Background summarization - Non-interactive workflows ## Benchmarking Use `scripts/benchmark_inference.py` to measure the deployment: ```bash python scripts/benchmark_inference.py \ --endpoint http://localhost:8000/v1/chat/completions \ --model meta-llama/Llama-3.1-8B-Instruct \ --concurrency 32 \ --requests 1000 ``` **Outputs:** - Requests per second - P50/P95/P99 latency - Tokens per second - GPU memory usage ## Bundled Resources **Detailed Guides:** - `references/vllm.md` - vLLM setup, PagedAttention, optimization - `references/tgi.md` - Text Generation Inference patterns - `references/bentoml.md` - BentoML deployment patterns - `references/langchain-orchestration.md` - LangChain RAG and agents - `references/inference-optimization.md` - Quantization, batching, GPU tuning **Working Examples:** - `examples/vllm-serving/` - Complete vLLM + FastAPI streaming setup - `examples/ollama-local/` - Local development with Ollama - `examples/langchain-agents/` - LangChain agent patterns **Utility Scripts:** - `scripts/benchmark_inference.py` - Throughput and latency benchmarking - `scripts/validate_model_config.py` - Validate deployment configurations ## Common Patterns ### Migration from OpenAI API vLLM provides OpenAI-compatible endpoints for easy migration: ```python # Before (OpenAI) from openai import OpenAI client = OpenAI(api_key="sk-...") # After (vLLM) from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" ) # Same API calls work! response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Hello"}] ) ``` ### Multi-Model Serving Route requests to different models based on task: ```python MODEL_ROUTING = { "small": "meta-llama/Llama-3.1-8B-Instruct", # Fast, cheap "large": "meta-llama/Llama-3.1-70B-Instruct", # Accurate, expensive "code": "codellama/CodeLlama-34b-Instruct" # Code-specific } @app.post("/chat") async def chat(message: str, task: str = "small"): model = MODEL_ROUTING[task] # Route to appropriate vLLM instance ``` ### Cost Optimization **Track token usage:** ```python import tiktoken def estimate_cost(text: str, model: str, price_per_1k: float): encoding = tiktoken.encoding_for_model(model) tokens = len(encoding.encode(text)) return (tokens / 1000) * price_per_1k # Compare costs openai_cost = estimate_cost(text, "gpt-4o", 0.005) # $5 per 1M tokens self_hosted_cost = 0 # Fixed GPU cost, unlimited tokens ``` ## Troubleshooting **Out of GPU memory:** - Reduce `--max-model-len` - Lower `--gpu-memory-utilization` (try 0.8) - Enable quantization (`--quantization awq`) - Use smaller model variant **Low throughput:** - Increase `--gpu-memory-utilization` (try 0.95) - Enable continuous batching (vLLM default) - Check GPU utilization (should be >80%) - Consider tensor parallelism for multi-GPU **High latency:** - Reduce batch size if using static batching - Check network latency to GPU server - Profile with `scripts/benchmark_inference.py` ## Next Steps 1. **Local Development**: Start with `examples/ollama-local/` for GPU-free testing 2. **Production Setup**: Deploy vLLM with `examples/vllm-serving/` 3. **RAG Integration**: Add vector DB with `examples/langchain-rag-qdrant/` 4. **Kubernetes**: Scale with `examples/k8s-vllm-deployment/` 5. **Monitoring**: Add metrics with Prometheus and Grafana