--- name: llm-serving-patterns description: LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments. allowed-tools: Read, Glob, Grep --- # LLM Serving Patterns ## When to Use This Skill Use this skill when: - Designing LLM inference infrastructure - Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM) - Implementing quantization for production deployment - Optimizing batching and throughput - Building streaming response systems - Scaling LLM deployments cost-effectively **Keywords:** LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding ## LLM Serving Architecture Overview ```text ┌─────────────────────────────────────────────────────────────────────┐ │ LLM Serving Stack │ ├─────────────────────────────────────────────────────────────────────┤ │ Clients (API, Chat UI, Agents) │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Load Balancer / API Gateway │ │ │ │ • Rate limiting • Authentication • Request routing │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Inference Server │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │ │ │ │ Request │ │ Batching │ │ KV Cache │ │ │ │ │ │ Queue │──▶│ Engine │──▶│ Management │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ │ │ Model Execution Engine │ │ │ │ │ │ • Tensor operations • Attention • Token sampling │ │ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ GPU/TPU Cluster │ │ │ │ • Model sharding • Tensor parallelism • Pipeline parallel │ │ │ └─────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Serving Framework Comparison | Framework | Strengths | Best For | Considerations | | --------- | --------- | -------- | -------------- | | **vLLM** | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community | | **TGI (Text Generation Inference)** | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first | | **TensorRT-LLM** | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup | | **Triton Inference Server** | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity | | **Ollama** | Simple local deployment | Development, edge deployment | Limited scaling features | | **llama.cpp** | CPU inference, quantization | Resource-constrained, edge | C++ integration required | ### Framework Selection Decision Tree ```text Need lowest latency on NVIDIA GPUs? ├── Yes → TensorRT-LLM └── No └── Need high throughput with many concurrent users? ├── Yes → vLLM (PagedAttention) └── No └── Need enterprise features + HF integration? ├── Yes → TGI └── No └── Simple local/edge deployment? ├── Yes → Ollama or llama.cpp └── No → vLLM (general purpose) ``` ## Quantization Techniques ### Precision Levels | Precision | Bits | Memory Reduction | Quality Impact | Use Case | | --------- | ---- | ---------------- | -------------- | -------- | | FP32 | 32 | Baseline | None | Training, reference | | FP16/BF16 | 16 | 2x | Minimal | Standard serving | | INT8 | 8 | 4x | Low | Production serving | | INT4 | 4 | 8x | Moderate | Resource-constrained | | INT2 | 2 | 16x | Significant | Experimental | ### Quantization Methods | Method | Description | Quality | Speed | | ------ | ----------- | ------- | ----- | | **PTQ (Post-Training Quantization)** | Quantize after training, no retraining | Good | Fast to apply | | **QAT (Quantization-Aware Training)** | Simulate quantization during training | Better | Requires training | | **GPTQ** | One-shot weight quantization | Very good | Moderate | | **AWQ (Activation-aware Weight Quantization)** | Preserves salient weights | Excellent | Moderate | | **GGUF/GGML** | llama.cpp format, CPU-optimized | Good | Very fast inference | | **SmoothQuant** | Migrates difficulty to weights | Excellent | Moderate | ### Quantization Selection ```text Quality vs. Efficiency Trade-off: Quality ────────────────────────────────────────────▶ Efficiency │ │ │ FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 │ │ ○───────○────────○──────────○──────────○──────○ │ │ │ │ │ │ │ │ │ │ Best Great Good Good Fair Poor │ │ │ ``` ## Batching Strategies ### Static Batching ```text Request 1: [tokens: 100] ─┐ Request 2: [tokens: 50] ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete Request 3: [tokens: 80] ─┘ Problem: Short requests wait for long ones (head-of-line blocking) ``` ### Continuous Batching (Preferred) ```text Time ──────────────────────────────────────────────────────────▶ Req 1: [████████████████████████████████] ──▶ Complete Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████] Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████] • New requests join batch as others complete • No padding waste • Optimal GPU utilization ``` ### Batching Parameters | Parameter | Description | Trade-off | | --------- | ----------- | --------- | | `max_batch_size` | Maximum concurrent requests | Memory vs. throughput | | `max_waiting_tokens` | Tokens before forcing batch | Latency vs. throughput | | `max_num_seqs` | Maximum sequences in batch | Memory vs. concurrency | ## KV Cache Management ### The KV Cache Problem ```text Attention: Q × K^T × V For each token generated: • Must recompute attention with ALL previous tokens • K and V tensors grow with sequence length • Memory: O(batch_size × seq_len × num_layers × hidden_dim) Example (70B model, 4K context): • KV cache per request: ~8GB • 10 concurrent requests: ~80GB GPU memory ``` ### PagedAttention (vLLM Innovation) ```text Traditional KV Cache: ┌──────────────────────────────────────────┐ │ Request 1 KV Cache (contiguous, fixed) │ ← Wastes memory ├──────────────────────────────────────────┤ │ Request 2 KV Cache (contiguous, fixed) │ ├──────────────────────────────────────────┤ │ FRAGMENTED/WASTED SPACE │ └──────────────────────────────────────────┘ PagedAttention: ┌────┬────┬────┬────┬────┬────┬────┬────┐ │ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │ ← Pages allocated on demand └────┴────┴────┴────┴────┴────┴────┴────┘ • Non-contiguous memory allocation • Near-zero memory waste • 2-4x higher throughput ``` ### KV Cache Optimization Strategies | Strategy | Description | Memory Savings | | -------- | ----------- | -------------- | | **Paged Attention** | Virtual memory for KV cache | ~50% reduction | | **Prefix Caching** | Reuse KV cache for common prefixes | System prompt: 100% | | **Quantized KV Cache** | INT8/FP8 for KV values | 50-75% reduction | | **Sliding Window** | Limited attention context | Linear memory | | **MQA/GQA** | Grouped query attention | Architecture-dependent | ## Streaming Response Patterns ### Server-Sent Events (SSE) ```text Client Server │ │ │──── GET /v1/chat/completions ──────▶│ │ (stream: true) │ │ │ │◀──── HTTP 200 OK ───────────────────│ │ Content-Type: text/event-stream│ │ │ │◀──── data: {"token": "Hello"} ──────│ │◀──── data: {"token": " world"} ─────│ │◀──── data: {"token": "!"} ──────────│ │◀──── data: [DONE] ──────────────────│ │ │ ``` **SSE Benefits:** - HTTP/1.1 compatible - Auto-reconnection support - Simple to implement - Wide client support ### WebSocket Streaming ```text Client Server │ │ │──── WebSocket Upgrade ─────────────▶│ │◀──── 101 Switching Protocols ───────│ │ │ │──── {"prompt": "Hello"} ───────────▶│ │ │ │◀──── {"token": "Hi"} ───────────────│ │◀──── {"token": " there"} ───────────│ │◀──── {"token": "!"} ────────────────│ │◀──── {"done": true} ────────────────│ │ │ ``` **WebSocket Benefits:** - Bidirectional communication - Lower latency - Better for chat applications - Connection persistence ### Streaming Implementation Considerations | Aspect | SSE | WebSocket | | ------ | --- | --------- | | **Reconnection** | Built-in | Manual | | **Scalability** | Per-request | Connection pool | | **Load Balancing** | Standard HTTP | Sticky sessions | | **Firewall/Proxy** | Usually works | May need config | | **Best For** | One-way streaming | Interactive chat | ## Speculative Decoding ### Concept ```text Standard Decoding: Large Model: [T1] → [T2] → [T3] → [T4] → [T5] 10ms 10ms 10ms 10ms 10ms = 50ms total Speculative Decoding: Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms) │ ▼ Large Model: [Verify T1-T5 in one pass] (15ms) Accept: T1, T2, T3 ✓ Reject: T4, T5 ✗ │ ▼ [Generate T4, T5 correctly] Total: ~25ms (2x speedup if 60% acceptance) ``` ### Speculative Decoding Trade-offs | Factor | Impact | | ------ | ------ | | **Draft model quality** | Higher match rate = more speedup | | **Draft model size** | Larger = better quality, slower | | **Speculation depth** | More tokens = higher risk/reward | | **Verification cost** | Must be < sequential generation | ## Scaling Strategies ### Horizontal Scaling ```text ┌─────────────────────────────────────────────────────────┐ │ Load Balancer │ │ (Round-robin, Least-connections) │ └─────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ vLLM │ │ vLLM │ │ vLLM │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ (GPU×4) │ │ (GPU×4) │ │ (GPU×4) │ └─────────┘ └─────────┘ └─────────┘ ``` ### Model Parallelism | Strategy | Description | Use Case | | -------- | ----------- | -------- | | **Tensor Parallelism** | Split layers across GPUs | Single large model | | **Pipeline Parallelism** | Different layers on different GPUs | Very large models | | **Data Parallelism** | Same model, different batches | High throughput | ```text Tensor Parallelism (TP=4): ┌─────────────────────────────────────────┐ │ Layer N │ │ GPU0 │ GPU1 │ GPU2 │ GPU3 │ │ 25% │ 25% │ 25% │ 25% │ └─────────────────────────────────────────┘ Pipeline Parallelism (PP=4): GPU0: Layers 0-7 GPU1: Layers 8-15 GPU2: Layers 16-23 GPU3: Layers 24-31 ``` ## Latency Optimization Checklist ### Pre-deployment - [ ] Choose appropriate quantization (INT8 for production) - [ ] Enable continuous batching - [ ] Configure KV cache size appropriately - [ ] Set optimal batch size for hardware - [ ] Enable prefix caching for system prompts ### Runtime - [ ] Monitor GPU memory utilization - [ ] Track p50/p95/p99 latencies - [ ] Measure time-to-first-token (TTFT) - [ ] Monitor tokens-per-second (TPS) - [ ] Set appropriate timeouts ### Infrastructure - [ ] Use fastest available interconnect (NVLink, InfiniBand) - [ ] Minimize network hops - [ ] Place inference close to users (edge) - [ ] Consider dedicated inference hardware ## Cost Optimization ### Cost Drivers | Factor | Impact | Optimization | | ------ | ------ | ------------ | | **GPU hours** | Highest | Quantization, batching | | **Memory** | High | PagedAttention, KV cache optimization | | **Network** | Medium | Response compression, edge deployment | | **Storage** | Low | Model deduplication | ### Cost Estimation Formula ```text Monthly Cost = (Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour) ───────────────────────────────────────────────────────────────────────────── 3600 Example: • 10M requests/month • 500 tokens average • 0.001 GPU-seconds/token (optimized) • $2/GPU-hour Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month ``` ## Common Patterns ### Multi-model Routing ```text ┌─────────────────────────────────────────────────────────┐ │ Router │ │ • Classify request complexity │ │ • Route to appropriate model │ └─────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Small │ │ Medium │ │ Large │ │ Model │ │ Model │ │ Model │ │ (7B) │ │ (13B) │ │ (70B) │ │ Fast │ │ Balanced│ │ Quality │ └─────────┘ └─────────┘ └─────────┘ ``` ### Caching Strategies | Cache Type | What to Cache | TTL | | ---------- | ------------- | --- | | **Prompt cache** | Common system prompts | Long | | **KV cache** | Prefix tokens | Session | | **Response cache** | Exact query matches | Varies | | **Embedding cache** | Document embeddings | Long | ## Related Skills - `ml-system-design` - End-to-end ML pipeline design - `rag-architecture` - Retrieval-augmented generation patterns - `vector-databases` - Vector search for LLM context - `ml-inference-optimization` - General inference optimization - `estimation-techniques` - Capacity planning for LLM systems ## Version History - v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews --- ## Last Updated **Date:** 2025-12-26