--- name: ai-llm-inference description: "Operational patterns for LLM inference: latency budgeting, tail-latency control, caching, batching/scheduling, quantization/compression, parallelism, and reliable serving at scale. Emphasizes production-grade performance, cost control, and observability." --- # LLMOps - Inference & Optimization - Production Skill Hub **Modern Best Practices (January 2026)**: - Treat inference as a **systems problem**: SLOs, tail latency, retries, overload, and cache strategy. - Use **continuous batching / smart scheduling** when serving many concurrent requests (Orca scheduling: https://www.usenix.org/conference/osdi22/presentation/yu). - Use **KV-cache aware serving** (PagedAttention/vLLM: https://arxiv.org/abs/2309.06180) and **efficient attention kernels** (FlashAttention: https://arxiv.org/abs/2205.14135). - Use **speculative decoding** when latency is critical and draft-model quality is acceptable (speculative decoding: https://arxiv.org/abs/2302.01318). - Quantize only with **measured** quality impact and rollback plan (quantization must be validated on your eval set). This skill provides **production-ready operational patterns** for optimizing LLM inference performance, cost, and reliability. It centralizes **decision rules**, **optimization strategies**, **configuration templates**, and **operational checklists** for inference workloads. No theory. No narrative. Only what Codex can execute. --- ## When to Use This Skill Codex should activate this skill whenever the user asks for: - Optimizing LLM inference latency or throughput - Choosing quantization strategies (FP8/FP4/INT8/INT4) - Configuring vLLM, TensorRT-LLM, or DeepSpeed inference - Scaling LLM inference across GPUs (tensor/pipeline parallelism) - Building high-throughput LLM APIs - Improving context window performance (KV cache optimization) - Using speculative decoding for faster generation - Reducing cost per token - Profiling and benchmarking inference workloads - Planning infrastructure capacity - CPU/edge deployment patterns - High availability and resilience patterns ## Scope Boundaries (Use These Skills for Depth) - **Prompting, tuning, datasets** -> [ai-llm](../ai-llm/SKILL.md) - **RAG pipeline construction** -> [ai-rag](../ai-rag/SKILL.md) - **Deployment, APIs, monitoring** -> [ai-mlops](../ai-mlops/SKILL.md) - **Safety, governance** -> [ai-mlops](../ai-mlops/SKILL.md) - **Performance monitoring** -> [qa-observability](../qa-observability/SKILL.md) - **Infrastructure operations** -> [ops-devops-platform](../ops-devops-platform/SKILL.md) --- ## Quick Reference | Task | Tool/Framework | Command/Pattern | When to Use | |------|----------------|-----------------|-------------| | Latency budget | SLO + load model | TTFT/ITL + P95/P99 under load | Any production endpoint | | Tail-latency control | Scheduling + timeouts | Admission control + queue caps + backpressure | Prevent p99 explosions | | Throughput | Batching + KV-cache aware serving | Continuous batching + KV paging | High concurrency serving | | Cost control | Model tiering + caching | Cache (prefix/response) + quotas | Reduce spend and overload risk | | Long context | Prefill optimization | Chunked prefill + prompt compression | Long inputs and RAG-heavy apps | | Parallelism | TP/PP/DP | Choose by model size and interconnect | Models that do not fit one device | | Reliability | Resilience patterns | Timeouts + circuit breakers + idempotency | Avoid cascading failures | --- ## Decision Tree: Inference Optimization Strategy ```text Need to optimize LLM inference: [Optimization Path] │ ├─ High throughput (>10k tok/s) OR P99 variance > 3x P50? │ └─ YES -> Disaggregated inference (prefill/decode separation) │ See references/disaggregated-inference.md │ ├─ Primary constraint: Throughput? │ ├─ Many concurrent users? -> batching + KV-cache aware serving + admission control │ ├─ Chat/agents with KV reuse? -> SGLang (RadixAttention) │ └─ Mostly batch/offline? -> batch inference jobs + large batches + spot capacity │ ├─ Primary constraint: Cost? │ ├─ Can accept lower quality tier? -> model tiering (small/medium/large router) │ └─ Must keep quality? -> caching + prompt/context reduction before quantization │ ├─ Primary constraint: Latency? │ ├─ Draft model acceptable? -> speculative decoding │ └─ Long context? -> prefill optimizations + FlashAttention-3 + context budgets │ ├─ Large model (>70B)? │ ├─ Multiple GPUs? -> Tensor parallelism (NVLink required) │ └─ Deep model? -> Pipeline parallelism (minimize bubbles) │ ├─ Hardware selection? │ ├─ Memory-bound? -> more HBM, higher bandwidth │ ├─ Latency-bound? -> faster clocks + kernel support │ └─ Multi-node? -> prioritize interconnect (NVLink/RDMA) and topology │ │ Notes: treat GPU/SKU advice as time-sensitive; verify with vendor docs and your own benchmarks. │ See references/gpu-optimization-checklists.md and references/infrastructure-tuning.md │ └─ Edge deployment? └─ CPU + quantization -> llama.cpp/GGUF for constrained resources ``` --- ## Intake Checklist (REQUIRED) Before recommending changes, collect (or infer) these inputs: - Model + variant (size, context length, precision/quantization, tokenizer) - Traffic shape (prompt/output length distributions, concurrency, QPS, streaming vs non-streaming) - SLOs and budgets (TTFT/ITL/total latency targets, error budget, cost per request) - Serving stack (engine/version, batching/scheduling settings, caching, parallelism, autoscaling) - Hardware and topology (GPU type/count, VRAM, NVLink/RDMA, CPU/RAM, storage, cluster/runtime) - Constraints (quality floor, safety requirements, rollout/rollback constraints) ## Core Concepts & Practices ### Core Concepts (Vendor-Agnostic) - **Latency components**: queueing + prefill + decode; optimize the largest contributor first. - **Tail latency**: p99 is dominated by queuing and long prompts; fix with admission control and context budgets. - **Retries**: retries can multiply load; bound retries and use hedged requests only with strict budgets. - **Caching**: prefix caching helps repeated system/tool scaffolds; response caching helps repeated questions (requires invalidation). - **Security & privacy**: prompts/outputs can contain sensitive data; scrub logs, enforce auth/tenancy, and rate-limit abuse (OWASP LLM Top 10: https://owasp.org/www-project-top-10-for-large-language-model-applications/). ### Implementation Practices (Tooling Examples) - **Measure under load**: benchmark TTFT/ITL and p95/p99 with realistic concurrency and prompt lengths. - **Separate environments**: dev/stage/prod model configs; promote only after passing the inference review checklist. - **Export telemetry**: request-level tokens, TTFT/ITL, queue depth, GPU memory headroom, and error classes (OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/). ### Do / Avoid **Do** - Do enforce `max_input_tokens` and `max_output_tokens` at the API boundary. - Do cap concurrency and queue depth; return overload errors quickly. - Do validate quality after any quantization or kernel change. **Avoid** - Avoid unbounded retries (amplifies outages). - Avoid unbounded context windows (OOM + latency spikes). - Avoid benchmarking on single requests; always test with realistic concurrency. --- ## Accuracy Protocol (REQUIRED) - Treat performance ratios (for example, "2x faster") as hypotheses unless a source is cited and the workload is comparable. - Do not recommend hardware/SKU changes without stating assumptions (model size, context length, concurrency, interconnect). - Prefer a measured baseline + checklist-driven rollout over "best practice" claims. --- ## Resources (Detailed Operational Guides) For comprehensive guides on specific topics, see: ### Infrastructure & Serving - [Disaggregated Inference](references/disaggregated-inference.md) - Prefill/decode separation (2025+ standard) - [Infrastructure Tuning](references/infrastructure-tuning.md) - OS, container, Kubernetes optimization for GPU workloads - [Serving Architectures](references/serving-architectures.md) - Production serving stack patterns (vLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo) - [Resilience & HA Patterns](references/resilience-ha-patterns.md) - Multi-region, failover, traffic management ### Performance Optimization - [Quantization Patterns](references/quantization-patterns.md) - FP8/FP4/INT8/INT4 decision trees (FP8 first, INT8 not on Blackwell) - [KV Cache Optimization](references/kv-cache-optimization.md) - PagedAttention, FlashAttention-3, FlashInfer, RadixAttention - [Parallelism Patterns](references/parallelism-patterns.md) - Tensor/pipeline/expert parallelism strategies - [Optimization Strategies](references/optimization-strategies.md) - Throughput, cost, memory optimization - [Batching & Scheduling](references/batching-and-scheduling.md) - Continuous batching and throughput patterns ### Deployment & Operations - [Edge & CPU Optimization](references/edge-cpu-optimization.md) - llama.cpp, GGUF, mobile/browser deployment - [GPU Optimization Checklists](references/gpu-optimization-checklists.md) - Hardware-specific tuning - [Speculative Decoding Guide](references/speculative-decoding-guide.md) - Advanced generation acceleration - [Profiling & Capacity Planning](references/profiling-and-capacity-planning.md) - Benchmarking, SLOs, replica sizing --- ## Templates ### Inference Configs Production-ready configuration templates for leading inference engines: - [vLLM Configuration](assets/inference/template-vllm-config.md) - Continuous batching, PagedAttention setup - [TensorRT-LLM Configuration](assets/inference/template-tensorrtllm-config.md) - NVIDIA kernel optimizations - [DeepSpeed Inference](assets/inference/template-deepspeed-inference.md) - PyTorch-friendly inference ### Quantization & Compression Model compression templates for reducing memory and cost: - [GPTQ Quantization](assets/quantization/template-gptq.md) - GPU post-training quantization - [AWQ Quantization](assets/quantization/template-awq.md) - Activation-aware weight quantization - [GGUF Format](assets/quantization/template-gguf.md) - CPU/edge optimized formats ### Serving Pipelines High-throughput serving architectures: - [LLM API Server](assets/serving/template-llm-api.md) - FastAPI + vLLM production setup - [High-Throughput Setup](assets/serving/template-high-throughput-setup.md) - Multi-replica scaling patterns ### Caching & Batching Performance optimization templates: - [Prefix Caching](assets/caching/template-prefix-caching.md) - KV cache reuse strategies - [Batching Configuration](assets/batching/template-batching-config.md) - Continuous batching tuning ### Benchmarking Performance measurement and validation: - [Latency & Throughput Testing](assets/benchmarking/template-latency-throughput-test.md) - Load testing framework ### Checklists - [Inference Performance Review Checklist](assets/checklists/inference-review-checklist.md) - Baseline, bottlenecks, rollout readiness ## Navigation **Resources** - [references/disaggregated-inference.md](references/disaggregated-inference.md) - [references/serving-architectures.md](references/serving-architectures.md) - [references/profiling-and-capacity-planning.md](references/profiling-and-capacity-planning.md) - [references/gpu-optimization-checklists.md](references/gpu-optimization-checklists.md) - [references/speculative-decoding-guide.md](references/speculative-decoding-guide.md) - [references/resilience-ha-patterns.md](references/resilience-ha-patterns.md) - [references/optimization-strategies.md](references/optimization-strategies.md) - [references/kv-cache-optimization.md](references/kv-cache-optimization.md) - [references/batching-and-scheduling.md](references/batching-and-scheduling.md) - [references/quantization-patterns.md](references/quantization-patterns.md) - [references/parallelism-patterns.md](references/parallelism-patterns.md) - [references/edge-cpu-optimization.md](references/edge-cpu-optimization.md) - [references/infrastructure-tuning.md](references/infrastructure-tuning.md) **Templates** - [assets/serving/template-llm-api.md](assets/serving/template-llm-api.md) - [assets/serving/template-high-throughput-setup.md](assets/serving/template-high-throughput-setup.md) - [assets/inference/template-vllm-config.md](assets/inference/template-vllm-config.md) - [assets/inference/template-tensorrtllm-config.md](assets/inference/template-tensorrtllm-config.md) - [assets/inference/template-deepspeed-inference.md](assets/inference/template-deepspeed-inference.md) - [assets/quantization/template-awq.md](assets/quantization/template-awq.md) - [assets/quantization/template-gptq.md](assets/quantization/template-gptq.md) - [assets/quantization/template-gguf.md](assets/quantization/template-gguf.md) - [assets/batching/template-batching-config.md](assets/batching/template-batching-config.md) - [assets/caching/template-prefix-caching.md](assets/caching/template-prefix-caching.md) - [assets/benchmarking/template-latency-throughput-test.md](assets/benchmarking/template-latency-throughput-test.md) - [assets/checklists/inference-review-checklist.md](assets/checklists/inference-review-checklist.md) **Data** - [data/sources.json](data/sources.json) - Curated external references --- ## Trend Awareness Protocol **IMPORTANT**: When users ask recommendation questions about LLM inference, you MUST use WebSearch to check current trends before answering. ### Trigger Conditions - "What's the best inference engine for [use case]?" - "What should I use for [serving/quantization/batching]?" - "What's the latest in LLM inference optimization?" - "Current best practices for [vLLM/TensorRT/quantization]?" - "Is [inference tool] still relevant in 2026?" - "[vLLM] vs [TensorRT-LLM] vs [SGLang]?" - "Best quantization method for [model size]?" - "What GPU should I use for inference?" ### Required Searches 1. Search: `"LLM inference optimization best practices 2026"` 2. Search: `"[vLLM/TensorRT-LLM/SGLang] comparison 2026"` 3. Search: `"LLM quantization trends January 2026"` 4. Search: `"LLM serving new releases 2026"` ### What to Report After searching, provide: - **Current landscape**: What serving engines are popular NOW (not 6 months ago) - **Emerging trends**: New inference optimizations gaining traction - **Deprecated/declining**: Techniques or tools losing relevance - **Recommendation**: Based on fresh data, not just static knowledge ### Example Topics (verify with fresh search) - Inference engines (vLLM 0.7+, TensorRT-LLM, SGLang, llama.cpp) - Quantization methods (FP8, AWQ, GPTQ, GGUF, bitsandbytes) - Attention kernels (FlashAttention-3, FlashInfer, xFormers) - Speculative decoding advances - KV cache optimization techniques - New GPU architectures (H200, Blackwell) and their optimizations --- ## Related Skills This skill focuses on **inference-time performance**. For related workflows: - See "Scope Boundaries" above. --- ## External Resources See [data/sources.json](data/sources.json) for: - Serving frameworks (vLLM, TensorRT-LLM, DeepSpeed-MII) - Quantization libraries (GPTQ, AWQ, bitsandbytes, LLM Compressor) - FlashAttention, FlashInfer, xFormers - GPU hardware guides and optimization docs - Benchmarking frameworks and tools --- Use this skill whenever the user needs **LLM inference performance, cost reduction, or serving architecture** guidance.