--- name: high-performance-inference description: High-performance LLM inference with vLLM, quantization (AWQ, GPTQ, FP8), speculative decoding, and edge deployment. Use when optimizing inference latency, throughput, or memory. version: 1.0.0 tags: [vllm, quantization, inference, performance, edge, speculative, 2026] context: fork agent: llm-integrator author: OrchestKit user-invocable: false --- # High-Performance Inference Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding. > **vLLM 0.14.0** (Jan 2026): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended. ## Overview - Deploying LLMs with low latency requirements - Reducing GPU memory for larger models - Maximizing throughput for batch inference - Edge/mobile deployment with constrained resources - Cost optimization through efficient hardware utilization ## Quick Reference ```bash # Basic vLLM server vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \ --tensor-parallel-size 4 \ --max-model-len 8192 # With quantization + speculative decoding vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \ --quantization awq \ --speculative-config '{"method": "ngram", "num_speculative_tokens": 5}' \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 ``` ## vLLM 0.14.x Key Features | Feature | Benefit | |---------|---------| | **PagedAttention** | Up to 24x throughput via efficient KV cache | | **Continuous Batching** | Dynamic request batching for max utilization | | **CUDA Graphs** | Fast model execution with graph capture | | **Tensor Parallelism** | Scale across multiple GPUs | | **Prefix Caching** | Reuse KV cache for shared prefixes | | **AttentionConfig** | New API replacing VLLM_ATTENTION_BACKEND env | | **Semantic Router** | vLLM SR v0.1 "Iris" for intelligent LLM routing | ## Python vLLM Integration ```python from vllm import LLM, SamplingParams # Initialize with optimization flags llm = LLM( model="meta-llama/Meta-Llama-3.1-8B-Instruct", quantization="awq", tensor_parallel_size=2, gpu_memory_utilization=0.9, enable_prefix_caching=True, ) # Sampling parameters sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=1024, ) # Generate outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text) ``` ## Quantization Methods | Method | Bits | Memory Savings | Speed | Quality | |--------|------|----------------|-------|---------| | FP16 | 16 | Baseline | Baseline | Best | | INT8 | 8 | 50% | +10-20% | Very Good | | AWQ | 4 | 75% | +20-40% | Good | | GPTQ | 4 | 75% | +15-30% | Good | | FP8 | 8 | 50% | +30-50% | Very Good | **When to Use Each:** - **FP16**: Maximum quality, sufficient memory - **INT8/FP8**: Balance of quality and efficiency - **AWQ**: Best 4-bit quality, activation-aware - **GPTQ**: Faster quantization, good quality ## Speculative Decoding Accelerate generation by predicting multiple tokens: ```python # N-gram based (no extra model) speculative_config = { "method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2, } # Draft model (higher quality) speculative_config = { "method": "draft_model", "draft_model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 3, } ``` **Expected Gains**: 1.5-2.5x throughput for autoregressive tasks. ## Key Decisions | Decision | Recommendation | |----------|----------------| | **Quantization** | AWQ for 4-bit, FP8 for H100/H200 | | **Batch size** | Dynamic via continuous batching | | **GPU memory** | 0.85-0.95 utilization | | **Parallelism** | Tensor parallel across GPUs | | **KV cache** | Enable prefix caching for shared contexts | ## Common Mistakes - Using GPTQ without calibration data (poor quality) - Over-allocating GPU memory (OOM on peak loads) - Ignoring warmup requests (cold start latency) - Not benchmarking actual workload patterns - Mixing quantization with incompatible features ## Performance Benchmarking ```python from vllm import LLM, SamplingParams import time def benchmark_throughput(llm, prompts, sampling_params, num_runs=3): """Benchmark tokens per second.""" total_tokens = 0 total_time = 0 for _ in range(num_runs): start = time.perf_counter() outputs = llm.generate(prompts, sampling_params) elapsed = time.perf_counter() - start tokens = sum(len(o.outputs[0].token_ids) for o in outputs) total_tokens += tokens total_time += elapsed return total_tokens / total_time # tokens/sec ``` ## Advanced Patterns See `references/` for: - **vLLM Deployment**: PagedAttention, batching, production config - **Quantization Guide**: AWQ, GPTQ, INT8, FP8 comparison - **Speculative Decoding**: Draft models, n-gram, throughput tuning - **Edge Deployment**: Mobile, resource-constrained optimization ## Related Skills - `llm-streaming` - Streaming token responses - `function-calling` - Tool use with inference - `ollama-local` - Local inference with Ollama - `prompt-caching` - Reduce redundant computation - `semantic-caching` - Cache full responses ## Capability Details ### vllm-deployment **Keywords:** vllm, inference server, deploy, serve, production **Solves:** - Deploy LLMs with vLLM for production - Configure tensor parallelism and batching - Optimize GPU memory utilization ### quantization **Keywords:** quantize, AWQ, GPTQ, INT8, FP8, compress, reduce memory **Solves:** - Reduce model memory footprint - Choose appropriate quantization method - Maintain quality with lower precision ### speculative-decoding **Keywords:** speculative, draft model, faster generation, predict tokens **Solves:** - Accelerate autoregressive generation - Configure draft models or n-gram speculation - Tune speculative token count ### edge-inference **Keywords:** edge, mobile, embedded, constrained, optimization **Solves:** - Deploy on resource-constrained devices - Optimize for mobile/edge hardware - Balance quality and resource usage ### throughput-optimization **Keywords:** throughput, latency, performance, benchmark, optimize **Solves:** - Maximize requests per second - Reduce time to first token - Benchmark and tune performance