--- name: ml-inference-optimization description: ML inference latency optimization, model compression, distillation, caching strategies, and edge deployment patterns. Use when optimizing inference performance, reducing model size, or deploying ML at the edge. allowed-tools: Read, Glob, Grep --- # ML Inference Optimization ## When to Use This Skill Use this skill when: - Optimizing ML inference latency - Reducing model size for deployment - Implementing model compression techniques - Designing inference caching strategies - Deploying models at the edge - Balancing accuracy vs. latency trade-offs **Keywords:** inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration ## Inference Optimization Overview ```text ┌─────────────────────────────────────────────────────────────────────┐ │ Inference Optimization Stack │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Model Level │ │ │ │ Distillation │ Pruning │ Quantization │ Architecture Search │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Compiler Level │ │ │ │ Graph optimization │ Operator fusion │ Memory planning │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Runtime Level │ │ │ │ Batching │ Caching │ Async execution │ Multi-threading │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Hardware Level │ │ │ │ GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Model Compression Techniques ### Technique Overview | Technique | Size Reduction | Speed Improvement | Accuracy Impact | | --------- | -------------- | ----------------- | --------------- | | **Quantization** | 2-4x | 2-4x | Low (1-2%) | | **Pruning** | 2-10x | 1-3x | Low-Medium | | **Distillation** | 3-10x | 3-10x | Medium | | **Low-rank factorization** | 2-5x | 1.5-3x | Low-Medium | | **Weight sharing** | 10-100x | Variable | Medium-High | ### Knowledge Distillation ```text ┌─────────────────────────────────────────────────────────────────────┐ │ Knowledge Distillation │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Teacher Model│ (Large, accurate, slow) │ │ │ GPT-4 │ │ │ └──────────────┘ │ │ │ │ │ ▼ Soft labels (probability distributions) │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Training Process │ │ │ │ Loss = α × CrossEntropy(student, hard_labels) │ │ │ │ + (1-α) × KL_Div(student, teacher_soft_labels) │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │Student Model │ (Small, nearly as accurate, fast) │ │ │ DistilBERT │ │ │ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` **Distillation Types:** | Type | Description | Use Case | | ---- | ----------- | -------- | | **Response distillation** | Match teacher outputs | General compression | | **Feature distillation** | Match intermediate layers | Better transfer | | **Relation distillation** | Match sample relationships | Structured data | | **Self-distillation** | Model teaches itself | Regularization | ### Pruning Strategies ```text Unstructured Pruning (Weight-level): Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7] After: [0.0, 0.8, 0.0, 0.9, 0.0, 0.7] (50% sparse) • Flexible, high sparsity possible • Needs sparse hardware/libraries Structured Pruning (Channel/Layer-level): Before: ┌───┬───┬───┬───┐ │ C1│ C2│ C3│ C4│ └───┴───┴───┴───┘ After: ┌───┬───┬───┐ │ C1│ C3│ C4│ (Removed C2 entirely) └───┴───┴───┘ • Works with standard hardware • Lower compression ratio ``` **Pruning Decision Criteria:** | Method | Description | Effectiveness | | ------ | ----------- | ------------- | | **Magnitude-based** | Remove smallest weights | Simple, effective | | **Gradient-based** | Remove low-gradient weights | Better accuracy | | **Second-order** | Use Hessian information | Best but expensive | | **Lottery ticket** | Find winning subnetwork | Theoretical insight | ### Quantization (Detailed) ```text Precision Hierarchy: FP32 (32 bits): ████████████████████████████████ FP16 (16 bits): ████████████████ BF16 (16 bits): ████████████████ (different mantissa/exponent) INT8 (8 bits): ████████ INT4 (4 bits): ████ Binary (1 bit): █ Memory and Compute Scale Proportionally ``` **Quantization Approaches:** | Approach | When Applied | Quality | Effort | | -------- | ------------ | ------- | ------ | | **Dynamic quantization** | Runtime | Good | Low | | **Static quantization** | Post-training with calibration | Better | Medium | | **QAT** | During training | Best | High | ## Compiler-Level Optimization ### Graph Optimization ```text Original Graph: Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output Optimized Graph (Operator Fusion): Input → FusedConvBNReLU → FusedConvBNReLU → Output Benefits: • Fewer kernel launches • Better memory locality • Reduced memory bandwidth ``` ### Common Optimizations | Optimization | Description | Speedup | | ------------ | ----------- | ------- | | **Operator fusion** | Combine sequential ops | 1.2-2x | | **Constant folding** | Pre-compute constants | 1.1-1.5x | | **Dead code elimination** | Remove unused ops | Variable | | **Layout optimization** | Optimize tensor memory layout | 1.1-1.3x | | **Memory planning** | Optimize buffer allocation | 1.1-1.2x | ### Optimization Frameworks | Framework | Vendor | Best For | | --------- | ------ | -------- | | **TensorRT** | NVIDIA | NVIDIA GPUs, lowest latency | | **ONNX Runtime** | Microsoft | Cross-platform, broad support | | **OpenVINO** | Intel | Intel CPUs/GPUs | | **Core ML** | Apple | Apple devices | | **TFLite** | Google | Mobile, embedded | | **Apache TVM** | Open source | Custom hardware, research | ## Runtime Optimization ### Batching Strategies ```text No Batching: Request 1: [Process] → Response 1 10ms Request 2: [Process] → Response 2 10ms Request 3: [Process] → Response 3 10ms Total: 30ms, GPU underutilized Dynamic Batching: Requests 1-3: [Wait 5ms] → [Process batch] → Responses Total: 15ms, 2x throughput Trade-off: Latency vs. Throughput • Larger batch: Higher throughput, higher latency • Smaller batch: Lower latency, lower throughput ``` **Batching Parameters:** | Parameter | Description | Trade-off | | --------- | ----------- | --------- | | `batch_size` | Maximum batch size | Throughput vs. latency | | `max_wait_time` | Wait time for batch fill | Latency vs. efficiency | | `min_batch_size` | Minimum before processing | Latency predictability | ### Caching Strategies ```text ┌─────────────────────────────────────────────────────────────────────┐ │ Inference Caching Layers │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Layer 1: Input Cache │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache exact inputs → Return cached outputs │ │ │ │ Hit rate: Low (inputs rarely repeat exactly) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Layer 2: Embedding Cache │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache computed embeddings for repeated tokens/entities │ │ │ │ Hit rate: Medium (common tokens repeat) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Layer 3: KV Cache (for transformers) │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache key-value pairs for attention │ │ │ │ Hit rate: High (reuse across tokens in sequence) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Layer 4: Result Cache │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache semantic equivalents (fuzzy matching) │ │ │ │ Hit rate: Variable (depends on query distribution) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` **Semantic Caching for LLMs:** ```text Query: "What's the capital of France?" ↓ Hash + Embed query ↓ Search cache (similarity > threshold) ↓ ├── Hit: Return cached response └── Miss: Generate → Cache → Return ``` ### Async and Parallel Execution ```text Sequential: ┌─────┐ ┌─────┐ ┌─────┐ │Prep │→│Model│→│Post │ Total: 30ms │10ms │ │15ms │ │5ms │ └─────┘ └─────┘ └─────┘ Pipelined: Request 1: │Prep│Model│Post│ Request 2: │Prep│Model│Post│ Request 3: │Prep│Model│Post│ Throughput: 3x higher Latency per request: Same ``` ## Hardware Acceleration ### Hardware Comparison | Hardware | Strengths | Limitations | Best For | | -------- | --------- | ----------- | -------- | | **GPU (NVIDIA)** | High parallelism, mature ecosystem | Power, cost | Training, large batch inference | | **TPU (Google)** | Matrix ops, cloud integration | Vendor lock-in | Google Cloud workloads | | **NPU (Apple/Qualcomm)** | Power efficient, on-device | Limited models | Mobile, edge | | **CPU** | Flexible, available | Slower for ML | Low-batch, CPU-bound | | **FPGA** | Customizable, low latency | Development complexity | Specialized workloads | ### GPU Optimization | Optimization | Description | Impact | | ------------ | ----------- | ------ | | **Tensor Cores** | Use FP16/INT8 tensor operations | 2-8x speedup | | **CUDA graphs** | Reduce kernel launch overhead | 1.5-2x for small models | | **Multi-stream** | Parallel execution | Higher throughput | | **Memory pooling** | Reduce allocation overhead | Lower latency variance | ## Edge Deployment ### Edge Constraints ```text ┌─────────────────────────────────────────────────────────────────────┐ │ Edge Deployment Constraints │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Resource Constraints: │ │ ├── Memory: 1-4 GB (vs. 64+ GB cloud) │ │ ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud) │ │ ├── Power: 5-15W (vs. 300W+ cloud) │ │ └── Storage: 16-128 GB (vs. TB cloud) │ │ │ │ Operational Constraints: │ │ ├── No network (offline operation) │ │ ├── Variable ambient conditions │ │ ├── Infrequent updates │ │ └── Long deployment lifetime │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### Edge Optimization Strategies | Strategy | Description | Use When | | -------- | ----------- | -------- | | **Model selection** | Use edge-native models (MobileNet, EfficientNet) | Accuracy acceptable | | **Aggressive quantization** | INT8 or lower | Memory/power constrained | | **On-device distillation** | Distill to tiny model | Extreme constraints | | **Split inference** | Edge preprocessing, cloud inference | Network available | | **Model caching** | Cache results locally | Repeated queries | ### Edge ML Frameworks | Framework | Platform | Features | | --------- | -------- | -------- | | **TensorFlow Lite** | Android, iOS, embedded | Quantization, delegates | | **Core ML** | iOS, macOS | Neural Engine optimization | | **ONNX Runtime Mobile** | Cross-platform | Broad model support | | **PyTorch Mobile** | Android, iOS | Familiar API | | **TensorRT** | NVIDIA Jetson | Maximum performance | ## Latency Profiling ### Profiling Methodology ```text ┌─────────────────────────────────────────────────────────────────────┐ │ Latency Breakdown Analysis │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ 1. Data Loading: ████████░░░░░░░░░░ 15% │ │ 2. Preprocessing: ██████░░░░░░░░░░░░ 10% │ │ 3. Model Inference: ████████████████░░ 60% │ │ 4. Postprocessing: ████░░░░░░░░░░░░░░ 8% │ │ 5. Response Serialization:███░░░░░░░░░░░░░░░ 7% │ │ │ │ Target: Model inference (60% = biggest optimization opportunity) │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### Profiling Tools | Tool | Use For | | ---- | ------- | | **PyTorch Profiler** | PyTorch model profiling | | **TensorBoard** | TensorFlow visualization | | **NVIDIA Nsight** | GPU profiling | | **Chrome Tracing** | General timeline visualization | | **perf** | CPU profiling | ### Key Metrics | Metric | Description | Target | | ------ | ----------- | ------ | | **P50 latency** | Median latency | < SLA | | **P99 latency** | Tail latency | < 2x P50 | | **Throughput** | Requests/second | Meet demand | | **GPU utilization** | Compute usage | > 80% | | **Memory bandwidth** | Memory usage | < limit | ## Optimization Workflow ### Systematic Approach ```text ┌─────────────────────────────────────────────────────────────────────┐ │ Optimization Workflow │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ 1. Baseline │ │ └── Measure current performance (latency, throughput, accuracy) │ │ │ │ 2. Profile │ │ └── Identify bottlenecks (model, data, system) │ │ │ │ 3. Optimize (in order of effort/impact): │ │ ├── Hardware: Use right accelerator │ │ ├── Compiler: Enable optimizations (TensorRT, ONNX) │ │ ├── Runtime: Batching, caching, async │ │ ├── Model: Quantization, pruning │ │ └── Architecture: Distillation, model change │ │ │ │ 4. Validate │ │ └── Verify accuracy maintained, latency improved │ │ │ │ 5. Deploy and Monitor │ │ └── Track real-world performance │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### Optimization Priority Matrix ```text High Impact │ Compiler Opts ────┼──── Quantization (easy win) │ (best ROI) │ Low Effort ──────────────┼──────────────── High Effort │ Batching ────┼──── Distillation (quick win) │ (major effort) │ Low Impact ``` ## Common Patterns ### Multi-Model Serving ```text ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Request → ┌─────────┐ │ │ │ Router │ │ │ └─────────┘ │ │ │ │ │ │ │ ┌────────┘ │ └────────┐ │ │ ▼ ▼ ▼ │ │ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │ Tiny │ │ Small │ │ Large │ │ │ │ <10ms │ │ <50ms │ │<500ms │ │ │ └───────┘ └───────┘ └───────┘ │ │ │ │ Routing strategies: │ │ • Complexity-based: Simple→Tiny, Complex→Large │ │ • Confidence-based: Try Tiny, escalate if low confidence │ │ • SLA-based: Route based on latency requirements │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### Speculative Execution ```text Query: "Translate: Hello" │ ├──▶ Small model (draft): "Bonjour" (5ms) │ └──▶ Large model (verify): Check "Bonjour" (10ms parallel) │ ├── Accept: Return immediately └── Reject: Generate with large model Speedup: 2-3x when drafts are often accepted ``` ### Cascade Models ```text Input → ┌────────┐ │ Filter │ ← Cheap filter (reject obvious negatives) └────────┘ │ (candidates only) ▼ ┌────────┐ │ Stage 1│ ← Fast model (coarse ranking) └────────┘ │ (top-100) ▼ ┌────────┐ │ Stage 2│ ← Accurate model (fine ranking) └────────┘ │ (top-10) ▼ Output Benefit: 10x cheaper, similar accuracy ``` ## Optimization Checklist ### Pre-Deployment - [ ] Profile baseline performance - [ ] Identify primary bottleneck (model, data, system) - [ ] Apply compiler optimizations (TensorRT, ONNX) - [ ] Evaluate quantization (INT8 usually safe) - [ ] Tune batch size for target throughput - [ ] Test accuracy after optimization ### Deployment - [ ] Configure appropriate hardware - [ ] Enable caching where applicable - [ ] Set up monitoring (latency, throughput, errors) - [ ] Configure auto-scaling policies - [ ] Implement graceful degradation ### Post-Deployment - [ ] Monitor p99 latency - [ ] Track accuracy metrics - [ ] Analyze cache hit rates - [ ] Review cost efficiency - [ ] Plan iterative improvements ## Related Skills - `llm-serving-patterns` - LLM-specific serving optimization - `ml-system-design` - End-to-end ML pipeline design - `quality-attributes-taxonomy` - Performance as quality attribute - `estimation-techniques` - Capacity planning for ML systems ## Version History - v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns --- ## Last Updated **Date:** 2025-12-26