---
name: ml-inference-optimization
description: ML inference latency optimization, model compression, distillation, caching strategies, and edge deployment patterns. Use when optimizing inference performance, reducing model size, or deploying ML at the edge.
allowed-tools: Read, Glob, Grep
---

# ML Inference Optimization

## When to Use This Skill

Use this skill when:

- Optimizing ML inference latency
- Reducing model size for deployment
- Implementing model compression techniques
- Designing inference caching strategies
- Deploying models at the edge
- Balancing accuracy vs. latency trade-offs

**Keywords:** inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration

## Inference Optimization Overview

```text
┌─────────────────────────────────────────────────────────────────────┐
│                 Inference Optimization Stack                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Model Level                                │  │
│  │  Distillation │ Pruning │ Quantization │ Architecture Search │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                   Compiler Level                              │  │
│  │  Graph optimization │ Operator fusion │ Memory planning       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Runtime Level                                │  │
│  │  Batching │ Caching │ Async execution │ Multi-threading      │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Hardware Level                               │  │
│  │  GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators            │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

## Model Compression Techniques

### Technique Overview

| Technique | Size Reduction | Speed Improvement | Accuracy Impact |
| --------- | -------------- | ----------------- | --------------- |
| **Quantization** | 2-4x | 2-4x | Low (1-2%) |
| **Pruning** | 2-10x | 1-3x | Low-Medium |
| **Distillation** | 3-10x | 3-10x | Medium |
| **Low-rank factorization** | 2-5x | 1.5-3x | Low-Medium |
| **Weight sharing** | 10-100x | Variable | Medium-High |

### Knowledge Distillation

```text
┌─────────────────────────────────────────────────────────────────────┐
│                    Knowledge Distillation                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐                                                   │
│  │ Teacher Model│ (Large, accurate, slow)                          │
│  │   GPT-4      │                                                   │
│  └──────────────┘                                                   │
│         │                                                           │
│         ▼ Soft labels (probability distributions)                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Training Process                           │  │
│  │  Loss = α × CrossEntropy(student, hard_labels)               │  │
│  │       + (1-α) × KL_Div(student, teacher_soft_labels)         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│         │                                                           │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │Student Model │ (Small, nearly as accurate, fast)                │
│  │  DistilBERT  │                                                   │
│  └──────────────┘                                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

**Distillation Types:**

| Type | Description | Use Case |
| ---- | ----------- | -------- |
| **Response distillation** | Match teacher outputs | General compression |
| **Feature distillation** | Match intermediate layers | Better transfer |
| **Relation distillation** | Match sample relationships | Structured data |
| **Self-distillation** | Model teaches itself | Regularization |

### Pruning Strategies

```text
Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After:  [0.0, 0.8, 0.0, 0.9, 0.0, 0.7]  (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries

Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
        │ C1│ C2│ C3│ C4│
        └───┴───┴───┴───┘
After:  ┌───┬───┬───┐
        │ C1│ C3│ C4│  (Removed C2 entirely)
        └───┴───┴───┘
• Works with standard hardware
• Lower compression ratio
```

**Pruning Decision Criteria:**

| Method | Description | Effectiveness |
| ------ | ----------- | ------------- |
| **Magnitude-based** | Remove smallest weights | Simple, effective |
| **Gradient-based** | Remove low-gradient weights | Better accuracy |
| **Second-order** | Use Hessian information | Best but expensive |
| **Lottery ticket** | Find winning subnetwork | Theoretical insight |

### Quantization (Detailed)

```text
Precision Hierarchy:

FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████  (different mantissa/exponent)
INT8 (8 bits):  ████████
INT4 (4 bits):  ████
Binary (1 bit): █

Memory and Compute Scale Proportionally
```

**Quantization Approaches:**

| Approach | When Applied | Quality | Effort |
| -------- | ------------ | ------- | ------ |
| **Dynamic quantization** | Runtime | Good | Low |
| **Static quantization** | Post-training with calibration | Better | Medium |
| **QAT** | During training | Best | High |

## Compiler-Level Optimization

### Graph Optimization

```text
Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output

Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output

Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidth
```

### Common Optimizations

| Optimization | Description | Speedup |
| ------------ | ----------- | ------- |
| **Operator fusion** | Combine sequential ops | 1.2-2x |
| **Constant folding** | Pre-compute constants | 1.1-1.5x |
| **Dead code elimination** | Remove unused ops | Variable |
| **Layout optimization** | Optimize tensor memory layout | 1.1-1.3x |
| **Memory planning** | Optimize buffer allocation | 1.1-1.2x |

### Optimization Frameworks

| Framework | Vendor | Best For |
| --------- | ------ | -------- |
| **TensorRT** | NVIDIA | NVIDIA GPUs, lowest latency |
| **ONNX Runtime** | Microsoft | Cross-platform, broad support |
| **OpenVINO** | Intel | Intel CPUs/GPUs |
| **Core ML** | Apple | Apple devices |
| **TFLite** | Google | Mobile, embedded |
| **Apache TVM** | Open source | Custom hardware, research |

## Runtime Optimization

### Batching Strategies

```text
No Batching:
Request 1: [Process] → Response 1      10ms
Request 2: [Process] → Response 2      10ms
Request 3: [Process] → Response 3      10ms
Total: 30ms, GPU underutilized

Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput

Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput
```

**Batching Parameters:**

| Parameter | Description | Trade-off |
| --------- | ----------- | --------- |
| `batch_size` | Maximum batch size | Throughput vs. latency |
| `max_wait_time` | Wait time for batch fill | Latency vs. efficiency |
| `min_batch_size` | Minimum before processing | Latency predictability |

### Caching Strategies

```text
┌─────────────────────────────────────────────────────────────────────┐
│                    Inference Caching Layers                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Layer 1: Input Cache                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache exact inputs → Return cached outputs                   │   │
│  │ Hit rate: Low (inputs rarely repeat exactly)                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 2: Embedding Cache                                           │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache computed embeddings for repeated tokens/entities       │   │
│  │ Hit rate: Medium (common tokens repeat)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 3: KV Cache (for transformers)                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache key-value pairs for attention                          │   │
│  │ Hit rate: High (reuse across tokens in sequence)             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 4: Result Cache                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache semantic equivalents (fuzzy matching)                  │   │
│  │ Hit rate: Variable (depends on query distribution)           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

**Semantic Caching for LLMs:**

```text
Query: "What's the capital of France?"
       ↓
Hash + Embed query
       ↓
Search cache (similarity > threshold)
       ↓
├── Hit: Return cached response
└── Miss: Generate → Cache → Return
```

### Async and Parallel Execution

```text
Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │  Total: 30ms
│10ms │ │15ms │ │5ms  │
└─────┘ └─────┘ └─────┘

Pipelined:
Request 1: │Prep│Model│Post│
Request 2:      │Prep│Model│Post│
Request 3:           │Prep│Model│Post│

Throughput: 3x higher
Latency per request: Same
```

## Hardware Acceleration

### Hardware Comparison

| Hardware | Strengths | Limitations | Best For |
| -------- | --------- | ----------- | -------- |
| **GPU (NVIDIA)** | High parallelism, mature ecosystem | Power, cost | Training, large batch inference |
| **TPU (Google)** | Matrix ops, cloud integration | Vendor lock-in | Google Cloud workloads |
| **NPU (Apple/Qualcomm)** | Power efficient, on-device | Limited models | Mobile, edge |
| **CPU** | Flexible, available | Slower for ML | Low-batch, CPU-bound |
| **FPGA** | Customizable, low latency | Development complexity | Specialized workloads |

### GPU Optimization

| Optimization | Description | Impact |
| ------------ | ----------- | ------ |
| **Tensor Cores** | Use FP16/INT8 tensor operations | 2-8x speedup |
| **CUDA graphs** | Reduce kernel launch overhead | 1.5-2x for small models |
| **Multi-stream** | Parallel execution | Higher throughput |
| **Memory pooling** | Reduce allocation overhead | Lower latency variance |

## Edge Deployment

### Edge Constraints

```text
┌─────────────────────────────────────────────────────────────────────┐
│                      Edge Deployment Constraints                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Resource Constraints:                                              │
│  ├── Memory: 1-4 GB (vs. 64+ GB cloud)                             │
│  ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud)                    │
│  ├── Power: 5-15W (vs. 300W+ cloud)                                │
│  └── Storage: 16-128 GB (vs. TB cloud)                             │
│                                                                     │
│  Operational Constraints:                                           │
│  ├── No network (offline operation)                                 │
│  ├── Variable ambient conditions                                    │
│  ├── Infrequent updates                                            │
│  └── Long deployment lifetime                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Edge Optimization Strategies

| Strategy | Description | Use When |
| -------- | ----------- | -------- |
| **Model selection** | Use edge-native models (MobileNet, EfficientNet) | Accuracy acceptable |
| **Aggressive quantization** | INT8 or lower | Memory/power constrained |
| **On-device distillation** | Distill to tiny model | Extreme constraints |
| **Split inference** | Edge preprocessing, cloud inference | Network available |
| **Model caching** | Cache results locally | Repeated queries |

### Edge ML Frameworks

| Framework | Platform | Features |
| --------- | -------- | -------- |
| **TensorFlow Lite** | Android, iOS, embedded | Quantization, delegates |
| **Core ML** | iOS, macOS | Neural Engine optimization |
| **ONNX Runtime Mobile** | Cross-platform | Broad model support |
| **PyTorch Mobile** | Android, iOS | Familiar API |
| **TensorRT** | NVIDIA Jetson | Maximum performance |

## Latency Profiling

### Profiling Methodology

```text
┌─────────────────────────────────────────────────────────────────────┐
│                    Latency Breakdown Analysis                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Data Loading:          ████████░░░░░░░░░░  15%                 │
│  2. Preprocessing:         ██████░░░░░░░░░░░░  10%                 │
│  3. Model Inference:       ████████████████░░  60%                 │
│  4. Postprocessing:        ████░░░░░░░░░░░░░░   8%                 │
│  5. Response Serialization:███░░░░░░░░░░░░░░░   7%                 │
│                                                                     │
│  Target: Model inference (60% = biggest optimization opportunity)  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Profiling Tools

| Tool | Use For |
| ---- | ------- |
| **PyTorch Profiler** | PyTorch model profiling |
| **TensorBoard** | TensorFlow visualization |
| **NVIDIA Nsight** | GPU profiling |
| **Chrome Tracing** | General timeline visualization |
| **perf** | CPU profiling |

### Key Metrics

| Metric | Description | Target |
| ------ | ----------- | ------ |
| **P50 latency** | Median latency | < SLA |
| **P99 latency** | Tail latency | < 2x P50 |
| **Throughput** | Requests/second | Meet demand |
| **GPU utilization** | Compute usage | > 80% |
| **Memory bandwidth** | Memory usage | < limit |

## Optimization Workflow

### Systematic Approach

```text
┌─────────────────────────────────────────────────────────────────────┐
│                  Optimization Workflow                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Baseline                                                        │
│     └── Measure current performance (latency, throughput, accuracy) │
│                                                                     │
│  2. Profile                                                         │
│     └── Identify bottlenecks (model, data, system)                  │
│                                                                     │
│  3. Optimize (in order of effort/impact):                           │
│     ├── Hardware: Use right accelerator                             │
│     ├── Compiler: Enable optimizations (TensorRT, ONNX)            │
│     ├── Runtime: Batching, caching, async                          │
│     ├── Model: Quantization, pruning                                │
│     └── Architecture: Distillation, model change                    │
│                                                                     │
│  4. Validate                                                        │
│     └── Verify accuracy maintained, latency improved                │
│                                                                     │
│  5. Deploy and Monitor                                              │
│     └── Track real-world performance                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Optimization Priority Matrix

```text
                    High Impact
                         │
    Compiler Opts    ────┼──── Quantization
    (easy win)           │     (best ROI)
                         │
Low Effort ──────────────┼──────────────── High Effort
                         │
    Batching         ────┼──── Distillation
    (quick win)          │     (major effort)
                         │
                    Low Impact
```

## Common Patterns

### Multi-Model Serving

```text
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Request → ┌─────────┐                                              │
│            │ Router  │                                              │
│            └─────────┘                                              │
│               │   │   │                                             │
│      ┌────────┘   │   └────────┐                                    │
│      ▼            ▼            ▼                                    │
│  ┌───────┐   ┌───────┐   ┌───────┐                                 │
│  │ Tiny  │   │ Small │   │ Large │                                 │
│  │ <10ms │   │ <50ms │   │<500ms │                                 │
│  └───────┘   └───────┘   └───────┘                                 │
│                                                                     │
│  Routing strategies:                                                │
│  • Complexity-based: Simple→Tiny, Complex→Large                    │
│  • Confidence-based: Try Tiny, escalate if low confidence          │
│  • SLA-based: Route based on latency requirements                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Speculative Execution

```text
Query: "Translate: Hello"
        │
        ├──▶ Small model (draft): "Bonjour" (5ms)
        │
        └──▶ Large model (verify): Check "Bonjour" (10ms parallel)
             │
             ├── Accept: Return immediately
             └── Reject: Generate with large model

Speedup: 2-3x when drafts are often accepted
```

### Cascade Models

```text
Input → ┌────────┐
        │ Filter │ ← Cheap filter (reject obvious negatives)
        └────────┘
             │ (candidates only)
             ▼
        ┌────────┐
        │ Stage 1│ ← Fast model (coarse ranking)
        └────────┘
             │ (top-100)
             ▼
        ┌────────┐
        │ Stage 2│ ← Accurate model (fine ranking)
        └────────┘
             │ (top-10)
             ▼
         Output

Benefit: 10x cheaper, similar accuracy
```

## Optimization Checklist

### Pre-Deployment

- [ ] Profile baseline performance
- [ ] Identify primary bottleneck (model, data, system)
- [ ] Apply compiler optimizations (TensorRT, ONNX)
- [ ] Evaluate quantization (INT8 usually safe)
- [ ] Tune batch size for target throughput
- [ ] Test accuracy after optimization

### Deployment

- [ ] Configure appropriate hardware
- [ ] Enable caching where applicable
- [ ] Set up monitoring (latency, throughput, errors)
- [ ] Configure auto-scaling policies
- [ ] Implement graceful degradation

### Post-Deployment

- [ ] Monitor p99 latency
- [ ] Track accuracy metrics
- [ ] Analyze cache hit rates
- [ ] Review cost efficiency
- [ ] Plan iterative improvements

## Related Skills

- `llm-serving-patterns` - LLM-specific serving optimization
- `ml-system-design` - End-to-end ML pipeline design
- `quality-attributes-taxonomy` - Performance as quality attribute
- `estimation-techniques` - Capacity planning for ML systems

## Version History

- v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns

---

## Last Updated

**Date:** 2025-12-26