---
name: llm-inference-batching-scheduler
description: Guidance for implementing batching schedulers for LLM inference systems with compilation-based accelerators. This skill applies when optimizing request batching to minimize cost while meeting latency thresholds, particularly when dealing with shape compilation costs, padding overhead, and multi-bucket request distributions. Use this skill for tasks involving batch planning, shape selection, generation-length bucketing, and cost-model-driven optimization for neural network inference.
---

# LLM Inference Batching Scheduler

This skill provides systematic approaches for designing batching schedulers that optimize LLM inference
workloads on compilation-based accelerators (TPUs, custom ASICs). The core challenge involves balancing
multiple competing objectives: minimizing compilation cost (fewer shapes), reducing padding waste
(tighter batches), and meeting latency thresholds (P95, P99 constraints).

## When to Apply This Skill

Apply this skill when:
- Designing batch schedulers for LLM inference with shape compilation constraints
- Optimizing request packing to minimize padding overhead
- Balancing cost metrics against latency thresholds
- Working with generation-length bucketing strategies
- Implementing plan files that assign requests to batches with specific shapes

## Core Concepts

### Shape Compilation Cost Model

Compilation-based accelerators require pre-compiled shapes. Each unique (prompt_length, generation_length)
shape incurs a one-time compilation cost. The total cost typically follows:

```
total_cost = per_token_cost × total_tokens + compilation_cost × num_shapes²
```

The quadratic term on `num_shapes` creates strong pressure to minimize unique shapes while the
per-token cost penalizes excessive padding.

### Padding Analysis Framework

Before implementing, calculate padding budgets mathematically:

1. **Identify the threshold constraints**: Extract max allowed cost, pad_ratio, and latency percentiles
2. **Calculate current baseline**: Sum actual tokens across all requests
3. **Derive padding budget**: `max_padded_tokens = actual_tokens / (1 - max_pad_ratio)`
4. **Compute allowable padding**: `padding_budget = max_padded_tokens - actual_tokens`

This budget constrains the maximum generation bucket size.

### Generation-Length Bucketing

Requests are grouped by generation length into buckets. The bucket size directly affects padding:

- **Smaller buckets**: Less padding waste, but more batches (higher P95 latency risk)
- **Larger buckets**: More padding waste, but fewer batches (better latency)

To derive optimal bucket size:
```
optimal_bucket_size ≈ padding_budget / num_requests_in_worst_bucket
```

## Systematic Approach

### Phase 1: Mathematical Analysis (Before Any Code)

1. **Parse the cost model completely**
   - Identify all cost components and their weights
   - Understand how shape count affects compilation cost
   - Map latency calculation formulas

2. **Analyze request distribution**
   - Compute statistics: prompt length distribution, generation length distribution
   - Identify required prompt shapes to cover all requests (max prompt length determines minimum shape)
   - Calculate baseline token counts per bucket

3. **Derive parameter bounds from constraints**
   - From pad_ratio threshold: calculate max padding tokens allowed
   - From cost threshold: calculate max compilation overhead
   - From latency thresholds: estimate max batch count implications

4. **Document derived constraints**
   - Write down: "Max padding = X tokens", "Max shapes = Y", "Gen bucket size must be ≤ Z"

### Phase 2: Systematic Parameter Search

Instead of ad-hoc trial-and-error, implement structured search:

1. **Define the parameter space**
   - Generation bucket sizes: range based on Phase 1 analysis
   - Shape configurations: enumerate valid combinations covering all prompt lengths

2. **Implement reusable evaluation**
   - Create a standalone validation function that computes all metrics
   - Return structured results: {cost, pad_ratio, p95_latency, p99_latency, valid: bool}

3. **Search systematically**
   - Grid search over small parameter spaces
   - Binary search when optimizing single parameters
   - Track all configurations and results

### Phase 3: Implementation with Invariant Checking

1. **Critical invariants to validate continuously**
   - All prompt lengths covered: `max(prompt_lengths) ≤ max(shape_prompt_dims)`
   - All requests assigned: `sum(batch_sizes) == total_requests`
   - Shape count within limit: `len(unique_shapes) ≤ max_shapes`

2. **Build validation into the workflow**
   - Check invariants after every modification
   - Fail fast when invariants break

## Common Pitfalls and Mitigations

### Pitfall 1: Removing Required Shapes

**Symptom**: Assertion errors about uncovered prompt lengths after optimizing shape count.

**Cause**: When reducing shapes for compilation cost, accidentally removing shapes needed for long prompts.

**Prevention**:
- Before removing any shape, verify: `shape_to_remove.prompt_dim < max(all_prompt_lengths)`
- Always keep at least one shape covering the maximum prompt length
- Implement a `required_shapes()` function that returns non-removable shapes

### Pitfall 2: Empirical Parameter Selection Without Mathematical Justification

**Symptom**: Cycling through bucket sizes (20, 21, 22...) without clear rationale, losing track of best configurations.

**Cause**: Not computing optimal parameters from constraints first.

**Prevention**:
- Calculate bounds analytically before testing
- If bucket_size=21 works but 20 doesn't, understand why mathematically
- Log all tested configurations with results

### Pitfall 3: Optimizing One Metric While Breaking Others

**Symptom**: Reducing pad_ratio but exceeding cost threshold, or vice versa.

**Cause**: Metrics are interconnected—fewer shapes increases padding, smaller buckets increase latency.

**Prevention**:
- Always evaluate ALL metrics after each change
- Create a multi-objective feasibility check
- Understand trade-off relationships before optimizing

### Pitfall 4: Inconsistent Configuration Tracking

**Symptom**: Re-testing configurations or forgetting which parameters achieved which results.

**Prevention**:
- Maintain a results log: `{config_hash: {params: {...}, metrics: {...}}}`
- Before testing, check if configuration was already evaluated
- Keep "best so far" state updated

### Pitfall 5: Late Validation of Structural Constraints

**Symptom**: Plan file passes metric thresholds but fails structural validation (duplicate requests, missing requests).

**Prevention**:
- Validate structure FIRST, before computing metrics
- Check: no duplicate request IDs, all request IDs present, batch counts match

## Verification Strategy

### Structural Verification (Run First)

```
1. Parse generated plan file
2. Extract all request IDs assigned to batches
3. Verify: set(assigned_ids) == set(all_request_ids)
4. Verify: len(assigned_ids) == len(all_request_ids)  # no duplicates
5. Verify: all shapes have prompt_dim >= max prompt in that batch
6. Verify: unique_shape_count <= max_allowed_shapes
```

### Metric Verification (Run After Structural)

```
1. Compute total padded tokens using shape dimensions
2. Compute actual tokens from request data
3. Calculate: pad_ratio = 1 - (actual / padded)
4. Compute cost using exact cost model formula
5. Simulate latency distribution, extract P95/P99
6. Compare all metrics against thresholds
```

### Iterative Verification Pattern

After any parameter change:
1. Generate new plan
2. Run structural verification → fix if broken
3. Run metric verification → analyze trade-offs
4. Update configuration tracking log
5. Decide next parameter adjustment based on which metrics are furthest from threshold

## Reference: Optimization Decision Tree

```
START
  │
  ├─ Is pad_ratio too high?
  │    └─ YES → Reduce generation bucket size OR add more prompt shapes
  │
  ├─ Is cost too high?
  │    └─ YES → Reduce number of unique shapes (but verify coverage!)
  │
  ├─ Is P95/P99 latency too high?
  │    └─ YES → Increase generation bucket size (larger batches, fewer total)
  │         OR redistribute requests across batches more evenly
  │
  └─ All metrics passing?
       └─ YES → DONE (record configuration as solution)
```

## Key Insight Summary

1. **Math first, code second**: Derive bounds from constraints before implementing
2. **Track everything**: Maintain configuration→results mapping
3. **Validate continuously**: Check invariants after every modification
4. **Understand trade-offs**: Metrics are interconnected; optimize with awareness
5. **Separate concerns**: Structure validation, metric computation, and parameter search should be independent, reusable components