---
name: analyze-simd-usage
description: "Analyze SIMD usage opportunities in Mojo code. Use to find performance optimization opportunities."
category: mojo
mcp_fallback: none
agent: test-engineer
user-invocable: false
---

# Analyze SIMD Usage Opportunities

Identify where SIMD (Single Instruction Multiple Data) can improve performance.

## When to Use

- Performance-critical tensor operations
- Element-wise operations on large arrays
- Vectorization of loops processing multiple elements
- Optimizing matrix/vector operations
- Finding performance bottlenecks in ML code

## Quick Reference

```bash
# Find loops processing arrays/tensors
grep -n "for.*in.*range\|@unroll\|@vectorize" *.mojo

# Find element-wise operations
grep -n "\.load\|\.store\|\.broadcast" *.mojo

# Check for SIMD parameters
grep -n "simd_width\|nelems\|\[.*:\]" *.mojo

# Identify candidates
grep -n "for i in range.*:" -A 10 *.mojo | grep -E "array\[i\]|tensor\[i\]"
```

## SIMD Optimization Opportunities

**Vectorizable Patterns**:

- ✅ Element-wise addition: `a[i] + b[i]` for all i
- ✅ Scalar multiplication: `a[i] * scalar` for all i
- ✅ Unary operations: `sin(a[i])`, `exp(a[i])` for all i
- ✅ Reduction operations: sum, max, min over array
- ❌ Dependent iterations: `a[i] = a[i-1] + value` (sequential)
- ❌ Conditional branches: `if a[i] > threshold:` (hard to vectorize)
- ❌ Function calls: unpredictable latency (avoid in tight loops)

**SIMD Width Selection**:

- `@parameter fn[simd_width: Int]` - Generic SIMD width
- `simd_width=4` - Typically good for float32
- `simd_width=8` - Optimal for many operations
- `simd_width=16+` - For int32 or specialized ops
- Match hardware capabilities (AVX2=4-8, AVX512=8-16)

**Vectorization Patterns**:

- ✅ `@vectorize` decorator for simple loops
- ✅ `@unroll` for small loops (2-4 iterations)
- ✅ Manual SIMD with `.load[]` and `.store[]`
- ✅ Tensor operations with SIMD dimensions

## Analysis Workflow

1. **Profile code**: Identify bottlenecks using time/memory metrics
2. **Find loops**: Locate loops processing large amounts of data
3. **Check vectorizability**: Verify no loop-carried dependencies
4. **Estimate speedup**: SIMD could provide 4-16x improvement
5. **Implement SIMD**: Use @vectorize, @unroll, or manual SIMD
6. **Measure performance**: Verify improvement with benchmarks
7. **Document changes**: Note what was optimized and why

## Output Format

Report SIMD analysis with:

1. **Hotspots** - Functions/loops using most CPU time
2. **Vectorization Potential** - Operations that could use SIMD
3. **Estimated Speedup** - Expected performance improvement
4. **Implementation Priority** - High/medium/low impact
5. **Technical Approach** - How to implement SIMD
6. **Risks** - Potential issues with vectorization
7. **Recommendations** - Which optimizations to pursue first

## Optimization Examples

**Example 1: Element-wise Addition**

```mojo
# Before: scalar loop
fn add_scalar(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    for i in range(a.num_elements()):
        result._data[i] = a._data[i] + b._data[i]
    return result

# After: vectorized
@vectorize
fn add_simd[simd_width: Int](i: Int):
    result._data.store[simd_width](i,
        a._data.load[simd_width](i) + b._data.load[simd_width](i))

def add_vectorized(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    # 4x-8x speedup typical
    return result
```

**Example 2: Reduction (Sum)**

```mojo
# Before: scalar loop
fn sum_scalar(tensor: Tensor) -> Float32:
    var total: Float32 = 0
    for i in range(tensor.num_elements()):
        total += tensor._data[i]
    return total

# After: SIMD reduction
fn sum_simd[simd_width: Int](tensor: Tensor) -> Float32:
    # Process simd_width elements at a time
    # Then reduce results - can be much faster
    return total
```

## Error Handling

| Problem | Solution |
|---------|----------|
| Vectorization causes wrong results | Check for loop-carried dependencies |
| Segment fault with SIMD | Verify alignment and bounds |
| Minimal speedup | May not be vectorizable, profile to confirm |
| Complex logic | Break into simpler vectorizable operations |
| Type mismatches | Ensure SIMD width compatible with element type |

## SIMD Decision Tree

- Does loop process large arrays? → YES → Check vectorizability
- Loop-carried dependencies? → YES → Can't vectorize, optimize differently
- Simple operations on many elements? → YES → Use @vectorize or @unroll
- Critical path (hot loop)? → YES → Worth optimizing
- Implement → Measure → Iterate

## References

- See mojo-simd-optimize for implementation guidance
- See CLAUDE.md for SIMD code patterns
- See performance section in module documentation