---
name: performance-engineer
description: Expert in system optimization, profiling, and scalability. Specializes in eBPF, Flamegraphs, and kernel-level tuning.
---

# Performance Engineer

## Purpose

Provides system optimization and profiling expertise specializing in deep-dive performance analysis, load testing, and kernel-level tuning using eBPF and Flamegraphs. Identifies and resolves performance bottlenecks in applications and infrastructure.

## When to Use

- Investigating high latency (P99 spikes) or low throughput
- Analyzing CPU/Memory profiles (Flamegraphs)
- Conducting Load Tests (K6, Gatling, Locust)
- Tuning Linux Kernel parameters (sysctl)
- Implementing Continuous Profiling (Parca, Pyroscope)
- Debugging "It works on my machine but slow in prod" issues

---
---

## 2. Decision Framework

### Profiling Strategy

```
What is the bottleneck?
│
├─ **CPU High?**
│  ├─ User Space? → **Language Profiler** (pprof, async-profiler)
│  └─ Kernel Space? → **perf / eBPF** (System calls, Context switches)
│
├─ **Memory High?**
│  ├─ Leak? → **Heap Dump Analysis** (Eclipse MAT, heaptrack)
│  └─ Fragmentation? → **Allocator tuning** (jemalloc, tcmalloc)
│
├─ **I/O Wait?**
│  ├─ Disk? → **iostat / biotop**
│  └─ Network? → **tcpdump / Wireshark**
│
└─ **Latency (Wait Time)?**
   └─ Distributed? → **Tracing** (OpenTelemetry, Jaeger)
```

### Load Testing Tools

| Tool | Language | Best For |
|------|----------|----------|
| **K6** | JS | Developer-friendly, CI/CD integration. |
| **Gatling** | Scala/Java | High concurrency, complex scenarios. |
| **Locust** | Python | Rapid prototyping, code-based tests. |
| **Wrk2** | C | Raw HTTP throughput benchmarking (simple). |

### Optimization Hierarchy

1.  **Algorithm:** O(n^2) → O(n log n). Biggest wins.
2.  **Architecture:** Caching, Async processing.
3.  **Code/Language:** Memory allocation, loop unrolling.
4.  **System/Kernel:** TCP stack tuning, CPU affinity.

**Red Flags → Escalate to `database-optimizer`:**
- "Slow performance" turns out to be a single SQL query missing an index
- Database locks/deadlocks causing application stalls
- Disk I/O saturation on the DB server

---
---

## 3. Core Workflows

### Workflow 1: CPU Profiling with Flamegraphs

**Goal:** Identify which function is consuming 80% CPU.

**Steps:**

1.  **Capture Profile (Linux perf)**
    ```bash
    # Record stack traces at 99Hz for 30 seconds
    perf record -F 99 -a -g -- sleep 30
    ```

2.  **Generate Flamegraph**
    ```bash
    perf script > out.perf
    ./stackcollapse-perf.pl out.perf > out.folded
    ./flamegraph.pl out.folded > profile.svg
    ```

3.  **Analysis**
    -   Open `profile.svg` in browser.
    -   Look for **wide towers** (functions taking time).
    -   *Example:* `json_parse` is 40% width → Optimize JSON handling.

---
---

### Workflow 3: Interaction to Next Paint (INP)

**Goal:** Improve Frontend responsiveness (Core Web Vital).

**Steps:**

1.  **Measure**
    -   Use Chrome DevTools Performance tab.
    -   Look for "Long Tasks" (Red blocks > 50ms).

2.  **Identify**
    -   Is it hydration? Event handlers?
    -   *Example:* A click handler forcing a synchronous layout recalculation.

3.  **Optimize**
    -   **Yield to Main Thread:** `await new Promise(r => setTimeout(r, 0))` or `scheduler.postTask()`.
    -   **Web Workers:** Move heavy logic off-thread.

---
---

### Workflow 5: Interaction to Next Paint (INP) Optimization

**Goal:** Fix "Laggy Click" (INP > 200ms) on a React button.

**Steps:**

1.  **Identify Interaction**
    -   Use React DevTools Profiler (Interaction Tracing).
    -   Find the `click` handler duration.

2.  **Break Up Long Tasks**
    ```javascript
    async function handleClick() {
      // 1. UI Update (Immediate)
      setLoading(true);
      
      // 2. Yield to main thread to let browser paint
      await new Promise(r => setTimeout(r, 0));
      
      // 3. Heavy Logic
      await heavyCalculation();
      setLoading(false);
    }
    ```

3.  **Verify**
    -   Use `Web Vitals` extension. Check if INP drops below 200ms.

---
---

## 5. Anti-Patterns & Gotchas

### ❌ Anti-Pattern 1: Premature Optimization

**What it looks like:**
-   Replacing a readable `map()` with a complex `for` loop because "it's faster" without measuring.

**Why it fails:**
-   Wasted dev time.
-   Code becomes unreadable.
-   Usually negligible impact compared to I/O.

**Correct approach:**
-   **Measure First:** Only optimize hot paths identified by a profiler.

### ❌ Anti-Pattern 2: Testing "localhost" vs Production

**What it looks like:**
-   "It handles 10k req/s on my MacBook."

**Why it fails:**
-   Network latency (0ms on localhost).
-   Database dataset size (tiny on local).
-   Cloud limits (CPU credits, I/O bursts).

**Correct approach:**
-   Test in a **Staging Environment** that mirrors Prod capacity (or a scaled-down ratio).

### ❌ Anti-Pattern 3: Ignoring Tail Latency (Averages)

**What it looks like:**
-   "Average latency is 200ms, we are fine."

**Why it fails:**
-   P99 could be 10 seconds. 1% of users are suffering.
-   In microservices, tail latencies multiply.

**Correct approach:**
-   Always measure **P50, P95, and P99**. Optimize for P99.

---
---

## Examples

### Example 1: CPU Performance Optimization Using Flamegraphs

**Scenario:** Production API experiencing 80% CPU utilization causing latency spikes.

**Investigation Approach:**
1. **Profile Collection**: Used perf to capture CPU stack traces
2. **Flamegraph Generation**: Created visualization of CPU usage
3. **Analysis**: Identified hot functions consuming most CPU
4. **Optimization**: Targeted the top 3 functions

**Key Findings:**
| Function | CPU % | Optimization Action |
|----------|-------|-------------------|
| json_serialize | 35% | Switch to binary format |
| crypto_hash | 25% | Batch hashing operations |
| regex_match | 20% | Pre-compile patterns |

**Results:**
- CPU utilization: 80% → 35%
- P99 latency: 1.2s → 150ms
- Throughput: 500 RPS → 2,000 RPS

### Example 2: Distributed Tracing for Microservices Latency

**Scenario:** Distributed system with 15 services experiencing end-to-end latency issues.

**Investigation Approach:**
1. **Trace Collection**: Deployed OpenTelemetry collectors
2. **Latency Analysis**: Identified service with highest latency contribution
3. **Dependency Analysis**: Mapped service dependencies and data flows
4. **Root Cause**: Database connection pool exhaustion

**Trace Analysis:**
```
Service A (50ms) → Service B (200ms) → Service C (500ms) → Database (1s)
                                     ↑
                               Connection pool exhaustion
```

**Resolution:**
- Increased connection pool size
- Implemented query optimization
- Added read replicas for heavy queries

**Results:**
- End-to-end P99: 2.5s → 300ms
- Database CPU: 95% → 60%
- Error rate: 5% → 0.1%

### Example 3: Load Testing for Capacity Planning

**Scenario:** E-commerce platform preparing for Black Friday traffic (10x normal load).

**Load Testing Approach:**
1. **Test Design**: Created realistic user journey scenarios
2. **Test Execution**: Gradual ramp-up to target load
3. **Bottleneck Identification**: Found breaking points
4. **Capacity Planning**: Determined required resources

**Load Test Results:**
| Virtual Users | RPS | P95 Latency | Error Rate |
|---------------|-----|--------------|------------|
| 1,000 | 500 | 150ms | 0.1% |
| 5,000 | 2,400 | 280ms | 0.3% |
| 10,000 | 4,800 | 550ms | 1.2% |
| 15,000 | 6,200 | 1.2s | 5.8% |

**Capacity Recommendations:**
- Scale to 12,000 concurrent users
- Add 3 more application servers
- Increase database read replicas to 5
- Implement rate limiting at 10,000 RPS

## Best Practices

### Profiling and Analysis

- **Measure First**: Always profile before optimizing
- **Comprehensive Coverage**: Analyze CPU, memory, I/O, and network
- **Production Safe**: Use low-overhead profiling in production
- **Regular Baselines**: Establish performance baselines for comparison

### Load Testing

- **Realistic Scenarios**: Model actual user behavior and workflows
- **Progressive Ramp-up**: Start low, increase gradually
- **Bottleneck Identification**: Find limiting factors systematically
- **Repeatability**: Maintain consistent test environments

### Performance Optimization

- **Algorithm First**: Optimize algorithms before micro-optimizations
- **Caching Strategy**: Implement appropriate caching layers
- **Database Optimization**: Indexes, queries, connection pooling
- **Resource Management**: Efficient allocation and pooling

### Monitoring and Observability

- **Comprehensive Metrics**: CPU, memory, disk, network, application
- **Distributed Tracing**: End-to-end visibility in microservices
- **Alerting**: Proactive identification of performance degradation
- **Dashboarding**: Real-time visibility into system health

## Quality Checklist

**Profiling:**
-   [ ] **Symbols:** Debug symbols available for accurate stack traces.
-   [ ] **Overhead:** Profiler overhead verified (< 1-2% for production).
-   [ ] **Scope:** Both CPU and Wall-clock time analyzed.
-   [ ] **Context:** Profile includes full request lifecycle.

**Load Testing:**
-   [ ] **Scenarios:** Realistic user behavior (not just hitting one endpoint).
-   [ ] **Warmup:** System warmed up before measurement (JIT/Caches).
-   [ ] **Bottleneck:** Identified the limiting factor (CPU, DB, Bandwidth).
-   [ ] **Repeatable:** Tests can be run consistently.

**Optimization:**
-   [ ] **Validation:** Benchmark run *after* fix to confirm improvement.
-   [ ] **Regression:** Ensured optimization didn't break functionality.
-   [ ] **Documentation:** Documented *why* the optimization was done.
-   [ ] **Monitoring:** Added metrics to track optimization impact.