---
name: flaky-detect
description: Identify flaky tests from CI history and test execution patterns. Use when debugging intermittent test failures, auditing test reliability, or improving CI stability.
---


# Flaky Detect Skill

## Purpose

Identify flaky tests (tests that pass and fail non-deterministically) by analyzing CI history, execution patterns, and test characteristics. Google research shows 4.56% of tests are flaky, costing millions in developer productivity.

## Research Foundation

| Finding | Source | Reference |
|---------|--------|-----------|
| 4.56% flaky rate | Google (2016) | [Flaky Tests at Google](https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html) |
| ML Classification | FlaKat (2024) | [arXiv:2403.01003](https://arxiv.org/abs/2403.01003) - 85%+ accuracy |
| LLM Auto-repair | FlakyFix (2023) | [arXiv:2307.00012](https://arxiv.org/html/2307.00012v4) |
| Flaky Taxonomy | Luo et al. (2014) | "An Empirical Analysis of Flaky Tests" |

## When This Skill Applies

- User reports "tests sometimes fail" or "intermittent failures"
- CI has been unstable or unreliable
- User wants to audit test suite reliability
- Pre-release quality assessment
- Debugging non-deterministic behavior

## Trigger Phrases

| Natural Language | Action |
|------------------|--------|
| "Find flaky tests" | Analyze CI history for flaky patterns |
| "Why does CI keep failing?" | Identify flaky tests causing failures |
| "Test suite is unreliable" | Full flaky test audit |
| "This test sometimes passes" | Analyze specific test for flakiness |
| "Audit test reliability" | Comprehensive flaky detection |
| "Quarantine flaky tests" | Identify and isolate flaky tests |

## Flaky Test Taxonomy (Google Research)

| Category | Percentage | Root Causes |
|----------|------------|-------------|
| **Async/Timing** | 45% | Race conditions, insufficient waits, timeouts |
| **Test Order** | 20% | Shared state, execution order dependencies |
| **Environment** | 15% | File system, network, configuration differences |
| **Resource Limits** | 10% | Memory, threads, connection pools |
| **Non-deterministic** | 10% | Random values, timestamps, UUIDs |

## Detection Methods

### 1. CI History Analysis

Parse GitHub Actions / CI logs to find inconsistent results:

```python
def analyze_ci_history(repo, days=30):
    """Analyze CI runs for flaky patterns"""
    runs = get_ci_runs(repo, days)
    test_results = {}

    for run in runs:
        for test in run.tests:
            if test.name not in test_results:
                test_results[test.name] = {"pass": 0, "fail": 0}

            if test.passed:
                test_results[test.name]["pass"] += 1
            else:
                test_results[test.name]["fail"] += 1

    # Identify flaky tests (pass rate between 5% and 95%)
    flaky = []
    for test, results in test_results.items():
        total = results["pass"] + results["fail"]
        if total >= 5:  # Enough data
            pass_rate = results["pass"] / total
            if 0.05 < pass_rate < 0.95:
                flaky.append({
                    "test": test,
                    "pass_rate": pass_rate,
                    "total_runs": total
                })

    return sorted(flaky, key=lambda x: x["pass_rate"])
```

### 2. Code Pattern Analysis

Scan test code for flaky patterns:

```python
FLAKY_PATTERNS = [
    # Timing issues
    (r'setTimeout|sleep|delay', "timing", "Uses explicit delays"),
    (r'Date\.now\(\)|new Date\(\)', "timing", "Uses current time"),

    # Async issues
    (r'\.then\([^)]*\)(?!.*await)', "async", "Promise without await"),
    (r'async.*(?!await)', "async", "Async without await"),

    # Order dependencies
    (r'Math\.random\(\)', "random", "Uses random values"),
    (r'uuid|nanoid', "random", "Uses generated IDs"),

    # Environment
    (r'process\.env', "environment", "Environment-dependent"),
    (r'fs\.(read|write)', "environment", "File system access"),
    (r'fetch\(|axios\.|http\.', "network", "Network calls"),
]

def scan_for_flaky_patterns(test_file):
    """Scan test file for flaky patterns"""
    content = read_file(test_file)
    matches = []

    for pattern, category, description in FLAKY_PATTERNS:
        if re.search(pattern, content):
            matches.append({
                "category": category,
                "description": description,
                "pattern": pattern
            })

    return matches
```

### 3. Re-run Analysis

Run tests multiple times to detect flakiness:

```bash
# Run tests 10 times, track results
for i in {1..10}; do
  npm test -- --reporter=json >> test-results.jsonl
done

# Analyze for inconsistency
python analyze_reruns.py test-results.jsonl
```

## Output Format

```markdown
## Flaky Test Report

**Analysis Period**: Last 30 days
**Total Tests**: 450
**Flaky Tests Found**: 12 (2.7%)

### Critical Flaky Tests (< 50% pass rate)

#### 1. `test/api/login.test.ts:45`
**Pass Rate**: 42% (21/50 runs)
**Category**: Timing
**Pattern**: Uses `Date.now()` for token expiry

```typescript
// Flaky code
it('should expire token after 1 hour', () => {
  const token = createToken();
  const expiry = Date.now() + 3600000;  // Flaky!
  expect(token.expiresAt).toBe(expiry);
});
```

**Root Cause**: Test creates token and checks expiry in same millisecond sometimes, different millisecond other times.

**Recommended Fix**: Use mocked time
```typescript
it('should expire token after 1 hour', () => {
  vi.setSystemTime(new Date('2024-01-01T00:00:00Z'));
  const token = createToken();
  expect(token.expiresAt).toBe(new Date('2024-01-01T01:00:00Z').getTime());
  vi.useRealTimers();
});
```

### High Flaky Tests (50-80% pass rate)

#### 2. `test/db/connection.test.ts:23`
**Pass Rate**: 68% (34/50 runs)
**Category**: Resource
**Pattern**: Connection pool exhaustion

[... more tests ...]

### Summary by Category

| Category | Count | Impact |
|----------|-------|--------|
| Timing | 5 | HIGH |
| Async | 3 | HIGH |
| Environment | 2 | MEDIUM |
| Order | 1 | MEDIUM |
| Network | 1 | LOW |

### Recommendations

1. **Quick Win**: Fix 5 timing tests with `vi.setSystemTime()` (+0.5% stability)
2. **Medium Effort**: Add proper async handling (+0.3% stability)
3. **Infrastructure**: Add test isolation for DB tests (+0.2% stability)

### Quarantine Candidates

These tests should be skipped in CI until fixed:

```javascript
// vitest.config.ts
export default {
  test: {
    exclude: [
      'test/api/login.test.ts',       // Timing flaky
      'test/db/connection.test.ts',   // Resource flaky
    ]
  }
}
```

**Note**: Track quarantined tests in `.aiwg/testing/flaky-quarantine.md`
```

## Quarantine Process

### 1. Identify

```bash
# Run flaky detection
python scripts/flaky_detect.py --ci-history 30 --threshold 95
```

### 2. Quarantine

```javascript
// Mark test as flaky
describe.skip('flaky: login expiry', () => {
  // FLAKY: https://github.com/org/repo/issues/123
  // Root cause: timing-dependent
  // Fix in progress: PR #456
});
```

### 3. Track

Create tracking issue:
```markdown
## Flaky Test: test/api/login.test.ts:45

- **Pass Rate**: 42%
- **Category**: Timing
- **Root Cause**: Uses real system time
- **Quarantined**: 2024-12-12
- **Fix PR**: #456
- **Target Unquarantine**: 2024-12-15
```

### 4. Fix and Unquarantine

After fix:
```bash
# Verify fix with multiple runs
for i in {1..20}; do npm test -- test/api/login.test.ts; done

# Remove from quarantine if all pass
```

## Integration Points

- Works with `flaky-fix` skill for automated repairs
- Reports to CI dashboard
- Feeds into `/flow-gate-check` for release decisions
- Tracks in `.aiwg/testing/flaky-registry.md`

## Script Reference

### flaky_detect.py
Analyze CI history for flaky tests:
```bash
python scripts/flaky_detect.py --repo owner/repo --days 30
```

### flaky_scanner.py
Scan code for flaky patterns:
```bash
python scripts/flaky_scanner.py --target test/
```