---
name: ci-monitoring
description: Use after creating PR - monitor CI pipeline, resolve failures cyclically until green or issue is identified as unresolvable
---

# CI Monitoring

## Overview

Monitor CI pipeline and resolve failures until green.

**CRITICAL: CI is validation, not discovery.**

> **If CI finds a bug you didn't find locally, your local testing was insufficient.**
>
> Before blaming CI, ask yourself:
> 1. Did you run all tests locally?
> 2. Did you test against local services (postgres, redis)?
> 3. Did you run the same checks CI runs?
> 4. Did you run integration tests, not just unit tests with mocks?
>
> CI should only fail for: environment differences, flaky tests, or infrastructure issues—never for bugs you could have caught locally.

**Core principle:** CI failures are blockers. But they should never be surprises.

**Announce at start:** "I'm monitoring CI and will resolve any failures."

## The CI Loop

```
PR Created
     │
     ▼
┌─────────────┐
│ Wait for CI │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ CI Status?  │
└──────┬──────┘
       │
   ┌───┴───┐
   │       │
 Green   Red/Failed
   │       │
   ▼       ▼
┌─────────┐  ┌─────────────┐
│ MERGE   │  │ Diagnose    │
│ THE PR  │  │ failure     │
└────┬────┘  └──────┬──────┘
     │              │
     ▼              ▼
┌─────────┐  ┌─────────────┐
│ Continue│  │ Fixable?    │
│ to next │  └──────┬──────┘
│ issue   │         │
└─────────┘    ┌────┴────┐
               │         │
              Yes        No
               │         │
               ▼         ▼
          ┌─────────┐  ┌─────────────┐
          │ Fix and │  │ Document as │
          │ push    │  │ unresolvable│
          └────┬────┘  └─────────────┘
               │
               └────► Back to "Wait for CI"
```

## CRITICAL: Green CI = Merge Immediately

**When CI passes, you MUST merge the PR and continue working.**

Do NOT:
- Stop and report "CI is green, ready for review"
- Wait for user confirmation
- Summarize and ask what to do next

DO:
- Merge the PR immediately: `gh pr merge [PR_NUMBER] --squash --delete-branch`
- Mark the linked issue as Done
- Continue to the next issue in scope

```bash
# When CI passes
gh pr merge [PR_NUMBER] --squash --delete-branch

# Update linked issue status
gh issue edit [ISSUE_NUMBER] --remove-label "status:in-review" --add-label "status:done"

# Continue to next issue (do not stop)
```

**The only exception:** PRs with `do-not-merge` label require explicit user action.

## Checking CI Status

### Using GitHub CLI

```bash
# Check all CI checks
gh pr checks [PR_NUMBER]

# Watch CI in real-time
gh pr checks [PR_NUMBER] --watch

# Get detailed status
gh pr view [PR_NUMBER] --json statusCheckRollup
```

### Expected Output

```
All checks were successful
0 failing, 0 pending, 5 passing

CHECKS
✓  build          1m23s
✓  lint           45s
✓  test           3m12s
✓  typecheck      1m05s
✓  security-scan  2m30s
```

## Handling Failures

### Step 1: Identify the Failure

```bash
# Get failed check details
gh pr checks [PR_NUMBER]

# View workflow run logs
gh run view [RUN_ID] --log-failed
```

### Step 2: Diagnose the Cause

Common failure types:

| Type | Symptoms | Cause |
|------|----------|-------|
| Test failure | `FAIL` in test output | Code bug or test bug |
| Build failure | Compilation errors | Type errors, syntax errors |
| Lint failure | Style violations | Formatting, conventions |
| Typecheck failure | Type errors | Missing types, wrong types |
| Timeout | Job exceeded time limit | Performance issue or stuck test |
| Flaky test | Passes locally, fails CI | Race condition, environment difference |

### Step 3: Fix the Issue

#### Test Failures

```bash
# Reproduce locally
pnpm test

# Run specific failing test
pnpm test --grep "test name"

# Fix the code or test
# Commit and push
```

#### Build Failures

```bash
# Reproduce locally
pnpm build

# Fix compilation errors
# Commit and push
```

#### Lint Failures

```bash
# Check lint errors
pnpm lint

# Auto-fix what's possible
pnpm lint:fix

# Manually fix remaining
# Commit and push
```

#### Type Failures

```bash
# Check type errors
pnpm typecheck

# Fix type issues
# Commit and push
```

### Step 4: Push Fix and Wait

```bash
# Commit fix
git add .
git commit -m "fix(ci): Resolve test failure in user validation"

# Push
git push

# Wait for CI again
gh pr checks [PR_NUMBER] --watch
```

### Step 5: Repeat Until Green

Loop through diagnose → fix → push → wait until all checks pass.

## Flaky Tests

### Identifying Flakiness

```
Test passes locally
Test fails in CI
Test passes on retry in CI
```

### Handling Flakiness

1. **Don't just retry** - Find the root cause
2. **Check for race conditions** - Timing-dependent code
3. **Check for environment differences** - Paths, env vars, services
4. **Check for state pollution** - Tests affecting each other

```typescript
// Common flaky pattern: timing dependency
// BAD
await saveData();
await delay(100);  // Hoping 100ms is enough
const result = await loadData();

// GOOD: Wait for condition
await saveData();
await waitFor(() => dataExists());
const result = await loadData();
```

## Unresolvable Failures

Sometimes failures can't be fixed in the current PR:

### Legitimate Unresolvable Cases

| Case | Example |
|------|---------|
| CI infrastructure issue | Service down, rate limited |
| Pre-existing flaky test | Not introduced by this PR |
| Upstream dependency issue | External API changed |
| Requires manual intervention | Needs secrets, permissions |

### Process for Unresolvable

1. **Document the issue**

```bash
gh pr comment [PR_NUMBER] --body "## CI Issue

The \`security-scan\` check is failing due to a known issue with the scanner service (see #999).

This is not related to changes in this PR. The scan passes when run locally.

Requesting bypass approval from @maintainer."
```

2. **Create issue if new**

```bash
gh issue create \
  --title "CI: Security scanner service timeout" \
  --body "The security scanner is timing out in CI..."
```

3. **Request bypass if appropriate**

Some teams allow merging with known infrastructure failures.

4. **Do NOT merge with real failures**

If the failure is from your code, it must be fixed.

## CI Best Practices

### Run Locally First (MANDATORY)

**CI is the last resort, not the first check.**

Before pushing, run EVERYTHING CI will run:

```bash
# Run the same checks CI will run
pnpm lint
pnpm typecheck
pnpm test              # Unit tests
pnpm test:integration  # Integration tests against real services
pnpm build

# If you have database changes
docker-compose up -d postgres
pnpm migrate
```

**If your project has docker-compose services:**
- Start them before testing: `docker-compose up -d`
- Run integration tests against real services
- Verify migrations apply to real database
- Don't rely on mocks alone

**Skill:** `local-service-testing`

### Commit Incrementally

Don't push 10 commits at once. Push smaller changes:

```bash
# Small fix, push, verify
git push

# Wait for CI
gh pr checks --watch

# Then next change
```

### Monitor Actively

Don't "push and forget":

```bash
# Watch CI after each push
gh pr checks [PR_NUMBER] --watch
```

## Checklist

For each CI run:

- [ ] Waited for CI to complete
- [ ] All checks examined
- [ ] Failures diagnosed (if any)
- [ ] Fixes implemented (if needed)
- [ ] Re-pushed and re-checked (if fixed)
- [ ] All green

When CI is green:

- [ ] **PR merged immediately** (`gh pr merge --squash --delete-branch`)
- [ ] Linked issue marked Done
- [ ] **Continued to next issue** (do NOT stop and report)

For unresolvable issues:

- [ ] Root cause identified
- [ ] Not caused by PR changes
- [ ] Documented in PR comment
- [ ] Issue created if new problem
- [ ] Bypass approval requested if appropriate

## Integration

This skill is called by:
- `issue-driven-development` - Step 13
- `autonomous-orchestration` - Main loop and bootstrap

This skill follows:
- `pr-creation` - PR exists

This skill completes:
- The PR lifecycle - merge is the final step, not "verification-before-merge"

This skill may trigger:
- `error-recovery` - If CI reveals deeper issues