--- name: triage-ci-flake description: Use when CI tests fail on main branch after PR merge, or when investigating flaky test failures in CI environments allowed-tools: Write, Bash(date:*), Bash(mkdir -p *) --- # Triage CI Failure ## Overview Systematic workflow for triaging and fixing test failures in CI, especially flaky tests that pass locally but fail in CI. Tests that made it to `main` are usually flaky due to timing, bundling, or environment differences. **CRITICAL RULE: You MUST run the reproduction workflow before proposing any fixes. No exceptions.** ## When to Use - CI test fails on `main` branch after PR was merged - Test passes locally but fails in CI - Test failure labeled as "flaky" or intermittent - E2E or integration test timing out in CI only ## MANDATORY First Steps **YOU MUST EXECUTE THESE COMMANDS. Reading code or analyzing logs does NOT count as reproduction.** 1. **Extract** suite name, test name, and error from CI logs 2. **EXECUTE**: Kill port 3000 to avoid conflicts 3. **EXECUTE**: `pnpm dev $SUITE_NAME` (use run_in_background=true) 4. **EXECUTE**: Wait for server to be ready (check with curl or sleep) 5. **EXECUTE**: Run the specific failing test with Playwright directly (npx playwright test test/TEST_SUITE_NAME/e2e.spec.ts:31:3 --headed -g "TEST_DESCRIPTION_TARGET_GOES_HERE") 6. **If test passes**, **EXECUTE**: `pnpm prepare-run-test-against-prod` 7. **EXECUTE**: `pnpm dev:prod $SUITE_NAME` and run test again **Only after EXECUTING these commands and seeing their output** can you proceed to analysis and fixes. **"Analysis from logs" is NOT reproduction. You must RUN the commands.** ## Core Workflow ```dot digraph triage_ci { "CI failure reported" [shape=box]; "Extract details from CI logs" [shape=box]; "Identify suite and test name" [shape=box]; "Run dev server: pnpm dev $SUITE" [shape=box]; "Run specific test by name" [shape=box]; "Did test fail?" [shape=diamond]; "Debug with dev code" [shape=box]; "Run prepare-run-test-against-prod" [shape=box]; "Run: pnpm dev:prod $SUITE" [shape=box]; "Run specific test again" [shape=box]; "Did test fail now?" [shape=diamond]; "Debug bundling issue" [shape=box]; "Unable to reproduce - check logs" [shape=box]; "Fix and verify" [shape=box]; "CI failure reported" -> "Extract details from CI logs"; "Extract details from CI logs" -> "Identify suite and test name"; "Identify suite and test name" -> "Run dev server: pnpm dev $SUITE"; "Run dev server: pnpm dev $SUITE" -> "Run specific test by name"; "Run specific test by name" -> "Did test fail?"; "Did test fail?" -> "Debug with dev code" [label="yes"]; "Did test fail?" -> "Run prepare-run-test-against-prod" [label="no"]; "Run prepare-run-test-against-prod" -> "Run: pnpm dev:prod $SUITE"; "Run: pnpm dev:prod $SUITE" -> "Run specific test again"; "Run specific test again" -> "Did test fail now?"; "Did test fail now?" -> "Debug bundling issue" [label="yes"]; "Did test fail now?" -> "Unable to reproduce - check logs" [label="no"]; "Debug with dev code" -> "Fix and verify"; "Debug bundling issue" -> "Fix and verify"; } ``` ## Step-by-Step Process ### 1. Extract CI Details From CI logs or GitHub Actions URL, identify: - **Suite name**: Directory name (e.g., `i18n`, `fields`, `lexical`) - **Test file**: Full path (e.g., `test/i18n/e2e.spec.ts`) - **Test name**: Exact test description - **Error message**: Full stack trace - **Test type**: E2E (Playwright) or integration (Vitest) ### 2. Reproduce with Dev Code **CRITICAL: Always run the specific test by name, not the full suite.** **SERVER MANAGEMENT RULES:** 1. **ALWAYS kill all servers before starting a new one** 2. **NEVER assume ports are free** 3. **ALWAYS wait for server ready confirmation before running tests** ```bash # ======================================== # STEP 2A: STOP ALL SERVERS # ======================================== lsof -ti:3000 | xargs kill -9 2>/dev/null || echo "Port 3000 clear" # ======================================== # STEP 2B: START DEV SERVER # ======================================== # Start dev server with the suite (in background with run_in_background=true) pnpm dev $SUITE_NAME # ======================================== # STEP 2C: WAIT FOR SERVER READY # ======================================== # Wait for server to be ready (REQUIRED - do not skip) until curl -s http://localhost:3000/admin > /dev/null 2>&1; do sleep 1; done && echo "Server ready" # ======================================== # STEP 2D: RUN SPECIFIC TEST # ======================================== # Run ONLY the specific failing test using Playwright directly # For E2E tests (DO NOT use pnpm test:e2e as it spawns its own server): pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts -g "exact test name" # For integration tests: pnpm test:int $SUITE_NAME -t "exact test name" ``` **Did the test fail?** - ✅ **YES**: You reproduced it! Proceed to debug with dev code. - ❌ **NO**: Continue to step 3 (bundled code test). ### 3. Reproduce with Bundled Code If test passed with dev code, the issue is likely in bundled/production code. **IMPORTANT: You MUST stop the dev server before starting prod server.** ```bash # ======================================== # STEP 3A: STOP ALL SERVERS (INCLUDING DEV SERVER FROM STEP 2) # ======================================== lsof -ti:3000 | xargs kill -9 2>/dev/null || echo "Port 3000 clear" # ======================================== # STEP 3B: BUILD AND PACK FOR PROD # ======================================== # Build all packages and pack them (this takes time - be patient) pnpm prepare-run-test-against-prod # ======================================== # STEP 3C: START PROD SERVER # ======================================== # Start prod dev server (in background with run_in_background=true) pnpm dev:prod $SUITE_NAME # ======================================== # STEP 3D: WAIT FOR SERVER READY # ======================================== # Wait for server to be ready (REQUIRED - do not skip) until curl -s http://localhost:3000/admin > /dev/null 2>&1; do sleep 1; done && echo "Server ready" # ======================================== # STEP 3E: RUN SPECIFIC TEST # ======================================== # Run the specific test again using Playwright directly pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts -g "exact test name" # OR for integration tests: pnpm test:int $SUITE_NAME -t "exact test name" ``` **Did the test fail now?** - ✅ **YES**: Bundling or production build issue. Look for: - Missing exports in package.json - Build configuration problems - Code that behaves differently when bundled - ❌ **NO**: Unable to reproduce locally. Proceed to step 4. ### 4. Unable to Reproduce If you cannot reproduce locally after both attempts: - Review CI logs more carefully for environment differences - Check for race conditions (run test multiple times: `for i in {1..10}; do pnpm test:e2e...; done`) - Look for CI-specific constraints (memory, CPU, timing) - Consider if it's a true race condition that's highly timing-dependent ## Common Flaky Test Patterns ### Race Conditions - Page navigating while assertions run - Network requests not settled before assertions - State updates not completed **Fix patterns:** - Use Playwright's web-first assertions (`toBeVisible()`, `toHaveText()`) - Wait for specific conditions, not arbitrary timeouts - Use `waitForFunction()` with condition checks ### Test Pollution - Tests leaving data in database - Shared state between tests - Missing cleanup in `afterEach` **Fix patterns:** - Track created IDs and clean up in `afterEach` - Use isolated test data - Don't use `deleteAll` that affects other tests ### Timing Issues - `setTimeout`/`sleep` instead of condition-based waiting - Not waiting for page stability - Animations/transitions not complete **Fix patterns:** - Use `waitForPageStability()` helper - Wait for specific DOM states - Use Playwright's built-in waiting mechanisms ## Linting Considerations When fixing e2e tests, be aware of these eslint rules: - `playwright/no-networkidle` - Avoid `waitForLoadState('networkidle')` (use condition-based waiting instead) - `payload/no-wait-function` - Avoid custom `wait()` functions (use Playwright's built-in waits) - `payload/no-flaky-assertions` - Avoid non-retryable assertions - `playwright/prefer-web-first-assertions` - Use built-in Playwright assertions **Existing code may violate these rules** - when adding new code, follow the rules even if existing code doesn't. ## Verification After fixing: ```bash # Ensure dev server is running on port 3000 # Run test multiple times to confirm stability for i in {1..10}; do pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts -g "exact test name" || break done # Run full suite pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts # If you modified bundled code, test with prod build lsof -ti:3000 | xargs kill -9 2>/dev/null pnpm prepare-run-test-against-prod pnpm dev:prod $SUITE_NAME until curl -s http://localhost:3000/admin > /dev/null; do sleep 1; done pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts ``` ## The Iron Law **NO FIX WITHOUT REPRODUCTION FIRST** If you propose a fix before completing steps 1-3 of the workflow, you've violated this skill. **This applies even when:** - The fix seems obvious from the logs - You've seen this error before - Time pressure from the team - You're confident about the root cause - The logs show clear stack traces **No exceptions. Run the reproduction workflow first.** ## Rationalization Table Every excuse for skipping reproduction, and why it's wrong: | Rationalization | Reality | | ------------------------------------ | ---------------------------------------------- | | "The logs show the exact error" | Logs show symptoms, not root cause. Reproduce. | | "I can see the problem in the code" | You're guessing. Reproduce to confirm. | | "This is obviously a race condition" | Maybe. Reproduce to be sure. | | "I've seen this error before" | This might be different. Reproduce. | | "The stack trace is clear" | Stack trace shows where, not why. Reproduce. | | "Time pressure - need to fix fast" | Reproducing IS fast. Guessing wastes time. | | "The test file shows the issue" | Reading ≠ running. Execute the commands. | | "I'll analyze the code first" | Code analysis comes AFTER reproduction. | | "Let me investigate the root cause" | Reproduction IS the investigation. | | "I need to understand the error" | Understanding comes from seeing it fail. | ## Red Flags - STOP These thoughts mean you're about to violate the skill: - "Let me just fix the obvious issue" → STOP. Execute reproduction first. - "I'll increase the timeout" → STOP. Find root cause via reproduction. - "This looks like X from the logs" → STOP. Run the test. - "I'll run the full suite" → STOP. Run specific test first. - "Let me check the code" → STOP. Run reproduction first. - "I can see the problem in the stack trace" → STOP. Reproduce to confirm. - "The logs are clear enough" → STOP. Execute the commands. - "Let me analyze..." → STOP. Analysis comes after reproduction. **All of these mean: Stop immediately. Execute the reproduction commands. No analysis without execution.** ## Example Session **CORRECT approach - executes commands first:** ``` User: CI failed on i18n suite - https://github.com/.../actions/runs/XXX Agent: 1. Extracted details from CI logs: - Suite: i18n - Test: "ensure translations update correctly when switching language" - File: test/i18n/e2e.spec.ts:124 - Error: "Error loading language: en" TypeError: network error 2. Executing reproduction workflow: [Uses Bash tool to kill ports] $ lsof -ti:3000 | xargs kill -9 2>/dev/null || echo "Port 3000 clear" [Uses Bash tool with run_in_background=true] $ pnpm dev i18n [Waits for server to start with curl check] $ until curl -s http://localhost:3000/admin > /dev/null; do sleep 1; done && echo "Server ready" [Uses Bash tool to run Playwright directly] $ pnpm exec playwright test test/i18n/e2e.spec.ts -g "ensure translations update correctly when switching language" Output shows: Test PASSED locally 3. Test passed with dev code, trying bundled code: [Kills ports again] $ lsof -ti:3000 | xargs kill -9 2>/dev/null [Uses Bash tool] $ pnpm prepare-run-test-against-prod [Uses Bash tool with run_in_background=true] $ pnpm dev:prod i18n [Waits for server] $ until curl -s http://localhost:3000/admin > /dev/null; do sleep 1; done [Uses Bash tool] $ pnpm exec playwright test test/i18n/e2e.spec.ts -g "ensure translations update correctly when switching language" Output shows: Test FAILED - reproduced! 4. Now that I've reproduced it, analyzing the failure... [Code analysis, root cause identification] 5. Implementing fix: [specific changes] 6. Verification: [Uses Bash tool] $ for i in {1..10}; do pnpm test:e2e i18n -g "exact test"; done All 10 runs passed! ``` **INCORRECT approach - analyzes without executing:** ``` User: CI failed on i18n suite Agent: Looking at the error logs, I can see this is a race condition... [Analyzes code, proposes fix without running any commands] ❌ WRONG - This violates the skill. Must execute reproduction commands first. ``` ## Common Mistakes | Mistake | Fix | | --------------------------------- | -------------------------------------------------------- | | Running full test suite first | Run specific test by name | | Skipping dev code reproduction | Always try dev code first | | Not testing with bundled code | If dev passes, test with `prepare-run-test-against-prod` | | Proposing fix without reproducing | Follow the workflow - reproduce first | | Using `networkidle` in new code | Use condition-based waiting with `waitForFunction()` | | Adding arbitrary `wait()` calls | Use Playwright's built-in assertions and waits | ## Key Principles 1. **Reproduce before fixing**: Never propose a fix without reproducing the issue 2. **Test specifically**: Run the exact failing test, not the full suite 3. **Dev first, prod second**: Check dev code before bundled code 4. **Follow the workflow**: No shortcuts - the steps exist to save time 5. **Verify stability**: Run tests multiple times to confirm fix ## Completion: Creating a PR **After you have:** 1. ✅ Reproduced the issue 2. ✅ Implemented a fix 3. ✅ Verified the fix passes locally (multiple runs) 4. ✅ Tested with prod build (if applicable) **You MUST prompt the user to create a PR:** ``` The fix has been verified and is ready for review. Would you like me to create a PR with these changes? Summary of changes: - [List files modified] - [Brief description of the fix] - [Verification results] ``` **IMPORTANT:** - **DO NOT automatically create a PR** - always ask the user first - Provide a clear summary of what was changed and why - Include verification results (number of test runs, pass rate) - Let the user decide whether to create the PR immediately or make additional changes first This ensures the user has visibility and control over what gets submitted for review.