--- name: analyzing-test-effectiveness description: Use to audit test quality with Google Fellow SRE scrutiny - identifies tautological tests, coverage gaming, weak assertions, missing corner cases. Creates bd epic with tasks for improvements, then runs SRE task refinement on each. --- Audit test suites for real effectiveness, not vanity metrics. Identify tests that provide false confidence (tautological, mock-testing, line hitters) and missing corner cases. Create bd epic with tracked tasks for improvements. Run SRE task refinement on each task before execution. **CRITICAL MINDSET: Assume tests were written by junior engineers optimizing for coverage metrics.** Default to skeptical—a test is RED or YELLOW until proven GREEN. You MUST read production code before categorizing tests. GREEN is the exception, not the rule. MEDIUM FREEDOM - Follow the 5-phase analysis process exactly. Categorization criteria (RED/YELLOW/GREEN) are rigid. Corner case discovery adapts to the specific codebase. Output format is flexible but must include all sections. | Phase | Action | Output | |-------|--------|--------| | 1. Inventory | List all test files and functions | Test catalog | | 2. Read Production Code | Read the actual code each test claims to test | Context for analysis | | 3. Trace Call Paths | Verify tests exercise production, not mocks/utilities | Call path verification | | 4. Categorize (Skeptical) | Apply RED/YELLOW/GREEN - default to harsher rating | Categorized tests | | 5. Self-Review | Challenge every GREEN - would a senior SRE agree? | Validated categories | | 6. Corner Cases | Identify missing edge cases per module | Gap analysis | | 7. Prioritize | Rank by business criticality | Priority matrix | | 8. bd Issues | Create epic + tasks, run SRE refinement | Tracked improvement plan | **MANDATORY: Read production code BEFORE categorizing tests. You cannot assess a test without understanding what it claims to test.** **Core Questions for Each Test:** 1. What bug would this catch? (If you can't name one → RED) 2. Does it exercise PRODUCTION code or a mock/test utility? (Mock → RED or YELLOW) 3. Could code break while test passes? (If yes → YELLOW or RED) 4. Meaningful assertion on PRODUCTION output? (`!= nil` or testing fixtures → weak) **bd Integration (MANDATORY):** - Create bd epic for test quality improvement - Create bd tasks for: remove RED, strengthen YELLOW, add corner cases - Run hyperpowers:sre-task-refinement on all tasks - Link tasks to epic with dependencies **Mutation Testing Validation:** - Java: Pitest (`mvn org.pitest:pitest-maven:mutationCoverage`) - JS/TS: Stryker (`npx stryker run`) - Python: mutmut (`mutmut run`) **Use this skill when:** - Production bugs appear despite high test coverage - Suspecting coverage gaming or tautological tests - Before major refactoring (ensure tests catch regressions) - Onboarding to unfamiliar codebase (assess test quality) - After hyperpowers:review-implementation flags test quality issues - Planning test improvement initiatives **Don't use when:** - Writing new tests (use hyperpowers:test-driven-development) - Debugging test failures (use hyperpowers:debugging-with-tools) - Just need to run tests (use hyperpowers:test-runner agent) ## Announcement **Announce:** "I'm using hyperpowers:analyzing-test-effectiveness to audit test quality with Google Fellow SRE-level scrutiny." --- ## Phase 1: Test Inventory **Goal:** Create complete catalog of tests to analyze. ```bash # Find all test files (adapt pattern to language) fd -e test.ts -e spec.ts -e _test.go -e Test.java -e test.py . # Or use grep to find test functions rg "func Test|it\(|test\(|def test_|@Test" --type-add 'test:*test*' -t test # Count tests per module for dir in src/*/; do count=$(rg -c "func Test|it\(" "$dir" 2>/dev/null | wc -l) echo "$dir: $count tests" done ``` **Create inventory TodoWrite:** ``` - Analyze tests in src/auth/ - Analyze tests in src/api/ - Analyze tests in src/parser/ [... one per module] ``` --- ## Phase 2: Read Production Code First **MANDATORY: Before categorizing ANY test, you MUST:** 1. **Read the production code** the test claims to exercise 2. **Understand what the production code actually does** 3. **Trace the test's call path** to verify it reaches production code **Why this matters:** Junior engineers commonly: - Create test utilities and test THOSE instead of production code - Set up mocks that determine the test outcome (mock-testing-mock) - Write assertions on values defined IN THE TEST, not from production - Copy patterns from examples without understanding the actual code **If you haven't read production code, you WILL miscategorize tests as GREEN when they're YELLOW or RED.** --- ## Phase 3: Categorize Each Test (Skeptical Default) **Assume every test is RED or YELLOW until you have concrete evidence it's GREEN.** For each test, apply these criteria: ### RED FLAGS - Must Remove or Replace **2.1 Tautological Tests** (pass by definition) ```typescript // ❌ RED: Verifies non-optional return is not nil test('builder returns value', () => { const result = new Builder().build(); expect(result).not.toBeNull(); // Always passes - return type guarantees this }); // ❌ RED: Verifies enum has cases (compiler checks this) test('status enum has values', () => { expect(Object.values(Status).length).toBeGreaterThan(0); }); // ❌ RED: Duplicates implementation test('add returns sum', () => { expect(add(2, 3)).toBe(2 + 3); // Tautology: testing 2+3 == 2+3 }); ``` **Detection patterns:** ```bash # Find != nil / != null on non-optional types rg "expect\(.*\)\.not\.toBeNull|assertNotNull|!= nil" tests/ # Find enum existence checks rg "Object\.values.*length|cases\.count" tests/ # Find tests with no meaningful assertions rg -l "expect\(" tests/ | xargs -I {} sh -c 'grep -c "expect" {} | grep -q "^1$" && echo {}' ``` **2.2 Mock-Testing Tests** (test the mock, not production) ```typescript // ❌ RED: Only verifies mock was called, not actual behavior test('service fetches data', () => { const mockApi = { fetch: jest.fn().mockResolvedValue({ data: [] }) }; const service = new Service(mockApi); service.getData(); expect(mockApi.fetch).toHaveBeenCalled(); // Tests mock, not service logic }); // ❌ RED: Mock determines test outcome test('processor handles data', () => { const mockParser = { parse: jest.fn().mockReturnValue({ valid: true }) }; const result = processor.process(mockParser); expect(result.valid).toBe(true); // Just returns what mock returns }); ``` **Detection patterns:** ```bash # Find tests that only verify mock calls rg "toHaveBeenCalled|verify\(mock|\.called" tests/ # Find heavy mock setup rg -c "mock|Mock|jest\.fn|stub" tests/ | sort -t: -k2 -nr | head -20 ``` **2.3 Line Hitters** (execute without asserting) ```typescript // ❌ RED: Calls function, doesn't verify outcome test('processor runs', () => { const processor = new Processor(); processor.run(); // No assertion - just verifies no crash }); // ❌ RED: Assertion is trivial test('config loads', () => { const config = loadConfig(); expect(config).toBeDefined(); // Too weak - doesn't verify correct values }); ``` **Detection patterns:** ```bash # Find tests with 0-1 assertions rg -l "test\(|it\(" tests/ | while read f; do assertions=$(rg -c "expect|assert" "$f" 2>/dev/null || echo 0) tests=$(rg -c "test\(|it\(" "$f" 2>/dev/null || echo 1) ratio=$((assertions / tests)) [ "$ratio" -lt 2 ] && echo "$f: low assertion ratio ($assertions assertions, $tests tests)" done ``` **2.4 Evergreen/Liar Tests** (always pass) ```typescript // ❌ RED: Catches and ignores exceptions test('parser handles input', () => { try { parser.parse(input); expect(true).toBe(true); // Always passes } catch (e) { // Swallowed - test passes even on exception } }); // ❌ RED: Test setup bypasses code under test test('validator validates', () => { const validator = new Validator({ skipValidation: true }); // Oops expect(validator.validate(badInput)).toBe(true); }); ``` ### YELLOW FLAGS - Must Strengthen **2.5 Happy Path Only** ```typescript // ⚠️ YELLOW: Only tests valid input test('parse valid json', () => { const result = parse('{"name": "test"}'); expect(result.name).toBe('test'); }); // Missing: empty string, malformed JSON, deeply nested, unicode, huge payload ``` **2.6 Weak Assertions** ```typescript // ⚠️ YELLOW: Assertion too weak test('fetch returns data', () => { const result = await fetch('/api/users'); expect(result).not.toBeNull(); // Should verify actual content expect(result.length).toBeGreaterThan(0); // Should verify exact count or specific items }); ``` **2.7 Partial Coverage** ```typescript // ⚠️ YELLOW: Tests success, not failure test('create user succeeds', () => { const user = createUser({ name: 'test', email: 'test@example.com' }); expect(user.id).toBeDefined(); }); // Missing: duplicate email, invalid email, missing fields, database error ``` ### GREEN FLAGS - Exceptional Quality Required **GREEN is the EXCEPTION, not the rule.** A test is GREEN only if ALL of the following are true: 1. **Exercises actual PRODUCTION code** - Not a mock, not a test utility, not a copy of logic 2. **Has precise assertions** - Exact values, not `!= nil` or `> 0` 3. **Would fail if production breaks** - You can name the specific bug it catches 4. **Tests behavior, not implementation** - Won't break on valid refactoring **Before marking ANY test GREEN, you MUST state:** - "This test exercises [specific production code path]" - "It would catch [specific bug] because [reason]" - "The assertion verifies [exact production behavior], not a test fixture" **If you cannot fill in those blanks, the test is YELLOW at best.** **3.1 Behavior Verification (Must exercise PRODUCTION code)** ```typescript // ✅ GREEN: Verifies specific behavior with exact values FROM PRODUCTION test('calculateTotal applies discount correctly', () => { const cart = new Cart([{ price: 100, quantity: 2 }]); // Real Cart class cart.applyDiscount('SAVE20'); // Real discount logic expect(cart.total).toBe(160); // 200 - 20% = 160 }); // GREEN because: Exercises Cart.applyDiscount production code // Would catch: Discount calculation bugs, rounding errors // Assertion: Verifies exact computed value from production ``` **3.2 Edge Case Coverage (Must test PRODUCTION paths)** ```typescript // ✅ GREEN: Tests boundary conditions IN PRODUCTION CODE test('username rejects empty string', () => { expect(() => new User({ username: '' })).toThrow(ValidationError); }); // GREEN because: Exercises User constructor validation (production) // Would catch: Missing empty string validation // Assertion: Exact error type from production code test('username handles unicode', () => { const user = new User({ username: '日本語ユーザー' }); expect(user.username).toBe('日本語ユーザー'); }); // GREEN because: Exercises User constructor and storage (production) // Would catch: Unicode corruption, encoding bugs // Assertion: Exact value preserved through production code ``` **3.3 Error Path Testing (Must verify PRODUCTION errors)** ```typescript // ✅ GREEN: Verifies error handling IN PRODUCTION CODE test('fetch returns specific error on 404', () => { mockServer.get('/api/user/999').reply(404); // External mock OK await expect(fetchUser(999)).rejects.toThrow(UserNotFoundError); }); // GREEN because: Exercises fetchUser error handling (production) // Would catch: Wrong error type, swallowed errors // Assertion: Exact error type from production code ``` **CAUTION:** A test that uses mocks for EXTERNAL dependencies (APIs, databases) can still be GREEN if it exercises PRODUCTION logic. A test that mocks the code under test is RED. --- ## Phase 4: Mandatory Self-Review **Before finalizing ANY categorization, complete this checklist:** ### For each GREEN test: - [ ] Did I read the PRODUCTION code this test exercises? - [ ] Does the test call PRODUCTION code or a test utility/mock? - [ ] Can I name the SPECIFIC BUG this test would catch? - [ ] If production code broke, would this test DEFINITELY fail? - [ ] Am I being too generous because the test "looks reasonable"? ### For each YELLOW test: - [ ] Should this actually be RED? Is there ANY bug-catching value here? - [ ] Is the weakness fundamental (tests a mock) or fixable (weak assertion)? - [ ] If I changed this to RED, would I lose any bug-catching ability? ### Self-Challenge Questions: - "If a junior engineer showed me this test, would I accept it as GREEN?" - "Am I marking this GREEN because I want to be done, or because it's genuinely good?" - "Could I defend this GREEN classification to a Google SRE?" **If you have ANY doubt about a GREEN, downgrade to YELLOW.** **If you have ANY doubt about a YELLOW, consider RED.** **Common mistakes that cause false GREENs:** - Assuming a well-named test tests what its name says (verify the code!) - Trusting test comments (comments lie, code doesn't) - Not tracing mock/utility usage to see what's actually exercised - Giving benefit of the doubt (junior engineers don't deserve it) --- ## Phase 4b: Line-by-Line Justification for RED/YELLOW **MANDATORY: For every RED or YELLOW classification, provide detailed justification.** This forces you to verify your classification is correct by explaining exactly WHY the test is problematic. ### Required Format for RED/YELLOW Tests: ```markdown ### [Test Name] - RED/YELLOW **Test code (file:lines):** - Line X: `code` - [what this line does] - Line Y: `code` - [what this line does] - Line Z: `assertion` - [what this asserts] **Production code it claims to test (file:lines):** - [Brief description of what production code does] **Why RED/YELLOW:** - [Specific reason with line references] - [What bug could slip through despite this test passing] ``` ### Example RED Justification: ```markdown ### testAuthWorks - RED (Tautological) **Test code (auth_test.ts:45-52):** - Line 46: `const auth = new AuthService()` - Creates auth instance - Line 47: `const result = auth.login('user', 'pass')` - Calls login - Line 48: `expect(result).not.toBeNull()` - Asserts result exists **Production code (auth.ts:78-95):** - login() returns AuthResult object (never null by TypeScript types) **Why RED:** - Line 48 asserts `!= null` but TypeScript guarantees non-null return - If login returned {success: false, error: "invalid"}, test still passes - Bug example: Wrong password accepted → returns {success: true} → test passes ``` ### Example YELLOW Justification: ```markdown ### testParseJson - YELLOW (Weak Assertion) **Test code (parser_test.ts:23-30):** - Line 24: `const input = '{"name": "test"}'` - Valid JSON input - Line 25: `const result = parse(input)` - Calls production parser - Line 26: `expect(result).toBeDefined()` - Asserts result exists - Line 27: `expect(result.name).toBe('test')` - Verifies one field **Production code (parser.ts:12-45):** - parse() handles JSON parsing with error handling and validation **Why YELLOW:** - Line 26-27 only test happy path with valid input - Missing: malformed JSON, empty string, deeply nested, unicode - Bug example: parse('') throws unhandled exception → not caught by test - Upgrade path: Add edge case inputs with specific error assertions ``` ### Why This Matters: Writing the justification FORCES you to: 1. Actually read the test code line by line 2. Actually read the production code 3. Articulate the specific gap 4. Consider what bugs could slip through **If you cannot write this justification, you haven't done the analysis properly.** --- ## Phase 5: Corner Case Discovery For each module, identify missing corner case tests: ### Input Validation Corner Cases | Category | Examples | Tests to Add | |----------|----------|--------------| | Empty values | `""`, `[]`, `{}`, `null` | test_empty_X_rejected/handled | | Boundary values | 0, -1, MAX_INT, MAX_LEN | test_boundary_X_handled | | Unicode | RTL, emoji, combining chars, null byte | test_unicode_X_preserved | | Injection | SQL: `'; DROP`, XSS: `