---
name: analyzing-test-effectiveness
description: Use to audit test quality with Google Fellow SRE scrutiny - identifies tautological tests, coverage gaming, weak assertions, missing corner cases. Creates bd epic with tasks for improvements, then runs SRE task refinement on each.
---
Audit test suites for real effectiveness, not vanity metrics. Identify tests that provide false confidence (tautological, mock-testing, line hitters) and missing corner cases. Create bd epic with tracked tasks for improvements. Run SRE task refinement on each task before execution.
**CRITICAL MINDSET: Assume tests were written by junior engineers optimizing for coverage metrics.** Default to skeptical—a test is RED or YELLOW until proven GREEN. You MUST read production code before categorizing tests. GREEN is the exception, not the rule.
MEDIUM FREEDOM - Follow the 5-phase analysis process exactly. Categorization criteria (RED/YELLOW/GREEN) are rigid. Corner case discovery adapts to the specific codebase. Output format is flexible but must include all sections.
| Phase | Action | Output |
|-------|--------|--------|
| 1. Inventory | List all test files and functions | Test catalog |
| 2. Read Production Code | Read the actual code each test claims to test | Context for analysis |
| 3. Trace Call Paths | Verify tests exercise production, not mocks/utilities | Call path verification |
| 4. Categorize (Skeptical) | Apply RED/YELLOW/GREEN - default to harsher rating | Categorized tests |
| 5. Self-Review | Challenge every GREEN - would a senior SRE agree? | Validated categories |
| 6. Corner Cases | Identify missing edge cases per module | Gap analysis |
| 7. Prioritize | Rank by business criticality | Priority matrix |
| 8. bd Issues | Create epic + tasks, run SRE refinement | Tracked improvement plan |
**MANDATORY: Read production code BEFORE categorizing tests. You cannot assess a test without understanding what it claims to test.**
**Core Questions for Each Test:**
1. What bug would this catch? (If you can't name one → RED)
2. Does it exercise PRODUCTION code or a mock/test utility? (Mock → RED or YELLOW)
3. Could code break while test passes? (If yes → YELLOW or RED)
4. Meaningful assertion on PRODUCTION output? (`!= nil` or testing fixtures → weak)
**bd Integration (MANDATORY):**
- Create bd epic for test quality improvement
- Create bd tasks for: remove RED, strengthen YELLOW, add corner cases
- Run hyperpowers:sre-task-refinement on all tasks
- Link tasks to epic with dependencies
**Mutation Testing Validation:**
- Java: Pitest (`mvn org.pitest:pitest-maven:mutationCoverage`)
- JS/TS: Stryker (`npx stryker run`)
- Python: mutmut (`mutmut run`)
**Use this skill when:**
- Production bugs appear despite high test coverage
- Suspecting coverage gaming or tautological tests
- Before major refactoring (ensure tests catch regressions)
- Onboarding to unfamiliar codebase (assess test quality)
- After hyperpowers:review-implementation flags test quality issues
- Planning test improvement initiatives
**Don't use when:**
- Writing new tests (use hyperpowers:test-driven-development)
- Debugging test failures (use hyperpowers:debugging-with-tools)
- Just need to run tests (use hyperpowers:test-runner agent)
## Announcement
**Announce:** "I'm using hyperpowers:analyzing-test-effectiveness to audit test quality with Google Fellow SRE-level scrutiny."
---
## Phase 1: Test Inventory
**Goal:** Create complete catalog of tests to analyze.
```bash
# Find all test files (adapt pattern to language)
fd -e test.ts -e spec.ts -e _test.go -e Test.java -e test.py .
# Or use grep to find test functions
rg "func Test|it\(|test\(|def test_|@Test" --type-add 'test:*test*' -t test
# Count tests per module
for dir in src/*/; do
count=$(rg -c "func Test|it\(" "$dir" 2>/dev/null | wc -l)
echo "$dir: $count tests"
done
```
**Create inventory TodoWrite:**
```
- Analyze tests in src/auth/
- Analyze tests in src/api/
- Analyze tests in src/parser/
[... one per module]
```
---
## Phase 2: Read Production Code First
**MANDATORY: Before categorizing ANY test, you MUST:**
1. **Read the production code** the test claims to exercise
2. **Understand what the production code actually does**
3. **Trace the test's call path** to verify it reaches production code
**Why this matters:** Junior engineers commonly:
- Create test utilities and test THOSE instead of production code
- Set up mocks that determine the test outcome (mock-testing-mock)
- Write assertions on values defined IN THE TEST, not from production
- Copy patterns from examples without understanding the actual code
**If you haven't read production code, you WILL miscategorize tests as GREEN when they're YELLOW or RED.**
---
## Phase 3: Categorize Each Test (Skeptical Default)
**Assume every test is RED or YELLOW until you have concrete evidence it's GREEN.**
For each test, apply these criteria:
### RED FLAGS - Must Remove or Replace
**2.1 Tautological Tests** (pass by definition)
```typescript
// ❌ RED: Verifies non-optional return is not nil
test('builder returns value', () => {
const result = new Builder().build();
expect(result).not.toBeNull(); // Always passes - return type guarantees this
});
// ❌ RED: Verifies enum has cases (compiler checks this)
test('status enum has values', () => {
expect(Object.values(Status).length).toBeGreaterThan(0);
});
// ❌ RED: Duplicates implementation
test('add returns sum', () => {
expect(add(2, 3)).toBe(2 + 3); // Tautology: testing 2+3 == 2+3
});
```
**Detection patterns:**
```bash
# Find != nil / != null on non-optional types
rg "expect\(.*\)\.not\.toBeNull|assertNotNull|!= nil" tests/
# Find enum existence checks
rg "Object\.values.*length|cases\.count" tests/
# Find tests with no meaningful assertions
rg -l "expect\(" tests/ | xargs -I {} sh -c 'grep -c "expect" {} | grep -q "^1$" && echo {}'
```
**2.2 Mock-Testing Tests** (test the mock, not production)
```typescript
// ❌ RED: Only verifies mock was called, not actual behavior
test('service fetches data', () => {
const mockApi = { fetch: jest.fn().mockResolvedValue({ data: [] }) };
const service = new Service(mockApi);
service.getData();
expect(mockApi.fetch).toHaveBeenCalled(); // Tests mock, not service logic
});
// ❌ RED: Mock determines test outcome
test('processor handles data', () => {
const mockParser = { parse: jest.fn().mockReturnValue({ valid: true }) };
const result = processor.process(mockParser);
expect(result.valid).toBe(true); // Just returns what mock returns
});
```
**Detection patterns:**
```bash
# Find tests that only verify mock calls
rg "toHaveBeenCalled|verify\(mock|\.called" tests/
# Find heavy mock setup
rg -c "mock|Mock|jest\.fn|stub" tests/ | sort -t: -k2 -nr | head -20
```
**2.3 Line Hitters** (execute without asserting)
```typescript
// ❌ RED: Calls function, doesn't verify outcome
test('processor runs', () => {
const processor = new Processor();
processor.run(); // No assertion - just verifies no crash
});
// ❌ RED: Assertion is trivial
test('config loads', () => {
const config = loadConfig();
expect(config).toBeDefined(); // Too weak - doesn't verify correct values
});
```
**Detection patterns:**
```bash
# Find tests with 0-1 assertions
rg -l "test\(|it\(" tests/ | while read f; do
assertions=$(rg -c "expect|assert" "$f" 2>/dev/null || echo 0)
tests=$(rg -c "test\(|it\(" "$f" 2>/dev/null || echo 1)
ratio=$((assertions / tests))
[ "$ratio" -lt 2 ] && echo "$f: low assertion ratio ($assertions assertions, $tests tests)"
done
```
**2.4 Evergreen/Liar Tests** (always pass)
```typescript
// ❌ RED: Catches and ignores exceptions
test('parser handles input', () => {
try {
parser.parse(input);
expect(true).toBe(true); // Always passes
} catch (e) {
// Swallowed - test passes even on exception
}
});
// ❌ RED: Test setup bypasses code under test
test('validator validates', () => {
const validator = new Validator({ skipValidation: true }); // Oops
expect(validator.validate(badInput)).toBe(true);
});
```
### YELLOW FLAGS - Must Strengthen
**2.5 Happy Path Only**
```typescript
// ⚠️ YELLOW: Only tests valid input
test('parse valid json', () => {
const result = parse('{"name": "test"}');
expect(result.name).toBe('test');
});
// Missing: empty string, malformed JSON, deeply nested, unicode, huge payload
```
**2.6 Weak Assertions**
```typescript
// ⚠️ YELLOW: Assertion too weak
test('fetch returns data', () => {
const result = await fetch('/api/users');
expect(result).not.toBeNull(); // Should verify actual content
expect(result.length).toBeGreaterThan(0); // Should verify exact count or specific items
});
```
**2.7 Partial Coverage**
```typescript
// ⚠️ YELLOW: Tests success, not failure
test('create user succeeds', () => {
const user = createUser({ name: 'test', email: 'test@example.com' });
expect(user.id).toBeDefined();
});
// Missing: duplicate email, invalid email, missing fields, database error
```
### GREEN FLAGS - Exceptional Quality Required
**GREEN is the EXCEPTION, not the rule.** A test is GREEN only if ALL of the following are true:
1. **Exercises actual PRODUCTION code** - Not a mock, not a test utility, not a copy of logic
2. **Has precise assertions** - Exact values, not `!= nil` or `> 0`
3. **Would fail if production breaks** - You can name the specific bug it catches
4. **Tests behavior, not implementation** - Won't break on valid refactoring
**Before marking ANY test GREEN, you MUST state:**
- "This test exercises [specific production code path]"
- "It would catch [specific bug] because [reason]"
- "The assertion verifies [exact production behavior], not a test fixture"
**If you cannot fill in those blanks, the test is YELLOW at best.**
**3.1 Behavior Verification (Must exercise PRODUCTION code)**
```typescript
// ✅ GREEN: Verifies specific behavior with exact values FROM PRODUCTION
test('calculateTotal applies discount correctly', () => {
const cart = new Cart([{ price: 100, quantity: 2 }]); // Real Cart class
cart.applyDiscount('SAVE20'); // Real discount logic
expect(cart.total).toBe(160); // 200 - 20% = 160
});
// GREEN because: Exercises Cart.applyDiscount production code
// Would catch: Discount calculation bugs, rounding errors
// Assertion: Verifies exact computed value from production
```
**3.2 Edge Case Coverage (Must test PRODUCTION paths)**
```typescript
// ✅ GREEN: Tests boundary conditions IN PRODUCTION CODE
test('username rejects empty string', () => {
expect(() => new User({ username: '' })).toThrow(ValidationError);
});
// GREEN because: Exercises User constructor validation (production)
// Would catch: Missing empty string validation
// Assertion: Exact error type from production code
test('username handles unicode', () => {
const user = new User({ username: '日本語ユーザー' });
expect(user.username).toBe('日本語ユーザー');
});
// GREEN because: Exercises User constructor and storage (production)
// Would catch: Unicode corruption, encoding bugs
// Assertion: Exact value preserved through production code
```
**3.3 Error Path Testing (Must verify PRODUCTION errors)**
```typescript
// ✅ GREEN: Verifies error handling IN PRODUCTION CODE
test('fetch returns specific error on 404', () => {
mockServer.get('/api/user/999').reply(404); // External mock OK
await expect(fetchUser(999)).rejects.toThrow(UserNotFoundError);
});
// GREEN because: Exercises fetchUser error handling (production)
// Would catch: Wrong error type, swallowed errors
// Assertion: Exact error type from production code
```
**CAUTION:** A test that uses mocks for EXTERNAL dependencies (APIs, databases) can still be GREEN if it exercises PRODUCTION logic. A test that mocks the code under test is RED.
---
## Phase 4: Mandatory Self-Review
**Before finalizing ANY categorization, complete this checklist:**
### For each GREEN test:
- [ ] Did I read the PRODUCTION code this test exercises?
- [ ] Does the test call PRODUCTION code or a test utility/mock?
- [ ] Can I name the SPECIFIC BUG this test would catch?
- [ ] If production code broke, would this test DEFINITELY fail?
- [ ] Am I being too generous because the test "looks reasonable"?
### For each YELLOW test:
- [ ] Should this actually be RED? Is there ANY bug-catching value here?
- [ ] Is the weakness fundamental (tests a mock) or fixable (weak assertion)?
- [ ] If I changed this to RED, would I lose any bug-catching ability?
### Self-Challenge Questions:
- "If a junior engineer showed me this test, would I accept it as GREEN?"
- "Am I marking this GREEN because I want to be done, or because it's genuinely good?"
- "Could I defend this GREEN classification to a Google SRE?"
**If you have ANY doubt about a GREEN, downgrade to YELLOW.**
**If you have ANY doubt about a YELLOW, consider RED.**
**Common mistakes that cause false GREENs:**
- Assuming a well-named test tests what its name says (verify the code!)
- Trusting test comments (comments lie, code doesn't)
- Not tracing mock/utility usage to see what's actually exercised
- Giving benefit of the doubt (junior engineers don't deserve it)
---
## Phase 4b: Line-by-Line Justification for RED/YELLOW
**MANDATORY: For every RED or YELLOW classification, provide detailed justification.**
This forces you to verify your classification is correct by explaining exactly WHY the test is problematic.
### Required Format for RED/YELLOW Tests:
```markdown
### [Test Name] - RED/YELLOW
**Test code (file:lines):**
- Line X: `code` - [what this line does]
- Line Y: `code` - [what this line does]
- Line Z: `assertion` - [what this asserts]
**Production code it claims to test (file:lines):**
- [Brief description of what production code does]
**Why RED/YELLOW:**
- [Specific reason with line references]
- [What bug could slip through despite this test passing]
```
### Example RED Justification:
```markdown
### testAuthWorks - RED (Tautological)
**Test code (auth_test.ts:45-52):**
- Line 46: `const auth = new AuthService()` - Creates auth instance
- Line 47: `const result = auth.login('user', 'pass')` - Calls login
- Line 48: `expect(result).not.toBeNull()` - Asserts result exists
**Production code (auth.ts:78-95):**
- login() returns AuthResult object (never null by TypeScript types)
**Why RED:**
- Line 48 asserts `!= null` but TypeScript guarantees non-null return
- If login returned {success: false, error: "invalid"}, test still passes
- Bug example: Wrong password accepted → returns {success: true} → test passes
```
### Example YELLOW Justification:
```markdown
### testParseJson - YELLOW (Weak Assertion)
**Test code (parser_test.ts:23-30):**
- Line 24: `const input = '{"name": "test"}'` - Valid JSON input
- Line 25: `const result = parse(input)` - Calls production parser
- Line 26: `expect(result).toBeDefined()` - Asserts result exists
- Line 27: `expect(result.name).toBe('test')` - Verifies one field
**Production code (parser.ts:12-45):**
- parse() handles JSON parsing with error handling and validation
**Why YELLOW:**
- Line 26-27 only test happy path with valid input
- Missing: malformed JSON, empty string, deeply nested, unicode
- Bug example: parse('') throws unhandled exception → not caught by test
- Upgrade path: Add edge case inputs with specific error assertions
```
### Why This Matters:
Writing the justification FORCES you to:
1. Actually read the test code line by line
2. Actually read the production code
3. Articulate the specific gap
4. Consider what bugs could slip through
**If you cannot write this justification, you haven't done the analysis properly.**
---
## Phase 5: Corner Case Discovery
For each module, identify missing corner case tests:
### Input Validation Corner Cases
| Category | Examples | Tests to Add |
|----------|----------|--------------|
| Empty values | `""`, `[]`, `{}`, `null` | test_empty_X_rejected/handled |
| Boundary values | 0, -1, MAX_INT, MAX_LEN | test_boundary_X_handled |
| Unicode | RTL, emoji, combining chars, null byte | test_unicode_X_preserved |
| Injection | SQL: `'; DROP`, XSS: `