--- name: multi-ai-verification description: Multi-layer quality assurance with 5-layer verification pyramid (Rules → Functional → Visual → Integration → Quality Scoring). Independent verification with LLM-as-judge and Agent-as-a-Judge patterns. Score 0-100 with ≥90 threshold. Use when verifying code quality, security scanning, preventing test gaming, comprehensive QA, or ensuring production readiness through multi-layer validation. allowed-tools: Task, Read, Write, Edit, Glob, Grep, Bash --- # Multi-AI Verification ## Overview multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation. **Purpose**: Multi-layer independent verification ensuring production-ready quality **Pattern**: Task-based (5 independent verification operations, one per layer) **Key Innovation**: **5-layer pyramid** (95% automated at base → 0% at apex) with **independent verification** preventing bias and test gaming **Core Principles** (validated by tri-AI research): 1. **Multi-Layer Defense** - 5 layers catch different types of issues 2. **Independent Verification** - Separate agent from implementation/testing 3. **Progressive Automation** - Automate what can be automated (95% → 0%) 4. **Quality Scoring** - Objective 0-100 scoring with ≥90 threshold 5. **Actionable Feedback** - 100% feedback is specific and actionable (What/Where/Why/How/Priority) **Quality Gates**: All 5 layers must pass for production approval --- ## When to Use Use multi-ai-verification when: - Final quality check before commit/deployment - Independent code review (preventing bias) - Security verification (OWASP, vulnerabilities) - Comprehensive QA (all layers) - Test quality verification (prevent gaming) - Production readiness validation --- ## Prerequisites ### Required - Code to verify (implementation complete) - Tests available (for functional verification) - Quality standards defined ### Recommended - **multi-ai-testing** - For generating/running tests - **multi-ai-implementation** - For implementing fixes ### Tools Available - Linters (ESLint, Pylint) - Type checkers (TypeScript, mypy) - Coverage tools (c8, pytest-cov) - Security scanners (Semgrep, Bandit) - Test frameworks (Jest, pytest) --- ## The 5-Layer Verification Pyramid ``` Layer 5: Quality Scoring (LLM-as-Judge, 0-20% automated) /\ / \ Layer 4: Integration (E2E, System, 20-30% automated) / \ / \ Layer 3: Visual (UI, Screenshots, 30-50% automated) / \ / \ Layer 2: Functional (Tests, Coverage, 60-80% automated) / \ / \ Layer 1: Rules-Based (Linting, Types, Schema, 95% automated) ``` **Principle**: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation --- ## Verification Operations ### Operation 1: Rules-Based Verification (Layer 1) **Purpose**: Automated validation of code structure, formatting, types **Automation**: 95% automated **Speed**: Seconds (fast feedback) **Confidence**: High (deterministic) **Process**: 1. **Schema Validation** (if applicable): ```bash # Validate JSON/YAML against schemas ajv validate -s plan.schema.json -d plan.json ajv validate -s task.schema.json -d tasks/*.json ``` 2. **Linting**: ```bash # JavaScript/TypeScript npx eslint src/**/*.{ts,tsx,js,jsx} # Python pylint src/**/*.py # Expected: Zero linting errors ``` 3. **Type Checking**: ```bash # TypeScript npx tsc --noEmit # Python mypy src/ # Expected: Zero type errors ``` 4. **Format Validation**: ```bash # Check formatting npx prettier --check src/**/*.{ts,tsx} # Or auto-fix npx prettier --write src/**/*.{ts,tsx} ``` 5. **Security Scanning** (SAST): ```bash # Static security analysis npx semgrep --config=auto src/ # Or for Python bandit -r src/ # Check for: # - Hardcoded secrets # - SQL injection risks # - XSS vulnerabilities # - Insecure dependencies ``` 6. **Generate Layer 1 Report**: ```markdown # Layer 1: Rules-Based Verification ## Schema Validation ✅ plan.json validates ✅ All task files validate ## Linting ✅ 0 linting errors ⚠️ 3 warnings (non-blocking) ## Type Checking ✅ 0 type errors ## Formatting ✅ All files formatted correctly ## Security Scan (SAST) ✅ No critical vulnerabilities ⚠️ 1 medium: Weak password hashing rounds (bcrypt) **Layer 1 Status**: ✅ PASS (0 critical issues) **Issues to Address**: 1 medium security issue ``` **Outputs**: - Lint report (errors/warnings) - Type check results - Schema validation results - Security scan findings - Layer 1 status (PASS/FAIL) **Validation**: - [ ] All automated checks run - [ ] Results documented - [ ] Critical issues = 0 for PASS - [ ] Actionable feedback for warnings **Time Estimate**: 15-30 minutes (mostly automated) **Gate 1**: ✅ PASS if no critical issues (warnings acceptable) --- ### Operation 2: Functional Verification (Layer 2) **Purpose**: Validate functionality through test execution and coverage **Automation**: 60-80% automated **Speed**: Minutes (medium feedback) **Confidence**: High (measurable outcomes) **Process**: 1. **Execute Complete Test Suite**: ```bash # Run all tests with coverage npm test -- --coverage --verbose # Capture results # - Tests passed/failed # - Coverage metrics # - Execution time ``` 2. **Validate Example Code** (from documentation): ```bash # Extract examples from SKILL.md # Execute each example automatically # Verify outputs match expected # Target: ≥90% examples work ``` 3. **Check Coverage**: ```markdown # Coverage Report **Line Coverage**: 87% ✅ (gate: ≥80%) **Branch Coverage**: 82% ✅ **Function Coverage**: 92% ✅ **Path Coverage**: 74% ✅ **Gate Status**: PASS ✅ (all ≥80%) **Uncovered Code**: - src/admin/legacy.ts: 23% (low priority) - src/utils/deprecated.ts: 15% (deprecated, ok) ``` 4. **Regression Testing** (for updates): ```bash # Compare before/after git diff main...feature --stat # Run all tests npm test # Verify: No new failures (regression prevention) ``` 5. **Performance Validation**: ```bash # Run performance tests npm run test:performance # Check response times # Verify: Within acceptable ranges ``` 6. **Generate Layer 2 Report**: ```markdown # Layer 2: Functional Verification ## Test Execution ✅ 245/245 tests passing (100%) ⏱️ Execution time: 8.3 seconds ## Coverage ✅ Line: 87% (gate: ≥80%) ✅ Branch: 82% ✅ Function: 92% ## Example Validation ✅ 18/20 examples work (90%) ❌ 2 examples fail (outdated) ## Regression ✅ All existing tests still pass ## Performance ✅ All endpoints <200ms **Layer 2 Status**: ✅ PASS **Issues**: 2 outdated examples (update docs) ``` **Outputs**: - Test execution results - Coverage report - Example validation results - Regression check - Performance metrics - Layer 2 status **Validation**: - [ ] All tests executed - [ ] Coverage meets gate (≥80%) - [ ] Examples validated (≥90%) - [ ] No regressions - [ ] Performance acceptable **Time Estimate**: 30-60 minutes **Gate 2**: ✅ PASS if tests pass + coverage ≥80% --- ### Operation 3: Visual Verification (Layer 3) **Purpose**: Validate UI appearance, layout, accessibility (for UI features) **Automation**: 30-50% automated **Speed**: Minutes-Hours **Confidence**: Medium (subjective elements) **Process**: 1. **Screenshot Generation**: ```bash # Generate screenshots of UI npx playwright test --screenshot=on # Or manually: # Open application # Capture screenshots of key views ``` 2. **Visual Comparison** (if previous version exists): ```bash # Compare against baseline npx playwright test --update-snapshots=missing # Or use Percy/Chromatic for visual regression npx percy snapshot screenshots/ ``` 3. **Layout Validation**: ```markdown # Visual Checklist ## Layout - [ ] Components positioned correctly - [ ] Spacing/margins match mockup - [ ] Alignment proper - [ ] No overlapping elements ## Styling - [ ] Colors match design system - [ ] Typography correct (fonts, sizes) - [ ] Icons/images display properly ## Responsiveness - [ ] Mobile view (320px-480px): ✅ - [ ] Tablet view (768px-1024px): ✅ - [ ] Desktop view (>1024px): ✅ ``` 4. **Accessibility Testing**: ```bash # Automated accessibility scan npx axe-core src/ # Check WCAG compliance npx pa11y http://localhost:3000 # Manual checks: # - Keyboard navigation # - Screen reader compatibility # - Color contrast ratios ``` 5. **Generate Layer 3 Report**: ```markdown # Layer 3: Visual Verification ## Screenshot Comparison ✅ Login page matches mockup ✅ Dashboard layout correct ⚠️ Profile page: Avatar alignment off by 5px ## Responsiveness ✅ Mobile: All components visible ✅ Tablet: Layout adapts correctly ✅ Desktop: Full functionality ## Accessibility ✅ WCAG 2.1 AA compliance ✅ Keyboard navigation works ⚠️ 2 color contrast warnings (non-critical) **Layer 3 Status**: ✅ PASS (minor issues acceptable) **Issues**: Avatar alignment (cosmetic), contrast warnings ``` **Outputs**: - Screenshots of UI - Visual comparison results - Responsiveness validation - Accessibility report - Layer 3 status **Validation**: - [ ] Screenshots captured - [ ] Visual comparison done (if applicable) - [ ] Layout validated - [ ] Responsiveness tested - [ ] Accessibility checked - [ ] No critical visual issues **Time Estimate**: 30-90 minutes (skip if no UI) **Gate 3**: ✅ PASS if no critical visual/a11y issues --- ### Operation 4: Integration Verification (Layer 4) **Purpose**: Validate system-level integration, data flow, API compatibility **Automation**: 20-30% automated **Speed**: Hours (complex) **Confidence**: Medium-High **Process**: 1. **Component Integration Tests**: ```bash # Run integration test suite npm test -- tests/integration/ # Verify components work together # - Database ← → API # - API ← → Frontend # - Frontend ← → User ``` 2. **Data Flow Validation**: ```markdown # Data Flow Verification **Flow 1: User Registration** Frontend form → API endpoint → Validation → Database → Email service ✅ Data flows correctly ✅ No data loss ✅ Transactions atomic **Flow 2: Authentication** Login request → API → Database lookup → Token generation → Response ✅ Token generated correctly ✅ Session stored ✅ Response includes token ``` 3. **API Integration Tests**: ```bash # Test all API endpoints npm run test:api # Verify: # - All endpoints respond # - Status codes correct # - Response formats match spec # - Error handling works ``` 4. **End-to-End Workflow Tests**: ```typescript // Complete user journeys test('Complete registration and login flow', async () => { // 1. Register new user const registerResponse = await api.post('/register', userData); expect(registerResponse.status).toBe(201); // 2. Confirm email const confirmResponse = await api.get(confirmLink); expect(confirmResponse.status).toBe(200); // 3. Login const loginResponse = await api.post('/login', credentials); expect(loginResponse.status).toBe(200); expect(loginResponse.data.token).toBeDefined(); // 4. Access protected resource const profileResponse = await api.get('/profile', { headers: { Authorization: `Bearer ${loginResponse.data.token}` } }); expect(profileResponse.status).toBe(200); }); ``` 5. **Dependency Compatibility**: ```bash # Check external dependencies work npm audit # Check for breaking changes npm outdated # Verify integration with services # - Database connection # - Redis/cache # - External APIs ``` 6. **Generate Layer 4 Report**: ```markdown # Layer 4: Integration Verification ## Component Integration ✅ 12/12 integration tests passing ✅ All components integrate correctly ## Data Flow ✅ All 5 data flows validated ✅ No data loss or corruption ## API Integration ✅ All 15 endpoints functional ✅ Response formats correct ✅ Error handling works ## E2E Workflows ✅ 8/8 user journeys complete successfully ✅ No workflow breaks ## Dependencies ✅ 0 critical vulnerabilities ⚠️ 2 moderate (non-blocking) **Layer 4 Status**: ✅ PASS ``` **Outputs**: - Integration test results - Data flow validation - API compatibility report - E2E workflow results - Dependency audit - Layer 4 status **Validation**: - [ ] Integration tests pass - [ ] Data flows validated - [ ] APIs integrate correctly - [ ] E2E workflows function - [ ] Dependencies secure **Time Estimate**: 45-90 minutes **Gate 4**: ✅ PASS if all integration tests pass, no critical dependencies --- ### Operation 5: Quality Scoring (Layer 5) **Purpose**: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns **Automation**: 0-20% automated **Speed**: Hours (expensive) **Confidence**: Medium (requires judgment) **Process**: 1. **Spawn Independent Quality Assessor** (Agent-as-a-Judge): **Key**: Use different model family if possible (prevent self-preference bias) ```typescript const qualityAssessment = await task({ description: "Assess code quality holistically", prompt: `Evaluate code quality in src/ and tests/. DO NOT read implementation conversation history. You have access to tools: - Read files - Execute tests - Run linters - Query database (if needed) Assess 5 dimensions (score each /20): 1. CORRECTNESS (/20): - Logic correctness - Edge case handling - Error handling completeness - Security considerations 2. FUNCTIONALITY (/20): - Meets all requirements - User workflows work - Performance acceptable - No regressions 3. QUALITY (/20): - Code maintainability - Best practices followed - Anti-patterns avoided - Documentation complete 4. INTEGRATION (/20): - Components integrate smoothly - API contracts correct - Data flow works - Backward compatible 5. SECURITY (/20): - No vulnerabilities - Input validation - Authentication/authorization - Data protection TOTAL: /100 (sum of 5 dimensions) For each dimension, provide: - Score (/20) - Strengths (what's good) - Weaknesses (what needs improvement) - Evidence (file:line references) - Recommendations (specific, actionable) Write comprehensive report to: quality-assessment.md` }); ``` 2. **Multi-Agent Ensemble** (for critical features): **3-5 Agent Voting Committee**: ```typescript // Spawn 3 independent quality assessors const [judge1, judge2, judge3] = await Promise.all([ task({description: "Quality Judge 1", prompt: assessmentPrompt}), task({description: "Quality Judge 2", prompt: assessmentPrompt}), task({description: "Quality Judge 3", prompt: assessmentPrompt}) ]); // Aggregate scores const scores = { correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]), functionality: median([...]), quality: median([...]), integration: median([...]), security: median([...]) }; const totalScore = sum(Object.values(scores)); // Total /100 // Check variance const totalScores = [judge1.total, judge2.total, judge3.total]; const variance = max(totalScores) - min(totalScores); if (variance > 15) { // High disagreement → spawn 2 more judges (total 5) // Use 5-agent ensemble for final score } // Final score: median of 3 or 5 ``` 3. **Calibration Against Rubric**: ```markdown # Scoring Calibration ## Correctness: 18/20 (Excellent) **20**: Zero errors, all edge cases handled perfectly **18**: Minor edge case missing, otherwise excellent ✅ (achieved) **15**: 1-2 significant edge cases missing **10**: Some logic errors present **0**: Major functionality broken **Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor) ## Functionality: 19/20 (Excellent) [Similar rubric with evidence] ## Quality: 17/20 (Good) [Similar rubric with evidence] ## Integration: 18/20 (Excellent) [Similar rubric with evidence] ## Security: 16/20 (Good) [Similar rubric with evidence] **Total**: 88/100 ⚠️ (Below ≥90 gate) ``` 4. **Gap Analysis** (if <90): ```markdown # Quality Gap Analysis **Current Score**: 88/100 **Target**: ≥90/100 **Gap**: 2 points ## Critical Gaps (Blocking Approval) None ## High Priority (Should Fix for ≥90) 1. **Security: Weak bcrypt rounds** - **What**: bcrypt using 10 rounds (outdated) - **Where**: src/auth/hash.ts:15 - **Why**: Current standard is 12-14 rounds - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)` - **Priority**: High - **Impact**: +2 points → 90/100 ## Medium Priority 1. **Quality: Missing JSDoc for 3 functions** - Impact: +1 point → 91/100 **Recommendation**: Fix high priority issue to reach ≥90 threshold **Estimated Effort**: 15 minutes ``` 5. **Generate Comprehensive Quality Report**: ```markdown # Layer 5: Quality Scoring Report ## Executive Summary **Total Score**: 88/100 ⚠️ (Below ≥90 gate) **Status**: NEEDS MINOR REVISION ## Dimension Scores - Correctness: 18/20 ⭐⭐⭐⭐⭐ - Functionality: 19/20 ⭐⭐⭐⭐⭐ - Quality: 17/20 ⭐⭐⭐⭐ - Integration: 18/20 ⭐⭐⭐⭐⭐ - Security: 16/20 ⭐⭐⭐⭐ ## Strengths 1. Comprehensive test coverage (87%) 2. All functionality working correctly 3. Clean integration with all components 4. Good error handling ## Weaknesses 1. Bcrypt rounds below current standard (security) 2. Missing documentation for helper functions (quality) 3. One timezone edge case not handled (correctness) ## Recommendations (Prioritized) ### Priority 1 (High - Needed for ≥90) 1. Increase bcrypt rounds: 10 → 12 - File: src/auth/hash.ts:15 - Effort: 5 min - Impact: +2 points ### Priority 2 (Medium - Nice to Have) 1. Add JSDoc to helper functions - Files: src/utils/validation.ts - Effort: 30 min - Impact: +1 point 2. Handle timezone DST edge case - File: src/auth/tokens.ts:78 - Effort: 20 min - Impact: +1 point **Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90 ``` **Outputs**: - Quality score (0-100) with dimension breakdown - Calibrated against rubric - Gap analysis - Prioritized recommendations (Critical/High/Medium/Low) - Evidence-based feedback (file:line references) - Action plan to reach ≥90 **Validation**: - [ ] All 5 dimensions scored - [ ] Scores calibrated against rubric - [ ] Evidence provided for each score - [ ] Gap analysis if <90 - [ ] Recommendations actionable - [ ] Ensemble used for critical features (optional) **Time Estimate**: 60-120 minutes (ensemble adds 30-60 min) **Gate 5**: ✅ PASS if total score ≥90/100 --- ## Quality Gates Summary **All 5 Gates Must Pass** for production approval: ``` Gate 1: Rules Pass ✅ ↓ (Linting, types, schema, security) Gate 2: Tests Pass ✅ ↓ (All tests, coverage ≥80%) Gate 3: Visual OK ✅ ↓ (UI validated, a11y checked) Gate 4: Integration OK ✅ ↓ (E2E works, APIs integrate) Gate 5: Quality ≥90 ✅ ↓ (LLM-as-judge score ≥90/100) ✅ PRODUCTION APPROVED ``` **If Any Gate Fails**: ``` Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass ``` --- ## Appendix A: Independence Protocol ### How Verification Independence is Maintained **Verification Agent Spawning**: ```typescript // After implementation and testing complete const verification = await task({ description: "Independent quality verification", prompt: `Verify code quality independently. DO NOT read prior conversation history. Review: - Code: src/**/*.ts - Tests: tests/**/*.test.ts - Specs: specs/requirements.md Verify against specifications ONLY (not implementation decisions). Use tools: - Read files to inspect code - Run tests to verify functionality - Execute linters for quality checks Score quality (0-100) with evidence. Write report to: independent-verification.md` }); ``` **Bias Prevention Checklist**: - [ ] Specifications written BEFORE implementation - [ ] Verification agent prompt has no implementation context - [ ] Agent evaluates against specs, not what code does - [ ] Fresh context (via Task tool) - [ ] Different model family used (if possible) **Validation of Independence**: ```markdown ## Independence Audit **Expected Behavior**: - ✅ Verifier finds 1-3 issues (healthy skepticism) - ✅ Verifier references specifications - ✅ Verifier uses tools to verify claims **Warning Signs**: - ⚠️ Verifier finds 0 issues (possible rubber stamp) - ⚠️ Verifier doesn't use tools - ⚠️ Verifier parrots implementation justifications **If Warning**: Re-verify with stronger independence prompt ``` --- ## Appendix B: Operational Scoring Rubrics ### Complete Rubrics for All 5 Dimensions #### Correctness (/20) **20 (Perfect)**: Zero logic errors, all edge cases handled, security perfect **18 (Excellent)**: 1 minor edge case missing, otherwise flawless **15 (Good)**: 2-3 edge cases missing, no critical errors **12 (Acceptable)**: Some edge cases missing, 1 minor logic issue **10 (Needs Work)**: Multiple edge cases missing or 1 significant logic error **5 (Poor)**: Major logic errors present **0 (Broken)**: Critical functionality broken #### Functionality (/20) **20**: All requirements met, exceeds expectations **18**: All requirements met, well implemented **15**: All requirements met, basic implementation **12**: 1 requirement partially missing **10**: 2+ requirements partially missing **5**: Several requirements not met **0**: Core functionality missing #### Quality (/20) **20**: Exceptional code quality, best practices exemplified **18**: High quality, follows best practices **15**: Good quality, minor style issues **12**: Acceptable quality, several style issues **10**: Below standard, needs refactoring **5**: Poor quality, significant issues **0**: Unmaintainable code #### Integration (/20) **20**: Perfect integration, all touch points verified **18**: Excellent integration, minor docs needed **15**: Good integration, all major points work **12**: Acceptable, 1-2 integration issues **10**: Integration issues present **5**: Multiple integration problems **0**: Does not integrate #### Security (/20) **20**: Passes all security scans, OWASP compliant, hardened **18**: Passes scans, 1 minor non-critical issue **15**: Passes, 2-3 minor issues **12**: 1 medium security issue **10**: Multiple medium issues **5**: 1 critical issue present **0**: Multiple critical vulnerabilities --- ## Appendix C: Technical Foundation ### Verification Tools **Linting**: - ESLint (JavaScript/TypeScript) - Pylint/Ruff (Python) **Type Checking**: - TypeScript compiler (tsc) - mypy (Python) **Security (SAST)**: - Semgrep (multi-language) - Bandit (Python) - npm audit (JavaScript) **Visual Testing**: - Playwright (screenshot, visual regression) - Percy/Chromatic (visual diff) - axe-core (accessibility) **Coverage**: - c8/nyc (JavaScript) - pytest-cov (Python) ### Cost Controls **Budget Caps**: - LLM-as-judge: $50/month - Ensemble verification: $20/month - Total verification: $70/month **Optimization**: - Cache quality scores for 24h (same code → same score) - Skip Layer 5 for changes <50 lines - Use ensemble (3-5 agents) only for critical features - Use cheaper models for pre-filtering (Haiku for Layer 1-2) --- ## Quick Reference ### The 5 Layers | Layer | Purpose | Automation | Time | Tools | |-------|---------|------------|------|-------| | 1 | Rules-based | 95% | 15-30m | Linters, types, SAST | | 2 | Functional | 60-80% | 30-60m | Test execution, coverage | | 3 | Visual | 30-50% | 30-90m | Screenshots, a11y | | 4 | Integration | 20-30% | 45-90m | E2E, API tests | | 5 | Quality Scoring | 0-20% | 60-120m | LLM-as-judge, ensemble | **Total**: 3-6 hours for complete 5-layer verification ### Quality Thresholds - **≥90**: ✅ Excellent (production-ready) - **80-89**: ⚠️ Good (needs minor improvements) - **70-79**: ❌ Acceptable (needs work before production) - **<70**: ❌ Poor (significant rework required) ### Gates **All 5 Must Pass**: 1. Rules pass (no critical lint/type/security) 2. Tests pass + coverage ≥80% 3. Visual OK (no critical UI issues) 4. Integration OK (E2E works) 5. Quality ≥90/100 --- **multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.** For rubrics, see Appendix B. For independence protocol, see Appendix A.