# Evolve Agents - Agent Self-Improvement Analysis Analyze agent performance, identify capability gaps, and propose improvements using the Evolution System. ## Usage ```bash # Analyze all agents .claude/commands/evolve-agents.md analyze # Analyze specific agent .claude/commands/evolve-agents.md analyze --agent-id=coder-agent # Generate weekly evolution report .claude/commands/evolve-agents.md report --period=weekly # Check thresholds and recommend updates .claude/commands/evolve-agents.md check-thresholds # View capability gaps .claude/commands/evolve-agents.md gaps # View skill suggestions .claude/commands/evolve-agents.md suggestions # View prompt history .claude/commands/evolve-agents.md prompt-history --agent-id=coder-agent # Evolve agent prompts .claude/commands/evolve-agents.md evolve --agent-id=coder-agent # Run A/B test on prompts .claude/commands/evolve-agents.md ab-test --agent-id=coder-agent ``` ## Command Actions ### 1. **analyze** - Comprehensive Agent Analysis Analyze agent performance metrics including: - Success rate over time - Average task duration - Token efficiency (success per 1000 tokens) - User feedback ratings - Performance trends (improving, stable, declining) **Output:** - Performance summary table - Trend analysis - Comparison with previous periods - Recommendations for improvement **Example Output:** ``` Agent Performance Analysis ======================== Agent: coder-agent ------------------ Success Rate: 87.5% (↑ 5.2% vs last week) Avg Duration: 45.3s (↓ 8.1s vs last week) Token Efficiency: 0.72 (↑ 0.08 vs last week) User Rating: 4.2/5.0 Performance Trend: IMPROVING Tasks Completed: 120 Failures: 15 Retries: 8 Top Error Types: 1. timeout (5 occurrences) 2. validation_error (4 occurrences) 3. tool_limitation (3 occurrences) ``` ### 2. **report** - Generate Evolution Report Generate a comprehensive evolution report for the specified period: - Overall system performance - Per-agent performance summaries - Identified capability gaps - Skill suggestions - Prompt updates - Improvements deployed **Periods:** - `daily` - Last 24 hours - `weekly` - Last 7 days (default) - `monthly` - Last 30 days - `custom` - Specify start/end dates **Output:** - Executive summary - Detailed metrics - Visual trends - Actionable recommendations ### 3. **check-thresholds** - Auto-Evolution Checks Check if any agents meet criteria for automatic evolution: - Success rate drops below threshold - Performance declining trend - High failure rate in specific task types **Thresholds:** - Success rate drop: 10% - Minimum task count: 10 tasks - Declining trend: 2+ consecutive periods **Output:** ``` Evolution Threshold Analysis =========================== Agents Requiring Attention: --------------------------- 1. coder-agent Current Version: v3 Success Rate: 72.5% (↓ 15.2% vs previous period) Threshold: success_rate_drop Recommended Action: EVOLVE Reason: Success rate dropped by 15.2% 2. tester-agent Current Version: v2 Success Rate: stable Performance: DECLINING trend for 14 days Recommended Action: AB_TEST Reason: Consistent performance decline ``` ### 4. **gaps** - View Capability Gaps Display identified capability gaps from task failures: - Gap category (missing_skill, tool_limitation, knowledge_gap, pattern_failure) - Severity (low, medium, high, critical) - Affected tasks - Frequency - Error patterns **Output:** ``` Capability Gaps ============== CRITICAL Gaps (2): ------------------ 1. Gap ID: gap-1234567890 Category: tool_limitation Description: Agent struggles with database query tasks requiring SQL execution Failure Count: 12 Frequency: 3.4 failures/week Severity: CRITICAL Affected Tasks: 12 tasks Error Pattern: "No database client available" 2. Gap ID: gap-0987654321 Category: missing_skill Description: Agent lacks capability for async/parallel task handling Failure Count: 8 Frequency: 2.3 failures/week Severity: CRITICAL ``` ### 5. **suggestions** - View Skill Suggestions Display proposed skills to address capability gaps: - Skill name and description - Addressed gaps - Estimated impact - Implementation complexity - Required tools/training **Output:** ``` Skill Suggestions ================ HIGH PRIORITY (3): ------------------ 1. Enhanced Database Integration Category: tool_usage Addresses Gaps: gap-1234567890 Estimated Impact: - Gaps Closed: 1 - Tasks Unblocked: 12 - Success Rate Improvement: +15% Implementation: MEDIUM complexity Required Tools: database-client, sql-executor 2. Async Task Manager Category: specialized_skill Addresses Gaps: gap-0987654321 Estimated Impact: - Gaps Closed: 1 - Tasks Unblocked: 8 - Success Rate Improvement: +20% Implementation: HIGH complexity Required Training: async_patterns, concurrency_control ``` ### 6. **prompt-history** - View Prompt Evolution Display prompt version history for an agent: - Version number - Activation/deactivation dates - Performance summary - Improvement over previous version **Output:** ``` Prompt History: coder-agent =========================== v4 (ACTIVE) ----------- Activated: 2024-01-15 14:30:00 Tasks: 45 Success Rate: 87.5% Avg Duration: 45.3s Token Efficiency: 0.72 Improvement: +12.5% vs v3 v3 (ARCHIVED) ----------- Activated: 2024-01-08 09:00:00 Deactivated: 2024-01-15 14:30:00 Tasks: 120 Success Rate: 75.0% Avg Duration: 53.4s Token Efficiency: 0.64 Improvement: +5.0% vs v2 ``` ### 7. **evolve** - Trigger Agent Evolution Manually trigger prompt evolution for an agent: - Analyze recent failures - Suggest prompt mutations - Generate new variant - Register for A/B testing **Process:** 1. Analyze failure patterns 2. Identify mutation opportunities 3. Generate new prompt variant 4. Add to A/B testing pool 5. Report expected improvements **Output:** ``` Agent Evolution: coder-agent =========================== Current Version: v3 Failure Analysis: ----------------- - timeout errors (5x) → Add time management constraints - validation errors (4x) → Clarify output format requirements Suggested Mutations: ------------------- 1. ADD_CONSTRAINT (system prompt) Confidence: 70% Description: Add time management and efficiency constraints 2. CLARIFY (user prompt) Confidence: 80% Description: Clarify output format requirements New Variant Created: v4 ----------------------- Status: TESTING Trial Count: 0 UCB1 Score: 0.0 The new variant will be tested using UCB1 multi-armed bandit selection. Expected to reach promotion threshold after 20 trials. ``` ### 8. **ab-test** - A/B Test Management View and manage A/B testing of prompt variants: - Active variants - Trial counts - Success rates - UCB1 scores - Selection probability **Output:** ``` A/B Testing Status: coder-agent =============================== Active Variants: --------------- v4 (TESTING) Trials: 15 Success Rate: 85.0% Avg Duration: 42.1s UCB1 Score: 0.92 Selection Prob: 65% v3 (ACTIVE) Trials: 120 Success Rate: 75.0% Avg Duration: 53.4s UCB1 Score: 0.78 Selection Prob: 35% Next Selection: v4 (UCB1 algorithm) Promotion Status: ---------------- v4 needs 5 more trials before promotion consideration Current improvement: +10.0% success rate vs v3 Threshold for promotion: +5.0% improvement ``` ## Implementation Guide ### 1. Initialize Evolution System ```typescript import Database from 'better-sqlite3'; import { EvolutionSystem } from '.claude/orchestration/evolution'; // Initialize database const db = new Database('.claude/orchestration/db/agents.db'); // Create evolution system const evolution = new EvolutionSystem(db, { autoEvolutionEnabled: true, explorationParameter: 2.0, minTrialsBeforePromotion: 20, }); ``` ### 2. Track Task Completion ```typescript // After task completes await evolution.trackTaskCompletion({ agentId: 'coder-agent', taskId: 'task-123', variantId: 'coder-agent-v4', success: true, duration: 45300, // ms tokens: 1250, userRating: 4.5, }); ``` ### 3. Collect Implicit Feedback ```typescript // User retries task evolution.feedbackLoop.trackRetry('task-123', 'coder-agent'); // User edits output evolution.feedbackLoop.trackEdit('task-123', 'coder-agent', 'minor'); // User abandons task evolution.feedbackLoop.trackAbandon('task-123', 'coder-agent'); ``` ### 4. Generate Reports ```typescript // Weekly report const report = evolution.generateWeeklyReport(); console.log('Overall Success Rate:', report.summary.overallSuccessRate); console.log('Total Tasks:', report.summary.totalTasks); // Per-agent performance for (const perf of report.agentPerformance) { console.log(`${perf.agentId}: ${perf.successRate}% (${perf.successRateChange > 0 ? '↑' : '↓'} ${Math.abs(perf.successRateChange)}%)`); } ``` ### 5. Check and Apply Evolution ```typescript // Check thresholds const updates = evolution.feedbackLoop.checkThresholds(); // Apply recommended updates for (const update of updates) { if (update.recommendedAction === 'evolve') { await evolution.evolveAgent(update.agentId); } } ``` ## UCB1 Algorithm Explanation The system uses the **UCB1 (Upper Confidence Bound)** algorithm for prompt variant selection, which balances: **Exploitation:** Selecting variants with proven high success rates **Exploration:** Testing new or under-tested variants **Formula:** ``` UCB1 = avg_success_rate + c * sqrt(ln(total_trials) / variant_trials) ``` Where: - `avg_success_rate`: Historical success rate of variant - `c`: Exploration parameter (default: 2.0) - `total_trials`: Total trials across all variants - `variant_trials`: Trials for this specific variant **Selection Strategy:** 1. Always select untried variants first (forced exploration) 2. Calculate UCB1 score for each variant 3. Select variant with highest UCB1 score 4. Update statistics after task completion 5. Promote variant to "active" after sufficient trials and proven improvement ## Configuration Default configuration can be customized: ```typescript const evolution = new EvolutionSystem(db, { // Tracking trackingEnabled: true, metricsRetentionDays: 90, // A/B Testing abTestingEnabled: true, minTrialsBeforePromotion: 20, confidenceLevel: 0.95, explorationParameter: 2.0, // Auto-Evolution autoEvolutionEnabled: true, evolutionThreshold: { minSuccessRateDrop: 10, // 10% drop triggers evolution minTaskCount: 10, }, // Feedback implicitFeedbackWeight: 0.3, feedbackDecayHalfLife: 7, // days // Reporting reportFrequency: 'weekly', reportRetentionCount: 12, }); ``` ## Database Schema All evolution data is stored in SQLite: - `evolution_performance_metrics` - Task completion metrics - `evolution_user_feedback` - Explicit and implicit feedback - `evolution_task_failures` - Detailed failure tracking - `evolution_prompt_variants` - Prompt versions and A/B testing - `evolution_capability_gaps` - Identified capability gaps - `evolution_skill_suggestions` - Proposed improvements - `evolution_reports` - Generated reports See `.claude/orchestration/db/evolution.sql` for complete schema. ## Integration with Orchestration System The evolution system integrates seamlessly with the existing orchestration system: 1. **Automatic Tracking**: All task completions are automatically tracked 2. **Checkpoint Integration**: Evolution state saved in checkpoints 3. **Activity Logging**: Evolution events logged to activity log 4. **Obsidian Sync**: Reports synced to Obsidian vault for review ## Best Practices 1. **Regular Reporting**: Generate weekly reports to track trends 2. **Review Gaps**: Address critical capability gaps promptly 3. **Monitor A/B Tests**: Track variant performance during testing phase 4. **Feedback Collection**: Actively collect user feedback for better evolution 5. **Gradual Evolution**: Don't change too many agents at once 6. **Version Control**: Keep prompt history for rollback capability ## See Also - `.claude/orchestration/evolution/README.md` - Detailed system documentation - `.claude/orchestration/PROTOCOL.md` - Orchestration protocol - Obsidian vault: `System/Agents/Evolution/` - Evolution reports and analysis