--- name: optimization-benchmark description: Comprehensive performance benchmarking, regression detection, automated testing, and performance validation. Use for CI/CD quality gates, baseline comparisons, and systematic performance testing. --- # Benchmark Suite Skill ## Overview This skill provides comprehensive automated performance testing capabilities including benchmark execution, regression detection, performance validation, and quality assessment for ensuring optimal system performance. ## When to Use - Running performance benchmark suites before deployment - Detecting performance regressions between versions - Validating SLA compliance through automated testing - Load, stress, and endurance testing - CI/CD pipeline performance quality gates - Comparing performance across configurations ## Quick Start ```bash # Run comprehensive benchmark suite npx claude-flow benchmark-run --suite comprehensive --duration 300 # Execute specific benchmark npx claude-flow benchmark-run --suite throughput --iterations 10 # Compare with baseline npx claude-flow benchmark-compare --current --baseline # Quality assessment npx claude-flow quality-assess --target swarm-performance --criteria throughput,latency # Performance validation npx claude-flow validate-performance --results --criteria ``` ## Architecture ``` +-----------------------------------------------------------+ | Benchmark Suite | +-----------------------------------------------------------+ | Benchmark Runner | Regression Detector | Validator | +--------------------+-----------------------+--------------+ | | | v v v +------------------+ +-------------------+ +--------------+ | Benchmark Types | | Detection Methods | | Validation | | - Throughput | | - Statistical | | - SLA | | - Latency | | - ML-based | | - Regression | | - Scalability | | - Threshold | | - Scalability| | - Coordination | | - Trend Analysis | | - Reliability| +------------------+ +-------------------+ +--------------+ | | | v v v +-----------------------------------------------------------+ | Reporter & Comparator | +-----------------------------------------------------------+ ``` ## Benchmark Types ### Standard Benchmarks | Benchmark | Metrics | Duration | Targets | |-----------|---------|----------|---------| | Throughput | requests/sec, tasks/sec, messages/sec | 5 min | min: 1000, optimal: 5000 | | Latency | p50, p90, p95, p99, max | 5 min | p50<100ms, p99<1s | | Scalability | linear coefficient, efficiency retention | variable | coefficient>0.8 | | Coordination | message latency, sync time | 5 min | <50ms | | Fault Tolerance | recovery time, failover success | 10 min | <30s recovery | ### Test Campaign Types 1. **Load Testing**: Gradual ramp-up to sustained load 2. **Stress Testing**: Find breaking points 3. **Volume Testing**: Large data set handling 4. **Endurance Testing**: Long-duration stability 5. **Spike Testing**: Sudden load changes 6. **Configuration Testing**: Different settings comparison ## Core Capabilities ### 1. Comprehensive Benchmarking ```javascript // Run benchmark suite const results = await benchmarkSuite.run({ duration: 300000, // 5 minutes iterations: 10, // 10 iterations warmupTime: 30000, // 30 seconds warmup cooldownTime: 10000, // 10 seconds cooldown parallel: false, // Sequential execution baseline: previousRun // Compare with baseline }); // Results include: // - summary: Overall scores and status // - detailed: Per-benchmark results // - baseline_comparison: Delta from baseline // - recommendations: Optimization suggestions ``` ### 2. Regression Detection Multi-algorithm detection: | Method | Description | Use Case | |--------|-------------|----------| | Statistical | CUSUM change point detection | Detect gradual degradation | | Machine Learning | Anomaly detection models | Identify unusual patterns | | Threshold | Fixed limit comparisons | Hard performance limits | | Trend | Time series regression | Long-term degradation | ```bash # Detect performance regressions npx claude-flow detect-regression --current --historical # Set up automated regression monitoring npx claude-flow regression-monitor --enable --sensitivity 0.95 ``` ### 3. Automated Performance Testing ```javascript // Execute test campaign const campaign = await tester.runTestCampaign({ tests: [ { type: 'load', config: loadTestConfig }, { type: 'stress', config: stressTestConfig }, { type: 'endurance', config: enduranceConfig } ], constraints: { maxDuration: 3600000, // 1 hour max failFast: true // Stop on first failure } }); ``` ### 4. Performance Validation Validation framework with multi-criteria assessment: | Validation Type | Criteria | |-----------------|----------| | SLA Validation | Availability, response time, throughput, error rate | | Regression Validation | Comparison with historical data | | Scalability Validation | Linear scaling, efficiency retention | | Reliability Validation | Error handling, recovery, consistency | ## MCP Integration ```javascript // Comprehensive benchmark integration const benchmarkIntegration = { // Execute performance benchmarks async runBenchmarks(config = {}) { const [benchmark, metrics, trends, cost] = await Promise.all([ mcp.benchmark_run({ suite: config.suite || 'comprehensive' }), mcp.metrics_collect({ components: ['system', 'agents', 'coordination'] }), mcp.trend_analysis({ metric: 'performance', period: '24h' }), mcp.cost_analysis({ timeframe: '24h' }) ]); return { benchmark, metrics, trends, cost, timestamp: Date.now() }; }, // Quality assessment async assessQuality(criteria) { return await mcp.quality_assess({ target: 'swarm-performance', criteria: criteria || ['throughput', 'latency', 'reliability', 'scalability'] }); } }; ``` ## Key Metrics ### Benchmark Targets ```javascript const benchmarkTargets = { throughput: { requests_per_second: { min: 1000, optimal: 5000 }, tasks_per_second: { min: 100, optimal: 500 }, messages_per_second: { min: 10000, optimal: 50000 } }, latency: { p50: { max: 100 }, // 100ms p90: { max: 200 }, // 200ms p95: { max: 500 }, // 500ms p99: { max: 1000 }, // 1s max: { max: 5000 } // 5s }, scalability: { linear_coefficient: { min: 0.8 }, efficiency_retention: { min: 0.7 } } }; ``` ### CI/CD Quality Gates | Gate | Criteria | Action on Failure | |------|----------|-------------------| | Performance | < 10% degradation | Block deployment | | Latency | p99 < 1s | Warning | | Error Rate | < 0.5% | Block deployment | | Scalability | > 80% linear | Warning | ## Load Testing Example ```javascript // Load test with gradual ramp-up const loadTest = { type: 'load', phases: [ { phase: 'ramp-up', duration: 60000, startLoad: 10, endLoad: 100 }, { phase: 'sustained', duration: 300000, load: 100 }, { phase: 'ramp-down', duration: 30000, startLoad: 100, endLoad: 0 } ], successCriteria: { p99_latency: { max: 1000 }, error_rate: { max: 0.01 }, throughput: { min: 80 } // % of expected } }; ``` ## Stress Testing Example ```javascript // Stress test to find breaking point const stressTest = { type: 'stress', startLoad: 100, maxLoad: 10000, loadIncrement: 100, duration: 60000, // Per load level breakingCriteria: { error_rate: { max: 0.05 }, // 5% errors latency_p99: { max: 5000 }, // 5s latency timeout_rate: { max: 0.10 } // 10% timeouts } }; ``` ## Integration Points | Integration | Purpose | |-------------|---------| | Performance Monitor | Continuous monitoring data for benchmarking | | Load Balancer | Validates load balancing effectiveness | | Topology Optimizer | Tests topology configurations | | CI/CD Pipeline | Automated quality gates | ## Best Practices 1. **Consistent Environment**: Run benchmarks in consistent, isolated environments 2. **Warmup Period**: Always include warmup to eliminate cold-start effects 3. **Multiple Iterations**: Run multiple iterations for statistical significance 4. **Baseline Maintenance**: Keep baseline updated with expected performance 5. **Historical Tracking**: Store all benchmark results for trend analysis 6. **Realistic Workloads**: Use production-like workload patterns ## Example: CI/CD Integration ```bash #!/bin/bash # ci-performance-gate.sh # Run benchmark suite RESULTS=$(npx claude-flow benchmark-run --suite quick --output json) # Compare with baseline COMPARISON=$(npx claude-flow benchmark-compare \ --current "$RESULTS" \ --baseline ./baseline.json) # Check for regressions if echo "$COMPARISON" | jq -e '.regression_detected == true' > /dev/null; then echo "Performance regression detected!" echo "$COMPARISON" | jq '.regressions' exit 1 fi echo "Performance validation passed" exit 0 ``` ## Related Skills - `optimization-monitor` - Real-time performance monitoring - `optimization-analyzer` - Bottleneck analysis and reporting - `optimization-load-balancer` - Load distribution optimization - `optimization-topology` - Topology performance testing --- ## Version History - **1.0.0** (2026-01-02): Initial release - converted from benchmark-suite agent with comprehensive benchmarking, regression detection, automated testing, and performance validation