# Performance Benchmarks > **Last updated:** March 2026 · **Toolkit version:** 2.1.0 · **Python:** 3.13 · **OS:** Windows 11 (AMD64) > > All benchmarks use `time.perf_counter()` with 10,000 iterations (unless noted). > Numbers are from a development workstation — CI runs on `ubuntu-latest` GitHub-hosted runners. ## TL;DR | What you care about | Number | |---|---| | **Policy evaluation (single rule)** | **0.011 ms** (p50) — 84K ops/sec | | **Policy evaluation (100 rules)** | **0.030 ms** (p50) — 32K ops/sec | | **Kernel enforcement (allow path)** | **0.103 ms** (p50) — 9.7K ops/sec | | **Adapter governance overhead** | **0.005–0.007 ms** (p50) — 135K–190K ops/sec | | **Circuit breaker check** | **0.0005 ms** (p50) — 1.83M ops/sec | | **Concurrent throughput (50 agents)** | **46,329 ops/sec** | | **Concurrent throughput (1,000 agents)** | **47,085 ops/sec** | **Bottom line:** Policy enforcement adds **< 0.1 ms** per action. At 1,000 concurrent agents, the governance layer sustains **47K ops/sec** with near-linear scaling — your LLM API call is 1,000–10,000× slower. --- ## 1. Policy Evaluation Measures `PolicyEvaluator.evaluate()` — the core enforcement path every agent action passes through. | Benchmark | ops/sec | p50 (ms) | p95 (ms) | p99 (ms) | |---|---:|---:|---:|---:| | Single rule evaluation | 84,489 | 0.011 | 0.014 | 0.037 | | 10-rule policy | 76,406 | 0.012 | 0.017 | 0.049 | | 100-rule policy | 32,025 | 0.030 | 0.039 | 0.108 | | SharedPolicy cross-project eval | 116,454 | 0.008 | 0.010 | 0.028 | | YAML policy load (cold, 10 rules) | 112 | 8.432 | 12.717 | 17.763 | **Key takeaway:** Rule count scales linearly. Even with 100 rules, p99 is under 0.11 ms. YAML loading is a cold-start cost (once per deployment, not per action). Source: [`agent-governance-python/agent-os/benchmarks/bench_policy.py`](agent-governance-python/agent-os/benchmarks/bench_policy.py) ## 2. Kernel Enforcement Measures `StatelessKernel.execute()` — the full enforcement path including policy evaluation, audit logging, and execution context management. | Benchmark | ops/sec | p50 (ms) | p95 (ms) | p99 (ms) | |---|---:|---:|---:|---:| | Kernel execute (allow) | 9,668 | 0.103 | 0.198 | 0.347 | | Kernel execute (deny) | 10,239 | 0.097 | 0.191 | 0.322 | | Circuit breaker state check | 1,828,845 | 0.001 | 0.001 | 0.001 | ### Concurrent Throughput (Scaling) | Concurrency | Total ops | Wall time (s) | ops/sec | vs. single-threaded | |---:|---:|---:|---:|---| | 50 agents × 200 ops | 10,000 | 0.216 | 46,329 | 4.8× | | 100 agents × 100 ops | 10,000 | 0.209 | 47,920 | 5.0× | | 500 agents × 100 ops | 50,000 | 1.085 | 46,089 | 4.8× | | **1,000 agents × 100 ops** | **100,000** | **2.124** | **47,085** | **4.9×** | **Key takeaway:** Throughput is **stable at ~47K ops/sec** from 50 to 1,000 concurrent agents — no degradation at scale. The deny path is slightly faster than allow (no downstream execution). Circuit breaker overhead is negligible (sub-microsecond). Source: [`agent-governance-python/agent-os/benchmarks/bench_kernel.py`](agent-governance-python/agent-os/benchmarks/bench_kernel.py) ## 3. Audit System Measures audit entry creation, querying, and serialization — the observability overhead. | Benchmark | ops/sec | p50 (ms) | p95 (ms) | p99 (ms) | |---|---:|---:|---:|---:| | Audit entry write | 285,202 | 0.002 | 0.006 | 0.008 | | Audit entry serialization | 343,548 | 0.003 | 0.003 | 0.004 | | Execution time tracking | 442,206 | 0.002 | 0.002 | 0.003 | | Audit log query (10K entries) | 1,399 | 0.716 | 0.877 | 1.076 | **Key takeaway:** Audit writes add ~2 µs per action. Querying 10K entries takes ~0.7 ms (in-memory scan). For production deployments, external append-only stores (e.g., OpenTelemetry export) are recommended for large-scale query workloads. Source: [`agent-governance-python/agent-os/benchmarks/bench_audit.py`](agent-governance-python/agent-os/benchmarks/bench_audit.py) ## 4. Framework Adapter Overhead Measures the governance check overhead per framework adapter — the cost added to each tool call or agent step. | Adapter | ops/sec | p50 (ms) | p95 (ms) | p99 (ms) | |---|---:|---:|---:|---:| | GovernancePolicy init (startup) | 134,923 | 0.007 | 0.008 | 0.019 | | Tool allowed check | 3,745,036 | 0.000 | 0.000 | 0.000 | | Pattern match (per call) | 135,717 | 0.007 | 0.008 | 0.022 | | **OpenAI** adapter | 166,363 | 0.005 | 0.007 | 0.017 | | **LangChain** adapter | 156,591 | 0.006 | 0.007 | 0.019 | | **Anthropic** adapter | 164,194 | 0.006 | 0.008 | 0.017 | | **LlamaIndex** adapter | 156,157 | 0.006 | 0.007 | 0.016 | | **CrewAI** adapter | 190,134 | 0.005 | 0.006 | 0.013 | | **AutoGen** adapter | 169,358 | 0.005 | 0.007 | 0.018 | | **Google Gemini** adapter | 180,770 | 0.006 | 0.006 | 0.011 | | **Mistral** adapter | 182,439 | 0.005 | 0.006 | 0.015 | | **Semantic Kernel** adapter | 170,930 | 0.005 | 0.007 | 0.014 | **Key takeaway:** All adapters add **< 0.02 ms** (p99) per tool call. This is 3–4 orders of magnitude below a typical LLM API round-trip (200–2000 ms). The governance layer is invisible to end users. Source: [`agent-governance-python/agent-os/benchmarks/bench_adapters.py`](agent-governance-python/agent-os/benchmarks/bench_adapters.py) ## 5. Agent SRE (Reliability Engineering) Measures chaos engineering, SLO enforcement, and observability primitives. | Benchmark | ops/sec | p50 (µs) | p99 (µs) | |---|---:|---:|---:| | Fault injection | 428,253 | 1.20 | 6.60 | | Chaos template init | 98,889 | 9.10 | 18.50 | | Chaos schedule eval | 168,380 | 5.30 | 7.60 | | SLO evaluation | 29,475 | 30.10 | 96.60 | | Error budget calculation | 29,851 | 31.70 | 111.70 | | Burn rate alert | 25,543 | 37.10 | 116.20 | | SLI recording | 284,274 | 2.40 | 11.10 | **Key takeaway:** SRE operations are sub-120 µs at p99. SLI recording (the hot path for every action) is ~2.4 µs. These can run alongside every agent action without measurable impact. Source: [`agent-governance-python/agent-sre/benchmarks/`](agent-governance-python/agent-sre/benchmarks/) ## 6. Memory Footprint Measured with `tracemalloc` — PolicyEvaluator with 100 rules, 1,000 evaluations: | Metric | Value | |---|---| | Evaluator instance (100 rules) | ~2 KB | | Per-evaluation context overhead | ~0.5 KB | | Peak process memory (Python runtime + evaluator + 1K evals) | ~126 MB | > **Note:** The 126 MB peak includes the entire Python runtime, standard library, and imported modules. The evaluator itself is a small fraction. For comparison, a bare `python -c "pass"` process uses ~15 MB. ## Methodology ### Hardware These benchmarks were run on a development workstation. CI runs on GitHub-hosted `ubuntu-latest` runners (2-core, 7 GB RAM). Expect ±20% variance between runs due to shared infrastructure. ### Measurement - **Timer:** `time.perf_counter()` (nanosecond resolution) - **Iterations:** 10,000 per benchmark (100,000 for circuit breaker, 1,000 for YAML load) - **Percentiles:** Sorted latency array, index-based selection - **Warm-up:** None (benchmarks measure cold-start-inclusive performance) ### Reproducing ```bash # Clone and install git clone https://github.com/microsoft/agent-governance-toolkit.git cd agent-governance-toolkit # Policy, kernel, audit, adapter benchmarks cd agent-os pip install -e ".[dev]" python agent-governance-python/benchmarks/bench_policy.py python agent-governance-python/benchmarks/bench_kernel.py python agent-governance-python/benchmarks/bench_audit.py python agent-governance-python/benchmarks/bench_adapters.py # SRE benchmarks cd ../agent-sre pip install -e ".[dev]" python agent-governance-python/benchmarks/bench_chaos.py python agent-governance-python/benchmarks/bench_slo.py # Custom concurrency levels (default: 50 agents × 200 ops) python -c " from benchmarks.bench_kernel import bench_concurrent_kernel import json result = bench_concurrent_kernel(concurrency=1000, per_task=100) print(json.dumps(result, indent=2)) " ``` ### CI Integration Benchmarks run automatically on every release via the [`benchmarks.yml`](.github/workflows/benchmarks.yml) workflow. Results are uploaded as workflow artifacts for comparison across releases. ## Comparison Context For context, here's where the governance overhead sits relative to typical agent operations: | Operation | Typical latency | |---|---| | **Policy evaluation (this toolkit)** | **0.01–0.03 ms** | | **Full kernel enforcement** | **0.10 ms** | | **Adapter overhead** | **0.005–0.007 ms** | | Python function call | 0.001 ms | | Redis read (local) | 0.1–0.5 ms | | Database query (simple) | 1–10 ms | | LLM API call (GPT-4) | 200–2,000 ms | | LLM API call (Claude Sonnet) | 300–3,000 ms | The governance layer adds less overhead than a single Redis read and is **10,000× faster than an LLM call**. ## Version History | Version | Date | Notable changes | |---|---|---| | v2.1.0 | March 2026 | Added 1K concurrent agent benchmarks, ~15% faster policy eval vs v1.1.x | | v1.1.0 | February 2026 | Initial published benchmarks |