# CKB Review: Benchmark Data All numbers from real measurements on a production PR (133 files, 19,200 lines, 37 commits). --- ## Token Savings ### LLM Review Without CKB (Scenario 1) | Metric | Value | |---|---| | Model | Claude Opus 4.6 | | Files in PR | 133 | | Files LLM reviewed | 37 (28%) | | Tokens consumed | 87,336 | | Tool calls (file reads, searches) | 71 | | Duration | 718 seconds (12 minutes) | | Findings | 4 | | False positives | 0 | | Tokens per finding | 21,834 | The LLM spent 87k tokens and still only covered 28% of files. It couldn't check secrets, breaking changes, dead code, test coverage, complexity, coupling, or churn history. ### LLM Review With CKB (Scenario 3 — Final) | Metric | Value | |---|---| | Model | Claude Opus 4.6 | | CKB runtime | 5,246ms | | CKB tokens | 0 | | CKB findings | 31 | | LLM files reviewed | ~10 (8%) | | LLM tokens consumed | 45,784 | | LLM tool calls | 47 | | LLM duration | ~17 minutes | | New LLM findings | 2 (verified CKB bug-patterns) | | Total findings | 33 | | False positives | 0 | | Tokens per finding | 1,388 | ### Comparison | Metric | Without CKB | With CKB | Improvement | |---|---|---|---| | Tokens | 87,336 | 45,784 | **-48%** | | File coverage (structural) | 28% | 100% | **+72pp** | | Findings | 4 | 33 | **8.3x** | | Tokens per finding | 21,834 | 1,388 | **15.7x more efficient** | | Secrets checked | No | Yes (all 133 files) | +133 files | | Breaking changes checked | No | Yes (SCIP-verified) | Impossible without CKB | | Test gaps identified | No | 16 functions | Impossible without CKB | --- ## CKB Standalone Performance ### Runtime | PR Size | Files | Lines | CKB Duration | Checks | |---|---|---|---|---| | Small (measured) | 2 | 10 | ~500ms | 15 | | Medium (estimated) | 30 | 2,000 | ~2s | 15 | | Large (measured) | 133 | 19,200 | 5.2s | 15 | All 15 checks run in parallel. The bottleneck is tree-sitter complexity analysis (~1.8s) and coupling analysis (~1.9s). ### Findings Quality Progression Over 5 tuning iterations on the same PR: | Iteration | Total Findings | Noise | False Positives | Score | |---|---|---|---|---| | Raw (no tuning) | 258 | 230 | 1 | 20 | | + Infallible-write allowlist | 89 | 62 | 1 | 54 | | + Threshold tuning | 27 | 8 | 1 | 63 | | + Framework symbol filter | 19 | 1 | 1 | 71 | | + Dead-code grep verification | 18 | 0 | 0 | 74 | | **Final (with test-gap details)** | **31** | **0** | **0** | **61** | Score dropped from 74 → 61 in the final version because test-gap findings (10 new findings with file:line details) were added to the output. These are informational findings, not quality regressions. ### Check Execution Times (133-file PR) | Check | Duration | What it does | |---|---|---| | complexity | 1,799ms | Tree-sitter cyclomatic/cognitive analysis, before/after comparison | | coupling | 1,772ms | Git co-change analysis across history | | tests | 904ms | SCIP + heuristic test coverage mapping | | health | 875ms | 8-factor weighted health score per file | | bug-patterns | 871ms | 10 AST rules with differential base comparison | | dead-code | 812ms | SCIP reference count + grep cross-verification | | blast-radius | 701ms | SCIP caller graph traversal | | secrets | 395ms | Pattern + entropy scanning | | test-gaps | 149ms | Tree-sitter function extraction + test cross-ref | | breaking | 39ms | SCIP API surface comparison | | format-consistency | 12ms | Output format divergence check | | comment-drift | 3ms | Numeric reference scanning | | risk | <1ms | Composite score (pre-computed inputs) | | split | <1ms | Module clustering (pre-computed) | | hotspots | <1ms | Score lookup (pre-computed by coupling check) | Total wall clock: 5.2s (parallel execution). --- ## MCP Response Sizes | Mode | Response Size | Tokens (~4 chars/tok) | Use Case | |---|---|---|---| | Full JSON | 120 KB | ~30,000 | Raw data export, CI pipelines | | Compact JSON | 4 KB | ~1,000 | LLM consumers (MCP tool calls) | | Human text | 2 KB | ~500 | Terminal output | | Markdown | 3 KB | ~750 | PR comments | Compact mode strips to: verdict, non-pass checks, top 10 findings, health summary, split suggestion. The LLM gets exactly what it needs for decision-making without wasting context window. --- ## False Positive History | Finding | Initial State | Fix | Final State | |---|---|---|---| | `FormatSARIF` flagged as dead code | SCIP missed cross-file reference in cmd/ckb | Added grep verification for same-package refs | Eliminated | | 169 `discarded-error` on strings.Builder | Builder.Write never errors | Receiver-type tracking in AST | Eliminated | | 10 `discarded-error` on hash.Hash | Hash.Write never errors | Added hash constructors to allowlist | Eliminated | | 8 blast-radius on cobra Command vars | Framework registrations, not real callers | Framework symbol filter (skip variables/constants) | Eliminated | Current false positive rate: **0%** (0 of 31 findings). --- ## Cost Comparison Based on Claude Sonnet 4 pricing ($3/MTok input, $15/MTok output). | Scenario | Input Tokens | Output Tokens | Cost | Findings | |---|---|---|---|---| | LLM reviews alone (small PR, 10 files) | ~20,000 | ~2,000 | $0.09 | ~2 | | LLM reviews alone (large PR, 100 files) | ~200,000 | ~5,000 | $0.68 | ~4 | | LLM reviews alone (huge PR, 600 files) | ~500,000 | ~10,000 | $1.65 | ~4 | | CKB + LLM (small PR) | ~15,000 | ~2,000 | $0.08 | ~15 | | CKB + LLM (large PR) | ~50,000 | ~3,000 | $0.20 | ~30 | | CKB + LLM (huge PR) | ~80,000 | ~5,000 | $0.32 | ~30 | | CKB alone (any size) | 0 | 0 | **$0.00** | 20-30 | CKB's value scales with PR size. On a 10-file PR, savings are minimal (~10%). On a 600-file PR, savings are **80%** ($1.65 → $0.32). --- ## Environment - **Hardware:** Apple Silicon (M-series), macOS - **CKB version:** 8.2.0 - **Go version:** 1.26.1 - **SCIP indexer:** scip-go - **LLM:** Claude Opus 4.6 (1M context) - **MCP transport:** stdio