--- name: lean4-prove description: > Retrieval-augmented Lean4 proof generation with self-improving lab. Queries 94k+ exemplars from DeepSeek-Prover V1+V2, uses hybrid search (BM25 + semantic + graph), generates via Claude, compiles via lean-interact (20 parallel silos), retries on failure. Lab mode: HF dataset ingestion, parallel compilation, English↔Lean4 autoformalization, GRPO training with compiler rewards, convergence loop, and adversarial benchmarking. allowed-tools: Bash, Read, Docker triggers: - prove this - lean4 proof - generate proof - verify lean4 - lean4-prove - formalize this requirement - lab ingest - lab compile - lab formalize - lab converge - lab benchmark - lab report metadata: short-description: Retrieval-augmented Lean4 proof generation with self-improving lab provides: - lean4-prove - lean4-lab composes: - memory - code-runner - scillm - edge-verifier - embedding - task-monitor - ops-runpod - create-gpt - episodic-archiver --- > STOP. READ THIS ENTIRE SKILL.MD BEFORE CALLING ANY ENDPOINT. # lean4-prove Retrieval-augmented Lean4 proof generation for engineering requirements. Uses 94,000+ proven theorems from DeepSeek-Prover V1+V2 to guide proof synthesis via hybrid search (BM25 + semantic + graph traversal). ## Architecture (updated 2026-03-29) ``` Requirement + Tactics + Persona │ ▼ ┌───────────────────────────┐ │ 1. RECALL similar proofs │ ← Hybrid search on ArangoDB │ from 94k+ exemplars │ (BM25 + semantic + graph) └───────────────────────────┘ │ ▼ ┌───────────────────────────┐ │ 2. BUILD support pack │ │ - Validated imports │ │ - Tactic patterns │ │ - Similar proofs │ └───────────────────────────┘ │ ▼ ┌───────────────────────────┐ │ 3. /code-runner session │ ← LLM via /scillm (httpx) │ propose → compile → │ DoD: lean4 HTTP compile │ fix → retry (5 rounds) │ Escalation: text → claude └───────────────────────────┘ │ ▼ ┌───────────────────────────┐ │ 4. COMPILE via HTTP svc │ ← lean4-prove-service:8604 │ lean-interact backend │ (lean4 v4.28.0 + Mathlib) └───────────────────────────┘ │ ┌────┴────┐ │ │ Success Failure │ │ ▼ ▼ Return /code-runner retries with error feedback (up to max_rounds) ``` ### Key Changes (2026-03-29) - **LLM**: Uses `/scillm` via httpx (NOT `claude -p` subprocess) - **Retry loop**: Delegated to `/code-runner` (propose→compile→fix→retry) - **Compile**: lean4-prove-service HTTP at :8604 (NOT docker exec) - **Model**: `text` (scillm cascade) by default, not `opus` - **Toolchain**: Pinned to Lean v4.28.0 (Mathlib not yet released for v4.29.0) ## Why Retrieval-Augmented? 1. **Determinism** - Exact provenance: "used these 3 proofs as templates" 2. **Version alignment** - Exemplars use imports that actually work 3. **Fewer hallucinations** - Constrained to lemmas that exist 4. **Tactic idioms** - Transfers working `simp` sets and proof patterns ## Usage ```bash # Basic proof ./run.sh --requirement "Prove n + 0 = n" # With tactics preference ./run.sh -r "Prove commutativity of addition" -t "simp,ring,omega" # With persona context ./run.sh -r "Prove message integrity" -p "cryptographer" # Via stdin (JSON) echo '{"requirement": "Prove n + 0 = n", "tactics": ["rfl"]}' | ./run.sh # Custom settings (uses /scillm text cascade by default) ./run.sh -r "Prove theorem" --candidates 3 --retries 5 --model text ``` ## Output ```json { "success": true, "code": "import Mathlib\n\ntheorem list_append_length (xs ys : List α) :\n (xs ++ ys).length = xs.length + ys.length := by\n induction xs with\n | nil => simp\n | cons x xs ih => simp [List.cons_append, ih]", "attempts": 1, "candidate": 0, "errors": null, "retrieval": { "retrieved": 5, "tactics_added": ["simp", "aesop", "norm_num", "intro"], "imports_count": 3 } } ``` On failure: ```json { "success": false, "code": null, "attempts": 9, "errors": [ "Candidate 0 attempt 1: unknown identifier 'natAdd'", "Candidate 1 attempt 1: type mismatch..." ], "retrieval": { "retrieved": 5, "tactics_added": ["simp", "exact"], "imports_count": 3 } } ``` ## Parameters | Parameter | Default | Description | | ------------------- | --------- | ------------------------------------- | | `--requirement, -r` | (required)| Theorem to prove | | `--tactics, -t` | none | Comma-separated preferred tactics | | `--persona, -p` | none | Persona context for generation | | `--candidates, -n` | 3 | Number of /code-runner sessions | | `--retries` | 3 | Max rounds per /code-runner session | | `--model` | text | scillm model (text, gemini, deepseek) | | `--container` | lean_runner | Docker container (legacy, ignored) | | `--timeout` | 120 | Compilation timeout (seconds) | ## Environment Variables ```bash # Proof generation LEAN4_TIMEOUT=120 # Compile timeout (seconds) LEAN4_MAX_RETRIES=3 # Retries per /code-runner session LEAN4_CANDIDATES=3 # Number of /code-runner sessions LEAN4_PROVE_MODEL=text # scillm model (text cascade by default) # LLM backend (uses /scillm httpx, NOT claude -p) SCILLM_API_BASE=http://localhost:4001 # scillm Docker proxy SCILLM_PROXY_KEY=sk-dev-proxy-123 # scillm auth key # Compilation backend (HTTP service preferred) LEAN4_SERVICE_URL=http://127.0.0.1:8604 # lean4-prove-service FastAPI # Retrieval (requires ArangoDB with ingested dataset) LEAN4_RETRIEVAL=1 # Enable/disable retrieval (default: 1) LEAN4_RETRIEVAL_K=5 # Number of exemplars to retrieve ARANGO_URL=http://127.0.0.1:8529 # ArangoDB connection ARANGO_DB=memory # Database name (same as memory skill) ``` ## Dataset Setup The skill uses 94,000+ theorems from DeepSeek-Prover V1+V2 for retrieval: - **V1**: 27,503 theorems (`status: "proven"`) - **V2**: 66,708 theorems (`status: "ok"` for 11,689 proven + others) One-time ingest: ```bash # Ingest full dataset (~5 min) ./ingest.sh # Or limit for testing ./ingest.sh --limit 1000 ``` This populates the `lean_theorems` collection in ArangoDB with: - `formal_statement` - The theorem statement - `formal_proof` - Working proof code - `header` - Validated imports (Mathlib, Aesop, etc.) - `tactics` - Extracted tactic names - `source` - "deepseek-prover-v1" or "deepseek-prover-v2" - `status` - "proven" (V1) or "ok"/"failed"/etc. (V2) ## LLM Backend Uses `/scillm` Docker proxy via httpx (NOT `claude -p` subprocess). The scillm text cascade routes to DeepSeek-V3 by default. ## HTTP API (lean4-prove-service at :8604) The Docker container exposes these endpoints: | Endpoint | Method | Description | |----------|--------|-------------| | `/health` | GET | Worker count + timeout | | `/compile` | POST | Compile Lean4 code (no LLM) | | `/compile-batch` | POST | Compile multiple proofs in parallel | | `/step-verify` | POST | Per-tactic incremental elaboration | | `/prove` | POST | **Full pipeline**: generate via scillm → compile → retry | ### POST /prove (composition endpoint) This is what `/create-evidence-case` and other skills call via httpx. No subprocess. ```json POST http://127.0.0.1:8604/prove { "requirement": "Prove that for all natural numbers n, n + 0 = n", "tactics": ["rfl", "simp"], "model": "text", "max_retries": 3, "timeout": 60 } ``` Response: ```json { "success": true, "code": "theorem test (n : Nat) : n + 0 = n := by simp", "attempts": 1, "errors": null } ``` The container calls `/scillm` (via `host.docker.internal:4001`) for LLM generation and compiles locally with lean-interact (20-worker pool, Mathlib cached). ## Compilation Backend The lean4-prove-service container uses lean-interact with a 20-worker LeanServerPool. Mathlib is cached in Docker volumes (`lean4-lake-cache`, `lean4-mathlib-cache`). Toolchain pinned to **v4.28.0** (Mathlib not yet released for v4.29.0). ## Requirements 1. **lean4-prove-service** Docker container running (lean-interact + Mathlib cached) 2. **scillm** Docker proxy running at :4001 (reachable from container via `host.docker.internal`) 3. **/code-runner** skill available (for CLI invocation; not used by HTTP API) 4. **ArangoDB** running locally (for retrieval - optional but recommended) 5. **Dataset ingested** via `./ingest.sh` (one-time setup) ## Tactics Common Lean4/Mathlib tactics to suggest: | Tactic | Use For | | ----------- | ----------------------- | | `rfl` | Reflexivity proofs | | `simp` | Simplification | | `ring` | Ring arithmetic | | `omega` | Linear arithmetic | | `decide` | Decidable propositions | | `exact` | Exact term construction | | `apply` | Apply lemmas | | `induction` | Inductive proofs | ## Examples ### Engineering: List operations ```bash ./run.sh -r "Prove length(xs ++ ys) = length(xs) + length(ys)" -t "simp,induction" ``` ### Engineering: State machine property ```bash ./run.sh -r "When mux_enable is false, output shall equal default_value" \ -p "embedded systems engineer" -t "simp,cases,decide" ``` ### Engineering: Protocol correctness ```bash ./run.sh -r "Prove message append preserves checksum: checksum(msg ++ data) = update(checksum(msg), data)" \ -t "simp,induction,ring" ``` ### Cryptography ```bash ./run.sh -r "Prove that XOR is self-inverse: a ⊕ a = 0" -p "cryptographer" -t "simp,decide" ``` ### Complex theorem ```bash ./run.sh -r "Prove the sum of first n natural numbers equals n*(n+1)/2" \ -t "induction,simp,ring" \ --candidates 5 \ --retries 5 ``` ## Difference from lean4-verify | Skill | Purpose | | -------------- | ---------------------------------------------------------------- | | `lean4-verify` | Compile-only. Takes Lean4 code, returns pass/fail | | `lean4-prove` | Full pipeline. Takes requirement, generates + compiles + retries | Use `lean4-verify` when you already have Lean4 code to check. Use `lean4-prove` when you need to generate the proof from a requirement. ## Advanced: Memory Integration The skill integrates with the memory project for hybrid retrieval (BM25 + semantic + graph traversal): ```bash # One-time setup: embed theorems and create edges ARANGO_PASS=yourpass ARANGO_DB=memory python integrate_memory.py ``` This creates: - **Embeddings**: 39k+ theorem embeddings in `lesson_embeddings` for semantic search - **Tactic edges**: 10k+ edges between theorems sharing primary tactics - **Similarity edges**: Edges between semantically similar theorems (cosine > 0.7) This enables: - **Semantic search**: Find theorems by meaning, not just keywords - **Multi-hop traversal**: "What proofs use similar tactics?" - **Impact analysis**: "If I change lemma X, what breaks?" ### Proof Jobs Queue (Control Extraction Pipeline) The control extraction pipeline (s12_framework_mapper) queues `proof_jobs` in ArangoDB for lean4-prove to consume. Each job represents a requirement→control pair where the requirement claims to implement a framework control (NIST, CWE, SPARTA, etc.). **Queue**: `proof_jobs` collection (40,210 pending jobs as of 2026-02-25) ```json { "source_type": "requirement_control", "requirement_chunk": "datalake_chunks/", "control_id": "AC-2", "control_ref": "sparta_controls/", "framework": "NIST", "requirement_text": "The system SHALL implement automated account management...", "status": "pending", "priority": 1, "attempts": 0 } ``` **On success**: Creates `proof_requirement_edges` and updates `requirement_control_edges.lean4_status` from "pending" to "proved". **Priority**: NIST and SPARTA jobs are priority 1; others priority 2. **Lemma Graph Structure**: ``` sparta_controls/AC-2 (SPEC) ↑ chunk_control_edges (EVIDENCE — many) ↑ requirement_control_edges (CLAIM — subset with lean4_status) ↑ proof_requirement_edges (PROOF — when lean4 succeeds) ``` ### Multi-Predicate Support The QRA consistency verifier supports 5 relationship predicates via `PREDICATE_REGISTRY`: | Predicate | Lean4 Type | Parameter | Axiom | |-----------|-----------|-----------|-------| | `subsumes` | `Control -> Control -> Prop` | None | Transitivity | | `countered_by` | `Control -> Control -> Prop` | `bool` (enabled) | Negation: enabled → ¬ threat_active | | `mitigated_by` | `Control -> Control -> Prop` | `float` (coverage) | Threshold: coverage_met → risk_reduced | | `exploits` | `Control -> Control -> Prop` | `bool` (vulnerable) | Propagation: vulnerable → compromised | | `maps_to` | `Control -> Control -> Prop` | None | Symmetry | Edge types from ArangoDB are normalized (`"maps-to"` → `maps_to`) and matched to the correct predicate automatically. ### Typed Parameters Controls can have typed parameters for what-if analysis: - **`bool`**: Toggle (enabled/disabled, vulnerable/not) — emits `Prop` axioms - **`float`**: Slider with bounds (coverage %, confidence score) — emits coverage_met/risk_reduced propositions - **`enum`**: Dropdown (NOMINAL/DEGRADED/CRITICAL) — emits `inductive ControlStatus` Parameters are modeled via the `ControlParameter` dataclass in `qra_models.py`. ### What-If Queries Identify which theorem chains break when a control parameter changes: ```bash # Dry-run: generate Lean4 without compiling python qra_consistency.py what-if --control REC-0001 --param enabled --value false --dry-run # Full compilation python qra_consistency.py what-if --control REC-0001 --param enabled --value false # JSON output for inspector panel consumption python qra_consistency.py what-if --control REC-0001 --param enabled --value false --dry-run --output-format json ``` JSON output: ```json { "control": "REC-0001", "parameter": "enabled", "new_value": false, "affected_chains": [ { "chain": "REC-0001 -> T1595 -> SC-7", "predicate": "countered_by", "status": "BROKEN" } ], "summary": { "total": 4, "broken": 2, "held": 2 } } ``` The UX Lab inspector panel composes with this command for cascade propagation analysis. ### Lemma Dependency Graph Impact analysis via what-if queries and lemma dependency tracking: ``` Query: "What's affected if REC-0001 is disabled?" Answer (via what-if): REC-0001 (enabled=false) └── countered_by T1595 (BROKEN — counter-control disabled) └── mitigated_by SC-7 (BROKEN — upstream broke) ``` ### Engineering Use Cases 1. **Ambiguity detection**: Formalization fails → requirement is vague 2. **Completeness check**: "Are all state transitions covered?" 3. **Regression impact**: "Change to X affects theorems Y, Z" (via what-if) 4. **Cross-reference**: "Which requirements share this lemma?" 5. **Cascade analysis**: "If I disable this control, what chains break?" (via what-if) ## Common Mistakes ### WRONG: `claude -p` subprocess for LLM calls ```bash echo "Prove n+0=n" | claude -p # old approach, uses subprocess ``` ### RIGHT: LLM calls go through /scillm via httpx ```python ./run.sh -r "Prove n + 0 = n" --model text # uses scillm cascade ``` ### WRONG: Running proofs without lean4-prove-service ```bash ./run.sh -r "Prove theorem" # fails if :8604 service is not running ``` ### RIGHT: Verify the compilation service is healthy first ```bash curl -s http://127.0.0.1:8604/health # check worker count ./run.sh -r "Prove theorem" ``` ### WRONG: Lean v4.29.0 toolchain (Mathlib not released for it) ``` leanpkg.toml: lean = "leanprover/lean4:v4.29.0" ``` ### RIGHT: Pin to v4.28.0 (current Mathlib-compatible version) ``` ./run.sh -r "Prove theorem" # uses container's pinned toolchain ```