---
name: lean4-prove
description: >
  Retrieval-augmented Lean4 proof generation. Queries 94k+ exemplars from DeepSeek-Prover V1+V2,
  uses hybrid search (BM25 + semantic + graph), generates via Claude, compiles in Docker, retries on failure.
allowed-tools: Bash, Read, Docker
triggers:
  - prove this
  - lean4 proof
  - generate proof
  - verify lean4
  - lean4-prove
  - formalize this requirement
metadata:
  short-description: Retrieval-augmented Lean4 proof generation
---

# lean4-prove

Retrieval-augmented Lean4 proof generation for engineering requirements. Uses 94,000+ proven theorems from DeepSeek-Prover V1+V2 to guide proof synthesis via hybrid search (BM25 + semantic + graph traversal).

## Architecture

```
Requirement + Tactics + Persona
        │
        ▼
┌───────────────────────────┐
│ 1. RECALL similar proofs  │  ← Hybrid search on ArangoDB
│    from 94k+ exemplars    │     (BM25 + semantic + graph)
└───────────────────────────┘
        │
        ▼
┌───────────────────────────┐
│ 2. BUILD support pack     │
│    - Validated imports    │
│    - Tactic patterns      │
│    - Similar proofs       │
└───────────────────────────┘
        │
        ▼
┌───────────────────────────┐
│ 3. GENERATE N candidates  │  ← Claude with exemplar context
│    constrained by corpus  │
└───────────────────────────┘
        │
        ▼
┌───────────────────────────┐
│ 4. COMPILE each in        │  ← lean_runner Docker
│    lean_runner container  │
└───────────────────────────┘
        │
   ┌────┴────┐
   │         │
 Success   Failure
   │         │
   ▼         ▼
 Return   Retry with error feedback
          (up to max_retries)
```

## Why Retrieval-Augmented?

1. **Determinism** - Exact provenance: "used these 3 proofs as templates"
2. **Version alignment** - Exemplars use imports that actually work
3. **Fewer hallucinations** - Constrained to lemmas that exist
4. **Tactic idioms** - Transfers working `simp` sets and proof patterns

## Usage

```bash
# Basic proof
./run.sh --requirement "Prove n + 0 = n"

# With tactics preference
./run.sh -r "Prove commutativity of addition" -t "simp,ring,omega"

# With persona context
./run.sh -r "Prove message integrity" -p "cryptographer"

# Via stdin (JSON)
echo '{"requirement": "Prove n + 0 = n", "tactics": ["rfl"]}' | ./run.sh

# Custom settings
./run.sh -r "Prove theorem" --candidates 5 --retries 5 --model claude-sonnet-4-20250514
```

## Output

```json
{
  "success": true,
  "code": "import Mathlib\n\ntheorem list_append_length (xs ys : List α) :\n    (xs ++ ys).length = xs.length + ys.length := by\n  induction xs with\n  | nil => simp\n  | cons x xs ih => simp [List.cons_append, ih]",
  "attempts": 1,
  "candidate": 0,
  "errors": null,
  "retrieval": {
    "retrieved": 5,
    "tactics_added": ["simp", "aesop", "norm_num", "intro"],
    "imports_count": 3
  }
}
```

On failure:

```json
{
  "success": false,
  "code": null,
  "attempts": 9,
  "errors": [
    "Candidate 0 attempt 1: unknown identifier 'natAdd'",
    "Candidate 1 attempt 1: type mismatch..."
  ],
  "retrieval": {
    "retrieved": 5,
    "tactics_added": ["simp", "exact"],
    "imports_count": 3
  }
}
```

## Parameters

| Parameter           | Default                  | Description                       |
| ------------------- | ------------------------ | --------------------------------- |
| `--requirement, -r` | (required)               | Theorem to prove                  |
| `--tactics, -t`     | none                     | Comma-separated preferred tactics |
| `--persona, -p`     | none                     | Persona context for generation    |
| `--candidates, -n`  | 3                        | Parallel proof candidates         |
| `--retries`         | 3                        | Max retries per candidate         |
| `--model`           | claude-sonnet-4-20250514 | Claude model                      |
| `--container`       | lean_runner              | Docker container name             |
| `--timeout`         | 120                      | Compilation timeout (seconds)     |

## Environment Variables

```bash
# Proof generation
LEAN4_CONTAINER=lean_runner      # Docker container
LEAN4_TIMEOUT=120                # Compile timeout
LEAN4_MAX_RETRIES=3              # Retries per candidate
LEAN4_CANDIDATES=3               # Parallel candidates
LEAN4_PROVE_MODEL=opus           # Claude model (opus recommended for proofs)

# Retrieval (requires ArangoDB with ingested dataset)
LEAN4_RETRIEVAL=1                # Enable/disable retrieval (default: 1)
LEAN4_RETRIEVAL_K=5              # Number of exemplars to retrieve
ARANGO_URL=http://127.0.0.1:8529 # ArangoDB connection
ARANGO_DB=memory                 # Database name (same as memory skill)
```

## Dataset Setup

The skill uses 94,000+ theorems from DeepSeek-Prover V1+V2 for retrieval:
- **V1**: 27,503 theorems (`status: "proven"`)
- **V2**: 66,708 theorems (`status: "ok"` for 11,689 proven + others)

One-time ingest:

```bash
# Ingest full dataset (~5 min)
./ingest.sh

# Or limit for testing
./ingest.sh --limit 1000
```

This populates the `lean_theorems` collection in ArangoDB with:
- `formal_statement` - The theorem statement
- `formal_proof` - Working proof code
- `header` - Validated imports (Mathlib, Aesop, etc.)
- `tactics` - Extracted tactic names
- `source` - "deepseek-prover-v1" or "deepseek-prover-v2"
- `status` - "proven" (V1) or "ok"/"failed"/etc. (V2)

## Authentication

Uses Claude Code CLI (`claude -p`) in headless non-interactive mode.

The CLI is called with:
- `-p` flag for print/headless mode
- `--output-format text` for plain text output
- `--max-turns 1` for single-turn operation

Environment variables `CLAUDE_CODE` and `CLAUDECODE` are cleared to avoid recursion detection when called from within Claude Code.

No separate API key required - authentication is handled via your Claude subscription.

## Requirements

1. **Docker** with `lean_runner` container running (Lean4 + Mathlib installed)
2. **Claude Code CLI** (`claude`) in PATH with valid authentication
3. **ArangoDB** running locally (for retrieval - optional but recommended)
4. **Dataset ingested** via `./ingest.sh` (one-time setup)

## Tactics

Common Lean4/Mathlib tactics to suggest:

| Tactic      | Use For                 |
| ----------- | ----------------------- |
| `rfl`       | Reflexivity proofs      |
| `simp`      | Simplification          |
| `ring`      | Ring arithmetic         |
| `omega`     | Linear arithmetic       |
| `decide`    | Decidable propositions  |
| `exact`     | Exact term construction |
| `apply`     | Apply lemmas            |
| `induction` | Inductive proofs        |

## Examples

### Engineering: List operations

```bash
./run.sh -r "Prove length(xs ++ ys) = length(xs) + length(ys)" -t "simp,induction"
```

### Engineering: State machine property

```bash
./run.sh -r "When mux_enable is false, output shall equal default_value" \
  -p "embedded systems engineer" -t "simp,cases,decide"
```

### Engineering: Protocol correctness

```bash
./run.sh -r "Prove message append preserves checksum: checksum(msg ++ data) = update(checksum(msg), data)" \
  -t "simp,induction,ring"
```

### Cryptography

```bash
./run.sh -r "Prove that XOR is self-inverse: a ⊕ a = 0" -p "cryptographer" -t "simp,decide"
```

### Complex theorem

```bash
./run.sh -r "Prove the sum of first n natural numbers equals n*(n+1)/2" \
  -t "induction,simp,ring" \
  --candidates 5 \
  --retries 5
```

## Difference from lean4-verify

| Skill          | Purpose                                                          |
| -------------- | ---------------------------------------------------------------- |
| `lean4-verify` | Compile-only. Takes Lean4 code, returns pass/fail                |
| `lean4-prove`  | Full pipeline. Takes requirement, generates + compiles + retries |

Use `lean4-verify` when you already have Lean4 code to check.
Use `lean4-prove` when you need to generate the proof from a requirement.

## Advanced: Memory Integration

The skill integrates with the memory project for hybrid retrieval (BM25 + semantic + graph traversal):

```bash
# One-time setup: embed theorems and create edges
ARANGO_PASS=yourpass ARANGO_DB=memory python integrate_memory.py
```

This creates:
- **Embeddings**: 39k+ theorem embeddings in `lesson_embeddings` for semantic search
- **Tactic edges**: 10k+ edges between theorems sharing primary tactics
- **Similarity edges**: Edges between semantically similar theorems (cosine > 0.7)

This enables:
- **Semantic search**: Find theorems by meaning, not just keywords
- **Multi-hop traversal**: "What proofs use similar tactics?"
- **Impact analysis**: "If I change lemma X, what breaks?"

### Lemma Dependency Graph (Planned)

Extract lemma dependencies during compilation to build an impact graph:

```
Query: "What's affected if mux_default changes?"

Answer:
  mux_default
    └── system_safe_state (NEEDS RE-VERIFY)
          └── requirement_42 (NEEDS RE-VERIFY)
```

### Engineering Use Cases

1. **Ambiguity detection**: Formalization fails → requirement is vague
2. **Completeness check**: "Are all state transitions covered?"
3. **Regression impact**: "Change to X affects theorems Y, Z"
4. **Cross-reference**: "Which requirements share this lemma?"