# Tutorial: Solo Track

This tutorial walks through three complete examples showing how the solver, proxy, and judge interact to solve problems on the **Solo track** — one solver subprocess per problem with a fixed per-problem budget. Each walkthrough lists the model used in its header; the reference configuration in `pipeline/config.json` is `openai/gpt-oss-120b` via `deepinfra/bf16`.

For the **Marathon track** (one subprocess for N problems, shared global budget, triage-friendly) see [`../marathon/TUTORIAL.md`](../marathon/TUTORIAL.md). The two tracks share `judge/verify.py` and the five-status mapping; they differ only in solver I/O shape and budgeting.

For setup, architecture, protocol details, and configuration, see the [README](../../README.md) in the project root.

## Running the Demos

Make sure the environment is set up first (`bash scripts/setup.sh` if not done yet):

```bash
source .env.judge
```

Run any demo on the 20-problem sample set:

```bash
# baseline: brute-force + singleton + generic LLM fallback
python3 -m pipeline.runner \
  --submission examples/solo/demos/baseline \
  --problems examples/problems/sample_20.json

# twophase: deeper search + analysis-then-implementation LLM
python3 -m pipeline.runner \
  --submission examples/solo/demos/twophase \
  --problems examples/problems/sample_20.json

# opnorm reference: 16 deterministic strategies + structural-context LLM
python3 -m pipeline.runner \
  --submission examples/solo/demos/opnorm \
  --problems examples/problems/sample_20.json
```

For the full 200-problem evaluation, replace `sample_20.json` with `sample_200.json`. LLM calls require the `OPENROUTER_API_KEY` environment variable (or a per-config `api_key_env` for direct providers — see the README for details).

Your submission is a **single file**: `solver.py`, up to 500 KB. If your solver uses the LLM, the prompt template lives as a top-level `PROMPT = "..."` string constant inside `solver.py` (the proxy extracts it via AST — the module is never imported). All three demos under `examples/solo/demos/` are self-contained — just copy any folder and run.

The three Solo demos form a learning ladder:

- `baseline` — no LLM required; brute-force + singleton + generic LLM fallback. The simplest starting point and a complete reference for the stdin/stdout protocol.
- `twophase` — gpt-oss-120b; deeper deterministic search plus an analysis-then-implementation two-phase LLM. Shown in Walkthrough 1.
- `opnorm` — gpt-oss-120b; flagship reference mining solver — 16 deterministic strategies fed into a structural-context LLM call. Shown in Walkthrough 3.

---

## Walkthrough 1: Deterministic Counterexample (`twophase`)

**Problem**: `false_919_872` -- Does Equation919 imply Equation872?

- Hypothesis (Eq919): `x = y ◇ ((y ◇ y) ◇ (y ◇ x))`
- Goal (Eq872): `x = y ◇ ((x ◇ x) ◇ (y ◇ x))`

**Model**: none — deterministic counterexample; no LLM call is made.

### What happens

The `twophase` solver's deterministic stage exhaustively searches Cayley tables on Fin 2 through Fin 7. It finds a counterexample on **Fin 5** -- a 5x5 operation table where Equation919 holds but Equation872 does not.

### Interaction log

**Step 1** -- Proxy sends the problem to solver:

```json
{"problem": {"id": "false_919_872", "eq1_id": 919, "eq2_id": 872,
  "equation1": "x = y ◇ ((y ◇ y) ◇ (y ◇ x))",
  "equation2": "x = y ◇ ((x ◇ x) ◇ (y ◇ x))"}, "budget": {...}}
```

**Step 2** -- Solver finds counterexample (no LLM needed), sends judge request:

```json
{"call": "judge", "verdict": "false",
 "code": "import JudgeProblem\nimport JudgeDecide.DecideBang\nimport JudgeFinOp.MemoFinOp\nopen MemoFinOp\n\ndef submission : Goal := by\n  let m : Magma (Fin 5) := {\n    op := finOpTable \"[[3, 4, 1, 2, 0], [2, 0, 4, 1, 3], [0, 1, 2, 3, 4], [1, 3, 0, 4, 2], [4, 2, 3, 0, 1]]\"\n  }\n  refine ⟨Fin 5, m, ?_⟩\n  decideFin!\n"}
```

`JudgeProblem` is written by the judge per submission and binds
`EquationLHS` / `EquationRHS` to the two problem equations, plus an
`abbrev Goal` that names the target type (the ∃ or ∀ statement). The
submitter's `Submission.lean` only has to expose a `submission` of
type `Goal` — the theorem statement itself is judge-controlled (lives
in `Problem.lean`, not Submission.lean).

**Step 3** -- Judge compiles and verifies the Lean proof:

```json
{"status": "accepted", "message": "certificate accepted"}
```

### Result

- **Solved**: Yes
- **LLM calls**: 0
- **Judge calls**: 1
- **Time**: 1.9 seconds

This is the ideal case: the solver finds the answer deterministically with no LLM involvement. The brute-force search + `decideFin!` handles most false problems.

---

## Walkthrough 2: LLM Feedback Loop (`baseline`)

**Problem**: `true_608_593` -- Does Equation608 imply Equation593?

- Hypothesis (Eq608): `x = y ◇ (z ◇ (w ◇ (u ◇ x)))`
- Goal (Eq593): `x = y ◇ (z ◇ (w ◇ (y ◇ x)))`

**Model**: `openai/gpt-oss-120b` via `deepinfra/bf16`.

### What happens

The solver's deterministic stages (counterexample search + singleton pattern) both fail. The solver enters the LLM loop. The LLM makes **4 attempts**, receiving judge error feedback each time, before finding the correct proof.

### Interaction log

**Round 0** -- Solver sends LLM request (proxy fills the `PROMPT` template from `solver.py`):

```json
{"call": "llm", "context": {"round": "0"}}
```

The proxy fills the template with problem data and sends to gpt-oss-120b. The filled prompt includes:
```
Does Equation608 imply Equation593?
Hypothesis: x = y ◇ (z ◇ (w ◇ (u ◇ x)))
Goal: x = y ◇ (z ◇ (w ◇ (y ◇ x)))
Previous attempts: (no previous attempts)
```

**LLM response (wrong -- guesses false)**:
```json
{"verdict": "false", "counterexample_table": [[0, 1, 2], [1, 2, 0], [2, 0, 1]]}
```

**Judge rejects** -- the table actually satisfies both equations:
```
error: (kernel) application type mismatch
  of_decide_eq_true (Eq.refl true)
```

**Round 1** -- LLM sees the error, switches to true:
```json
{"verdict": "true", "proof": "intro h\nsimpa using (h (u:=y))"}
```

**Judge rejects** -- invalid argument name:
```
error: invalid argument name 'u' for function
```

**Round 2** -- LLM tries again:
```json
{"verdict": "true", "proof": "by\n  intro h x y z w\n  specialize h x y z w y\n  simpa using h"}
```

**Judge rejects** -- too many `intro` binders (used `intro h` consuming the hypothesis name):
```
error: tactic 'introN' failed, insufficient number of binders
```

**Round 3** -- LLM finally gets the syntax right:
```json
{"verdict": "true", "proof": "intro x y z w\nsimpa using h x y z w y"}
```

**Judge accepts!**

The key insight: Equation608 has variable `u` that appears only in `u ◇ x`. The proof substitutes `y` for `u`, giving exactly the goal. The LLM needed 4 tries to get the Lean syntax right.

### Final accepted code

```lean
import JudgeProblem

def submission : Goal := by
  intro G _ h x y z w
  simpa using h x y z w y
```

### Result

- **Solved**: Yes
- **LLM calls**: 4
- **Judge calls**: 4
- **Time**: 77 seconds

This shows the feedback loop: each judge error is automatically included in `{history.attempts}` for the next LLM call, so the LLM can learn from its mistakes.

---

## Walkthrough 3: Reference Mining Solver (`opnorm`)

**Problem**: `normal_0121` -- Does Equation3580 imply Equation4304?

- Hypothesis (Eq3580): `x ◇ y = y ◇ ((z ◇ w) ◇ w)`
- Goal (Eq4304): `x ◇ (x ◇ y) = z ◇ (y ◇ y)`

**Model**: `openai/gpt-oss-120b` via `deepinfra/bf16`.

### What happens

`opnorm` is the flagship reference mining solver — not a teaching demo. It runs **16 deterministic proof strategies** (counterexample search on Fin 2-7, singleton detection, library lookup, constancy lemmas, BFS near-miss search, several calc-chain variants). All 16 fail for this problem. On the first LLM call the model emits a correct 3-step `calc` proof and the judge accepts immediately.

The shape of the win mirrors structural-context calc proofs, but the method is triggered autonomously from solver-side constancy analysis rather than a hand-crafted prompt for a designed problem. This was captured during a 1000-problem mining sweep over the `normal` set and is the first LLM-guided win catalogued for `opnorm` (mining run dated 2026-04-23).

### Solver's analysis (sent to LLM via `{solver.*}` placeholders, excerpt)

```
Goal has 3 vars, hypothesis has 4 vars – might be a direct substitution
No counterexample found on Fin 2-7 (exhaustive 2-3 + structured + random 4-7)
This strongly suggests the implication is TRUE. Focus on writing a proof.
CONSTANCY: Variables {'w', 'z'} appear ONLY on the RHS of h. For FIXED {'y'},
the RHS is constant regardless of {'w', 'z'}. So h(same_x, y1, z1) and
h(same_x, y2, z2) give the same LHS.
PROOF STRATEGY: Use a calc chain or transitivity. Apply h with specific args
to rewrite, then apply h again with different args.
```

### LLM response (correct on first try)

```json
{
  "verdict": "true",
  "proof": "intro x y z\ncalc\n  x ◇ (x ◇ y) = (x ◇ y) ◇ ((z ◇ z) ◇ z) := h x (x ◇ y) z z\n  _ = (y ◇ y) ◇ ((z ◇ z) ◇ z) := by\n    have hxy : x ◇ y = y ◇ y := (h x y z z).trans ((h y y z z).symm)\n    exact congrArg (fun t => t ◇ ((z ◇ z) ◇ z)) hxy\n  _ = z ◇ (y ◇ y) := (h z (y ◇ y) z z).symm"
}
```

**Judge accepts!**

### Final accepted code

```lean
import JudgeProblem

def submission : Goal := by
  intro G _ h
  intro x y z
  calc
    x ◇ (x ◇ y) = (x ◇ y) ◇ ((z ◇ z) ◇ z) := h x (x ◇ y) z z
    _ = (y ◇ y) ◇ ((z ◇ z) ◇ z) := by
      have hxy : x ◇ y = y ◇ y := (h x y z z).trans ((h y y z z).symm)
      exact congrArg (fun t => t ◇ ((z ◇ z) ◇ z)) hxy
    _ = z ◇ (y ◇ y) := (h z (y ◇ y) z z).symm
```

### Result

- **Solved**: Yes
- **LLM calls**: 1 (after 16 deterministic strategies failed)
- **Judge calls**: 17
- **Time**: 87.4 seconds

The mining-sweep sweet spot: deterministic strategies cover the easy bulk, and when they fall short, the solver's structural analysis (constancy, near-miss, useful instantiations) feeds the LLM enough context to nail the proof on the first try — the pattern emerging autonomously during live mining rather than from a hand-crafted prompt.

---

## Summary: Demo Comparison

| Demo | Model | Strategy | LLM Role |
|------|-------|----------|----------|
| `baseline` | OpenRouter default | Brute-force + singleton + generic LLM fallback | Generic — retries with error feedback |
| `twophase` | gpt-oss-120b | Deeper search + two-phase LLM | Structured — analysis phase, then implementation |
| `opnorm` | gpt-oss-120b | 16 deterministic strategies + structural-context LLM | Flagship — fed constancy / near-miss / instantiation analysis |

## Key Takeaways for Contestants

1. **Deterministic strategies first**: Counterexample search on Fin 2-7 solves most false problems. Singleton detection, library matching, and transitive chains solve many true problems. No LLM needed.

2. **The prompt matters more than the solver loop**: The difference between `baseline` (generic prompt, several LLM rounds for a simple problem) and `opnorm` (specialized prompt with solver-side structural analysis, often one LLM round for a hard problem) is largely in the prompt template and the structural context it receives.

3. **Judge feedback is free information**: Each judge error is automatically included in `{history.attempts}`. Design your prompt to help the LLM learn from these errors.

4. **Solver context is your secret weapon**: The `{solver.*}` placeholders let you pass arbitrary analysis to the LLM. The more structural information you compute deterministically (constancy, BFS near-misses, useful instantiations), the better the LLM performs.

5. **Budget awareness**: Each problem has a wall-clock budget (3600s in the Solo reference config — tuned for multi-round LLM loops under `reasoning_effort=medium`; subject to refinement from community feedback) and a per-call `max_output_tokens` cap on LLM responses (65536 by default) — there is **no per-problem cap** on the number of judge or LLM calls. Deterministic judge calls are cheap (~1s); LLM calls are expensive (~20s). Spend deterministic judge calls freely to try patterns; pace your LLM calls so you don't burn the wall-clock before trying the promising strategies.