--- name: crucible-research-foundations description: Validate findings, design shuffled nulls, check label leakage, review causal features. TRIGGERS - shuffled null, label leakage allowed-tools: Read, Grep, Glob --- # Research Foundations — 6 epistemic disciplines > **Self-Evolving Skill**: This skill improves through use. If a discipline's guidance fails in practice or a new trap emerges, update the relevant section AND append to `references/evolution-log.md`. Don't defer. Read these in order. The first three (causal, labels, nulls) are the hardest prerequisites — violating any of them silently invalidates every downstream result. --- ## 1. Causal-feature invariant (bars[:i]) Every feature `f[i]` used at trigger/decision bar `i` must be computable using only `bars[0:i]` — never `bars[i]`, never `bars[i+1:]`. Violation produces look-ahead bias; findings silently become worthless. **Canonical pattern**: ```python for i in range(n): lo = max(0, i - window) wind = values[lo:i] # EXCLUSIVE upper bound — no peeking f[i] = compute(wind) ``` Note `lo:i` (exclusive), not `lo:i+1`. This discipline "feels off by one" but is correct. **Verification test** (add to every new feature function): ```python def test_causality(fn, n=1000): bars = generate_test_bars(n) f_orig = fn(bars) bars_mod = bars.copy() bars_mod[500:] *= 2 # perturb the FUTURE f_mod = fn(bars_mod) assert np.array_equal(f_orig[:500], f_mod[:500]), "look-ahead detected" ``` **Silent-bug signature**: impossibly clean results (tw > 10 bps on FX, win rate > 70%, OOS matches IS perfectly). Full reference: `findings/methodology/10-causal-feature-invariant.md`. --- ## 2. Label-leakage (bar-local scaling kills window leakage) Forward labels must be scaled to the **triggering bar's own range**, NEVER to a window-wide scale. Window-relative labels are tautological. **Trap**: If you label `fwd+H = UP when close[i+H] - close[i] > window.span/20`, then when `close[i]` is near `window.min` (loc=B), `fwd=UP` is near-automatic. Agents will report spurious "signals". **Fix**: use bar-local triple-barrier labels: ```python r = high[i] - low[i] # THIS bar's range, not window's tp_level = close[i] + tp_mult * r sl_level = close[i] - sl_mult * r # walk forward, exit at first tp/sl/expiry ``` **Symptom that you fell into the trap**: apparent signal strengthens monotonically with `loc` quintile; collapses when you test adjacent cells. Full reference: `findings/methodology/02-label-leakage-bar-local-scaling.md`. --- ## 3. Shuffled-null design (3 null types — get the right one) Shuffled-null tests are mandatory before trust, but the **choice of what to shuffle** is a design decision. | Hypothesis class | Shuffle WHAT | Session example | | -------------------------------------------- | ------------------------------------------------------------- | --------------------------------------------------------- | | "Feature X predicts outcomes" | Shuffle the feature values | Phase F-B (used wrong null, "falsified" a real signal) | | "Trigger pattern fires at informative times" | Shuffle the trigger mask (preserve fire-rate, move locations) | Phase C (validated ngram_triple_fast_up at z=+5.74) | | "Filter improves selection" | Shuffle which trades pass the filter | Phase L-C (evaluated filters against N-size random draws) | **Rule**: ask "what is the alternative hypothesis, in one sentence?" If you can't state it, you don't know what you're testing. **Common mistakes**: - Using feature-shuffle when testing a trigger pattern → destroys temporal structure the pattern depends on → real signal looks worse than shuffled noise - Under-tight null (null std huge relative to observed effect) → no statistical power - Over-tight null (too few permutations) → unreliable z-estimates; use ≥100 for z<3, ≥1000 for z<2 Full reference: `findings/methodology/03-shuffled-null-design.md`. --- ## 4. Agent significance corrections (z-scores are overstated 2-3×) LLM agents systematically overstate z-scores. Treat agent-reported p-values as **upper bounds**. **Three overstatement patterns**: 1. **Ignored multiple-testing burden**: agent tests 25 variants, reports z=2.43 vs nominal 1.96 threshold. True Bonferroni threshold is `sqrt(2 * ln(N))` — for N=25 that's z>2.8. 2. **Confused sample-mean z with binomial-proportion z**: 53.5% vs 50% on N=840 gives z≈2.0 not 4.2. 3. **Extremum-of-K treated as single test**: "top combo from 17,280" has expected null-max `null_mean + null_std × sqrt(2 ln K)` ≈ null_mean + 4.5σ. An observed tw that's below that expectation is not a finding. **Always verify**: - How many implicit tests did the agent run? - Re-derive z yourself: `(real - null.mean) / null.std` - Bonferroni threshold for K tests: `z > sqrt(2 * ln K)` **Trust thresholds**: - z > 5, N > 500: likely real, test further - z in [3, 5]: promising, mandatory gate validation - z in [2, 3]: suspect, require adjacent-cell gradient + null test - z < 2: treat as null Full reference: `findings/methodology/09-agent-significance-corrections.md`. --- ## 5. Record-keeping discipline (append-only ledger + audit folders) Every investigation — positive or null — must produce a permanent, discoverable record. **3-layer architecture**: ``` findings/ ├── evolution/ │ ├── evolution.jsonl # append-only ledger │ └── audits/ │ └── YYYY-MM-DD-slug/ │ ├── CLAUDE.md # navigator │ ├── verdict.md # plain-English conclusion │ ├── CHRONICLE.md # narrative (for major findings) │ ├── .py # script that regenerates headline numbers │ └── .json # raw telemetry └── methodology/ # universal principles ``` **Ledger entry fields**: `id`, `date`, `status`, `supersedes`, `superseded_by`, `headline`, `key_numbers`, `evidence` (file paths), `sha256_results`. **The supersedes pattern**: when a later finding replaces an earlier one, ADD a new entry with `supersedes: "OLD-ID"`; UPDATE the old entry with `superseded_by: "NEW-ID"`. **Do NOT delete** the older audit folder. Full reference: `findings/methodology/07-record-keeping-discipline.md`. --- ## 6. Post-mortem-before-abandon Before declaring a signal dead, enrich every trade with causal pre-entry features and hunt filters on individual losses. A "sometimes works" signal is often a filterable signal in disguise. **Pipeline**: 1. Run the signal across full history; collect N trade outcomes 2. Compute ~20-30 causal features at each trigger bar 3. Emit per-trade parquet + CSV (one row per trade) 4. Ship to multi-lens agents (see Skill B) 5. Each agent hunts filters that separate winners from losers 6. Evaluate filters against shuffled-null (see §3) **Kill-selectivity metric**: `losers_killed / max(1, winners_killed)`. < 1.0 = harmful; 1.0-1.2 = marginal; 1.2-1.5 = useful; > 1.5 = strong. Session example: `+0.178 bps` baseline → `+0.514 bps` after Phase-L filter. 2.9× lift from enrichment-driven filter hunt. Full reference: `findings/methodology/06-per-trade-enrichment-postmortem.md`. --- ## Confirmation counts (provisional, as of session ca9d7ffa) | Principle | Confirmed | Notes | | --------------------------- | ----------------- | -------------------------------------------------------------------------------- | | 1. causal-feature-invariant | 18+ (every phase) | Fundamental; drop only with proof | | 2. label-leakage | 2 | Directly caught spurious "lower-rejection-at-bottom" | | 3. shuffled-null-design | 4 | Phase F-B wrong-null, Phase C right-null, Phase L filter-null, Phase M mgmt-null | | 4. agent-sig-corrections | 5+ | Combinatorialist, transition-asymmetry, trade-mgmt agents all overstated | | 5. record-keeping | 5 ledger entries | Full chain for NGRAM3FU-STRADDLE | | 6. post-mortem | 1 | Phase L delivered the filter; needs re-confirmation on other campaigns | Higher `confirmed` = more trustworthy. Principle 6 has only one confirmation and should be treated as provisional. --- ## Post-Execution Reflection After invoking this skill: 1. Did applying a principle catch a bug or false positive? Increment its `confirmed` count in the table above; note the session where it fired in `references/evolution-log.md`. 2. Did a principle fail (bad guidance)? Demote it in the table; add a `superseded_by` pointer in `references/archive/` with `resurrect_if:` conditions. 3. New trap that isn't covered? Draft a new section here and append to the evolution log. 4. Never silently move on. This skill's value compounds only if reality-corrections flow back.