--- name: mechinterp-next-step-planner description: Propose next mechanistic interpretability experiments based on research state, hypotheses, and existing evidence --- # MechInterp Next-Step Planner Analyze current research state and propose the next best experiments to run. Writes ExperimentSpec JSON files directly to the specs directory. ## Purpose The planner skill: - Analyzes current hypotheses and their evidence status - Identifies gaps in understanding - Proposes 2-3 experiments that would discriminate between hypotheses - Writes ExperimentSpec JSON files ready for the runner ## When to Use Use this skill when you need to: - Decide what experiment to run next - Find experiments that test uncertain hypotheses - Generate specs after updating research state - Plan a sequence of experiments ## Decision Heuristics The planner considers: 1. **Hypothesis uncertainty**: Prioritize experiments for hypotheses with confidence near 0.5 2. **Evidence gaps**: Look for hypotheses with few evidence items 3. **Validation needs**: Suggest validation for high-confidence hypotheses (>0.7) 4. **ReLU floor avoidance**: Skip weapon experiments if activation is low 5. **Constraint compliance**: All specs enforce one-rung-per-family 6. **Suppressor detection**: Suggest token_influence_sweep when single-family features may have hidden suppressors ## Planning Workflow ### 1. Load Current State ```python from splatnlp.mechinterp.state import ResearchStateManager manager = ResearchStateManager(feature_id=18712, model_type="ultra") print(manager.get_summary()) ``` ### 2. Analyze Evidence Gaps Check which hypotheses need more testing: ```python for h in manager.state.hypotheses: n_support = len(h.supporting_evidence) n_refute = len(h.refuting_evidence) print(f"{h.id}: conf={h.confidence:.0%}, support={n_support}, refute={n_refute}") if h.status == "proposed" and n_support == 0: print(f" -> NEEDS TESTING") elif h.confidence > 0.7 and h.status == "supported": print(f" -> NEEDS VALIDATION") ``` ### 3. Generate Experiment Specs Based on the analysis, create appropriate specs: ```python from splatnlp.mechinterp.schemas import ExperimentSpec, ExperimentType from splatnlp.mechinterp.state.io import get_spec_path, SPECS_DIR import json from datetime import datetime # Example: Family sweep for untested hypothesis about SCU spec = ExperimentSpec( id=datetime.now().strftime("%Y%m%d_%H%M%S"), type=ExperimentType.FAMILY_1D_SWEEP, feature_id=18712, model_type="ultra", variables={ "family": "special_charge_up", "rungs": [3, 12, 29, 41, 57], "include_absent": True }, dataset_slice={ "percentile_min": 10.0, "percentile_max": 90.0, "sample_size": 500 }, description="Test SCU response to identify threshold rung", parent_hypothesis="h001", rationale="Hypothesis h001 claims SCU >= 41 AP triggers feature. " "This sweep will identify the exact threshold." ) # Write spec to file spec_path = SPECS_DIR / spec.to_filename() spec_path.parent.mkdir(parents=True, exist_ok=True) spec_path.write_text(spec.model_dump_json(indent=2)) print(f"Spec written to: {spec_path}") ``` ### 4. Common Experiment Templates #### For Family-Specific Hypotheses ```python # "Feature responds to family X at high AP" ExperimentSpec( type=ExperimentType.FAMILY_1D_SWEEP, feature_id=feature_id, model_type="ultra", variables={"family": "swim_speed_up"}, ) ``` #### For Multi-Family Interactions ```python # "SCU and ISS have synergistic effect" ExperimentSpec( type=ExperimentType.FAMILY_2D_HEATMAP, feature_id=feature_id, model_type="ultra", variables={ "family_x": "special_charge_up", "family_y": "ink_saver_sub" }, ) ``` #### For Pattern Discovery ```python # "Feature detects specific ability combinations" ExperimentSpec( type=ExperimentType.FREQUENT_ITEMSETS, feature_id=feature_id, model_type="ultra", variables={ "min_support": 0.05, "max_size": 4, "high_activation_pct": 10.0 }, ) ``` #### For Validation ```python # Validate stable high-confidence hypothesis ExperimentSpec( type=ExperimentType.SPLIT_HALF, feature_id=feature_id, model_type="ultra", variables={"n_splits": 10}, parent_hypothesis="h001", ) ``` #### For Token Influence Analysis (Enhancers + Suppressors) ```python # Identify both enhancing and suppressing tokens # CRITICAL: Use this when you need to understand what a feature AVOIDS ExperimentSpec( type=ExperimentType.TOKEN_INFLUENCE_SWEEP, feature_id=feature_id, model_type="ultra", variables={ "min_samples": 50, "high_percentile": 0.995, # Top 0.5% "collapse_families": True, "suppressor_threshold": 0.8, # < 0.8 = suppressed "enhancer_threshold": 1.2, # > 1.2 = enhanced }, description="Identify enhancers and suppressors for complete interpretation", ) ``` ## Planner Decision Tree ``` ┌─────────────────────────────────────────┐ │ Load Research State │ └────────────────┬────────────────────────┘ │ ┌────────────▼────────────┐ │ Any hypotheses? │ └────────────┬────────────┘ │ No ◄────┴────► Yes │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────────────┐ │ Exploration │ │ Any untested? │ │ - PageRank │ └────────────┬────────────┘ │ - Itemsets │ │ │ - Family sweeps │ Yes ◄───┴───► No └─────────────────┘ │ │ ▼ ▼ ┌─────────────┐ ┌─────────────────┐ │ Test most │ │ Any uncertain? │ │ uncertain │ │ (conf ~0.5) │ └─────────────┘ └────────┬────────┘ │ Yes ◄───┴───► No │ │ ▼ ▼ ┌─────────────┐ ┌─────────────────┐ │ Discriminate│ │ Any high-conf │ │ experiments │ │ need validation?│ └─────────────┘ └────────┬────────┘ │ Yes ◄───┴───► No │ │ ▼ ▼ ┌─────────────┐ ┌───────────┐ │ Validation │ │ Research │ │ experiments │ │ complete! │ └─────────────┘ └───────────┘ ``` ## Guardrails The planner enforces these rules: 1. **One-rung-per-family**: All specs include this constraint 2. **ReLU floor check**: If state has pitfall "relu_floor_at_low_activation", avoid weapon sweeps 3. **Evidence diversification**: Prefer different experiment types than recently run 4. **Hypothesis coverage**: Ensure all active hypotheses have at least one pending experiment 5. **Suppressor check**: For single-family features, suggest token_influence_sweep to detect what the feature AVOIDS (critical for complete interpretation) 6. **Conditional sweep requirement**: After confirming primary driver with 1D sweep, ALWAYS suggest 2D heatmaps for secondary abilities (see below) ## ⚠️ CRITICAL: Conditional 2D Sweep Logic **After a 1D sweep confirms a primary driver, 1D sweeps for other abilities are MISLEADING.** ### Why 1D Sweeps Fail for Secondary Abilities When feature has strong primary driver (e.g., SCU): - Most contexts have LOW primary → activation ≈ 0 - Secondary ability can't suppress zero - 1D sweep averages across ALL contexts → delta ≈ 0 - **False conclusion**: "secondary has no effect" ### Planner Logic ```python def suggest_next_experiments(state): # Check if primary driver is confirmed primary_confirmed = any( h.status == "supported" and h.confidence > 0.7 for h in state.hypotheses if "primary" in h.tags or "monotonic" in h.tags ) if primary_confirmed: primary_family = get_confirmed_primary(state) correlated_abilities = get_overview_top_tokens(state, top_k=10) # For EACH correlated ability, suggest 2D heatmap for ability in correlated_abilities: if ability != primary_family: suggest_2d_heatmap( family_x=primary_family, family_y=ability, rationale=f"Test conditional effect of {ability} at each {primary_family} level" ) # ALWAYS suggest death-mitigation battery for death_ability in ["quick_respawn", "special_saver", "comeback"]: if not already_tested(state, primary_family, death_ability): suggest_2d_heatmap( family_x=primary_family, family_y=death_ability, rationale="Death-aversion test" ) ``` ### Template: Post-Primary Conditional Sweeps After confirming primary driver `{PRIMARY}`, automatically suggest: ```json // Batch 1: Death-mitigation (REQUIRED) [ {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "quick_respawn"}}, {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "special_saver"}}, {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "comeback"}} ] // Batch 2: Top correlated from overview (if not already primary) [ {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "{SECONDARY_1}"}}, {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "{SECONDARY_2}"}} ] // Batch 3: Weapon interaction (optional) [ {"type": "weapon_sweep", "variables": {"condition_family": "{PRIMARY}", "condition_rung": 29}} ] ``` ### Interpreting 2D Results | 2D Result | Meaning | Label Update | |-----------|---------|--------------| | Y suppresses at high X | Anti-synergy | Add "{Y}-averse" | | Y enhances at high X | Synergy | Add "{Y}-synergistic" | | Y flat across X levels | Spurious correlation | Remove from label | | No data at high X + high Y | Mutual exclusion | Note in label | ## Output Location Specs are written to: ``` /mnt/e/mechinterp_runs/specs/{timestamp}__f{feature_id}__{type}.json ``` Example filenames: - `20250607_142531__f18712__family-1d-sweep.json` - `20250607_143022__f18712__frequent-itemsets.json` - `20250607_143500__f18712__split-half.json` ## See Also - **mechinterp-state**: Manage research state - **mechinterp-runner**: Execute generated specs - **mechinterp-glossary-and-constraints**: Domain rules - **mechinterp-summarizer**: Process results - **mechinterp-ability-semantics**: Ability semantic groupings for interpretation - **mechinterp-investigator**: Full investigation workflow