# neurotrace — Agent Skill

You are an AI agent that can run **neurotrace**, an activation-patching framework for encoder-only transformer models. Use this skill when a user wants to understand where factual knowledge is stored inside a HuggingFace MLM, or to compare how two models encode domain-specific facts.

## When to use this skill

- User asks "does [model] actually know [domain] facts?"
- User wants to compare a fine-tuned model against its base model
- User wants to locate knowledge neurons or test factual recall
- User asks about activation patching, causal tracing, or knowledge localisation

## Prerequisites

The repo must be installed:

```bash
cd /path/to/neurotrace
uv venv && uv pip install -e .
source .venv/bin/activate
```

### Option A: Use the built-in Strands orchestrator agent

If you have a Strands-compatible LLM provider configured, you can delegate to the built-in orchestrator:

```python
from neurotrace.providers import get_model
from neurotrace.agents import create_orchestrator

model = get_model("anthropic", client_args={"api_key": "sk-..."})
# Also available: "openai", "bedrock", "gemini"

agent = create_orchestrator(model=model)
agent("Compare SecureBERT vs ModernBERT on cybersecurity knowledge")
```

The orchestrator has tools for dataset generation, tracing, charting, and interpretation.
It will drive the full pipeline autonomously.

### Option B: Drive the pipeline yourself (step-by-step below)

## Step-by-step protocol

### Step 1: Gather information from the user

Ask the following questions (skip any the user has already answered):

1. **Domain**: What knowledge domain? (e.g. cybersecurity, medicine, law, finance, programming)
2. **Goal**: What do you want to find out? (e.g. "Does BioBERT store drug-disease associations differently from BERT?")
3. **Models**: Which models to compare? Need display name + HuggingFace ID for each. At least one model, ideally two for comparison.
4. **Facts**: What factual associations to test? For each fact, you need:
   - A sentence template with `{}` where the subject goes (e.g. "The attack that exploits {} injection")
   - The correct subject and an incorrect alternative
   - The target token the model should predict
   - A context word that anchors the fact, and a corruption for it
5. **Variants** (optional): Custom prompt phrasing variants, or use defaults.

If the user provides a domain but no specific facts, **generate 8–12 representative facts yourself** covering diverse concepts in that domain. Ensure each fact has an unambiguous correct answer and a plausible corruption.

### Step 2: Build the DatasetSpec

Create a JSON spec and save it:

```python
from neurotrace.agent_dataset import DatasetSpec, FactSpec, ModelSpec, generate_from_spec

spec = DatasetSpec(
    domain="cybersecurity",
    description="Test whether SecureBERT stores cybersecurity facts differently from ModernBERT",
    models=[
        ModelSpec(name="ModernBERT", huggingface_id="answerdotai/ModernBERT-base"),
        ModelSpec(name="SecureBERT", huggingface_id="cisco-ai/SecureBERT2.0-base"),
    ],
    facts=[
        FactSpec(
            clean_subject="SQL",
            corrupt_subject="HTML",
            template="The attack that exploits input sanitization failures in a database is known as {} injection",
            target="SQL",
            context_from="database",
            context_to="email",
        ),
        # ... more facts ...
    ],
    output_dir="output",
)

dataset, config = generate_from_spec(spec, save_path="output/dataset.json")
```

### Step 3: Run the trace

```python
from neurotrace import CausalTracer
from neurotrace.tracer import save_results

results = []
results_dict = {}

for model_spec in spec.models:
    tracer = CausalTracer(model_spec.name, model_spec.huggingface_id)
    result = tracer.trace(dataset)
    results.append(result)
    results_dict[model_spec.name] = result.to_dict()
    tracer.release()

save_results(results, f"{spec.output_dir}/trace_results.json")
```

### Step 4: Generate charts

```python
from neurotrace import plot_comparison, plot_patching_flow, plot_heatmap, plot_peak_bars

out = spec.output_dir

# Line chart — mask-patch restoration per layer
plot_comparison(results_dict, curve="mask_patch",
    title="Mask-Patch: Probability Restoration by Layer",
    save_path=f"{out}/fig1_mask_patch.png")

# Line chart — context-patch restoration per layer
plot_comparison(results_dict, curve="context_patch",
    title="Context-Patch: Probability Restoration by Layer",
    save_path=f"{out}/fig2_context_patch.png")

# Patching flow diagram
peak = results[0].peak_layer()[0]
plot_patching_flow(num_layers=results[0].num_layers,
    highlighted_layer=peak, save_path=f"{out}/fig3_flow.png")

# Heatmap
plot_heatmap(results_dict, save_path=f"{out}/fig4_heatmap.png")

# Peak comparison bars
plot_peak_bars(results_dict, save_path=f"{out}/fig5_peaks.png")
```

### Step 5: Interpret the results

Report to the user:

1. **Peak layer**: Which layer showed the highest restoration score for mask-patching? This is where the model stores the domain knowledge.
2. **Peak magnitude**: How strong was the signal? Larger = the model relies more on context to retrieve the fact.
3. **Context-patch vs mask-patch**: If context-patch is flat (near zero), the model stores facts directly at the prediction position, not via context tokens.
4. **Clean-minus-corrupt baseline** (`clean_minus_corrupt` in the results): If positive, clean context helps. If negative, the model's knowledge is robust enough that corruption doesn't hurt — a sign of strong factual entrenchment.
5. **Cross-model comparison**: If two models peak at the same layer but with different magnitudes, the one with the *smaller* peak may actually have *stronger* knowledge (less to restore because corruption matters less).

## DatasetSpec JSON schema

```json
{
  "domain": "string (required)",
  "description": "string (required)",
  "models": [
    {
      "name": "string",
      "huggingface_id": "string"
    }
  ],
  "facts": [
    {
      "clean_subject": "string",
      "corrupt_subject": "string",
      "template": "string with {} placeholder",
      "target": "string",
      "context_from": "string",
      "context_to": "string"
    }
  ],
  "variants": [["prefix", "suffix"], ...],
  "output_dir": "string (default: 'output')"
}
```

## Fact design guidelines

When generating facts for the user:

- The template MUST contain `{}` where the subject goes
- `clean_subject` should be the unambiguously correct answer
- `corrupt_subject` should be a plausible but wrong alternative from the same category
- `target` is the token the model predicts (usually same as `clean_subject`)
- `context_from` is a word in the template that supports the correct answer
- `context_to` replaces it with something that shifts meaning but keeps the sentence grammatical
- Aim for 8–12 diverse facts per domain, covering different sub-topics
- Each fact should be testable as a fill-in-the-blank: "...is known as [MASK]..."

## Output files

After a run, the output directory will contain:

| File | Contents |
|------|----------|
| `dataset.json` | The generated prompt pairs |
| `run_config.json` | Models, domain, metadata |
| `trace_results.json` | Per-layer restoration scores |
| `fig1_mask_patch.png` | Mask-patch line chart |
| `fig2_context_patch.png` | Context-patch line chart |
| `fig3_flow.png` | Patching flow diagram |
| `fig4_heatmap.png` | Layer × model heatmap |
| `fig5_peaks.png` | Peak layer bar chart |

## Error handling

- If a model fails to load, `CausalTracer.__init__` will raise an exception. Catch it and report the HuggingFace ID to the user — they may need to `pip install` a specific tokenizer or use `trust_remote_code=True`.
- If the `[MASK]` token is not found, the tracer falls back to finding the subject token position. This works but is less precise for MLMs.
- Running on CPU is fine for base-size models (~120M params). For large models, suggest the user use a GPU.