---
name: add-step
description: Add or insert a new numbered pipeline step in the p2cs_project, following the standardized Step/DataFile/explore notebook patterns and safely renumbering subsequent steps (code, data, figures, tests, config, dependencies, and imports) when inserting in the middle of the pipeline. Use when creating a new step_* directory, inserting a step between existing ones, or updating step numbers across the project.
---

# Add Step (p2cs_project)

## When to Use

Use this skill when:

- A new pipeline step needs to be added under `code/step_{N}_{name}/`.
- A step should be **inserted between existing steps**, requiring **renumbering** of later steps.
- You need a checklist for creating the step class, substeps, README, `explore.ipynb`, and wiring dependencies/config.

This skill is **project-specific** to `p2cs_project` and assumes the architecture described in the root `README.md`.

---

## Overview: Step Pattern

Each numbered step follows the same high-level pattern:

- Directory: `code/step_{N}_{snake_name}/`
  - `step.py`: main `Step{N}{CamelName}` class inheriting from `base.Step`.
  - `data_classes.py`: data classes inheriting from `DataFile` subclasses (when the step has new artifacts).
  - `__init__.py`: re-exports the main step class (and sometimes helper symbols).
  - `README.md`: step-specific documentation (inputs, outputs, substeps, external tools).
  - `explore.ipynb`: exploration notebook following the **standard notebook structure** from the root `README.md`.
  - Numbered substeps: `1_*.py`, `2_*.py`, etc., each implementing focused logic.
- Outputs in `data/step_{N}_{snake_name}/` and figures in `figures/step_{N}_{snake_name}/` are routed via `base.paths`.
- The step is registered in:
  - `pipeline_config.yaml` under `steps: step_{N}_{snake_name}: ...`
  - Tests in `code/tests/test_step_{N}.py`.

**Good templates to copy from:**

- Simple single-substep data step: `step_4_prepare_pairs/`
- Multi-substep data/tooling step: `step_2_organism_distance/`
- Modeling/evaluation step: `step_6_train_model/`, `step_7_crosstalk_estimation/`

---

## Workflow A: Append a New Step at the End

1. **Determine the new step index and name**
   - Inspect existing numbered steps under `code/step_*`.
   - Let `N_max` be the largest index (currently `8` for `step_8_generate_paper`).
   - Choose:
     - New index: `N_new = N_max + 1`
     - Snake name: `step_{N_new}_{snake_name}`
     - Class name: `Step{N_new}{CamelName}`

2. **Create the step directory**
   - Create `code/step_{N_new}_{snake_name}/` with at least:
     - `__init__.py` (re-export the main step class).
     - `step.py` (main step implementation).
     - `README.md` (step documentation).
     - `explore.ipynb` (exploration notebook).
     - One or more numbered substeps `1_*.py`, `2_*.py`, etc.
     - `data_classes.py` and/or `config.json` if this step defines new data types or config.
   - **Recommended pattern:** Copy the closest existing step directory (e.g., `step_4_prepare_pairs/`) and rename/trim to match the new step’s responsibilities.

3. **Implement the main step class**
   - In `step.py`:
     - Import `Step` and path helpers from `code/base/step.py` and `code/base/paths.py`.
     - Define a class like:
       - `class Step{N_new}{CamelName}(Step):`
     - Implement:
       - `name` and `description` properties (or class attributes).
       - `dependencies` property returning a `List[str]` of upstream steps, using canonical IDs like `"step_1_get_p2cs_data"`.
       - `get_input_paths()` and `get_output_paths()` using data classes and `get_step_input_path` / `get_step_output_path`.
       - `run()` orchestrating any substeps via `self.run_substeps(...)`.

4. **Define data classes (if needed)**
   - In `data_classes.py`:
     - Inherit from appropriate `DataFile` subclasses (e.g., `PickleDataFile`, `CSVDataFile`, `NumpyDataFile`).
     - Define schemas, descriptions, and default loaders/savers as in existing steps.
   - Use these data classes in `get_input_paths()` / `get_output_paths()` and in substeps.

5. **Create numbered substeps**
   - Add scripts `1_*.py`, `2_*.py`, etc. inside the new step directory.
   - Follow existing substep patterns:
     - Each substep is a small class/function using the step’s data classes and `paths` helpers.
     - The main step’s `run()` calls `self.run_substeps(...)` with:
       - Substep objects
       - `step_numbers=[1, 2, ...]`
       - `descriptions=[...]`
       - Appropriate `on_failure` mode (`"strict"` or `"warning"`).

6. **Create the step README**
   - In `README.md`, mirror the structure used in other steps:
     - Short description.
     - Inputs (data classes, upstream steps).
     - Outputs and their data classes.
     - Substeps and what they do.
     - Any external tools / configs required.

7. **Create the `explore.ipynb` notebook**
   - Follow the standard structure from the root `README.md`:
     - `# Imports` (path setup + step/data class imports).
     - `# Load Data`
       - `## Load Inputs` (using `step.get_input_paths()` + data classes).
       - `## Load Outputs`.
     - `# Plot`
       - Display saved figures from visualization substeps first.
       - Put any extra exploratory plots after those.
     - `# Notes` (short list of exploration ideas).
   - Respect the **collapsible headings rule** (heading-only markdown cells).

8. **Wire the step into `pipeline_config.yaml`**
   - Under `steps:`, add a new entry:
     - Key: `step_{N_new}_{snake_name}:`
     - Fields: `enabled`, `description`, `overwrite_outputs`, optional `fast_plots`, and `substeps:`.
   - Add a `substeps:` section keyed by the filenames (without `.py`), matching patterns in other steps.

9. **Add tests**
   - Create `code/tests/test_step_{N_new}.py` by copying a nearby test (e.g., `test_step_4.py`) and adjusting:
     - Imports to the new step and data classes.
     - Test names and assertions to cover the new step’s behavior.

10. **Run tests / pipeline checks**
    - Run `pytest code/tests/test_step_{N_new}.py`.
    - Optionally run the step via:
      - `cd code && python run_pipeline.py --step step_{N_new}_{snake_name}`

---

## Workflow B: Insert a Step in the Middle (with Renumbering)

Use this when inserting a new step **between** existing steps (e.g., between `step_3_embed_proteins` and `step_4_prepare_pairs`).

### B1. Plan the new ordering

1. **Identify current step order**
   - List existing `code/step_*` directories and their indices (including `step_0_draw_theoretical`).

2. **Choose insertion point**
   - Let:
     - `N_insert_after` = index of the step **before** the new one.
     - `N_new = N_insert_after + 1`.
   - All steps with index `> N_insert_after` must be shifted **up by 1**:
     - Old `k` → new `k + 1` for all `k > N_insert_after`.

3. **Decide the new step’s ID**
   - Choose:
     - New directory name: `step_{N_new}_{snake_name}`.
     - New class name: `Step{N_new}{CamelName}`.

### B2. Renumber existing steps (highest → lowest)

Perform renaming **from highest index down to `N_insert_after + 1`** to avoid collisions.

For each step index `k` in descending order where `k > N_insert_after`:

1. **Compute new index**
   - `k_new = k + 1`.

2. **Rename step directories**
   - Code: `code/step_{k}_{name}/` → `code/step_{k_new}_{name}/`.
   - Data: `data/step_{k}_{name}/` → `data/step_{k_new}_{name}/` (if exists).
   - Figures: `figures/step_{k}_{name}/` → `figures/step_{k_new}_{name}/` (if exists).

3. **Rename tests**
   - `code/tests/test_step_{k}.py` → `code/tests/test_step_{k_new}.py`.

4. **Update configuration keys**
   - In `pipeline_config.yaml`, change:
     - `step_{k}_{name}:` → `step_{k_new}_{name}:`.

5. **Update string references and imports**
   - Use text search for `step_{k}_{name}` and `test_step_{k}` across the repo and update to the new IDs:
     - Imports like `from step_{k}_{name}...`.
     - Dependency lists in `dependencies` properties (e.g., `return ["step_{k}_{name}", ...]`).
     - Any key strings that reference `step_{k}_{name}`.

6. **Update doc references**
   - In `README.md` files and notebooks, update any textual references to the old step name or number, if present.

### B3. Add the new step

After all affected steps `k > N_insert_after` have been shifted to `k + 1`:

1. **Create `code/step_{N_new}_{snake_name}/`**
   - Follow **Workflow A, steps 2–7** to:
     - Implement `step.py` and `data_classes.py`.
     - Add numbered substeps.
     - Add `README.md`.
     - Add `explore.ipynb`.

2. **Wire into `pipeline_config.yaml`**
   - Under `steps:` add:
     - `step_{N_new}_{snake_name}:` with its configuration and `substeps`.

3. **Update dependencies**
   - For the new step:
     - Set `dependencies` to the upstream steps, using the **renumbered IDs**.
   - For downstream steps:
     - Review their `dependencies` properties:
       - Replace any old IDs that were shifted, and add the new step as a dependency where appropriate.

4. **Add test file**
   - Create `code/tests/test_step_{N_new}.py` following neighboring step tests.

5. **Sanity check references**
   - Run a repo-wide search for any **old** step IDs (`step_{k}_{name}` where `k` was renumbered) and ensure:
     - All references are either removed or updated to the new IDs.

### B4. Validate after renumbering

1. **Run targeted tests**
   - Run:
     - `cd code && pytest tests/test_step_{N_new}.py`
     - Plus tests for all renumbered steps: `test_step_{k_new}.py`.

2. **Run a dry pipeline**
   - Optionally run:
     - `cd code && python run_pipeline.py --list-steps` to confirm updated IDs and ordering.
     - `cd code && python run_pipeline.py --step step_{N_new}_{snake_name}` to test the new step in context.

---

## Notebook Guidelines (Quick Reference)

When creating or editing `explore.ipynb` for a step:

- Follow the standard sections:
  - `# Imports`
  - `# Load Data`
    - `## Load Inputs`
    - `## Load Outputs`
  - `# Plot`
  - `# Notes`
- Ensure each heading is in its **own markdown cell** to enable collapsible sections.
- Use the data classes for loading inputs/outputs, not raw paths.
- Display visualization substep figures first under `# Plot`; additional exploratory plots come after.

---

## Usage Summary

When asked to **add a new step**:

1. Decide whether it is an **append** (Workflow A) or **insert with renumbering** (Workflow B).
2. Follow the appropriate workflow carefully, especially:
   - Directory and file naming: `step_{N}_{snake_name}`, `test_step_{N}.py`.
   - Dependency updates and imports.
   - `pipeline_config.yaml` step and substep entries.
   - `explore.ipynb` structure and data class usage.
3. Always finish by running the relevant tests and, if feasible, a pipeline run of the new step.