--- name: add-step description: Add or insert a new numbered pipeline step in the p2cs_project, following the standardized Step/DataFile/explore notebook patterns and safely renumbering subsequent steps (code, data, figures, tests, config, dependencies, and imports) when inserting in the middle of the pipeline. Use when creating a new step_* directory, inserting a step between existing ones, or updating step numbers across the project. --- # Add Step (p2cs_project) ## When to Use Use this skill when: - A new pipeline step needs to be added under `code/step_{N}_{name}/`. - A step should be **inserted between existing steps**, requiring **renumbering** of later steps. - You need a checklist for creating the step class, substeps, README, `explore.ipynb`, and wiring dependencies/config. This skill is **project-specific** to `p2cs_project` and assumes the architecture described in the root `README.md`. --- ## Overview: Step Pattern Each numbered step follows the same high-level pattern: - Directory: `code/step_{N}_{snake_name}/` - `step.py`: main `Step{N}{CamelName}` class inheriting from `base.Step`. - `data_classes.py`: data classes inheriting from `DataFile` subclasses (when the step has new artifacts). - `__init__.py`: re-exports the main step class (and sometimes helper symbols). - `README.md`: step-specific documentation (inputs, outputs, substeps, external tools). - `explore.ipynb`: exploration notebook following the **standard notebook structure** from the root `README.md`. - Numbered substeps: `1_*.py`, `2_*.py`, etc., each implementing focused logic. - Outputs in `data/step_{N}_{snake_name}/` and figures in `figures/step_{N}_{snake_name}/` are routed via `base.paths`. - The step is registered in: - `pipeline_config.yaml` under `steps: step_{N}_{snake_name}: ...` - Tests in `code/tests/test_step_{N}.py`. **Good templates to copy from:** - Simple single-substep data step: `step_4_prepare_pairs/` - Multi-substep data/tooling step: `step_2_organism_distance/` - Modeling/evaluation step: `step_6_train_model/`, `step_7_crosstalk_estimation/` --- ## Workflow A: Append a New Step at the End 1. **Determine the new step index and name** - Inspect existing numbered steps under `code/step_*`. - Let `N_max` be the largest index (currently `8` for `step_8_generate_paper`). - Choose: - New index: `N_new = N_max + 1` - Snake name: `step_{N_new}_{snake_name}` - Class name: `Step{N_new}{CamelName}` 2. **Create the step directory** - Create `code/step_{N_new}_{snake_name}/` with at least: - `__init__.py` (re-export the main step class). - `step.py` (main step implementation). - `README.md` (step documentation). - `explore.ipynb` (exploration notebook). - One or more numbered substeps `1_*.py`, `2_*.py`, etc. - `data_classes.py` and/or `config.json` if this step defines new data types or config. - **Recommended pattern:** Copy the closest existing step directory (e.g., `step_4_prepare_pairs/`) and rename/trim to match the new step’s responsibilities. 3. **Implement the main step class** - In `step.py`: - Import `Step` and path helpers from `code/base/step.py` and `code/base/paths.py`. - Define a class like: - `class Step{N_new}{CamelName}(Step):` - Implement: - `name` and `description` properties (or class attributes). - `dependencies` property returning a `List[str]` of upstream steps, using canonical IDs like `"step_1_get_p2cs_data"`. - `get_input_paths()` and `get_output_paths()` using data classes and `get_step_input_path` / `get_step_output_path`. - `run()` orchestrating any substeps via `self.run_substeps(...)`. 4. **Define data classes (if needed)** - In `data_classes.py`: - Inherit from appropriate `DataFile` subclasses (e.g., `PickleDataFile`, `CSVDataFile`, `NumpyDataFile`). - Define schemas, descriptions, and default loaders/savers as in existing steps. - Use these data classes in `get_input_paths()` / `get_output_paths()` and in substeps. 5. **Create numbered substeps** - Add scripts `1_*.py`, `2_*.py`, etc. inside the new step directory. - Follow existing substep patterns: - Each substep is a small class/function using the step’s data classes and `paths` helpers. - The main step’s `run()` calls `self.run_substeps(...)` with: - Substep objects - `step_numbers=[1, 2, ...]` - `descriptions=[...]` - Appropriate `on_failure` mode (`"strict"` or `"warning"`). 6. **Create the step README** - In `README.md`, mirror the structure used in other steps: - Short description. - Inputs (data classes, upstream steps). - Outputs and their data classes. - Substeps and what they do. - Any external tools / configs required. 7. **Create the `explore.ipynb` notebook** - Follow the standard structure from the root `README.md`: - `# Imports` (path setup + step/data class imports). - `# Load Data` - `## Load Inputs` (using `step.get_input_paths()` + data classes). - `## Load Outputs`. - `# Plot` - Display saved figures from visualization substeps first. - Put any extra exploratory plots after those. - `# Notes` (short list of exploration ideas). - Respect the **collapsible headings rule** (heading-only markdown cells). 8. **Wire the step into `pipeline_config.yaml`** - Under `steps:`, add a new entry: - Key: `step_{N_new}_{snake_name}:` - Fields: `enabled`, `description`, `overwrite_outputs`, optional `fast_plots`, and `substeps:`. - Add a `substeps:` section keyed by the filenames (without `.py`), matching patterns in other steps. 9. **Add tests** - Create `code/tests/test_step_{N_new}.py` by copying a nearby test (e.g., `test_step_4.py`) and adjusting: - Imports to the new step and data classes. - Test names and assertions to cover the new step’s behavior. 10. **Run tests / pipeline checks** - Run `pytest code/tests/test_step_{N_new}.py`. - Optionally run the step via: - `cd code && python run_pipeline.py --step step_{N_new}_{snake_name}` --- ## Workflow B: Insert a Step in the Middle (with Renumbering) Use this when inserting a new step **between** existing steps (e.g., between `step_3_embed_proteins` and `step_4_prepare_pairs`). ### B1. Plan the new ordering 1. **Identify current step order** - List existing `code/step_*` directories and their indices (including `step_0_draw_theoretical`). 2. **Choose insertion point** - Let: - `N_insert_after` = index of the step **before** the new one. - `N_new = N_insert_after + 1`. - All steps with index `> N_insert_after` must be shifted **up by 1**: - Old `k` → new `k + 1` for all `k > N_insert_after`. 3. **Decide the new step’s ID** - Choose: - New directory name: `step_{N_new}_{snake_name}`. - New class name: `Step{N_new}{CamelName}`. ### B2. Renumber existing steps (highest → lowest) Perform renaming **from highest index down to `N_insert_after + 1`** to avoid collisions. For each step index `k` in descending order where `k > N_insert_after`: 1. **Compute new index** - `k_new = k + 1`. 2. **Rename step directories** - Code: `code/step_{k}_{name}/` → `code/step_{k_new}_{name}/`. - Data: `data/step_{k}_{name}/` → `data/step_{k_new}_{name}/` (if exists). - Figures: `figures/step_{k}_{name}/` → `figures/step_{k_new}_{name}/` (if exists). 3. **Rename tests** - `code/tests/test_step_{k}.py` → `code/tests/test_step_{k_new}.py`. 4. **Update configuration keys** - In `pipeline_config.yaml`, change: - `step_{k}_{name}:` → `step_{k_new}_{name}:`. 5. **Update string references and imports** - Use text search for `step_{k}_{name}` and `test_step_{k}` across the repo and update to the new IDs: - Imports like `from step_{k}_{name}...`. - Dependency lists in `dependencies` properties (e.g., `return ["step_{k}_{name}", ...]`). - Any key strings that reference `step_{k}_{name}`. 6. **Update doc references** - In `README.md` files and notebooks, update any textual references to the old step name or number, if present. ### B3. Add the new step After all affected steps `k > N_insert_after` have been shifted to `k + 1`: 1. **Create `code/step_{N_new}_{snake_name}/`** - Follow **Workflow A, steps 2–7** to: - Implement `step.py` and `data_classes.py`. - Add numbered substeps. - Add `README.md`. - Add `explore.ipynb`. 2. **Wire into `pipeline_config.yaml`** - Under `steps:` add: - `step_{N_new}_{snake_name}:` with its configuration and `substeps`. 3. **Update dependencies** - For the new step: - Set `dependencies` to the upstream steps, using the **renumbered IDs**. - For downstream steps: - Review their `dependencies` properties: - Replace any old IDs that were shifted, and add the new step as a dependency where appropriate. 4. **Add test file** - Create `code/tests/test_step_{N_new}.py` following neighboring step tests. 5. **Sanity check references** - Run a repo-wide search for any **old** step IDs (`step_{k}_{name}` where `k` was renumbered) and ensure: - All references are either removed or updated to the new IDs. ### B4. Validate after renumbering 1. **Run targeted tests** - Run: - `cd code && pytest tests/test_step_{N_new}.py` - Plus tests for all renumbered steps: `test_step_{k_new}.py`. 2. **Run a dry pipeline** - Optionally run: - `cd code && python run_pipeline.py --list-steps` to confirm updated IDs and ordering. - `cd code && python run_pipeline.py --step step_{N_new}_{snake_name}` to test the new step in context. --- ## Notebook Guidelines (Quick Reference) When creating or editing `explore.ipynb` for a step: - Follow the standard sections: - `# Imports` - `# Load Data` - `## Load Inputs` - `## Load Outputs` - `# Plot` - `# Notes` - Ensure each heading is in its **own markdown cell** to enable collapsible sections. - Use the data classes for loading inputs/outputs, not raw paths. - Display visualization substep figures first under `# Plot`; additional exploratory plots come after. --- ## Usage Summary When asked to **add a new step**: 1. Decide whether it is an **append** (Workflow A) or **insert with renumbering** (Workflow B). 2. Follow the appropriate workflow carefully, especially: - Directory and file naming: `step_{N}_{snake_name}`, `test_step_{N}.py`. - Dependency updates and imports. - `pipeline_config.yaml` step and substep entries. - `explore.ipynb` structure and data class usage. 3. Always finish by running the relevant tests and, if feasible, a pipeline run of the new step.