# Research Skill ## Purpose Read the materialized research source and extract actionable information needed to implement the proposed method. Record the findings as a structured JSON entry. ## When to Use Phase 2 — after the current implementation has been analyzed. ## Input | Parameter | Type | Description | |-----------|------|-------------| | experiment_path | path | `experiments/{research_name}/` | The research source was materialized into `{experiment_path}` during Phase 0. Its local path is recorded in `{experiment_path}/log.json` under `metadata.research_source`, and a short descriptive label Phase 0 chose is under `metadata.research_source_kind`. The label is free-form (common values: `pdf`, `file`, `git`, `kaggle_notebook`, `kaggle_dataset`, `arxiv`, `huggingface_model`, `html`, `idea`, `other`), but treat it as a hint only — always follow the actual path in `metadata.research_source`. Inspect that path and read whatever is there: - A single file (PDF, Markdown, HTML, `.ipynb`, text, …) → read it directly. - A directory → read the obvious entry points first (`README*`, `*.ipynb`, top-level notebooks or code, `docs/`, dataset descriptions), then skim the rest as needed. - A text **idea** (`research_source_kind == "idea"`, typically a short `research_source.md`) → read the user's description carefully and turn it into a concrete method plan. Pick a specific algorithm / library that matches the description, define the hyperparameters you will use, and document your interpretation explicitly in the Phase 2 log entry. If the idea is ambiguous, commit to a reasonable default and note the trade-off — do not invent a citation or claim the idea came from a paper. Do not try to re-fetch the source. If the content is insufficient, note what is missing in the Phase 2 log entry and proceed with the best analysis you can. ## Actions 1. **Read the materialized research source** at `metadata.research_source` (falling back to `research.pdf` for legacy experiments) and extract: - **Method Summary:** 2-3 short paragraphs describing what the paper proposes, what problem it solves, and how it differs from traditional approaches. - **Pros:** each advantage the paper claims or demonstrates. - **Cons:** stated or inferred limitations, assumptions, or weaknesses. - **Implementation Requirements:** - Required libraries/packages (with versions if specified) - Required data format or preprocessing - Required compute resources (GPU, memory, etc.) - Key hyperparameters to set - **Compatibility Analysis:** - Can the method use the same data as the current baseline? - Does it need different preprocessing? - Does it output comparable predictions (same format)? - Can the same metrics be used for comparison? 2. **Append a Phase 2 entry to `{experiment_path}/log.json`** under `phases`: ```json { "name": "Phase 2: Research", "completed_at": "2026-04-17T10:30:00Z", "paper": { "title": "CatBoost: Unbiased Boosting with Categorical Features", "authors": ["Prokhorenkova et al."], "method_summary": "CatBoost is a gradient-boosting framework that handles categorical features natively via ordered target statistics and uses oblivious decision trees to reduce overfitting." }, "pros": [ "Native categorical handling — no manual encoding needed", "Reduces target leakage with ordered boosting", "Strong out-of-the-box performance" ], "cons": [ "Training slower than XGBoost for small data", "More memory intensive" ], "requirements": { "new_dependencies": ["catboost>=1.2"], "data_format": "pandas.DataFrame with categorical columns marked", "compute": "CPU is sufficient; GPU optional" }, "compatibility": { "same_data": true, "same_metrics": true, "preprocessing_notes": "CatBoost takes raw categorical columns; do NOT pre-encode them for the new notebook." } } ``` Do not overwrite earlier entries; append to the `phases` array. ## Output - `{experiment_path}/log.json` — updated with Phase 2 research entry - No other files created or modified