--- name: batch-cohort description: Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix. triggers: batch cohort, batch analysis, 대량 분석, 변수 교체, variable swap, mass production, 80명 팀, batch generate, 일괄 코드 생성, exposure outcome matrix, combinatorial analysis tools: Read, Write, Edit, Bash, Grep, Glob model: opus --- # Batch Cohort Analysis Skill You are assisting a medical researcher in generating multiple analysis scripts from a single validated methodology template, each differing only in the exposure/outcome variable combination. This replicates the "80-person research team" pattern: one PI designs the methodology, and many researchers execute the same approach with different variable swaps. ## When to Use - Researcher has a **validated analysis template** (e.g., from /replicate-study or /cross-national) - Wants to explore **multiple exposure → outcome combinations** on the same database - Goal: systematic variable-swap code generation + batch execution + result matrix ## Inputs 1. **Database path(s)**: CSV/SAS data files (KNHANES, NHANES, NHIS, or any cleaned cohort) 2. **Methodology template**: One of: - Path to a validated R/Python analysis script (from /replicate-study or /cross-national) - A paper type template name: `nhis_cohort`, `cross_national`, `survey_weighted` - A source paper to extract methodology from (falls back to /replicate-study Phase 1) 3. **Combination spec**: A list of exposure/outcome pairs, provided as: - Inline list: `exposures: [depression, obesity, smoking]; outcomes: [diabetes, hypertension, CVD]` - CSV file with columns: `exposure`, `outcome`, (optional) `subgroup_vars` - `"all"` keyword: generates all pairwise combinations from the lists ### Optional Inputs - **Covariate set**: Fixed covariate list for all analyses (default: use template's set) - **Subgroup variables**: Variables to stratify by (default: sex, age group) - **Output format**: `code_only` (just scripts) | `execute` (run + collect results) | `full` (code + results + summary) - **Cross-national mode**: If TRUE, generates paired scripts for both countries per combination ## Workflow ### Phase 1: Template Validation 1. Read the methodology template (R script or paper type reference). 2. Identify the **slot variables** — parts that change per combination: - `EXPOSURE_VAR`: raw variable name in the database - `EXPOSURE_LABEL`: human-readable label for tables/figures - `EXPOSURE_CODING`: how to derive binary/categorical exposure - `OUTCOME_VAR`: raw variable name - `OUTCOME_LABEL`: human-readable label - `OUTCOME_CODING`: how to derive binary outcome 3. Verify the template runs successfully on at least one combination before batch generation. 4. Output: template summary with identified slots → user approval. ### Phase 2: Variable Specification For each exposure and outcome in the combination spec: 1. **Look up** the variable in the database: - KNHANES: check variable name exists in the CSV header - NHANES: check which table contains the variable (use codebook.csv if available) - NHIS: check claims code or variable name 2. **Define coding**: - Binary: threshold or category mapping (e.g., `HE_glu >= 126 → diabetes = 1`) - Categorical: level definitions (e.g., `smoking: current/former/never`) 3. **Check covariate overlap**: If the exposure IS one of the standard covariates, remove it from the adjustment set for that analysis (no self-adjustment). 4. Output: **combination matrix** with all variable specifications. ``` | # | Exposure | Exposure Coding | Outcome | Outcome Coding | Covariates (adjusted) | Notes | |---|----------|-----------------|---------|----------------|----------------------|-------| | 1 | Depression (PHQ≥10) | BP_PHQ sum ≥10 | Diabetes | HE_glu≥126|HbA1c≥6.5|DE1_dg=1 | age,sex,edu,income,smoking,alcohol,obesity,CVD | — | | 2 | Obesity (BMI≥25) | HE_obe ≥4 | Diabetes | same | age,sex,edu,income,smoking,alcohol,depression,CVD | obesity removed from covariates | | ... | | | | | | | ``` ### Phase 3: Batch Code Generation For each combination in the matrix: 1. **Clone** the template script. 2. **Replace** slot variables with the combination-specific values. 3. **Adjust covariates**: Remove exposure variable from covariate list if present. 4. **Set output paths**: Each combination gets its own results subdirectory. 5. **Generate a master runner script** (`run_all.R` or `run_all.sh`) that: - Executes all N scripts sequentially (or in parallel via `future`/`parallel`) - Captures errors per script without stopping the batch - Logs execution time per analysis ### Phase 4: Batch Execution (if `execute` or `full` mode) 1. Run the master script. 2. Collect results from each combination's output directory. 3. Handle failures gracefully: - Log which combinations failed and why - Common failures: convergence issues, too few events, empty subgroups - Suggest fixes for failed combinations ### Phase 5: Summary Matrix Aggregate all results into a single summary: **Main Results Matrix** (`summary_matrix.csv`): | Exposure | Outcome | N | Events | Model 1 OR (95% CI) | Model 2 OR (95% CI) | Model 3 OR (95% CI) | p-value | Significant | |----------|---------|---|--------|---------------------|---------------------|---------------------|---------|-------------| | Depression | Diabetes | 5,811 | 487 | 2.14 (1.52–3.01) | 1.89 (1.33–2.69) | 1.36 (0.91–2.05) | 0.137 | No | | Obesity | Diabetes | 5,811 | 487 | 3.45 (2.71–4.39) | 3.38 (2.65–4.32) | 3.12 (2.42–4.02) | <0.001 | Yes | | ... | | | | | | | | | **Subgroup Summary** (`subgroup_matrix.csv`): Same format, stratified by subgroup variables. **Heatmap** (optional): Visual matrix of effect sizes × significance, exposure on Y-axis, outcome on X-axis. ## Output Files ``` {working_dir}/batch_{timestamp}/ ├── README.md — Batch run summary (N combinations, template used, date) ├── combination_matrix.csv — All exposure/outcome specs with coding ├── template/ │ └── base_template.R — The validated template (frozen copy) ├── scripts/ │ ├── 01_depression_diabetes.R │ ├── 02_obesity_diabetes.R │ ├── ... │ └── run_all.R — Master execution script ├── results/ │ ├── 01_depression_diabetes/ │ │ ├── table1.csv │ │ ├── main_results.csv │ │ └── subgroup_results.csv │ ├── 02_obesity_diabetes/ │ │ └── ... │ └── ... ├── summary/ │ ├── summary_matrix.csv — Main results across all combinations │ ├── subgroup_matrix.csv — Subgroup results across all combinations │ ├── failed_runs.csv — Combinations that failed + error messages │ └── heatmap.png — Optional effect size × significance visual └── logs/ └── batch_execution.log — Timing + error log ``` ## Critical Rules 1. **Never modify the core methodology** across combinations — only swap exposure/outcome/covariates. 2. **Remove self-adjustment**: If exposure = BMI, remove obesity from covariates. If exposure = education/income, remove the same variable from covariates. If outcome = MetS, consider removing obesity from covariates. Document all removals. 3. **Weighted analysis mandatory** for KNHANES/NHANES/NHIS — inherited from template. 4. **Event count check**: Before running, verify each outcome has ≥10 events per covariate (EPV rule). Flag underpowered combinations. 5. **Multiple comparisons**: When generating >5 combinations, include a Bonferroni-corrected significance column in the summary matrix. Add a note about exploratory vs confirmatory framing. 6. **Reproducibility**: Freeze the template version. Include a SHA256 hash of the data file in README. 7. **No p-hacking framing**: The summary matrix is for **hypothesis generation**, not confirmation. State this explicitly in README and any manuscript output. 8. **Outcome definitions MUST include physician diagnosis**: Diabetes = FPG≥126 OR HbA1c≥6.5 OR physician-diagnosed (KNHANES: DE1_dg=1, NHANES: DIQ010="Yes"). Hypertension = SBP≥140 OR DBP≥90 OR physician-diagnosed (KNHANES: DI1_dg=1, NHANES: BPQ020="Yes"). Lab-only definitions systematically overestimate exposure→outcome associations (validated: Joo 2026 replication showed US depression→DM wOR 1.92 without vs 1.54 with physician dx). 9. **Full covariate set is default**: Always use 8 covariates (age, sex, education, income, smoking, alcohol, obesity, CVD) unless explicitly justified. Minimal models (age+sex+BMI only) overestimate effects due to residual confounding. ## Cross-National Batch Mode When `cross_national: true`: - Generate paired scripts for each combination (Korea + US) - Summary matrix includes both countries side-by-side - Direction agreement column: ✓ if both countries show same direction of effect - Uses /cross-national skill's dual-survey-design approach ## Integration with Upstream Skills | Need | Skill | |------|-------| | Variable coding lookup | `analyze-stats` survey_weighted guide | | Template creation from paper | `/replicate-study` Phase 1–3 | | Cross-national paired analysis | `/cross-national` | | ICD-10 claims algorithms | `analyze-stats` nhis_icd10_mapping guide | | Write manuscript from results | `/write-paper` (nhis_cohort or cross_national type) | | Figure generation | `/make-figures` (forest plot of all combinations) | ## Example Invocations ### Basic: Single DB, Multiple Exposures × Single Outcome ``` /batch-cohort DB: /path/to/knhanes/HN18.csv Template: /path/to/validated_analysis.R Exposures: [depression, obesity, smoking, heavy_drinking, low_income, low_education] Outcome: diabetes Mode: full ``` ### Cross-National: Full Matrix ``` /batch-cohort DB Korea: /path/to/knhanes/HN18.csv DB US: /path/to/nhanes/ Template: cross_national Exposures: [depression, obesity, smoking] Outcomes: [diabetes, hypertension, metabolic_syndrome] cross_national: true Mode: execute ``` ### NHIS Cohort: Claims-Based Batch ``` /batch-cohort DB: /path/to/nhis_sample_cohort.csv Template: nhis_cohort Exposures: [atrial_fibrillation, heart_failure, COPD, CKD] Outcomes: [all_cause_mortality, cardiovascular_death, stroke] Mode: code_only ``` ## Anti-Hallucination - **Never fabricate variable names, dataset column names, or variable codings.** If a variable mapping is uncertain, output `[VERIFY: variable_name]` and ask the user to confirm against the data dictionary. - **Never fabricate statistical results** — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output. - **Never generate references from memory.** Use `/search-lit` for all citations. - If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.