---
name: data-analysis
description: Generate statistical analysis code with 4-round review. Select appropriate statistical tests, interpret results, and produce analysis reports with p-values, effect sizes, and confidence intervals. Use when analyzing experimental data for a paper.
argument-hint: [data-source]
---

# Data Analysis

Generate rigorous statistical analysis code with multi-round review.

## Input

- `$0` — Data source (CSV, JSON, pickle, or experiment logs)
- `$1` — Research goal or hypothesis to test

## References

- 4-round code review prompts: `~/.claude/skills/data-analysis/references/review-prompts.md`

## Scripts

### Statistical summary and comparison
```bash
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --compare method --metric accuracy --output summary.json
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --describe
```

Detects data types, recommends tests, runs comparisons, outputs effect sizes and significance stars. Requires numpy, scipy.

### Format p-values
```bash
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --values "0.001 0.05 0.23" --format stars
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --csv results.csv --column pvalue --format latex
```

Formats p-values with stars, LaTeX notation, or plain text. Stdlib-only.

## Workflow

### Step 1: Generate Analysis Code
Structure the code with these sections:
1. `# IMPORT` — pandas, numpy, scipy, statsmodels, sklearn
2. `# LOAD DATA` — Load from original data files
3. `# DATASET PREPARATIONS` — Missing values, units, exclusion criteria
4. `# DESCRIPTIVE STATISTICS` — Summary tables if needed
5. `# PREPROCESSING` — Dummy variables, normalization
6. `# ANALYSIS` — Statistical tests per hypothesis
7. `# SAVE ADDITIONAL RESULTS` — Extra results to pickle

### Step 2: 4-Round Code Review
1. **Round 1 — Code Flaws**: Mathematical/statistical errors, wrong calculations, trivial tests
2. **Round 2 — Data Handling**: Missing values, units, preprocessing, test choice
3. **Round 3 — Per-Table**: Sensible values, measures of uncertainty, missing data
4. **Round 4 — Cross-Table**: Completeness, consistency, missing variables

### Step 3: Produce Results
- Every nominal value must have uncertainty (CI, STD, or p-value)
- Statistical tests must be appropriate for the data type
- Results must match actual data — never hallucinate

## Allowed Packages

`pandas`, `numpy`, `scipy`, `statsmodels`, `sklearn`, `pickle`

## Statistical Test Selection

| Data Type | Test |
|-----------|------|
| Two groups, normal | Independent t-test |
| Two groups, non-normal | Mann-Whitney U |
| Paired samples | Paired t-test / Wilcoxon |
| Multiple groups | ANOVA / Kruskal-Wallis |
| Categorical | Chi-square / Fisher's exact |
| Correlation | Pearson / Spearman |
| Regression | OLS / Logistic / Mixed effects |

## Rules

- Always report p-values for statistical tests
- Account for relevant confounding variables
- Use inherent package functionality (e.g., `formula = "y ~ a * b"` for interactions)
- Do not manually implement available statistical functions
- Access dataframes using string-based column names, not integer indices

## Related Skills
- Upstream: [experiment-code](../experiment-code/), [experiment-design](../experiment-design/)
- Downstream: [table-generation](../table-generation/), [figure-generation](../figure-generation/), [backward-traceability](../backward-traceability/)
- See also: [math-reasoning](../math-reasoning/)