---
name: tooluniverse-image-analysis
description: Production-ready microscopy image analysis and quantitative imaging data skill for colony morphometry, cell counting, fluorescence quantification, and statistical analysis of imaging-derived measurements. Processes ImageJ/CellProfiler output (area, circularity, intensity, cell counts), performs Dunnett's test, Cohen's d effect size, power analysis, Shapiro-Wilk normality tests, two-way ANOVA, polynomial regression, natural spline regression with confidence intervals, and comparative morphometry. Supports CSV/TSV measurement tables, multi-channel fluorescence data, colony swarming assays, and neuron counting datasets. Use when analyzing microscopy measurement data, colony area/circularity, cell count statistics, swarming assays, co-culture ratio optimization, or answering questions about imaging-derived quantitative data.
---

# Microscopy Image Analysis and Quantitative Imaging Data

Production-ready skill for analyzing microscopy-derived measurement data using pandas, numpy, scipy, statsmodels, and scikit-image. Designed for BixBench imaging questions covering colony morphometry, cell counting, fluorescence quantification, regression modeling, and statistical comparisons.

**IMPORTANT**: This skill handles complex multi-workflow analysis. Most implementation details have been moved to `references/` for progressive disclosure. This document focuses on high-level decision-making and workflow orchestration.

---

## When to Use This Skill

Apply when users:
- Have microscopy measurement data (area, circularity, intensity, cell counts) in CSV/TSV
- Ask about colony morphometry (bacterial swarming, biofilm, growth assays)
- Need statistical comparisons of imaging measurements (t-test, ANOVA, Dunnett's, Mann-Whitney)
- Ask about cell counting statistics (NeuN, DAPI, marker counts)
- Need effect size calculations (Cohen's d) and power analysis
- Want regression models (polynomial, spline) fitted to dose-response or ratio data
- Ask about model comparison (R-squared, F-statistic, AIC/BIC)
- Need Shapiro-Wilk normality testing on imaging data
- Want confidence intervals for peak predictions from fitted models
- Questions mention imaging software output (ImageJ, CellProfiler, QuPath)
- Need fluorescence intensity quantification or colocalization analysis
- Ask about image segmentation results (counts, areas, shapes)

**BixBench Coverage**: 21 questions across 4 projects (bix-18, bix-19, bix-41, bix-54)

**NOT for** (use other skills instead):
- Phylogenetic analysis → Use `tooluniverse-phylogenetics`
- RNA-seq differential expression → Use `tooluniverse-rnaseq-deseq2`
- Single-cell scRNA-seq → Use `tooluniverse-single-cell`
- Statistical regression only (no imaging context) → Use `tooluniverse-statistical-modeling`

---

## Core Principles

1. **Data-first approach** - Load and inspect all CSV/TSV measurement data before analysis
2. **Question-driven** - Parse the exact statistic, comparison, or model requested
3. **Statistical rigor** - Proper effect sizes, multiple comparison corrections, model selection
4. **Imaging-aware** - Understand ImageJ/CellProfiler measurement columns (Area, Circularity, Round, Intensity)
5. **Workflow flexibility** - Support both pre-quantified data (CSV) and raw image processing
6. **Precision** - Match expected answer format (integer, range, decimal places)
7. **Reproducible** - Use standard Python/scipy equivalents to R functions

---

## Required Python Packages

```python
# Core (MUST be installed)
import pandas as pd
import numpy as np
from scipy import stats
from scipy.interpolate import BSpline, make_interp_spline
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.power import TTestIndPower
from patsy import dmatrix, bs, cr

# Optional (for raw image processing)
import skimage
import cv2
import tifffile
```

**Installation**:
```bash
pip install pandas numpy scipy statsmodels patsy scikit-image opencv-python-headless tifffile
```

---

## High-Level Workflow Decision Tree

```
START: User question about microscopy data
│
├─ Q1: What type of data is available?
│  │
│  ├─ PRE-QUANTIFIED DATA (CSV/TSV with measurements)
│  │  └─ Workflow: Load → Parse question → Statistical analysis
│  │     Pattern: Most common BixBench pattern (bix-18, bix-19, bix-41, bix-54)
│  │     See: Section "Quantitative Data Analysis" below
│  │
│  └─ RAW IMAGES (TIFF, PNG, multi-channel)
│     └─ Workflow: Load → Segment → Measure → Analyze
│        See: references/image_processing.md
│
├─ Q2: What type of analysis is needed?
│  │
│  ├─ STATISTICAL COMPARISON
│  │  ├─ Two groups → t-test or Mann-Whitney
│  │  ├─ Multiple groups → ANOVA or Dunnett's test
│  │  ├─ Two factors → Two-way ANOVA
│  │  └─ Effect size → Cohen's d, power analysis
│  │  See: references/statistical_analysis.md
│  │
│  ├─ REGRESSION MODELING
│  │  ├─ Dose-response → Polynomial (quadratic, cubic)
│  │  ├─ Ratio optimization → Natural spline
│  │  └─ Model comparison → R-squared, F-statistic, AIC/BIC
│  │  See: references/statistical_analysis.md
│  │
│  ├─ CELL COUNTING
│  │  ├─ Fluorescence (DAPI, NeuN) → Threshold + watershed
│  │  ├─ Brightfield → Adaptive threshold
│  │  └─ High-density → CellPose or StarDist (external)
│  │  See: references/cell_counting.md
│  │
│  ├─ COLONY SEGMENTATION
│  │  ├─ Swarming assays → Otsu threshold + morphology
│  │  ├─ Biofilms → Li threshold + fill holes
│  │  └─ Growth assays → Time-lapse tracking
│  │  See: references/segmentation.md
│  │
│  └─ FLUORESCENCE QUANTIFICATION
│     ├─ Intensity measurement → regionprops
│     ├─ Colocalization → Pearson/Manders
│     └─ Multi-channel → Channel-wise quantification
│     See: references/fluorescence_analysis.md
│
└─ Q3: When to use scikit-image vs OpenCV?
   ├─ scikit-image: Scientific analysis, measurements, regionprops
   ├─ OpenCV: Fast processing, real-time, large batches
   └─ Both: Often interchangeable for basic operations
   See: references/image_processing.md "Library Selection Guide"
```

---

## Quantitative Data Analysis Workflow

### Phase 0: Question Parsing and Data Discovery

**CRITICAL FIRST STEP**: Before writing ANY code, identify what data files are available and what the question is asking for.

```python
import os, glob, pandas as pd

# Discover data files
data_dir = "."
csv_files = glob.glob(os.path.join(data_dir, '**', '*.csv'), recursive=True)
tsv_files = glob.glob(os.path.join(data_dir, '**', '*.tsv'), recursive=True)
img_files = glob.glob(os.path.join(data_dir, '**', '*.tif*'), recursive=True)

# Load and inspect first measurement file
if csv_files:
    df = pd.read_csv(csv_files[0])
    print(f"Shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    print(df.head())
    print(df.describe())
```

**Common Column Names**:
- Area: Colony or cell area in pixels or calibrated units
- Circularity: 4*pi*area/perimeter^2, range [0,1], 1.0 = perfect circle
- Round: Roundness = 4*area/(pi*major_axis^2)
- Genotype/Strain: Biological grouping variable
- Ratio: Co-culture mixing ratio (e.g., "1:3", "5:1")
- NeuN/DAPI/GFP: Cell marker counts or intensities

### Phase 1: Grouped Statistics

```python
def grouped_summary(df, group_cols, measure_col):
    """Calculate summary statistics by group."""
    summary = df.groupby(group_cols)[measure_col].agg(
        Mean='mean',
        SD='std',
        Median='median',
        Min='min',
        Max='max',
        N='count'
    ).reset_index()
    summary['SEM'] = summary['SD'] / np.sqrt(summary['N'])
    return summary

# Example: Colony morphometry by genotype
area_summary = grouped_summary(df, 'Genotype', 'Area')
circ_summary = grouped_summary(df, 'Genotype', 'Circularity')
```

For detailed statistical functions, see: **references/statistical_analysis.md**

### Phase 2: Statistical Testing

**Decision guide**:
- Normality test needed? → Shapiro-Wilk
- Two groups comparison? → t-test or Mann-Whitney
- Multiple groups vs control? → Dunnett's test
- Multiple groups, all comparisons? → Tukey HSD
- Two factors? → Two-way ANOVA
- Effect size? → Cohen's d
- Sample size planning? → Power analysis

See: **references/statistical_analysis.md** for complete implementations

### Phase 3: Regression Modeling

**When to use each model**:
- Polynomial (quadratic/cubic): Smooth dose-response, clear peak
- Natural spline: Flexible, non-parametric, handles complex patterns
- Linear: Simple relationships, checking for trends

Model comparison metrics:
- R-squared: Overall fit (higher = better)
- Adjusted R-squared: Penalizes complexity
- F-statistic p-value: Model significance
- AIC/BIC: Compare non-nested models

See: **references/statistical_analysis.md** for complete implementations

---

## Raw Image Processing Workflow

### When Processing Raw Images

**Workflow**: Load → Preprocess → Segment → Measure → Export

```python
# Quick start for cell counting
from scripts.segment_cells import count_cells_in_image

result = count_cells_in_image(
    image_path="cells.tif",
    channel=0,  # DAPI channel
    min_area=50
)
print(f"Found {result['count']} cells")
```

### Segmentation Method Selection

**Decision guide**:

| Cell Type | Density | Best Method | Notes |
|-----------|---------|-------------|-------|
| **Nuclei (DAPI)** | Low-Medium | Otsu + watershed | Standard approach |
| **Nuclei (DAPI)** | High | CellPose/StarDist | Handles touching |
| **Colonies** | Well-separated | Otsu threshold | Fast, reliable |
| **Colonies** | Touching | Watershed | Edge detection |
| **Cells (phase)** | Any | Adaptive threshold | Handles uneven illumination |
| **Fluorescence** | Low signal | Li threshold | More sensitive |

See: **references/segmentation.md** and **references/cell_counting.md** for detailed protocols

### Library Selection: scikit-image vs OpenCV

**Use scikit-image when**:
- Scientific measurements needed (area, perimeter, intensity)
- regionprops for object properties
- Publication-quality analysis
- Easier syntax for scientists

**Use OpenCV when**:
- Processing large image batches
- Speed is critical
- Real-time processing
- Advanced computer vision features

**Both work for**:
- Thresholding, filtering, morphological operations
- Basic image transformations
- Most segmentation tasks

See: **references/image_processing.md** "Library Selection Guide"

---

## Common BixBench Patterns

### Pattern 1: Colony Morphometry (bix-18)

**Question type**: "Mean circularity of genotype with largest area?"

**Data**: CSV with Genotype, Area, Circularity columns

**Workflow**:
1. Load CSV → group by Genotype
2. Calculate mean Area per genotype
3. Identify genotype with max mean Area
4. Report mean Circularity for that genotype

See: **references/segmentation.md** "Colony Morphometry Analysis"

### Pattern 2: Cell Counting Statistics (bix-19)

**Question type**: "Cohen's d for NeuN counts between conditions?"

**Data**: CSV with Condition, NeuN_count, Sex, Hemisphere columns

**Workflow**:
1. Load CSV → filter by hemisphere/sex if needed
2. Split by Condition (KD vs CTRL)
3. Calculate Cohen's d with pooled SD
4. Report effect size

See: **references/statistical_analysis.md** "Effect Size Calculations"

### Pattern 3: Multi-Group Comparison (bix-41)

**Question type**: "Dunnett's test: How many ratios equivalent to control?"

**Data**: CSV with multiple co-culture ratios, Area, Circularity

**Workflow**:
1. Create Strain_Ratio labels
2. Run Dunnett's test for Area (vs control)
3. Run Dunnett's test for Circularity (vs control)
4. Count groups NOT significant in BOTH tests

See: **references/statistical_analysis.md** "Dunnett's Test"

### Pattern 4: Regression Optimization (bix-54)

**Question type**: "Peak frequency from natural spline model?"

**Data**: CSV with co-culture frequencies and Area measurements

**Workflow**:
1. Convert ratio strings to frequencies
2. Fit natural spline model (df=4)
3. Find peak via grid search
4. Report peak frequency + confidence interval

See: **references/statistical_analysis.md** "Regression Modeling"

---

## Quick Reference Table

| Task | Primary Tool | Reference |
|------|-------------|-----------|
| **Load measurement CSV** | pandas.read_csv() | This file |
| **Group statistics** | df.groupby().agg() | This file |
| **T-test** | scipy.stats.ttest_ind() | statistical_analysis.md |
| **ANOVA** | statsmodels.ols + anova_lm() | statistical_analysis.md |
| **Dunnett's test** | scipy.stats.dunnett() | statistical_analysis.md |
| **Cohen's d** | Custom function (pooled SD) | statistical_analysis.md |
| **Power analysis** | statsmodels TTestIndPower | statistical_analysis.md |
| **Polynomial regression** | statsmodels.OLS + poly features | statistical_analysis.md |
| **Natural spline** | patsy.cr() + statsmodels.OLS | statistical_analysis.md |
| **Cell segmentation** | skimage.filters + watershed | cell_counting.md |
| **Colony segmentation** | skimage.filters.threshold_otsu | segmentation.md |
| **Fluorescence quantification** | skimage.measure.regionprops | fluorescence_analysis.md |
| **Colocalization** | Pearson/Manders | fluorescence_analysis.md |
| **Image loading** | tifffile, skimage.io | image_processing.md |
| **Batch processing** | scripts/batch_process.py | scripts/ |

---

## Example Scripts

Ready-to-use scripts in `scripts/` directory:

1. **segment_cells.py** - Cell/nuclei counting with watershed
2. **measure_fluorescence.py** - Multi-channel intensity quantification
3. **batch_process.py** - Process folders of images
4. **colony_morphometry.py** - Measure colony area/circularity
5. **statistical_comparison.py** - Group comparison statistics

Usage:
```bash
# Count cells in image
python scripts/segment_cells.py cells.tif --channel 0 --min-area 50

# Batch process folder
python scripts/batch_process.py input_folder/ output.csv --analysis cell_count
```

---

## Detailed Reference Guides

For complete implementations and protocols:

1. **references/statistical_analysis.md** - All statistical tests, regression models
2. **references/cell_counting.md** - Cell/nuclei counting protocols
3. **references/segmentation.md** - Colony and object segmentation
4. **references/fluorescence_analysis.md** - Intensity quantification, colocalization
5. **references/image_processing.md** - Image loading, preprocessing, library selection
6. **references/troubleshooting.md** - Common issues and solutions

---

## Important Notes

### Matching R Statistical Functions

Some BixBench questions use R for analysis. Python equivalents:

- **R's Dunnett test** (`multcomp::glht`) → `scipy.stats.dunnett()` (scipy ≥ 1.10)
- **R's natural spline** (`ns(x, df=4)`) → `patsy.cr(x, knots=...)` with explicit quantile knots
- **R's t-test** (`t.test()`) → `scipy.stats.ttest_ind()`
- **R's ANOVA** (`aov()`) → `statsmodels.formula.api.ols()` + `sm.stats.anova_lm()`

See: **references/statistical_analysis.md** for exact parameter matching

### Answer Formatting

BixBench expects specific formats:
- "to the nearest thousand": `int(round(val, -3))`
- Percentages: Usually integer or 1-2 decimal places
- Cohen's d: 3 decimal places
- Sample sizes: Always integer (ceiling)
- Ratios: String format "5:1"

---

## Completeness Checklist

Before returning your answer, verify:

- [ ] Loaded all data files and inspected column names
- [ ] Identified the specific statistic or model requested
- [ ] Used correct grouping variables and filter conditions
- [ ] Applied correct rounding or format
- [ ] For "how many" questions: counted correctly based on criteria
- [ ] For statistical tests: used appropriate multiple comparison correction
- [ ] For regression: properly prepared and transformed data
- [ ] Double-checked direction of comparisons
- [ ] Verified answer falls within expected range

---

## Getting Help

- Start with decision tree at top of this file
- Check relevant reference guide for detailed protocol
- Use example scripts as templates
- See troubleshooting guide for common issues
- All statistical implementations in statistical_analysis.md