---
name: pipeline-cutandrun
description: "Execute CUT&RUN processing pipeline from FASTQ to peaks and signal tracks. Child of pipeline-guide. Provides Nextflow execution with Docker and cloud deployment. Use when processing CUT&RUN or CUT&Tag data, an alternative to ChIP-seq with lower background. Trigger on: CUT&RUN pipeline, CUT&Tag, SEACR, Henikoff, targeted chromatin, pA-MNase, process CUT&RUN."
---

# ENCODE CUT&RUN Pipeline: FASTQ to Peaks and Signal Tracks

## When to Use

- User wants to run a CUT&RUN or CUT&Tag processing pipeline from FASTQ to peaks
- User asks about "CUT&RUN pipeline", "CUT&Tag", "SEACR", "spike-in normalization", or "targeted chromatin"
- User needs to process CUT&RUN/CUT&Tag data with spike-in calibration and SEACR peak calling
- Example queries: "process my CUT&RUN FASTQs", "run SEACR on CUT&Tag data", "normalize CUT&RUN with spike-in controls"

Execute the CUT&RUN/CUT&Tag processing pipeline for targeted chromatin profiling,
producing peak calls with SEACR and spike-in normalized signal tracks.

## Pipeline Overview

```
FASTQ -> Trim -> Bowtie2 align (genome) -> Filter/dedup -> SEACR peaks
                     |                          |              |
              Bowtie2 align (spike-in)   Spike-in normalize  Signal tracks
                     |
              Scale factor calculation
```

### ENCODE Repository

- **GitHub**: `ENCODE-DCC/cutandrun-pipeline`
- **Container**: `encodedcc/cutandrun-pipeline`
- **This skill**: Nextflow DSL2 reimplementation for portability

## Core Tools and Versions

| Tool | Version | Purpose | Citation |
|------|---------|---------|----------|
| Bowtie2 | 2.5.3 | Alignment (genome + spike-in) | Langmead & Salzberg 2012 |
| SEACR | 1.3 | Peak calling (CUT&RUN-specific) | Meers et al. 2019 |
| MACS2 | 2.2.9.1 | Alternative peak caller | Zhang et al. 2008 |
| Picard | 3.1.1 | Duplicate marking | Broad Institute |
| samtools | 1.19 | BAM operations | Li et al. 2009 |
| bedtools | 2.31.0 | Genomic arithmetic | Quinlan & Hall 2010 |
| deepTools | 3.5.4 | Signal track generation | Ramirez et al. 2016 |
| FastQC | 0.12.1 | Read quality | Andrews (Babraham) |
| MultiQC | 1.21 | Aggregated QC | Ewels et al. 2016 |

## Key Literature

1. **Skene & Henikoff 2017** - "An efficient targeted nuclease strategy for
   high-resolution mapping of DNA binding sites" (eLife, ~1,500 citations)
   DOI: 10.7554/eLife.21856

2. **Meers et al. 2019** - "Peak calling by Sparse Enrichment Analysis for
   CUT&RUN chromatin profiling" (Epigenetics & Chromatin, ~800 citations)
   DOI: 10.1186/s13072-019-0287-4

3. **Kaya-Okur et al. 2019** - "CUT&Tag for efficient epigenomic profiling
   of small samples and single cells" (Nature Communications, ~1,200 citations)
   DOI: 10.1038/s41467-019-09982-5

4. **Nordin et al. 2023** - "The CUT&RUN suspect list of problematic regions"
   (Genome Biology)
   DOI: 10.1186/s13059-023-02960-3

5. **Amemiya et al. 2019** - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations)
   DOI: 10.1038/s41598-019-45839-z

## Execution

### Quick Start (Local)

```bash
nextflow run main.nf \
    -profile local \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bowtie2_index '/ref/bowtie2_index/genome' \
    --spikein_index '/ref/bowtie2_ecoli/ecoli' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume
```

### SLURM HPC

```bash
nextflow run main.nf \
    -profile slurm \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bowtie2_index '/ref/bowtie2_index/genome' \
    --spikein_index '/ref/bowtie2_ecoli/ecoli' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume
```

### Cloud (GCP / AWS)

```bash
nextflow run main.nf \
    -profile gcp \
    --reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
    --bowtie2_index 'gs://bucket/ref/bowtie2_index/genome' \
    --spikein_index 'gs://bucket/ref/bowtie2_ecoli/ecoli' \
    --chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
    --blacklist 'gs://bucket/ref/hg38-blacklist.v2.bed' \
    --outdir 'gs://bucket/results/' \
    -resume
```

## Resource Requirements

| Step | CPUs | RAM | Time (per sample) |
|------|------|-----|-------------------|
| Bowtie2 align (genome) | 8 | 8 GB | 30-60 min |
| Bowtie2 align (spike-in) | 4 | 4 GB | 10-20 min |
| Filter/dedup | 4 | 8 GB | 15-30 min |
| SEACR peaks | 2 | 4 GB | 10-20 min |
| Signal tracks | 4 | 8 GB | 15-30 min |
| **Total** | **8** | **8 GB** | **1.5-3 hours** |

## Pipeline Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--reads` | required | Glob pattern to paired FASTQ files |
| `--bowtie2_index` | required | Bowtie2 genome index prefix |
| `--spikein_index` | required | Bowtie2 E. coli spike-in index prefix |
| `--chrom_sizes` | required | Chromosome sizes file |
| `--blacklist` | required | ENCODE blacklist BED file |
| `--outdir` | `./results` | Output directory |
| `--seacr_mode` | `stringent` | SEACR mode: `stringent` or `relaxed` |
| `--seacr_norm` | `norm` | SEACR normalization: `norm` or `non` |
| `--control` | `null` | IgG control BAM (if available) |
| `--peak_caller` | `seacr` | Peak caller: `seacr` or `macs2` or `both` |
| `--skip_spikein` | `false` | Skip spike-in normalization |

## Output Files

```
results/
  fastqc/                           # Raw read quality
  alignment/
    {sample}.filtered.bam           # Filtered, deduplicated BAM
    {sample}.filtered.bam.bai
  spikein/
    {sample}.spikein_counts.txt     # Spike-in read counts
    {sample}.scale_factor.txt       # Computed scale factor
  peaks/
    {sample}.seacr.stringent.bed    # SEACR stringent peaks
    {sample}.seacr.relaxed.bed      # SEACR relaxed peaks
    {sample}.macs2_peaks.narrowPeak # MACS2 peaks (if requested)
  signal/
    {sample}.normalized.bw          # Spike-in normalized signal
    {sample}.fragments.bed          # Fragment BED file
  qc/
    {sample}.flagstat.txt
    {sample}.fragment_sizes.txt
    {sample}.frip.txt
  multiqc/
    multiqc_report.html
```

## QC Thresholds

| Metric | Pass | Warning | Fail |
|--------|------|---------|------|
| Mapping rate (genome) | >80% | 60-80% | <60% |
| Spike-in reads | 1-10% of total | 0.1-1% or 10-30% | <0.1% or >30% |
| Duplication rate | <20% | 20-40% | >40% |
| FRiP (peaks) | >10% | 5-10% | <5% |
| Peak count | >5,000 | 1,000-5,000 | <1,000 |
| Fragment size | Nucleosomal pattern | Irregular | No pattern |

### Fragment Size Distribution

CUT&RUN produces a characteristic nucleosomal ladder:
- **<120 bp**: Sub-nucleosomal (TF binding)
- **~150 bp**: Mononucleosomal (histone marks)
- **~300 bp**: Dinucleosomal
- Absence of nucleosomal pattern suggests protocol issues

## Spike-in Normalization

Spike-in normalization is CRITICAL for CUT&RUN quantitative comparison.

### How It Works

1. E. coli DNA is carried over from pA-MNase/pA-Tn5 production
2. Each sample has a different amount of spike-in reads
3. Samples with more target cleavage have fewer spike-in reads (proportionally)
4. Scale factor = 1 / (spike-in reads / minimum spike-in reads across samples)

### Scale Factor Calculation

```
Sample A: 200,000 spike-in reads -> scale = 1.0 (minimum)
Sample B: 400,000 spike-in reads -> scale = 0.5
Sample C: 100,000 spike-in reads -> scale = 2.0
```

Higher spike-in counts = less target enrichment = lower scale factor.

## SEACR vs MACS2

| Feature | SEACR | MACS2 |
|---------|-------|-------|
| Designed for | CUT&RUN/CUT&Tag | ChIP-seq |
| Background model | Sparse enrichment | Dynamic Poisson |
| Control required | Optional (IgG) | Recommended |
| Low background | Handles well | May overcall |
| Stringent mode | Very conservative | Via q-value |
| ENCODE recommendation | Primary for CUT&RUN | Alternative |

SEACR is specifically designed for the sparse, low-background signal
profile of CUT&RUN data. MACS2 may overcall peaks due to the low background.

## Critical Pitfalls

### Spike-in Calibration is CRITICAL
Without spike-in normalization, quantitative comparisons between samples are
unreliable. The amount of pA-MNase (or pA-Tn5) varies between experiments,
and spike-in reads provide the internal calibration standard.

### IgG Control vs No-Antibody Control
- **IgG control**: Non-specific antibody, captures background binding
- **No-antibody**: No antibody, captures MNase accessibility background
- IgG is preferred but not always available
- SEACR can work without control (uses top 1% of signal as threshold)

### SEACR Stringent vs Relaxed Mode
- **Stringent**: Returns only the most enriched peaks (fewer, higher confidence)
- **Relaxed**: Returns a broader set including weaker peaks
- For initial analysis, use stringent mode
- For comprehensive catalogs, use relaxed mode with downstream filtering

### CUT&RUN Suspect List (Nordin 2023)
In addition to the ENCODE blacklist, filter CUT&RUN peaks against the
suspect list (Nordin et al. 2023), which identifies regions with
artifactual signal specific to CUT&RUN/CUT&Tag protocols:

```bash
# Download suspect list
wget https://github.com/Boyle-Lab/Blacklist/raw/master/lists/CUTandRUN.suspectlist.hg38.bed.gz

# Filter peaks
bedtools intersect \
    -a peaks.bed \
    -b hg38-blacklist.v2.bed CUTandRUN.suspectlist.hg38.bed \
    -v \
    > peaks_filtered.bed
```

### CUT&RUN vs CUT&Tag
Both protocols are supported by this pipeline. Differences:
- **CUT&RUN**: Uses pA-MNase, E. coli spike-in from MNase production
- **CUT&Tag**: Uses pA-Tn5, E. coli spike-in from Tn5 production
- CUT&Tag has higher background from Tn5 insertion preference
- CUT&Tag may work better for histone marks; CUT&RUN for TFs

## Provenance Integration

After pipeline completion, log all outputs:

```python
encode_log_derived_file(
    file_path="/results/peaks/sample1.seacr.stringent.bed",
    source_accessions=["ENCSR...", "ENCFF..."],
    description="CUT&RUN peaks from ENCODE CUT&RUN pipeline",
    file_type="CUT&RUN_peaks",
    tool_used="Bowtie2 2.5.3 + SEACR 1.3",
    parameters="stringent mode, spike-in normalized, blacklist + suspect list filtered"
)
```

## Reference Files

Detailed step-by-step documentation is provided in the `references/` directory:

1. `01-qc-trimming.md` -- Read QC and adapter trimming for CUT&RUN
2. `02-bowtie2-alignment.md` -- Bowtie2 alignment to genome and spike-in
3. `03-filtering-spikein.md` -- Filtering, dedup, and spike-in normalization
4. `04-seacr-peaks.md` -- SEACR peak calling and MACS2 alternative
5. `05-qc-metrics.md` -- Fragment sizes, FRiP, spike-in QC

## Walkthrough: Processing ENCODE CUT&RUN from FASTQ to Peaks

**Goal**: Process CUT&RUN/CUT&Tag FASTQ files through the ENCODE-compatible pipeline to generate peak calls with spike-in normalization.
**Context**: CUT&RUN uses targeted MNase digestion (lower background than ChIP-seq) but requires different peak calling (SEACR instead of MACS2) and spike-in normalization for quantitative comparisons.

### Step 1: Find CUT&RUN experiment

```
encode_search_experiments(assay_title="CUT&RUN", organism="Homo sapiens")
```

Expected output:
```json
{
  "total": 35,
  "results": [
    {"accession": "ENCSR900CUR", "assay_title": "CUT&RUN", "target": "H3K27me3", "biosample_summary": "K562", "status": "released"}
  ]
}
```

### Step 2: List FASTQ files

```
encode_list_files(accession="ENCSR900CUR", file_format="fastq")
```

Expected output:
```json
{
  "files": [
    {"accession": "ENCFF900CR1", "output_type": "reads", "paired_end": "1", "file_size_mb": 800},
    {"accession": "ENCFF901CR2", "output_type": "reads", "paired_end": "2", "file_size_mb": 850}
  ]
}
```

**Interpretation**: CUT&RUN yields smaller files than ChIP-seq (~800MB vs ~2.5GB) due to lower background.

### Step 3: Run the CUT&RUN pipeline

```bash
nextflow run pipeline-cutandrun/main.nf \
  --fastq_r1 ENCFF900CR1.fastq.gz \
  --fastq_r2 ENCFF901CR2.fastq.gz \
  --genome GRCh38 \
  --spike_in_genome dm6 \
  --target H3K27me3 \
  --peak_caller seacr \
  -profile docker
```

Key pipeline steps:
1. Adapter trimming (Trim Galore)
2. Bowtie2 alignment (very-sensitive-local)
3. Spike-in alignment (E. coli or Drosophila)
4. Spike-in normalization (scale factor)
5. SEACR peak calling (stringent mode)
6. Signal track generation with spike-in scaling

### Step 4: Validate output quality

| Metric | Threshold | Purpose |
|---|---|---|
| Spike-in alignment | 0.5-5% of reads | Normalization calibration |
| Fragment size | < 150bp majority | CUT&RUN characteristic |
| FRiP (SEACR) | >= 5% | Higher than ChIP-seq due to lower background |
| Duplicate rate | < 20% | Library complexity |

**Key difference from ChIP-seq**: CUT&RUN has inherently lower background, so peak callers like MACS2 overfit. Use SEACR (Meers et al. 2019) instead.

### Step 5: Compare with ChIP-seq for the same target

```
encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27me3", organism="Homo sapiens")
```

**Interpretation**: CUT&RUN typically identifies fewer but higher-confidence peaks than ChIP-seq. Concordant peaks between both methods are the highest confidence.

### Integration with downstream skills
- SEACR peaks feed into -> **histone-aggregation** for cross-experiment comparison
- Spike-in normalized signals feed into -> **visualization-workflow**
- Peak regions feed into -> **regulatory-elements** for chromatin state classification
- QC uses different thresholds than ChIP-seq -> **quality-assessment** (see suspect list)
- Pipeline provenance logged by -> **data-provenance**

## Code Examples

### 1. Survey CUT&RUN/CUT&Tag availability

```
encode_get_facets(assay_title="CUT&RUN", facet_field="target.label", organism="Homo sapiens")
```

Expected output:
```json
{
  "facets": {
    "target.label": {"H3K27me3": 15, "H3K4me3": 12, "H3K27ac": 8, "CTCF": 5}
  }
}
```

### 2. Find matching ChIP-seq for comparison

```
encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27me3", organism="Homo sapiens")
```

Expected output:
```json
{
  "total": 5,
  "results": [
    {"accession": "ENCSR000CHI", "assay_title": "Histone ChIP-seq", "target": "H3K27me3", "biosample_summary": "K562"}
  ]
}
```

### 3. Track CUT&RUN experiments

```
encode_track_experiment(accession="ENCSR900CUR", notes="K562 H3K27me3 CUT&RUN - SEACR peaks for comparison with ChIP-seq")
```

Expected output:
```json
{
  "status": "tracked",
  "accession": "ENCSR900CUR",
  "notes": "K562 H3K27me3 CUT&RUN - SEACR peaks for comparison with ChIP-seq"
}
```

## Integration

| This skill produces... | Feed into... | Purpose |
|---|---|---|
| SEACR peaks | **histone-aggregation** | Cross-experiment comparison (note: different caller than ChIP-seq) |
| Spike-in normalized signal | **visualization-workflow** | Quantitatively comparable browser tracks |
| Peak regions | **regulatory-elements** | Chromatin state classification |
| CUT&RUN-specific QC | **quality-assessment** | Validate with CUT&RUN-appropriate thresholds |
| Peak coordinates | **motif-analysis** | TF motif discovery at CUT&RUN peaks |
| Pipeline parameters | **data-provenance** | Record SEACR/spike-in normalization details |
| Peak files | **variant-annotation** | Identify variants in CUT&RUN peaks |
| Comparison with ChIP-seq | **compare-biosamples** | Cross-assay concordance analysis |

## Related Skills

- `pipeline-guide` -- Parent skill with compute resource assessment and cloud setup
- `histone-aggregation` -- Aggregate histone mark data across samples
- `quality-assessment` -- Evaluate pipeline output quality metrics
- `data-provenance` -- Track all pipeline inputs, outputs, and parameters
- `download-encode` -- Download ENCODE CUT&RUN FASTQ files for pipeline input
- `publication-trust` -- Verify literature claims backing analytical decisions

## Presenting Results

When reporting CUT&RUN pipeline results:

- **SEACR peak counts**: Report peak counts for both stringent and relaxed modes. If MACS2 was also run, include those counts for comparison
- **Spike-in normalization factor**: Report the computed scale factor per sample and the spike-in read fraction (ideal 1-10% of total reads). Explain that higher spike-in counts indicate less target enrichment
- **FRiP**: Report the fraction of reads in peaks (>10% pass, 5-10% warning, <5% fail). Note that CUT&RUN FRiP thresholds differ from ChIP-seq
- **Signal track paths**: Provide paths to spike-in normalized bigWig files for genome browser visualization
- **Fragment size distribution**: Confirm the expected nucleosomal ladder pattern and note the dominant fragment class (sub-nucleosomal for TFs, mononucleosomal for histone marks)
- **Key QC metrics**: Present mapping rate (>80%), duplication rate (<20%), and spike-in calibration status in a summary table
- **Suspect list filtering**: Note whether peaks were filtered against both the ENCODE blacklist and the CUT&RUN suspect list (Nordin 2023)
- **Next steps**: Suggest `peak-annotation` for gene association of peaks, or `visualization-workflow` for genome browser session generation

## For the request: "$ARGUMENTS"