---
name: deeptools-ngs-analysis
description: "NGS CLI for ChIP/RNA/ATAC-seq. BAM→bigWig with RPGC/CPM/RPKM, sample correlation/PCA, heatmaps/profiles around features, fingerprints. For alignment use STAR/BWA; for peak calling use MACS2."
license: BSD-3-Clause
---

# deepTools — NGS Data Analysis Toolkit

## Overview

deepTools is a command-line toolkit for processing and visualizing high-throughput sequencing data. It converts BAM alignments to normalized coverage tracks (bigWig), performs quality control (correlation, PCA, fingerprint), and generates publication-quality heatmaps and profile plots around genomic features. Supports ChIP-seq, RNA-seq, ATAC-seq, and MNase-seq.

## When to Use

- Converting BAM files to normalized bigWig coverage tracks
- Comparing ChIP-seq treatment vs input control (log2 ratio tracks)
- Assessing sample quality: replicate correlation, PCA, coverage depth
- Evaluating ChIP enrichment strength (fingerprint plots)
- Creating heatmaps and profile plots around TSS, peaks, or other genomic regions
- Analyzing ATAC-seq data with Tn5 offset correction
- Generating strand-specific RNA-seq coverage tracks
- For **read alignment**, use STAR, BWA, or bowtie2 instead
- For **peak calling**, use MACS2 or HOMER instead
- For **BAM/VCF file manipulation**, use pysam instead

## Prerequisites

```bash
pip install deeptools
# Verify installation
bamCoverage --version
```

**Input requirements**: BAM files must be sorted and indexed (`.bai` file present). Generate index with `samtools index input.bam`. BED files for genomic regions (genes, peaks) in standard 3+ column format.

## Quick Start

```bash
# Convert BAM to normalized bigWig
bamCoverage --bam sample.bam --outFileName sample.bw \
    --normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
    --binSize 10 --numberOfProcessors 8

# Create heatmap around TSS
computeMatrix reference-point -S sample.bw -R genes.bed \
    -b 3000 -a 3000 --referencePoint TSS -o matrix.gz
plotHeatmap -m matrix.gz -o heatmap.png --colorMap RdBu
```

## Core API

### 1. BAM to Coverage Conversion

Convert BAM alignments to normalized coverage tracks (bigWig or bedGraph).

```bash
# Basic conversion with RPGC normalization
bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
    --binSize 10 --numberOfProcessors 8 \
    --extendReads 200 --ignoreDuplicates

# CPM normalization (simpler, no genome size needed)
bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing CPM --binSize 10 -p 8

# RNA-seq: strand-specific coverage
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
    --filterRNAstrand forward --normalizeUsing CPM -p 8
# IMPORTANT: Never use --extendReads for RNA-seq (spans splice junctions)
```

### 2. Sample Comparison

Compare treatment vs control or generate ratio tracks.

```bash
# Log2 ratio: treatment / control
bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bw \
    --operation log2 --scaleFactorsMethod readCount \
    --extendReads 200 -p 8

# Subtract control from treatment
bamCompare -b1 treatment.bam -b2 control.bam -o subtract.bw \
    --operation subtract --scaleFactorsMethod readCount
```

### 3. Quality Control

Assess sample quality, replicate concordance, and enrichment strength.

```bash
# Sample correlation heatmap
multiBamSummary bins --bamfiles rep1.bam rep2.bam rep3.bam \
    -o counts.npz --binSize 10000 -p 8
plotCorrelation -in counts.npz --corMethod pearson \
    --whatToShow heatmap -o correlation.png
# Good: replicates cluster with r > 0.9

# PCA of samples
plotPCA -in counts.npz -o pca.png --plotTitle "Sample PCA"

# ChIP enrichment fingerprint
plotFingerprint -b input.bam chip.bam -o fingerprint.png \
    --extendReads 200 --ignoreDuplicates
# Good ChIP: steep rise curve; flat diagonal = poor enrichment

# Coverage depth assessment
plotCoverage -b sample.bam -o coverage.png --ignoreDuplicates -p 8

# Fragment size distribution (paired-end)
bamPEFragmentSize -b sample.bam -o fragsize.png
```

### 4. Heatmaps and Profile Plots

Visualize signal around genomic features (TSS, peaks, gene bodies).

```bash
# Reference-point mode: signal around TSS
computeMatrix reference-point -S chip.bw -R genes.bed \
    -b 3000 -a 3000 --referencePoint TSS -o matrix.gz -p 8

# Scale-regions mode: signal across gene bodies
computeMatrix scale-regions -S chip.bw -R genes.bed \
    -b 1000 -a 1000 --regionBodyLength 5000 -o matrix.gz -p 8

# Generate heatmap
plotHeatmap -m matrix.gz -o heatmap.png \
    --colorMap RdBu --kmeans 3 --sortUsing mean

# Generate profile plot
plotProfile -m matrix.gz -o profile.png \
    --plotType lines --colors blue red

# Multiple signal files: compare marks
computeMatrix reference-point -S h3k4me3.bw h3k27me3.bw -R genes.bed \
    -b 3000 -a 3000 --referencePoint TSS -o multi_matrix.gz
plotHeatmap -m multi_matrix.gz -o multi_heatmap.png
```

### 5. Read Filtering and Processing

Filter reads before analysis or correct for assay-specific biases.

```bash
# Filter by mapping quality and fragment size
alignmentSieve --bam input.bam --outFile filtered.bam \
    --minMappingQuality 10 --minFragmentLength 150 \
    --maxFragmentLength 700

# ATAC-seq: apply Tn5 offset correction (+4/-5 bp shift)
alignmentSieve --bam atac.bam --outFile shifted.bam --ATACshift
# Then index: samtools index shifted.bam

# GC bias correction (only if significant bias detected)
computeGCBias -b input.bam --effectiveGenomeSize 2913022398 \
    -g genome.2bit --GCbiasFrequenciesFile gc_freq.txt -p 8
correctGCBias -b input.bam --effectiveGenomeSize 2913022398 \
    --GCbiasFrequenciesFile gc_freq.txt -o corrected.bam
```

### 6. Enrichment Analysis

Quantify signal enrichment at specific regions.

```bash
# Signal enrichment at peak regions
plotEnrichment -b chip.bam input.bam --BED peaks.bed \
    -o enrichment.png --ignoreDuplicates -p 8
```

## Key Concepts

### Normalization Methods

| Method | Formula | When to Use | Requires |
|--------|---------|-------------|----------|
| **RPGC** | 1× genome coverage | ChIP-seq, ATAC-seq | `--effectiveGenomeSize` |
| **CPM** | Counts per million | Any assay, quick comparison | Nothing |
| **RPKM** | Per kb per million | RNA-seq gene-level | Nothing |
| **BPM** | Bins per million | Similar to CPM | Nothing |
| **None** | Raw counts | Not recommended for comparison | Nothing |

**Rule**: Use RPGC for ChIP-seq/ATAC-seq (accounts for genome size). Use CPM for quick comparisons. Use RPKM for RNA-seq gene-level analysis.

### Effective Genome Sizes

| Organism | Assembly | Effective Size |
|----------|----------|---------------|
| Human | GRCh38/hg38 | 2,913,022,398 |
| Mouse | GRCm38/mm10 | 2,652,783,500 |
| Zebrafish | GRCz11 | 1,368,780,147 |
| *Drosophila* | dm6 | 142,573,017 |
| *C. elegans* | ce10/ce11 | 100,286,401 |

### computeMatrix Modes

| Mode | Use When | Key Params |
|------|----------|------------|
| `reference-point` | Signal around a fixed point (TSS, peak summit) | `-b`, `-a`, `--referencePoint` |
| `scale-regions` | Signal across variable-length features (gene bodies) | `-b`, `-a`, `--regionBodyLength` |

## Common Workflows

### Workflow: ChIP-seq QC and Visualization

```bash
#!/bin/bash
# Complete ChIP-seq QC + visualization pipeline
CHIP="chip.bam"
INPUT="input.bam"
GENES="genes.bed"
PEAKS="peaks.bed"
GSIZE=2913022398
THREADS=8

# 1. QC: sample correlation
multiBamSummary bins --bamfiles $INPUT $CHIP -o summary.npz -p $THREADS
plotCorrelation -in summary.npz --corMethod pearson --whatToShow heatmap -o correlation.png

# 2. QC: enrichment fingerprint
plotFingerprint -b $INPUT $CHIP -o fingerprint.png --extendReads 200 --ignoreDuplicates

# 3. Convert to normalized bigWig
bamCoverage --bam $CHIP --outFileName chip.bw --normalizeUsing RPGC \
    --effectiveGenomeSize $GSIZE --extendReads 200 --ignoreDuplicates -p $THREADS

# 4. Log2 ratio track
bamCompare -b1 $CHIP -b2 $INPUT -o log2ratio.bw --operation log2 \
    --scaleFactorsMethod readCount --extendReads 200 -p $THREADS

# 5. Heatmap at TSS
computeMatrix reference-point -S chip.bw log2ratio.bw -R $GENES \
    -b 3000 -a 3000 --referencePoint TSS -o tss_matrix.gz -p $THREADS
plotHeatmap -m tss_matrix.gz -o tss_heatmap.png --colorMap RdBu --kmeans 3

# 6. Profile at peaks
computeMatrix reference-point -S chip.bw -R $PEAKS \
    -b 2000 -a 2000 -o peak_matrix.gz -p $THREADS
plotProfile -m peak_matrix.gz -o peak_profile.png
```

### Workflow: ATAC-seq Analysis

```bash
#!/bin/bash
ATAC="atac.bam"
PEAKS="atac_peaks.bed"
GSIZE=2913022398
THREADS=8

# 1. Apply Tn5 offset correction (+4/-5 bp)
alignmentSieve --bam $ATAC --outFile shifted.bam --ATACshift -p $THREADS
samtools index shifted.bam

# 2. Generate RPGC-normalized coverage
bamCoverage --bam shifted.bam --outFileName atac.bw \
    --normalizeUsing RPGC --effectiveGenomeSize $GSIZE \
    --binSize 5 --extendReads -p $THREADS

# 3. Check nucleosome periodicity (expect 200bp/400bp peaks)
bamPEFragmentSize -b shifted.bam -o fragsize.png \
    --maxFragmentLength 1000 --binSize 1

# 4. Heatmap at ATAC peaks
computeMatrix reference-point -S atac.bw -R $PEAKS \
    -b 2000 -a 2000 -o atac_matrix.gz -p $THREADS
plotHeatmap -m atac_matrix.gz -o atac_heatmap.png --colorMap Blues --kmeans 2
```

## Key Parameters

| Parameter | Tool(s) | Default | Range | Effect |
|-----------|---------|---------|-------|--------|
| `--normalizeUsing` | bamCoverage, bamCompare | None | RPGC, CPM, RPKM, BPM, None | Coverage normalization method |
| `--effectiveGenomeSize` | bamCoverage, bamCompare | — | See table above | Required for RPGC normalization |
| `--binSize` | bamCoverage, multiBamSummary | 50 | 1–10000 | Resolution in bp; smaller = larger files |
| `--extendReads` | bamCoverage, bamCompare | False | integer (bp) | Extend to fragment length (ChIP: YES, RNA: NO) |
| `--ignoreDuplicates` | Most tools | False | True/False | Remove PCR duplicates |
| `--numberOfProcessors` | Most tools | 1 | 1–N cores | Parallel processing |
| `--operation` | bamCompare | log2 | log2, ratio, subtract, add, mean, reciprocal_ratio | Sample comparison operation |
| `--referencePoint` | computeMatrix | TSS | TSS, TES, center | Anchor point for reference-point mode |
| `-b` / `-a` | computeMatrix | 500 | 100–10000 bp | Upstream/downstream distance from reference |
| `--kmeans` | plotHeatmap | None | 1–20 | Number of clusters for heatmap rows |
| `--minMappingQuality` | Most tools | None | 0–60 | Minimum alignment quality filter |

## Best Practices

1. **Always extend reads for ChIP-seq**: Use `--extendReads 200` (or actual fragment length) — ChIP fragments are longer than reads.

2. **Never extend reads for RNA-seq**: `--extendReads` would span splice junctions, creating artifacts.

3. **Anti-pattern — comparing with different normalizations**: Always use the same normalization method across all samples in a comparison.

4. **Use `--region` for parameter testing**: Test on a single chromosome (`--region chr1:1-10000000`) before running on the full genome — saves hours.

5. **Always use `--numberOfProcessors`**: Most tools parallelize well — use all available cores.

6. **Anti-pattern — using RPGC without `--effectiveGenomeSize`**: Will silently produce wrong results. Always specify the correct genome size.

7. **Run QC before analysis**: Check fingerprint and correlation before investing time in heatmaps/profiles. Poor enrichment means downstream visualizations will be noise.

## Common Recipes

### Recipe: Multi-Sample Correlation Matrix

```bash
# Compare 6 samples across the genome
multiBamSummary bins --bamfiles sample{1..6}.bam \
    -o all_samples.npz --binSize 10000 -p 8 \
    --labels S1 S2 S3 S4 S5 S6

# Pearson correlation heatmap
plotCorrelation -in all_samples.npz --corMethod pearson \
    --whatToShow heatmap -o pearson_corr.png --plotNumbers

# Spearman correlation + PCA
plotCorrelation -in all_samples.npz --corMethod spearman \
    --whatToShow heatmap -o spearman_corr.png
plotPCA -in all_samples.npz -o pca.png
```

### Recipe: Gene Body Coverage Profile

```bash
# Scale-regions mode for gene body analysis
computeMatrix scale-regions -S sample.bw -R genes.bed \
    -b 1000 -a 1000 --regionBodyLength 5000 -o gene_body.gz -p 8
plotProfile -m gene_body.gz -o gene_body_profile.png \
    --plotType lines --perGroup
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `BAM index not found` | Missing `.bai` file | Run `samtools index input.bam` |
| Out of memory | Large genome, small bin size | Increase `--binSize`; process with `--region chr1` |
| Very slow processing | Single-threaded execution | Add `-p 8` (or available cores) |
| bigWig files very large | Bin size too small | Increase `--binSize 50` or larger |
| Flat ChIP fingerprint | Poor ChIP enrichment | Biological issue — consider repeating ChIP experiment |
| RNA-seq artifacts at exon boundaries | `--extendReads` used with RNA-seq | Remove `--extendReads` for RNA-seq data |
| ATAC-seq signal offset | Missing Tn5 correction | Apply `alignmentSieve --ATACshift` before analysis |
| Mismatched genome assemblies | BAM and BED use different assemblies | Verify both use same genome build (hg38 vs hg19) |

## Related Skills

- **pysam-genomic-files** — programmatic BAM/VCF manipulation for custom filtering before deepTools
- **matplotlib-scientific-plotting** — customize deepTools output figures beyond built-in options

## References

- [deepTools documentation](https://deeptools.readthedocs.io/) — official user guide and tool reference
- [deepTools Galaxy](https://deeptools.ie-freiburg.mpg.de/) — web-based interface
- Ramirez et al. (2016) "deepTools2: a next generation web server for deep-sequencing data analysis" — [Nucleic Acids Research](https://doi.org/10.1093/nar/gkw257)