---
name: pipeline-rnaseq
description: "Execute ENCODE RNA-seq pipeline from FASTQ to gene quantification and signal tracks. Child of pipeline-guide. Provides Nextflow execution with Docker and cloud deployment. Use when processing RNA-seq data with STAR alignment, RSEM/Kallisto quantification, or generating expression matrices. Trigger on: RNA-seq pipeline, gene expression, STAR alignment, RSEM quantification, transcript quantification, TPM, FPKM, RNA processing, run RNA-seq."
---

# ENCODE RNA-seq Pipeline

## When to Use

- User wants to run an RNA-seq processing pipeline from FASTQ to gene quantification
- User asks about "RNA-seq pipeline", "STAR alignment", "RSEM", "gene expression quantification", or "Kallisto"
- User needs to process bulk RNA-seq data with ENCODE-standard 2-pass STAR alignment
- Example queries: "process my RNA-seq FASTQs", "quantify gene expression from RNA-seq", "run STAR and RSEM on my data"

Execute the ENCODE RNA-seq processing pipeline from raw FASTQ files through splice-aware
alignment, gene/transcript quantification, and strand-specific signal track generation.
This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform
analysis standards.

## Overview

RNA-seq measures transcriptome-wide gene expression by sequencing cDNA derived from
cellular RNA. The ENCODE pipeline processes RNA-seq data through quality control,
splice-aware alignment with STAR (2-pass mode), gene and transcript quantification
with RSEM, optional fast pseudoalignment with Kallisto, and generation of strand-specific
signal tracks as bigWig files.

Key design decisions: STAR 2-pass mode for maximum splice junction sensitivity, RSEM
for accurate gene/transcript/isoform quantification including multi-mapped reads,
stranded library protocol (dUTP/rf-stranded) as the ENCODE standard, and paired-end
sequencing with a minimum of 30 million uniquely mapped reads per replicate.

## Key Literature

| Reference | Journal | Year | DOI | Relevance |
|-----------|---------|------|-----|-----------|
| Dobin et al. "STAR: ultrafast universal RNA-seq aligner" | Bioinformatics | 2013 | 10.1093/bioinformatics/bts635 | Splice-aware aligner (~12,000 citations) |
| Li & Dewey "RSEM: accurate transcript quantification from RNA-Seq data" | BMC Bioinformatics | 2011 | 10.1186/1471-2105-12-323 | Gene/transcript quantification (~6,000 citations) |
| Bray et al. "Near-optimal probabilistic RNA-seq quantification" | Nature Biotechnology | 2016 | 10.1038/nbt.3519 | Fast pseudoalignment (~4,000 citations) |
| Wang et al. "RSeQC: quality control of RNA-seq experiments" | Bioinformatics | 2012 | 10.1093/bioinformatics/bts356 | RNA-seq QC suite (~3,500 citations) |
| ENCODE Project Consortium "Expanded encyclopaedias" | Nature | 2020 | 10.1038/s41586-020-2493-4 | ENCODE Phase 3 standards |
| Frankish et al. "GENCODE 2021" | Nucleic Acids Research | 2021 | 10.1093/nar/gkaa1087 | Gene annotation reference |

## Pipeline Stages

```
FASTQ ──> FastQC / Trim Galore ──> STAR (2-pass) ──> Genome BAM + Transcriptome BAM
  |                                                        |              |
  |                  ┌─────────────────────────────────────┘              |
  |                  v                                                    v
  |         Signal Track Generation                              RSEM Quantification
  |          (strand-specific bigWig)                        (gene + transcript + isoform)
  |                  |                                                    |
  |                  v                                                    v
  |          Plus strand bigWig                                genes.results (TPM/FPKM)
  |          Minus strand bigWig                               isoforms.results
  |                                                                       |
  |         ┌───────────────────────────────────────────────────────────┘
  |         v
  |   Kallisto (optional fast pseudoalignment)
  |         |
  |         v
  └──> RSeQC + MultiQC ──> Aggregated QC Report
```

### Stage Summary

| Stage | Tool | Input | Output | Reference |
|-------|------|-------|--------|-----------|
| 1. QC & Trimming | FastQC, Trim Galore | Raw FASTQ | Trimmed FASTQ | references/01-qc-trimming.md |
| 2. Alignment | STAR (2-pass) | Trimmed FASTQ | Genome BAM + Transcriptome BAM | references/02-star-alignment.md |
| 3. Quantification | RSEM, Kallisto | Transcriptome BAM / FASTQ | Gene/transcript counts, TPM, FPKM | references/03-quantification.md |
| 4. Signal Tracks | bedGraphToBigWig | STAR bedGraph | Strand-specific bigWig | references/04-signal-tracks.md |
| 5. QC Metrics | RSeQC, MultiQC | BAM, counts | Strandedness, coverage, saturation | references/05-qc-metrics.md |

## Input Requirements

### Required Files
- **RNA-seq FASTQ**: Paired-end reads (ENCODE standard; single-end supported)
- **Reference genome**: STAR-indexed genome with gene annotation (GRCh38 + GENCODE for human)
- **Gene annotation**: GENCODE GTF (v38+ for human, vM27+ for mouse)

### Sample Sheet Format
```csv
sample_id,read1,read2,replicate,strandedness
SAMPLE1_rep1,rna_R1.fq.gz,rna_R2.fq.gz,1,reverse
SAMPLE1_rep2,rna_R1.fq.gz,rna_R2.fq.gz,2,reverse
```

**Strandedness**: ENCODE uses dUTP-based stranded libraries. The resulting reads are
`reverse` stranded (read 2 matches the sense strand). If unknown, the pipeline will
auto-detect strandedness using RSeQC `infer_experiment.py`.

## Library Strandedness

| Protocol | Strandedness | RSEM flag | Kallisto flag | Common Usage |
|----------|-------------|-----------|---------------|-------------|
| dUTP (ENCODE standard) | Reverse | `--strandedness reverse` | `--rf-stranded` | Most ENCODE RNA-seq |
| SMARTer / SMART-Seq2 | Unstranded | `--strandedness none` | (default) | Single-cell, low-input |
| Illumina TruSeq Stranded | Reverse | `--strandedness reverse` | `--rf-stranded` | Standard bulk RNA-seq |
| Directional ligation | Forward | `--strandedness forward` | `--fr-stranded` | Some legacy protocols |

## QC Thresholds

| Metric | Threshold | Category | Source |
|--------|-----------|----------|--------|
| Total sequenced reads | >=30M PE reads | Read depth | ENCODE |
| Uniquely mapped reads | >=70% of total | Alignment | ENCODE |
| Multi-mapped reads | <10% | Alignment | ENCODE |
| rRNA rate | <10% | Sample quality | ENCODE |
| Strandedness agreement | >90% | Library prep | RSeQC |
| Exonic rate | >60% | Mapping quality | RSeQC |
| Gene body coverage | Relatively uniform (5'/3' bias <1.5) | RNA integrity | RSeQC |
| Duplication rate | <60% | Library complexity | Picard |
| Detected genes (TPM>1) | >12,000 (human) | Sensitivity | ENCODE |
| Saturation | Approaching plateau at sequencing depth | Depth sufficiency | RSeQC |

### Read Depth Guidelines

| Application | Minimum Reads (PE) | Recommended | Notes |
|-------------|-------------------|-------------|-------|
| Gene-level expression | 20M | 30M | ENCODE minimum |
| Transcript-level expression | 40M | 60M | Isoform resolution requires more depth |
| Differential expression | 20M per sample | 30M per sample | 3+ biological replicates per condition |
| Novel junction discovery | 60M | 100M+ | STAR 2-pass mode benefits from depth |
| Fusion detection | 50M | 80M+ | Chimeric reads are rare |

## Execution

### Quick Start (Local Docker)
```bash
nextflow run scripts/main.nf \
  -profile local \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir results/
```

### SLURM HPC
```bash
nextflow run scripts/main.nf \
  -profile slurm \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir results/
```

### Google Cloud
```bash
nextflow run scripts/main.nf \
  -profile gcp \
  --reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 'gs://bucket/results/'
```

### AWS Batch
```bash
nextflow run scripts/main.nf \
  -profile aws \
  --reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 's3://bucket/results/'
```

## Cloud Cost Estimates

| Platform | Instance | Cost/Sample | Time/Sample | Notes |
|----------|----------|-------------|-------------|-------|
| GCP | n1-highmem-8 | ~$3-6 | 2-4 hours | STAR index loading dominates; preemptible recommended |
| AWS | r5.2xlarge | ~$3-6 | 2-4 hours | r-series for STAR memory; spot recommended |
| Local | 8 cores, 32GB | $0 | 3-6 hours | Docker required; STAR needs 32GB+ RAM |
| SLURM | 8 cores, 32GB | Varies | 2-4 hours | Singularity recommended |

**Memory note**: STAR genome index loading requires ~32GB RAM for human. Machines with
<32GB will fail at the alignment step. Use `--limitGenomeGenerateRAM` to cap index size
for constrained environments, but this reduces mapping sensitivity.

## Output Directory Structure

```
results/
  fastqc/                   # Raw and trimmed QC reports
  trimmed/                  # Trimmed FASTQ files
  star/
    genome_bam/             # Genome-aligned BAM files
    transcriptome_bam/      # Transcriptome BAM files (for RSEM)
    logs/                   # STAR alignment logs (Log.final.out)
    junctions/              # Splice junction files (SJ.out.tab)
  rsem/
    genes/                  # genes.results (gene_id, TPM, FPKM, expected_count)
    isoforms/               # isoforms.results (transcript_id, TPM, FPKM)
  kallisto/                 # Kallisto quantification output (optional)
    abundance.tsv           # transcript TPMs and counts
  signal/
    plus/                   # Plus-strand bigWig signal tracks
    minus/                  # Minus-strand bigWig signal tracks
  qc/
    rseqc/                  # RSeQC output (strandedness, coverage, distribution)
    multiqc/                # Aggregated QC report
  logs/                     # Nextflow execution logs
```

## Common Pitfalls

### 1. Insufficient Memory for STAR
STAR loads the entire genome index into shared memory (~32GB for human). This is the most
common failure mode. Always allocate at least 32GB RAM for the alignment step. On shared
HPC systems, check memory limits per job.

### 2. Wrong Strandedness Setting
Using incorrect strandedness results in near-zero gene counts. If you see uniformly low
counts, check strandedness with `infer_experiment.py` from RSeQC. ENCODE dUTP libraries
are `reverse` stranded.

### 3. Using FPKM for Cross-Sample Comparison
FPKM values are not comparable across samples because they depend on total library
composition. Use TPM (comparable across samples) or raw counts with DESeq2/edgeR
normalization for differential expression.

### 4. Ignoring Multi-Mapped Reads
RSEM uses an expectation-maximization algorithm to probabilistically assign multi-mapped
reads. This is critical for gene families and repetitive elements. Do not pre-filter
multi-mappers before RSEM quantification.

### 5. Skipping rRNA Assessment
High rRNA contamination (>10%) indicates failed rRNA depletion. This reduces effective
sequencing depth and inflates apparent duplication rates. Always check rRNA rate before
downstream analysis.

### 6. Not Using 2-Pass Mode for Novel Junctions
STAR 1-pass mode only uses annotated splice junctions. 2-pass mode first discovers novel
junctions then re-maps, critical for non-model organisms or samples with extensive
alternative splicing.

## Pipeline Scripts

| File | Description | Lines |
|------|-------------|-------|
| `scripts/main.nf` | Nextflow DSL2 pipeline | ~130 |
| `scripts/nextflow.config` | Execution profiles (local/slurm/gcp/aws) | ~60 |
| `scripts/Dockerfile` | Docker build with STAR, RSEM, Kallisto, RSeQC | ~35 |

## ENCODE Data Integration

After running on your own data, compare with ENCODE reference:

```python
# Find matching ENCODE RNA-seq experiments
encode_search_experiments(
    assay_title="total RNA-seq",
    organ="pancreas",
    biosample_type="tissue"
)

# Download ENCODE gene quantifications for comparison
encode_batch_download(
    download_dir="/data/encode_reference/",
    output_type="gene quantifications",
    assay_title="total RNA-seq",
    organ="pancreas",
    assembly="GRCh38"
)

# Download ENCODE signal tracks for browser visualization
encode_search_files(
    file_format="bigWig",
    assay_title="total RNA-seq",
    organ="pancreas",
    output_type="signal of unique reads"
)
```

## Pitfalls & Edge Cases

- **Strandedness must match library prep**: STAR requires correct `--outSAMstrandField` and RSEM needs `--strandedness`. Using wrong strand settings can halve gene counts or assign reads to antisense genes.
- **rRNA contamination**: rRNA >10% wastes sequencing depth. Ribosomal depletion libraries should have <5%. Poly-A selection libraries should have <1%. Check with FastQC or Picard CollectRnaSeqMetrics.
- **STAR 2-pass mode is required**: The first pass discovers novel splice junctions; the second pass uses them. Single-pass STAR misses tissue-specific or rare splicing events, reducing sensitivity for differential exon usage.
- **Gene-level vs transcript-level quantification**: RSEM provides transcript-level estimates but gene-level aggregation is more robust for differential expression. Transcript-level analysis requires many more replicates (≥6).
- **TPM normalization is not for cross-sample comparison**: TPM normalizes within a sample but is NOT appropriate for comparing expression across conditions. Use DESeq2 size factors or TMM normalization for differential expression.
- **Batch effects in multi-lab data**: RNA-seq is highly sensitive to library prep method, sequencer, and lab. Always check for batch effects with PCA before combining datasets from different sources.

## Walkthrough: Processing ENCODE RNA-seq from FASTQ to Gene Quantification

**Goal**: Process raw RNA-seq FASTQ files through the ENCODE pipeline to generate gene expression quantifications (TPM/FPKM) and signal tracks.
**Context**: The ENCODE RNA-seq pipeline uses STAR 2-pass alignment and RSEM quantification, producing both gene-level and transcript-level expression estimates.

### Step 1: Find RNA-seq experiment

```
encode_get_experiment(accession="ENCSR000CPR")
```

Expected output:
```json
{
  "accession": "ENCSR000CPR",
  "assay_title": "RNA-seq",
  "biosample_summary": "K562",
  "replicates": 2,
  "status": "released"
}
```

### Step 2: List FASTQ files

```
encode_list_files(accession="ENCSR000CPR", file_format="fastq")
```

Expected output:
```json
{
  "files": [
    {"accession": "ENCFF200RN1", "output_type": "reads", "paired_end": "1", "biological_replicates": [1], "file_size_mb": 3200},
    {"accession": "ENCFF201RN2", "output_type": "reads", "paired_end": "2", "biological_replicates": [1], "file_size_mb": 3300}
  ]
}
```

### Step 3: Run the RNA-seq pipeline

```bash
nextflow run pipeline-rnaseq/main.nf \
  --fastq_r1 ENCFF200RN1.fastq.gz \
  --fastq_r2 ENCFF201RN2.fastq.gz \
  --genome GRCh38 \
  --annotation gencode.v36.annotation.gtf \
  --strandedness reverse \
  -profile docker
```

Key pipeline steps:
1. Quality trimming (fastp)
2. STAR 2-pass alignment (splice-aware)
3. RSEM gene quantification (TPM, FPKM, expected counts)
4. Kallisto transcript quantification (optional)
5. Signal track generation (bigWig)
6. QC metrics (RSeQC, MultiQC)

### Step 4: Validate output quality

| Metric | Threshold | Purpose |
|---|---|---|
| Mapping rate | >= 70% | Alignment success |
| rRNA rate | < 10% | rRNA depletion efficiency |
| Replicate correlation | >= 0.9 | Biological consistency |
| Exonic rate | > 60% | Expected for mRNA-seq |

### Step 5: Use expression data with ENCODE epigenomic data

Compare gene expression with enhancer marks:
```
encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27ac", organism="Homo sapiens")
```

**Interpretation**: Genes with high TPM AND nearby H3K27ac peaks have validated enhancer-gene connections. Low expression despite nearby enhancer marks suggests poised or tissue-specific regulation.

### Integration with downstream skills
- Gene quantifications feed into -> **peak-annotation** for expression-validated peak targets
- Expression data connects to -> **gtex-expression** for tissue comparison
- Processed data feeds into -> **compare-biosamples** for differential expression analysis
- Pipeline provenance logged by -> **data-provenance**

## Code Examples

### 1. Find RNA-seq experiments for a tissue

```
encode_search_experiments(assay_title="total RNA-seq", organ="liver", organism="Homo sapiens")
```

Expected output:
```json
{
  "total": 35,
  "results": [
    {"accession": "ENCSR300RNA", "assay_title": "RNA-seq", "biosample_summary": "liver", "status": "released"}
  ]
}
```

### 2. Check for existing gene quantifications

```
encode_list_files(accession="ENCSR300RNA", file_format="tsv", output_type="gene quantifications", assembly="GRCh38")
```

Expected output:
```json
{
  "files": [
    {"accession": "ENCFF400GEQ", "output_type": "gene quantifications", "file_format": "tsv", "file_size_mb": 5.2}
  ]
}
```

### 3. Download expression data

```
encode_download_files(accessions=["ENCFF400GEQ"], download_dir="/data/rnaseq/quantification")
```

Expected output:
```json
{
  "downloaded": 1,
  "md5_verified": true,
  "files": ["/data/rnaseq/quantification/ENCFF400GEQ.tsv"]
}
```

## Integration

| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Gene expression (TPM/FPKM) | **peak-annotation** | Validate enhancer targets with expression data |
| Expression matrix | **gtex-expression** | Compare cell-line vs. tissue expression |
| Differential expression results | **compare-biosamples** | Identify tissue-specific gene regulation |
| Signal tracks (bigWig) | **visualization-workflow** | Display expression signal in genome browser |
| Expression quantifications | **disease-research** | Connect gene expression to disease phenotypes |
| Pipeline run parameters | **data-provenance** | Record STAR/RSEM versions and settings |
| QC metrics | **quality-assessment** | Validate against ENCODE RNA-seq standards |

## Related Skills

- **pipeline-guide** (parent): General pipeline selection and resource assessment
- **quality-assessment**: Deep-dive QC analysis beyond basic metrics
- **integrative-analysis**: Combine RNA-seq with ChIP-seq/ATAC-seq for regulatory inference
- **compare-biosamples**: Compare expression profiles across cell types
- **single-cell-encode**: For scRNA-seq data processing (different pipeline)
- **pipeline-chipseq**: Sibling pipeline for ChIP-seq data
- **pipeline-atacseq**: Sibling pipeline for ATAC-seq data
- **publication-trust**: Verify literature claims backing analytical decisions

## Presenting Results

When reporting RNA-seq pipeline results:

- **Mapping rate**: Report STAR uniquely mapped rate (>70% expected), multi-mapped rate (<10%), and unmapped rate from Log.final.out
- **Gene/transcript counts**: Report number of detected genes (TPM>1, expect >12,000 for human) and detected transcripts
- **rRNA rate**: Report rRNA contamination percentage (<10% expected). Flag if rRNA depletion appears to have failed
- **Quantification paths**: Provide paths to RSEM genes.results (TPM, FPKM, expected_count) and Kallisto abundance.tsv files
- **Strandedness**: Confirm the detected strandedness from RSeQC infer_experiment.py matches the expected library protocol
- **Key QC metrics**: Present exonic rate (>60%), gene body coverage uniformity, duplication rate, and saturation curve status in a summary table
- **Signal tracks**: Provide paths to strand-specific bigWig files (plus/ and minus/)
- **Next steps**: Suggest `integrative-analysis` to combine RNA-seq with ChIP-seq/ATAC-seq for regulatory inference, or `compare-biosamples` for cross-tissue expression comparison

## For the request: "$ARGUMENTS"