---
name: "snakemake-workflow-engine"
description: "Python-based workflow manager for reproducible, scalable pipelines. Define rules with file-based dependencies; Snakemake resolves execution order and parallelism. Runs local, SLURM, LSF, AWS, GCP via profiles; per-rule conda/Singularity envs. For NGS pipelines, ML training, and multi-step file processing. Use Nextflow for Groovy dataflow or nf-core integration."
license: "MIT"
---

# Snakemake — Python Workflow Engine

## Overview

Snakemake is a Python-based workflow management system that scales analyses from laptop to HPC and cloud. Workflows are defined as rules with explicit input/output file dependencies; Snakemake resolves the execution order automatically and runs independent steps in parallel. Rules can call shell commands, Python/R/Julia scripts, or inline Python. Per-rule conda or Singularity environments make workflows fully reproducible. Widely used in bioinformatics for NGS, genome assembly, and variant-calling pipelines.

## When to Use

- Building reproducible multi-step bioinformatics pipelines (align → sort → call variants → annotate)
- Scaling the same workflow from local development to SLURM cluster without code changes
- Processing multiple samples identically using wildcard-based rules
- Managing dependencies automatically — only rerun steps whose inputs changed
- Deploying per-rule conda or Singularity environments for tool isolation
- Generating visual DAGs and dry-run previews before committing computational resources
- Use `Nextflow` instead when you need Groovy DSL + dataflow channels, or when leveraging the nf-core community pipeline library
- For simple shell loops, use bash scripts; Snakemake is worth the overhead only for 3+ sequential steps with branching
- Use `Prefect` or `Airflow` instead for data engineering workflows with dynamic task graphs or time-based scheduling

## Prerequisites

- **Python packages**: `snakemake`, `graphviz` (for DAG visualization)
- **Environment**: Python 3.11+; conda/mamba recommended for per-rule environments
- **Data requirements**: Input files, reference files; output paths defined as rules

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v snakemake` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run snakemake` rather than bare `snakemake`.

```bash
# Install via conda (includes optional dependencies)
conda install -c conda-forge -c bioconda snakemake

# Minimal pip install
pip install snakemake

# Verify
snakemake --version
# 8.x.x
```

## Quick Start

```python
# Snakefile — minimal 2-rule pipeline
SAMPLES = ["sampleA", "sampleB"]

rule all:             # Target rule: request final outputs
    input:
        expand("results/{sample}.sorted.bam", sample=SAMPLES)

rule align:
    input:
        fastq="data/{sample}.fastq",
        ref="refs/genome.fa"
    output:
        bam="results/{sample}.sorted.bam"
    threads: 4
    shell:
        "bwa mem -t {threads} {input.ref} {input.fastq} "
        "| samtools sort -@ {threads} -o {output.bam}"
```

```bash
# Run: dry-run first, then execute
snakemake -n            # dry-run: show what would run
snakemake --cores 8     # execute with 8 cores
```

## Core API

### Module 1: Rule Definition

Each rule defines one analysis step with inputs, outputs, and an execution method.

```python
# Shell rule: run a command with {input} and {output} placeholders
rule fastqc:
    input:
        fastq="data/{sample}.fastq"
    output:
        html="qc/{sample}_fastqc.html",
        zip="qc/{sample}_fastqc.zip"
    log:
        "logs/fastqc/{sample}.log"
    shell:
        "fastqc {input.fastq} -o qc/ 2> {log}"
```

```python
# Run rule: inline Python for logic-heavy steps
rule parse_stats:
    input:
        txt="results/{sample}.flagstat.txt"
    output:
        csv="results/{sample}.stats.csv"
    run:
        import re, csv
        lines = open(input.txt).readlines()
        mapped = re.search(r"(\d+) mapped", "".join(lines)).group(1)
        with open(output.csv, "w") as f:
            csv.writer(f).writerow([wildcards.sample, mapped])
```

```python
# Script rule: delegate to external R/Python/Julia script
rule plot_coverage:
    input:
        depth="results/{sample}.depth.txt"
    output:
        pdf="results/{sample}.coverage.pdf"
    script:
        "scripts/plot_coverage.R"
    # In the R script, access via snakemake object:
    # depth_file <- snakemake@input[["depth"]]
    # pdf_path <- snakemake@output[["pdf"]]
```

### Module 2: Wildcards and Pattern Expansion

Wildcards let one rule process any number of samples; `expand()` generates all required file paths.

```python
# Define sample list (from config or glob)
SAMPLES = ["ctrl_rep1", "ctrl_rep2", "treat_rep1", "treat_rep2"]

rule all:
    input:
        # expand() generates: qc/ctrl_rep1_fastqc.html, qc/ctrl_rep2_fastqc.html, ...
        expand("qc/{sample}_fastqc.html", sample=SAMPLES),
        expand("results/{sample}.bam", sample=SAMPLES)

# Access wildcard values inside shell/run
rule align:
    input:
        "data/{sample}.fastq"
    output:
        "results/{sample}.bam"
    shell:
        "echo Processing {wildcards.sample}; "
        "bwa mem refs/genome.fa {input} | samtools view -b > {output}"
```

```python
# Wildcard constraints prevent ambiguous matches
rule process:
    input:
        "data/{sample}_{rep}.fastq"
    output:
        "results/{sample}_{rep}.txt"
    wildcard_constraints:
        sample="[A-Za-z]+",   # letters only
        rep="\d+"             # digits only

# multiext: multiple outputs sharing a common path base
rule bwa_index:
    input:
        "refs/genome.fa"
    output:
        multiext("refs/genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
    shell:
        "bwa index {input}"
```

### Module 3: Configuration and Parameters

Config files externalize settings; `params` passes rule-level values without file dependencies.

```python
# Snakefile: declare config file
configfile: "config/config.yaml"

# config/config.yaml:
# samples: [ctrl, treat]
# threads:
#   align: 8
#   sort: 4
# min_mapq: 20

SAMPLES = config["samples"]

rule filter_reads:
    input:
        "results/{sample}.bam"
    output:
        "results/{sample}.filtered.bam"
    params:
        mapq=config["min_mapq"]    # from config, not a file
    threads:
        config["threads"]["sort"]
    shell:
        "samtools view -q {params.mapq} -b {input} > {output}"
```

```python
# Dynamic params via lambda functions
rule trim:
    input:
        fastq="data/{sample}.fastq"
    output:
        trimmed="trimmed/{sample}.fastq"
    params:
        # Adapt quality threshold based on sample name
        quality=lambda wildcards: 25 if "ctrl" in wildcards.sample else 20
    shell:
        "fastp -q {params.quality} -i {input.fastq} -o {output.trimmed}"
```

### Module 4: Resources and Environments

Declare computational resources for scheduler integration; use conda/Singularity for tool isolation.

```python
# Resource declaration (used by SLURM/LSF profiles)
rule variant_calling:
    input:
        bam="results/{sample}.deduped.bam",
        ref="refs/genome.fa"
    output:
        vcf="variants/{sample}.vcf.gz"
    resources:
        mem_mb=16000,       # memory in MB
        runtime=240,        # max walltime in minutes
        disk_mb=20000       # scratch disk space
    threads: 8
    shell:
        "bcftools mpileup -f {input.ref} {input.bam} "
        "| bcftools call -m -Oz -o {output.vcf}"
```

```python
# Conda environment per rule (for reproducibility)
rule star_align:
    input:
        reads="data/{sample}.fastq",
        genome_dir="refs/star_index/"
    output:
        bam="star_out/{sample}/Aligned.sortedByCoord.out.bam"
    conda:
        "envs/star.yaml"
    # envs/star.yaml:
    # channels:
    #   - bioconda
    # dependencies:
    #   - star=2.7.10b
    #   - samtools=1.17
    threads: 8
    shell:
        "STAR --runThreadN {threads} --genomeDir {input.genome_dir} "
        "--readFilesIn {input.reads} --outSAMtype BAM SortedByCoordinate"
```

```python
# Singularity/Apptainer container
rule gatk_haplotypecaller:
    input:
        bam="results/{sample}.bam",
        ref="refs/genome.fa"
    output:
        gvcf="gvcfs/{sample}.g.vcf.gz"
    container:
        "docker://broadinstitute/gatk:4.4.0.0"
    shell:
        "gatk HaplotypeCaller -I {input.bam} -R {input.ref} "
        "-O {output.gvcf} -ERC GVCF"
```

### Module 5: Execution and Cluster Profiles

Execute locally, on clusters, or in cloud; profiles configure executors without changing the Snakefile.

```bash
# Local execution
snakemake --cores 8                   # use 8 CPU cores
snakemake --cores all                 # use all available cores

# Dry run: show tasks without executing
snakemake -n --cores 8
# Output: 12 of 24 steps are complete. 12 jobs to run.

# Force rerun (ignore existing outputs)
snakemake --forceall --cores 8

# Visualize DAG as PDF
snakemake --dag | dot -Tpdf > workflow_dag.pdf
```

```bash
# SLURM cluster profile (profiles/slurm/config.yaml)
# executor: slurm
# jobs: 50
# default-resources:
#   mem_mb: 2000
#   runtime: 60
# use-conda: true

# Run with profile (cluster submit + monitor)
snakemake --profile profiles/slurm --cores 128

# Override resources at runtime
snakemake --profile profiles/slurm \
    --set-resources variant_calling:mem_mb=32000 --cores 128

# Override threads
snakemake --set-threads align=16 --cores 64
```

### Module 6: Special Output Types and Utilities

Handle temporary files, protected outputs, checkpoints, and output validation.

```python
# temp: auto-delete after downstream rules consume it
rule sort_bam:
    input:
        "results/{sample}.raw.bam"
    output:
        temp("results/{sample}.sorted_temp.bam")  # deleted after indexing
    shell:
        "samtools sort {input} -o {output}"

# protected: write-protect final outputs (prevent overwrite)
rule final_report:
    input:
        "results/{sample}.vcf.gz"
    output:
        protected("reports/{sample}.final.vcf.gz")
    shell:
        "cp {input} {output}"
```

```python
# directory: rule that outputs a directory
rule denovo_assembly:
    input:
        fastq="data/{sample}.fastq"
    output:
        directory("assemblies/{sample}/")
    shell:
        "spades.py -s {input.fastq} -o {output}"

# touch: create empty flag file (for ordering-only dependencies)
rule validate_bam:
    input:
        "results/{sample}.bam"
    output:
        touch("checkpoints/{sample}.validated")
    shell:
        "samtools quickcheck {input} && echo OK"

# ensure: validate output properties before considering rule complete
rule download_reference:
    output:
        ensure("refs/genome.fa", min_size=1_000_000)
    shell:
        "wget -O {output} https://example.com/genome.fa"
```

## Key Concepts

### Rule Resolution and DAG

Snakemake works **backward from targets**: given a list of desired output files, it builds a DAG of rules needed to produce them. Rules not needed for the current targets are ignored.

```python
# rule all: declare all final outputs here
# Without this, snakemake runs only the first rule
rule all:
    input:
        expand("results/{sample}.vcf.gz", sample=SAMPLES),
        expand("qc/{sample}_fastqc.html", sample=SAMPLES)
```

### Wildcards vs Expand

- `{sample}` in rule input/output = wildcard: filled by Snakemake at execution time
- `expand("results/{sample}.bam", sample=SAMPLES)` = Python: generates a list of strings NOW (used in `rule all`)

## Common Workflows

### Workflow 1: Standard NGS QC Pipeline

**Goal**: FastQC → trim → align → sort → dedup → flagstat for multiple samples.

```python
configfile: "config/config.yaml"
SAMPLES = config["samples"]

rule all:
    input:
        expand("qc/{sample}_fastqc.html", sample=SAMPLES),
        expand("results/{sample}.flagstat.txt", sample=SAMPLES)

rule fastqc:
    input:  "data/{sample}.fastq"
    output: "qc/{sample}_fastqc.html", "qc/{sample}_fastqc.zip"
    shell:  "fastqc {input} -o qc/"

rule trim:
    input:  "data/{sample}.fastq"
    output: "trimmed/{sample}.fastq"
    shell:  "fastp -q 20 -i {input} -o {output}"

rule align:
    input:
        fastq="trimmed/{sample}.fastq",
        ref="refs/genome.fa"
    output: temp("results/{sample}.raw.bam")
    threads: 8
    shell:
        "bwa mem -t {threads} {input.ref} {input.fastq} | samtools view -b > {output}"

rule sort_dedup:
    input:  "results/{sample}.raw.bam"
    output:
        bam="results/{sample}.bam",
        bai="results/{sample}.bam.bai"
    threads: 4
    shell:
        "samtools sort -@ {threads} {input} | samtools markdup -r - {output.bam} "
        "&& samtools index {output.bam}"

rule flagstat:
    input:  "results/{sample}.bam"
    output: "results/{sample}.flagstat.txt"
    shell:  "samtools flagstat {input} > {output}"
```

### Workflow 2: Running on a SLURM Cluster

**Goal**: Deploy the same Snakefile to HPC with per-job resource allocation.

```bash
# 1. Create profiles/slurm/config.yaml
mkdir -p profiles/slurm
cat > profiles/slurm/config.yaml << 'EOF'
executor: slurm
jobs: 100
default-resources:
  mem_mb: 4000
  runtime: 60
use-conda: true
latency-wait: 30
rerun-incomplete: true
EOF

# 2. Add resources to compute-heavy rules in Snakefile
# resources:
#   mem_mb=16000, runtime=120

# 3. Submit
snakemake --profile profiles/slurm --cores 256 -n  # dry-run
snakemake --profile profiles/slurm --cores 256      # submit

# 4. Monitor
snakemake --profile profiles/slurm --report report.html  # after completion
```

## Key Parameters

| Parameter | Context | Default | Range/Options | Effect |
|-----------|---------|---------|---------------|--------|
| `--cores` | CLI | 1 | 1–N or `all` | Max concurrent jobs/threads |
| `threads:` | Rule | 1 | 1–N | Threads per rule (scales to `--cores`) |
| `mem_mb:` | `resources:` | None | integer | Memory in MB (used by SLURM profile) |
| `runtime:` | `resources:` | None | integer (min) | Max walltime per job |
| `--profile` | CLI | None | path | YAML profile for executor config |
| `--use-conda` | CLI | False | flag | Activate per-rule conda environments |
| `--use-apptainer` | CLI | False | flag | Enable Singularity/Apptainer containers |
| `-n` | CLI | False | flag | Dry-run (show tasks, don't execute) |
| `--forceall` | CLI | False | flag | Rerun all rules regardless of status |
| `--rerun-incomplete` | CLI | False | flag | Rerun rules with partial outputs |
| `configfile:` | Snakefile | None | YAML path | Load config dictionary from YAML |

## Best Practices

1. **Always define `rule all`**: Without it, only the first rule in the Snakefile runs. `rule all` collects all final outputs; Snakemake runs everything needed to produce them.

2. **Use `temp()` for large intermediates**: BAM files before deduplication, unsorted BAMs, and intermediate assemblies can be marked `temp()` to auto-delete after consumption — saves significant disk.

3. **Separate config from code**: Put sample lists, thread counts, file paths, and thresholds in `config.yaml`. Hard-coded values in Snakefiles make pipelines brittle and non-reusable.

4. **Test with `snakemake -n` first**: The dry-run shows exactly which rules will run and in what order. Run it before every production execution to confirm the DAG is correct.

5. **Use `log:` for every shell rule**: Redirect tool stderr/stdout to per-rule log files (`2> {log}`). Without logs, debugging cluster job failures is nearly impossible.

6. **Benchmark rules in production**: Add `benchmark: "benchmarks/{rule}/{sample}.txt"` to measure actual runtime and memory — essential data for tuning SLURM resource requests.

## Common Recipes

### Recipe: Generate Sample List from Files

```python
# Auto-discover samples from input directory (no hardcoded list)
from pathlib import Path

SAMPLES = [p.stem.replace(".fastq", "") for p in Path("data/").glob("*.fastq")]
print(f"Found {len(SAMPLES)} samples: {SAMPLES[:3]}...")

rule all:
    input:
        expand("results/{sample}.bam", sample=SAMPLES)
```

### Recipe: Conditional Execution Based on Config

```python
configfile: "config/config.yaml"

# Only run deduplication for WGS (not amplicon) data
rule dedup:
    input:
        "results/{sample}.sorted.bam"
    output:
        "results/{sample}.deduped.bam"
    run:
        if config.get("assay_type") == "WGS":
            shell("samtools markdup -r {input} {output}")
        else:
            shell("cp {input} {output}")
```

### Recipe: Aggregate Multiple Samples

```python
# Collect all per-sample stats into one summary table
rule multiqc:
    input:
        expand("qc/{sample}_fastqc.zip", sample=SAMPLES),
        expand("results/{sample}.flagstat.txt", sample=SAMPLES)
    output:
        "multiqc/multiqc_report.html"
    shell:
        "multiqc qc/ results/ -o multiqc/"
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `AmbiguousRuleException` | Multiple rules match same output | Add `wildcard_constraints:`, use `ruleorder rule_a > rule_b`, or rename outputs |
| `MissingOutputException` | Rule completed but output file absent | Check working directory in shell; verify output path; check disk space |
| `TargetFileException` | `rule all` requests a file no rule can produce | Verify `expand()` args match wildcard names; use `-n` to trace resolution |
| Cluster jobs all fail | Resources too low for tool | Increase `mem_mb` or `runtime`; check cluster queue with `squeue` |
| Conda env build fails | Package conflict or wrong channel | Add `conda-forge` before `bioconda`; pin package versions |
| Rule reruns unexpectedly | Output file timestamp older than input | Touch output files with `snakemake --touch`; or delete and rerun |
| `PermissionError` on protected output | `protected()` wrapper applied | Remove protection with `--force`; or delete and regenerate without `protected()` |

## Related Skills

- **samtools-bam-processing** — BAM sorting and indexing rules commonly used in Snakemake pipelines
- **bedtools-genomic-intervals** — interval operations in downstream annotation rules
- **neuropixels-analysis** — example of a complex multi-step pipeline that benefits from Snakemake

## References

- [Snakemake documentation](https://snakemake.readthedocs.io/) — rules, wildcards, profiles, API reference
- [Snakemake GitHub](https://github.com/snakemake/snakemake) — source, releases, issue tracker
- Mölder et al. (2021) "Sustainable data analysis with Snakemake" — [F1000Research 10:33](https://doi.org/10.12688/f1000research.29032.2)
- [Snakemake workflow catalog](https://snakemake.github.io/snakemake-workflow-catalog/) — community-maintained reference pipelines