--- name: bioinformatics-installer description: "Install bioinformatics tools for ENCODE data analysis. Covers CLI tools (BWA, STAR, samtools, MACS2), R/Bioconductor packages (DESeq2, Seurat, ChIPseeker), Python packages (Scanpy, deeptools), and Nextflow pipeline infrastructure. Generates conda environments, R install scripts, and Python requirements. Use when the user needs to set up a bioinformatics workstation, install tools for a specific assay, create reproducible environments, or troubleshoot dependency issues. Trigger on: install tools, set up environment, conda create, bioinformatics setup, install R packages, install Bioconductor, install pipeline tools." --- # Bioinformatics Installer for ENCODE Data Analysis Install all bioinformatics tools needed for ENCODE data analysis, organized by assay type. This skill provides ready-to-use conda environment definitions, R/Bioconductor install scripts, Python package lists, and Nextflow pipeline infrastructure setup. Every environment is version-pinned for reproducibility and tested against ENCODE uniform processing standards. ## When to Use - User wants to install bioinformatics tools needed for ENCODE data analysis - User asks about "install tools", "conda environment", "setup bioinformatics", or "install HOMER/MACS2/deeptools" - User needs pre-configured conda environments for specific assay pipelines (ChIP-seq, ATAC-seq, RNA-seq, etc.) - User wants to install R/Bioconductor packages (DESeq2, Seurat, ChIPseeker) or Python packages (Scanpy, pysam) - Example queries: "install tools for ChIP-seq analysis", "set up a conda environment for ATAC-seq", "install deeptools and bedtools" ## Overview ENCODE data analysis requires a broad ecosystem of tools spanning command-line aligners, peak callers, signal processors, statistical analysis frameworks in R, Python visualization and single-cell packages, and workflow engines. Setting up these tools correctly — with compatible versions, proper channel priorities, and no dependency conflicts — is a significant barrier for new users and a reproducibility concern for experienced analysts. This skill solves that by providing: - **7 assay-specific conda environments** with pinned tool versions matching ENCODE pipeline standards - **R/Bioconductor install script** covering 50+ packages across 8 categories - **Python install script** for single-cell, Hi-C, and genomics packages - **Nextflow + container setup** for pipeline execution on local, HPC, and cloud platforms All environments use the same channel priority (conda-forge > bioconda > defaults) and are tested for cross-platform compatibility on Linux x86_64 and macOS (Intel + Apple Silicon where possible). ## Quick Start Install a complete environment for any assay type with a single command: ```bash # ChIP-seq (histone or TF) conda env create -f skills/bioinformatics-installer/environments/chipseq-env.yml # ATAC-seq conda env create -f skills/bioinformatics-installer/environments/atacseq-env.yml # RNA-seq conda env create -f skills/bioinformatics-installer/environments/rnaseq-env.yml # Hi-C conda env create -f skills/bioinformatics-installer/environments/hic-env.yml # Whole-Genome Bisulfite Sequencing (WGBS) conda env create -f skills/bioinformatics-installer/environments/wgbs-env.yml # DNase-seq conda env create -f skills/bioinformatics-installer/environments/dnaseseq-env.yml # CUT&RUN / CUT&Tag conda env create -f skills/bioinformatics-installer/environments/cutandrun-env.yml ``` Using mamba for faster solves (recommended): ```bash mamba env create -f skills/bioinformatics-installer/environments/chipseq-env.yml ``` Install R and Python packages: ```bash # All R/Bioconductor packages Rscript skills/bioinformatics-installer/scripts/install-r-packages.R --all # All Python packages bash skills/bioinformatics-installer/scripts/install-python-packages.sh --all # Nextflow + Docker bash skills/bioinformatics-installer/scripts/install-nextflow.sh --docker ``` ## Per-Assay Environments ### ChIP-seq Environment (`encode-chipseq`) For histone modification and transcription factor ChIP-seq processing following ENCODE uniform pipeline standards (Landt et al. 2012, ENCODE Consortium 2020). | Tool | Version | Purpose | |------|---------|---------| | BWA-MEM | 0.7.17 | Read alignment to reference genome (Li & Durbin 2009) | | samtools | 1.19 | BAM manipulation, sorting, indexing, flagstat (Li et al. 2009) | | MACS2 | 2.2.9.1 | Peak calling for narrow (TF) and broad (histone) marks (Zhang et al. 2008) | | Picard | 3.1.1 | Duplicate marking and library complexity metrics (Broad Institute) | | phantompeakqualtools | 1.2.2 | Strand cross-correlation (NSC/RSC) quality metrics (Kharchenko et al. 2008) | | IDR | 2.0.3 | Irreproducible Discovery Rate for replicate consistency (Li et al. 2011) | | deeptools | 3.5.5 | Signal normalization (bamCoverage), fingerprint, correlation (Ramirez et al. 2016) | | bedtools | 2.31.0 | Interval operations, blacklist filtering (Quinlan & Hall 2010) | | FastQC | 0.12.1 | Raw read quality assessment (Andrews 2010) | | Trim Galore | 0.6.10 | Adapter and quality trimming via Cutadapt (Krueger 2012) | | MultiQC | 1.21 | Aggregate QC report across all pipeline stages (Ewels et al. 2016) | | bedGraphToBigWig | — | Convert bedGraph signal to bigWig for genome browser viewing (Kent et al. 2010) | **Memory**: BWA index for GRCh38 requires ~5.5 GB RAM. Peak calling with MACS2 typically requires 4-8 GB. phantompeakqualtools loads full BAM into memory. **Environment file**: `environments/chipseq-env.yml` --- ### ATAC-seq Environment (`encode-atacseq`) For chromatin accessibility profiling via ATAC-seq following ENCODE standards (Buenrostro et al. 2013, Corces et al. 2017). | Tool | Version | Purpose | |------|---------|---------| | Bowtie2 | 2.5.3 | Alignment (preferred over BWA for ATAC-seq short fragments) (Langmead & Salzberg 2012) | | MACS2 | 2.2.9.1 | Peak calling with --nomodel --shift -100 --extsize 200 for ATAC (Zhang et al. 2008) | | samtools | 1.19 | BAM manipulation, mitochondrial read filtering | | Picard | 3.1.1 | Duplicate marking, insert size metrics | | deeptools | 3.5.5 | alignmentSieve (Tn5 offset), bamCoverage (signal tracks), plotFingerprint | | bedtools | 2.31.0 | Blacklist filtering, interval operations | | FastQC | 0.12.1 | Raw read quality and adapter content assessment | | Trim Galore | 0.6.10 | Adapter trimming (Nextera adapters for ATAC-seq) | | MultiQC | 1.21 | Aggregate QC reporting | **Key ATAC-seq parameters**: Tn5 transposase introduces a +4/-5 bp offset that must be corrected. Fragment size distribution should show nucleosomal ladder (sub-nucleosomal, mono-, di-, tri-). TSS enrichment score should be >= 5 (GRCh38), >= 6 (hg19), or >= 10 (mm10) for high-quality data (ENCODE data standards). **Environment file**: `environments/atacseq-env.yml` --- ### RNA-seq Environment (`encode-rnaseq`) For gene expression quantification following ENCODE RNA-seq standards (Conesa et al. 2016, ENCODE Consortium 2020). | Tool | Version | Purpose | |------|---------|---------| | STAR | 2.7.11b | Splice-aware alignment with 2-pass mapping (Dobin et al. 2013) | | RSEM | 1.3.3 | Gene/transcript quantification with expectation-maximization (Li & Dewey 2011) | | Kallisto | 0.50.1 | Pseudoalignment-based transcript quantification (Bray et al. 2016) | | Salmon | 1.10.3 | Quasi-mapping transcript quantification with GC bias correction (Patro et al. 2017) | | featureCounts (subread) | 2.0.6 | Gene-level read counting for count-based DE methods (Liao et al. 2014) | | samtools | 1.19 | BAM handling, flagstat, idxstats | | FastQC | 0.12.1 | Read quality assessment | | Trim Galore | 0.6.10 | Adapter and quality trimming | | MultiQC | 1.21 | Aggregate QC report | | RSeQC | 5.0.3 | RNA-seq-specific QC: gene body coverage, read distribution, inner distance (Wang et al. 2012) | **Memory**: STAR genome generation requires 32+ GB RAM for human genome. STAR alignment requires ~30 GB RAM. Kallisto and Salmon are memory-efficient alternatives (~4 GB). **Environment file**: `environments/rnaseq-env.yml` --- ### Hi-C Environment (`encode-hic`) For chromatin conformation capture processing following ENCODE Hi-C standards (Yardimci et al. 2019, Rao et al. 2014). | Tool | Version | Purpose | |------|---------|---------| | BWA-MEM | 0.7.17 | Chimeric read alignment (each mate aligned independently) | | pairtools | 1.0.3 | Parse, sort, deduplicate, filter contact pairs (Open2C) | | cooler | 0.9.3 | Multi-resolution contact matrix storage and balancing (Abdennur & Mirny 2020) | | Juicer | 2.20.00 | Contact matrix generation and HiCCUPS loop calling (Durand et al. 2016) | | samtools | 1.19 | BAM handling for chimeric alignment parsing | | bedtools | 2.31.0 | Restriction fragment and TAD boundary operations | | FastQC | 0.12.1 | Read quality assessment | | Trim Galore | 0.6.10 | Adapter trimming | | MultiQC | 1.21 | Aggregate QC reporting | **Key Hi-C parameters**: Cis/trans ratio > 60%, long-range cis contacts (> 20 kb) > 40%. Resolution depends on sequencing depth: ~1 billion valid pairs for 5 kb resolution on human. **Note**: Juicer requires Java 11+. Install via `conda install -c bioconda juicer_tools` or download the `.jar` directly from the Aiden Lab GitHub. **Environment file**: `environments/hic-env.yml` --- ### WGBS Environment (`encode-wgbs`) For whole-genome bisulfite sequencing (DNA methylation) following ENCODE standards (Foox et al. 2021, Schultz et al. 2015). | Tool | Version | Purpose | |------|---------|---------| | Bismark | 0.24.2 | Bisulfite-aware alignment and methylation extraction (Krueger & Andrews 2011) | | MethylDackel | 0.6.1 | Fast methylation extraction from bisulfite BAMs (Ryan 2023) | | samtools | 1.19 | BAM manipulation, merge, index | | bedtools | 2.31.0 | Interval operations for DMR analysis | | FastQC | 0.12.1 | Read quality assessment (note: bisulfite libraries have biased base composition) | | Trim Galore | 0.6.10 | Adapter trimming with --rrbs or default mode | | MultiQC | 1.21 | Aggregate QC reporting with Bismark module | | tabix | 1.19 | Index methylation BED files for random access | | bgzip | 1.19 | Block-gzip compression for indexed access | **Key WGBS parameters**: Bisulfite conversion rate ≥ 98% (check unmethylated spike-in lambda DNA). CpG coverage >= 10x for reliable DMR calling. M-bias plots should be checked for end-repair artifacts. **Environment file**: `environments/wgbs-env.yml` --- ### DNase-seq Environment (`encode-dnaseseq`) For DNase I hypersensitive site mapping following ENCODE standards (Thurman et al. 2012, ENCODE Consortium 2020). | Tool | Version | Purpose | |------|---------|---------| | BWA-MEM | 0.7.17 | Read alignment to reference genome | | Hotspot2 | 2.3.1 | DNase-seq hotspot detection (John et al. 2011) | | HINT-ATAC | 0.13.2 | TF footprinting from DNase-seq data (Li et al. 2019) | | F-Seq2 | 2.0.3 | Feature density estimation for peak calling (Boyle et al. 2008, Zhao et al. 2020) | | samtools | 1.19 | BAM handling and filtering | | bedtools | 2.31.0 | Interval operations, blacklist filtering | | FastQC | 0.12.1 | Read quality assessment | | Trim Galore | 0.6.10 | Adapter trimming | | MultiQC | 1.21 | Aggregate QC reporting | **Environment file**: `environments/dnaseseq-env.yml` --- ### CUT&RUN / CUT&Tag Environment (`encode-cutandrun`) For antibody-targeted chromatin profiling via CUT&RUN (Skene & Henikoff 2017) and CUT&Tag (Kaya-Okur et al. 2019). | Tool | Version | Purpose | |------|---------|---------| | Bowtie2 | 2.5.3 | Alignment (recommended for shorter CUT&RUN/Tag fragments) | | SEACR | 1.3 | Sparse Enrichment Analysis for CUT&RUN (Meers et al. 2019) | | MACS2 | 2.2.9.1 | Alternative peak calling with adjusted parameters | | samtools | 1.19 | BAM handling, spike-in alignment filtering | | Picard | 3.1.1 | Duplicate marking (low duplication expected for CUT&RUN/Tag) | | deeptools | 3.5.5 | Signal tracks, heatmaps, spike-in normalization | | bedtools | 2.31.0 | Interval operations, suspect list filtering | | FastQC | 0.12.1 | Read quality assessment | | Trim Galore | 0.6.10 | Adapter trimming | | MultiQC | 1.21 | Aggregate QC reporting | **Key CUT&RUN/Tag notes**: These assays have inherently lower background than ChIP-seq. Do NOT apply ChIP-seq quality thresholds — use CUT&RUN-specific metrics (Nordin et al. 2023). Apply the CUT&RUN suspect list instead of the standard ENCODE blacklist. Spike-in normalization (E. coli DNA for CUT&RUN, carry-over for CUT&Tag) is strongly recommended for quantitative comparisons. **Environment file**: `environments/cutandrun-env.yml` ## R/Bioconductor Packages Install all R packages needed for ENCODE downstream analysis. The install script at `scripts/install-r-packages.R` handles BiocManager setup, version locking, and category-based installation. ### Core Genomic Infrastructure These packages provide the foundation for all genomic data manipulation in R: | Package | Purpose | |---------|---------| | GenomicRanges | Interval arithmetic on genomic coordinates (Lawrence et al. 2013) | | GenomicFeatures | Gene model and transcript annotation handling | | rtracklayer | Import/export BED, bigWig, GFF, narrowPeak, broadPeak | | IRanges | Integer range operations (underlying GenomicRanges) | | GenomeInfoDb | Chromosome naming conventions (UCSC vs Ensembl vs NCBI) | | BiocGenerics | Common S4 generics across Bioconductor | | S4Vectors | S4 class infrastructure for Bioconductor objects | | AnnotationDbi | Unified interface to annotation databases | | biomaRt | Ensembl BioMart query interface for gene annotation (Durinck et al. 2009) | ### Differential Analysis | Package | Purpose | |---------|---------| | DESeq2 | Differential gene expression with shrinkage estimators (Love et al. 2014) | | edgeR | Differential expression using empirical Bayes (Robinson et al. 2010) | | limma | Linear models for microarray and RNA-seq data (Ritchie et al. 2015) | | DiffBind | Differential binding analysis for ChIP-seq/ATAC-seq peaks (Stark & Brown 2011) | | ChIPQC | ChIP-seq quality control in R (Carroll et al. 2014) | | chromVAR | Chromatin accessibility variation across single cells (Schep et al. 2017) | ### Annotation and Pathway Analysis | Package | Purpose | |---------|---------| | ChIPseeker | Peak annotation and visualization (Yu et al. 2015) | | annotatr | Annotate genomic regions with CpG islands, genes, enhancers (Cavalcante & Sartor 2017) | | clusterProfiler | Gene ontology and KEGG pathway enrichment (Yu et al. 2012) | | org.Hs.eg.db | Human gene annotation database | | org.Mm.eg.db | Mouse gene annotation database | | TxDb.Hsapiens.UCSC.hg38.knownGene | Human transcript models (GRCh38) | | TxDb.Mmusculus.UCSC.mm10.knownGene | Mouse transcript models (mm10) | ### Single-Cell Analysis | Package | Purpose | |---------|---------| | Seurat | Comprehensive single-cell RNA-seq analysis (Hao et al. 2021) | | Signac | Single-cell chromatin accessibility (ATAC-seq) analysis (Stuart et al. 2021) | | SingleCellExperiment | Core Bioconductor container for single-cell data | | scater | Single-cell QC, normalization, visualization (McCarthy et al. 2017) | | scran | Single-cell normalization and feature selection (Lun et al. 2016) | ### Bulk-to-Single-Cell Deconvolution | Package | Purpose | |---------|---------| | BayesPrism | Bayesian deconvolution with scRNA-seq reference (Chu et al. 2022) | | InstaPrism | Fast approximation of BayesPrism for large datasets (Wang et al. 2024) | | MuSiC_deconv | Multi-Subject Single Cell deconvolution (Wang et al. 2019) | | DWLS | Dampened Weighted Least Squares deconvolution (Tsoucas et al. 2019) | | BisqueRNA | Reference-based and marker-based deconvolution (Jew et al. 2020) | ### DNA Methylation Analysis | Package | Purpose | |---------|---------| | DMRcate | Differentially methylated region detection (Peters et al. 2021) | | bsseq | Bisulfite sequencing data handling and smoothing (Hansen et al. 2012) | | methylKit | Methylation analysis from bisulfite sequencing (Akalin et al. 2012) | ### Visualization | Package | Purpose | |---------|---------| | ComplexHeatmap | Publication-quality heatmaps with annotations (Gu et al. 2016) | | EnhancedVolcano | Volcano plots for differential expression (Blighe et al. 2018) | | Gviz | Genome browser-style track visualization (Hahne & Ivanek 2016) | | ggplot2 | Grammar of graphics for all custom plots (Wickham 2016) | ### Statistics and Batch Correction | Package | Purpose | |---------|---------| | sva (ComBat) | Surrogate variable analysis and batch correction (Leek et al. 2012) | | WGCNA | Weighted Gene Co-expression Network Analysis (Langfelder & Horvath 2008) | | ReactomePA | Reactome pathway analysis (Yu & He 2016) | **Install script**: `scripts/install-r-packages.R` ```bash # Install all categories Rscript scripts/install-r-packages.R --all # Install only specific categories Rscript scripts/install-r-packages.R --chipseq # DiffBind, ChIPQC, ChIPseeker Rscript scripts/install-r-packages.R --rnaseq # DESeq2, edgeR, limma Rscript scripts/install-r-packages.R --singlecell # Seurat, Signac, scater, scran Rscript scripts/install-r-packages.R --methylation # DMRcate, bsseq, methylKit Rscript scripts/install-r-packages.R --deconvolution # BayesPrism, InstaPrism, MuSiC_deconv, DWLS, BisqueRNA ``` ## Python Packages Install Python packages for single-cell analysis, Hi-C processing, signal visualization, and genomic data manipulation. ### Core Single-Cell Stack | Package | Purpose | |---------|---------| | scanpy | Single-cell RNA-seq analysis framework (Wolf et al. 2018) | | anndata | Annotated data matrix for single-cell (Virshup et al. 2021) | | scvi-tools | Deep generative models for single-cell (Gayoso et al. 2022) | | numpy | Numerical computing | | pandas | Data manipulation and tabular operations | | scipy | Scientific computing (sparse matrices, statistics) | | matplotlib | Plotting foundation | | seaborn | Statistical visualization | ### Genomics and Signal Processing | Package | Purpose | |---------|---------| | deeptools | Signal tracks, heatmaps, correlation (also CLI; Ramirez et al. 2016) | | pyBigWig | Read/write bigWig signal files (Ryan 2023) | | pysam | Python interface to samtools/htslib (Li et al. 2009) | | pybedtools | Python interface to bedtools (Dale et al. 2011) | ### Hi-C Analysis | Package | Purpose | |---------|---------| | cooler | Multi-resolution contact matrices (Abdennur & Mirny 2020) | | cooltools | Analysis toolkit for cooler data: TADs, compartments, insulation | | hic-straw | Read .hic files from Juicer/Juicebox (Durand et al. 2016) | | pyGenomeTracks | Genome browser visualization including Hi-C tracks | ### Single-Cell QC and Integration | Package | Purpose | |---------|---------| | scrublet | Doublet detection for scRNA-seq (Wolock et al. 2019) | | CellBender | Remove ambient RNA contamination (Fleming et al. 2023) | | harmony-pytorch | Batch integration via Harmony in PyTorch (Korsunsky et al. 2019) | | scanorama | Panoramic stitching of scRNA-seq datasets (Hie et al. 2019) | | bbknn | Batch-balanced KNN graph construction (Polanski et al. 2020) | **Install script**: `scripts/install-python-packages.sh` ```bash # Install all Python packages bash scripts/install-python-packages.sh --all # Install only specific categories bash scripts/install-python-packages.sh --singlecell # scanpy, scvi-tools, harmony bash scripts/install-python-packages.sh --hic # cooler, cooltools, hic-straw bash scripts/install-python-packages.sh --deeptools # deeptools, pyBigWig, pysam ``` ## Nextflow and Container Setup ENCODE pipeline execution requires Nextflow DSL2 and a container runtime (Docker or Singularity). ### Nextflow Installation ```bash # Install Nextflow (requires Java 11+) curl -s https://get.nextflow.io | bash mv nextflow /usr/local/bin/ # Verify nextflow -version ``` ### Docker (recommended for local/cloud) ```bash # macOS brew install --cask docker # Linux (Ubuntu/Debian) sudo apt-get update sudo apt-get install -y docker-ce docker-ce-cli containerd.io # Add current user to docker group (Linux) sudo usermod -aG docker $USER ``` ### Singularity (for HPC clusters) ```bash # Most HPC clusters have Singularity pre-installed # Check with: module load singularity && singularity version # If not available, install via conda: conda install -c conda-forge singularity ``` ### Nextflow Configuration Profiles The pipeline skills (pipeline-chipseq, pipeline-atacseq, etc.) include `nextflow.config` files with profiles for local, SLURM, GCP, and AWS execution. Select the appropriate profile: ```bash # Local with Docker nextflow run main.nf -profile local # HPC with Singularity nextflow run main.nf -profile slurm # Google Cloud nextflow run main.nf -profile gcp # AWS Batch nextflow run main.nf -profile aws ``` **Install script**: `scripts/install-nextflow.sh` ## Motif Analysis Tools For transcription factor binding motif discovery and scanning. | Tool | Version | Type | Purpose | |------|---------|------|---------| | HOMER | 4.11 | CLI | De novo and known motif discovery, annotation (Heinz et al. 2010) | | MEME Suite | 5.5.5 | CLI | MEME, DREME, STREME de novo discovery; FIMO scanning; AME enrichment (Bailey et al. 2015) | | FIMO | 5.5.5 | CLI (part of MEME Suite) | Motif occurrence scanning across sequences | | TFBSTools | R | R/Bioconductor | JASPAR motif handling, PFM/PWM conversion, motif scanning in R (Tan & Lenhard 2016) | ### HOMER Installation ```bash # Download and configure HOMER mkdir -p ~/software/homer cd ~/software/homer wget http://homer.ucsd.edu/homer/configureHomer.pl perl configureHomer.pl -install homer perl configureHomer.pl -install hg38 # Human genome perl configureHomer.pl -install mm10 # Mouse genome # Add to PATH export PATH=$PATH:~/software/homer/bin ``` ### MEME Suite Installation ```bash # Via conda (recommended) conda install -c bioconda meme # Or from source wget https://meme-suite.org/meme/meme-software/5.5.5/meme-5.5.5.tar.gz tar xzf meme-5.5.5.tar.gz cd meme-5.5.5 ./configure --prefix=$HOME/software/meme --enable-build-libxml2 --enable-build-libxslt make && make install ``` ## Walkthrough: Setting Up a Complete ENCODE Analysis Environment **Goal**: Install all bioinformatics tools needed to process ENCODE data, from raw FASTQ files through peak calling, annotation, and visualization, using Conda environments. **Context**: ENCODE analysis requires dozens of specialized tools. This skill automates installation with pre-configured Conda environments for each pipeline stage. ### Step 1: Determine required tools by experiment type ``` encode_get_experiment(accession="ENCSR000AKA") ``` Expected output: ```json { "accession": "ENCSR000AKA", "assay_title": "Histone ChIP-seq", "target": "H3K27ac" } ``` **Interpretation**: Histone ChIP-seq requires: BWA-MEM (alignment), SAMtools (BAM processing), MACS2 (peak calling), IDR (reproducibility), bedtools (interval operations), deepTools (signal visualization). ### Step 2: Install the ChIP-seq Conda environment ```bash # Using the pre-configured environment YAML conda env create -f skills/bioinformatics-installer/scripts/chipseq-env.yml conda activate encode-chipseq ``` The YAML includes: ```yaml name: encode-chipseq channels: [bioconda, conda-forge, defaults] dependencies: - bwa=0.7.17 - samtools=1.17 - macs2=2.2.9.1 - idr=2.0.3 - bedtools=2.31.0 - deeptools=3.5.4 - picard=3.1.1 - fastqc=0.12.1 - multiqc=1.17 ``` ### Step 3: Install additional tools for downstream analysis For peak annotation and motif analysis: ```bash conda env create -f skills/bioinformatics-installer/scripts/annotation-env.yml conda activate encode-annotation # Includes: HOMER, GREAT, bedtools, R/Bioconductor (ChIPseeker, clusterProfiler) ``` ### Step 4: Verify installation ```bash # Quick verification of key tools bwa 2>&1 | head -3 samtools --version | head -1 macs2 --version bedtools --version ``` ### Step 5: Download reference data for ENCODE analysis ``` encode_download_files(accessions=["ENCFF001ABC"], download_dir="/data/references") ``` Reference files needed: - GRCh38 genome FASTA - ENCODE blacklist v2 (Amemiya et al. 2019) - Gene annotation GTF (GENCODE v36) ### Integration with downstream skills - Installed tools are used by → **pipeline-chipseq** through **pipeline-cutandrun** for processing - Reference data feeds into → **download-encode** for FASTQ retrieval - Environment setup enables → **quality-assessment** tool execution - Installed annotation tools support → **peak-annotation** and **motif-analysis** ## Code Examples ### 1. Find experiments to identify required tools ``` encode_search_experiments( assay_title="ATAC-seq", organ="pancreas" ) ``` Expected output: ```json { "total": 8, "experiments": [ { "accession": "ENCSR799GHJ", "assay_title": "ATAC-seq", "biosample_summary": "pancreatic islet tissue male adult (44 years)", "status": "released" } ] } ``` **Install decision**: ATAC-seq requires the `atacseq-env.yml` conda environment (Bowtie2 + MACS2 + deeptools + samtools + bedtools). ### 2. Get file info to understand format requirements ``` encode_get_file_info(accession="ENCFF001ABC") ``` Expected output: ```json { "accession": "ENCFF001ABC", "file_format": "fastq", "file_size_mb": 4521.3, "read_length": 100, "paired_end": true, "platform": "Illumina NovaSeq 6000" } ``` **Install decision**: Paired-end FASTQ needs Bowtie2 (not BWA for ATAC-seq), Picard for duplicate marking, and samtools for BAM processing. ## Pitfalls & Edge Cases - **Conda solver conflicts**: Large conda environments with many packages can take hours to solve. Use mamba instead of conda for faster dependency resolution, or install in smaller focused environments. - **R/Bioconductor version mismatch**: R packages from CRAN and Bioconductor must match the R version. Installing Bioconductor 3.18 packages with R 4.4 will fail silently or produce errors. Use BiocManager::install() to ensure version compatibility. - **Python 2 vs Python 3**: Some legacy bioinformatics tools (MACS 1.x, old HOMER) require Python 2. Never install Python 2 tools in the same environment as Python 3 tools — use separate conda environments. - **ARM Mac (M1/M2/M3) compatibility**: Many bioinformatics tools lack native ARM builds. Use `CONDA_SUBDIR=osx-64` or Rosetta 2 emulation for x86_64 packages. Some tools (samtools, BWA) have ARM-native builds. - **Nextflow requires Java 11+**: Nextflow will not run on Java 8. Check `java -version` before running pipelines. Install with `curl -s https://get.nextflow.io | bash` for correct Java bundling. - **Docker vs Singularity on HPC**: Most HPC clusters do not allow Docker (requires root). Use Singularity instead. Nextflow supports both via `-profile singularity` or `-profile docker`. ## Literature Foundation | # | Reference | Key Contribution | |---|-----------|-----------------| | 1 | Li & Durbin 2009, Bioinformatics, DOI:10.1093/bioinformatics/btp324 (~30,000 cit) | BWA aligner | | 2 | Langmead & Salzberg 2012, Nat Methods, DOI:10.1038/nmeth.1923 (~25,000 cit) | Bowtie2 aligner | | 3 | Li et al. 2009, Bioinformatics, DOI:10.1093/bioinformatics/btp352 (~20,000 cit) | SAMtools/BAM format | | 4 | Zhang et al. 2008, Genome Biol, DOI:10.1186/gb-2008-9-9-r137 (~7,000 cit) | MACS2 peak caller | | 5 | Dobin et al. 2013, Bioinformatics, DOI:10.1093/bioinformatics/bts635 (~15,000 cit) | STAR RNA-seq aligner | | 6 | Love et al. 2014, Genome Biol, DOI:10.1186/s13059-014-0550-8 (~30,000 cit) | DESeq2 | | 7 | Ramirez et al. 2016, Nucleic Acids Res, DOI:10.1093/nar/gkw257 (~3,000 cit) | deeptools | | 8 | Wolf et al. 2018, Genome Biol, DOI:10.1186/s13059-017-1382-0 (~5,000 cit) | Scanpy | | 9 | Hao et al. 2021, Cell, DOI:10.1016/j.cell.2021.04.048 (~8,000 cit) | Seurat v4 | | 10 | Quinlan & Hall 2010, Bioinformatics, DOI:10.1093/bioinformatics/btq033 (~10,000 cit) | bedtools | | 11 | Ewels et al. 2016, Bioinformatics, DOI:10.1093/bioinformatics/btw354 (~3,000 cit) | MultiQC | | 12 | Krueger & Andrews 2011, Bioinformatics, DOI:10.1093/bioinformatics/btr167 (~5,000 cit) | Bismark | | 13 | Heinz et al. 2010, Molecular Cell, DOI:10.1016/j.molcel.2010.05.004 (~7,000 cit) | HOMER motif analysis | | 14 | Bailey et al. 2015, Nucleic Acids Res, DOI:10.1093/nar/gkv416 (~3,000 cit) | MEME Suite | | 15 | Meers et al. 2019, Epigenetics Chromatin, DOI:10.1186/s13072-019-0287-4 (~800 cit) | SEACR for CUT&RUN | | 16 | Di Tommaso et al. 2017, Nat Biotechnol, DOI:10.1038/nbt.3820 (~2,500 cit) | Nextflow | | 17 | Landt et al. 2012, Genome Res, DOI:10.1101/gr.136184.111 (~4,000 cit) | ENCODE ChIP-seq standards | | 18 | ENCODE Consortium 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit) | ENCODE Phase 3 | | 19 | Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z (~1,372 cit) | ENCODE Blacklist v2 | ## Integration | This skill produces... | Feed into... | Purpose | |---|---|---| | Conda environments | **pipeline-chipseq** through **pipeline-cutandrun** | Provide tool dependencies for all pipeline stages | | Installed reference data | **download-encode** | Reference genomes and annotations for alignment | | Tool version inventory | **data-provenance** | Record exact tool versions for reproducibility | | QC tool installations | **quality-assessment** | Enable FastQC, MultiQC, and ENCODE QC metric tools | | Annotation tool setup | **peak-annotation** | HOMER, ChIPseeker for peak-to-gene assignment | | Motif scanning tools | **jaspar-motifs** | MEME Suite for motif scanning against JASPAR | | Visualization tools | **visualization-workflow** | deepTools, IGV, R/ggplot2 for data visualization | | Liftover utilities | **liftover-coordinates** | UCSC liftOver binary for assembly conversion | ## Related Skills - **pipeline-guide**: Parent skill for all pipeline execution; provides overview of available pipelines and tool selection guidance - **pipeline-chipseq**: Uses the ChIP-seq conda environment tools for FASTQ-to-peaks processing - **pipeline-atacseq**: Uses the ATAC-seq conda environment tools for accessibility analysis - **pipeline-rnaseq**: Uses the RNA-seq conda environment for expression quantification - **pipeline-wgbs**: Uses the WGBS conda environment for methylation analysis - **pipeline-hic**: Uses the Hi-C conda environment for contact matrix generation - **pipeline-dnaseseq**: Uses the DNase-seq conda environment for hotspot detection - **pipeline-cutandrun**: Uses the CUT&RUN conda environment for CUT&RUN/CUT&Tag processing - **quality-assessment**: Quality metrics require properly installed tools to compute - **setup**: Initial ENCODE Toolkit server setup (MCP connection, not bioinformatics tools) - **motif-analysis**: Requires HOMER and MEME Suite from this installer - **visualization-workflow**: Uses deeptools, pyGenomeTracks, and R visualization packages from this installer - **single-cell-encode**: Uses Seurat, Signac, Scanpy from this installer - **publication-trust**: Assess scientific integrity of publications before relying on their methods or findings ## Presenting Results - Present installed tools as a checklist table: tool | version | status (installed/failed/skipped). Group by assay environment. Suggest: "Would you like to verify the installation by running a quick test on sample ENCODE data?" - If any installation fails, provide the exact error and a targeted fix. Common fixes: update conda, set channel priority, install system dependencies. ## For the request: "$ARGUMENTS"