--- name: accessibility-aggregation description: Build comprehensive chromatin accessibility maps by aggregating ATAC-seq and DNase-seq narrowPeak data across multiple ENCODE experiments, donors, and labs. Use when the user wants to answer "where is chromatin accessible in my tissue?" by combining peak calls into a union peak set. Handles cross-lab variation, ATAC vs DNase platform differences, and ENCODE blocklist filtering. --- ## When to Use - User wants to combine ATAC-seq or DNase-seq peaks across multiple experiments for a tissue - User asks "where is chromatin accessible in my tissue?" or "build an open chromatin map" - User needs to merge accessibility data from different labs, donors, or platforms (ATAC vs DNase) - User wants a comprehensive set of open chromatin regions for regulatory element discovery - Example queries: "aggregate ATAC-seq peaks for pancreas", "combine DNase-seq across donors", "find all accessible regions in liver" # Aggregate Chromatin Accessibility Peaks Across Studies Build a comprehensive map of open chromatin for a tissue/cell type by merging ATAC-seq and/or DNase-seq narrowPeak files from multiple ENCODE experiments. ## Scientific Rationale **The question**: "Where is chromatin accessible in my tissue?" Like histone marks, chromatin accessibility is a **detection question**. An open chromatin region detected in one donor but not another is still a real accessible site — individual variation, sequencing depth, and technical factors explain absence. We want the **union of all detections**. ### ATAC-seq vs DNase-seq Both measure open chromatin but with different biases: | Property | ATAC-seq | DNase-seq | |----------|----------|-----------| | Method | Tn5 transposase insertion | DNase I hypersensitivity | | Input required | ~50K cells | ~1M cells | | Resolution | High | High | | GC bias | Moderate (Tn5 preference) | Low | | Mitochondrial reads | High (filter needed) | None | | ENCODE availability | Newer experiments | Extensive historical catalog | | Comparability | Generally comparable at open regions | | ### Literature Support - **Corces et al. 2017** (Nature Methods, 733 citations): Established that ATAC-seq and DNase-seq identify largely overlapping accessible regions, with ATAC capturing ~75% of DNase sites. Both are valid for union maps. - **ENCODE Blacklist** (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Comprehensive set of problematic genomic regions to filter. Essential for all functional genomics analyses. [DOI](https://doi.org/10.1038/s41598-019-45839-z) - **F-Seq2** (Zhao & Boyle 2020, NAR Genomics): Improved peak caller for DNase-seq and ATAC-seq with proper test statistics for IDR compatibility. - **ENCODE Phase 3** (Gorkin et al. 2020, Nature, 301 citations): Integrated accessibility data with histone marks across tissues for chromatin state annotation. **Recommendation**: If combining ATAC-seq and DNase-seq peaks, treat them as equivalent signal sources for accessibility. The union is appropriate because both detect the same biological signal (open chromatin) through different enzymatic mechanisms. ## Step 1: Find All Available Accessibility Data ``` # ATAC-seq encode_search_experiments( assay_title="ATAC-seq", organ="pancreas", biosample_type="tissue", limit=100 ) # DNase-seq encode_search_experiments( assay_title="DNase-seq", organ="pancreas", biosample_type="tissue", limit=100 ) ``` Present a summary to the user: - Total ATAC-seq experiments - Total DNase-seq experiments - Labs represented - Whether to use one or both assay types ### Combining ATAC + DNase? Ask the user: - **Same assay only** (purest comparison, no cross-platform effects) - **Both assays combined** (maximum coverage, slight platform variation) For a comprehensive accessibility catalog, combining both is scientifically justified. ## Step 2: Quality-Gate Each Experiment ``` encode_get_experiment(accession="ENCSR...") ``` ### ATAC-seq Quality Checks - Audit status: no ERROR flags - Has IDR thresholded peaks - Low mitochondrial read fraction (ENCODE pipeline removes these) - Good TSS enrichment score - Nucleosome-free fragment enrichment visible ### DNase-seq Quality Checks - Audit status: no ERROR flags - Has Hotspot2 peaks or IDR thresholded peaks - Adequate sequencing depth (20M+ mapped reads) - Signal-to-noise ratio Track all included experiments: ``` encode_track_experiment(accession="ENCSR...") ``` ## Step 3: Download Peak Files For each experiment: ``` # ATAC-seq — IDR thresholded peaks encode_list_files( experiment_accession="ENCSR...", file_format="bed", output_type="IDR thresholded peaks", assembly="GRCh38" ) # DNase-seq — may use different output types encode_list_files( experiment_accession="ENCSR...", file_format="bed", output_type="peaks", assembly="GRCh38" ) ``` Prefer `preferred_default=True` files. ``` encode_download_files( file_accessions=["ENCFF...", ...], download_dir="/path/to/data/accessibility", organize_by="flat" ) ``` ## Step 4: Per-Sample Noise Filtering ### 4a. ENCODE Blocklist Filtering (Amemiya et al. 2019) ```bash # Download from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz bedtools intersect -a sample.narrowPeak -b hg38-blacklist.v2.bed -v > sample.filtered.narrowPeak ``` ### 4b. SignalValue Filtering (Perna et al. 2024) Same logic as histone aggregation — filter per-sample to top 75% by signalValue (column 7): ```bash # Per-sample: remove bottom 25% by signalValue (true distribution quantile) TOTAL=$(wc -l < sample.filtered.narrowPeak) LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}') THRESHOLD=$(sort -k7,7n sample.filtered.narrowPeak | awk -v line="$LINE_25" 'NR==line{print $7}') awk -v t="$THRESHOLD" '$7 >= t' sample.filtered.narrowPeak > sample.qfiltered.narrowPeak ``` ### 4c. ATAC-specific: Remove Sub-nucleosomal Artifacts (optional) For ATAC-seq, very narrow peaks (<50bp) can be Tn5 insertion artifacts: ```bash awk '($3-$2) >= 50' sample.qfiltered.narrowPeak > sample.clean.narrowPeak ``` ## Step 5: Union Merge Accessibility peaks are **narrow/point-source** (like H3K4me3). Use default merge (overlap only, no gap tolerance). **CRITICAL**: Tag peaks by sample before concatenation to count unique SAMPLES, not overlapping peaks: ```bash # Tag each sample's peaks with a unique sample ID awk -v sid="atac_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' atac_sample1.qfiltered.narrowPeak > atac_s1.tagged.bed awk -v sid="dnase_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' dnase_sample1.qfiltered.narrowPeak > dnase_s1.tagged.bed # ... repeat for all samples # Concatenate all tagged peaks (ATAC + DNase combined or separate) cat *.tagged.bed > all_accessibility.bed # Sort bedtools sort -i all_accessibility.bed > all_accessibility.sorted.bed # Union merge — count UNIQUE SAMPLES (not peaks) bedtools merge \ -i all_accessibility.sorted.bed \ -c 4,7,9 \ -o count_distinct,max,max \ > union_accessible_regions.bed # Columns: chr, start, end, n_unique_samples, max_signalValue, max_qValue ``` ### If Tracking Assay Source To annotate whether peaks came from ATAC, DNase, or both: ```bash # Add assay tag to each peak before concatenation awk '{print $0"\tATAC"}' atac_peaks.bed > tagged.bed awk '{print $0"\tDNase"}' dnase_peaks.bed >> tagged.bed # After merge, use bedtools multiIntersect to track sources bedtools multiIntersect \ -i atac_sample1.bed atac_sample2.bed dnase_sample1.bed ... \ -header \ -names ATAC_1 ATAC_2 DNase_1 ... \ > multi_intersect.bed ``` ## Step 6: Confidence Annotation Same logic as histone aggregation. Given N total samples: | Confidence | Criteria | Interpretation | |-----------|----------|----------------| | **High** | ≥50% of samples | Constitutive accessible region | | **Supported** | 2+ samples | Likely real, some variation | | **Singleton** | 1 sample only | Keep — may be individual-specific or condition-specific | ```bash awk -v N=8 '{ if ($4 >= N*0.5) conf="HIGH"; else if ($4 >= 2) conf="SUPPORTED"; else conf="SINGLETON"; print $0"\t"conf"\t"$4"/"N }' union_accessible_regions.bed > union_accessible_regions.annotated.bed ``` ## Step 7: Log Provenance ``` encode_log_derived_file( file_path="/path/to/union_accessible_regions.annotated.bed", source_accessions=["ENCSR...", "ENCSR...", ...], description="Union chromatin accessibility peaks (ATAC-seq + DNase-seq) across N pancreas samples", file_type="aggregated_accessibility", tool_used="bedtools merge v2.31.0", parameters="blocklist filtered, signalValue >= 25th pctl per sample, ATAC min width 50bp, bedtools merge -d 0" ) ``` ## Step 8: Summary Statistics Report to the user: - Total input experiments: N (ATAC: X, DNase: Y) - Experiments passing QC: M - Total peaks before merge: X - Union peaks after merge: Y - High-confidence regions: Z (≥50% support) - Supported regions: W (2+ support) - Singleton regions: V (1 sample only) - Genome coverage: bp covered / total genome - Overlap between ATAC-only and DNase-only peaks (if both assays used) ## Pitfalls Specific to Accessibility Data 1. **ATAC mitochondrial reads**: ENCODE pipeline removes these, but verify in QC metrics. High mitochondrial fraction indicates poor nuclear chromatin enrichment. 2. **Tn5 sequence bias**: ATAC-seq Tn5 has mild sequence preference. For union maps this is acceptable — bias affects peak *intensity*, not *presence*. 3. **DNase hypersensitivity saturation**: Deeply sequenced DNase-seq detects more sites. Shallowly sequenced samples contribute fewer peaks but are not wrong — they just miss weaker sites. 4. **Promoter enrichment**: Both assays are enriched at promoters. When comparing accessibility across tissues, note that promoter accessibility is largely constitutive while enhancer accessibility is tissue-specific. 5. **Cell-type heterogeneity in tissue samples**: Bulk ATAC/DNase from tissue captures accessibility across ALL cell types. A peak may represent a minor cell population. This is correct for a tissue-level map but important to note. 6. **Do NOT mix assemblies**: All files must be GRCh38 or all hg19. Use `encode_compare_experiments` to verify. 7. **Peak summits lost after merge**: NarrowPeak column 10 (summit offset) is discarded by `bedtools merge`. If you need summits for motif analysis, extract them before merging and map back afterward. 8. **CUT&RUN/CUT&Tag accessibility data**: If ENCODE adds CUT&RUN-based accessibility data in the future, apply the CUT&RUN suspect list (Nordin et al. 2023, Genome Biology) in addition to the ENCODE blacklist. ## Walkthrough: Building a Pan-Donor Accessibility Map for Brain Cortex **Goal**: Merge ATAC-seq peaks from 4 brain cortex experiments into a union accessibility map. **Context**: User needs comprehensive open chromatin regions for regulatory element discovery. ### Step 1: Search for brain ATAC-seq experiments ``` encode_search_experiments( assay_title="ATAC-seq", organ="brain" ) ``` Expected output: ```json { "total": 24, "experiments": [ {"accession": "ENCSR001BRN", "biosample_summary": "brain cortex tissue male adult (53 years)"}, {"accession": "ENCSR002BRN", "biosample_summary": "brain cortex tissue female adult (49 years)"} ] } ``` ### Step 2: Download narrowPeak files ``` encode_search_files( assay_title="ATAC-seq", organ="brain", file_format="bed", output_type="IDR thresholded peaks", assembly="GRCh38" ) ``` Expected output: ```json { "total": 8, "files": [{"accession": "ENCFF001ATQ", "output_type": "IDR thresholded peaks", "assembly": "GRCh38"}] } ``` ### Step 3: Merge into union peak set ```bash cat *.narrowPeak | sort -k1,1 -k2,2n | bedtools merge -i - -c 4,5 -o count,mean > union_atac_brain.bed ``` **Interpretation**: Union peaks represent all genomic positions where chromatin is accessible in brain cortex. Peaks found in all 4 donors are constitutive regulatory elements. ## Code Examples ### 1. Find accessibility data for aggregation ``` encode_get_facets(organ="pancreas", assay_title="ATAC-seq") ``` Expected output: ```json { "facets": {"biosample_term_name": {"pancreas": 4, "pancreatic islet": 3}, "status": {"released": 6}} } ``` ## Integration | This skill produces... | Feed into... | Using tool/skill | |---|---|---| | Union open chromatin map (BED) | Enhancer identification | regulatory-elements skill | | Accessible regions for motif analysis | TF motif discovery | motif-analysis skill | | Tissue accessibility catalog | Cross-tissue comparison | compare-biosamples skill | | Open chromatin at variant sites | Variant functional annotation | variant-annotation skill | | Accessible peak coordinates | Visualization signal anchors | visualization-workflow skill | ## Related Skills - **histone-aggregation**: Same union approach for histone ChIP-seq narrowPeak data - **methylation-aggregation**: Different approach (averaging) for continuous methylation signal; HMRs + accessibility peaks mark active regulatory elements - **hic-aggregation**: Union approach for BEDPE chromatin loops; loops often anchor at accessible regions - **regulatory-elements**: Use union accessibility maps to define active regulatory elements with histone mark combinations - **motif-analysis**: Find enriched TF motifs in accessible regions using HOMER and MEME - **pipeline-atacseq**: Process raw ATAC-seq data through the full ENCODE-aligned pipeline - **batch-analysis**: Batch processing workflows for systematic accessibility aggregation - **publication-trust**: Verify literature claims backing analytical decisions ## Presenting Results - Present merged accessibility regions as: chr | start | end | assay_type | sample_count. Show ATAC vs DNase contribution. Suggest: "Would you like to run motif analysis on these accessible regions?" ## For the request: "$ARGUMENTS"