---
name: hic-aggregation
description: Build comprehensive chromatin contact maps by aggregating Hi-C loop calls (BEDPE) across multiple ENCODE experiments, donors, and labs. Use when the user wants to answer "what regions are in 3D contact in my tissue?" by creating a union catalog of chromatin loops. Handles resolution-aware anchor matching, cross-lab variation, and Hi-C-specific quality metrics.
---

# Aggregate Hi-C Chromatin Contacts Across Studies

## When to Use

- User wants to build a comprehensive catalog of chromatin loops from multiple Hi-C experiments
- User asks "what regions are in 3D contact in my tissue?" or "aggregate loop calls across donors"
- User needs a union catalog of BEDPE loops with resolution-aware anchor matching
- User wants to identify high-confidence loops supported by multiple experiments
- Example queries: "aggregate Hi-C loops for K562", "combine chromatin contacts across labs", "find consensus TAD boundaries in liver"

Build a comprehensive catalog of chromatin loops for a tissue/cell type by merging BEDPE loop calls from multiple ENCODE Hi-C experiments.

## Scientific Rationale

**The question**: "What regions are in 3D physical contact in my tissue?"

Like histone marks and accessibility, chromatin loops are a **detection question**. If a loop between Region A and Region B is detected in one donor but not another, the contact is still real — individual variation, sequencing depth, and computational resolution explain absence. We want the **union of all detected contacts**.

### Key Concepts

**Hi-C data** measures pairwise chromatin interactions genome-wide. After processing:
- **Contact matrix** (`.hic` file): Genome-wide interaction frequencies at multiple resolutions
- **Loop calls** (BEDPE): Statistically significant point interactions (loops) identified by algorithms like HICCUPS or Juicer
- **TAD boundaries**: Topologically associating domain boundaries
- **Compartments**: A/B compartment assignments

**BEDPE format** (Paired-End BED):
```
chr1  start1  end1  chr2  start2  end2  name  score  strand1  strand2
```
Each row represents a contact between two genomic anchor regions.

### Literature Support
- **Loop Catalog** (Reyna et al. 2025, Nucleic Acids Research): Created a union catalog of 4.19M unique loops across 1,089 Hi-C datasets. Demonstrated that union approach captures tissue-specific and constitutive loops. Used resolution-aware merging at 5kb, 10kb, and 25kb bins.
- **AQuA Tools** (Chakraborty et al. 2025): Toolkit for BEDPE intersection, union, and annotation. Handles paired-region arithmetic.
- **mariner** (Flores et al. 2024, Bioinformatics): R/Bioconductor package for BEDPE manipulation including merging loops across experiments with configurable anchor tolerance.
- **ENCODE Phase 3** (Gorkin et al. 2020, Nature, 301 citations): Integrated Hi-C data across tissues to define regulatory loops connecting enhancers to promoters.
- **ENCODE Blacklist** (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Problematic genomic regions to filter from loop anchors. [DOI](https://doi.org/10.1038/s41598-019-45839-z)
- **Mustache** (Roayaei Ardakany et al. 2020, Genome Biology, 165 citations): Multi-scale loop caller that recovers more validated loops than HICCUPS. Different callers produce discordant loop sets.
- **Wolff et al. 2022** (GigaScience): Benchmark showing loop callers intersect by **~50% at most** — critical context for why union approach is necessary.

## Step 1: Find All Available Hi-C Data

```
encode_search_experiments(
    assay_title="Hi-C",
    organ="pancreas",           # user's tissue of interest
    biosample_type="tissue",
    limit=100
)
```

Present a summary to the user:
- Total Hi-C experiments
- Labs represented
- Unique donors/biosamples
- Resolution(s) available (check experiment metadata)

Use `encode_get_facets` to check availability:
```
encode_get_facets(assay_title="Hi-C", organ="pancreas")
```

**Note**: Hi-C data is computationally expensive to produce, so there are typically fewer experiments per tissue than ChIP-seq or ATAC-seq. Even 2-3 experiments can be valuable for union catalogs.

## Step 2: Quality-Gate Each Experiment

```
encode_get_experiment(accession="ENCSR...")
```

### Hi-C Quality Checks
- Audit status: no ERROR flags
- **Sequencing depth**: 400M+ valid read pairs for loop calling (ENCODE standard)
- **Cis/trans ratio**: >60% cis contacts expected (low cis suggests noisy library)
- **Hi-C-specific QC**: Library complexity, PCR duplicate rate
- Has loop calls (BEDPE output) — not all Hi-C experiments have called loops
- Resolution: at least 5-10kb resolution for loop detection

### Include if:
- Has BEDPE loop calls at consistent resolution
- Passes ENCODE audit (no ERROR flags)
- Adequate sequencing depth for loop resolution

### Exclude if:
- ERROR audit flags
- Only contact matrices without loop calls
- Very low sequencing depth (<200M valid pairs — insufficient for loop calling)

Track all included experiments:
```
encode_track_experiment(accession="ENCSR...")
```

## Step 3: Download Loop Call Files

For each experiment, get BEDPE loop calls:

```
# Search for loop/interaction files
encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bedpe",
    assembly="GRCh38"
)

# Also check for BED-formatted loop files
encode_list_files(
    experiment_accession="ENCSR...",
    output_type="chromatin interactions",
    assembly="GRCh38"
)

# Or contact domains
encode_list_files(
    experiment_accession="ENCSR...",
    output_type="contact domains",
    assembly="GRCh38"
)
```

**File selection priority:**
1. **Chromatin interactions** (loop calls from HICCUPS or similar)
2. **Contact domains** (TADs — different analysis, handle separately)
3. **Replicated loops** (if available)

Prefer `preferred_default=True` files when available.

```
encode_download_files(
    file_accessions=["ENCFF...", ...],
    download_dir="/path/to/data/hic_loops",
    organize_by="flat"
)
```

## Step 4: Understanding Hi-C Resolution and Anchors

### Critical: Resolution-Aware Processing

Hi-C loop anchors are **binned regions**, not precise positions. The resolution determines anchor size:

| Resolution | Anchor Width | Best For | Typical Loop Count |
|-----------|-------------|---------|-------------------|
| 5 kb | 5,000 bp | Fine-scale promoter-enhancer loops | More loops |
| 10 kb | 10,000 bp | Standard analysis | Moderate |
| 25 kb | 25,000 bp | Large-scale domain contacts | Fewer loops |

**All loops being merged must be at the same resolution**, or anchors must be harmonized to a common resolution.

### Harmonizing Resolution
If experiments have loops called at different resolutions:
```bash
# Expand 5kb anchors to 10kb resolution
awk -v res=10000 'BEGIN{OFS="\t"} {
    # Bin anchor 1
    bin1_start = int($2/res) * res
    bin1_end = bin1_start + res
    # Bin anchor 2
    bin2_start = int($5/res) * res
    bin2_end = bin2_start + res
    print $1, bin1_start, bin1_end, $4, bin2_start, bin2_end, $7, $8, $9, $10
}' fine_res_loops.bedpe > harmonized_loops.bedpe
```

## Step 5: Per-Sample Filtering

### 5a. ENCODE Blocklist Filtering (Amemiya et al. 2019)
Remove loops with anchors in artifact-prone regions (download from https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz):
```bash
# Filter loops where EITHER anchor overlaps a blocklist region
# First, extract anchor 1 and anchor 2 as separate BED files
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,NR}' sample.bedpe > anchors1.bed
awk 'BEGIN{OFS="\t"} {print $4,$5,$6,NR}' sample.bedpe > anchors2.bed

# Find anchor rows NOT in blocklist
bedtools intersect -a anchors1.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_1.txt
bedtools intersect -a anchors2.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_2.txt

# Keep only rows where BOTH anchors pass
comm -12 <(sort clean_rows_1.txt) <(sort clean_rows_2.txt) > clean_rows.txt
awk 'NR==FNR{a[$1];next} FNR in a' clean_rows.txt sample.bedpe > sample.filtered.bedpe
```

### 5b. Score Filtering
Filter by interaction score/significance:
```bash
# If BEDPE has a score column (col 8), filter to significant interactions
# Keep top 75% by score (true distribution quantile, not range-based)
TOTAL=$(wc -l < sample.filtered.bedpe)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k8,8n sample.filtered.bedpe | awk -v line="$LINE_25" 'NR==line{print $8}')
awk -v t="$THRESHOLD" '$8 >= t' sample.filtered.bedpe > sample.qfiltered.bedpe
```

### 5c. Remove Self-Ligation Artifacts
Loops where both anchors are very close are likely artifacts:
```bash
# Remove loops where anchors are on same chromosome and < 20kb apart
awk '{
    if ($1 != $4) print $0;  # inter-chromosomal: keep (rare but real)
    else if (($5 - $3) >= 20000) print $0;  # > 20kb apart: keep
}' sample.qfiltered.bedpe > sample.clean.bedpe
```

## Step 6: Union Merge of Loops

### The Paired-Region Matching Problem

Unlike peaks (single regions), loops are **pairs of regions**. Two loops match if **both anchors overlap**:

```
Loop 1:  [anchor1A]--------[anchor1B]
Loop 2:    [anchor2A]------[anchor2B]
```

These should merge if anchor1A overlaps anchor2A AND anchor1B overlaps anchor2B.

### Method A: bedtools pairToPair (Recommended for simple union)

```bash
# Concatenate all filtered loops
cat sample1.clean.bedpe sample2.clean.bedpe ... > all_loops.bedpe

# Sort by anchor 1 coordinates
sort -k1,1 -k2,2n -k4,4 -k5,5n all_loops.bedpe > all_loops.sorted.bedpe

# Use a custom merge approach:
# 1. Bin anchors to resolution, creating a loop ID
# 2. Group by loop ID
# 3. Count support

awk -v res=10000 'BEGIN{OFS="\t"} {
    # Create binned anchor coordinates as loop identifier
    a1_bin = $1 ":" int($2/res)*res
    a2_bin = $4 ":" int($5/res)*res
    # Canonical order (smaller coordinate first) to handle orientation
    if (a1_bin < a2_bin) loop_id = a1_bin "-" a2_bin
    else loop_id = a2_bin "-" a1_bin
    print loop_id, $0
}' all_loops.sorted.bedpe | \
sort -k1,1 | \
awk 'BEGIN{OFS="\t"} {
    if ($1 != prev_id) {
        if (NR > 1) print chr1, start1, end1, chr2, start2, end2, count, max_score
        prev_id = $1
        chr1=$2; start1=$3; end1=$4; chr2=$5; start2=$6; end2=$7
        count = 1; max_score = $9
    } else {
        count++
        if ($9 > max_score) max_score = $9
        # Expand anchors to encompass all overlapping calls
        if ($3 < start1) start1 = $3
        if ($4 > end1) end1 = $4
        if ($6 < start2) start2 = $6
        if ($7 > end2) end2 = $7
    }
} END {
    print chr1, start1, end1, chr2, start2, end2, count, max_score
}' > union_loops.bedpe
```

### Method B: Resolution-Binned Approach (Loop Catalog method)

Following the Loop Catalog (Reyna et al. 2025) approach:

```bash
# Bin all loop anchors to a fixed resolution
awk -v res=10000 'BEGIN{OFS="\t"} {
    a1_start = int($2/res) * res
    a1_end = a1_start + res
    a2_start = int($5/res) * res
    a2_end = a2_start + res
    # Canonical ordering
    if ($1 < $4 || ($1 == $4 && a1_start <= a2_start))
        print $1, a1_start, a1_end, $4, a2_start, a2_end
    else
        print $4, a2_start, a2_end, $1, a1_start, a1_end
}' all_loops.sorted.bedpe | \
sort -u | \
sort -k1,1 -k2,2n -k4,4 -k5,5n | \
uniq -c | \
awk 'BEGIN{OFS="\t"} {print $2,$3,$4,$5,$6,$7,$1}' > union_loops_binned.bedpe
# Columns: chr1, start1, end1, chr2, start2, end2, n_supporting_samples
```

### Method C: Using Specialized Tools

**mariner** (R/Bioconductor):
```r
library(mariner)
# Read BEDPE files as GInteractions
loops <- lapply(bedpe_files, read.table)
# Convert to GInteractions and merge
gi <- as_ginteractions(loops)
merged <- mergePairs(gi, radius = 10000)  # 10kb tolerance
```

**AQuA Tools** (Python):
```python
# BEDPE union with anchor overlap tolerance
aqua bedpe-union -i sample1.bedpe sample2.bedpe -o union.bedpe --slop 5000
```

## Step 7: Confidence Annotation

Given N total experiments:

| Confidence | Criteria | Interpretation |
|-----------|----------|----------------|
| **High** | Detected in >=50% of samples | Constitutive loop, present across individuals |
| **Supported** | Detected in 2+ samples | Likely real, some variation |
| **Singleton** | Detected in 1 sample only | May be individual-specific or depth-dependent |

```bash
awk -v N=4 '{
    if ($7 >= N*0.5) conf="HIGH";
    else if ($7 >= 2) conf="SUPPORTED";
    else conf="SINGLETON";
    print $0"\t"conf"\t"$7"/"N
}' union_loops_binned.bedpe > union_loops.annotated.bedpe
```

**Context for singletons**: Hi-C loop detection is very sensitive to sequencing depth. Many singletons may simply be under-powered in other samples rather than biologically absent. The Loop Catalog found that a union approach captures ~3x more loops than any individual experiment.

## Step 8: Separate Analysis for TADs and Compartments

**TAD boundaries** and **A/B compartments** require different aggregation than loops:

### TAD Boundaries
TAD boundaries are single genomic positions. Aggregate like narrow peaks:
```bash
# Extract TAD boundary BED from contact domain files
# Each boundary is a narrow region
cat tad_boundaries_sample*.bed | \
bedtools sort -i - | \
bedtools merge -i - -d 40000 -c 1 -o count > union_tad_boundaries.bed
# 40kb gap tolerance because TAD boundaries are resolution-dependent
```

### A/B Compartments
Compartment calls (eigenvector sign at each bin) should be aggregated by majority vote:
```bash
# For each resolution bin, assign A or B based on majority of samples
# This is more complex and typically done in R/Python
```

## Step 9: Log Provenance

```
encode_log_derived_file(
    file_path="/path/to/union_loops.annotated.bedpe",
    source_accessions=["ENCSR...", "ENCSR...", ...],
    description="Union chromatin loops across N pancreas Hi-C experiments",
    file_type="aggregated_loops",
    tool_used="bedtools + custom merge at 10kb resolution",
    parameters="blocklist filtered, score >= 25th pctl, self-ligation >= 20kb removed, 10kb resolution binning"
)
```

## Step 10: Summary Statistics

Report to the user:
- Total input experiments: N
- Experiments passing QC: M
- Resolution used: Xkb
- Total loops before merge: X
- Union loops after merge: Y
- High-confidence loops: Z (≥50% support)
- Supported loops: W (2+ support)
- Singleton loops: V (1 sample only)
- Distance distribution: median and range of loop sizes (anchor-to-anchor)
- Inter-chromosomal loops: count (expect very few)

## Pitfalls Specific to Hi-C Data

1. **Resolution mismatch**: Loop calls at 5kb vs 25kb resolution will have very different anchor sizes. Always harmonize to a common resolution before merging.

2. **Sequencing depth sensitivity**: Loop calling requires deep sequencing (400M+ valid pairs). Shallowly sequenced experiments will call far fewer loops — this is under-detection, not absence.

3. **Algorithm differences are LARGE**: Wolff et al. 2022 (GigaScience) found that HICCUPS, Mustache, Fit-Hi-C, and HiCExplorer loop callers **intersect by ~50% at most**. Mustache tends to recover more validated loops (Roayaei Ardakany et al. 2020). If mixing callers, note this in provenance — and this discordance is itself a reason to prefer the union approach.

4. **Orientation matters**: BEDPE anchors should be canonically ordered (anchor1 < anchor2 by genomic coordinate) before merging to avoid duplicate counting.

5. **Inter-chromosomal contacts**: These are rare but real. Handle separately — they cannot be distance-filtered.

6. **Distance distribution**: Most loops are 100kb-2Mb. Very short-range contacts (<20kb) are often noise from undigested chromatin. Very long-range (>10Mb) are rare.

7. **Do NOT mix assemblies**: All files must be GRCh38 or all hg19. Hi-C resolution binning makes liftOver of loops particularly error-prone.

8. **TADs vs loops**: These are different features. TADs are domains (regions), loops are point contacts (pairs). Do not mix them in the same union.

9. **Micro-C as complement**: Micro-C achieves higher resolution than Hi-C and can detect sub-TAD loops. Treat Micro-C loops as compatible with Hi-C loops in a union (Mustache works on both).

## Walkthrough: Building a Cross-Tissue Loop Catalog for the MYC Locus

**Goal**: Aggregate Hi-C chromatin loops across tissues to identify conserved and tissue-specific 3D contacts at the MYC gene locus.
**Context**: Cancer research — MYC is regulated by distal enhancers via chromatin looping.

### Step 1: Find Hi-C experiments across tissues

```
encode_search_experiments(assay_title="Hi-C", organism="Homo sapiens", limit=50)
```

Expected output:
```json
{
  "total": 89,
  "results": [
    {"accession": "ENCSR000AKA", "assay_title": "Hi-C", "biosample_summary": "GM12878", "status": "released"},
    {"accession": "ENCSR489OCU", "assay_title": "Hi-C", "biosample_summary": "K562", "status": "released"},
    {"accession": "ENCSR382RFU", "assay_title": "Hi-C", "biosample_summary": "liver", "status": "released"}
  ]
}
```

**Interpretation**: 89 Hi-C experiments available. Select 5–10 spanning diverse tissue types for cross-tissue comparison.

### Step 2: List loop files for each experiment

```
encode_list_files(accession="ENCSR000AKA", file_format="bedpe", assembly="GRCh38")
```

Expected output:
```json
{
  "files": [
    {"accession": "ENCFF001ABC", "output_type": "contact domains", "file_format": "bedpe", "file_size_mb": 2.4},
    {"accession": "ENCFF002DEF", "output_type": "chromatin interactions", "file_format": "bedpe", "file_size_mb": 1.8}
  ]
}
```

**Interpretation**: Use "chromatin interactions" files for loop aggregation. Contact domains are TADs, not loops.

### Step 3: Download loop files

```
encode_download_files(accessions=["ENCFF002DEF", "ENCFF003GHI", "ENCFF004JKL"], download_dir="/data/hic_loops")
```

Expected output:
```json
{
  "downloaded": 3,
  "total_size_mb": 5.6,
  "md5_verified": true,
  "files": ["/data/hic_loops/ENCFF002DEF.bedpe", "/data/hic_loops/ENCFF003GHI.bedpe", "/data/hic_loops/ENCFF004JKL.bedpe"]
}
```

### Step 4: Aggregate loops with resolution-aware anchor matching

Apply union merge across tissues:
- Expand loop anchors by ±resolution (e.g., ±5kb for 5kb resolution data)
- Merge overlapping anchors using bedtools pairToPair
- Assign tissue support counts to each union loop
- Filter: require ≥2 tissue support for conserved loops

### Step 5: Filter to MYC locus

```bash
# MYC locus: chr8:127,700,000-128,000,000
awk '$1=="chr8" && $2>=127700000 && $3<=128000000' union_loops.bedpe > myc_loops.bedpe
```

**Interpretation**: Loops anchored at the MYC promoter connecting to distal enhancers. Conserved loops (≥3 tissues) likely represent fundamental regulatory architecture; tissue-specific loops may drive context-dependent MYC activation.

### Integration with downstream skills
- Feed loop anchors into → **peak-annotation** for gene assignment at anchor regions
- Overlay with → **histone-aggregation** H3K27ac peaks to identify active enhancer-promoter loops
- Cross-reference loop-disrupting variants via → **variant-annotation**
- Visualize in → **ucsc-browser** as interaction tracks

## Code Examples

### 1. Survey available Hi-C data by tissue
```
encode_get_facets(assay_title="Hi-C", facet_field="organ", organism="Homo sapiens")
```

Expected output:
```json
{
  "facets": {
    "organ": {"brain": 24, "heart": 12, "liver": 8, "lung": 6, "kidney": 4, "blood": 15}
  }
}
```

### 2. Get details for a specific Hi-C experiment
```
encode_get_experiment(accession="ENCSR000AKA")
```

Expected output:
```json
{
  "accession": "ENCSR000AKA",
  "assay_title": "Hi-C",
  "biosample_summary": "GM12878",
  "replicates": 2,
  "status": "released",
  "lab": "/labs/erez-lieberman-aiden/",
  "audit": {"WARNING": 1, "ERROR": 0}
}
```

### 3. Compare loop sets between two cell types
```
encode_compare_experiments(accession_1="ENCSR000AKA", accession_2="ENCSR489OCU")
```

Expected output:
```json
{
  "comparison": {
    "shared": {"assay": "Hi-C", "organism": "Homo sapiens", "assembly": "GRCh38"},
    "differences": {
      "biosample": ["GM12878", "K562"],
      "lab": ["/labs/erez-lieberman-aiden/", "/labs/erez-lieberman-aiden/"]
    }
  }
}
```

## Integration

| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Union loop catalog (BEDPE) | **peak-annotation** | Assign genes to loop anchors |
| Conserved loop coordinates | **histone-aggregation** | Overlay H3K27ac at anchors to find active enhancer-promoter loops |
| Tissue-specific loops | **accessibility-aggregation** | Check if loop anchors overlap open chromatin |
| Loop anchor BED intervals | **variant-annotation** | Find GWAS/clinical variants disrupting loop anchors |
| Loop anchor coordinates | **liftover-coordinates** | Convert hg19 loops to GRCh38 |
| Aggregated loop statistics | **visualization-workflow** | Generate loop frequency heatmaps |
| Loop-gene assignments | **disease-research** | Connect loop disruptions to disease phenotypes |

## Related Skills

- **histone-aggregation**: Loop anchors often overlap with H3K27ac/H3K4me1 peaks — integrate with histone union sets to annotate loop function
- **accessibility-aggregation**: Loop anchors frequently coincide with accessible chromatin — validate loops by requiring anchor accessibility
- **regulatory-elements**: Use loops to connect distal enhancers (H3K27ac) to target promoters (H3K4me3)
- **epigenome-profiling**: Loops add 3D context to 1D chromatin state maps
- **pipeline-hic**: Process raw Hi-C data through the full ENCODE-aligned pipeline
- **batch-analysis**: Batch processing workflows for systematic Hi-C loop aggregation
- **publication-trust**: Verify literature claims backing analytical decisions

## Presenting Results

- Present aggregated loops as: chr | anchor1_start | anchor1_end | anchor2_start | anchor2_end | sample_count | resolution. Show loop statistics. Suggest: "Would you like to check if any GWAS variants overlap loop anchors?"

## For the request: "$ARGUMENTS"