--- author: "Mikhail Dozmorov, Ph.D." institute: "https:/bit.ly/3Dgenomics" date: "January 22, 2021" output: xaringan::moon_reader: lib_dir: libs css: ["css/xaringan-themer.css", "css/xaringan-my.css"] nature: ratio: '16:9' highlightStyle: github highlightLines: true countIncrementalSlides: false --- ```{r xaringan-themer, include = FALSE} library(xaringanthemer) library(icons) mono_light( base_color = "midnightblue", header_font_google = google_font("Noto Sans"), text_font_google = google_font("Montserrat", "500", "500i"), code_font_google = google_font("Droid Mono"), link_color = "#8B1A1A", #firebrick4, "deepskyblue1" text_font_size = "28px", code_font_size = "26px" ) ``` class: center, middle # The genome in action ### Detecting and interpreting changes in the 3D genome organization [bit.ly/3Dgenomics](https://mdozmorov.github.io/Talk_3Dgenome/) Mikhail Dozmorov, Ph.D. Associate professor, Department of Biostatistics Virginia Commonwealth University
--- ## Overview I. Normalization and differential analysis of Hi-C data - HiCcompare - multiHiCcompare
]
]
---
## Biases in Hi-C data
- Hi-C data suffers from many biases: **sequence-driven** (e.g., mappability, CG content) & **technology-driven** (e.g., type of restriction enzyme, sequencing platform)
- Most normalization methods work only on individual Hi-C dataset, one at a time
- Individual normalization methods do not perform well when the goal is comparison
---
class: center, middle
# HiCcompare
Joint normalization and differential analysis of Hi-C data
---
## Joint Normalization on the MD plot
.pull-left[
- **MD plot** represents data from two Hi-C matrices on one plot
- Similar to the MA plot (Bland-Altman plot)
- X-axis: **Genomic Distance** (off-diagonal data slices)
- Y-axis: **Mean differences in interaction frequencies** (log2(IF2/IF1))
]
.pull-right[
]
---
## Loess regression
.pull-left[
- Differences between two datasets should be minimal (symmetric around M = 0, Y-axis)
- Local Regression – fit based on local subsets of the data
- Nonlinear, data-driven - creates a smooth curve through the data
- Can use loess fit to correct for bias
]
.pull-right[
.center[
]
]
---
## Joint Loess Normalization of Hi-C Data
.pull-left[
- Differences between two datasets should be minimal (symmetric around M = 0, Y-axis)
- Perform loess regression on the MD plot to calculate $f(D)$ - the predicted interaction frequency $IF$ value at distance $D$
.small[
- $log_2(\hat{IF_{1D}})=log_2(IF_{1D})-f(D)/2$
- $log_2(\hat{IF_{2D}})=log_2(IF_{2D})+f(D)/2$
- Average $IF$ for the pair remains unchanged
]
]
.pull-right[
.center[
]
]
.small[ Benchmarking study: Lyu, Hongqiang, Erhu Liu, and Zhifang Wu. “[Comparison of Normalization Methods for Hi-C Data](https://doi.org/10.2144/btn-2019-0105)” _BioTechniques_, October 7, 2019
]
---
## Difference detection
.pull-left[
- At each distance, take a set of M-values
- Convert M-values to Z-scores $Z_i=\frac{M_i-\hat{M}}{\sigma_M}$
- Z-scores are compared to standard normal distribution to obtain p-values
- FDR multiple testing correction applied on a per-distance basis
]
.pull-right[
.center[
]
]
---
## Difference Detection After Different Normalizations
.pull-left[
- Simulated 100x100 chromatin interaction matrices with 250 controlled changes applied
- ROC curves for different normalization techniques and fold changes
- Loess provides the most power to detect small fold changes
.small[ https://bioconductor.org/packages/HiCcompare/
Stansfield, John C., Kellen G. Cresswell, Vladimir I. Vladimirov, and Mikhail G. Dozmorov. “[HiCcompare: An R-Package for Joint Normalization and Comparison of HI-C Datasets](https://doi.org/10.1186/s12859-018-2288-x)” _BMC Bioinformatics_, December 2018 ]
]
.pull-right[
.center[
]
]
---
class: center, middle
# multiHiCcompare
Normalization and differential analysis of multiple Hi-C datasets
---
## Normalization: Multiple Hi-C datasets
**Cyclic loess** (Ballman et al. 2004) to jointly normalize multiple datasets - take each pair of datasets, normalize, repeat until convergence
1. Choose two out of the N total samples, then generate an MD plot
2. Fit a loess curve $f(D)$ to the MD plot
3. Subtract $f(D)/2$ from the first dataset and add $f(D)/2$ to the second
4. Repeat until all unique pairs have been compared
5. Repeat until convergence
]
---
## Differential analysis: Multiple Hi-C datasets
- Statistical framework of differential gene expression analysis can be adapted for differential analysis of IFs ([edgeR](https://bioconductor.org/packages/edgeR/) R package)
- **Exact test**
.small[
- For comparing 2 groups without other covariates
- Similar to Fisher's exact test
- Computes exact p-values by summing over all sums of counts that have a probability less than the probability under the null hypothesis
]
- **Generalized Linear Models**
.small[
- For more complex experiments, utilize the GLM framework
- The IF value for a pair of regions $g$ at distance $d$ from sample $i$ follows $NB(M_{di}*p_{dgj},\phi_{dg})$, where $M_{di}$ is the total number of reads in sample $i$ at distance $d$, $p_{dgj}$ is the proportion of interaction counts $g$ in sample $i$ from experimental condition $j$, $\phi_{dg}$ is the dispersion
- The vector of covariates $x_i$ can be linked with $\mu_{dgj}$ through a log-linear model $log(\mu_{dgj}) = x_i^T\beta_{dg} + log(M_{dj})$
]
---
## Analysis overview: Multiple Hi-C datasets
.center[
]
.small[
Benchmarking study: Zheng, Ye, Peigen Zhou, and Sündüz Keleş. “[FreeHi-C Spike-in Simulations for Benchmarking Differential Chromatin Interaction Detection](https://doi.org/10.1016/j.ymeth.2020.07.001)” _Methods_, July 12, 2020
]
---
## Benchmarking
.center[
]
.small[ https://bioconductor.org/packages/multiHiCcompare/
Stansfield, John C, Kellen G Cresswell, and Mikhail G Dozmorov. “[MultiHiCcompare: Joint Normalization and Comparative Analysis of Complex Hi-C Experiments](https://doi.org/10.1093/bioinformatics/btz048).” Bioinformatics, January 22, 2019 ]
---
class: center, middle
# SpectralTAD
Detection of Topologically Associating Domains using Spectral Clustering
---
## Topologically Associating Domains
- TADs are 3D structures of frequently interacting regions
- Boundaries are associated with specific genomic features (CTCF, cohesin, mediator)
- Can be hierarchical (nested, TADs containing sub-TADs)
.center[
]
---
## Why are TADs Important?
- Established early in development
- Highly correlate with replication timing
- TADs create "autonomous gene-domains" partitioning the genome into discrete functional regions
- Disruptions of TADs lead to _de novo_ enhancer-promoter interactions and dysregulation of gene expression -> disease
TADs are a distinct layer of the 3D genome organization
---
## TAD detection using graph theory
.pull-left[
- Hi-C data has a natural graph structure, defined by vertices $V$ and edges $E$
- **Vertices** are genomic regions
- **Edges** represent interaction strength between any pair of regions
- Vertices and edges are stored in an **adjacency matrix** $A_{ij}$ where $ij$ is the number of edges between a given set of vertices $ij$
]
.pull-right[ .center[
]
]
---
## Step 2: Project on to Unit Circle
.center[
]
---
## Step 3: Find the k-largest gaps and partition
.center[
]
---
## Step 3: Find the k-largest gaps and partition
.center[
]
---
## Windowed Spectral Clustering
- We know the biologically maximum TAD size (2 million bp)
- We can use a 2 million bp sliding window to perform spectral clustering and aggregate
- Advantages of the sliding window
- Reduced cubic complexity of spectral clustering $O(n^{3})$ to linear complexity $O(n)$
- Naturally discards noisy interactions at large genomic distances
---
## SpectralTAD algorithm
1. Cut a window from the matrix equal to the maximum TAD size (2Mb)
2. Find the graph spectrum of the window and calculate eigenvector gaps
3. Find a set of clusters that maximize the silhouette score
4. Slide the window to the next group of loci and repeat
.center[
]
---
## Determining a hierarchy of TADs
- TADs are hierarchical in nature (organized into large meta-TADs with sub-TADs within them)
- To find sub-TADs, we use a novel metric called boundary score
- **Boundary score** is just the z-score for each eigenvector gap (Euclidean distance between adjacent points)
.center[
]
---
## Boundary score as a metric for TAD boundary detection
.center[
]
---
## Determining a hierarchy of TADs
For each initial TAD:
- Perform spectral clustering on the submatrix defined by the initial TAD
- Calculate the eigenvector gaps for each consecutive pair of regions
- Convert eigenvector gaps to boundary scores
- If any boundary score is greater than 1.96, this is a sub-TAD boundary
- Repeat for all sub-TADs until no z-score is greater than 1.96
---
## TAD Calling: SpectralTAD
- We compared SpectralTAD against four TAD callers:
- TopDom
- HiCSeg
- OnTAD
- rGMAP
- Good TAD caller must satisfy three criteria:
- Be robust to Hi-C data imperfections (resolution, sparsity, sequencing depth)
- Detect biologically significant, hierarchical TAD boundaries
- Be fast
---
## SpectralTAD is robust to resolution
.center[
]
---
## SpectralTAD is robust to sparsity
- 25 simulated matrices with pre-defined TADs (HiCToolsCompare)
- The percentage of the matrix replaced with zeros
- Jaccard similarity between the detected and pre-defined TADs
.center[
]
---
## SpectralTAD is robust to sequencing depth
- Downsample matrices uniformly at random
- Measure Jaccard similarity between TADs detected from the original and sparse data
.center[
]
---
## Hierarchical TAD boundaries differ
- Boundaries shared by two TADs (Level 2) or three TADs (Level 3) are more biologically significant
.center[
]
---
## SpectralTAD is fast
A) Runtimes for various TAD callers at different chromosome sizes
B) Runtimes for various TAD callers across all chromosomes (25kb data)
.center[
]
---
## SpectralTAD Package
- **Input**: three types of contact matrices ( $n \times n$, sparse and $n \times (n+3)$) in text format, import from `.hic` and `.cool` files supported
- Two main functions: `SpectralTAD` and `SpectralTAD_Par` (parallelized)
- **Output**: A 3-column BED file for each hierarchy level
- Visualization options include output for `Juicebox`
]
]
.pull-right[ .center[
] ]
.small[ Cresswell, Kellen G., and Mikhail G. Dozmorov. “[TADCompare: An R Package for Differential and Temporal Analysis of Topologically Associated Domains](https://doi.org/10.3389/fgene.2020.00158)” _Frontiers in Genetics_, March 10, 2020 ]
---
## Complex TAD boundary changes are frequent between biological replicates and cell/tissue types
.center[
]
---
## Distinct biology of differential TAD boundaries
- Non-differential boundaries are most enriched in CTCF and other boundary marks – conserved, biologically important
- Shifted boundaries are least enriched – shifts are likely due to noisy data
.center[
]
---
## Dynamics of the 3D genome over time course
.pull-left[
- Boundary score allows investigating changes in TAD boundaries over time course
- Six patterns of boundary changes over time
]
.pull-right[ .center[
] ]
---
## Distinct biology of different patterns of TAD boundary changes
- Auxin treatment experiment - eliminate TAD boundaries with auxin, wash auxin out, observe TAD boundaries recovery
- Early and Late appearing boundaries are most enriched in CTCF, RAD21
.center[
]
---
## Time course Hi-C data analysis
- Part of the TADcompare package
- Requires three or more time points
- Can handle replicates at each time point
]
---
## Analysis of PDX Hi-C data (PDX Hi-C)
.pull-left[
- Mouse reads minimally affect Hi-C data - direct alignment to the human genome works well
- Processing pipeline plays minimal role
- Technology is the most important for data quality
]
.pull-right[ .center[
] ]
.small[Dozmorov, Mikhail G, Katarzyna M Tyc, Nathan C Sheffield, David C Boyd, Amy L Olex, Jason Reed, and J Chuck Harrell. “[Chromatin Conformation Capture (Hi-C) Sequencing of Patient-Derived Xenografts: Analysis Guidelines](https://doi.org/10.1093/gigascience/giab022)” _GigaScience_ April 21, 2021]
---
## 3D genome of cancer metastasis and drug resistance
.center[
]
---
## Summary
- **Distance-centric view** of Hi-C data is critical (`MD plot`)
- **Joint loess normalization** effectively removes between-dataset biases ([HiCcompare](https://bioconductor.org/packages/HiCcompare/))
- **Differential analysis considering distance** has optimal performance ([multiHiCcompare](https://bioconductor.org/packages/multiHiCcompare/))
- **Spectral clustering** robustly detects biologically relevant hierarchical TADs ([SpectralTAD](https://bioconductor.org/packages/SpectralTAD/))
- **Boundary score** enables differential and time course analysis of TAD boundaries ([TADcompare](https://bioconductor.org/packages/TADCompare/))
.small[https://github.com/mdozmorov/HiC_tools]
---
## Acknowledgements
.pull-left[ .center[
]
.small[https://dozmorovlab.github.io/]
]
.pull-rignt[ ### Collaborators
- J. Chuck Harrell (VCU, Pathology)
- Ay lab (La Jolla Institute for Immunology)
### Funding
- American Cancer Society
- PhRMA Foundation
- The George and Lavinia Blick Research Fund
.center[
]
]
---
class: center, middle
# Thank you
[bit.ly/3Dgenomics](https://mdozmorov.github.io/Talk_3Dgenome/)