--- author: "Mikhail Dozmorov, Ph.D." institute: "https:/bit.ly/3Dgenomics" date: "January 22, 2021" output: xaringan::moon_reader: lib_dir: libs css: ["css/xaringan-themer.css", "css/xaringan-my.css"] nature: ratio: '16:9' highlightStyle: github highlightLines: true countIncrementalSlides: false --- ```{r xaringan-themer, include = FALSE} library(xaringanthemer) library(icons) mono_light( base_color = "midnightblue", header_font_google = google_font("Noto Sans"), text_font_google = google_font("Montserrat", "500", "500i"), code_font_google = google_font("Droid Mono"), link_color = "#8B1A1A", #firebrick4, "deepskyblue1" text_font_size = "28px", code_font_size = "26px" ) ``` class: center, middle # The genome in action ### Detecting and interpreting changes in the 3D genome organization [bit.ly/3Dgenomics](https://mdozmorov.github.io/Talk_3Dgenome/) Mikhail Dozmorov, Ph.D. Associate professor, Department of Biostatistics Virginia Commonwealth University --- ## Overview I. Normalization and differential analysis of Hi-C data - HiCcompare - multiHiCcompare
II. Detection and differential analysis of Topologically Associating Domains (TADs) - SpectralTAD - TADcompare --- ## The 3D structure of the genome - Human genome is big - ~3.2 billion base pairs - ~4 meters (~12ft) of diploid genome is packed into ~10um nucleus - ~800 trips from Earth to Sun in ~30T cells from the human body .center[

]

Human body has approximately 30 trillion human cells (excluding trillions of microbiome cells); Stretched haploid genome would be roughly 2 meters - each cell has 4 meters of DNA (1 m = 3.28 ft); 30 trillion * 4 meters = 120 trillion meters; Convert to miles: 120 trillion meters / 1609.34 = 7.45*10^{10}; Convert to Earth-Sun distance: 7.45*10^{10} / 91.43*10^6 = 814.83

--- ##The 3D genome is not static - The 3D structure of the genome plays a role in many diseases - Changes in the 3D genome organization are an ~~emerging~~ established hallmark of cancer - Disruption of 3D genomic structure can lead to rewiring of enhancer-promoter interactions and changes in gene expression -> disease --- ## Chromatin conformation capture technologies .pull-left[ - 3C, 4C, 5C, Hi-C - Capture-(Hi)C, ChIA-PET - Single-cell variants - SPRITE, GAM, ChIA-Drop - Specialized (e.g., Methyl-HiC) ] .pull-right[

] .small[ Lieberman-Aiden, Erez et al. “[Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome](https://doi.org/10.1126/science.1181369)” _Science_, October 9, 2009 ] --- ## Hi-C Data as a matrix .pull-left[ - The genome (chromosome) is split into equally sized regions - Data is represented by a symmetric matrix of contacts $C_{ij}$ where entry $ij$ corresponds to the number of times region $i$ comes into contact with region $j$ - Off-diagonal data view - increasing **distance** between interacting regions - Power-law decay of interactions with increasing **distance** ] .pull-right[

] --- ## Biases in Hi-C data - Hi-C data suffers from many biases: **sequence-driven** (e.g., mappability, CG content) & **technology-driven** (e.g., type of restriction enzyme, sequencing platform) - Most normalization methods work only on individual Hi-C dataset, one at a time - Individual normalization methods do not perform well when the goal is comparison --- class: center, middle # HiCcompare Joint normalization and differential analysis of Hi-C data --- ## Joint Normalization on the MD plot .pull-left[ - **MD plot** represents data from two Hi-C matrices on one plot - Similar to the MA plot (Bland-Altman plot) - X-axis: **Genomic Distance** (off-diagonal data slices) - Y-axis: **Mean differences in interaction frequencies** (log2(IF2/IF1)) ] .pull-right[

] --- ## Loess regression .pull-left[ - Differences between two datasets should be minimal (symmetric around M = 0, Y-axis) - Local Regression – fit based on local subsets of the data - Nonlinear, data-driven - creates a smooth curve through the data - Can use loess fit to correct for bias ] .pull-right[ .center[

] ] --- ## Joint Loess Normalization of Hi-C Data .pull-left[ - Differences between two datasets should be minimal (symmetric around M = 0, Y-axis) - Perform loess regression on the MD plot to calculate $f(D)$ - the predicted interaction frequency $IF$ value at distance $D$ .small[ - $log_2(\hat{IF_{1D}})=log_2(IF_{1D})-f(D)/2$ - $log_2(\hat{IF_{2D}})=log_2(IF_{2D})+f(D)/2$ - Average $IF$ for the pair remains unchanged ] ] .pull-right[ .center[

] ] .small[ Benchmarking study: Lyu, Hongqiang, Erhu Liu, and Zhifang Wu. “[Comparison of Normalization Methods for Hi-C Data](https://doi.org/10.2144/btn-2019-0105)” _BioTechniques_, October 7, 2019 ] --- ## Difference detection .pull-left[ - At each distance, take a set of M-values - Convert M-values to Z-scores $Z_i=\frac{M_i-\hat{M}}{\sigma_M}$ - Z-scores are compared to standard normal distribution to obtain p-values - FDR multiple testing correction applied on a per-distance basis ] .pull-right[ .center[

] ] --- ## Difference Detection After Different Normalizations .pull-left[ - Simulated 100x100 chromatin interaction matrices with 250 controlled changes applied - ROC curves for different normalization techniques and fold changes - Loess provides the most power to detect small fold changes .small[ https://bioconductor.org/packages/HiCcompare/ Stansfield, John C., Kellen G. Cresswell, Vladimir I. Vladimirov, and Mikhail G. Dozmorov. “[HiCcompare: An R-Package for Joint Normalization and Comparison of HI-C Datasets](https://doi.org/10.1186/s12859-018-2288-x)” _BMC Bioinformatics_, December 2018 ] ] .pull-right[ .center[

] ] --- class: center, middle # multiHiCcompare Normalization and differential analysis of multiple Hi-C datasets --- ## Normalization: Multiple Hi-C datasets **Cyclic loess** (Ballman et al. 2004) to jointly normalize multiple datasets - take each pair of datasets, normalize, repeat until convergence 1. Choose two out of the N total samples, then generate an MD plot 2. Fit a loess curve $f(D)$ to the MD plot 3. Subtract $f(D)/2$ from the first dataset and add $f(D)/2$ to the second 4. Repeat until all unique pairs have been compared 5. Repeat until convergence

.small[Ballman, Karla V., Diane E. Grill, Ann L. Oberg, and Terry M. Therneau. “[Faster Cyclic Loess: Normalizing RNA Arrays via Linear Models](https://doi.org/10.1093/bioinformatics/bth327).” Bioinformatics (Oxford, England) 20, no. 16 (November 1, 2004)] --- ## Differential analysis: Multiple Hi-C datasets - **Distance-centric analysis** – each off-diagonal data slice has unique statistical properties - Split Hi-C data into $d$ distance-centric matrices with $g$ rows (indices for interacting pairs of regions) and $i$ columns (samples) .center[

] --- ## Differential analysis: Multiple Hi-C datasets - Statistical framework of differential gene expression analysis can be adapted for differential analysis of IFs ([edgeR](https://bioconductor.org/packages/edgeR/) R package) - **Exact test** .small[ - For comparing 2 groups without other covariates - Similar to Fisher's exact test - Computes exact p-values by summing over all sums of counts that have a probability less than the probability under the null hypothesis ] - **Generalized Linear Models** .small[ - For more complex experiments, utilize the GLM framework - The IF value for a pair of regions $g$ at distance $d$ from sample $i$ follows $NB(M_{di}*p_{dgj},\phi_{dg})$, where $M_{di}$ is the total number of reads in sample $i$ at distance $d$, $p_{dgj}$ is the proportion of interaction counts $g$ in sample $i$ from experimental condition $j$, $\phi_{dg}$ is the dispersion - The vector of covariates $x_i$ can be linked with $\mu_{dgj}$ through a log-linear model $log(\mu_{dgj}) = x_i^T\beta_{dg} + log(M_{dj})$ ] --- ## Analysis overview: Multiple Hi-C datasets .center[

] .small[ Benchmarking study: Zheng, Ye, Peigen Zhou, and Sündüz Keleş. “[FreeHi-C Spike-in Simulations for Benchmarking Differential Chromatin Interaction Detection](https://doi.org/10.1016/j.ymeth.2020.07.001)” _Methods_, July 12, 2020 ] --- ## Benchmarking .center[

] .small[ https://bioconductor.org/packages/multiHiCcompare/ Stansfield, John C, Kellen G Cresswell, and Mikhail G Dozmorov. “[MultiHiCcompare: Joint Normalization and Comparative Analysis of Complex Hi-C Experiments](https://doi.org/10.1093/bioinformatics/btz048).” Bioinformatics, January 22, 2019 ] --- class: center, middle # SpectralTAD Detection of Topologically Associating Domains using Spectral Clustering --- ## Topologically Associating Domains - TADs are 3D structures of frequently interacting regions - Boundaries are associated with specific genomic features (CTCF, cohesin, mediator) - Can be hierarchical (nested, TADs containing sub-TADs) .center[

] --- ## Why are TADs Important? - Established early in development - Highly correlate with replication timing - TADs create "autonomous gene-domains" partitioning the genome into discrete functional regions - Disruptions of TADs lead to _de novo_ enhancer-promoter interactions and dysregulation of gene expression -> disease TADs are a distinct layer of the 3D genome organization --- ## TAD detection using graph theory .pull-left[ - Hi-C data has a natural graph structure, defined by vertices $V$ and edges $E$ - **Vertices** are genomic regions - **Edges** represent interaction strength between any pair of regions - Vertices and edges are stored in an **adjacency matrix** $A_{ij}$ where $ij$ is the number of edges between a given set of vertices $ij$ ] .pull-right[ .center[

]

] --- ## Traditional Spectral Clustering - Specifically designed to cluster graphs - Works by projecting the data into a lower-dimensional space - Excels on noisy and non-normally distributed data (Hi-C data) --- ## How to perform spectral clustering - Calculate the Laplacian: $$D = diag(A\mathbf{1_n})$$ $$\bar{L} = D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$$ - Calculate the eigenvectors of the Laplacian matrix (graph spectrum): $$\bar{L}\mathbf{v} = \lambda\mathbf{v}$$ - Normalize the eigenvectors and cluster --- ## Spectral clustering with eigenvector gaps - Rows and columns of Hi-C matrices are naturally ordered - TADs are continuous - Order points (genomic regions) in eigenvector space, and cluster - We propose a simple, novel approach to clustering ordered data using gaps between consecutive points in eigenvector space --- ## Step 1: Plot the non-normalized eigenvectors .center[

] --- ## Step 2: Project on to Unit Circle .center[

] --- ## Step 3: Find the k-largest gaps and partition .center[

] --- ## Windowed Spectral Clustering - We know the biologically maximum TAD size (2 million bp) - We can use a 2 million bp sliding window to perform spectral clustering and aggregate - Advantages of the sliding window - Reduced cubic complexity of spectral clustering $O(n^{3})$ to linear complexity $O(n)$ - Naturally discards noisy interactions at large genomic distances --- ## SpectralTAD algorithm 1. Cut a window from the matrix equal to the maximum TAD size (2Mb) 2. Find the graph spectrum of the window and calculate eigenvector gaps 3. Find a set of clusters that maximize the silhouette score 4. Slide the window to the next group of loci and repeat .center[

] --- ## Determining a hierarchy of TADs - TADs are hierarchical in nature (organized into large meta-TADs with sub-TADs within them) - To find sub-TADs, we use a novel metric called boundary score - **Boundary score** is just the z-score for each eigenvector gap (Euclidean distance between adjacent points) .center[

] --- ## Boundary score as a metric for TAD boundary detection .center[

] --- ## Determining a hierarchy of TADs For each initial TAD: - Perform spectral clustering on the submatrix defined by the initial TAD - Calculate the eigenvector gaps for each consecutive pair of regions - Convert eigenvector gaps to boundary scores - If any boundary score is greater than 1.96, this is a sub-TAD boundary - Repeat for all sub-TADs until no z-score is greater than 1.96 --- ## TAD Calling: SpectralTAD - We compared SpectralTAD against four TAD callers: - TopDom - HiCSeg - OnTAD - rGMAP - Good TAD caller must satisfy three criteria: - Be robust to Hi-C data imperfections (resolution, sparsity, sequencing depth) - Detect biologically significant, hierarchical TAD boundaries - Be fast --- ## SpectralTAD is robust to resolution .center[

] --- ## SpectralTAD is robust to sparsity - 25 simulated matrices with pre-defined TADs (HiCToolsCompare) - The percentage of the matrix replaced with zeros - Jaccard similarity between the detected and pre-defined TADs .center[

] --- ## SpectralTAD is robust to sequencing depth - Downsample matrices uniformly at random - Measure Jaccard similarity between TADs detected from the original and sparse data .center[

] --- ## Hierarchical TAD boundaries differ - Boundaries shared by two TADs (Level 2) or three TADs (Level 3) are more biologically significant .center[

] --- ## SpectralTAD is fast A) Runtimes for various TAD callers at different chromosome sizes B) Runtimes for various TAD callers across all chromosomes (25kb data) .center[

] --- ## SpectralTAD Package - **Input**: three types of contact matrices ( $n \times n$, sparse and $n \times (n+3)$) in text format, import from `.hic` and `.cool` files supported - Two main functions: `SpectralTAD` and `SpectralTAD_Par` (parallelized) - **Output**: A 3-column BED file for each hierarchy level - Visualization options include output for `Juicebox`

.small[ https://bioconductor.org/packages/SpectralTAD/ Cresswell, Kellen G., John C. Stansfield, and Mikhail G. Dozmorov. “[SpectralTAD: An R Package for Defining a Hierarchy of Topologically Associated Domains Using Spectral Clustering](https://doi.org/10.1186/s12859-020-03652-w)” _BMC Bioinformatics_, July 20, 2020 ] --- class: center, middle # TADcompare Differential and time course analysis of TAD boundaries --- ## Comparing TADs (TADcompare) .pull-left[ - Boundary score can be compared across conditions – differential boundary score .center[

] ] .pull-right[ .center[

] ] .small[ Cresswell, Kellen G., and Mikhail G. Dozmorov. “[TADCompare: An R Package for Differential and Temporal Analysis of Topologically Associated Domains](https://doi.org/10.3389/fgene.2020.00158)” _Frontiers in Genetics_, March 10, 2020 ] --- ## Complex TAD boundary changes are frequent between biological replicates and cell/tissue types .center[

] --- ## Distinct biology of differential TAD boundaries - Non-differential boundaries are most enriched in CTCF and other boundary marks – conserved, biologically important - Shifted boundaries are least enriched – shifts are likely due to noisy data .center[

] --- ## Dynamics of the 3D genome over time course .pull-left[ - Boundary score allows investigating changes in TAD boundaries over time course - Six patterns of boundary changes over time ] .pull-right[ .center[

] ] --- ## Distinct biology of different patterns of TAD boundary changes - Auxin treatment experiment - eliminate TAD boundaries with auxin, wash auxin out, observe TAD boundaries recovery - Early and Late appearing boundaries are most enriched in CTCF, RAD21 .center[

] --- ## Time course Hi-C data analysis - Part of the TADcompare package - Requires three or more time points - Can handle replicates at each time point

.small[ https://bioconductor.org/packages/TADCompare/ Cresswell, Kellen G., and Mikhail G. Dozmorov. “[TADCompare: An R Package for Differential and Temporal Analysis of Topologically Associated Domains](https://doi.org/10.3389/fgene.2020.00158)” _Frontiers in Genetics_, March 10, 2020 ] --- class: center, middle # Understanding the biology of differential chromatin interactions --- ## 3D genome of cancer metastasis and drug resistance - Patient Derived Xenograft (PDX) mouse models of breast cancer - Progression of the primary tumor to metastatic and drug resistant states .center[

] --- ## Analysis of PDX Hi-C data (PDX Hi-C) .pull-left[ - Mouse reads minimally affect Hi-C data - direct alignment to the human genome works well - Processing pipeline plays minimal role - Technology is the most important for data quality ] .pull-right[ .center[

] ] .small[Dozmorov, Mikhail G, Katarzyna M Tyc, Nathan C Sheﬃeld, David C Boyd, Amy L Olex, Jason Reed, and J Chuck Harrell. “[Chromatin Conformation Capture (Hi-C) Sequencing of Patient-Derived Xenografts: Analysis Guidelines](https://doi.org/10.1093/gigascience/giab022)” _GigaScience_ April 21, 2021] --- ## 3D genome of cancer metastasis and drug resistance .center[

] --- ## Summary - **Distance-centric view** of Hi-C data is critical (`MD plot`) - **Joint loess normalization** effectively removes between-dataset biases ([HiCcompare](https://bioconductor.org/packages/HiCcompare/)) - **Differential analysis considering distance** has optimal performance ([multiHiCcompare](https://bioconductor.org/packages/multiHiCcompare/)) - **Spectral clustering** robustly detects biologically relevant hierarchical TADs ([SpectralTAD](https://bioconductor.org/packages/SpectralTAD/)) - **Boundary score** enables differential and time course analysis of TAD boundaries ([TADcompare](https://bioconductor.org/packages/TADCompare/)) .small[https://github.com/mdozmorov/HiC_tools] --- ## Acknowledgements .pull-left[ .center[

] .small[https://dozmorovlab.github.io/] ] .pull-rignt[ ### Collaborators - J. Chuck Harrell (VCU, Pathology) - Ay lab (La Jolla Institute for Immunology) ### Funding - American Cancer Society - PhRMA Foundation - The George and Lavinia Blick Research Fund .center[

] ] --- class: center, middle # Thank you [bit.ly/3Dgenomics](https://mdozmorov.github.io/Talk_3Dgenome/)

Bioinformatics postdoc, research assistant positions available