--- title: "Tutorial: machine-learning with TGCA BIC transcriptome" subtitle: "02. Clustering" author: "Anne Badel & Jacques van Helden" date: '`r Sys.Date()`' output: html_document: self_contained: no fig_caption: yes highlight: zenburn theme: cerulean toc: yes toc_depth: 3 toc_float: yes code_folding: "hide" ioslides_presentation: slide_level: 2 self_contained: no colortheme: dolphin fig_caption: yes fig_height: 5 fig_width: 7 fonttheme: structurebold highlight: tango smaller: yes toc: yes widescreen: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 5 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes revealjs::revealjs_presentation: theme: night transition: none self_contained: true css: ../slides.css slidy_presentation: smart: no slide_level: 2 self_contained: yes fig_caption: yes fig_height: 5 fig_width: 7 highlight: tango incremental: no keep_md: yes smaller: yes theme: cerulean toc: yes widescreen: yes powerpoint_presentation: slide_level: 2 fig_caption: yes fig_height: 5 fig_width: 7 toc: yes font-import: http://fonts.googleapis.com/css?family=Risque font-family: Garamond transition: linear editor_options: chunk_output_type: console --- ```{r settings, include=FALSE, echo=FALSE, eval=TRUE} options(width = 300) # options(encoding = 'UTF-8') knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.path = 'figures/tcga-bic_', fig.align = "center", size = "tiny", echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") options(scipen = 12) ## Max number of digits for non-scientific notation # knitr::asis_output("\\footnotesize") ``` ```{r libraries, echo=FALSE, eval=TRUE} #### Load required R libraries #### ## CRAN libraries requiredLib <- c("knitr", "factoextra", ## for extended distances "aricode", # for Adjusted Rand Index (ARI), "pheatmap", ## To draw beautiful heat maps "scales" ## to draw plot with semi-transparent points ) for (lib in requiredLib) { if (!require(lib, character.only = TRUE)) { install.packages(lib, ) } require(lib, character.only = TRUE) } ``` ## Data loading We provide hereafter a code to load the prepared data from a memory image on the github repository. This image has been generated at the end of the tutorial on [data loading and exploration](https://du-bii.github.io/module-3-Stat-R/stat-R_2021/tutorials/machine-learning_TCGA-BIC/01_data-loading_TCGA-BIC.html). ```{r reload_mem_image, eval=TRUE, echo=TRUE, collapse=FALSE} #### Reload memory image from github repository #### github_mem_img <- "https://github.com/DU-Bii/module-3-Stat-R/blob/master/stat-R_2021/data/TCGA_BIC_subset/bic_data.Rdata?raw=true" ## Define local destination folder bic_folder <- "~/m3-stat-R/TCGA-BIC_analysis" ## Create it if required dir.create(bic_folder, showWarnings = FALSE, recursive = TRUE) ## Define local destination for the memory image mem_image <- file.path(bic_folder, "bic_data.Rdata") if (file.exists(mem_image)) { message("Memory image already there, skipping download") } else { message("Downloading memory image from\n", github_mem_img) download.file(url = github_mem_img, destfile = mem_image) message("Local memory image\t", mem_image) } ## Load the memory image message("Loading memory image", mem_image) load(mem_image) ``` The table below indicates the main variables loaded with this memory image. | Variable name | Data content | |--------------|----------------------------------------------| | `class_color`| a vector specifying the color to be associated to each sample class (cancer type) | | `bic_meta` | metadata with a few added columns (sample color, estimators of central tendency and dispersion) | | `gene_info`| ID, name and description of the 1000 genes used here | | `bic_expr` | non-normalised expression table | | `bic_expr_centered` | median-based expression table | | `bic_expr_std` | sample-wise standardised expression table, all samples having the same median and IQR | | `bic_expr_labels` | same content as bic_expr_std but with row names replaced by human-readable gene names, and column names by human-readable sample labels | Use the command `View()` or `head()` or any other convenient way of your choice to check the content of these variables. ## Subset selection The original data table contains `r ncol(bic_expr_labels)` samples (columns) x `r nrow(bic_expr_labels)` genes (rows). These dimensions are not convenient for visualisation, so we will start by selecting a smaller set of samples and genes. Note that this smaller data subset will serve only for didactic purposes; at the end of the practical, we will apply the clustering to all the genes with all the samples, in order to get relevant results. ### Exercise: subset selection 1. In the metadata table, count the number of samples per cancer type. ```{r samples_per_cancer_type} #### Samples per cancer type #### kable(sort(table(bic_meta$cancer.type), decreasing = TRUE), caption = "Number of samples per cancer type") ``` 2. Generate a vector named `sample_subset` that will contain 20 sample IDs from each cancer type *except the unclassified*. Tip: use the commands `subset()` and `sample()`. If you are not familiar with these commands, use `help()` to read their doc. ```{r sample_subset_selection} #### Sample subset selection #### samples_per_class <- 20 ## Get the list of cancer types cancer_types <- unique(bic_meta$cancer.type) ## Initialise the result vector sample_subset <- vector() ## Select samples for each class for (type in cancer_types) { if (type != "Unclassified") { ## Skip Unclassified samples ## Select the samples of the current cancer type type_sample <- subset(x = bic_meta, cancer.type == type, select = "label") sample_subset <- append( sample_subset, sample(row.names(type_sample), size = samples_per_class, replace = FALSE) ) } } ``` 3. Create a data.frame named `subse_meta` with the metadata corresponding to this random selection and check the number of samples per class. ```{r subset_meta} #### Subset metadata #### subset_meta <- bic_meta[sample_subset, ] kable(x = table(subset_meta$cancer.type), caption = "Number of samples per class in the selected subset") ``` 3. Generate an expression table named `subset_expr_labels` from the expression table `bic_expr_labels` by keeping only - the columns corresponding to the selected samples (beware, you will need to use labels instead of row names from the subset metadata) - the 100 top-ranking genes ```{r subset_expr_labels} #### Select a subset of the expression table #### subset_gene_nb <- 100 subset_labels <- subset_meta$label subset_expr_labels <- bic_expr_labels[1:subset_gene_nb, subset_labels] ``` ## Distance computation Read the help page of the enhanced distance function of the package `factoextra` (`factoextra::get_dist`). Use this method to compute 3 dissimilarity matrices, using respectively Euclidian, Pearson and Spearman distances. Name these matrices respectively `sample_dist_euclidian`, `sample_dist_pearson`, and `sample_dist_spearman`, Tips: - Beware: `get_dist()` computes distances betwee the *rows* of a data matrix. For sample clustering we will thus need to transpose the data.frame with the function `t()`. ### Example: Euclidian distance ```{r distances_euclidian} #### Inter-sample distances #### library(factoextra) # ?factoextra::get_dist ## Euclidian distance sample_dist_euclidian <- factoextra::get_dist( x = t(subset_expr_labels), method = "euclidian") ``` ### Exercise Use `get_dist()` to compute correlation-based distances with respectiely Pearson and Spearman correlations. Save the results in variables named `sample_dist_pearson` and `sample_dist_spearman`. ```{r distances_corr} ## Pearson distance sample_dist_pearson <- factoextra::get_dist( x = t(subset_expr_labels), method = "pearson") ## Spearman distance sample_dist_spearman <- factoextra::get_dist( x = t(subset_expr_labels), method = "spearman") ``` ## Distance heatmaps Use the function `pheatmap()` (from the package of the same name) to plot the distance matrices. Use the `annotation` option to display colors above the columns indicating the cancer type and the status of the three gene markers (Her2, PR1 and ER1). ### Example: Euclidian distance heatmap ```{r dist_euclidian_heatmap, fig.width=8, fig.height=8, out.width="100%", fig.cap="Heatmap of the Euclidian distance"} #### Heatmap of the Euclidian distance ####s subset_annotation <- subset_meta[, c("cancer.type", "ER1", "PR1", "Her2")] rownames(subset_annotation) <- subset_meta$label pheatmap(sample_dist_euclidian, cluster_rows = FALSE, cluster_cols = FALSE, main = "Euclidian distance", annotation_col = subset_annotation) ``` ### Exercise: distance heatmaps Run the same analysis for the Pearson and Spearman distance matrices. Interpret the results: do some of these measures seem more relevant to highlight the similarities between samples? ```{r dist_pearson_heatmap, fig.width=8, fig.height=8, out.width="100%", fig.cap="Heatmap of the Spearman correlation-based distance"} #### Pearson distance heatmap #### ``` ```{r dist_spearman_heatmap, fig.width=8, fig.height=8, out.width="100%", fig.cap="Heatmap of the Spearman correlation-based distance"} #### Spearman distance heatmap #### ``` ### Interpretation: dissimilarity measures Comparing the three distance heatmaps above, can you already conclude something about the most relevant measure of dissimilarity between samples? - TO BE COMPLETED ## Computing a tree from a distance matrix Use the function `hclust()` to get trees from the previously computed distance matrices. ### Example: Euclidian matrix, complete linkage ```{r sample_tree_euclidian_complete, fig.width=12, fig.height=5, out.width="100%", fig.cap="Sample tree of the expression matrix with Euclidian distance and complete linkage. "} #### Sample tree with Euclidian distance, complete linkage #### sample_tree_euclidian_complete <- hclust( d = sample_dist_euclidian, method = "complete") plot(sample_tree_euclidian_complete, xlab = NA, ylab = NA, hang = -1) ``` We can also use some fancy library (e.g. `ClassDiscovery`) to color the tree labels. Other librairies can be used to generate nice tree drawings, don't hesitate to google and improve. ```{r sample_tree_euclidian_complete_colored, fig.width=12, fig.height=5, out.width="100%", fig.cap="Sample tree of the expression matrix with Euclidian distance and complete linkage. "} library(ClassDiscovery) ClassDiscovery::plotColoredClusters( sample_tree_euclidian_complete, labs = subset_meta$cancer.type, ## class labels col = "grey", ## tree color cols = subset_meta$color, cex = 0.6, main = "Euclidian distance, Ward linkage") legend("topright", legend = names(class_color), col = class_color, text.col = class_color, cex=0.9) ``` ### Exercises: other matrices and linkage rules Test the different agglomeration rules (single, average, complete, ward) and the different disimilarity measures (Euclidian, Spearman, Pearson). ### Interpretation: agglomeration rule Based on the analyses above, would you consider that some agglomeration rules are generally more relevant than other ones with this dataset? If so, is it systematic or does the best agglomeration rule vary depending on the dissimilarity measure? ## Tree cut 1. Cut the tree to obtain 4 samples clusters from the tree obtained with Euclidian distance and complete linkage. 2. Create a confusion matrix to compare the sample compositon of the clusters with the cancer types annotated in the metadata. 3. Compute the ARI from this confusion matrix. ```{r clusters_vs_cancer_types} #### Cluster - cancer type comparison #### ## Cut the tree nclust <- 4 clusters_euclidian_complete <- cutree( tree = sample_tree_euclidian_complete, k = nclust) confusion_euclidian_complete <- table( clusters_euclidian_complete, subset_meta$cancer.type) kable(confusion_euclidian_complete, row.names = TRUE) ## Compute Adjusted Rand Index (ARI) ari_euclidian_complete <- round( digits = 3, aricode::ARI( c1 = clusters_euclidian_complete, c2 = subset_meta$cancer.type)) message("ARI for Euclidian distance and complete linkage: ", ari_euclidian_complete) ``` With Euclidian distance and the complete lin,kage, we obtain an Adjusted Rank Index (ARI) of `r ari_euclidian_complete`, computed with a subset of the expression table. The next question is of course: can we obtain better results with other distances and agglomeration rules? This will be left as an exercise. ## Clustering with the full dataset Compute the ARI index with the same similarity measure and the same agglomeration rule, but with the full dataset (`bic_expr_labels`) ```{r ari_full_dataset} ## Choose the clustering parameters distance <- "euclidian" rule <- "complete" message("Computing ARI with ", distance, " dissimilarity and ", rule, " linkage") ## Euclidian distance bic_dist <- factoextra::get_dist( x = t(bic_expr_labels), method = distance) #### Sample tree with Euclidian distance, complete linkage #### sample_tree <- hclust( d = bic_dist, method = rule) ## Confusion matrix nclust <- 4 bic_clusters <- cutree( tree = sample_tree, k = nclust) bic_confusion <- table( bic_clusters, bic_meta$cancer.type) kable(bic_confusion, row.names = TRUE, caption = paste0( "Confusion table of the sample clusters versus cancer subtypes (all samples and 1000 genes). ", "Distance: ", distance, ". ", "Agglomeration rule: ", rule, ". " )) ## Compute Adjusted Rand Index (ARI) bic_ari <- round( digits = 3, aricode::ARI( c1 = bic_clusters, c2 = bic_meta$cancer.type)) message("ARI for Euclidian distance and complete linkage (full dataset): ", bic_ari) ``` With Euclidian distance and the complete lin,kage, we obtain an Adjusted Rank Index (ARI) of `r bic_ari`, computed with a subset of the expression table. ## Heatmap with biclustering (samples + genes) Biclustering amounts to cluster both the individuals (patient samples in our case) and variables (genes in our case). We will illustrate this with the Spearman correlation as distance and Ward as agglomeration rule. ```{r biclustering} #### Biclustering #### # left as an exercise ``` ## Personal work: chosing the most relevant parameters These analyses can be led collectively: each trainee could take one combination of parameters. 1. Based on the examples above, find the most relevant parameters (dissimilarity matrix, agglomeration rule) to cluster the samples of the subset expression matrix. Choose these parameters based on the confusion matrix and the ARI. 2. Draw different graphical representations of the best result: - heatmap of the distance matrix - sample tree with coloring of the annotated cancer type - heatmap (samples x genes) with coloring of the metadata 3. Use these optimal parameters to run clustering with all the samples of the BIC dataset and the `1000 genes instead of the subset of 100. Export a table with the sample clusters and another one with the gene clusters. ## Session info As usual, we write the session info in the report for the sake of traceability. ```{r session_info} sessionInfo() ```