---
title: "Tutorial: machine-learning with TGCA BIC transcriptome"
subtitle: "02. Clustering"
author: "Anne Badel & Jacques van Helden"
date: '`r Sys.Date()`'
output:
  html_document:
    self_contained: no
    fig_caption: yes
    highlight: zenburn
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
    code_folding: "hide"
  ioslides_presentation:
    slide_level: 2
    self_contained: no
    colortheme: dolphin
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    smaller: yes
    toc: yes
    widescreen: yes
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
  revealjs::revealjs_presentation:
    theme: night
    transition: none
    self_contained: true
    css: ../slides.css
  slidy_presentation:
    smart: no
    slide_level: 2
    self_contained: yes
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: yes
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  powerpoint_presentation:
    slide_level: 2
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    toc: yes
font-import: http://fonts.googleapis.com/css?family=Risque
font-family: Garamond
transition: linear
editor_options: 
  chunk_output_type: console
---


```{r settings, include=FALSE, echo=FALSE, eval=TRUE}

options(width = 300)
# options(encoding = 'UTF-8')
knitr::opts_chunk$set(
  fig.width = 7, fig.height = 5, 
  fig.path = 'figures/tcga-bic_',
  fig.align = "center", 
  size = "tiny", 
  echo = TRUE, 
  eval = TRUE, 
  warning = FALSE, 
  message = FALSE, 
  results = TRUE, 
  comment = "")

options(scipen = 12) ## Max number of digits for non-scientific notation
# knitr::asis_output("\\footnotesize")

```


```{r libraries, echo=FALSE, eval=TRUE}
#### Load required R libraries ####

## CRAN libraries
requiredLib <- c("knitr",
                 "factoextra", ## for extended distances
                 "aricode",  # for Adjusted Rand Index (ARI),
                 "pheatmap",  ## To draw beautiful heat maps
                 "scales" ## to draw plot with semi-transparent points
)
for (lib in requiredLib) {
  if (!require(lib, character.only = TRUE)) {
    install.packages(lib, )
  }
  require(lib, character.only = TRUE)
}


```


## Data loading

We provide hereafter a code to load the prepared data from a memory image on the github repository. This image has been generated at the end of the tutorial on [data loading and exploration](https://du-bii.github.io/module-3-Stat-R/stat-R_2021/tutorials/machine-learning_TCGA-BIC/01_data-loading_TCGA-BIC.html). 

```{r reload_mem_image, eval=TRUE, echo=TRUE, collapse=FALSE}
#### Reload memory image from github repository ####
github_mem_img <- "https://github.com/DU-Bii/module-3-Stat-R/blob/master/stat-R_2021/data/TCGA_BIC_subset/bic_data.Rdata?raw=true"

## Define local destination folder
bic_folder <- "~/m3-stat-R/TCGA-BIC_analysis"
## Create it if required
dir.create(bic_folder, showWarnings = FALSE, recursive = TRUE)

## Define local destination for the memory image
mem_image <- file.path(bic_folder, "bic_data.Rdata")
if (file.exists(mem_image)) {
  message("Memory image already there, skipping download")
} else {
  message("Downloading memory image from\n", github_mem_img)
  download.file(url = github_mem_img, destfile = mem_image)
  message("Local memory image\t", mem_image)
}

## Load the memory image
message("Loading memory image", mem_image)
load(mem_image)

```

The table below indicates the main variables loaded with this memory image. 

| Variable name | Data content |
|--------------|----------------------------------------------|
| `class_color`| a vector specifying the color to be associated to each sample class (cancer type) |
| `bic_meta` | metadata with a few added columns (sample color, estimators of central tendency and dispersion) |
| `gene_info`| ID, name and description of the 1000 genes used here |
| `bic_expr` | non-normalised expression table |
| `bic_expr_centered` | median-based expression table |
| `bic_expr_std` | sample-wise standardised expression table, all samples having the same median and IQR |
| `bic_expr_labels` | same content as bic_expr_std but with row names replaced by human-readable gene names, and column names by human-readable sample labels |

Use the command `View()` or `head()` or any other convenient way of your choice to check the content of these variables. 


## Subset selection

The original data table contains `r ncol(bic_expr_labels)` samples (columns) x `r nrow(bic_expr_labels)` genes (rows). These dimensions are not convenient for visualisation, so we will start by selecting a smaller set of samples and genes. 

Note that this smaller data subset will serve only for didactic purposes; at the end of the practical, we will apply the clustering to all the genes with all the samples, in order to get relevant results. 


### Exercise: subset selection

1. In the metadata table, count the number of samples per cancer type.

```{r samples_per_cancer_type}
#### Samples per cancer type ####
kable(sort(table(bic_meta$cancer.type), decreasing = TRUE),
      caption = "Number of samples per cancer type")

```


2. Generate a vector named `sample_subset` that will contain 20 sample IDs from each cancer type *except the unclassified*. 

Tip:  use the commands `subset()` and `sample()`. If you are not familiar with these commands, use `help()` to read their doc. 

```{r sample_subset_selection}

#### Sample subset selection ####

samples_per_class <- 20

## Get the list of cancer types
cancer_types <- unique(bic_meta$cancer.type)

## Initialise the result vector
sample_subset <- vector()

## Select samples for each class
for (type in cancer_types) {
  if (type != "Unclassified") { ## Skip Unclassified samples
    ## Select the samples of the current cancer type
    type_sample <- subset(x = bic_meta, cancer.type == type, 
                          select = "label")
    sample_subset <- append(
      sample_subset, 
      sample(row.names(type_sample), 
             size = samples_per_class,
             replace = FALSE)
    )
      
  }
}


```

3. Create a data.frame named `subse_meta` with the metadata corresponding to this random selection and check the number of samples per class. 

```{r subset_meta}
#### Subset metadata ####
subset_meta <- bic_meta[sample_subset, ]

kable(x = table(subset_meta$cancer.type), 
      caption = "Number of samples per class in the selected subset")

```


3. Generate an expression table named `subset_expr_labels` from the expression table `bic_expr_labels` by keeping only

    - the columns corresponding to the selected samples (beware, you will need to use labels instead of row names from the subset metadata)
    - the 100 top-ranking genes

```{r subset_expr_labels}
#### Select a subset of the expression table ####
subset_gene_nb <- 100
subset_labels <- subset_meta$label
subset_expr_labels <- bic_expr_labels[1:subset_gene_nb, subset_labels]

```

## Distance computation

Read the help page of the enhanced distance function  of the package `factoextra` (`factoextra::get_dist`). 

Use this method to compute 3 dissimilarity matrices, using respectively Euclidian, Pearson and Spearman distances. Name these matrices respectively `sample_dist_euclidian`,  `sample_dist_pearson`, and `sample_dist_spearman`, 

Tips:

- Beware: `get_dist()` computes distances betwee the *rows* of a data matrix. For sample clustering we will thus need to transpose the data.frame with the function `t()`.

### Example: Euclidian distance

```{r distances_euclidian}
#### Inter-sample distances ####
library(factoextra)
# ?factoextra::get_dist

## Euclidian distance
sample_dist_euclidian <- factoextra::get_dist(
  x = t(subset_expr_labels), method = "euclidian")

```

### Exercise

Use `get_dist()` to compute correlation-based distances with respectiely Pearson and Spearman correlations. Save the results in variables named `sample_dist_pearson` and `sample_dist_spearman`. 

```{r distances_corr}

## Pearson distance
sample_dist_pearson <- factoextra::get_dist(
  x = t(subset_expr_labels), method = "pearson")

## Spearman distance
sample_dist_spearman <- factoextra::get_dist(
  x = t(subset_expr_labels), method = "spearman")

```


## Distance heatmaps

Use the function `pheatmap()` (from the package of the same name) to plot the distance matrices. Use the `annotation` option to display colors above the columns indicating the cancer type and the status of the three gene markers (Her2, PR1 and ER1). 

### Example: Euclidian distance heatmap

```{r dist_euclidian_heatmap, fig.width=8, fig.height=8, out.width="100%", fig.cap="Heatmap of the Euclidian distance"}
#### Heatmap of the Euclidian distance ####s
subset_annotation <- subset_meta[, c("cancer.type", "ER1", "PR1", "Her2")]
rownames(subset_annotation) <- subset_meta$label
pheatmap(sample_dist_euclidian, 
         cluster_rows = FALSE, 
         cluster_cols = FALSE,
         main = "Euclidian distance", 
         annotation_col = subset_annotation)
```


### Exercise: distance heatmaps

Run the same analysis for the Pearson and Spearman distance matrices. Interpret the results: do some of these measures seem more relevant to highlight the similarities between samples?


```{r dist_pearson_heatmap, fig.width=8, fig.height=8, out.width="100%", fig.cap="Heatmap of the Spearman correlation-based distance"}
#### Pearson distance heatmap ####


```


```{r dist_spearman_heatmap, fig.width=8, fig.height=8, out.width="100%", fig.cap="Heatmap of the Spearman correlation-based distance"}
#### Spearman distance heatmap ####
```

### Interpretation: dissimilarity measures

Comparing the three distance heatmaps above, can you already conclude something about the most relevant measure of dissimilarity between samples?

- TO BE COMPLETED

## Computing a tree from a distance matrix

Use the function `hclust()` to get trees from the previously computed distance matrices. 

### Example: Euclidian matrix, complete linkage

```{r sample_tree_euclidian_complete, fig.width=12, fig.height=5, out.width="100%", fig.cap="Sample tree of the expression matrix with Euclidian distance and complete linkage. "}
#### Sample tree with Euclidian distance, complete linkage ####
sample_tree_euclidian_complete <- hclust(
  d = sample_dist_euclidian, method = "complete")

plot(sample_tree_euclidian_complete, 
     xlab = NA, ylab = NA, hang = -1)
```

We can also use some fancy library (e.g. `ClassDiscovery`) to color the tree labels. Other librairies can be used to generate nice tree drawings, don't hesitate to google and improve. 

```{r sample_tree_euclidian_complete_colored, fig.width=12, fig.height=5, out.width="100%", fig.cap="Sample tree of the expression matrix with Euclidian distance and complete linkage. "}

library(ClassDiscovery)
ClassDiscovery::plotColoredClusters(
  sample_tree_euclidian_complete, 
  labs = subset_meta$cancer.type, ## class labels
  col = "grey", ## tree color
  cols = subset_meta$color, 
  cex = 0.6,
  main = "Euclidian distance, Ward linkage")

legend("topright",
       legend = names(class_color), 
       col = class_color,
       text.col = class_color, cex=0.9)

```

### Exercises: other matrices and linkage rules

Test the different agglomeration rules (single, average, complete, ward) and the different disimilarity measures (Euclidian, Spearman, Pearson). 

### Interpretation: agglomeration rule

Based on the analyses above, would you consider that some agglomeration rules are generally more relevant than other ones with this dataset? If so, is it systematic or does the best agglomeration rule vary depending on the dissimilarity measure?


## Tree cut

1. Cut the tree to obtain 4 samples clusters from the tree obtained with Euclidian distance and complete linkage. 

2. Create a confusion matrix to compare the sample compositon of the clusters with the cancer types annotated in the metadata. 

3. Compute the ARI from this confusion matrix. 

```{r clusters_vs_cancer_types}
#### Cluster - cancer type comparison ####

## Cut the tree
nclust <- 4
clusters_euclidian_complete <- cutree(
  tree = sample_tree_euclidian_complete, k = nclust)
confusion_euclidian_complete <- table(
  clusters_euclidian_complete,
  subset_meta$cancer.type)

kable(confusion_euclidian_complete, row.names = TRUE)

## Compute Adjusted Rand Index (ARI)
ari_euclidian_complete <- round(
  digits = 3, 
  aricode::ARI(
    c1 = clusters_euclidian_complete, 
    c2 = subset_meta$cancer.type))

message("ARI for Euclidian distance and complete linkage: ",
        ari_euclidian_complete)
```

With Euclidian distance and the complete lin,kage, we obtain an Adjusted Rank Index (ARI) of `r ari_euclidian_complete`, computed with a subset of the expression table. 

The next question is of course: can we obtain better results with other distances and agglomeration rules? This will be left as an exercise. 

## Clustering with the full dataset

Compute the ARI index with the same similarity measure and the same agglomeration rule, but with the full dataset (`bic_expr_labels`)

```{r ari_full_dataset}

## Choose the clustering parameters
distance <- "euclidian"
rule <- "complete"

message("Computing ARI with ", 
        distance, " dissimilarity and ",
        rule, " linkage")

## Euclidian distance
bic_dist <- factoextra::get_dist(
  x = t(bic_expr_labels), method = distance)

#### Sample tree with Euclidian distance, complete linkage ####
sample_tree <- hclust(
  d = bic_dist, method = rule)

## Confusion matrix
nclust <- 4
bic_clusters <- cutree(
  tree = sample_tree, k = nclust)
bic_confusion <- table(
  bic_clusters,
  bic_meta$cancer.type)

kable(bic_confusion, row.names = TRUE, 
      caption = paste0(
        "Confusion table of the sample clusters versus cancer subtypes (all samples and 1000 genes). ",
        "Distance: ", distance, ". ",
        "Agglomeration rule: ", rule, ". "
        
      ))

## Compute Adjusted Rand Index (ARI)
bic_ari <- round(
  digits = 3, 
  aricode::ARI(
    c1 = bic_clusters, 
    c2 = bic_meta$cancer.type))

message("ARI for Euclidian distance and complete linkage (full dataset): ",
        bic_ari)

```

With Euclidian distance and the complete lin,kage, we obtain an Adjusted Rank Index (ARI) of `r bic_ari`, computed with a subset of the expression table. 

## Heatmap with biclustering (samples + genes)

Biclustering amounts to cluster both the individuals (patient samples in our case) and variables (genes in our case).

We will illustrate this with the Spearman correlation as distance and Ward as agglomeration rule. 

```{r biclustering}
#### Biclustering ####

# left as an exercise


```


## Personal work: chosing the most relevant parameters

These analyses can be led collectively: each trainee could take one combination of parameters. 

1. Based on the examples above, find the most relevant parameters (dissimilarity matrix, agglomeration rule) to cluster the samples of the subset expression matrix. Choose these parameters based on the confusion matrix and the ARI. 

2. Draw different graphical representations of the best result: 

    - heatmap of the distance matrix
    - sample tree with coloring of the annotated cancer type
    - heatmap (samples x genes) with coloring of the metadata
    

3. Use these optimal parameters to run clustering with all the samples of the BIC dataset and the `1000 genes instead of the subset of 100. Export a table with the sample clusters and another one with the gene clusters.


## Session info

As usual, we write the session info in the report for the sake of traceability. 

```{r session_info}
sessionInfo()

```