---
title: "Tutorial: machine-learning with TGCA BIC transcriptome"
subtitle: "04. Supervised classification"
author: "Jacques van Helden"
date: '`r Sys.Date()`'
output:
  html_document:
    self_contained: no
    fig_caption: yes
    highlight: zenburn
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
    code_folding: "hide"
  ioslides_presentation:
    slide_level: 2
    self_contained: no
    colortheme: dolphin
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    smaller: yes
    toc: yes
    widescreen: yes
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
  revealjs::revealjs_presentation:
    theme: night
    transition: none
    self_contained: true
    css: ../slides.css
  slidy_presentation:
    smart: no
    slide_level: 2
    self_contained: yes
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: yes
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  powerpoint_presentation:
    slide_level: 2
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    toc: yes
font-import: http://fonts.googleapis.com/css?family=Risque
font-family: Garamond
transition: linear
editor_options: 
  chunk_output_type: console
---


```{r settings, include=FALSE, echo=FALSE, eval=TRUE}

options(width = 300)
# options(encoding = 'UTF-8')
knitr::opts_chunk$set(
  fig.width = 7, fig.height = 5, 
  fig.path = 'figures/tcga-bic_',
  fig.align = "center", 
  size = "tiny", 
  echo = TRUE, 
  eval = TRUE, 
  warning = FALSE, 
  message = FALSE, 
  results = TRUE, 
  comment = "")

options(scipen = 12) ## Max number of digits for non-scientific notation
# knitr::asis_output("\\footnotesize")

```


```{r libraries, echo=FALSE, eval=TRUE}
#### Load required R libraries ####

## CRAN libraries
requiredLib <- c("knitr",
                 "e1071", ## For svm()
                 "caret" ## Many utilities for supervised classification
                 )
for (lib in requiredLib) {
  if (!require(lib, character.only = TRUE)) {
    install.packages(lib, )
  }
  require(lib, character.only = TRUE)
}


```


## Data loading

We provide hereafter a code to load the prepared data from a memory image on the github repository. This image has been generated at the end of the tutorial on [data loading and exploration](https://du-bii.github.io/module-3-Stat-R/stat-R_2021/tutorials/machine-learning_TCGA-BIC/01_data-loading_TCGA-BIC.html). 

```{r reload_mem_image, eval=TRUE, echo=TRUE, collapse=FALSE}
#### Reload memory image from github repository ####
github_mem_img <- "https://github.com/DU-Bii/module-3-Stat-R/blob/master/stat-R_2021/data/TCGA_BIC_subset/bic_data.Rdata?raw=true"

## Define local destination folder
bic_folder <- "~/m3-stat-R/TCGA-BIC_analysis"
## Create it if required
dir.create(bic_folder, showWarnings = FALSE, recursive = TRUE)

## Define local destination for the memory image
mem_image <- file.path(bic_folder, "bic_data.Rdata")
if (file.exists(mem_image)) {
  message("Memory image already there, skipping download")
} else {
  message("Downloading memory image from\n", github_mem_img)
  download.file(url = github_mem_img, destfile = mem_image)
  message("Local memory image\t", mem_image)
}

## Load the memory image
message("Loading memory image", mem_image)
load(mem_image)

```

The table below indicates the main variables loaded with this memory image. 

| Variable name | Data content |
|--------------|----------------------------------------------|
| `class_color`| a vector specifying the color to be associated to each sample class (cancer type) |
| `bic_meta` | metadata with a few added columns (sample color, estimators of central tendency and dispersion) |
| `gene_info`| ID, name and description of the 1000 genes used here |
| `bic_expr` | non-normalised expression table |
| `bic_expr_centered` | median-based expression table |
| `bic_expr_std` | sample-wise standardised expression table, all samples having the same median and IQR |
| `bic_expr_labels` | same content as bic_expr_std but with row names replaced by human-readable gene names, and column names by human-readable sample labels |

Use the command `View()` or `head()` or any other convenient way of your choice to check the content of these variables. 


## Discard the unclassified samples

Generate an expression matrix and a metadata table without the unclassified samples, since these ones cannot be used to train a classifier. 

```{r discard_unclassified}
#### Discard unclassified ####
bic_meta_ok <- bic_meta[bic_meta$cancer.type != "Unclassified",]

## Check that the metadata table has only the right classes
nrow(bic_meta_ok)
table(bic_meta_ok$cancer.type)

## Select the corresponding columns (samples) of the expression table
## We also transpose the data table for the supoervised
bic_expr_class_ok <- t(bic_expr_labels[, bic_meta_ok$label])
dim(bic_expr_class_ok)

bic_classes_ok <- bic_meta_ok$cancer.type

```


## Split the dataset into a testing and a training set

Split the sample set into 
- a training (2/3 of the samples) 
- a testing set (the other 1/3 of the samples)


```{r split_training_testing}
#### Split samples into training and testing sets ####

## Count the number of samples (total, train and test)
nsamples <- nrow(bic_expr_class_ok)
nsamples_train <- round(nsamples * 2/3)
nsamples_test <- nsamples - nsamples_train

## Perfoirm a random sampling of indices
resampled_indices <- sample(1:nsamples, replace = FALSE)
train_indices <- resampled_indices[1:nsamples_train]
test_indices <- setdiff(resampled_indices, train_indices)

```

## Training a classifier with the train set

Use the `svm()`funcvtion to train a support vector machine with the training set. 

```{r train_svm}
#### Train the SVM with the training subset ####
svm_kernel = "radial"

## Define training set: expression matrix and class labels
training_expr <- bic_expr_class_ok[train_indices, ]
training_classes <- bic_meta_ok[train_indices, "cancer.type"]
table(training_classes)

## Train the SVM
svm_model <- svm(x = training_expr, 
                 y = as.factor(training_classes), 
                 type = "C-classification", 
                 kernel = svm_kernel)

```


## Predicting the classes of the testing set

```{r predict_test}
#### Predict the classes of the testing set ####
testing_expr <- bic_expr_class_ok[test_indices, ]
testing_pred <- as.vector(predict(svm_model,  testing_expr))

## Check the number of predicted samples per class
table(testing_pred)

```


```{r eval_perf}
#### Evaluate the performances of the classifier ####

## Generate a contingency table comparing 
## the known and predicted classes for the testing set


## Get the annotated classes for the testing set
testing_classes <- bic_meta_ok[test_indices, "cancer.type"]
table(testing_classes)

## Compare the annotated and  predicted cancer types
contingency <- table(testing_classes, testing_pred)
kable(contingency, row.names = TRUE, caption = )

## Compute the number of correct predictions
errors <- testing_pred != testing_classes

## Compute the misclassification error rate (MER)
mer <- sum(errors) / length(errors)

message("MER = ", mer)

```

In this training / testing experiment we obtained a misclassification error rate (MER) of $MER = `r mer`$.  However, we just performed a single random sampling, so we can expect to obtain different results if we repaeat this experiment. 

Several strategies are classically used to obtain a more reliable estimation of the performances : 

1. **K-fold cross-validation:** split the dataset into $k$ subsets, each of which is used as test for one train/test experiment (and the $k-1$ remaining ones are then used for training). We could for example run a 10-fold cross-validation with this dataset. 

2. **Leave-one-out** (**LOO**, also called **jack-knife**) is a specific case of k-fold cross-validation where $k = n$, i.e. each individual (biological sample in our case) is in turn discarded from the data set ("left out"), a classifier is trained with the $n - 1$ remaining individuals, and the trained classifier ("model") is used to predict the class of the left out individual. 

3. **Iterative subsampling** consists in repeating the above procedure, where we select a given proportion of the individuals for training (e.g. 2/3) and the rest for testing. 

Each of these methods of performance estimation has its pros and cons. 


For this course, we will run a **collective iterative subsampling**: each trainee will run a trainig/test experiment and we will write the results in a shared spreadsheet, which will then enable us to compute some statistics on the MER values (mean, standard deviation, min, max). 

## Exercise: impact of the SVM  kernel

Based on the above example, test the 4 SVM kernels (linear, radial, sigmoid, polynomial) and compare the performances. Which kernel gives the best results. 


## Exercise: impact of the number of variables (genes)

Use the best kernel defined above, and estimate the MER with the 100, 200, 300, 500, 1000 top genes respectively. 

## SVM tuning

use the `tune.svm()` function to find the optimal SVM parameters.

```{r tune_svm}
#### Tune SVM ####
svm_tuning_result <- tune.svm(
  x = training_expr, 
  y = as.factor(training_classes))

```


## Session info

As usual, we write the session info in the report for the sake of traceability. 

```{r session_info}
sessionInfo()

```