--- title: "Tutorial: machine-learning with TGCA BIC transcriptome" subtitle: "04. Supervised classification" author: "Jacques van Helden" date: '`r Sys.Date()`' output: html_document: self_contained: no fig_caption: yes highlight: zenburn theme: cerulean toc: yes toc_depth: 3 toc_float: yes code_folding: "hide" ioslides_presentation: slide_level: 2 self_contained: no colortheme: dolphin fig_caption: yes fig_height: 5 fig_width: 7 fonttheme: structurebold highlight: tango smaller: yes toc: yes widescreen: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 5 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes revealjs::revealjs_presentation: theme: night transition: none self_contained: true css: ../slides.css slidy_presentation: smart: no slide_level: 2 self_contained: yes fig_caption: yes fig_height: 5 fig_width: 7 highlight: tango incremental: no keep_md: yes smaller: yes theme: cerulean toc: yes widescreen: yes powerpoint_presentation: slide_level: 2 fig_caption: yes fig_height: 5 fig_width: 7 toc: yes font-import: http://fonts.googleapis.com/css?family=Risque font-family: Garamond transition: linear editor_options: chunk_output_type: console --- ```{r settings, include=FALSE, echo=FALSE, eval=TRUE} options(width = 300) # options(encoding = 'UTF-8') knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.path = 'figures/tcga-bic_', fig.align = "center", size = "tiny", echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") options(scipen = 12) ## Max number of digits for non-scientific notation # knitr::asis_output("\\footnotesize") ``` ```{r libraries, echo=FALSE, eval=TRUE} #### Load required R libraries #### ## CRAN libraries requiredLib <- c("knitr", "e1071", ## For svm() "caret" ## Many utilities for supervised classification ) for (lib in requiredLib) { if (!require(lib, character.only = TRUE)) { install.packages(lib, ) } require(lib, character.only = TRUE) } ``` ## Data loading We provide hereafter a code to load the prepared data from a memory image on the github repository. This image has been generated at the end of the tutorial on [data loading and exploration](https://du-bii.github.io/module-3-Stat-R/stat-R_2021/tutorials/machine-learning_TCGA-BIC/01_data-loading_TCGA-BIC.html). ```{r reload_mem_image, eval=TRUE, echo=TRUE, collapse=FALSE} #### Reload memory image from github repository #### github_mem_img <- "https://github.com/DU-Bii/module-3-Stat-R/blob/master/stat-R_2021/data/TCGA_BIC_subset/bic_data.Rdata?raw=true" ## Define local destination folder bic_folder <- "~/m3-stat-R/TCGA-BIC_analysis" ## Create it if required dir.create(bic_folder, showWarnings = FALSE, recursive = TRUE) ## Define local destination for the memory image mem_image <- file.path(bic_folder, "bic_data.Rdata") if (file.exists(mem_image)) { message("Memory image already there, skipping download") } else { message("Downloading memory image from\n", github_mem_img) download.file(url = github_mem_img, destfile = mem_image) message("Local memory image\t", mem_image) } ## Load the memory image message("Loading memory image", mem_image) load(mem_image) ``` The table below indicates the main variables loaded with this memory image. | Variable name | Data content | |--------------|----------------------------------------------| | `class_color`| a vector specifying the color to be associated to each sample class (cancer type) | | `bic_meta` | metadata with a few added columns (sample color, estimators of central tendency and dispersion) | | `gene_info`| ID, name and description of the 1000 genes used here | | `bic_expr` | non-normalised expression table | | `bic_expr_centered` | median-based expression table | | `bic_expr_std` | sample-wise standardised expression table, all samples having the same median and IQR | | `bic_expr_labels` | same content as bic_expr_std but with row names replaced by human-readable gene names, and column names by human-readable sample labels | Use the command `View()` or `head()` or any other convenient way of your choice to check the content of these variables. ## Discard the unclassified samples Generate an expression matrix and a metadata table without the unclassified samples, since these ones cannot be used to train a classifier. ```{r discard_unclassified} #### Discard unclassified #### bic_meta_ok <- bic_meta[bic_meta$cancer.type != "Unclassified",] ## Check that the metadata table has only the right classes nrow(bic_meta_ok) table(bic_meta_ok$cancer.type) ## Select the corresponding columns (samples) of the expression table ## We also transpose the data table for the supoervised bic_expr_class_ok <- t(bic_expr_labels[, bic_meta_ok$label]) dim(bic_expr_class_ok) bic_classes_ok <- bic_meta_ok$cancer.type ``` ## Split the dataset into a testing and a training set Split the sample set into - a training (2/3 of the samples) - a testing set (the other 1/3 of the samples) ```{r split_training_testing} #### Split samples into training and testing sets #### ## Count the number of samples (total, train and test) nsamples <- nrow(bic_expr_class_ok) nsamples_train <- round(nsamples * 2/3) nsamples_test <- nsamples - nsamples_train ## Perfoirm a random sampling of indices resampled_indices <- sample(1:nsamples, replace = FALSE) train_indices <- resampled_indices[1:nsamples_train] test_indices <- setdiff(resampled_indices, train_indices) ``` ## Training a classifier with the train set Use the `svm()`funcvtion to train a support vector machine with the training set. ```{r train_svm} #### Train the SVM with the training subset #### svm_kernel = "radial" ## Define training set: expression matrix and class labels training_expr <- bic_expr_class_ok[train_indices, ] training_classes <- bic_meta_ok[train_indices, "cancer.type"] table(training_classes) ## Train the SVM svm_model <- svm(x = training_expr, y = as.factor(training_classes), type = "C-classification", kernel = svm_kernel) ``` ## Predicting the classes of the testing set ```{r predict_test} #### Predict the classes of the testing set #### testing_expr <- bic_expr_class_ok[test_indices, ] testing_pred <- as.vector(predict(svm_model, testing_expr)) ## Check the number of predicted samples per class table(testing_pred) ``` ```{r eval_perf} #### Evaluate the performances of the classifier #### ## Generate a contingency table comparing ## the known and predicted classes for the testing set ## Get the annotated classes for the testing set testing_classes <- bic_meta_ok[test_indices, "cancer.type"] table(testing_classes) ## Compare the annotated and predicted cancer types contingency <- table(testing_classes, testing_pred) kable(contingency, row.names = TRUE, caption = ) ## Compute the number of correct predictions errors <- testing_pred != testing_classes ## Compute the misclassification error rate (MER) mer <- sum(errors) / length(errors) message("MER = ", mer) ``` In this training / testing experiment we obtained a misclassification error rate (MER) of $MER = `r mer`$. However, we just performed a single random sampling, so we can expect to obtain different results if we repaeat this experiment. Several strategies are classically used to obtain a more reliable estimation of the performances : 1. **K-fold cross-validation:** split the dataset into $k$ subsets, each of which is used as test for one train/test experiment (and the $k-1$ remaining ones are then used for training). We could for example run a 10-fold cross-validation with this dataset. 2. **Leave-one-out** (**LOO**, also called **jack-knife**) is a specific case of k-fold cross-validation where $k = n$, i.e. each individual (biological sample in our case) is in turn discarded from the data set ("left out"), a classifier is trained with the $n - 1$ remaining individuals, and the trained classifier ("model") is used to predict the class of the left out individual. 3. **Iterative subsampling** consists in repeating the above procedure, where we select a given proportion of the individuals for training (e.g. 2/3) and the rest for testing. Each of these methods of performance estimation has its pros and cons. For this course, we will run a **collective iterative subsampling**: each trainee will run a trainig/test experiment and we will write the results in a shared spreadsheet, which will then enable us to compute some statistics on the MER values (mean, standard deviation, min, max). ## Exercise: impact of the SVM kernel Based on the above example, test the 4 SVM kernels (linear, radial, sigmoid, polynomial) and compare the performances. Which kernel gives the best results. ## Exercise: impact of the number of variables (genes) Use the best kernel defined above, and estimate the MER with the 100, 200, 300, 500, 1000 top genes respectively. ## SVM tuning use the `tune.svm()` function to find the optimal SVM parameters. ```{r tune_svm} #### Tune SVM #### svm_tuning_result <- tune.svm( x = training_expr, y = as.factor(training_classes)) ``` ## Session info As usual, we write the session info in the report for the sake of traceability. ```{r session_info} sessionInfo() ```