--- title: "Clustering Algorithms" output: rmarkdown::html_vignette: toc: true description: > Vary clustering algorithm to expand or refine the space of generated cluster solutions. vignette: > %\VignetteIndexEntry{Clustering Algorithms} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Download a copy of the vignette to follow along here: [clustering_algorithms.Rmd](https://raw.githubusercontent.com/BRANCHlab/metasnf/main/vignettes/clustering_algorithms.Rmd) ## Clustering Algorithms SNF produces a single similarity matrix that is meant to describe how similar all the observations are to each other across all the provided input features. Dividing that similarity matrix into subtypes requires can be done using clustering algorithms. Within the metasnf package, clustering is done by default using the spectral clustering algorithm (as implemented by the original SNFtool package). The code below goes over what the default clustering looks like. ### Default clustering ```{r} # Load the package library(metasnf) # Start by making a data list containing all our dataframes to more easily # identify subjects without missing data full_data_list <- generate_data_list( list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), list(income, "household_income", "demographics", "continuous"), list(pubertal, "pubertal_status", "demographics", "continuous"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) # Partition into a data and target list (optional) data_list <- full_data_list[1:3] target_list <- full_data_list[4:5] # Specifying 5 different sets of settings for SNF set.seed(42) settings_matrix <- generate_settings_matrix( data_list, nrow = 5, max_k = 40 ) # This matrix has clustering solutions for each of the 5 SNF runs! solutions_matrix <- batch_snf(data_list, settings_matrix) extended_solutions <- extend_solutions( solutions_matrix, target_list, cat_test = "fisher_exact" ) ``` ```{r eval = FALSE} clust_esm_manhattan <- esm_manhattan_plot( extended_solutions, threshold = 0.05, bonferroni_line = TRUE ) ggplot2::ggsave( "clust_esm_manhattan.png", clust_esm_manhattan, width = 5, height = 5, dpi = 100 ) ``` ![](https://raw.githubusercontent.com/BRANCHlab/metasnf/main/vignettes/clust_esm_manhattan_l.png) The Manhattan plot shows the p-values (y-axis) of the associations between our target features (x-axis) and each cluster solution we calculated (colour) for each row of the settings matrix. The information about the clustering used is tucked away in the settings matrix: ```{r} settings_matrix$"clust_alg" ``` The "1" corresponds to spectral clustering using the eigen-gap heuristic to determine the number of clusters, while the "2" corresponds to spectral clustering using the rotation cost heuristic to determine the number of clusters. You can find this information by running `?generate_settings_matrix`. ### Other built-in clustering options Currently, the available clustering algorithms are: * `spectral_eigen` * `spectral_rot` * `spectral_two` * `spectral_three` * `spectral_four` * `spectral_five` * `spectral_six` * `spectral_seven` * `spectral_eight` The first two are the defaults, and the remaining ones specifically use 2, 3, 4, ... and so on as their number of clusters rather than whatever is calculated by a separate heuristic function. To make use of any of these alternative algorithms, you'll need to provide `batch_snf` with a custom `clust_algs_list`. Here's what that looks like: ```{r} clust_algs_list <- generate_clust_algs_list() # The default list: clust_algs_list # A prettier format: summarize_clust_algs_list(clust_algs_list) # Adding algorithms provided by the package clust_algs_list <- generate_clust_algs_list( "two_cluster_spectral" = spectral_two, "five_cluster_spectral" = spectral_five ) # Note that this one has the default algorithms as well as the newly added ones summarize_clust_algs_list(clust_algs_list) # This list has only the newly added ones, thanks to the disable_base parameter clust_algs_list <- generate_clust_algs_list( "two_cluster_spectral" = spectral_two, "five_cluster_spectral" = spectral_five, disable_base = TRUE ) summarize_clust_algs_list(clust_algs_list) ``` **By default, the settings matrix only varies over the values 1 and 2**. This is because, by default, it only expects you to use the default clustering algorithms. If your custom list is two algorithms long, that's fine, but if it isn't, it's imperative that you either manually adjust the numbers in `settings_matrix$"clust_algs"` or (more easily) you supply your custom list during settings matrix generation: ```{r} # This list has only the newly added ones, thanks to the disable_base parameter clust_algs_list <- generate_clust_algs_list( "two_cluster_spectral" = spectral_two, "three_cluster_spectral" = spectral_three, "five_cluster_spectral" = spectral_five ) # Specifying 5 different sets of settings for SNF set.seed(42) settings_matrix <- generate_settings_matrix( data_list, nrow = 10, max_k = 40, clustering_algorithms = clust_algs_list ) settings_matrix$"clust_alg" ``` Then, make sure to provide the `clust_algs_list` once more during the call to `batch_snf`: ```{r eval = FALSE} solutions_matrix <- batch_snf( data_list, settings_matrix, clust_algs_list = clust_algs_list ) ``` ### Structure of a clustering algorithm function Any clustering algorithm can be used as long as you can write a function for it with the following format: 1. Takes a single N*N similarity_matrix as its only input 2. Returns a named list: * The first item (named "solution") is a single N-dimensional vector of numbers corresponding to the observations in the similarity matrix * The second item (named "nclust") is a single integer indicating the number of clusters that the algorithm is supposed to have generated Note that the number of clusters parameter fed into `SNFtool::spectralClustering` doesn't always return a solution containing that many clusters. For example, an affinity matrix where every observation is identical to all other observations could not yield more than a 1 cluster solution, regardless of how many clusters were requested. Also note that the function should *not* take number of clusters as an argument - if you want to explore the same clustering algorithm with a varying number of clusters, you'll need to provide a separate function for each number of clusters you're interested in. The source code for the two default functions are shown below: ```{r eval = FALSE} # Default clustering algorithm #1 spectral_eigen <- function(similarity_matrix) { estimated_n <- estimate_nclust_given_graph( W = similarity_matrix, NUMC = 2:10 ) nclust_estimate <- estimated_n$`Eigen-gap best` solution <- SNFtool::spectralClustering( similarity_matrix, nclust_estimate ) nclust <- length(unique(solution)) solution_data <- list("solution" = solution, "nclust" = nclust) if (nclust_estimate != nclust) { warning( "Spectral clustering provided a solution of size ", nclust, " when the number requested based on the eigen-gap heuristic", " was ", nclust_estimate, "." ) } return(solution_data) } # Default clustering algorithm #2 spectral_rot <- function(similarity_matrix) { estimated_n <- estimate_nclust_given_graph( W = similarity_matrix, NUMC = 2:10 ) nclust_estimate <- estimated_n$`Rotation cost best` solution <- SNFtool::spectralClustering( similarity_matrix, nclust_estimate ) nclust <- length(unique(solution)) solution_data <- list("solution" = solution, "nclust" = nclust) if (nclust_estimate != length(unique(solution))) { warning( "Spectral clustering provided a solution of size ", nclust, " when the number requested based on the rotation cost heuristic", " was ", nclust_estimate, "." ) } return(solution_data) } ``` ### Non-automated clustering You can also extract the similarity matrices for each computed row of the settings matrix and perform the clustering more "manually". This is particularly useful for clustering procedures where the transition from a similarity matrix to the final solution requires human intervention (e.g., judgement for clustering hyperparameters). ```{r} batch_snf_results <- batch_snf( data_list, settings_matrix, clust_algs_list = clust_algs_list, return_similarity_matrices = TRUE ) names(batch_snf_results) solutions_matrix <- batch_snf_results$"solutions_matrix" # Similarity matrices are in the list below: similarity_matrices <- batch_snf_results$"similarity_matrices" length(similarity_matrices) dim(similarity_matrices[[1]]) # Your manual clustering goes here... ``` ### Example of non-automated clustering: DBSCAN Let's say we wanted to cluster our similarity matrices with DBSCAN rather than with spectral clustering. Start by taking a look at the documentation for running the DBSCAN function from the dbscan R package [https://cran.r-project.org/package=dbscan](https://cran.r-project.org/package=dbscan): DBSCAN is about as challenging as custom clustering can get, as the suggested process for specifying the number of clusters is recommended to involve human intervention and the function itself operates on dissimilarity (distance) matrices rather than similarity matrices. Below is a slightly adjusted form of their `dbscan` example. You can work through this on your own. ```{r eval = FALSE} library(dbscan) ## Example 1: use dbscan on the iris data set data(iris) iris <- as.matrix(iris[, 1:4]) iris_dist <- dist(iris) ## Find suitable DBSCAN parameters: ## 1. We use minPts = dim + 1 = 5 for iris. A larger value can also be used. ## 2. We inspect the k-NN distance plot for k = minPts - 1 = 4 kNNdistplot(iris, minPts = 5) ## Noise seems to start around a 4-NN distance of .7 abline(h=.7, col = "red", lty = 2) results <- dbscan(iris_dist, eps = 0.7, minPts = 5) # The 1 is added to ensure that those with no cluster (cluster 0) are still # plotted. pairs(iris, col = results$cluster + 1) ``` After some poking around, you will often see people mentioning that DBSCAN just isn't meant to be automated. You need to look at your data to have a good idea of what values to use for the hyperparameters `eps` and `minPts`. When you need to get involved manually, that's the perfect time to manage the many similarity matrices you've created with `batch_snf` through meta clustering. Generate a wide range of similarity matrices and apply meta clustering to find a few representative similarity matrices. One way to do this would be based on how the spectral-clustering derived solutions end up clustering together (see [example](https://branchlab.github.io/metasnf/articles/a_complete_example.html#examining-meta-clusters)). Then take the corresponding affinity matrices and go through the dbscan clustering process manually. Based on `?dbscan`, it looks like the function can accept precomputed distance matrices (instead of `precomputed_nn_object`s) as long as they actually are `dist` objects (which can be done using the `as.dist()` function). There are many formulas to convert between similarity matrices and distance matrices which have their own pros and cons. Here, we'll use a common approach of `distance = max(similarity) - similarity`. Whichever matrix values had the maximum similarity will now have a distance of 0, and whichever matrix values had the lowest amount of similarity will have distance values that are the closest to the former maximum similarity value. ```{r fig.width = 5, fig.height = 4.5} library(dbscan) library(ggplot2) data_list <- generate_data_list( list( data = expression_df, name = "genes_1_and_2_exp", domain = "gene_expression", type = "continuous" ), list( data = methylation_df, name = "genes_1_and_2_meth", domain = "gene_methylation", type = "continuous" ), uid = "patient_id" ) set.seed(42) settings_matrix <- generate_settings_matrix( data_list, nrow = 5 ) batch_snf_results <- batch_snf( data_list, settings_matrix, return_similarity_matrices = TRUE ) similarity_matrices <- batch_snf_results$"similarity_matrices" solutions_matrix <- batch_snf_results$"solutions_matrix" representative_sm <- similarity_matrices[[1]] representative_sms <- similarity_matrices[c(1, 2)] distance_matrix1 <- as.dist( max(representative_sm) - representative_sm ) kNNdistplot( distance_matrix1, minPts = 10 ) ## Maybe there? abline(h=0.4872, col = "red", lty = 2) dbscan_results <- dbscan(distance_matrix1, eps = 0.4872, minPts = 10)$"cluster" spectral_results <- get_clusters(solutions_matrix[1, ]) dbscan_vs_spectral <- data.frame( dbscan = dbscan_results, spectral = spectral_results ) ggplot(dbscan_vs_spectral, aes(x = dbscan, y = spectral)) + geom_jitter(height = 0.1, width = 0.1, alpha = 0.5) + theme_bw() ``` There's one bold lie in the code chunk above, which is how easy it was to find that magic hyperparameter combination of `minPts = 10` and `eps = 0.487` parameter value of 0.4872. It wasn't based off of the visual inspection of the kNNdistplot, but rather through a lot of trial and error. Something along the lines of: ```{r eval = FALSE} for (i in seq(0.485, 0.488, by = 0.0001)) { results <- dbscan(distance_matrix1, eps = i, minPts = 10) if (length(unique(results$"cluster")) == 3) { print(i) } } ``` In this specific instance, those hyperparameters are incredibly sensitive - a slight change will get very different results. That is likely due to the shape of the actual data being clustered. Your mileage may vary!