--- title: "Stability Measures" output: rmarkdown::html_vignette: toc: true description: > Evaluating robustness of cluster solutions through resampling methods. vignette: > %\VignetteIndexEntry{Stability Measures} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Download a copy of the vignette to follow along here: [stability_measures.Rmd](https://raw.githubusercontent.com/BRANCHlab/metasnf/main/vignettes/stability_measures.Rmd) In this vignette, we will highlight the main stability measure options in the metasnf package. ## Data set-up ```{r eval = FALSE} library(metasnf) data_list <- generate_data_list( list(cort_t, "cortical_thickness", "neuroimaging", "continuous"), list(cort_sa, "cortical_area", "neuroimaging", "continuous"), list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), list(income, "household_income", "demographics", "continuous"), list(pubertal, "pubertal_status", "demographics", "continuous"), uid = "unique_id" ) set.seed(42) settings_matrix <- generate_settings_matrix( data_list, nrow = 4, max_k = 40 ) # Cluster solutions made using the full data solutions_matrix <- batch_snf(data_list, settings_matrix) ``` To begin start calculating resampling-based stability measures, we'll build subsamples of the data list using the `subsample_data_list` function. ```{r eval = FALSE} data_list_subsamples <- subsample_data_list( data_list, n_subsamples = 100, subsample_fraction = 0.85 ) ``` `data_list_subsamples` contains a list of 100 subsamples of the full data list. Each variation only has a random 85% of the original observations. ```{r eval = FALSE} batch_subsample_results <- batch_snf_subsamples( data_list_subsamples, settings_matrix ) ``` By default, the function returns a one-element list: `cluster_solutions`, which is itself a list of cluster solution data frames corresponding to each of the provided data list subsamples. Setting the parameters `return_similarity_matrices` and `return_solutions_matrices` to `TRUE` will turn the result of the function to a three-element list containing the corresponding solutions matrices and final fused similarity matrices of those cluster solutions, should you require these objects for your own stability calculations. The function `subsample_pairwise_aris` can then be used to calculate the ARIs between cluster solutions across the subsamples. ```{r eval = FALSE} subsample_cluster_solutions <- batch_subsample_results[["cluster_solutions"]] pairwise_aris <- subsample_pairwise_aris( subsample_cluster_solutions, return_raw_aris = TRUE ) ``` `pairwise_aris` is a list containing a summary data frame of the ARIs between subsamples for each row of the original settings matrix, as well as another list of all the generated inter-subsample ARIs as a result of setting `return_raw_aris` to `TRUE`. The raw inter-subsample ARIs corresponding to a particualr settings matrix row can be visualized with a heatmap: ```{r eval = FALSE} ComplexHeatmap::Heatmap( raw_aris[[1]], heatmap_legend_param = list( color_bar = "continuous", title = "Inter-Subsample\nARI", at = c(0, 0.5, 1) ), show_column_names = FALSE, show_row_names = FALSE ) ```
![](https://raw.githubusercontent.com/BRANCHlab/metasnf/main/vignettes/ss_ari_hm_l.png){width=700px}
To calculate information about how often each pair of observations clustered together across the subsamples, we can use the `calculate_coclustering` function: ```{r eval = FALSE} coclustering_results <- calculate_coclustering( subsample_cluster_solutions, solutions_matrix ) ``` Note that the runtime of this function scales quadratically with the number of observations and linearly with the number of subsamples. The output of `calculate_coclustering` is a list containing the following components: - `cocluster_dfs`: A list of data frames, one per cluster solution, that shows the number of times that every pair of subjects in the original cluster solution occurred in the same subsample, the number of times that every pair clustered together in a subsample, and the corresponding fraction of times that every pair clustered together in a subsample. - `cocluster_ss_mats`: The number of times every pair of subjects occurred in the same subsample, formatted as a pairwise matrix. - `cocluster_sc_mats`: The number of times every pair of subjects occurred in the same cluster, formatted as a pairwise matrix. - `cocluster_cf_mats`: The fraction of times every pair of subjects occurred in the same cluster, formatted as a pairwise matrix. - `cocluster_summary`: Among pairs of subjects that clustered together in the original cluster solution, the mean fraction those pairs remained clustered together across the subsample-derived solutions. This information is formatted as a data frame with one row per cluster solution. The `cocluster_dfs` component can be used to visualize co-clustering across subsamples as a density plot: ```{r eval = FALSE} cocluster_dfs <- coclustering_results$"cocluster_dfs" cocluster_density(cocluster_dfs[[1]]) ```
![](https://raw.githubusercontent.com/BRANCHlab/metasnf/main/vignettes/cocluster_density_l.png){width=700px}
Or as a heatmap: ```{r eval = FALSE} # Fraction of co-clustering between observations, grouped by original # cluster membership cocluster_heatmap( cocluster_dfs[[1]], data_list = data_list, top_hm = list( "Income" = "household_income", "Pubertal Status" = "pubertal_status" ), annotation_colours = list( "Pubertal Status" = colour_scale( c(1, 4), min_colour = "black", max_colour = "purple" ), "Income" = colour_scale( c(0, 4), min_colour = "black", max_colour = "red" ) ) ) ```
![](https://raw.githubusercontent.com/BRANCHlab/metasnf/main/vignettes/cc_hm_l.png){width=700px}