--- title: "NMI Scores" output: rmarkdown::html_vignette: toc: true description: > Calculate how important various features were to the final SNF cluster solution. vignette: > %\VignetteIndexEntry{NMI Scores} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Download a copy of the vignette to follow along here: [nmi_scores.Rmd](https://raw.githubusercontent.com/BRANCHlab/metasnf/main/vignettes/nmi_scores.Rmd) NMI scores were used in the original `SNFtool` package as a unitless way to compare the relative importance of different features in a final cluster solution. The premise of this approach is that if a feature was very important, clustering off of that feature alone should result in a solution that is very similar to the one that was generated by clustering off of all the features together. In the original `SNFtool` implementation of calculating NMI scores, the cluster solution based on the individual feature being assessed was restricted to necessarily being generated using squared Euclidean distance, a K hyperparameter value of 20, an alpha hyperparameter value of 0.5, and spectral clustering with the number of clusters based on the best eigen-gap value of possible solutions spanning from 2 to 5 clusters. In contrast, the `metasnf` implementation leverages all the architectural details and hyperparameters supplied in the original `settings_matrix` and `batch_snf()` call to make the solo-feature to all-feature solutions as comparable as possible. The chunk below outlines how the primary NMI calculating function, `batch_nmi()`, can be used. ```{r} library(metasnf) data_list <- generate_data_list( list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), list(income, "household_income", "demographics", "continuous"), list(pubertal, "pubertal_status", "demographics", "continuous"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) set.seed(42) settings_matrix <- generate_settings_matrix( data_list, nrow = 20, min_k = 20, max_k = 50 ) # Generation of 20 cluster solutions solutions_matrix <- batch_snf(data_list, settings_matrix) # Let's just calculate NMIs of the anxiety and depression data types for the # first 5 cluster solutions to save time: feature_nmis <- batch_nmi(data_list[4:5], solutions_matrix[1:5, ]) print(feature_nmis) ``` One important thing to note is that if the cluster space you initially set up when calling `batch_snf` relied on custom distance metrics, clustering algorithms, or the `automatic_standard_normalize` parameter, you should use those same values when calling `batch_nmi()` as well. Another important note is that by default, `batch_nmi` will ignore the `inc_*` columns of the settings matrix, i.e., no data types are dropped during solo feature cluster solution calculations. This can lead to a bit of an odd interpretation if you view NMI as a direct reflection of contribution to the final SNF output. It is possible for a feature that was not a part of a particular cluster solution to still produce its own cluster solution that has a very high NMI score to the prior one. If you wish to suppress the calculation of NMIs for features that were not actually included in a particular SNF run due to having a 0 value in the inclusion column, you can set the `ignore_inclusions` parameter to `FALSE`. Finally, if you'd like the NMI information to be presented in a transposed format, you can do that too by setting `transpose` to `FALSE`.