---
title: "Region-based analysis of surface-based neuroimaging data in R"
author: "Tim Schäfer"
date: "March 02, 2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
rgl::setupKnitr()
```

This script demonstrates one way to analyze atlas-based neuroimaging data in R. It uses data from the [publicly available ABIDE I data set](https://fcon_1000.projects.nitrc.org/indi/abide/) (pre-processed with [FreeSurfer](https://freesurfer.net/) v5.3) to analyze group differences in cortical thickness between neurotypical controls and people with ASD. The sample is from [this DiMartino *et al.* paper](https://doi.org/10.1038/mp.2013.78) and all male. The brain regions are defined by the [Desikan-Killiany atlas](https://doi.org/10.1016/j.neuroimage.2006.01.021).


### Load required packages

```{r, message=FALSE}
library("brainnet");  # this package
library("fsbrain");   # loading neuroimaging data and visualizing results.
library("readxl");    # to read Excel demographics files
library("MatchIt");   # matching of patient/control groups
library("Rglpk");     # a solver for cardinality matching with MatchIt using the "glpk" solver (the default in this script). Requires sys deps, e.g., `sudo apt install libglpk-dev` under Linux. Use Homebrew to get it under MacOS.
library("ggplot2");   # general purpose plots
```


### Settings

```{r}
do_plot = F;    # This is a lot faster without plotting. Disable plotting during development.
do_export_high_quality_plots = FALSE;
```


```{r, echo=FALSE, results=FALSE, message=FALSE}
# Make sure we have the brain templates for plotting.
if(do_plot) {
    fsbrain::download_optional_data();
    fsbrain::download_fsaverage(accept_freesurfer_license = TRUE);
    fsbrain::fsbrain.set.default.figsize(700, 700);
}
```


## Load and process demographics and other metadata

The demographics data includes the group membership, and covariates like age, gender, and IQ. The total brain measurements are also
loaded by the `load_ABIDE_metadata_males()` function and merged with the other metadata. They are typically used to correct for
total brain volume or similar measures.

```{r}
subjects_dir = brainnet:::get_ABIDE_path_on_tims_machines(); # replace with the FreeSurfer SUBJECTS_DIR containing the neuroimaging data.
if(! dir.exists(subjects_dir)) {
    warning(sprintf("The subjects_dir '%s' does not exist. This is okay if you only use the CSV data that comes with the brainnet package.\n", subjects_dir));
}

md = load_ABIDE_metadata_males(impute_data = TRUE);
demographics = md$merged; # Extract the field that contains the merged brainstats and demographics.
```


### Define your sample

You task: Define your own sample here.

The sample should be a subset of the available data (rows) in the data.frame 'demographics' that:

1. must follow your inclusion criteria (=> e.g.,define a suitable age range for your research question and filter the subjects accordingly)
2. must not include bad quality scans (=> perform some sort of quality inspection of the data and filter subjects that do not pass the quality criteria)
3. should achieve at least approximate balance between covariates for the groups (=> check for significant IQ/age/... differences between controls and cases and reduce them using matching).

#### Data Quality checks

The next lines show a basic QC example:

```{r}
qc_segstats_files = aparcstats_files_ABIDE(measure = "thickness"); # There is no need to use the same measure for QC and your GLM analysis below, leave this alone if in doubt.
qc = fsbrain::qc.from.segstats.tables(qc_segstats_files$lh, qc_segstats_files$rh); # You can provide your own files as well (and will need to do so if you do not use the ABIDE/IXI datasets for which we provide the files).
# qc = fsbrain::qc.for.group(subjects_dir, subjects_list, measure = "thickness", atlas = "aparc"); # This could be used if you cannot get the files (which is very unlikely if you have the data required to run this command). It is *a lot* slower than reading the pre-computed file. You should NOT use this unless there really is no other way.
bad_quality_subjects = unique(c(qc$lh$failed_subjects, qc$rh$failed_subjects)); # extract failed subjects from quality check results.
my_demographics_qc = subset(demographics, !(demographics$subject_id %in% bad_quality_subjects)); # Remove all bad quality scans.
cat(sprintf("Excluded %d of %d total subjects due to bad scan quality. %d left.\n", length(bad_quality_subjects), nrow(demographics), nrow(my_demographics_qc)));
```


#### Apply your inclusion criteria

Here is an example for inclusion/exclusion criteria: we only keep subjects with IQ >= 70 and age between 12 and 18 years.

```{r}
my_demographics_qc_incl = my_demographics_qc;
my_demographics_qc_incl = subset(my_demographics_qc_incl, (my_demographics_qc_incl$age >= 12 & my_demographics_qc_incl$age <= 18)); # Filter for age range.
my_demographics_qc_incl = subset(my_demographics_qc_incl, (my_demographics_qc_incl$iq >= 70.0)); # Filter by IQ.
num_excluded_by_inclusion_criteria = nrow(my_demographics_qc) - nrow(my_demographics_qc_incl);
cat(sprintf("Excluded %d of %d total subjects due to inclusion criteria. %d left.\n", num_excluded_by_inclusion_criteria, nrow(demographics), nrow(my_demographics_qc_incl)));
```


#### Matching

Now for the matching. We use cardinality matching here, but that may not be what you want. Read the [documentation for MatchIt](https://cran.r-project.org/web/packages/MatchIt/vignettes/) to see (many!) other options.

```{r}
## The variables you need to match on also depend on your research question (mainly on the descriptor used).
solver = "glpk"; # use "gurobi" if you have or can install it, or "glpk" if you do not have gurobi. See the MatchIt documentation for reasons.
mymatch = MatchIt::matchit(group ~ age + iq + totalBrainVolume, data = my_demographics_qc_incl, method = "cardinality", solver = solver, time = 60*5);
## Show an overview of the improvements/changes before and after matching. You should look at this for various matching algorithms/settings.
# summary(mymatch);
## Get the matched data.
my_demographics_qc_incl_matched = MatchIt::match.data(mymatch);
my_demographics_qc_incl_matched$weights = NULL; # delete the 'weights' column added by matching, it is useless (all ones) in the case of cardinality matching.
num_excluded_by_matching = nrow(my_demographics_qc_incl) - nrow(my_demographics_qc_incl_matched);
cat(sprintf("Excluded %d of %d total subjects due to matching. %d left.\n", num_excluded_by_matching, nrow(demographics), nrow(my_demographics_qc_incl_matched)));
```

All done, let us actually use our modified demographics:

```{r}
demographics = my_demographics_qc_incl_matched;
```


Your sample should be complete by this line, with results in variable 'demographics'.

#### Sample overview

Print some information on the sample to make sure it matches your expectations:

```{r}
subjects_list = as.character(demographics$subject_id);

subjects_control = subjects_list[demographics$group == "control"];
subjects_asd = subjects_list[demographics$group == "asd"];
cat(sprintf("Working with %d subjects in total: %d ASD and %s controls.\n", length(subjects_list), length(subjects_asd), length(subjects_control)));
if(length(subjects_asd) + length(subjects_control) != length(subjects_list)) {
    stop("Invalid group assignment to case/control: maybe wrong factor levels/group names (expected 'control' and 'asd'), or more than 2 groups.");
}
```


## Load the FreeSurfer neuroimaging data

First we specify which data we want to load: the atlas and measure are the important ones.

```{r}
measure="thickness";  ## The native space descriptor to load from the subjects 'surf' directory, its values will be aggregated with the atlas regions (see 'atlas' below).
hemi="split";   ## For which hemisphere to compute the results. One of 'lh' for left only, 'rh' for right only, or 'split' to compute (separately) for both hemispheres. Leave this alone.
atlas="aparc";  ## The atlas you want, 'aparc' for Desikan-Killiany atlas, 'aparc.a2009s' for Destrieux atlas, 'aparc.DKTatlas40' for DKT atlas, or your custom atlas. See https://surfer.nmr.mgh.harvard.edu/fswiki/CorticalParcellation for details.
```

Now we actually load the data. There are 2 options:

1) Aggregate the native space data by atlas region from the raw per-vertex data files and native space parcellations in R. This takes quite a bit of time due to data loading for a large data set (and slow hard disks/networks). We do no do this here, but we still illustrate how it could be done:

```{r, eval = FALSE}
# This is not run.
braindata = fsbrain::group.agg.atlas.native(subjects_dir, subjects_list, measure=measure, hemi=hemi, atlas=atlas);
```

2) Alternatively, one could load CSV files produced by the FreeSurfer tool `aparcstats2table` for your descriptor (one CSV file per hemisphere). Then you would only need to do the computation once (when running that command), so here in R all you need to do is to load the resultung CSV files, which is very fast.

```{r}
model_data_segstats_files = aparcstats_files_ABIDE(measure = measure); # for ABIDE, we ship the files with this package for some measures.
braindata = region_data_from_segstat_tables(model_data_segstats_files$lh, model_data_segstats_files$rh);
braindata = subset(braindata, braindata$subject %in% demographics$subject_id); # the CSV file is for all ABIDE subjects, need to limit to our sample.
```


### Data postprocessing

#### Optional: visualize the data for a subject.

We do not actually do this, we just illustrate how it could be done.

```{r, eval = FALSE}
# This is not run.
if(do_plot) {
    fsbrain::vis.subject.morph.native(subjects_dir, subjects_list[1], measure = measure, draw_colorbar = TRUE);
}
```

#### Data cleanup

Remove empty atlas regions we do not want. Here we identify atlas regions (columns in the `data.frame`) that contain `NA` values and remove them. These are regions which were not assigned any vertices, meaning that the mean function used to aggregate data over the vertices of that brain region returned `NA`. This typically happens for the `unknown` region and for the region representing the medial wall (that region is named `corpuscallosum` in the Desikan atlas). Whether such columns exist depends on the atlas and the method to load the `braindata`.

```{r}
# Remove NA columns/empty atlas regions, if any.
na_column_indices = unique(which(is.na(braindata), arr.ind = TRUE)[,2]);
if(length(na_column_indices) > 0L) {
  del_cols = colnames(braindata)[na_column_indices];
  cat(sprintf("Ignoring %d of %d total brain atlas regions because their mean value is NA: %s.\n", length(na_column_indices), (ncol(braindata)-1L), paste(del_cols, collapse = ",")));
  braindata[,na_column_indices]=NULL;
} else {
  cat(sprintf("All %d brain atlas regions contain valid values.\n", (ncol(braindata)-1L))); # The -1 is for the 'subject' column.
}
```


#### Data format conversion

Some final data preparations: we need to exclude the non-numerical 'subject' column from the data and turn them into a `data.matrix` for the statistical analysis.

```{r}
slm_braindata = braindata;
slm_braindata$subject = NULL; # remove non-numerical 'subject' column.
slm_braindata = data.matrix(slm_braindata);
```

#### Data Normalization

Do you want to normalize the predictors (age, iq, meanCT/brainvol)?
What are the advantages and disadvantages?
If you want to do it, you can use `base::scale()` here on your data.

```{r}
# TODO: normalize here.
```

## Statistical analysis

Specify and run the model.

### Run the GLMs (on per atlas region/hemi)

Prepare the formula and compute effect sizes.


```{r}
mm <- model.matrix(~ factor(demographics$group) + demographics$age + demographics$iq + factor(demographics$site) + demographics$totalMeanCorticalThickness);

predictors <- c("group", "age", "iq", "site", "mean_ct"); # names for the predictors in the model.matrix
slm_ef_res <- brainnet::slm_effect_sizes(mm, slm_braindata, predictors, c("cohens.f", "etasq", "power"));
#slm_t_res <- brainnet::slm_t(mm, slm_braindata, model.term = "group");
```


You now have the effect sizes for all predictors in variable `slm_res$cohens.f`.


### Visualization of results

Optional: here we plot the effect size by atlas region.

```{r rgl=TRUE, dev='png', warning = 'hide', echo=FALSE, results=FALSE}
# This section only defines the plotting function, just ignore it.
plot_slm_ef_res <- function(slm_ef_res, measure, predictor, do_export = FALSE) {
  res = list("");
  measure_values = slm_ef_res[[measure]][predictor, ];
  effect_sizes_by_hemi = fsbrain::hemilist.from.prefixed.list(measure_values);
  cm = fsbrain::vis.region.values.on.subject(fsbrain::fsaverage.path(), 'fsaverage', lh_region_value_list = effect_sizes_by_hemi$lh, rh_region_value_list = effect_sizes_by_hemi$rh, atlas = atlas, draw_colorbar = TRUE);
  res$cm = cm;
  
  if(do_export) {
    output_img = sprintf("%s_%s.png", measure, predictor);
    cat(sprintf("Writing '%s' figure for predictor '%s' to file '%s' in directory '%s'.\n", measure, predictor, output_img, getwd()));
    img = fsbrain::export(cm, output_img = output_img, colorbar_legend = sprintf("%s for %s ", measure, predictor));
    res$img = img;
  }
  return(res);
}
```


```{r rgl=TRUE, dev='png', warning = 'hide'}
if(do_plot) {

    measure = "cohens.f";           # Choose one. Can use "cohens.f", "etasq", or "power".
    predictor = "group";
    
    cat(sprintf("Showing %s for predictor %s.\n", measure, predictor));
    plot_res = plot_slm_ef_res(slm_ef_res, measure, predictor, do_export = do_export_high_quality_plots);
}
```


```{r, warning = 'hide'}
if(do_plot) {
    # A single effect size (or goodness-of-fit, or power) violin plot for all predictors.
    ggplot_instance = brainnet::effect_size_violin_plots(slm_ef_res[[measure]], plot_ylabel=measure);
    ggplot_instance;
}
```