{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's keep our notebook clean, so it's a little more readable!\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Machine learning to predict age from rs-fmri\n", "\n", "The goal is to extract data from several rs-fmri images, and use that data as features in a machine learning model. We will integrate what we've learned in the previous machine learning lecture to build an unbiased model and test it on a left out sample.\n", "\n", "We're going to use a dataset that was prepared for this tutorial by [Elizabeth Dupre](https://elizabeth-dupre.com/#/), [Jake Vogel](https://github.com/illdopejake) and [Gael Varoquaux](http://gael-varoquaux.info/), by preprocessing [ds000228](https://openneuro.org/datasets/ds000228/versions/1.0.0) (from [Richardson et al. (2018)](https://dx.doi.org/10.1038%2Fs41467-018-03399-2)) through [fmriprep](https://github.com/poldracklab/fmriprep). They also created this tutorial and should be credited for it. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data\n", "\n", "\"terms\"\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# change this to the location where you downloaded the data\n", "wdir = '/data/ds000228/'\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now fetch the data\n", "\n", "from glob import glob\n", "import os\n", "data = sorted(glob(os.path.join(wdir,'*.gz')))\n", "confounds = sorted(glob(os.path.join(wdir,'*regressors.tsv')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many individual subjects do we have?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#len(data.func)\n", "len(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extract features\n", "\n", "![feat_xtrct](https://ars.els-cdn.com/content/image/1-s2.0-S1053811919301594-gr1.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to do our machine learning, we will need to extract feature from our rs-fmri images.\n", "\n", "Specifically, we will extract signals from a brain parcellation and compute a correlation matrix, representing regional coactivation between regions.\n", "\n", "We will practice on one subject first, then we'll extract data for all subjects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Retrieve the atlas for extracting features and an example subject" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we're using rs-fmri data, it makes sense to use an atlas defined using rs-fmri data\n", "\n", "This paper has many excellent insights about what kind of atlas to use for an rs-fmri machine learning task. See in particular Figure 5.\n", "\n", "https://www.sciencedirect.com/science/article/pii/S1053811919301594?via%3Dihub\n", "\n", "Let's use the MIST atlas, created here in Montreal using the BASC method. This atlas has multiple resolutions, for larger networks or finer-grained ROIs. Let's use a 64-ROI atlas to allow some detail, but to ultimately keep our connectivity matrices manageable\n", "\n", "Here is a link to the MIST paper: https://mniopenresearch.org/articles/1-3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nilearn import datasets\n", "\n", "parcellations = datasets.fetch_atlas_basc_multiscale_2015(version='sym')\n", "atlas_filename = parcellations.scale064\n", "\n", "\n", "print('Atlas ROIs are located in nifti image (4D) at: %s' %\n", " atlas_filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at that atlas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nilearn import plotting\n", "\n", "plotting.plot_roi(atlas_filename, draw_cross=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, let's load an example 4D fmri time-series for one subject" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fmri_filenames = data[0]\n", "print(fmri_filenames)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the image! Because it is a 4D image, we can only look at one slice at a time. Or, better yet, let's look at an average image!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nilearn import image \n", "\n", "averaged_Img = image.mean_img(image.mean_img(fmri_filenames))\n", "plotting.plot_stat_map(averaged_Img)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract signals on a parcellation defined by labels\n", "Using the NiftiLabelsMasker\n", "\n", "So we've loaded our atlas and 4D data for a single subject. Let's practice extracting features!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nilearn.input_data import NiftiLabelsMasker\n", "\n", "masker = NiftiLabelsMasker(labels_img=atlas_filename, standardize=True, \n", " memory='nilearn_cache', verbose=1)\n", "\n", "# Here we go from nifti files to the signal time series in a numpy\n", "# array. Note how we give confounds to be regressed out during signal\n", "# extraction\n", "conf = confounds[0]\n", "time_series = masker.fit_transform(fmri_filenames, confounds=conf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So what did we just create here?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(time_series)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "time_series.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are these \"confounds\" and how are they used? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas\n", "conf_df = pandas.read_table(conf)\n", "conf_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conf_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compute and display a correlation matrix\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nilearn.connectome import ConnectivityMeasure\n", "\n", "correlation_measure = ConnectivityMeasure(kind='correlation')\n", "correlation_matrix = correlation_measure.fit_transform([time_series])[0]\n", "correlation_matrix.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the correlation matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "# Mask the main diagonal for visualization:\n", "np.fill_diagonal(correlation_matrix, 0)\n", "\n", "# The labels we have start with the background (0), hence we skip the\n", "# first label\n", "plotting.plot_matrix(correlation_matrix, figure=(10, 8), \n", " labels=range(time_series.shape[-1]),\n", " vmax=0.8, vmin=-0.8, reorder=False)\n", "\n", "# matrices are ordered for block-like representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract features from the whole dataset\n", "\n", "Here, we are going to use a for loop to iterate through each image and use the same techniques we learned above to extract rs-fmri connectivity features from every subject.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Here is a really simple for loop\n", "\n", "for i in range(10):\n", " print('the number is', i)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "container = []\n", "for i in range(10):\n", " container.append(i)\n", "\n", "container" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets construct a more complicated loop to do what we want" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we do some things we don't need to do in the loop. Let's reload our atlas, and re-initiate our masker and correlation_measure" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nilearn.input_data import NiftiLabelsMasker\n", "from nilearn.connectome import ConnectivityMeasure\n", "from nilearn import datasets\n", "\n", "# load atlas\n", "multiscale = datasets.fetch_atlas_basc_multiscale_2015()\n", "atlas_filename = multiscale.scale064\n", "\n", "# initialize masker (change verbosity)\n", "masker = NiftiLabelsMasker(labels_img=atlas_filename, standardize=True, \n", " memory='nilearn_cache', verbose=0)\n", "\n", "# initialize correlation measure, set to vectorize\n", "correlation_measure = ConnectivityMeasure(kind='correlation', vectorize=True,\n", " discard_diagonal=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay -- now that we have that taken care of, let's run our big loop!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE**: On a laptop, this might a few minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_features = [] # here is where we will put the data (a container)\n", "\n", "for i,sub in enumerate(data):\n", " # extract the timeseries from the ROIs in the atlas\n", " time_series = masker.fit_transform(sub, confounds=confounds[i])\n", " # create a region x region correlation matrix\n", " correlation_matrix = correlation_measure.fit_transform([time_series])[0]\n", " # add to our container\n", " all_features.append(correlation_matrix)\n", " # keep track of status\n", " print('finished %s of %s'%(i+1,len(data)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's save the data to disk\n", "import numpy as np\n", "\n", "np.savez_compressed('MAIN_BASC064_subsamp_features',a = all_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case you do not want to run the full loop on your computer, you can load the output of the loop here!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feat_file = 'MAIN_BASC064_subsamp_features.npz'\n", "X_features = np.load(feat_file)['a']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_features.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"terms\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay so we've got our features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualize our feature matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.imshow(X_features, aspect='auto')\n", "plt.colorbar()\n", "plt.title('feature matrix')\n", "plt.xlabel('features')\n", "plt.ylabel('subjects')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Y (our target) and assess its distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's load the phenotype data\n", "\n", "pheno_path = os.path.join(wdir, 'participants.tsv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas\n", "\n", "pheno = pandas.read_csv(pheno_path, sep='\\t').sort_values('participant_id')\n", "pheno.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like there is a column labeling age. Let's capture it in a variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_age = pheno['Age']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maybe we should have a look at the distribution of our target variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.distplot(y_age)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare data for machine learning\n", "\n", "Here, we will define a \"training sample\" where we can play around with our models. We will also set aside a \"validation\" sample that we will not touch until the end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to be sure that our training and test sample are matched! We can do that with a \"stratified split\". This dataset has a variable indicating AgeGroup. We can use that to make sure our training and testing sets are balanced!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "age_class = pheno['AgeGroup']\n", "age_class.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Split the sample to training/validation with a 60/40 ratio, and \n", "# stratify by age class, and also shuffle the data.\n", "\n", "X_train, X_val, y_train, y_val = train_test_split(\n", " X_features, # x\n", " y_age, # y\n", " test_size = 0.4, # 60%/40% split \n", " shuffle = True, # shuffle dataset\n", " # before splitting\n", " stratify = age_class, # keep\n", " # distribution\n", " # of ageclass\n", " # consistent\n", " # betw. train\n", " # & test sets.\n", " random_state = 123 # same shuffle each\n", " # time\n", " )\n", "\n", "# print the size of our training and test groups\n", "print('training:', len(X_train),\n", " 'testing:', len(X_val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the distributions to be sure they are matched" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.distplot(y_train)\n", "sns.distplot(y_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run your first model!\n", "\n", "Machine learning can get pretty fancy pretty quickly. We'll start with a fairly standard regression model called a Support Vector Regressor (SVR). \n", "\n", "While this may seem unambitious, simple models can be very robust. And we probably don't have enough data to create more complex models (but we can try later).\n", "\n", "For more information, see this excellent resource:\n", "https://hal.inria.fr/hal-01824205" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's fit our first model!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import SVR\n", "l_svr = SVR(kernel='linear') # define the model\n", "\n", "l_svr.fit(X_train, y_train) # fit the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well... that was easy. Let's see how well the model learned the data!\n", "\n", "\n", "\"terms\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# predict the training data based on the model\n", "y_pred = l_svr.predict(X_train) \n", "\n", "# caluclate the model accuracy\n", "acc = l_svr.score(X_train, y_train) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's view our results and plot them all at once!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print results\n", "print('accuracy (R2)', acc)\n", "\n", "sns.regplot(y_pred,y_train)\n", "plt.xlabel('Predicted Age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HOLY COW! Machine learning is amazing!!! Almost a perfect fit!\n", "\n", "...which means there's something wrong. What's the problem here?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Split the sample to training/test with a 75/25 ratio, and \n", "# stratify by age class, and also shuffle the data.\n", "\n", "age_class2 = pheno.loc[y_train.index,'AgeGroup']\n", "\n", "X_train2, X_test, y_train2, y_test = train_test_split(\n", " X_train, # x\n", " y_train, # y\n", " test_size = 0.25, # 75%/25% split \n", " shuffle = True, # shuffle dataset\n", " # before splitting\n", " stratify = age_class2, # keep\n", " # distribution\n", " # of ageclass\n", " # consistent\n", " # betw. train\n", " # & test sets.\n", " random_state = 123 # same shuffle each\n", " # time\n", " )\n", "\n", "# print the size of our training and test groups\n", "print('training:', len(X_train2),\n", " 'testing:', len(X_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_absolute_error\n", "\n", "# fit model just to training data\n", "l_svr.fit(X_train2,y_train2)\n", "\n", "# predict the *test* data based on the model trained on X_train2\n", "y_pred = l_svr.predict(X_test) \n", "\n", "# caluclate the model accuracy\n", "acc = l_svr.score(X_test, y_test) \n", "mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print results\n", "print('accuracy (R2) = ', acc)\n", "print('MAE = ',mae)\n", "\n", "sns.regplot(y_pred,y_test)\n", "plt.xlabel('Predicted Age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not perfect, but as predicting with unseen data goes, not too bad! Especially with a training sample of \"only\" 69 subjects. But we can do better! \n", "\n", "For example, we can increase the size our training set while simultaneously reducing bias by instead using 10-fold cross-validation\n", "\n", "\"terms\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_predict, cross_val_score\n", "\n", "# predict\n", "y_pred = cross_val_predict(l_svr, X_train, y_train, cv=10)\n", "# scores\n", "acc = cross_val_score(l_svr, X_train, y_train, cv=10)\n", "mae = cross_val_score(l_svr, X_train, y_train, cv=10, scoring='neg_mean_absolute_error')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can look at the accuracy of the predictions for each fold of the cross-validation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in range(10):\n", " print('Fold {} -- Acc = {}, MAE = {}'.format(i, acc[i],-mae[i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at the overall accuracy of the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import r2_score\n", "\n", "overall_acc = r2_score(y_train, y_pred)\n", "overall_mae = mean_absolute_error(y_train,y_pred)\n", "print('R2:',overall_acc)\n", "print('MAE:',overall_mae)\n", "\n", "sns.regplot(y_pred, y_train)\n", "plt.xlabel('Predicted Age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not too bad at all! But more importantly, this is a more accurate estimation of our model's predictive efficacy. Our sample size is larger and this is based on several rounds of prediction of unseen data. \n", "\n", "For example, we can now see that the effect is being driven by the model's successful parsing of adults vs. children, but is not performing so well within the adult or children group. This was not evident during our previous iteration of the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tweak your model\n", "\n", "It's very important to learn when and where its appropriate to \"tweak\" your model.\n", "\n", "Since we have done all of the previous analysis in our training data, it's fine to try out different models. But we **absolutely cannot** \"test\" it on our left out data. If we do, we are in great danger of overfitting.\n", "\n", "It is not uncommon to try other models, or tweak hyperparameters. In this case, due to our relatively small sample size, we are probably not powered sufficiently to do so, and we would once again risk overfitting. However, for the sake of demonstration, we will do some tweaking. \n", "\n", "\"terms\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will try a few different examples:\n", "* Normalizing our target data\n", "* Tweaking our hyperparameters\n", "* Trying a more complicated model\n", "* Feature selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Normalize the target data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a log transformer function and log transform Y (age)\n", "\n", "from sklearn.preprocessing import FunctionTransformer\n", "\n", "log_transformer = FunctionTransformer(func = np.log, validate=True)\n", "log_transformer.fit(y_train.values.reshape(-1,1))\n", "y_train_log = log_transformer.transform(y_train.values.reshape(-1,1))[:,0]\n", "\n", "sns.distplot(y_train_log)\n", "plt.title(\"Log-Transformed Age\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's go ahead and cross-validate our model once again with this new log-transformed target" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# predict\n", "y_pred = cross_val_predict(l_svr, X_train, y_train_log, cv=10)\n", "\n", "# scores\n", "acc = r2_score(y_train_log, y_pred)\n", "mae = mean_absolute_error(y_train_log,y_pred)\n", "\n", "print('R2:',acc)\n", "print('MAE:',mae)\n", "\n", "sns.regplot(y_pred, y_train_log)\n", "plt.xlabel('Predicted Log Age')\n", "plt.ylabel('Log Age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seems like a definite improvement, right? I think we can agree on that. \n", "\n", "But we can't forget about interpretability? The MAE is much less interpretable now." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Tweak the hyperparameters\n", "\n", "Many machine learning algorithms have hyperparameters that can be \"tuned\" to optimize model fitting. Careful parameter tuning can really improve a model, but haphazard tuning will often lead to overfitting.\n", "\n", "Our SVR model has multiple hyperparameters. Let's explore some approaches for tuning them" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SVR?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way is to plot a \"Validation Curve\" -- this will let us view changes in training and validation accuracy of a model as we shift its hyperparameters. We can do this easily with sklearn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import validation_curve\n", "\n", "C_range = 10. ** np.arange(-3, 8) # A range of different values for C\n", "\n", "train_scores, valid_scores = validation_curve(l_svr, X_train, y_train_log, \n", " param_name= \"C\",\n", " param_range = C_range,\n", " cv=10)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A bit of pandas magic to prepare the data for a seaborn plot\n", "\n", "tScores = pandas.DataFrame(train_scores).stack().reset_index()\n", "tScores.columns = ['C','Fold','Score']\n", "tScores.loc[:,'Type'] = ['Train' for x in range(len(tScores))]\n", "\n", "vScores = pandas.DataFrame(valid_scores).stack().reset_index()\n", "vScores.columns = ['C','Fold','Score']\n", "vScores.loc[:,'Type'] = ['Validate' for x in range(len(vScores))]\n", "\n", "ValCurves = pandas.concat([tScores,vScores]).reset_index(drop=True)\n", "ValCurves.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# And plot!\n", "# g = sns.lineplot(x='C',y='Score',hue='Type',data=ValCurves)\n", "# g.set_xticks(range(10))\n", "# g.set_xticklabels(C_range, rotation=90)\n", "\n", "g = sns.factorplot(x='C',y='Score',hue='Type',data=ValCurves)\n", "plt.xticks(range(10))\n", "g.set_xticklabels(C_range, rotation=90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like accuracy is better for higher values of C, and plateaus somewhere between 0.1 and 1. The default setting is C=1, so it looks like we can't really improve by changing C." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But our SVR model actually has two hyperparameters, C and epsilon. Perhaps there is an optimal combination of settings for these two parameters.\n", "\n", "We can explore that somewhat quickly with a grid search, which is once again easily achieved with sklearn. Because we are fitting the model multiple times witih cross-validation, this will take some time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "C_range = 10. ** np.arange(-3, 8)\n", "epsilon_range = 10. ** np.arange(-3, 8)\n", "\n", "param_grid = dict(epsilon=epsilon_range, C=C_range)\n", "\n", "grid = GridSearchCV(l_svr, param_grid=param_grid, cv=10)\n", "\n", "grid.fit(X_train, y_train_log)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the grid search has completed, let's find out what was the \"best\" parameter combination" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(grid.best_params_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And if redo our cross-validation with this parameter set?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = cross_val_predict(SVR(kernel='linear',C=0.10,epsilon=0.10, gamma='auto'), \n", " X_train, y_train_log, cv=10)\n", "\n", "# scores\n", "acc = r2_score(y_train_log, y_pred)\n", "mae = mean_absolute_error(y_train_log,y_pred)\n", "\n", "print('R2:',acc)\n", "print('MAE:',mae)\n", "\n", "sns.regplot(y_pred, y_train_log)\n", "plt.xlabel('Predicted Log Age')\n", "plt.ylabel('Log Age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perhaps unsurprisingly, the model fit is actually exactly the same as what we had with our defaults. There's a reason they are defaults ;-)\n", "\n", "Grid search can be a powerful and useful tool. But can you think of a way that, if not properly utilized, it could lead to overfitting?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find a nice set of tutorials with links to very helpful content regarding how to tune hyperparameters and while be aware of over- and under-fitting here:\n", "\n", "https://scikit-learn.org/stable/modules/learning_curve.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Trying a more complicated model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In principle, there is no real reason to do this. Perhaps one could make an argument for quadratic relationship with age, but we probably don't have enough subjects to learn a complicated non-linear model. But for the sake of demonstration, we can give it a shot.\n", "\n", "We'll use a validation curve to see the result of our model if, instead of fitting a linear model, we instead try to the fit a 2nd, 3rd, ... 8th order polynomial." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "validation_curve?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import validation_curve\n", "\n", "degree_range = list(range(1,8)) # A range of different values for C\n", "\n", "train_scores, valid_scores = validation_curve(SVR(kernel='poly',\n", " gamma='scale'\n", " ), \n", " X=X_train, y=y_train_log, \n", " param_name= \"degree\",\n", " param_range = degree_range,\n", " cv=10)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A bit of pandas magic to prepare the data for a seaborn plot\n", "\n", "tScores = pandas.DataFrame(train_scores).stack().reset_index()\n", "tScores.columns = ['Degree','Fold','Score']\n", "tScores.loc[:,'Type'] = ['Train' for x in range(len(tScores))]\n", "\n", "vScores = pandas.DataFrame(valid_scores).stack().reset_index()\n", "vScores.columns = ['Degree','Fold','Score']\n", "vScores.loc[:,'Type'] = ['Validate' for x in range(len(vScores))]\n", "\n", "ValCurves = pandas.concat([tScores,vScores]).reset_index(drop=True)\n", "ValCurves.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# And plot!\n", "\n", "# g = sns.lineplot(x='Degree',y='Score',hue='Type',data=ValCurves)\n", "# g.set_xticks(range(10))\n", "# g.set_xticklabels(degree_range, rotation=90)\n", "\n", "g = sns.factorplot(x='Degree',y='Score',hue='Type',data=ValCurves)\n", "plt.xticks(range(10))\n", "g.set_xticklabels(degree_range, rotation=90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It appears that we cannot improve our model by increasing the complexity of the fit. If one looked only at the training data, one might surmise that a 2nd order fit could be a slightly better model. But that improvement does not generalize to the validation data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# y_pred = cross_val_predict(SVR(kernel='rbf', gamma='scale'), X_train, y_train_log, cv=10)\n", "\n", "# # scores\n", "# acc = r2_score(y_train_log, y_pred)\n", "# mae = mean_absolute_error(y_train_log,y_pred)\n", "\n", "# print('R2:',acc)\n", "# print('MAE:',mae)\n", "\n", "# sns.regplot(y_pred, y_train_log)\n", "# plt.xlabel('Predicted Log Age')\n", "# plt.ylabel('Log Age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Feature selection\n", "\n", "Right now, we have 2016 features. Are all of those really going to contribute to the model stably? \n", "\n", "Intuitively, models tend to perform better when there are fewer, more important features than when there are many, less imortant features. The tough part is figuring out which features are useful or important.\n", "\n", "Here will quickly try a basic feature seclection strategy\n", "\n", "\"terms\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SelectPercentile() function will select the top X% of features based on univariate tests. This is a way of identifying theoretically more useful features. But remember, significance != prediction! \n", "\n", "We are also in danger of overfitting here. For starters, if we want to test this with 10-fold cross-validation, we will need to do a separate feature selection within each fold! That means we'll need to do the cross-validation manually instead of using cross_val_predict(). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_selection import SelectPercentile, f_regression\n", "from sklearn.model_selection import KFold\n", "from sklearn.pipeline import Pipeline\n", "\n", "# Build a tiny pipeline that does feature selection (top 20% of features), \n", "# and then prediction with our linear svr model.\n", "model = Pipeline([\n", " ('feature_selection',SelectPercentile(f_regression,percentile=20)),\n", " ('prediction', l_svr)\n", " ])\n", "\n", "y_pred = [] # a container to catch the predictions from each fold\n", "y_index = [] # just in case, the index for each prediciton\n", "\n", "# First we create 10 splits of the data\n", "skf = KFold(n_splits=10, shuffle=True, random_state=123)\n", "\n", "# For each split, assemble the train and test samples \n", "for tr_ind, te_ind in skf.split(X_train):\n", " X_tr = X_train[tr_ind]\n", " y_tr = y_train_log[tr_ind]\n", " X_te = X_train[te_ind]\n", " y_index += list(te_ind) # store the index of samples to predict\n", " \n", " # and run our pipeline \n", " model.fit(X_tr, y_tr) # fit the data to the model using our mini pipeline\n", " predictions = model.predict(X_te).tolist() # get the predictions for this fold\n", " y_pred += predictions # add them to the list of predictions\n", "\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alrighty, let's see if only using the top 20% of features improves the model at all..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "acc = r2_score(y_train_log[y_index], y_pred)\n", "mae = mean_absolute_error(y_train_log[y_index],y_pred)\n", "\n", "print('R2:',acc)\n", "print('MAE:',mae)\n", "\n", "sns.regplot(np.array(y_pred), y_train_log[y_index])\n", "plt.xlabel('Predicted Log Age')\n", "plt.ylabel('Log Age')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nope, in fact it got a bit worse. It seems we're getting \"value at the margins\" so to speak. This is a very good example of how significance != prediction, as demonstrated in this figure from Bzdok et al., 2018 *bioRxiv*\n", "\n", "![Bzdok2018](https://www.biorxiv.org/content/biorxiv/early/2018/05/21/327437/F1.large.jpg?width=800&height=600&carousel=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See here for an explanation of different feature selection options and how to implement them in sklearn: https://scikit-learn.org/stable/modules/feature_selection.html\n", "\n", "And here is a thoughtful tutorial covering feature selection for novel machine learners: https://www.datacamp.com/community/tutorials/feature-selection-python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So there you have it. We've tried many different strategies, but most of our \"tweaks\" haven't really lead to improvements in the model. This is not always the case, but it is not uncommon. Can you think of some reasons why?\n", "\n", "Moving on to our validation data, we probably should just stick to a basic model, though predicting log age might be a good idea!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Can our model predict age in completely un-seen data?\n", "Now that we've fit a model we think has possibly learned how to decode age based on rs-fmri signal, let's put it to the test. We will train our model on all of the training data, and try to predict the age of the subjects we left out at the beginning of this section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we performed a log transformation on our training data, we will need to transform our testing data using the *same information!* But that's easy because we stored our transformation in an object!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Notice how we use the Scaler that was fit to X_train and apply to X_test,\n", "# rather than creating a new Scaler for X_test\n", "y_val_log = log_transformer.transform(y_val.values.reshape(-1,1))[:,0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now for the moment of truth! \n", "\n", "No cross-validation needed here. We simply fit the model with the training data and use it to predict the testing data\n", "\n", "I'm so nervous. Let's just do it all in one cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "l_svr.fit(X_train, y_train_log) # fit to training data\n", "y_pred = l_svr.predict(X_val) # classify age class using testing data\n", "acc = l_svr.score(X_val, y_val_log) # get accuracy (r2)\n", "mae = mean_absolute_error(y_val_log, y_pred) # get mae\n", "\n", "# print results\n", "print('accuracy (r2) =', acc)\n", "print('mae = ',mae)\n", "\n", "# plot results\n", "sns.regplot(y_pred, y_val_log)\n", "plt.xlabel('Predicted Log Age')\n", "plt.ylabel('Log Age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Wow!!*** Congratulations. You just trained a machine learning model that used real rs-fmri data to predict the age of real humans.\n", "\n", "The proper thing to do at this point would be to repeat the train-validation split multiple times. This will ensure the results are not specific to this validation set, and will give you some confidence intervals around your results.\n", "\n", "As an assignment, you can give that a try below. Create 10 different splits of the entire dataset, fit the model and get your predictions. Then, plot the range of predictions.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# SPACE FOR YOUR ASSIGNMENT\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, it seems like something in this data does seem to be systematically related to age ... but what? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Interpreting model feature importances\n", "Interpreting the feature importances of a machine learning model is a real can of worms. This is an area of active research. Unfortunately, it's hard to trust the feature importance of some models. \n", "\n", "You can find a whole tutorial on this subject here:\n", "http://gael-varoquaux.info/interpreting_ml_tuto/index.html\n", "\n", "For now, we'll just eschew better judgement and take a look at our feature importances. While we can't ascribe any biological relevance to the features, it can still be helpful to know what the model is using to make its predictions. This is a good way to, for example, establish whether your model is actually learning based on a confound! Could you think of some examples?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access the feature importances (weights) used my the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "l_svr.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "lets plot these weights to see their distribution better" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.bar(range(l_svr.coef_.shape[-1]),l_svr.coef_[0])\n", "plt.title('feature importances')\n", "plt.xlabel('feature')\n", "plt.ylabel('weight')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or perhaps it will be easier to visualize this information as a matrix similar to the one we started with\n", "\n", "We can use the correlation measure from before to perform an inverse transform" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "correlation_measure.inverse_transform(l_svr.coef_).shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nilearn import plotting\n", "\n", "feat_exp_matrix = correlation_measure.inverse_transform(l_svr.coef_)[0]\n", "\n", "plotting.plot_matrix(feat_exp_matrix, figure=(10, 8), \n", " labels=range(feat_exp_matrix.shape[0]),\n", " reorder=False,\n", " tri='lower')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see if we can throw those features onto an actual brain.\n", "\n", "First, we'll need to gather the coordinates of each ROI of our atlas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coords = plotting.find_parcellation_cut_coords(atlas_filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can use our feature matrix and the wonders of nilearn to create a connectome map where each node is an ROI, and each connection is weighted by the importance of the feature to the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whoa!! That's...a lot to process. Maybe let's threshold the edges so that only the most important connections are visualized" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True, edge_threshold=0.035)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's definitely an improvement, but it's still a bit hard to see what's going on.\n", "Nilearn has a new feature that let's use view this data interactively!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib notebook\n", "plotting.view_connectome(feat_exp_matrix, coords, threshold='98%')" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 1 }