{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align: left;\">\n",
    "<table style=\"width:100%; background-color:transparent;\">\n",
    "  <tr style=\"background-color:transparent;\">\n",
    "    <td style=\"background-color:transparent;\">[<img src=\"http://project.inria.fr/saclaycds/files/2017/02/logoUPSayPlusCDS_990.png\" width=\"70%\">](http://www.datascience-paris-saclay.fr)</td>\n",
    "    <td style=\"background-color:transparent;\">[<img src=\"https://paris-saclay-cds.github.io/autism_challenge/images/institut_pasteur_logo.svg\" width=\"30%\">](https://research.pasteur.fr/en/team/group-roberto-toro/)</td>\n",
    "  </tr>\n",
    "</table> \n",
    "</div>\n",
    "\n",
    "<center><h1>Imaging-psychiatry challenge: predicting autism</h1></center>\n",
    "\n",
    "<center><h3>A data challenge on Autism Spectrum Disorder detection</h3></center>\n",
    "<br/>\n",
    "<center>_Roberto Toro (Institut Pasteur), Nicolas Traut (Institut Pasteur), Anita Beggiato (Institut Pasteur), Katja Heuer (Institut Pasteur),<br /> Gael Varoquaux (Inria, Parietal), Alex Gramfort (Inria, Parietal), Balazs Kegl (LAL),<br /> Guillaume Lemaitre (CDS), Alexandre Boucaud (CDS), and Joris van den Bossche (CDS)_</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Content\n",
    "\n",
    "0. [Prerequisites](#Software-prerequisites)\n",
    "1. [Introduction about the competition](#Introduction:-what-is-this-challenge-about)\n",
    "3. [The data](#The-data)\n",
    "4. [Workflow](#Workflow)\n",
    "5. [Evaluation](#Evaluation)\n",
    "6. [Submission](#Submitting-to-the-online-challenge:-ramp.studio)\n",
    "7. [More information](#More-information)\n",
    "8. [Questions](#Question)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**To download and run this notebook**: download the [full starting kit](https://github.com/ramp-kits/autism/archive/master.zip), with all the necessary files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Software prerequisites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This starting kit requires the following dependencies:\n",
    "\n",
    "* `numpy`\n",
    "* `scipy`\n",
    "* `pandas`\n",
    "* `scikit-learn`\n",
    "* `matplolib`\n",
    "* `seaborn`\n",
    "* `nilearn`\n",
    "* `jupyter`\n",
    "* `ramp-workflow`\n",
    "\n",
    "The following 2 cells will install if necessary the missing dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: scikit-learn in /home/lemaitre/miniconda3/lib/python3.6/site-packages\n",
      "Requirement already satisfied: seaborn in /home/lemaitre/miniconda3/lib/python3.6/site-packages\n",
      "Requirement already satisfied: nilearn in /home/lemaitre/miniconda3/lib/python3.6/site-packages\n",
      "Requirement already satisfied: numpy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)\n",
      "Requirement already satisfied: scipy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)\n",
      "Requirement already satisfied: matplotlib in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)\n",
      "Requirement already satisfied: pandas in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)\n",
      "Requirement already satisfied: nibabel>=2.0.2 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from nilearn)\n",
      "Requirement already satisfied: cycler>=0.10 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n",
      "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n",
      "Requirement already satisfied: python-dateutil>=2.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n",
      "Requirement already satisfied: pytz in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n",
      "Requirement already satisfied: six>=1.10 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n",
      "Requirement already satisfied: kiwisolver>=1.0.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n",
      "Requirement already satisfied: setuptools in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->seaborn)\n",
      "\u001b[33mYou are using pip version 9.0.3, however version 10.0.1 is available.\n",
      "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# import sys\n",
    "# !{sys.executable} -m pip install \"scikit-learn>=0.19,<=0.21\" seaborn nilearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Install `ramp-workflow` from the master branch on GitHub."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master\n",
      "  Downloading https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master (2.8MB)\n",
      "\u001b[K    100% |████████████████████████████████| 2.8MB 444kB/s eta 0:00:01\n",
      "\u001b[?25h  Requirement already satisfied (use --upgrade to upgrade): ramp-workflow==0+unknown from https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master in /home/lemaitre/miniconda3/lib/python3.6/site-packages\n",
      "Requirement already satisfied: numpy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n",
      "Requirement already satisfied: scipy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n",
      "Requirement already satisfied: pandas>=0.19.2 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n",
      "Requirement already satisfied: scikit-learn>=0.18 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n",
      "Requirement already satisfied: cloudpickle in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n",
      "Requirement already satisfied: colored in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n",
      "Requirement already satisfied: python-dateutil>=2.5.0 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from pandas>=0.19.2->ramp-workflow==0+unknown)\n",
      "Requirement already satisfied: pytz>=2011k in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from pandas>=0.19.2->ramp-workflow==0+unknown)\n",
      "Requirement already satisfied: six>=1.5 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas>=0.19.2->ramp-workflow==0+unknown)\n",
      "\u001b[33mYou are using pip version 9.0.3, however version 10.0.1 is available.\n",
      "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# !{sys.executable} -m pip install https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/0.2.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction: what is this challenge about"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Autism spectrum disorder (ASD) is a developmental disorder affecting communication and behavior with different range in severity of symptoms. ASD has been reported to affect approximately 1 in 166 children.\n",
    "\n",
    "Although there is a consensus on a relation between ASD and atypical brain networks and anatomy, those differences in brain anatomy and functional connectivity remain unclear. To address these issues, study on large cohort of subjects are necessary to ensure relevant finding. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Competition rules\n",
    "\n",
    "You can sign up to the challenge on the event page [ramp.studio](http://ramp.studio/) event. Here are the competition rules:\n",
    "\n",
    "* Submissions will be trained on the 1150 subjects given in the starting kit and tested on a private set of size 1003. More information on the distribution of the public and private sets  can be found on the [challenge website](https://paris-saclay-cds.github.io/autism_challenge/).\n",
    "* The competition will end on July 1, 2018 at 18h UTC (20h in Paris).\n",
    "* All models will be trained on AWS instance `m5.xlarge` (4 CPUs and 16 GiB of RAM).\n",
    "* Participants will be given a total of 80 machine hours. Submissions of a given participant will be ordered by submission timestamp. We will make an attempt to train all submissions, but starting from (and including) the first submission that makes the participant's total training time exceed 80 hours, all submissions will be disqualified from the competition (but can enter into the collaborative phase). Testing time will not count towards the limit. Train time will be displayed on the leaderboard for all submissions, rounded to second, per cross validation fold. Since we have 8 CV folds, the maximum total training time in the leaderboard is 10h. If a submission raises an exception, its training time will not count towards the total.\n",
    "* There is a timeout of 15 minutes between submissions.\n",
    "* Submissions submitted after the end of the competition will not qualify for prizes.\n",
    "* The public leaderboard will display validation scores running a stratified CV with 8 folds and 20% validation set size, executed on the public (starting kit) data. The official scores will be calculated on the hidden test set and will be published after the closing of the competition. We will measure several scores of each submission: (i) the Area Under the Curve of the Receiver Operating Characteristic (ROC-AUC) and (ii) the accuracy. The ROC-AUC will be used to rank submissions.\n",
    "* The organizers will do their best so that the provided backend runs flawlessly. We will communicate with participants in case of concerns and will try to resolve all issues, but we reserve the right to make unilateral decisions in specific cases, not covered by this set of minimal rules.\n",
    "* The organizers reserve the right to disqualify any participant found to violate the fair competitive spirit of the challenge. Possible reasons, without being exhaustive, are multiple accounts, attempts to access the test data, etc.\n",
    "* The challenge is essentially an individual contest, so there is no way to form official teams. Participants can form teams outside the platform before submitting any model individually, contact the organizers to let them know about the team, and submit on a single team member's account. However, submitting on one's own and participating in such a team at the same time is against the \"no multiple accounts\" rule, so, if discovered, may lead to disqualification.\n",
    "* Participants retain copyright on their submitted code and grant reuse under BSD 3-Clause License.\n",
    "\n",
    "Participants accept these rules automatically when making a submission at the RAMP site."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prizes of the competitive phase\n",
    "\n",
    "Ties in the competitive scores will be broken by earlier submission time.\n",
    "\n",
    "* 3000 €: the top submission according to private test ROC-AUC at the end of the competitive phase.\n",
    "* 2000 €: the second best submission according to private test ROC-AUC at the end of the competitive phase.\n",
    "* 1000 €: the third best submission according to private test ROC-AUC at the end of the competitive phase.\n",
    "* 500 €: from the fourth to the tenth best submissions according to the private test ROC-AUC at the end of the competitive phase."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sponsorship\n",
    "\n",
    "Prizes are provided by [IESF](https://www.iesf.fr/) and [Paris-Saclay CDS](http://www.datascience-paris-saclay.fr/). The computational time is provided by AWS on a research credit granted to the [Paris-Saclay CDS](http://www.datascience-paris-saclay.fr/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start by downloading the data from Internet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from problem import get_train_data\n",
    "\n",
    "data_train, labels_train = get_train_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>participants_site</th>\n",
       "      <th>participants_sex</th>\n",
       "      <th>participants_age</th>\n",
       "      <th>anatomy_lh_bankssts_area</th>\n",
       "      <th>anatomy_lh_caudalanteriorcingulate_area</th>\n",
       "      <th>anatomy_lh_caudalmiddlefrontal_area</th>\n",
       "      <th>anatomy_lh_cuneus_area</th>\n",
       "      <th>anatomy_lh_entorhinal_area</th>\n",
       "      <th>anatomy_lh_fusiform_area</th>\n",
       "      <th>anatomy_lh_inferiorparietal_area</th>\n",
       "      <th>...</th>\n",
       "      <th>anatomy_select</th>\n",
       "      <th>fmri_basc064</th>\n",
       "      <th>fmri_basc122</th>\n",
       "      <th>fmri_basc197</th>\n",
       "      <th>fmri_craddock_scorr_mean</th>\n",
       "      <th>fmri_harvard_oxford_cort_prob_2mm</th>\n",
       "      <th>fmri_motions</th>\n",
       "      <th>fmri_msdl</th>\n",
       "      <th>fmri_power_2011</th>\n",
       "      <th>fmri_select</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>subject_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1932355398536124106</th>\n",
       "      <td>5</td>\n",
       "      <td>F</td>\n",
       "      <td>9.301370</td>\n",
       "      <td>977.0</td>\n",
       "      <td>427.0</td>\n",
       "      <td>1884.0</td>\n",
       "      <td>1449.0</td>\n",
       "      <td>463.0</td>\n",
       "      <td>2790.0</td>\n",
       "      <td>4091.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>./data/fmri/basc064/1932355398536124106/run_1/...</td>\n",
       "      <td>./data/fmri/basc122/1932355398536124106/run_1/...</td>\n",
       "      <td>./data/fmri/basc197/1932355398536124106/run_1/...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/19323553985361...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/19323...</td>\n",
       "      <td>./data/fmri/motions/1932355398536124106/run_1/...</td>\n",
       "      <td>./data/fmri/msdl/1932355398536124106/run_1/193...</td>\n",
       "      <td>./data/fmri/power_2011/1932355398536124106/run...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5174041730092253771</th>\n",
       "      <td>19</td>\n",
       "      <td>M</td>\n",
       "      <td>29.000000</td>\n",
       "      <td>1279.0</td>\n",
       "      <td>730.0</td>\n",
       "      <td>2419.0</td>\n",
       "      <td>1611.0</td>\n",
       "      <td>467.0</td>\n",
       "      <td>3562.0</td>\n",
       "      <td>5380.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>./data/fmri/basc064/5174041730092253771/run_1/...</td>\n",
       "      <td>./data/fmri/basc122/5174041730092253771/run_1/...</td>\n",
       "      <td>./data/fmri/basc197/5174041730092253771/run_1/...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/51740417300922...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/51740...</td>\n",
       "      <td>./data/fmri/motions/5174041730092253771/run_1/...</td>\n",
       "      <td>./data/fmri/msdl/5174041730092253771/run_1/517...</td>\n",
       "      <td>./data/fmri/power_2011/5174041730092253771/run...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10219322676643534800</th>\n",
       "      <td>19</td>\n",
       "      <td>F</td>\n",
       "      <td>45.000000</td>\n",
       "      <td>926.0</td>\n",
       "      <td>446.0</td>\n",
       "      <td>1897.0</td>\n",
       "      <td>2135.0</td>\n",
       "      <td>570.0</td>\n",
       "      <td>3064.0</td>\n",
       "      <td>4834.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>./data/fmri/basc064/10219322676643534800/run_1...</td>\n",
       "      <td>./data/fmri/basc122/10219322676643534800/run_1...</td>\n",
       "      <td>./data/fmri/basc197/10219322676643534800/run_1...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/10219322676643...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/10219...</td>\n",
       "      <td>./data/fmri/motions/10219322676643534800/run_1...</td>\n",
       "      <td>./data/fmri/msdl/10219322676643534800/run_1/10...</td>\n",
       "      <td>./data/fmri/power_2011/10219322676643534800/ru...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10645466564919190227</th>\n",
       "      <td>5</td>\n",
       "      <td>F</td>\n",
       "      <td>9.216438</td>\n",
       "      <td>983.0</td>\n",
       "      <td>588.0</td>\n",
       "      <td>2479.0</td>\n",
       "      <td>1312.0</td>\n",
       "      <td>525.0</td>\n",
       "      <td>3766.0</td>\n",
       "      <td>5091.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>./data/fmri/basc064/10645466564919190227/run_1...</td>\n",
       "      <td>./data/fmri/basc122/10645466564919190227/run_1...</td>\n",
       "      <td>./data/fmri/basc197/10645466564919190227/run_1...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/10645466564919...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/10645...</td>\n",
       "      <td>./data/fmri/motions/10645466564919190227/run_1...</td>\n",
       "      <td>./data/fmri/msdl/10645466564919190227/run_1/10...</td>\n",
       "      <td>./data/fmri/power_2011/10645466564919190227/ru...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14512541342641936232</th>\n",
       "      <td>28</td>\n",
       "      <td>M</td>\n",
       "      <td>15.050000</td>\n",
       "      <td>1488.0</td>\n",
       "      <td>593.0</td>\n",
       "      <td>2309.0</td>\n",
       "      <td>1829.0</td>\n",
       "      <td>726.0</td>\n",
       "      <td>3720.0</td>\n",
       "      <td>5432.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>./data/fmri/basc064/14512541342641936232/run_1...</td>\n",
       "      <td>./data/fmri/basc122/14512541342641936232/run_1...</td>\n",
       "      <td>./data/fmri/basc197/14512541342641936232/run_1...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/14512541342641...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/14512...</td>\n",
       "      <td>./data/fmri/motions/14512541342641936232/run_1...</td>\n",
       "      <td>./data/fmri/msdl/14512541342641936232/run_1/14...</td>\n",
       "      <td>./data/fmri/power_2011/14512541342641936232/ru...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 220 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      participants_site participants_sex  participants_age  \\\n",
       "subject_id                                                                   \n",
       "1932355398536124106                   5                F          9.301370   \n",
       "5174041730092253771                  19                M         29.000000   \n",
       "10219322676643534800                 19                F         45.000000   \n",
       "10645466564919190227                  5                F          9.216438   \n",
       "14512541342641936232                 28                M         15.050000   \n",
       "\n",
       "                      anatomy_lh_bankssts_area  \\\n",
       "subject_id                                       \n",
       "1932355398536124106                      977.0   \n",
       "5174041730092253771                     1279.0   \n",
       "10219322676643534800                     926.0   \n",
       "10645466564919190227                     983.0   \n",
       "14512541342641936232                    1488.0   \n",
       "\n",
       "                      anatomy_lh_caudalanteriorcingulate_area  \\\n",
       "subject_id                                                      \n",
       "1932355398536124106                                     427.0   \n",
       "5174041730092253771                                     730.0   \n",
       "10219322676643534800                                    446.0   \n",
       "10645466564919190227                                    588.0   \n",
       "14512541342641936232                                    593.0   \n",
       "\n",
       "                      anatomy_lh_caudalmiddlefrontal_area  \\\n",
       "subject_id                                                  \n",
       "1932355398536124106                                1884.0   \n",
       "5174041730092253771                                2419.0   \n",
       "10219322676643534800                               1897.0   \n",
       "10645466564919190227                               2479.0   \n",
       "14512541342641936232                               2309.0   \n",
       "\n",
       "                      anatomy_lh_cuneus_area  anatomy_lh_entorhinal_area  \\\n",
       "subject_id                                                                 \n",
       "1932355398536124106                   1449.0                       463.0   \n",
       "5174041730092253771                   1611.0                       467.0   \n",
       "10219322676643534800                  2135.0                       570.0   \n",
       "10645466564919190227                  1312.0                       525.0   \n",
       "14512541342641936232                  1829.0                       726.0   \n",
       "\n",
       "                      anatomy_lh_fusiform_area  \\\n",
       "subject_id                                       \n",
       "1932355398536124106                     2790.0   \n",
       "5174041730092253771                     3562.0   \n",
       "10219322676643534800                    3064.0   \n",
       "10645466564919190227                    3766.0   \n",
       "14512541342641936232                    3720.0   \n",
       "\n",
       "                      anatomy_lh_inferiorparietal_area     ...       \\\n",
       "subject_id                                                 ...        \n",
       "1932355398536124106                             4091.0     ...        \n",
       "5174041730092253771                             5380.0     ...        \n",
       "10219322676643534800                            4834.0     ...        \n",
       "10645466564919190227                            5091.0     ...        \n",
       "14512541342641936232                            5432.0     ...        \n",
       "\n",
       "                      anatomy_select  \\\n",
       "subject_id                             \n",
       "1932355398536124106                1   \n",
       "5174041730092253771                1   \n",
       "10219322676643534800               1   \n",
       "10645466564919190227               1   \n",
       "14512541342641936232               1   \n",
       "\n",
       "                                                           fmri_basc064  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/basc064/1932355398536124106/run_1/...   \n",
       "5174041730092253771   ./data/fmri/basc064/5174041730092253771/run_1/...   \n",
       "10219322676643534800  ./data/fmri/basc064/10219322676643534800/run_1...   \n",
       "10645466564919190227  ./data/fmri/basc064/10645466564919190227/run_1...   \n",
       "14512541342641936232  ./data/fmri/basc064/14512541342641936232/run_1...   \n",
       "\n",
       "                                                           fmri_basc122  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/basc122/1932355398536124106/run_1/...   \n",
       "5174041730092253771   ./data/fmri/basc122/5174041730092253771/run_1/...   \n",
       "10219322676643534800  ./data/fmri/basc122/10219322676643534800/run_1...   \n",
       "10645466564919190227  ./data/fmri/basc122/10645466564919190227/run_1...   \n",
       "14512541342641936232  ./data/fmri/basc122/14512541342641936232/run_1...   \n",
       "\n",
       "                                                           fmri_basc197  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/basc197/1932355398536124106/run_1/...   \n",
       "5174041730092253771   ./data/fmri/basc197/5174041730092253771/run_1/...   \n",
       "10219322676643534800  ./data/fmri/basc197/10219322676643534800/run_1...   \n",
       "10645466564919190227  ./data/fmri/basc197/10645466564919190227/run_1...   \n",
       "14512541342641936232  ./data/fmri/basc197/14512541342641936232/run_1...   \n",
       "\n",
       "                                               fmri_craddock_scorr_mean  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/craddock_scorr_mean/19323553985361...   \n",
       "5174041730092253771   ./data/fmri/craddock_scorr_mean/51740417300922...   \n",
       "10219322676643534800  ./data/fmri/craddock_scorr_mean/10219322676643...   \n",
       "10645466564919190227  ./data/fmri/craddock_scorr_mean/10645466564919...   \n",
       "14512541342641936232  ./data/fmri/craddock_scorr_mean/14512541342641...   \n",
       "\n",
       "                                      fmri_harvard_oxford_cort_prob_2mm  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/harvard_oxford_cort_prob_2mm/19323...   \n",
       "5174041730092253771   ./data/fmri/harvard_oxford_cort_prob_2mm/51740...   \n",
       "10219322676643534800  ./data/fmri/harvard_oxford_cort_prob_2mm/10219...   \n",
       "10645466564919190227  ./data/fmri/harvard_oxford_cort_prob_2mm/10645...   \n",
       "14512541342641936232  ./data/fmri/harvard_oxford_cort_prob_2mm/14512...   \n",
       "\n",
       "                                                           fmri_motions  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/motions/1932355398536124106/run_1/...   \n",
       "5174041730092253771   ./data/fmri/motions/5174041730092253771/run_1/...   \n",
       "10219322676643534800  ./data/fmri/motions/10219322676643534800/run_1...   \n",
       "10645466564919190227  ./data/fmri/motions/10645466564919190227/run_1...   \n",
       "14512541342641936232  ./data/fmri/motions/14512541342641936232/run_1...   \n",
       "\n",
       "                                                              fmri_msdl  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/msdl/1932355398536124106/run_1/193...   \n",
       "5174041730092253771   ./data/fmri/msdl/5174041730092253771/run_1/517...   \n",
       "10219322676643534800  ./data/fmri/msdl/10219322676643534800/run_1/10...   \n",
       "10645466564919190227  ./data/fmri/msdl/10645466564919190227/run_1/10...   \n",
       "14512541342641936232  ./data/fmri/msdl/14512541342641936232/run_1/14...   \n",
       "\n",
       "                                                        fmri_power_2011  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/power_2011/1932355398536124106/run...   \n",
       "5174041730092253771   ./data/fmri/power_2011/5174041730092253771/run...   \n",
       "10219322676643534800  ./data/fmri/power_2011/10219322676643534800/ru...   \n",
       "10645466564919190227  ./data/fmri/power_2011/10645466564919190227/ru...   \n",
       "14512541342641936232  ./data/fmri/power_2011/14512541342641936232/ru...   \n",
       "\n",
       "                      fmri_select  \n",
       "subject_id                         \n",
       "1932355398536124106             1  \n",
       "5174041730092253771             1  \n",
       "10219322676643534800            1  \n",
       "10645466564919190227            1  \n",
       "14512541342641936232            1  \n",
       "\n",
       "[5 rows x 220 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0 0 1 ... 0 1 0]\n"
     ]
    }
   ],
   "source": [
    "print(labels_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of subjects in the training tests: 1127\n"
     ]
    }
   ],
   "source": [
    "print('Number of subjects in the training tests: {}'.format(labels_train.size))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Participant features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>participants_site</th>\n",
       "      <th>participants_sex</th>\n",
       "      <th>participants_age</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>subject_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1932355398536124106</th>\n",
       "      <td>5</td>\n",
       "      <td>F</td>\n",
       "      <td>9.301370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5174041730092253771</th>\n",
       "      <td>19</td>\n",
       "      <td>M</td>\n",
       "      <td>29.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10219322676643534800</th>\n",
       "      <td>19</td>\n",
       "      <td>F</td>\n",
       "      <td>45.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10645466564919190227</th>\n",
       "      <td>5</td>\n",
       "      <td>F</td>\n",
       "      <td>9.216438</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14512541342641936232</th>\n",
       "      <td>28</td>\n",
       "      <td>M</td>\n",
       "      <td>15.050000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      participants_site participants_sex  participants_age\n",
       "subject_id                                                                \n",
       "1932355398536124106                   5                F          9.301370\n",
       "5174041730092253771                  19                M         29.000000\n",
       "10219322676643534800                 19                F         45.000000\n",
       "10645466564919190227                  5                F          9.216438\n",
       "14512541342641936232                 28                M         15.050000"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_train_participants = data_train[[col for col in data_train.columns if col.startswith('participants')]]\n",
    "data_train_participants.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Structural MRI features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A set of structural features have been extracted for each subject: (i) normalized brain volume computed using subcortical segmentation of FreeSurfer and (ii) cortical thickness and area for right and left hemisphere of FreeSurfer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>anatomy_lh_bankssts_area</th>\n",
       "      <th>anatomy_lh_caudalanteriorcingulate_area</th>\n",
       "      <th>anatomy_lh_caudalmiddlefrontal_area</th>\n",
       "      <th>anatomy_lh_cuneus_area</th>\n",
       "      <th>anatomy_lh_entorhinal_area</th>\n",
       "      <th>anatomy_lh_fusiform_area</th>\n",
       "      <th>anatomy_lh_inferiorparietal_area</th>\n",
       "      <th>anatomy_lh_inferiortemporal_area</th>\n",
       "      <th>anatomy_lh_isthmuscingulate_area</th>\n",
       "      <th>anatomy_lh_lateraloccipital_area</th>\n",
       "      <th>...</th>\n",
       "      <th>anatomy_MaskVol</th>\n",
       "      <th>anatomy_BrainSegVol-to-eTIV</th>\n",
       "      <th>anatomy_MaskVol-to-eTIV</th>\n",
       "      <th>anatomy_lhSurfaceHoles</th>\n",
       "      <th>anatomy_rhSurfaceHoles</th>\n",
       "      <th>anatomy_SurfaceHoles</th>\n",
       "      <th>anatomy_EstimatedTotalIntraCranialVol</th>\n",
       "      <th>anatomy_eTIV</th>\n",
       "      <th>anatomy_BrainSegVolNotVent</th>\n",
       "      <th>anatomy_select</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>subject_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1932355398536124106</th>\n",
       "      <td>977.0</td>\n",
       "      <td>427.0</td>\n",
       "      <td>1884.0</td>\n",
       "      <td>1449.0</td>\n",
       "      <td>463.0</td>\n",
       "      <td>2790.0</td>\n",
       "      <td>4091.0</td>\n",
       "      <td>3305.0</td>\n",
       "      <td>897.0</td>\n",
       "      <td>4406.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1375171.0</td>\n",
       "      <td>0.840976</td>\n",
       "      <td>1.077472</td>\n",
       "      <td>30.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>1.276294e+06</td>\n",
       "      <td>1.276294e+06</td>\n",
       "      <td>1058903.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5174041730092253771</th>\n",
       "      <td>1279.0</td>\n",
       "      <td>730.0</td>\n",
       "      <td>2419.0</td>\n",
       "      <td>1611.0</td>\n",
       "      <td>467.0</td>\n",
       "      <td>3562.0</td>\n",
       "      <td>5380.0</td>\n",
       "      <td>3555.0</td>\n",
       "      <td>1155.0</td>\n",
       "      <td>5611.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1807924.0</td>\n",
       "      <td>0.771229</td>\n",
       "      <td>1.033285</td>\n",
       "      <td>45.0</td>\n",
       "      <td>54.0</td>\n",
       "      <td>99.0</td>\n",
       "      <td>1.749685e+06</td>\n",
       "      <td>1.749685e+06</td>\n",
       "      <td>1329340.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10219322676643534800</th>\n",
       "      <td>926.0</td>\n",
       "      <td>446.0</td>\n",
       "      <td>1897.0</td>\n",
       "      <td>2135.0</td>\n",
       "      <td>570.0</td>\n",
       "      <td>3064.0</td>\n",
       "      <td>4834.0</td>\n",
       "      <td>2602.0</td>\n",
       "      <td>1171.0</td>\n",
       "      <td>6395.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1522076.0</td>\n",
       "      <td>0.774117</td>\n",
       "      <td>1.082107</td>\n",
       "      <td>48.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>98.0</td>\n",
       "      <td>1.406585e+06</td>\n",
       "      <td>1.406585e+06</td>\n",
       "      <td>1072503.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10645466564919190227</th>\n",
       "      <td>983.0</td>\n",
       "      <td>588.0</td>\n",
       "      <td>2479.0</td>\n",
       "      <td>1312.0</td>\n",
       "      <td>525.0</td>\n",
       "      <td>3766.0</td>\n",
       "      <td>5091.0</td>\n",
       "      <td>3433.0</td>\n",
       "      <td>1028.0</td>\n",
       "      <td>5405.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1544951.0</td>\n",
       "      <td>0.845986</td>\n",
       "      <td>1.083437</td>\n",
       "      <td>55.0</td>\n",
       "      <td>57.0</td>\n",
       "      <td>112.0</td>\n",
       "      <td>1.425972e+06</td>\n",
       "      <td>1.425972e+06</td>\n",
       "      <td>1194831.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14512541342641936232</th>\n",
       "      <td>1488.0</td>\n",
       "      <td>593.0</td>\n",
       "      <td>2309.0</td>\n",
       "      <td>1829.0</td>\n",
       "      <td>726.0</td>\n",
       "      <td>3720.0</td>\n",
       "      <td>5432.0</td>\n",
       "      <td>3956.0</td>\n",
       "      <td>1033.0</td>\n",
       "      <td>5644.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1738955.0</td>\n",
       "      <td>0.793794</td>\n",
       "      <td>1.083640</td>\n",
       "      <td>22.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>1.604735e+06</td>\n",
       "      <td>1.604735e+06</td>\n",
       "      <td>1263065.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 208 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      anatomy_lh_bankssts_area  \\\n",
       "subject_id                                       \n",
       "1932355398536124106                      977.0   \n",
       "5174041730092253771                     1279.0   \n",
       "10219322676643534800                     926.0   \n",
       "10645466564919190227                     983.0   \n",
       "14512541342641936232                    1488.0   \n",
       "\n",
       "                      anatomy_lh_caudalanteriorcingulate_area  \\\n",
       "subject_id                                                      \n",
       "1932355398536124106                                     427.0   \n",
       "5174041730092253771                                     730.0   \n",
       "10219322676643534800                                    446.0   \n",
       "10645466564919190227                                    588.0   \n",
       "14512541342641936232                                    593.0   \n",
       "\n",
       "                      anatomy_lh_caudalmiddlefrontal_area  \\\n",
       "subject_id                                                  \n",
       "1932355398536124106                                1884.0   \n",
       "5174041730092253771                                2419.0   \n",
       "10219322676643534800                               1897.0   \n",
       "10645466564919190227                               2479.0   \n",
       "14512541342641936232                               2309.0   \n",
       "\n",
       "                      anatomy_lh_cuneus_area  anatomy_lh_entorhinal_area  \\\n",
       "subject_id                                                                 \n",
       "1932355398536124106                   1449.0                       463.0   \n",
       "5174041730092253771                   1611.0                       467.0   \n",
       "10219322676643534800                  2135.0                       570.0   \n",
       "10645466564919190227                  1312.0                       525.0   \n",
       "14512541342641936232                  1829.0                       726.0   \n",
       "\n",
       "                      anatomy_lh_fusiform_area  \\\n",
       "subject_id                                       \n",
       "1932355398536124106                     2790.0   \n",
       "5174041730092253771                     3562.0   \n",
       "10219322676643534800                    3064.0   \n",
       "10645466564919190227                    3766.0   \n",
       "14512541342641936232                    3720.0   \n",
       "\n",
       "                      anatomy_lh_inferiorparietal_area  \\\n",
       "subject_id                                               \n",
       "1932355398536124106                             4091.0   \n",
       "5174041730092253771                             5380.0   \n",
       "10219322676643534800                            4834.0   \n",
       "10645466564919190227                            5091.0   \n",
       "14512541342641936232                            5432.0   \n",
       "\n",
       "                      anatomy_lh_inferiortemporal_area  \\\n",
       "subject_id                                               \n",
       "1932355398536124106                             3305.0   \n",
       "5174041730092253771                             3555.0   \n",
       "10219322676643534800                            2602.0   \n",
       "10645466564919190227                            3433.0   \n",
       "14512541342641936232                            3956.0   \n",
       "\n",
       "                      anatomy_lh_isthmuscingulate_area  \\\n",
       "subject_id                                               \n",
       "1932355398536124106                              897.0   \n",
       "5174041730092253771                             1155.0   \n",
       "10219322676643534800                            1171.0   \n",
       "10645466564919190227                            1028.0   \n",
       "14512541342641936232                            1033.0   \n",
       "\n",
       "                      anatomy_lh_lateraloccipital_area       ...        \\\n",
       "subject_id                                                   ...         \n",
       "1932355398536124106                             4406.0       ...         \n",
       "5174041730092253771                             5611.0       ...         \n",
       "10219322676643534800                            6395.0       ...         \n",
       "10645466564919190227                            5405.0       ...         \n",
       "14512541342641936232                            5644.0       ...         \n",
       "\n",
       "                      anatomy_MaskVol  anatomy_BrainSegVol-to-eTIV  \\\n",
       "subject_id                                                           \n",
       "1932355398536124106         1375171.0                     0.840976   \n",
       "5174041730092253771         1807924.0                     0.771229   \n",
       "10219322676643534800        1522076.0                     0.774117   \n",
       "10645466564919190227        1544951.0                     0.845986   \n",
       "14512541342641936232        1738955.0                     0.793794   \n",
       "\n",
       "                      anatomy_MaskVol-to-eTIV  anatomy_lhSurfaceHoles  \\\n",
       "subject_id                                                              \n",
       "1932355398536124106                  1.077472                    30.0   \n",
       "5174041730092253771                  1.033285                    45.0   \n",
       "10219322676643534800                 1.082107                    48.0   \n",
       "10645466564919190227                 1.083437                    55.0   \n",
       "14512541342641936232                 1.083640                    22.0   \n",
       "\n",
       "                      anatomy_rhSurfaceHoles  anatomy_SurfaceHoles  \\\n",
       "subject_id                                                           \n",
       "1932355398536124106                     31.0                  61.0   \n",
       "5174041730092253771                     54.0                  99.0   \n",
       "10219322676643534800                    50.0                  98.0   \n",
       "10645466564919190227                    57.0                 112.0   \n",
       "14512541342641936232                    42.0                  64.0   \n",
       "\n",
       "                      anatomy_EstimatedTotalIntraCranialVol  anatomy_eTIV  \\\n",
       "subject_id                                                                  \n",
       "1932355398536124106                            1.276294e+06  1.276294e+06   \n",
       "5174041730092253771                            1.749685e+06  1.749685e+06   \n",
       "10219322676643534800                           1.406585e+06  1.406585e+06   \n",
       "10645466564919190227                           1.425972e+06  1.425972e+06   \n",
       "14512541342641936232                           1.604735e+06  1.604735e+06   \n",
       "\n",
       "                      anatomy_BrainSegVolNotVent  anatomy_select  \n",
       "subject_id                                                        \n",
       "1932355398536124106                    1058903.0               1  \n",
       "5174041730092253771                    1329340.0               1  \n",
       "10219322676643534800                   1072503.0               1  \n",
       "10645466564919190227                   1194831.0               1  \n",
       "14512541342641936232                   1263065.0               1  \n",
       "\n",
       "[5 rows x 208 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_train_anatomy = data_train[[col for col in data_train.columns if col.startswith('anatomy')]]\n",
    "data_train_anatomy.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the column `anatomy_select` contain a label affected during a manual quality check (i.e. `0` and `3` reject, `1` accept, `2` accept with reserve). This column can be used during training to exclude noisy data for instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "subject_id\n",
       "1932355398536124106     1\n",
       "5174041730092253771     1\n",
       "10219322676643534800    1\n",
       "10645466564919190227    1\n",
       "14512541342641936232    1\n",
       "Name: anatomy_select, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_train_anatomy['anatomy_select'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Functional MRI features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./img/preprocessing_fmri.png\" width=\"40%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The original data acquired are resting-state functional MRI. Each subject also comes with fMRI signals extracted on different brain parcellations and atlases, and a set of confound signals. Those brain atlases and parcellations are: (i) BASC parcellations with 64, 122, and 197 regions (Bellec 2010), (ii) Ncuts parcellations (Craddock 2012), (iii) Harvard-Oxford anatomical parcellations, (iv) MSDL functional atlas (Varoquaux 2011), and (v) Power atlas (Power 2011). The script used for this extraction can be found [there](https://github.com/ramp-kits/autism/blob/master/preprocessing/extract_time_series.py)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fmri_basc064</th>\n",
       "      <th>fmri_basc122</th>\n",
       "      <th>fmri_basc197</th>\n",
       "      <th>fmri_craddock_scorr_mean</th>\n",
       "      <th>fmri_harvard_oxford_cort_prob_2mm</th>\n",
       "      <th>fmri_motions</th>\n",
       "      <th>fmri_msdl</th>\n",
       "      <th>fmri_power_2011</th>\n",
       "      <th>fmri_select</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>subject_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1932355398536124106</th>\n",
       "      <td>./data/fmri/basc064/1932355398536124106/run_1/...</td>\n",
       "      <td>./data/fmri/basc122/1932355398536124106/run_1/...</td>\n",
       "      <td>./data/fmri/basc197/1932355398536124106/run_1/...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/19323553985361...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/19323...</td>\n",
       "      <td>./data/fmri/motions/1932355398536124106/run_1/...</td>\n",
       "      <td>./data/fmri/msdl/1932355398536124106/run_1/193...</td>\n",
       "      <td>./data/fmri/power_2011/1932355398536124106/run...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5174041730092253771</th>\n",
       "      <td>./data/fmri/basc064/5174041730092253771/run_1/...</td>\n",
       "      <td>./data/fmri/basc122/5174041730092253771/run_1/...</td>\n",
       "      <td>./data/fmri/basc197/5174041730092253771/run_1/...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/51740417300922...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/51740...</td>\n",
       "      <td>./data/fmri/motions/5174041730092253771/run_1/...</td>\n",
       "      <td>./data/fmri/msdl/5174041730092253771/run_1/517...</td>\n",
       "      <td>./data/fmri/power_2011/5174041730092253771/run...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10219322676643534800</th>\n",
       "      <td>./data/fmri/basc064/10219322676643534800/run_1...</td>\n",
       "      <td>./data/fmri/basc122/10219322676643534800/run_1...</td>\n",
       "      <td>./data/fmri/basc197/10219322676643534800/run_1...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/10219322676643...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/10219...</td>\n",
       "      <td>./data/fmri/motions/10219322676643534800/run_1...</td>\n",
       "      <td>./data/fmri/msdl/10219322676643534800/run_1/10...</td>\n",
       "      <td>./data/fmri/power_2011/10219322676643534800/ru...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10645466564919190227</th>\n",
       "      <td>./data/fmri/basc064/10645466564919190227/run_1...</td>\n",
       "      <td>./data/fmri/basc122/10645466564919190227/run_1...</td>\n",
       "      <td>./data/fmri/basc197/10645466564919190227/run_1...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/10645466564919...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/10645...</td>\n",
       "      <td>./data/fmri/motions/10645466564919190227/run_1...</td>\n",
       "      <td>./data/fmri/msdl/10645466564919190227/run_1/10...</td>\n",
       "      <td>./data/fmri/power_2011/10645466564919190227/ru...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14512541342641936232</th>\n",
       "      <td>./data/fmri/basc064/14512541342641936232/run_1...</td>\n",
       "      <td>./data/fmri/basc122/14512541342641936232/run_1...</td>\n",
       "      <td>./data/fmri/basc197/14512541342641936232/run_1...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/14512541342641...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/14512...</td>\n",
       "      <td>./data/fmri/motions/14512541342641936232/run_1...</td>\n",
       "      <td>./data/fmri/msdl/14512541342641936232/run_1/14...</td>\n",
       "      <td>./data/fmri/power_2011/14512541342641936232/ru...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                           fmri_basc064  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/basc064/1932355398536124106/run_1/...   \n",
       "5174041730092253771   ./data/fmri/basc064/5174041730092253771/run_1/...   \n",
       "10219322676643534800  ./data/fmri/basc064/10219322676643534800/run_1...   \n",
       "10645466564919190227  ./data/fmri/basc064/10645466564919190227/run_1...   \n",
       "14512541342641936232  ./data/fmri/basc064/14512541342641936232/run_1...   \n",
       "\n",
       "                                                           fmri_basc122  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/basc122/1932355398536124106/run_1/...   \n",
       "5174041730092253771   ./data/fmri/basc122/5174041730092253771/run_1/...   \n",
       "10219322676643534800  ./data/fmri/basc122/10219322676643534800/run_1...   \n",
       "10645466564919190227  ./data/fmri/basc122/10645466564919190227/run_1...   \n",
       "14512541342641936232  ./data/fmri/basc122/14512541342641936232/run_1...   \n",
       "\n",
       "                                                           fmri_basc197  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/basc197/1932355398536124106/run_1/...   \n",
       "5174041730092253771   ./data/fmri/basc197/5174041730092253771/run_1/...   \n",
       "10219322676643534800  ./data/fmri/basc197/10219322676643534800/run_1...   \n",
       "10645466564919190227  ./data/fmri/basc197/10645466564919190227/run_1...   \n",
       "14512541342641936232  ./data/fmri/basc197/14512541342641936232/run_1...   \n",
       "\n",
       "                                               fmri_craddock_scorr_mean  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/craddock_scorr_mean/19323553985361...   \n",
       "5174041730092253771   ./data/fmri/craddock_scorr_mean/51740417300922...   \n",
       "10219322676643534800  ./data/fmri/craddock_scorr_mean/10219322676643...   \n",
       "10645466564919190227  ./data/fmri/craddock_scorr_mean/10645466564919...   \n",
       "14512541342641936232  ./data/fmri/craddock_scorr_mean/14512541342641...   \n",
       "\n",
       "                                      fmri_harvard_oxford_cort_prob_2mm  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/harvard_oxford_cort_prob_2mm/19323...   \n",
       "5174041730092253771   ./data/fmri/harvard_oxford_cort_prob_2mm/51740...   \n",
       "10219322676643534800  ./data/fmri/harvard_oxford_cort_prob_2mm/10219...   \n",
       "10645466564919190227  ./data/fmri/harvard_oxford_cort_prob_2mm/10645...   \n",
       "14512541342641936232  ./data/fmri/harvard_oxford_cort_prob_2mm/14512...   \n",
       "\n",
       "                                                           fmri_motions  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/motions/1932355398536124106/run_1/...   \n",
       "5174041730092253771   ./data/fmri/motions/5174041730092253771/run_1/...   \n",
       "10219322676643534800  ./data/fmri/motions/10219322676643534800/run_1...   \n",
       "10645466564919190227  ./data/fmri/motions/10645466564919190227/run_1...   \n",
       "14512541342641936232  ./data/fmri/motions/14512541342641936232/run_1...   \n",
       "\n",
       "                                                              fmri_msdl  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/msdl/1932355398536124106/run_1/193...   \n",
       "5174041730092253771   ./data/fmri/msdl/5174041730092253771/run_1/517...   \n",
       "10219322676643534800  ./data/fmri/msdl/10219322676643534800/run_1/10...   \n",
       "10645466564919190227  ./data/fmri/msdl/10645466564919190227/run_1/10...   \n",
       "14512541342641936232  ./data/fmri/msdl/14512541342641936232/run_1/14...   \n",
       "\n",
       "                                                        fmri_power_2011  \\\n",
       "subject_id                                                                \n",
       "1932355398536124106   ./data/fmri/power_2011/1932355398536124106/run...   \n",
       "5174041730092253771   ./data/fmri/power_2011/5174041730092253771/run...   \n",
       "10219322676643534800  ./data/fmri/power_2011/10219322676643534800/ru...   \n",
       "10645466564919190227  ./data/fmri/power_2011/10645466564919190227/ru...   \n",
       "14512541342641936232  ./data/fmri/power_2011/14512541342641936232/ru...   \n",
       "\n",
       "                      fmri_select  \n",
       "subject_id                         \n",
       "1932355398536124106             1  \n",
       "5174041730092253771             1  \n",
       "10219322676643534800            1  \n",
       "10645466564919190227            1  \n",
       "14512541342641936232            1  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_train_functional = data_train[[col for col in data_train.columns if col.startswith('fmri')]]\n",
    "data_train_functional.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unlike the anatomical and participants data, the available data are filename to CSV files in which the time-series information are stored. We show in the next section how to read and extract meaningful information from those data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly to the anatomical data, the column `fmri_select` gives information about the manual quality check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "subject_id\n",
       "1932355398536124106     1\n",
       "5174041730092253771     1\n",
       "10219322676643534800    1\n",
       "10645466564919190227    1\n",
       "14512541342641936232    1\n",
       "Name: fmri_select, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_train_functional['fmri_select'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Testing data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The testing data can be loaded similarly as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from problem import get_test_data\n",
    "\n",
    "data_test, labels_test = get_test_data()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>participants_site</th>\n",
       "      <th>participants_sex</th>\n",
       "      <th>participants_age</th>\n",
       "      <th>anatomy_lh_bankssts_area</th>\n",
       "      <th>anatomy_lh_caudalanteriorcingulate_area</th>\n",
       "      <th>anatomy_lh_caudalmiddlefrontal_area</th>\n",
       "      <th>anatomy_lh_cuneus_area</th>\n",
       "      <th>anatomy_lh_entorhinal_area</th>\n",
       "      <th>anatomy_lh_fusiform_area</th>\n",
       "      <th>anatomy_lh_inferiorparietal_area</th>\n",
       "      <th>...</th>\n",
       "      <th>anatomy_select</th>\n",
       "      <th>fmri_basc064</th>\n",
       "      <th>fmri_basc122</th>\n",
       "      <th>fmri_basc197</th>\n",
       "      <th>fmri_craddock_scorr_mean</th>\n",
       "      <th>fmri_harvard_oxford_cort_prob_2mm</th>\n",
       "      <th>fmri_motions</th>\n",
       "      <th>fmri_msdl</th>\n",
       "      <th>fmri_power_2011</th>\n",
       "      <th>fmri_select</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>subject_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5181409268393785348</th>\n",
       "      <td>31</td>\n",
       "      <td>M</td>\n",
       "      <td>12.200000</td>\n",
       "      <td>985.0</td>\n",
       "      <td>723.0</td>\n",
       "      <td>2851.0</td>\n",
       "      <td>1844.0</td>\n",
       "      <td>495.0</td>\n",
       "      <td>3526.0</td>\n",
       "      <td>5658.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>./data/fmri/basc064/5181409268393785348/run_1/...</td>\n",
       "      <td>./data/fmri/basc122/5181409268393785348/run_1/...</td>\n",
       "      <td>./data/fmri/basc197/5181409268393785348/run_1/...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/51814092683937...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/51814...</td>\n",
       "      <td>./data/fmri/motions/5181409268393785348/run_1/...</td>\n",
       "      <td>./data/fmri/msdl/5181409268393785348/run_1/518...</td>\n",
       "      <td>./data/fmri/power_2011/5181409268393785348/run...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8797865049371315550</th>\n",
       "      <td>9</td>\n",
       "      <td>M</td>\n",
       "      <td>14.000000</td>\n",
       "      <td>1174.0</td>\n",
       "      <td>506.0</td>\n",
       "      <td>1890.0</td>\n",
       "      <td>1327.0</td>\n",
       "      <td>462.0</td>\n",
       "      <td>3564.0</td>\n",
       "      <td>4408.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>./data/fmri/basc064/8797865049371315550/run_1/...</td>\n",
       "      <td>./data/fmri/basc122/8797865049371315550/run_1/...</td>\n",
       "      <td>./data/fmri/basc197/8797865049371315550/run_1/...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/87978650493713...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/87978...</td>\n",
       "      <td>./data/fmri/motions/8797865049371315550/run_1/...</td>\n",
       "      <td>./data/fmri/msdl/8797865049371315550/run_1/879...</td>\n",
       "      <td>./data/fmri/power_2011/8797865049371315550/run...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6486385878325245147</th>\n",
       "      <td>20</td>\n",
       "      <td>M</td>\n",
       "      <td>14.425000</td>\n",
       "      <td>1288.0</td>\n",
       "      <td>568.0</td>\n",
       "      <td>2406.0</td>\n",
       "      <td>1546.0</td>\n",
       "      <td>432.0</td>\n",
       "      <td>3497.0</td>\n",
       "      <td>4808.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>./data/fmri/basc064/6486385878325245147/run_1/...</td>\n",
       "      <td>./data/fmri/basc122/6486385878325245147/run_1/...</td>\n",
       "      <td>./data/fmri/basc197/6486385878325245147/run_1/...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/64863858783252...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/64863...</td>\n",
       "      <td>./data/fmri/motions/6486385878325245147/run_1/...</td>\n",
       "      <td>./data/fmri/msdl/6486385878325245147/run_1/648...</td>\n",
       "      <td>./data/fmri/power_2011/6486385878325245147/run...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17126438435398394588</th>\n",
       "      <td>33</td>\n",
       "      <td>M</td>\n",
       "      <td>22.880200</td>\n",
       "      <td>1179.0</td>\n",
       "      <td>991.0</td>\n",
       "      <td>2427.0</td>\n",
       "      <td>1771.0</td>\n",
       "      <td>363.0</td>\n",
       "      <td>3579.0</td>\n",
       "      <td>6082.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>./data/fmri/basc064/17126438435398394588/run_1...</td>\n",
       "      <td>./data/fmri/basc122/17126438435398394588/run_1...</td>\n",
       "      <td>./data/fmri/basc197/17126438435398394588/run_1...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/17126438435398...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/17126...</td>\n",
       "      <td>./data/fmri/motions/17126438435398394588/run_1...</td>\n",
       "      <td>./data/fmri/msdl/17126438435398394588/run_1/17...</td>\n",
       "      <td>./data/fmri/power_2011/17126438435398394588/ru...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16638049522113999228</th>\n",
       "      <td>2</td>\n",
       "      <td>M</td>\n",
       "      <td>8.252055</td>\n",
       "      <td>1064.0</td>\n",
       "      <td>721.0</td>\n",
       "      <td>2445.0</td>\n",
       "      <td>1453.0</td>\n",
       "      <td>561.0</td>\n",
       "      <td>3262.0</td>\n",
       "      <td>4885.0</td>\n",
       "      <td>...</td>\n",
       "      <td>2</td>\n",
       "      <td>./data/fmri/basc064/16638049522113999228/run_1...</td>\n",
       "      <td>./data/fmri/basc122/16638049522113999228/run_1...</td>\n",
       "      <td>./data/fmri/basc197/16638049522113999228/run_1...</td>\n",
       "      <td>./data/fmri/craddock_scorr_mean/16638049522113...</td>\n",
       "      <td>./data/fmri/harvard_oxford_cort_prob_2mm/16638...</td>\n",
       "      <td>./data/fmri/motions/16638049522113999228/run_1...</td>\n",
       "      <td>./data/fmri/msdl/16638049522113999228/run_1/16...</td>\n",
       "      <td>./data/fmri/power_2011/16638049522113999228/ru...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 220 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      participants_site participants_sex  participants_age  \\\n",
       "subject_id                                                                   \n",
       "5181409268393785348                  31                M         12.200000   \n",
       "8797865049371315550                   9                M         14.000000   \n",
       "6486385878325245147                  20                M         14.425000   \n",
       "17126438435398394588                 33                M         22.880200   \n",
       "16638049522113999228                  2                M          8.252055   \n",
       "\n",
       "                      anatomy_lh_bankssts_area  \\\n",
       "subject_id                                       \n",
       "5181409268393785348                      985.0   \n",
       "8797865049371315550                     1174.0   \n",
       "6486385878325245147                     1288.0   \n",
       "17126438435398394588                    1179.0   \n",
       "16638049522113999228                    1064.0   \n",
       "\n",
       "                      anatomy_lh_caudalanteriorcingulate_area  \\\n",
       "subject_id                                                      \n",
       "5181409268393785348                                     723.0   \n",
       "8797865049371315550                                     506.0   \n",
       "6486385878325245147                                     568.0   \n",
       "17126438435398394588                                    991.0   \n",
       "16638049522113999228                                    721.0   \n",
       "\n",
       "                      anatomy_lh_caudalmiddlefrontal_area  \\\n",
       "subject_id                                                  \n",
       "5181409268393785348                                2851.0   \n",
       "8797865049371315550                                1890.0   \n",
       "6486385878325245147                                2406.0   \n",
       "17126438435398394588                               2427.0   \n",
       "16638049522113999228                               2445.0   \n",
       "\n",
       "                      anatomy_lh_cuneus_area  anatomy_lh_entorhinal_area  \\\n",
       "subject_id                                                                 \n",
       "5181409268393785348                   1844.0                       495.0   \n",
       "8797865049371315550                   1327.0                       462.0   \n",
       "6486385878325245147                   1546.0                       432.0   \n",
       "17126438435398394588                  1771.0                       363.0   \n",
       "16638049522113999228                  1453.0                       561.0   \n",
       "\n",
       "                      anatomy_lh_fusiform_area  \\\n",
       "subject_id                                       \n",
       "5181409268393785348                     3526.0   \n",
       "8797865049371315550                     3564.0   \n",
       "6486385878325245147                     3497.0   \n",
       "17126438435398394588                    3579.0   \n",
       "16638049522113999228                    3262.0   \n",
       "\n",
       "                      anatomy_lh_inferiorparietal_area     ...       \\\n",
       "subject_id                                                 ...        \n",
       "5181409268393785348                             5658.0     ...        \n",
       "8797865049371315550                             4408.0     ...        \n",
       "6486385878325245147                             4808.0     ...        \n",
       "17126438435398394588                            6082.0     ...        \n",
       "16638049522113999228                            4885.0     ...        \n",
       "\n",
       "                      anatomy_select  \\\n",
       "subject_id                             \n",
       "5181409268393785348                1   \n",
       "8797865049371315550                1   \n",
       "6486385878325245147                1   \n",
       "17126438435398394588               1   \n",
       "16638049522113999228               2   \n",
       "\n",
       "                                                           fmri_basc064  \\\n",
       "subject_id                                                                \n",
       "5181409268393785348   ./data/fmri/basc064/5181409268393785348/run_1/...   \n",
       "8797865049371315550   ./data/fmri/basc064/8797865049371315550/run_1/...   \n",
       "6486385878325245147   ./data/fmri/basc064/6486385878325245147/run_1/...   \n",
       "17126438435398394588  ./data/fmri/basc064/17126438435398394588/run_1...   \n",
       "16638049522113999228  ./data/fmri/basc064/16638049522113999228/run_1...   \n",
       "\n",
       "                                                           fmri_basc122  \\\n",
       "subject_id                                                                \n",
       "5181409268393785348   ./data/fmri/basc122/5181409268393785348/run_1/...   \n",
       "8797865049371315550   ./data/fmri/basc122/8797865049371315550/run_1/...   \n",
       "6486385878325245147   ./data/fmri/basc122/6486385878325245147/run_1/...   \n",
       "17126438435398394588  ./data/fmri/basc122/17126438435398394588/run_1...   \n",
       "16638049522113999228  ./data/fmri/basc122/16638049522113999228/run_1...   \n",
       "\n",
       "                                                           fmri_basc197  \\\n",
       "subject_id                                                                \n",
       "5181409268393785348   ./data/fmri/basc197/5181409268393785348/run_1/...   \n",
       "8797865049371315550   ./data/fmri/basc197/8797865049371315550/run_1/...   \n",
       "6486385878325245147   ./data/fmri/basc197/6486385878325245147/run_1/...   \n",
       "17126438435398394588  ./data/fmri/basc197/17126438435398394588/run_1...   \n",
       "16638049522113999228  ./data/fmri/basc197/16638049522113999228/run_1...   \n",
       "\n",
       "                                               fmri_craddock_scorr_mean  \\\n",
       "subject_id                                                                \n",
       "5181409268393785348   ./data/fmri/craddock_scorr_mean/51814092683937...   \n",
       "8797865049371315550   ./data/fmri/craddock_scorr_mean/87978650493713...   \n",
       "6486385878325245147   ./data/fmri/craddock_scorr_mean/64863858783252...   \n",
       "17126438435398394588  ./data/fmri/craddock_scorr_mean/17126438435398...   \n",
       "16638049522113999228  ./data/fmri/craddock_scorr_mean/16638049522113...   \n",
       "\n",
       "                                      fmri_harvard_oxford_cort_prob_2mm  \\\n",
       "subject_id                                                                \n",
       "5181409268393785348   ./data/fmri/harvard_oxford_cort_prob_2mm/51814...   \n",
       "8797865049371315550   ./data/fmri/harvard_oxford_cort_prob_2mm/87978...   \n",
       "6486385878325245147   ./data/fmri/harvard_oxford_cort_prob_2mm/64863...   \n",
       "17126438435398394588  ./data/fmri/harvard_oxford_cort_prob_2mm/17126...   \n",
       "16638049522113999228  ./data/fmri/harvard_oxford_cort_prob_2mm/16638...   \n",
       "\n",
       "                                                           fmri_motions  \\\n",
       "subject_id                                                                \n",
       "5181409268393785348   ./data/fmri/motions/5181409268393785348/run_1/...   \n",
       "8797865049371315550   ./data/fmri/motions/8797865049371315550/run_1/...   \n",
       "6486385878325245147   ./data/fmri/motions/6486385878325245147/run_1/...   \n",
       "17126438435398394588  ./data/fmri/motions/17126438435398394588/run_1...   \n",
       "16638049522113999228  ./data/fmri/motions/16638049522113999228/run_1...   \n",
       "\n",
       "                                                              fmri_msdl  \\\n",
       "subject_id                                                                \n",
       "5181409268393785348   ./data/fmri/msdl/5181409268393785348/run_1/518...   \n",
       "8797865049371315550   ./data/fmri/msdl/8797865049371315550/run_1/879...   \n",
       "6486385878325245147   ./data/fmri/msdl/6486385878325245147/run_1/648...   \n",
       "17126438435398394588  ./data/fmri/msdl/17126438435398394588/run_1/17...   \n",
       "16638049522113999228  ./data/fmri/msdl/16638049522113999228/run_1/16...   \n",
       "\n",
       "                                                        fmri_power_2011  \\\n",
       "subject_id                                                                \n",
       "5181409268393785348   ./data/fmri/power_2011/5181409268393785348/run...   \n",
       "8797865049371315550   ./data/fmri/power_2011/8797865049371315550/run...   \n",
       "6486385878325245147   ./data/fmri/power_2011/6486385878325245147/run...   \n",
       "17126438435398394588  ./data/fmri/power_2011/17126438435398394588/ru...   \n",
       "16638049522113999228  ./data/fmri/power_2011/16638049522113999228/ru...   \n",
       "\n",
       "                      fmri_select  \n",
       "subject_id                         \n",
       "5181409268393785348             1  \n",
       "8797865049371315550             1  \n",
       "6486385878325245147             1  \n",
       "17126438435398394588            1  \n",
       "16638049522113999228            0  \n",
       "\n",
       "[5 rows x 220 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0]\n"
     ]
    }
   ],
   "source": [
    "print(labels_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Workflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./img/workflow.png\" width=\"100%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The framework is evaluated with a cross-validation approach. The metrics used are the AUC under the ROC and the accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.model_selection import cross_validate\n",
    "from problem import get_cv\n",
    "\n",
    "def evaluation(X, y):\n",
    "    pipe = make_pipeline(FeatureExtractor(), Classifier())\n",
    "    cv = get_cv(X, y)\n",
    "    results = cross_validate(pipe, X, y, scoring=['roc_auc', 'accuracy'], cv=cv,\n",
    "                             verbose=1, return_train_score=True,\n",
    "                             n_jobs=1)\n",
    "    \n",
    "    return results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simple starting kit: using only anatomical features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### FeatureExtractor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The available structural data can be used directly to make some classification. In this regard, we will use a feature extractor (i.e. `FeatureExtractor`). This extractor will only select only the anatomical features, dropping any information regarding the fMRI-based features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.base import BaseEstimator\n",
    "from sklearn.base import TransformerMixin\n",
    "\n",
    "\n",
    "class FeatureExtractor(BaseEstimator, TransformerMixin):\n",
    "    def fit(self, X_df, y):\n",
    "        return self\n",
    "\n",
    "    def transform(self, X_df):\n",
    "        # get only the anatomical information\n",
    "        X = X_df[[col for col in X_df.columns if col.startswith('anatomy')]]\n",
    "        return X.drop(columns='anatomy_select')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We propose to use a logistic classifier preceded from a scaler which will remove the mean and standard deviation computed on the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.base import BaseEstimator\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "\n",
    "class Classifier(BaseEstimator):\n",
    "    def __init__(self):\n",
    "        self.clf = make_pipeline(StandardScaler(), LogisticRegression())\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        self.clf.fit(X, y)\n",
    "        return self\n",
    "        \n",
    "    def predict(self, X):\n",
    "        return self.clf.predict(X)\n",
    "\n",
    "    def predict_proba(self, X):\n",
    "        return self.clf.predict_proba(X)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Testing the submission"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test locally our pipeline using `evaluation` function that we defined earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training score ROC-AUC: 0.850 +- 0.004\n",
      "Validation score ROC-AUC: 0.652 +- 0.016 \n",
      "\n",
      "Training score accuracy: 0.772 +- 0.007\n",
      "Validation score accuracy: 0.622 +- 0.016\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.1s finished\n"
     ]
    }
   ],
   "source": [
    "results = evaluation(data_train, labels_train)\n",
    "\n",
    "print(\"Training score ROC-AUC: {:.3f} +- {:.3f}\".format(np.mean(results['train_roc_auc']),\n",
    "                                                        np.std(results['train_roc_auc'])))\n",
    "print(\"Validation score ROC-AUC: {:.3f} +- {:.3f} \\n\".format(np.mean(results['test_roc_auc']),\n",
    "                                                          np.std(results['test_roc_auc'])))\n",
    "\n",
    "print(\"Training score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['train_accuracy']),\n",
    "                                                         np.std(results['train_accuracy'])))\n",
    "print(\"Validation score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['test_accuracy']),\n",
    "                                                           np.std(results['test_accuracy'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Going further: using fMRI-derived features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./img/full_fmri_pipeline.png\" width=\"70%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the framework illustrated in the figure above, steps 1 to 2 already have been computed during some preprocessing and are the data given during this challenge. Therefore, our feature extractor will implement the step #3 which correspond to the extraction of functional connectivity features. Step 4 is identical to the pipeline presented for the anatomy with a standard scaler followed by a logistic regression classifier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We pointed out that the available feature for fMRI are filename to the time-series. In order to limit the amount of data to be downloaded, we provide a fetcher `fetch_fmri_time_series()` to download only the time-series linked to a specific atlases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on function fetch_fmri_time_series in module download_data:\n",
      "\n",
      "fetch_fmri_time_series(atlas='all')\n",
      "    Fetch the time-series extracted from the fMRI data using a specific\n",
      "    atlas.\n",
      "    \n",
      "    Parameters\n",
      "    ----------\n",
      "    atlas : string, default='all'\n",
      "        The name of the atlas used during the extraction. The possibilities\n",
      "        are:\n",
      "    \n",
      "        * `'basc064`, `'basc122'`, `'basc197'`: BASC parcellations with 64,\n",
      "        122, and 197 regions [1]_;\n",
      "        * `'craddock_scorr_mean'`: Ncuts parcellations [2]_;\n",
      "        * `'harvard_oxford_cort_prob_2mm'`: Harvard-Oxford anatomical\n",
      "        parcellations;\n",
      "        * `'msdl'`: MSDL functional atlas [3]_;\n",
      "        * `'power_2011'`: Power atlas [4]_.\n",
      "    \n",
      "    Returns\n",
      "    -------\n",
      "    None\n",
      "    \n",
      "    References\n",
      "    ----------\n",
      "    .. [1] Bellec, Pierre, et al. \"Multi-level bootstrap analysis of stable\n",
      "       clusters in resting-state fMRI.\" Neuroimage 51.3 (2010): 1126-1139.\n",
      "    \n",
      "    .. [2] Craddock, R. Cameron, et al. \"A whole brain fMRI atlas generated\n",
      "       via spatially constrained spectral clustering.\" Human brain mapping\n",
      "       33.8 (2012): 1914-1928.\n",
      "    \n",
      "    .. [3] Varoquaux, Gaël, et al. \"Multi-subject dictionary learning to\n",
      "       segment an atlas of brain spontaneous activity.\" Biennial International\n",
      "       Conference on Information Processing in Medical Imaging. Springer,\n",
      "       Berlin, Heidelberg, 2011.\n",
      "    \n",
      "    .. [4] Power, Jonathan D., et al. \"Functional network organization of the\n",
      "       human brain.\" Neuron 72.4 (2011): 665-678.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from download_data import fetch_fmri_time_series\n",
    "help(fetch_fmri_time_series)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading completed ...\n"
     ]
    }
   ],
   "source": [
    "fetch_fmri_time_series(atlas='msdl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can download all atlases at once by passing `atlas='all'`. It is also possible to execute the file as a script `python download_data.py all`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the `FeatureExtractor` below, we first only select the filename related to the MSDL time-series data. We create a `FunctionTransformer` which will read on-the-fly the time-series from the CSV file and store them into a numpy array.\n",
    "Those series will be used to extract the functional connectivity matrices which will be used later in the classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/lemaitre/miniconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import FunctionTransformer\n",
    "\n",
    "from nilearn.connectome import ConnectivityMeasure\n",
    "\n",
    "\n",
    "def _load_fmri(fmri_filenames):\n",
    "    \"\"\"Load time-series extracted from the fMRI using a specific atlas.\"\"\"\n",
    "    return np.array([pd.read_csv(subject_filename,\n",
    "                                 header=None).values\n",
    "                     for subject_filename in fmri_filenames])\n",
    "\n",
    "\n",
    "class FeatureExtractor(BaseEstimator, TransformerMixin):\n",
    "    def __init__(self):\n",
    "        # make a transformer which will load the time series and compute the\n",
    "        # connectome matrix\n",
    "        self.transformer_fmri = make_pipeline(\n",
    "            FunctionTransformer(func=_load_fmri, validate=False),\n",
    "            ConnectivityMeasure(kind='tangent', vectorize=True))\n",
    "        \n",
    "    def fit(self, X_df, y):\n",
    "        # get only the time series for the MSDL atlas\n",
    "        fmri_filenames = X_df['fmri_msdl']\n",
    "        self.transformer_fmri.fit(fmri_filenames, y)\n",
    "        return self\n",
    "\n",
    "    def transform(self, X_df):\n",
    "        fmri_filenames = X_df['fmri_msdl']\n",
    "        return self.transformer_fmri.transform(fmri_filenames)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.base import BaseEstimator\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "\n",
    "class Classifier(BaseEstimator):\n",
    "    def __init__(self):\n",
    "        self.clf = make_pipeline(StandardScaler(), LogisticRegression(C=1.))\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        self.clf.fit(X, y)\n",
    "        return self\n",
    "       \n",
    "    def predict(self, X):\n",
    "        return self.clf.predict(X)\n",
    "\n",
    "    def predict_proba(self, X):\n",
    "        return self.clf.predict_proba(X)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training score ROC-AUC: 1.000 +- 0.000\n",
      "Validation score ROC-AUC: 0.612 +- 0.019 \n",
      "\n",
      "Training score accuracy: 1.000 +- 0.000\n",
      "Validation score accuracy: 0.587 +- 0.021\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  4.1min finished\n"
     ]
    }
   ],
   "source": [
    "results = evaluation(data_train, labels_train)\n",
    "\n",
    "print(\"Training score ROC-AUC: {:.3f} +- {:.3f}\".format(np.mean(results['train_roc_auc']),\n",
    "                                                        np.std(results['train_roc_auc'])))\n",
    "print(\"Validation score ROC-AUC: {:.3f} +- {:.3f} \\n\".format(np.mean(results['test_roc_auc']),\n",
    "                                                          np.std(results['test_roc_auc'])))\n",
    "\n",
    "print(\"Training score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['train_accuracy']),\n",
    "                                                         np.std(results['train_accuracy'])))\n",
    "print(\"Validation score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['test_accuracy']),\n",
    "                                                           np.std(results['test_accuracy'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More elaborate pipeline: combining anatomy and fMRI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This workflow is a combination of the 2 previous workflows. The `FeatureExtractor` is extracting both structural and functional connectivity information and concatenate them. Note that each column will contain in their name either **connectome** or **anatomy** depending of the type of feature. It will be used to train different classifiers later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import FunctionTransformer\n",
    "\n",
    "from nilearn.connectome import ConnectivityMeasure\n",
    "\n",
    "\n",
    "def _load_fmri(fmri_filenames):\n",
    "    \"\"\"Load time-series extracted from the fMRI using a specific atlas.\"\"\"\n",
    "    return np.array([pd.read_csv(subject_filename,\n",
    "                                 header=None).values\n",
    "                     for subject_filename in fmri_filenames])\n",
    "\n",
    "\n",
    "class FeatureExtractor(BaseEstimator, TransformerMixin):\n",
    "    def __init__(self):\n",
    "        # make a transformer which will load the time series and compute the\n",
    "        # connectome matrix\n",
    "        self.transformer_fmri = make_pipeline(\n",
    "            FunctionTransformer(func=_load_fmri, validate=False),\n",
    "            ConnectivityMeasure(kind='tangent', vectorize=True))\n",
    "    \n",
    "    def fit(self, X_df, y):\n",
    "        fmri_filenames = X_df['fmri_msdl']\n",
    "        self.transformer_fmri.fit(fmri_filenames, y)\n",
    "        return self\n",
    "\n",
    "    def transform(self, X_df):\n",
    "        fmri_filenames = X_df['fmri_msdl']\n",
    "        X_connectome = self.transformer_fmri.transform(fmri_filenames)\n",
    "        X_connectome = pd.DataFrame(X_connectome, index=X_df.index)\n",
    "        X_connectome.columns = ['connectome_{}'.format(i)\n",
    "                                for i in range(X_connectome.columns.size)]\n",
    "        # get the anatomical information\n",
    "        X_anatomy = X_df[[col for col in X_df.columns\n",
    "                          if col.startswith('anatomy')]]\n",
    "        X_anatomy = X_anatomy.drop(columns='anatomy_select')\n",
    "        # concatenate both matrices\n",
    "        return pd.concat([X_connectome, X_anatomy], axis=1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will create a classifier (i.e. a random forest classifier) which will used both connectome and anatomical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from sklearn.base import BaseEstimator\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "\n",
    "class Classifier(BaseEstimator):\n",
    "    def __init__(self):\n",
    "        self.clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        self.clf.fit(X, y)\n",
    "        return self\n",
    "    \n",
    "    def predict(self, X):\n",
    "        return self.clf.predict(X)\n",
    "\n",
    "    def predict_proba(self, X):\n",
    "        return self.clf.predict_proba(X)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training score ROC-AUC: 1.000 +- 0.000\n",
      "Validation score ROC-AUC: 0.655 +- 0.028 \n",
      "\n",
      "Training score accuracy: 1.000 +- 0.000\n",
      "Validation score accuracy: 0.613 +- 0.033\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  4.1min finished\n"
     ]
    }
   ],
   "source": [
    "results = evaluation(data_train, labels_train)\n",
    "\n",
    "print(\"Training score ROC-AUC: {:.3f} +- {:.3f}\".format(np.mean(results['train_roc_auc']),\n",
    "                                                        np.std(results['train_roc_auc'])))\n",
    "print(\"Validation score ROC-AUC: {:.3f} +- {:.3f} \\n\".format(np.mean(results['test_roc_auc']),\n",
    "                                                          np.std(results['test_roc_auc'])))\n",
    "\n",
    "print(\"Training score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['train_accuracy']),\n",
    "                                                         np.std(results['train_accuracy'])))\n",
    "print(\"Validation score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['test_accuracy']),\n",
    "                                                           np.std(results['test_accuracy'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can propose a\n",
    "more complex classifier than the previous one. We will train 2 single classifiers independetly on the sMRI-derived and fMRI-derived features. Then, a meta-classifier will be used to combine both information. We left out some data to be able to train the meta-classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from sklearn.base import BaseEstimator\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "\n",
    "class Classifier(BaseEstimator):\n",
    "    def __init__(self):\n",
    "        self.clf_connectome = make_pipeline(StandardScaler(),\n",
    "                                            LogisticRegression(C=1.))\n",
    "        self.clf_anatomy = make_pipeline(StandardScaler(),\n",
    "                                         LogisticRegression(C=1.))\n",
    "        self.meta_clf = LogisticRegression(C=1.)\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]\n",
    "        X_connectome = X[[col for col in X.columns\n",
    "                          if col.startswith('connectome')]]\n",
    "        train_idx, validation_idx = train_test_split(range(y.size),\n",
    "                                                     test_size=0.33,\n",
    "                                                     shuffle=True,\n",
    "                                                     random_state=42)\n",
    "        X_anatomy_train = X_anatomy.iloc[train_idx]\n",
    "        X_anatomy_validation = X_anatomy.iloc[validation_idx]\n",
    "        X_connectome_train = X_connectome.iloc[train_idx]\n",
    "        X_connectome_validation = X_connectome.iloc[validation_idx]\n",
    "        y_train = y[train_idx]\n",
    "        y_validation = y[validation_idx]\n",
    "\n",
    "        self.clf_connectome.fit(X_connectome_train, y_train)\n",
    "        self.clf_anatomy.fit(X_anatomy_train, y_train)\n",
    "\n",
    "        y_connectome_pred = self.clf_connectome.predict_proba(\n",
    "            X_connectome_validation)\n",
    "        y_anatomy_pred = self.clf_anatomy.predict_proba(\n",
    "            X_anatomy_validation)\n",
    "\n",
    "        self.meta_clf.fit(\n",
    "            np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1),\n",
    "            y_validation)\n",
    "        return self\n",
    "    \n",
    "    def predict(self, X):\n",
    "        X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]\n",
    "        X_connectome = X[[col for col in X.columns\n",
    "                          if col.startswith('connectome')]]\n",
    "\n",
    "        y_anatomy_pred = self.clf_anatomy.predict_proba(X_anatomy)\n",
    "        y_connectome_pred = self.clf_connectome.predict_proba(X_connectome)\n",
    "\n",
    "        return self.meta_clf.predict(\n",
    "            np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1))\n",
    "\n",
    "    def predict_proba(self, X):\n",
    "        X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]\n",
    "        X_connectome = X[[col for col in X.columns\n",
    "                          if col.startswith('connectome')]]\n",
    "\n",
    "        y_anatomy_pred = self.clf_anatomy.predict_proba(X_anatomy)\n",
    "        y_connectome_pred = self.clf_connectome.predict_proba(X_connectome)\n",
    "\n",
    "        return self.meta_clf.predict_proba(\n",
    "            np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training score ROC-AUC: 0.915 +- 0.010\n",
      "Validation score ROC-AUC: 0.649 +- 0.023 \n",
      "\n",
      "Training score accuracy: 0.854 +- 0.017\n",
      "Validation score accuracy: 0.606 +- 0.022\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  4.2min finished\n"
     ]
    }
   ],
   "source": [
    "results = evaluation(data_train, labels_train)\n",
    "\n",
    "print(\"Training score ROC-AUC: {:.3f} +- {:.3f}\".format(np.mean(results['train_roc_auc']),\n",
    "                                                        np.std(results['train_roc_auc'])))\n",
    "print(\"Validation score ROC-AUC: {:.3f} +- {:.3f} \\n\".format(np.mean(results['test_roc_auc']),\n",
    "                                                          np.std(results['test_roc_auc'])))\n",
    "\n",
    "print(\"Training score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['train_accuracy']),\n",
    "                                                         np.std(results['train_accuracy'])))\n",
    "print(\"Validation score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['test_accuracy']),\n",
    "                                                           np.std(results['test_accuracy'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submitting to the online challenge: [ramp.studio](http://ramp.studio)\n",
    "\n",
    "Once you found a good model, you can submit them to [ramp.studio](http://www.ramp.studio) to enter the online challenge. First, if it is your first time using the RAMP platform, [sign up](http://www.ramp.studio/sign_up), otherwise [log in](http://www.ramp.studio/login). Then sign up to the event [autism](http://www.ramp.studio/events/autism). Sign up for the event. Both signups are controled by RAMP administrators, so there **can be a delay between asking for signup and being able to submit**.\n",
    "\n",
    "Once your signup request is accepted, you can go to your [sandbox](http://www.ramp.studio/events/autism/sandbox) and copy-paste. You can also create a new starting-kit in the `submissions` folder containing both `feature_extractor.py` and `classifier.py` and upload those file directly. You can check the starting-kit ([`feature_extractor.py`](/edit/submissions/starting_kit/feature_extractor.py) and [`classifier.py`](/edit/submissions/starting_kit/classifier.py)) for an example. The submission is trained and tested on our backend in the similar way as `ramp_test_submission` does it locally. While your submission is waiting in the queue and being trained, you can find it in the \"New submissions (pending training)\" table in [my submissions](http://www.ramp.studio/events/autism/my_submissions). Once it is trained, you get a mail, and your submission shows up on the [public leaderboard](http://www.ramp.studio/events/autism/leaderboard). \n",
    "If there is an error (despite having tested your submission locally with `ramp_test_submission`), it will show up in the \"Failed submissions\" table in [my submissions](http://www.ramp.studio/events/autism/my_submissions). You can click on the error to see part of the trace.\n",
    "\n",
    "After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.\n",
    "\n",
    "The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.\n",
    "\n",
    "The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., _locally_, and checking them with `ramp_test_submission`. The script prints mean cross-validation scores \n",
    "\n",
    "The official score in this RAMP (the first score column after \"historical contributivity\" on the [leaderboard](http://www.ramp.studio/events/autism/leaderboard)) is the AUC. When the score is good enough, you can submit it at the RAMP."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[38;5;178m\u001b[1mTesting Autism Spectrum Disorder classification\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mReading train and test files from ./data ...\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mReading cv ...\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mTraining ./submissions/starting_kit_anatomy ...\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mCV fold 0\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore    auc    acc\u001b[0m\n",
      "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m  \u001b[38;5;10m\u001b[1m0.847\u001b[0m  \u001b[38;5;150m0.767\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.647\u001b[0m  \u001b[38;5;105m0.611\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.765\u001b[0m  \u001b[38;5;218m0.696\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mCV fold 1\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore    auc    acc\u001b[0m\n",
      "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m  \u001b[38;5;10m\u001b[1m0.842\u001b[0m  \u001b[38;5;150m0.766\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.662\u001b[0m  \u001b[38;5;105m0.628\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.659\u001b[0m  \u001b[38;5;218m0.478\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mCV fold 2\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore    auc    acc\u001b[0m\n",
      "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m  \u001b[38;5;10m\u001b[1m0.854\u001b[0m  \u001b[38;5;150m0.786\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.645\u001b[0m  \u001b[38;5;105m0.615\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.720\u001b[0m  \u001b[38;5;218m0.609\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mCV fold 3\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore    auc    acc\u001b[0m\n",
      "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m  \u001b[38;5;10m\u001b[1m0.849\u001b[0m  \u001b[38;5;150m0.769\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.645\u001b[0m  \u001b[38;5;105m0.619\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.758\u001b[0m  \u001b[38;5;218m0.565\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mCV fold 4\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore    auc    acc\u001b[0m\n",
      "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m  \u001b[38;5;10m\u001b[1m0.852\u001b[0m  \u001b[38;5;150m0.770\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.650\u001b[0m  \u001b[38;5;105m0.606\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.735\u001b[0m  \u001b[38;5;218m0.652\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mCV fold 5\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore    auc    acc\u001b[0m\n",
      "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m  \u001b[38;5;10m\u001b[1m0.847\u001b[0m  \u001b[38;5;150m0.776\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.680\u001b[0m  \u001b[38;5;105m0.642\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.598\u001b[0m  \u001b[38;5;218m0.565\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mCV fold 6\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore    auc    acc\u001b[0m\n",
      "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m  \u001b[38;5;10m\u001b[1m0.852\u001b[0m  \u001b[38;5;150m0.764\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.624\u001b[0m  \u001b[38;5;105m0.602\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.773\u001b[0m  \u001b[38;5;218m0.652\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mCV fold 7\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore    auc    acc\u001b[0m\n",
      "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m  \u001b[38;5;10m\u001b[1m0.854\u001b[0m  \u001b[38;5;150m0.779\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.662\u001b[0m  \u001b[38;5;105m0.650\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.644\u001b[0m  \u001b[38;5;218m0.478\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1m----------------------------\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mMean CV scores\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1m----------------------------\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore             auc             acc\u001b[0m\n",
      "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m   \u001b[38;5;10m\u001b[1m0.85\u001b[0m \u001b[38;5;150m\u001b[38;5;150m±\u001b[0m\u001b[0m \u001b[38;5;150m0.0039\u001b[0m  \u001b[38;5;150m0.772\u001b[0m \u001b[38;5;150m\u001b[38;5;150m±\u001b[0m\u001b[0m \u001b[38;5;150m0.0071\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.652\u001b[0m \u001b[38;5;105m\u001b[38;5;105m±\u001b[0m\u001b[0m \u001b[38;5;105m0.0155\u001b[0m  \u001b[38;5;105m0.622\u001b[0m \u001b[38;5;105m\u001b[38;5;105m±\u001b[0m\u001b[0m \u001b[38;5;105m0.0161\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.706\u001b[0m \u001b[38;5;218m\u001b[38;5;218m±\u001b[0m\u001b[0m \u001b[38;5;218m0.0605\u001b[0m  \u001b[38;5;218m0.587\u001b[0m \u001b[38;5;218m\u001b[38;5;218m±\u001b[0m\u001b[0m \u001b[38;5;218m0.0753\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1m----------------------------\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1mBagged scores\u001b[0m\n",
      "\u001b[38;5;178m\u001b[1m----------------------------\u001b[0m\n",
      "\t\u001b[38;5;178m\u001b[1mscore    auc\u001b[0m\n",
      "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m  \u001b[38;5;12m\u001b[1m0.651\u001b[0m\n",
      "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m   \u001b[38;5;1m\u001b[1m0.720\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!ramp_test_submission --submission starting_kit_anatomy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More information\n",
    "\n",
    "You can find more information in the [README](https://github.com/paris-saclay-cds/ramp-workflow/blob/master/README.md) of the [ramp-workflow library](https://github.com/paris-saclay-cds/ramp-workflow)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Questions\n",
    "\n",
    "Questions related to the starting kit should be asked on the [issue tracker](https://github.com/ramp-kits/autism/issues). The RAMP site administrators can be pinged at the [RAMP slack team](https://ramp-studio.slack.com/shared_invite/MTg1NDUxNDAyNDk2LTE0OTUzOTcwMDQtMThlOWY1NWU0Mg) in the #autism channel."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}