{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", " \n", " \n", " \n", "
[](http://www.datascience-paris-saclay.fr)[](https://research.pasteur.fr/en/team/group-roberto-toro/)
\n", "
\n", "\n", "

Imaging-psychiatry challenge: predicting autism

\n", "\n", "

A data challenge on Autism Spectrum Disorder detection

\n", "
\n", "
_Roberto Toro (Institut Pasteur), Nicolas Traut (Institut Pasteur), Anita Beggiato (Institut Pasteur), Katja Heuer (Institut Pasteur),
Gael Varoquaux (Inria, Parietal), Alex Gramfort (Inria, Parietal), Balazs Kegl (LAL),
Guillaume Lemaitre (CDS), Alexandre Boucaud (CDS), and Joris van den Bossche (CDS)_
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Content\n", "\n", "0. [Prerequisites](#Software-prerequisites)\n", "1. [Introduction about the competition](#Introduction:-what-is-this-challenge-about)\n", "3. [The data](#The-data)\n", "4. [Workflow](#Workflow)\n", "5. [Evaluation](#Evaluation)\n", "6. [Submission](#Submitting-to-the-online-challenge:-ramp.studio)\n", "7. [More information](#More-information)\n", "8. [Questions](#Question)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**To download and run this notebook**: download the [full starting kit](https://github.com/ramp-kits/autism/archive/master.zip), with all the necessary files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Software prerequisites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This starting kit requires the following dependencies:\n", "\n", "* `numpy`\n", "* `scipy`\n", "* `pandas`\n", "* `scikit-learn`\n", "* `matplolib`\n", "* `seaborn`\n", "* `nilearn`\n", "* `jupyter`\n", "* `ramp-workflow`\n", "\n", "The following 2 cells will install if necessary the missing dependencies." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: scikit-learn in /home/lemaitre/miniconda3/lib/python3.6/site-packages\n", "Requirement already satisfied: seaborn in /home/lemaitre/miniconda3/lib/python3.6/site-packages\n", "Requirement already satisfied: nilearn in /home/lemaitre/miniconda3/lib/python3.6/site-packages\n", "Requirement already satisfied: numpy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)\n", "Requirement already satisfied: scipy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)\n", "Requirement already satisfied: matplotlib in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)\n", "Requirement already satisfied: pandas in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)\n", "Requirement already satisfied: nibabel>=2.0.2 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from nilearn)\n", "Requirement already satisfied: cycler>=0.10 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n", "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n", "Requirement already satisfied: python-dateutil>=2.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n", "Requirement already satisfied: pytz in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n", "Requirement already satisfied: six>=1.10 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)\n", "Requirement already satisfied: setuptools in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->seaborn)\n", "\u001b[33mYou are using pip version 9.0.3, however version 10.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n" ] } ], "source": [ "# import sys\n", "# !{sys.executable} -m pip install \"scikit-learn>=0.19,<=0.21\" seaborn nilearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install `ramp-workflow` from the master branch on GitHub." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master\n", " Downloading https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master (2.8MB)\n", "\u001b[K 100% |████████████████████████████████| 2.8MB 444kB/s eta 0:00:01\n", "\u001b[?25h Requirement already satisfied (use --upgrade to upgrade): ramp-workflow==0+unknown from https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master in /home/lemaitre/miniconda3/lib/python3.6/site-packages\n", "Requirement already satisfied: numpy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n", "Requirement already satisfied: scipy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n", "Requirement already satisfied: pandas>=0.19.2 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n", "Requirement already satisfied: scikit-learn>=0.18 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n", "Requirement already satisfied: cloudpickle in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n", "Requirement already satisfied: colored in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)\n", "Requirement already satisfied: python-dateutil>=2.5.0 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from pandas>=0.19.2->ramp-workflow==0+unknown)\n", "Requirement already satisfied: pytz>=2011k in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from pandas>=0.19.2->ramp-workflow==0+unknown)\n", "Requirement already satisfied: six>=1.5 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas>=0.19.2->ramp-workflow==0+unknown)\n", "\u001b[33mYou are using pip version 9.0.3, however version 10.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n" ] } ], "source": [ "# !{sys.executable} -m pip install https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/0.2.1" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction: what is this challenge about" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Autism spectrum disorder (ASD) is a developmental disorder affecting communication and behavior with different range in severity of symptoms. ASD has been reported to affect approximately 1 in 166 children.\n", "\n", "Although there is a consensus on a relation between ASD and atypical brain networks and anatomy, those differences in brain anatomy and functional connectivity remain unclear. To address these issues, study on large cohort of subjects are necessary to ensure relevant finding. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Competition rules\n", "\n", "You can sign up to the challenge on the event page [ramp.studio](http://ramp.studio/) event. Here are the competition rules:\n", "\n", "* Submissions will be trained on the 1150 subjects given in the starting kit and tested on a private set of size 1003. More information on the distribution of the public and private sets can be found on the [challenge website](https://paris-saclay-cds.github.io/autism_challenge/).\n", "* The competition will end on July 1, 2018 at 18h UTC (20h in Paris).\n", "* All models will be trained on AWS instance `m5.xlarge` (4 CPUs and 16 GiB of RAM).\n", "* Participants will be given a total of 80 machine hours. Submissions of a given participant will be ordered by submission timestamp. We will make an attempt to train all submissions, but starting from (and including) the first submission that makes the participant's total training time exceed 80 hours, all submissions will be disqualified from the competition (but can enter into the collaborative phase). Testing time will not count towards the limit. Train time will be displayed on the leaderboard for all submissions, rounded to second, per cross validation fold. Since we have 8 CV folds, the maximum total training time in the leaderboard is 10h. If a submission raises an exception, its training time will not count towards the total.\n", "* There is a timeout of 15 minutes between submissions.\n", "* Submissions submitted after the end of the competition will not qualify for prizes.\n", "* The public leaderboard will display validation scores running a stratified CV with 8 folds and 20% validation set size, executed on the public (starting kit) data. The official scores will be calculated on the hidden test set and will be published after the closing of the competition. We will measure several scores of each submission: (i) the Area Under the Curve of the Receiver Operating Characteristic (ROC-AUC) and (ii) the accuracy. The ROC-AUC will be used to rank submissions.\n", "* The organizers will do their best so that the provided backend runs flawlessly. We will communicate with participants in case of concerns and will try to resolve all issues, but we reserve the right to make unilateral decisions in specific cases, not covered by this set of minimal rules.\n", "* The organizers reserve the right to disqualify any participant found to violate the fair competitive spirit of the challenge. Possible reasons, without being exhaustive, are multiple accounts, attempts to access the test data, etc.\n", "* The challenge is essentially an individual contest, so there is no way to form official teams. Participants can form teams outside the platform before submitting any model individually, contact the organizers to let them know about the team, and submit on a single team member's account. However, submitting on one's own and participating in such a team at the same time is against the \"no multiple accounts\" rule, so, if discovered, may lead to disqualification.\n", "* Participants retain copyright on their submitted code and grant reuse under BSD 3-Clause License.\n", "\n", "Participants accept these rules automatically when making a submission at the RAMP site." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prizes of the competitive phase\n", "\n", "Ties in the competitive scores will be broken by earlier submission time.\n", "\n", "* 3000 €: the top submission according to private test ROC-AUC at the end of the competitive phase.\n", "* 2000 €: the second best submission according to private test ROC-AUC at the end of the competitive phase.\n", "* 1000 €: the third best submission according to private test ROC-AUC at the end of the competitive phase.\n", "* 500 €: from the fourth to the tenth best submissions according to the private test ROC-AUC at the end of the competitive phase." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sponsorship\n", "\n", "Prizes are provided by [IESF](https://www.iesf.fr/) and [Paris-Saclay CDS](http://www.datascience-paris-saclay.fr/). The computational time is provided by AWS on a research credit granted to the [Paris-Saclay CDS](http://www.datascience-paris-saclay.fr/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by downloading the data from Internet" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from problem import get_train_data\n", "\n", "data_train, labels_train = get_train_data()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
participants_siteparticipants_sexparticipants_ageanatomy_lh_bankssts_areaanatomy_lh_caudalanteriorcingulate_areaanatomy_lh_caudalmiddlefrontal_areaanatomy_lh_cuneus_areaanatomy_lh_entorhinal_areaanatomy_lh_fusiform_areaanatomy_lh_inferiorparietal_area...anatomy_selectfmri_basc064fmri_basc122fmri_basc197fmri_craddock_scorr_meanfmri_harvard_oxford_cort_prob_2mmfmri_motionsfmri_msdlfmri_power_2011fmri_select
subject_id
19323553985361241065F9.301370977.0427.01884.01449.0463.02790.04091.0...1./data/fmri/basc064/1932355398536124106/run_1/..../data/fmri/basc122/1932355398536124106/run_1/..../data/fmri/basc197/1932355398536124106/run_1/..../data/fmri/craddock_scorr_mean/19323553985361..../data/fmri/harvard_oxford_cort_prob_2mm/19323..../data/fmri/motions/1932355398536124106/run_1/..../data/fmri/msdl/1932355398536124106/run_1/193..../data/fmri/power_2011/1932355398536124106/run...1
517404173009225377119M29.0000001279.0730.02419.01611.0467.03562.05380.0...1./data/fmri/basc064/5174041730092253771/run_1/..../data/fmri/basc122/5174041730092253771/run_1/..../data/fmri/basc197/5174041730092253771/run_1/..../data/fmri/craddock_scorr_mean/51740417300922..../data/fmri/harvard_oxford_cort_prob_2mm/51740..../data/fmri/motions/5174041730092253771/run_1/..../data/fmri/msdl/5174041730092253771/run_1/517..../data/fmri/power_2011/5174041730092253771/run...1
1021932267664353480019F45.000000926.0446.01897.02135.0570.03064.04834.0...1./data/fmri/basc064/10219322676643534800/run_1..../data/fmri/basc122/10219322676643534800/run_1..../data/fmri/basc197/10219322676643534800/run_1..../data/fmri/craddock_scorr_mean/10219322676643..../data/fmri/harvard_oxford_cort_prob_2mm/10219..../data/fmri/motions/10219322676643534800/run_1..../data/fmri/msdl/10219322676643534800/run_1/10..../data/fmri/power_2011/10219322676643534800/ru...1
106454665649191902275F9.216438983.0588.02479.01312.0525.03766.05091.0...1./data/fmri/basc064/10645466564919190227/run_1..../data/fmri/basc122/10645466564919190227/run_1..../data/fmri/basc197/10645466564919190227/run_1..../data/fmri/craddock_scorr_mean/10645466564919..../data/fmri/harvard_oxford_cort_prob_2mm/10645..../data/fmri/motions/10645466564919190227/run_1..../data/fmri/msdl/10645466564919190227/run_1/10..../data/fmri/power_2011/10645466564919190227/ru...1
1451254134264193623228M15.0500001488.0593.02309.01829.0726.03720.05432.0...1./data/fmri/basc064/14512541342641936232/run_1..../data/fmri/basc122/14512541342641936232/run_1..../data/fmri/basc197/14512541342641936232/run_1..../data/fmri/craddock_scorr_mean/14512541342641..../data/fmri/harvard_oxford_cort_prob_2mm/14512..../data/fmri/motions/14512541342641936232/run_1..../data/fmri/msdl/14512541342641936232/run_1/14..../data/fmri/power_2011/14512541342641936232/ru...1
\n", "

5 rows × 220 columns

\n", "
" ], "text/plain": [ " participants_site participants_sex participants_age \\\n", "subject_id \n", "1932355398536124106 5 F 9.301370 \n", "5174041730092253771 19 M 29.000000 \n", "10219322676643534800 19 F 45.000000 \n", "10645466564919190227 5 F 9.216438 \n", "14512541342641936232 28 M 15.050000 \n", "\n", " anatomy_lh_bankssts_area \\\n", "subject_id \n", "1932355398536124106 977.0 \n", "5174041730092253771 1279.0 \n", "10219322676643534800 926.0 \n", "10645466564919190227 983.0 \n", "14512541342641936232 1488.0 \n", "\n", " anatomy_lh_caudalanteriorcingulate_area \\\n", "subject_id \n", "1932355398536124106 427.0 \n", "5174041730092253771 730.0 \n", "10219322676643534800 446.0 \n", "10645466564919190227 588.0 \n", "14512541342641936232 593.0 \n", "\n", " anatomy_lh_caudalmiddlefrontal_area \\\n", "subject_id \n", "1932355398536124106 1884.0 \n", "5174041730092253771 2419.0 \n", "10219322676643534800 1897.0 \n", "10645466564919190227 2479.0 \n", "14512541342641936232 2309.0 \n", "\n", " anatomy_lh_cuneus_area anatomy_lh_entorhinal_area \\\n", "subject_id \n", "1932355398536124106 1449.0 463.0 \n", "5174041730092253771 1611.0 467.0 \n", "10219322676643534800 2135.0 570.0 \n", "10645466564919190227 1312.0 525.0 \n", "14512541342641936232 1829.0 726.0 \n", "\n", " anatomy_lh_fusiform_area \\\n", "subject_id \n", "1932355398536124106 2790.0 \n", "5174041730092253771 3562.0 \n", "10219322676643534800 3064.0 \n", "10645466564919190227 3766.0 \n", "14512541342641936232 3720.0 \n", "\n", " anatomy_lh_inferiorparietal_area ... \\\n", "subject_id ... \n", "1932355398536124106 4091.0 ... \n", "5174041730092253771 5380.0 ... \n", "10219322676643534800 4834.0 ... \n", "10645466564919190227 5091.0 ... \n", "14512541342641936232 5432.0 ... \n", "\n", " anatomy_select \\\n", "subject_id \n", "1932355398536124106 1 \n", "5174041730092253771 1 \n", "10219322676643534800 1 \n", "10645466564919190227 1 \n", "14512541342641936232 1 \n", "\n", " fmri_basc064 \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/basc064/1932355398536124106/run_1/... \n", "5174041730092253771 ./data/fmri/basc064/5174041730092253771/run_1/... \n", "10219322676643534800 ./data/fmri/basc064/10219322676643534800/run_1... \n", "10645466564919190227 ./data/fmri/basc064/10645466564919190227/run_1... \n", "14512541342641936232 ./data/fmri/basc064/14512541342641936232/run_1... \n", "\n", " fmri_basc122 \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/basc122/1932355398536124106/run_1/... \n", "5174041730092253771 ./data/fmri/basc122/5174041730092253771/run_1/... \n", "10219322676643534800 ./data/fmri/basc122/10219322676643534800/run_1... \n", "10645466564919190227 ./data/fmri/basc122/10645466564919190227/run_1... \n", "14512541342641936232 ./data/fmri/basc122/14512541342641936232/run_1... \n", "\n", " fmri_basc197 \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/basc197/1932355398536124106/run_1/... \n", "5174041730092253771 ./data/fmri/basc197/5174041730092253771/run_1/... \n", "10219322676643534800 ./data/fmri/basc197/10219322676643534800/run_1... \n", "10645466564919190227 ./data/fmri/basc197/10645466564919190227/run_1... \n", "14512541342641936232 ./data/fmri/basc197/14512541342641936232/run_1... \n", "\n", " fmri_craddock_scorr_mean \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/craddock_scorr_mean/19323553985361... \n", "5174041730092253771 ./data/fmri/craddock_scorr_mean/51740417300922... \n", "10219322676643534800 ./data/fmri/craddock_scorr_mean/10219322676643... \n", "10645466564919190227 ./data/fmri/craddock_scorr_mean/10645466564919... \n", "14512541342641936232 ./data/fmri/craddock_scorr_mean/14512541342641... \n", "\n", " fmri_harvard_oxford_cort_prob_2mm \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/harvard_oxford_cort_prob_2mm/19323... \n", "5174041730092253771 ./data/fmri/harvard_oxford_cort_prob_2mm/51740... \n", "10219322676643534800 ./data/fmri/harvard_oxford_cort_prob_2mm/10219... \n", "10645466564919190227 ./data/fmri/harvard_oxford_cort_prob_2mm/10645... \n", "14512541342641936232 ./data/fmri/harvard_oxford_cort_prob_2mm/14512... \n", "\n", " fmri_motions \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/motions/1932355398536124106/run_1/... \n", "5174041730092253771 ./data/fmri/motions/5174041730092253771/run_1/... \n", "10219322676643534800 ./data/fmri/motions/10219322676643534800/run_1... \n", "10645466564919190227 ./data/fmri/motions/10645466564919190227/run_1... \n", "14512541342641936232 ./data/fmri/motions/14512541342641936232/run_1... \n", "\n", " fmri_msdl \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/msdl/1932355398536124106/run_1/193... \n", "5174041730092253771 ./data/fmri/msdl/5174041730092253771/run_1/517... \n", "10219322676643534800 ./data/fmri/msdl/10219322676643534800/run_1/10... \n", "10645466564919190227 ./data/fmri/msdl/10645466564919190227/run_1/10... \n", "14512541342641936232 ./data/fmri/msdl/14512541342641936232/run_1/14... \n", "\n", " fmri_power_2011 \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/power_2011/1932355398536124106/run... \n", "5174041730092253771 ./data/fmri/power_2011/5174041730092253771/run... \n", "10219322676643534800 ./data/fmri/power_2011/10219322676643534800/ru... \n", "10645466564919190227 ./data/fmri/power_2011/10645466564919190227/ru... \n", "14512541342641936232 ./data/fmri/power_2011/14512541342641936232/ru... \n", "\n", " fmri_select \n", "subject_id \n", "1932355398536124106 1 \n", "5174041730092253771 1 \n", "10219322676643534800 1 \n", "10645466564919190227 1 \n", "14512541342641936232 1 \n", "\n", "[5 rows x 220 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_train.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 1 ... 0 1 0]\n" ] } ], "source": [ "print(labels_train)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of subjects in the training tests: 1127\n" ] } ], "source": [ "print('Number of subjects in the training tests: {}'.format(labels_train.size))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Participant features" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
participants_siteparticipants_sexparticipants_age
subject_id
19323553985361241065F9.301370
517404173009225377119M29.000000
1021932267664353480019F45.000000
106454665649191902275F9.216438
1451254134264193623228M15.050000
\n", "
" ], "text/plain": [ " participants_site participants_sex participants_age\n", "subject_id \n", "1932355398536124106 5 F 9.301370\n", "5174041730092253771 19 M 29.000000\n", "10219322676643534800 19 F 45.000000\n", "10645466564919190227 5 F 9.216438\n", "14512541342641936232 28 M 15.050000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_train_participants = data_train[[col for col in data_train.columns if col.startswith('participants')]]\n", "data_train_participants.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Structural MRI features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A set of structural features have been extracted for each subject: (i) normalized brain volume computed using subcortical segmentation of FreeSurfer and (ii) cortical thickness and area for right and left hemisphere of FreeSurfer." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
anatomy_lh_bankssts_areaanatomy_lh_caudalanteriorcingulate_areaanatomy_lh_caudalmiddlefrontal_areaanatomy_lh_cuneus_areaanatomy_lh_entorhinal_areaanatomy_lh_fusiform_areaanatomy_lh_inferiorparietal_areaanatomy_lh_inferiortemporal_areaanatomy_lh_isthmuscingulate_areaanatomy_lh_lateraloccipital_area...anatomy_MaskVolanatomy_BrainSegVol-to-eTIVanatomy_MaskVol-to-eTIVanatomy_lhSurfaceHolesanatomy_rhSurfaceHolesanatomy_SurfaceHolesanatomy_EstimatedTotalIntraCranialVolanatomy_eTIVanatomy_BrainSegVolNotVentanatomy_select
subject_id
1932355398536124106977.0427.01884.01449.0463.02790.04091.03305.0897.04406.0...1375171.00.8409761.07747230.031.061.01.276294e+061.276294e+061058903.01
51740417300922537711279.0730.02419.01611.0467.03562.05380.03555.01155.05611.0...1807924.00.7712291.03328545.054.099.01.749685e+061.749685e+061329340.01
10219322676643534800926.0446.01897.02135.0570.03064.04834.02602.01171.06395.0...1522076.00.7741171.08210748.050.098.01.406585e+061.406585e+061072503.01
10645466564919190227983.0588.02479.01312.0525.03766.05091.03433.01028.05405.0...1544951.00.8459861.08343755.057.0112.01.425972e+061.425972e+061194831.01
145125413426419362321488.0593.02309.01829.0726.03720.05432.03956.01033.05644.0...1738955.00.7937941.08364022.042.064.01.604735e+061.604735e+061263065.01
\n", "

5 rows × 208 columns

\n", "
" ], "text/plain": [ " anatomy_lh_bankssts_area \\\n", "subject_id \n", "1932355398536124106 977.0 \n", "5174041730092253771 1279.0 \n", "10219322676643534800 926.0 \n", "10645466564919190227 983.0 \n", "14512541342641936232 1488.0 \n", "\n", " anatomy_lh_caudalanteriorcingulate_area \\\n", "subject_id \n", "1932355398536124106 427.0 \n", "5174041730092253771 730.0 \n", "10219322676643534800 446.0 \n", "10645466564919190227 588.0 \n", "14512541342641936232 593.0 \n", "\n", " anatomy_lh_caudalmiddlefrontal_area \\\n", "subject_id \n", "1932355398536124106 1884.0 \n", "5174041730092253771 2419.0 \n", "10219322676643534800 1897.0 \n", "10645466564919190227 2479.0 \n", "14512541342641936232 2309.0 \n", "\n", " anatomy_lh_cuneus_area anatomy_lh_entorhinal_area \\\n", "subject_id \n", "1932355398536124106 1449.0 463.0 \n", "5174041730092253771 1611.0 467.0 \n", "10219322676643534800 2135.0 570.0 \n", "10645466564919190227 1312.0 525.0 \n", "14512541342641936232 1829.0 726.0 \n", "\n", " anatomy_lh_fusiform_area \\\n", "subject_id \n", "1932355398536124106 2790.0 \n", "5174041730092253771 3562.0 \n", "10219322676643534800 3064.0 \n", "10645466564919190227 3766.0 \n", "14512541342641936232 3720.0 \n", "\n", " anatomy_lh_inferiorparietal_area \\\n", "subject_id \n", "1932355398536124106 4091.0 \n", "5174041730092253771 5380.0 \n", "10219322676643534800 4834.0 \n", "10645466564919190227 5091.0 \n", "14512541342641936232 5432.0 \n", "\n", " anatomy_lh_inferiortemporal_area \\\n", "subject_id \n", "1932355398536124106 3305.0 \n", "5174041730092253771 3555.0 \n", "10219322676643534800 2602.0 \n", "10645466564919190227 3433.0 \n", "14512541342641936232 3956.0 \n", "\n", " anatomy_lh_isthmuscingulate_area \\\n", "subject_id \n", "1932355398536124106 897.0 \n", "5174041730092253771 1155.0 \n", "10219322676643534800 1171.0 \n", "10645466564919190227 1028.0 \n", "14512541342641936232 1033.0 \n", "\n", " anatomy_lh_lateraloccipital_area ... \\\n", "subject_id ... \n", "1932355398536124106 4406.0 ... \n", "5174041730092253771 5611.0 ... \n", "10219322676643534800 6395.0 ... \n", "10645466564919190227 5405.0 ... \n", "14512541342641936232 5644.0 ... \n", "\n", " anatomy_MaskVol anatomy_BrainSegVol-to-eTIV \\\n", "subject_id \n", "1932355398536124106 1375171.0 0.840976 \n", "5174041730092253771 1807924.0 0.771229 \n", "10219322676643534800 1522076.0 0.774117 \n", "10645466564919190227 1544951.0 0.845986 \n", "14512541342641936232 1738955.0 0.793794 \n", "\n", " anatomy_MaskVol-to-eTIV anatomy_lhSurfaceHoles \\\n", "subject_id \n", "1932355398536124106 1.077472 30.0 \n", "5174041730092253771 1.033285 45.0 \n", "10219322676643534800 1.082107 48.0 \n", "10645466564919190227 1.083437 55.0 \n", "14512541342641936232 1.083640 22.0 \n", "\n", " anatomy_rhSurfaceHoles anatomy_SurfaceHoles \\\n", "subject_id \n", "1932355398536124106 31.0 61.0 \n", "5174041730092253771 54.0 99.0 \n", "10219322676643534800 50.0 98.0 \n", "10645466564919190227 57.0 112.0 \n", "14512541342641936232 42.0 64.0 \n", "\n", " anatomy_EstimatedTotalIntraCranialVol anatomy_eTIV \\\n", "subject_id \n", "1932355398536124106 1.276294e+06 1.276294e+06 \n", "5174041730092253771 1.749685e+06 1.749685e+06 \n", "10219322676643534800 1.406585e+06 1.406585e+06 \n", "10645466564919190227 1.425972e+06 1.425972e+06 \n", "14512541342641936232 1.604735e+06 1.604735e+06 \n", "\n", " anatomy_BrainSegVolNotVent anatomy_select \n", "subject_id \n", "1932355398536124106 1058903.0 1 \n", "5174041730092253771 1329340.0 1 \n", "10219322676643534800 1072503.0 1 \n", "10645466564919190227 1194831.0 1 \n", "14512541342641936232 1263065.0 1 \n", "\n", "[5 rows x 208 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_train_anatomy = data_train[[col for col in data_train.columns if col.startswith('anatomy')]]\n", "data_train_anatomy.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the column `anatomy_select` contain a label affected during a manual quality check (i.e. `0` and `3` reject, `1` accept, `2` accept with reserve). This column can be used during training to exclude noisy data for instance." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "subject_id\n", "1932355398536124106 1\n", "5174041730092253771 1\n", "10219322676643534800 1\n", "10645466564919190227 1\n", "14512541342641936232 1\n", "Name: anatomy_select, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_train_anatomy['anatomy_select'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Functional MRI features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The original data acquired are resting-state functional MRI. Each subject also comes with fMRI signals extracted on different brain parcellations and atlases, and a set of confound signals. Those brain atlases and parcellations are: (i) BASC parcellations with 64, 122, and 197 regions (Bellec 2010), (ii) Ncuts parcellations (Craddock 2012), (iii) Harvard-Oxford anatomical parcellations, (iv) MSDL functional atlas (Varoquaux 2011), and (v) Power atlas (Power 2011). The script used for this extraction can be found [there](https://github.com/ramp-kits/autism/blob/master/preprocessing/extract_time_series.py)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fmri_basc064fmri_basc122fmri_basc197fmri_craddock_scorr_meanfmri_harvard_oxford_cort_prob_2mmfmri_motionsfmri_msdlfmri_power_2011fmri_select
subject_id
1932355398536124106./data/fmri/basc064/1932355398536124106/run_1/..../data/fmri/basc122/1932355398536124106/run_1/..../data/fmri/basc197/1932355398536124106/run_1/..../data/fmri/craddock_scorr_mean/19323553985361..../data/fmri/harvard_oxford_cort_prob_2mm/19323..../data/fmri/motions/1932355398536124106/run_1/..../data/fmri/msdl/1932355398536124106/run_1/193..../data/fmri/power_2011/1932355398536124106/run...1
5174041730092253771./data/fmri/basc064/5174041730092253771/run_1/..../data/fmri/basc122/5174041730092253771/run_1/..../data/fmri/basc197/5174041730092253771/run_1/..../data/fmri/craddock_scorr_mean/51740417300922..../data/fmri/harvard_oxford_cort_prob_2mm/51740..../data/fmri/motions/5174041730092253771/run_1/..../data/fmri/msdl/5174041730092253771/run_1/517..../data/fmri/power_2011/5174041730092253771/run...1
10219322676643534800./data/fmri/basc064/10219322676643534800/run_1..../data/fmri/basc122/10219322676643534800/run_1..../data/fmri/basc197/10219322676643534800/run_1..../data/fmri/craddock_scorr_mean/10219322676643..../data/fmri/harvard_oxford_cort_prob_2mm/10219..../data/fmri/motions/10219322676643534800/run_1..../data/fmri/msdl/10219322676643534800/run_1/10..../data/fmri/power_2011/10219322676643534800/ru...1
10645466564919190227./data/fmri/basc064/10645466564919190227/run_1..../data/fmri/basc122/10645466564919190227/run_1..../data/fmri/basc197/10645466564919190227/run_1..../data/fmri/craddock_scorr_mean/10645466564919..../data/fmri/harvard_oxford_cort_prob_2mm/10645..../data/fmri/motions/10645466564919190227/run_1..../data/fmri/msdl/10645466564919190227/run_1/10..../data/fmri/power_2011/10645466564919190227/ru...1
14512541342641936232./data/fmri/basc064/14512541342641936232/run_1..../data/fmri/basc122/14512541342641936232/run_1..../data/fmri/basc197/14512541342641936232/run_1..../data/fmri/craddock_scorr_mean/14512541342641..../data/fmri/harvard_oxford_cort_prob_2mm/14512..../data/fmri/motions/14512541342641936232/run_1..../data/fmri/msdl/14512541342641936232/run_1/14..../data/fmri/power_2011/14512541342641936232/ru...1
\n", "
" ], "text/plain": [ " fmri_basc064 \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/basc064/1932355398536124106/run_1/... \n", "5174041730092253771 ./data/fmri/basc064/5174041730092253771/run_1/... \n", "10219322676643534800 ./data/fmri/basc064/10219322676643534800/run_1... \n", "10645466564919190227 ./data/fmri/basc064/10645466564919190227/run_1... \n", "14512541342641936232 ./data/fmri/basc064/14512541342641936232/run_1... \n", "\n", " fmri_basc122 \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/basc122/1932355398536124106/run_1/... \n", "5174041730092253771 ./data/fmri/basc122/5174041730092253771/run_1/... \n", "10219322676643534800 ./data/fmri/basc122/10219322676643534800/run_1... \n", "10645466564919190227 ./data/fmri/basc122/10645466564919190227/run_1... \n", "14512541342641936232 ./data/fmri/basc122/14512541342641936232/run_1... \n", "\n", " fmri_basc197 \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/basc197/1932355398536124106/run_1/... \n", "5174041730092253771 ./data/fmri/basc197/5174041730092253771/run_1/... \n", "10219322676643534800 ./data/fmri/basc197/10219322676643534800/run_1... \n", "10645466564919190227 ./data/fmri/basc197/10645466564919190227/run_1... \n", "14512541342641936232 ./data/fmri/basc197/14512541342641936232/run_1... \n", "\n", " fmri_craddock_scorr_mean \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/craddock_scorr_mean/19323553985361... \n", "5174041730092253771 ./data/fmri/craddock_scorr_mean/51740417300922... \n", "10219322676643534800 ./data/fmri/craddock_scorr_mean/10219322676643... \n", "10645466564919190227 ./data/fmri/craddock_scorr_mean/10645466564919... \n", "14512541342641936232 ./data/fmri/craddock_scorr_mean/14512541342641... \n", "\n", " fmri_harvard_oxford_cort_prob_2mm \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/harvard_oxford_cort_prob_2mm/19323... \n", "5174041730092253771 ./data/fmri/harvard_oxford_cort_prob_2mm/51740... \n", "10219322676643534800 ./data/fmri/harvard_oxford_cort_prob_2mm/10219... \n", "10645466564919190227 ./data/fmri/harvard_oxford_cort_prob_2mm/10645... \n", "14512541342641936232 ./data/fmri/harvard_oxford_cort_prob_2mm/14512... \n", "\n", " fmri_motions \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/motions/1932355398536124106/run_1/... \n", "5174041730092253771 ./data/fmri/motions/5174041730092253771/run_1/... \n", "10219322676643534800 ./data/fmri/motions/10219322676643534800/run_1... \n", "10645466564919190227 ./data/fmri/motions/10645466564919190227/run_1... \n", "14512541342641936232 ./data/fmri/motions/14512541342641936232/run_1... \n", "\n", " fmri_msdl \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/msdl/1932355398536124106/run_1/193... \n", "5174041730092253771 ./data/fmri/msdl/5174041730092253771/run_1/517... \n", "10219322676643534800 ./data/fmri/msdl/10219322676643534800/run_1/10... \n", "10645466564919190227 ./data/fmri/msdl/10645466564919190227/run_1/10... \n", "14512541342641936232 ./data/fmri/msdl/14512541342641936232/run_1/14... \n", "\n", " fmri_power_2011 \\\n", "subject_id \n", "1932355398536124106 ./data/fmri/power_2011/1932355398536124106/run... \n", "5174041730092253771 ./data/fmri/power_2011/5174041730092253771/run... \n", "10219322676643534800 ./data/fmri/power_2011/10219322676643534800/ru... \n", "10645466564919190227 ./data/fmri/power_2011/10645466564919190227/ru... \n", "14512541342641936232 ./data/fmri/power_2011/14512541342641936232/ru... \n", "\n", " fmri_select \n", "subject_id \n", "1932355398536124106 1 \n", "5174041730092253771 1 \n", "10219322676643534800 1 \n", "10645466564919190227 1 \n", "14512541342641936232 1 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_train_functional = data_train[[col for col in data_train.columns if col.startswith('fmri')]]\n", "data_train_functional.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike the anatomical and participants data, the available data are filename to CSV files in which the time-series information are stored. We show in the next section how to read and extract meaningful information from those data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly to the anatomical data, the column `fmri_select` gives information about the manual quality check." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "subject_id\n", "1932355398536124106 1\n", "5174041730092253771 1\n", "10219322676643534800 1\n", "10645466564919190227 1\n", "14512541342641936232 1\n", "Name: fmri_select, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_train_functional['fmri_select'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Testing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The testing data can be loaded similarly as follows:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from problem import get_test_data\n", "\n", "data_test, labels_test = get_test_data()\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
participants_siteparticipants_sexparticipants_ageanatomy_lh_bankssts_areaanatomy_lh_caudalanteriorcingulate_areaanatomy_lh_caudalmiddlefrontal_areaanatomy_lh_cuneus_areaanatomy_lh_entorhinal_areaanatomy_lh_fusiform_areaanatomy_lh_inferiorparietal_area...anatomy_selectfmri_basc064fmri_basc122fmri_basc197fmri_craddock_scorr_meanfmri_harvard_oxford_cort_prob_2mmfmri_motionsfmri_msdlfmri_power_2011fmri_select
subject_id
518140926839378534831M12.200000985.0723.02851.01844.0495.03526.05658.0...1./data/fmri/basc064/5181409268393785348/run_1/..../data/fmri/basc122/5181409268393785348/run_1/..../data/fmri/basc197/5181409268393785348/run_1/..../data/fmri/craddock_scorr_mean/51814092683937..../data/fmri/harvard_oxford_cort_prob_2mm/51814..../data/fmri/motions/5181409268393785348/run_1/..../data/fmri/msdl/5181409268393785348/run_1/518..../data/fmri/power_2011/5181409268393785348/run...1
87978650493713155509M14.0000001174.0506.01890.01327.0462.03564.04408.0...1./data/fmri/basc064/8797865049371315550/run_1/..../data/fmri/basc122/8797865049371315550/run_1/..../data/fmri/basc197/8797865049371315550/run_1/..../data/fmri/craddock_scorr_mean/87978650493713..../data/fmri/harvard_oxford_cort_prob_2mm/87978..../data/fmri/motions/8797865049371315550/run_1/..../data/fmri/msdl/8797865049371315550/run_1/879..../data/fmri/power_2011/8797865049371315550/run...1
648638587832524514720M14.4250001288.0568.02406.01546.0432.03497.04808.0...1./data/fmri/basc064/6486385878325245147/run_1/..../data/fmri/basc122/6486385878325245147/run_1/..../data/fmri/basc197/6486385878325245147/run_1/..../data/fmri/craddock_scorr_mean/64863858783252..../data/fmri/harvard_oxford_cort_prob_2mm/64863..../data/fmri/motions/6486385878325245147/run_1/..../data/fmri/msdl/6486385878325245147/run_1/648..../data/fmri/power_2011/6486385878325245147/run...1
1712643843539839458833M22.8802001179.0991.02427.01771.0363.03579.06082.0...1./data/fmri/basc064/17126438435398394588/run_1..../data/fmri/basc122/17126438435398394588/run_1..../data/fmri/basc197/17126438435398394588/run_1..../data/fmri/craddock_scorr_mean/17126438435398..../data/fmri/harvard_oxford_cort_prob_2mm/17126..../data/fmri/motions/17126438435398394588/run_1..../data/fmri/msdl/17126438435398394588/run_1/17..../data/fmri/power_2011/17126438435398394588/ru...1
166380495221139992282M8.2520551064.0721.02445.01453.0561.03262.04885.0...2./data/fmri/basc064/16638049522113999228/run_1..../data/fmri/basc122/16638049522113999228/run_1..../data/fmri/basc197/16638049522113999228/run_1..../data/fmri/craddock_scorr_mean/16638049522113..../data/fmri/harvard_oxford_cort_prob_2mm/16638..../data/fmri/motions/16638049522113999228/run_1..../data/fmri/msdl/16638049522113999228/run_1/16..../data/fmri/power_2011/16638049522113999228/ru...0
\n", "

5 rows × 220 columns

\n", "
" ], "text/plain": [ " participants_site participants_sex participants_age \\\n", "subject_id \n", "5181409268393785348 31 M 12.200000 \n", "8797865049371315550 9 M 14.000000 \n", "6486385878325245147 20 M 14.425000 \n", "17126438435398394588 33 M 22.880200 \n", "16638049522113999228 2 M 8.252055 \n", "\n", " anatomy_lh_bankssts_area \\\n", "subject_id \n", "5181409268393785348 985.0 \n", "8797865049371315550 1174.0 \n", "6486385878325245147 1288.0 \n", "17126438435398394588 1179.0 \n", "16638049522113999228 1064.0 \n", "\n", " anatomy_lh_caudalanteriorcingulate_area \\\n", "subject_id \n", "5181409268393785348 723.0 \n", "8797865049371315550 506.0 \n", "6486385878325245147 568.0 \n", "17126438435398394588 991.0 \n", "16638049522113999228 721.0 \n", "\n", " anatomy_lh_caudalmiddlefrontal_area \\\n", "subject_id \n", "5181409268393785348 2851.0 \n", "8797865049371315550 1890.0 \n", "6486385878325245147 2406.0 \n", "17126438435398394588 2427.0 \n", "16638049522113999228 2445.0 \n", "\n", " anatomy_lh_cuneus_area anatomy_lh_entorhinal_area \\\n", "subject_id \n", "5181409268393785348 1844.0 495.0 \n", "8797865049371315550 1327.0 462.0 \n", "6486385878325245147 1546.0 432.0 \n", "17126438435398394588 1771.0 363.0 \n", "16638049522113999228 1453.0 561.0 \n", "\n", " anatomy_lh_fusiform_area \\\n", "subject_id \n", "5181409268393785348 3526.0 \n", "8797865049371315550 3564.0 \n", "6486385878325245147 3497.0 \n", "17126438435398394588 3579.0 \n", "16638049522113999228 3262.0 \n", "\n", " anatomy_lh_inferiorparietal_area ... \\\n", "subject_id ... \n", "5181409268393785348 5658.0 ... \n", "8797865049371315550 4408.0 ... \n", "6486385878325245147 4808.0 ... \n", "17126438435398394588 6082.0 ... \n", "16638049522113999228 4885.0 ... \n", "\n", " anatomy_select \\\n", "subject_id \n", "5181409268393785348 1 \n", "8797865049371315550 1 \n", "6486385878325245147 1 \n", "17126438435398394588 1 \n", "16638049522113999228 2 \n", "\n", " fmri_basc064 \\\n", "subject_id \n", "5181409268393785348 ./data/fmri/basc064/5181409268393785348/run_1/... \n", "8797865049371315550 ./data/fmri/basc064/8797865049371315550/run_1/... \n", "6486385878325245147 ./data/fmri/basc064/6486385878325245147/run_1/... \n", "17126438435398394588 ./data/fmri/basc064/17126438435398394588/run_1... \n", "16638049522113999228 ./data/fmri/basc064/16638049522113999228/run_1... \n", "\n", " fmri_basc122 \\\n", "subject_id \n", "5181409268393785348 ./data/fmri/basc122/5181409268393785348/run_1/... \n", "8797865049371315550 ./data/fmri/basc122/8797865049371315550/run_1/... \n", "6486385878325245147 ./data/fmri/basc122/6486385878325245147/run_1/... \n", "17126438435398394588 ./data/fmri/basc122/17126438435398394588/run_1... \n", "16638049522113999228 ./data/fmri/basc122/16638049522113999228/run_1... \n", "\n", " fmri_basc197 \\\n", "subject_id \n", "5181409268393785348 ./data/fmri/basc197/5181409268393785348/run_1/... \n", "8797865049371315550 ./data/fmri/basc197/8797865049371315550/run_1/... \n", "6486385878325245147 ./data/fmri/basc197/6486385878325245147/run_1/... \n", "17126438435398394588 ./data/fmri/basc197/17126438435398394588/run_1... \n", "16638049522113999228 ./data/fmri/basc197/16638049522113999228/run_1... \n", "\n", " fmri_craddock_scorr_mean \\\n", "subject_id \n", "5181409268393785348 ./data/fmri/craddock_scorr_mean/51814092683937... \n", "8797865049371315550 ./data/fmri/craddock_scorr_mean/87978650493713... \n", "6486385878325245147 ./data/fmri/craddock_scorr_mean/64863858783252... \n", "17126438435398394588 ./data/fmri/craddock_scorr_mean/17126438435398... \n", "16638049522113999228 ./data/fmri/craddock_scorr_mean/16638049522113... \n", "\n", " fmri_harvard_oxford_cort_prob_2mm \\\n", "subject_id \n", "5181409268393785348 ./data/fmri/harvard_oxford_cort_prob_2mm/51814... \n", "8797865049371315550 ./data/fmri/harvard_oxford_cort_prob_2mm/87978... \n", "6486385878325245147 ./data/fmri/harvard_oxford_cort_prob_2mm/64863... \n", "17126438435398394588 ./data/fmri/harvard_oxford_cort_prob_2mm/17126... \n", "16638049522113999228 ./data/fmri/harvard_oxford_cort_prob_2mm/16638... \n", "\n", " fmri_motions \\\n", "subject_id \n", "5181409268393785348 ./data/fmri/motions/5181409268393785348/run_1/... \n", "8797865049371315550 ./data/fmri/motions/8797865049371315550/run_1/... \n", "6486385878325245147 ./data/fmri/motions/6486385878325245147/run_1/... \n", "17126438435398394588 ./data/fmri/motions/17126438435398394588/run_1... \n", "16638049522113999228 ./data/fmri/motions/16638049522113999228/run_1... \n", "\n", " fmri_msdl \\\n", "subject_id \n", "5181409268393785348 ./data/fmri/msdl/5181409268393785348/run_1/518... \n", "8797865049371315550 ./data/fmri/msdl/8797865049371315550/run_1/879... \n", "6486385878325245147 ./data/fmri/msdl/6486385878325245147/run_1/648... \n", "17126438435398394588 ./data/fmri/msdl/17126438435398394588/run_1/17... \n", "16638049522113999228 ./data/fmri/msdl/16638049522113999228/run_1/16... \n", "\n", " fmri_power_2011 \\\n", "subject_id \n", "5181409268393785348 ./data/fmri/power_2011/5181409268393785348/run... \n", "8797865049371315550 ./data/fmri/power_2011/8797865049371315550/run... \n", "6486385878325245147 ./data/fmri/power_2011/6486385878325245147/run... \n", "17126438435398394588 ./data/fmri/power_2011/17126438435398394588/ru... \n", "16638049522113999228 ./data/fmri/power_2011/16638049522113999228/ru... \n", "\n", " fmri_select \n", "subject_id \n", "5181409268393785348 1 \n", "8797865049371315550 1 \n", "6486385878325245147 1 \n", "17126438435398394588 1 \n", "16638049522113999228 0 \n", "\n", "[5 rows x 220 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_test.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0]\n" ] } ], "source": [ "print(labels_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Workflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The framework is evaluated with a cross-validation approach. The metrics used are the AUC under the ROC and the accuracy." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": false }, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "from sklearn.model_selection import cross_validate\n", "from problem import get_cv\n", "\n", "def evaluation(X, y):\n", " pipe = make_pipeline(FeatureExtractor(), Classifier())\n", " cv = get_cv(X, y)\n", " results = cross_validate(pipe, X, y, scoring=['roc_auc', 'accuracy'], cv=cv,\n", " verbose=1, return_train_score=True,\n", " n_jobs=1)\n", " \n", " return results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simple starting kit: using only anatomical features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### FeatureExtractor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The available structural data can be used directly to make some classification. In this regard, we will use a feature extractor (i.e. `FeatureExtractor`). This extractor will only select only the anatomical features, dropping any information regarding the fMRI-based features." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator\n", "from sklearn.base import TransformerMixin\n", "\n", "\n", "class FeatureExtractor(BaseEstimator, TransformerMixin):\n", " def fit(self, X_df, y):\n", " return self\n", "\n", " def transform(self, X_df):\n", " # get only the anatomical information\n", " X = X_df[[col for col in X_df.columns if col.startswith('anatomy')]]\n", " return X.drop(columns='anatomy_select')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We propose to use a logistic classifier preceded from a scaler which will remove the mean and standard deviation computed on the training set." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import make_pipeline\n", "\n", "\n", "class Classifier(BaseEstimator):\n", " def __init__(self):\n", " self.clf = make_pipeline(StandardScaler(), LogisticRegression())\n", "\n", " def fit(self, X, y):\n", " self.clf.fit(X, y)\n", " return self\n", " \n", " def predict(self, X):\n", " return self.clf.predict(X)\n", "\n", " def predict_proba(self, X):\n", " return self.clf.predict_proba(X)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Testing the submission" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can test locally our pipeline using `evaluation` function that we defined earlier." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training score ROC-AUC: 0.850 +- 0.004\n", "Validation score ROC-AUC: 0.652 +- 0.016 \n", "\n", "Training score accuracy: 0.772 +- 0.007\n", "Validation score accuracy: 0.622 +- 0.016\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 1.1s finished\n" ] } ], "source": [ "results = evaluation(data_train, labels_train)\n", "\n", "print(\"Training score ROC-AUC: {:.3f} +- {:.3f}\".format(np.mean(results['train_roc_auc']),\n", " np.std(results['train_roc_auc'])))\n", "print(\"Validation score ROC-AUC: {:.3f} +- {:.3f} \\n\".format(np.mean(results['test_roc_auc']),\n", " np.std(results['test_roc_auc'])))\n", "\n", "print(\"Training score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['train_accuracy']),\n", " np.std(results['train_accuracy'])))\n", "print(\"Validation score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['test_accuracy']),\n", " np.std(results['test_accuracy'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Going further: using fMRI-derived features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the framework illustrated in the figure above, steps 1 to 2 already have been computed during some preprocessing and are the data given during this challenge. Therefore, our feature extractor will implement the step #3 which correspond to the extraction of functional connectivity features. Step 4 is identical to the pipeline presented for the anatomy with a standard scaler followed by a logistic regression classifier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We pointed out that the available feature for fMRI are filename to the time-series. In order to limit the amount of data to be downloaded, we provide a fetcher `fetch_fmri_time_series()` to download only the time-series linked to a specific atlases." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on function fetch_fmri_time_series in module download_data:\n", "\n", "fetch_fmri_time_series(atlas='all')\n", " Fetch the time-series extracted from the fMRI data using a specific\n", " atlas.\n", " \n", " Parameters\n", " ----------\n", " atlas : string, default='all'\n", " The name of the atlas used during the extraction. The possibilities\n", " are:\n", " \n", " * `'basc064`, `'basc122'`, `'basc197'`: BASC parcellations with 64,\n", " 122, and 197 regions [1]_;\n", " * `'craddock_scorr_mean'`: Ncuts parcellations [2]_;\n", " * `'harvard_oxford_cort_prob_2mm'`: Harvard-Oxford anatomical\n", " parcellations;\n", " * `'msdl'`: MSDL functional atlas [3]_;\n", " * `'power_2011'`: Power atlas [4]_.\n", " \n", " Returns\n", " -------\n", " None\n", " \n", " References\n", " ----------\n", " .. [1] Bellec, Pierre, et al. \"Multi-level bootstrap analysis of stable\n", " clusters in resting-state fMRI.\" Neuroimage 51.3 (2010): 1126-1139.\n", " \n", " .. [2] Craddock, R. Cameron, et al. \"A whole brain fMRI atlas generated\n", " via spatially constrained spectral clustering.\" Human brain mapping\n", " 33.8 (2012): 1914-1928.\n", " \n", " .. [3] Varoquaux, Gaël, et al. \"Multi-subject dictionary learning to\n", " segment an atlas of brain spontaneous activity.\" Biennial International\n", " Conference on Information Processing in Medical Imaging. Springer,\n", " Berlin, Heidelberg, 2011.\n", " \n", " .. [4] Power, Jonathan D., et al. \"Functional network organization of the\n", " human brain.\" Neuron 72.4 (2011): 665-678.\n", "\n" ] } ], "source": [ "from download_data import fetch_fmri_time_series\n", "help(fetch_fmri_time_series)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading completed ...\n" ] } ], "source": [ "fetch_fmri_time_series(atlas='msdl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can download all atlases at once by passing `atlas='all'`. It is also possible to execute the file as a script `python download_data.py all`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the `FeatureExtractor` below, we first only select the filename related to the MSDL time-series data. We create a `FunctionTransformer` which will read on-the-fly the time-series from the CSV file and store them into a numpy array.\n", "Those series will be used to extract the functional connectivity matrices which will be used later in the classifier." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/lemaitre/miniconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " from ._conv import register_converters as _register_converters\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import FunctionTransformer\n", "\n", "from nilearn.connectome import ConnectivityMeasure\n", "\n", "\n", "def _load_fmri(fmri_filenames):\n", " \"\"\"Load time-series extracted from the fMRI using a specific atlas.\"\"\"\n", " return np.array([pd.read_csv(subject_filename,\n", " header=None).values\n", " for subject_filename in fmri_filenames])\n", "\n", "\n", "class FeatureExtractor(BaseEstimator, TransformerMixin):\n", " def __init__(self):\n", " # make a transformer which will load the time series and compute the\n", " # connectome matrix\n", " self.transformer_fmri = make_pipeline(\n", " FunctionTransformer(func=_load_fmri, validate=False),\n", " ConnectivityMeasure(kind='tangent', vectorize=True))\n", " \n", " def fit(self, X_df, y):\n", " # get only the time series for the MSDL atlas\n", " fmri_filenames = X_df['fmri_msdl']\n", " self.transformer_fmri.fit(fmri_filenames, y)\n", " return self\n", "\n", " def transform(self, X_df):\n", " fmri_filenames = X_df['fmri_msdl']\n", " return self.transformer_fmri.transform(fmri_filenames)\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import make_pipeline\n", "\n", "\n", "class Classifier(BaseEstimator):\n", " def __init__(self):\n", " self.clf = make_pipeline(StandardScaler(), LogisticRegression(C=1.))\n", "\n", " def fit(self, X, y):\n", " self.clf.fit(X, y)\n", " return self\n", " \n", " def predict(self, X):\n", " return self.clf.predict(X)\n", "\n", " def predict_proba(self, X):\n", " return self.clf.predict_proba(X)\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training score ROC-AUC: 1.000 +- 0.000\n", "Validation score ROC-AUC: 0.612 +- 0.019 \n", "\n", "Training score accuracy: 1.000 +- 0.000\n", "Validation score accuracy: 0.587 +- 0.021\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 4.1min finished\n" ] } ], "source": [ "results = evaluation(data_train, labels_train)\n", "\n", "print(\"Training score ROC-AUC: {:.3f} +- {:.3f}\".format(np.mean(results['train_roc_auc']),\n", " np.std(results['train_roc_auc'])))\n", "print(\"Validation score ROC-AUC: {:.3f} +- {:.3f} \\n\".format(np.mean(results['test_roc_auc']),\n", " np.std(results['test_roc_auc'])))\n", "\n", "print(\"Training score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['train_accuracy']),\n", " np.std(results['train_accuracy'])))\n", "print(\"Validation score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['test_accuracy']),\n", " np.std(results['test_accuracy'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More elaborate pipeline: combining anatomy and fMRI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This workflow is a combination of the 2 previous workflows. The `FeatureExtractor` is extracting both structural and functional connectivity information and concatenate them. Note that each column will contain in their name either **connectome** or **anatomy** depending of the type of feature. It will be used to train different classifiers later on." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import FunctionTransformer\n", "\n", "from nilearn.connectome import ConnectivityMeasure\n", "\n", "\n", "def _load_fmri(fmri_filenames):\n", " \"\"\"Load time-series extracted from the fMRI using a specific atlas.\"\"\"\n", " return np.array([pd.read_csv(subject_filename,\n", " header=None).values\n", " for subject_filename in fmri_filenames])\n", "\n", "\n", "class FeatureExtractor(BaseEstimator, TransformerMixin):\n", " def __init__(self):\n", " # make a transformer which will load the time series and compute the\n", " # connectome matrix\n", " self.transformer_fmri = make_pipeline(\n", " FunctionTransformer(func=_load_fmri, validate=False),\n", " ConnectivityMeasure(kind='tangent', vectorize=True))\n", " \n", " def fit(self, X_df, y):\n", " fmri_filenames = X_df['fmri_msdl']\n", " self.transformer_fmri.fit(fmri_filenames, y)\n", " return self\n", "\n", " def transform(self, X_df):\n", " fmri_filenames = X_df['fmri_msdl']\n", " X_connectome = self.transformer_fmri.transform(fmri_filenames)\n", " X_connectome = pd.DataFrame(X_connectome, index=X_df.index)\n", " X_connectome.columns = ['connectome_{}'.format(i)\n", " for i in range(X_connectome.columns.size)]\n", " # get the anatomical information\n", " X_anatomy = X_df[[col for col in X_df.columns\n", " if col.startswith('anatomy')]]\n", " X_anatomy = X_anatomy.drop(columns='anatomy_select')\n", " # concatenate both matrices\n", " return pd.concat([X_connectome, X_anatomy], axis=1)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will create a classifier (i.e. a random forest classifier) which will used both connectome and anatomical features." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "from sklearn.base import BaseEstimator\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "\n", "class Classifier(BaseEstimator):\n", " def __init__(self):\n", " self.clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)\n", "\n", " def fit(self, X, y):\n", " self.clf.fit(X, y)\n", " return self\n", " \n", " def predict(self, X):\n", " return self.clf.predict(X)\n", "\n", " def predict_proba(self, X):\n", " return self.clf.predict_proba(X)\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training score ROC-AUC: 1.000 +- 0.000\n", "Validation score ROC-AUC: 0.655 +- 0.028 \n", "\n", "Training score accuracy: 1.000 +- 0.000\n", "Validation score accuracy: 0.613 +- 0.033\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 4.1min finished\n" ] } ], "source": [ "results = evaluation(data_train, labels_train)\n", "\n", "print(\"Training score ROC-AUC: {:.3f} +- {:.3f}\".format(np.mean(results['train_roc_auc']),\n", " np.std(results['train_roc_auc'])))\n", "print(\"Validation score ROC-AUC: {:.3f} +- {:.3f} \\n\".format(np.mean(results['test_roc_auc']),\n", " np.std(results['test_roc_auc'])))\n", "\n", "print(\"Training score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['train_accuracy']),\n", " np.std(results['train_accuracy'])))\n", "print(\"Validation score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['test_accuracy']),\n", " np.std(results['test_accuracy'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can propose a\n", "more complex classifier than the previous one. We will train 2 single classifiers independetly on the sMRI-derived and fMRI-derived features. Then, a meta-classifier will be used to combine both information. We left out some data to be able to train the meta-classifier." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "from sklearn.base import BaseEstimator\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.model_selection import train_test_split\n", "\n", "\n", "class Classifier(BaseEstimator):\n", " def __init__(self):\n", " self.clf_connectome = make_pipeline(StandardScaler(),\n", " LogisticRegression(C=1.))\n", " self.clf_anatomy = make_pipeline(StandardScaler(),\n", " LogisticRegression(C=1.))\n", " self.meta_clf = LogisticRegression(C=1.)\n", "\n", " def fit(self, X, y):\n", " X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]\n", " X_connectome = X[[col for col in X.columns\n", " if col.startswith('connectome')]]\n", " train_idx, validation_idx = train_test_split(range(y.size),\n", " test_size=0.33,\n", " shuffle=True,\n", " random_state=42)\n", " X_anatomy_train = X_anatomy.iloc[train_idx]\n", " X_anatomy_validation = X_anatomy.iloc[validation_idx]\n", " X_connectome_train = X_connectome.iloc[train_idx]\n", " X_connectome_validation = X_connectome.iloc[validation_idx]\n", " y_train = y[train_idx]\n", " y_validation = y[validation_idx]\n", "\n", " self.clf_connectome.fit(X_connectome_train, y_train)\n", " self.clf_anatomy.fit(X_anatomy_train, y_train)\n", "\n", " y_connectome_pred = self.clf_connectome.predict_proba(\n", " X_connectome_validation)\n", " y_anatomy_pred = self.clf_anatomy.predict_proba(\n", " X_anatomy_validation)\n", "\n", " self.meta_clf.fit(\n", " np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1),\n", " y_validation)\n", " return self\n", " \n", " def predict(self, X):\n", " X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]\n", " X_connectome = X[[col for col in X.columns\n", " if col.startswith('connectome')]]\n", "\n", " y_anatomy_pred = self.clf_anatomy.predict_proba(X_anatomy)\n", " y_connectome_pred = self.clf_connectome.predict_proba(X_connectome)\n", "\n", " return self.meta_clf.predict(\n", " np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1))\n", "\n", " def predict_proba(self, X):\n", " X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]\n", " X_connectome = X[[col for col in X.columns\n", " if col.startswith('connectome')]]\n", "\n", " y_anatomy_pred = self.clf_anatomy.predict_proba(X_anatomy)\n", " y_connectome_pred = self.clf_connectome.predict_proba(X_connectome)\n", "\n", " return self.meta_clf.predict_proba(\n", " np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1))\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training score ROC-AUC: 0.915 +- 0.010\n", "Validation score ROC-AUC: 0.649 +- 0.023 \n", "\n", "Training score accuracy: 0.854 +- 0.017\n", "Validation score accuracy: 0.606 +- 0.022\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 4.2min finished\n" ] } ], "source": [ "results = evaluation(data_train, labels_train)\n", "\n", "print(\"Training score ROC-AUC: {:.3f} +- {:.3f}\".format(np.mean(results['train_roc_auc']),\n", " np.std(results['train_roc_auc'])))\n", "print(\"Validation score ROC-AUC: {:.3f} +- {:.3f} \\n\".format(np.mean(results['test_roc_auc']),\n", " np.std(results['test_roc_auc'])))\n", "\n", "print(\"Training score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['train_accuracy']),\n", " np.std(results['train_accuracy'])))\n", "print(\"Validation score accuracy: {:.3f} +- {:.3f}\".format(np.mean(results['test_accuracy']),\n", " np.std(results['test_accuracy'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Submitting to the online challenge: [ramp.studio](http://ramp.studio)\n", "\n", "Once you found a good model, you can submit them to [ramp.studio](http://www.ramp.studio) to enter the online challenge. First, if it is your first time using the RAMP platform, [sign up](http://www.ramp.studio/sign_up), otherwise [log in](http://www.ramp.studio/login). Then sign up to the event [autism](http://www.ramp.studio/events/autism). Sign up for the event. Both signups are controled by RAMP administrators, so there **can be a delay between asking for signup and being able to submit**.\n", "\n", "Once your signup request is accepted, you can go to your [sandbox](http://www.ramp.studio/events/autism/sandbox) and copy-paste. You can also create a new starting-kit in the `submissions` folder containing both `feature_extractor.py` and `classifier.py` and upload those file directly. You can check the starting-kit ([`feature_extractor.py`](/edit/submissions/starting_kit/feature_extractor.py) and [`classifier.py`](/edit/submissions/starting_kit/classifier.py)) for an example. The submission is trained and tested on our backend in the similar way as `ramp_test_submission` does it locally. While your submission is waiting in the queue and being trained, you can find it in the \"New submissions (pending training)\" table in [my submissions](http://www.ramp.studio/events/autism/my_submissions). Once it is trained, you get a mail, and your submission shows up on the [public leaderboard](http://www.ramp.studio/events/autism/leaderboard). \n", "If there is an error (despite having tested your submission locally with `ramp_test_submission`), it will show up in the \"Failed submissions\" table in [my submissions](http://www.ramp.studio/events/autism/my_submissions). You can click on the error to see part of the trace.\n", "\n", "After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.\n", "\n", "The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.\n", "\n", "The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., _locally_, and checking them with `ramp_test_submission`. The script prints mean cross-validation scores \n", "\n", "The official score in this RAMP (the first score column after \"historical contributivity\" on the [leaderboard](http://www.ramp.studio/events/autism/leaderboard)) is the AUC. When the score is good enough, you can submit it at the RAMP." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[38;5;178m\u001b[1mTesting Autism Spectrum Disorder classification\u001b[0m\n", "\u001b[38;5;178m\u001b[1mReading train and test files from ./data ...\u001b[0m\n", "\u001b[38;5;178m\u001b[1mReading cv ...\u001b[0m\n", "\u001b[38;5;178m\u001b[1mTraining ./submissions/starting_kit_anatomy ...\u001b[0m\n", "\u001b[38;5;178m\u001b[1mCV fold 0\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc acc\u001b[0m\n", "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m \u001b[38;5;10m\u001b[1m0.847\u001b[0m \u001b[38;5;150m0.767\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.647\u001b[0m \u001b[38;5;105m0.611\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.765\u001b[0m \u001b[38;5;218m0.696\u001b[0m\n", "\u001b[38;5;178m\u001b[1mCV fold 1\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc acc\u001b[0m\n", "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m \u001b[38;5;10m\u001b[1m0.842\u001b[0m \u001b[38;5;150m0.766\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.662\u001b[0m \u001b[38;5;105m0.628\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.659\u001b[0m \u001b[38;5;218m0.478\u001b[0m\n", "\u001b[38;5;178m\u001b[1mCV fold 2\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc acc\u001b[0m\n", "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m \u001b[38;5;10m\u001b[1m0.854\u001b[0m \u001b[38;5;150m0.786\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.645\u001b[0m \u001b[38;5;105m0.615\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.720\u001b[0m \u001b[38;5;218m0.609\u001b[0m\n", "\u001b[38;5;178m\u001b[1mCV fold 3\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc acc\u001b[0m\n", "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m \u001b[38;5;10m\u001b[1m0.849\u001b[0m \u001b[38;5;150m0.769\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.645\u001b[0m \u001b[38;5;105m0.619\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.758\u001b[0m \u001b[38;5;218m0.565\u001b[0m\n", "\u001b[38;5;178m\u001b[1mCV fold 4\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc acc\u001b[0m\n", "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m \u001b[38;5;10m\u001b[1m0.852\u001b[0m \u001b[38;5;150m0.770\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.650\u001b[0m \u001b[38;5;105m0.606\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.735\u001b[0m \u001b[38;5;218m0.652\u001b[0m\n", "\u001b[38;5;178m\u001b[1mCV fold 5\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc acc\u001b[0m\n", "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m \u001b[38;5;10m\u001b[1m0.847\u001b[0m \u001b[38;5;150m0.776\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.680\u001b[0m \u001b[38;5;105m0.642\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.598\u001b[0m \u001b[38;5;218m0.565\u001b[0m\n", "\u001b[38;5;178m\u001b[1mCV fold 6\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc acc\u001b[0m\n", "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m \u001b[38;5;10m\u001b[1m0.852\u001b[0m \u001b[38;5;150m0.764\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.624\u001b[0m \u001b[38;5;105m0.602\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.773\u001b[0m \u001b[38;5;218m0.652\u001b[0m\n", "\u001b[38;5;178m\u001b[1mCV fold 7\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc acc\u001b[0m\n", "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m \u001b[38;5;10m\u001b[1m0.854\u001b[0m \u001b[38;5;150m0.779\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.662\u001b[0m \u001b[38;5;105m0.650\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.644\u001b[0m \u001b[38;5;218m0.478\u001b[0m\n", "\u001b[38;5;178m\u001b[1m----------------------------\u001b[0m\n", "\u001b[38;5;178m\u001b[1mMean CV scores\u001b[0m\n", "\u001b[38;5;178m\u001b[1m----------------------------\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc acc\u001b[0m\n", "\t\u001b[38;5;10m\u001b[1mtrain\u001b[0m \u001b[38;5;10m\u001b[1m0.85\u001b[0m \u001b[38;5;150m\u001b[38;5;150m±\u001b[0m\u001b[0m \u001b[38;5;150m0.0039\u001b[0m \u001b[38;5;150m0.772\u001b[0m \u001b[38;5;150m\u001b[38;5;150m±\u001b[0m\u001b[0m \u001b[38;5;150m0.0071\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.652\u001b[0m \u001b[38;5;105m\u001b[38;5;105m±\u001b[0m\u001b[0m \u001b[38;5;105m0.0155\u001b[0m \u001b[38;5;105m0.622\u001b[0m \u001b[38;5;105m\u001b[38;5;105m±\u001b[0m\u001b[0m \u001b[38;5;105m0.0161\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.706\u001b[0m \u001b[38;5;218m\u001b[38;5;218m±\u001b[0m\u001b[0m \u001b[38;5;218m0.0605\u001b[0m \u001b[38;5;218m0.587\u001b[0m \u001b[38;5;218m\u001b[38;5;218m±\u001b[0m\u001b[0m \u001b[38;5;218m0.0753\u001b[0m\n", "\u001b[38;5;178m\u001b[1m----------------------------\u001b[0m\n", "\u001b[38;5;178m\u001b[1mBagged scores\u001b[0m\n", "\u001b[38;5;178m\u001b[1m----------------------------\u001b[0m\n", "\t\u001b[38;5;178m\u001b[1mscore auc\u001b[0m\n", "\t\u001b[38;5;12m\u001b[1mvalid\u001b[0m \u001b[38;5;12m\u001b[1m0.651\u001b[0m\n", "\t\u001b[38;5;1m\u001b[1mtest\u001b[0m \u001b[38;5;1m\u001b[1m0.720\u001b[0m\n" ] } ], "source": [ "!ramp_test_submission --submission starting_kit_anatomy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More information\n", "\n", "You can find more information in the [README](https://github.com/paris-saclay-cds/ramp-workflow/blob/master/README.md) of the [ramp-workflow library](https://github.com/paris-saclay-cds/ramp-workflow)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Questions\n", "\n", "Questions related to the starting kit should be asked on the [issue tracker](https://github.com/ramp-kits/autism/issues). The RAMP site administrators can be pinged at the [RAMP slack team](https://ramp-studio.slack.com/shared_invite/MTg1NDUxNDAyNDk2LTE0OTUzOTcwMDQtMThlOWY1NWU0Mg) in the #autism channel." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }