An experiment in decentralized collaborative neuroimaging research
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"## What is *StudyForrest*?\n",
"\n",
"The *StudyForrest* project centers around the use of the movie Forrest Gump, which provides complex sensory input that is both reproducible and is also richly laden with real-life-like content and contexts. While study participants listened to and watched the movie, we collected a rich dataset that encompasses many hours of fMRI scans, structural brain scans, eye-tracking data, and extensive annotations of the movie.\n",
"\n",
"Apart from these data acquisitions, several processed derivatives have also been generated and shared. These include functional contrasts, cortical and subcortical brain parcellations, retinotopic maps, and more.\n",
"\n",
"The *StudyForrest* and all its derivatives are structured as a nested DataLad dataset that is accessible publicly and provides fine-grained data access down to the level of individual files.\n",
"\n",
"\n",
"\n",
"## What is *DataLad*?\n",
"\n",
"DataLad is a free and open-source distributed data management system, available for all major operating systems, that was developed to aid with everything related to the evolution of digital objects. As explained in the DataLad Handbook:\n",
"\n",
"> *It is not only keeping track of code, it is not only keeping track of data, it is not only making sharing, retrieving and linking data (and metadata) easy, but it assists with the combination of all things necessary in the digital workflow of data and science.*\n",
"\n",
"\n",
"\n",
"\n",
"## Useful *StudyForrest* and *DataLad* links\n",
"\n",
"- [StudyForrest website](https://www.studyforrest.org/)\n",
"- [StudyForrest data on GitHub](https://github.com/psychoinformatics-de/studyforrest-data)\n",
"- [datalad.org](https://www.datalad.org/)\n",
"- [DataLad Handbook](http://handbook.datalad.org/en/latest/index.html)\n",
"- ['Intro to DataLad' tutorial by Adina Wagner](https://www.youtube.com/watch?v=QsAqnP7TwyY&list=PLVso6Qs8PLCiMMBXewYQjsAQLVtzAdJJX&index=4)\n",
"\n",
"\n",
"\n",
"## What is this notebook?\n",
"\n",
"By following this interactive [Jupyter Notebook](https://jupyter-notebook.readthedocs.io/en/stable/), you are guided through the step-by-step process of accessing, downloading, and processing the *StudyForrest* data and its derivatives. This is made seamless with the use of DataLad and various other open source software packages, which are already installed in this Binder instance and ready to use.\n",
"\n",
"#### Instructions:\n",
"- All code/Markdown cells can be run using the keyboard shortcut: `Shift + Enter` or `Shift + Return`\n",
"- It is important to run all of the cells in order from top to bottom. This is important because Jupyter Notebooks keep a global context, i.e. it remembers the results of the code that was recently executed, and takes that into account when running new code cells. For example, if you run a code cell to list the contents of a directory, then run a next cell to navigate to a different directory, then rerun the first cell, it will list the contents of the directory that you navigated to, not the one where you were originally located.\n",
"- If you get stuck or receive errors due to running the cells in a different order, you can restart the kernel in the `Kernel` menu option above.\n",
"\n",
"#### Content:\n",
"\n",
"1. [DataLad setup and introduction](#section1)\n",
"2. [StudyForrest dataset installation and structure](#section2)\n",
"3. [Retrieving and removing data](#section3)\n",
"4. [Visualizing structural data](#section4)\n",
"5. [Functional data quality plots](#section5)\n",
"6. [Your move :)](#section6)\n",
"\n",
"\n",
"\\**Note: this notebook is not intended to be an in-depth tutorial on the use of DataLad. For a more detailed walk-through, please use the DataLad Handbook and introductory tutorial linked above. Some content of the current notebook was imported and adapted from the tutorial.*\n",
"\n",
"\n",
"---\n",
"---\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "fa60088b",
"metadata": {},
"source": [
"
Exploring the StudyForrest data
\n",
"\n",
"---\n",
"\n",
"\n",
"\n",
"## 1. DataLad setup and introduction\n",
"\n",
"You can find more details about how to install DataLad and its dependencies on all operating systems in the DataLad handbook, in the [installation section](http://handbook.datalad.org/en/latest/intro/installation.html). It also details how to install DataLad on shared machines that you don’t have administrative privileges (sudo rights) on, such as high performance compute clusters.\n",
"\n",
"As a first step in this walk-through, let's ensure that we install the most recent version of DataLad using the Python package manager [pip](pip.pypa.io/):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "31f755b2",
"metadata": {},
"outputs": [],
"source": [
"!pip install -U datalad"
]
},
{
"cell_type": "markdown",
"id": "29045749",
"metadata": {},
"source": [
"Once installed, DataLad can be used as a command line tool or via its Python API. In the command line, an instruction always starts with the general `datalad` command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1555cd2f",
"metadata": {
"hide-output": false,
"scrolled": true
},
"outputs": [],
"source": [
"!datalad"
]
},
{
"cell_type": "markdown",
"id": "c50b1225",
"metadata": {},
"source": [
"To find out more about the available commands, type `datalad --help`. If you already have DataLad installed, **make sure that it is a recent version, at least 0.12 or higher**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1ccadff",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"!datalad --version"
]
},
{
"cell_type": "markdown",
"id": "c7398214",
"metadata": {},
"source": [
"DataLad has further functionality that is particularly useful for getting local access to the dataset, its subdatasets, and individual files. These include:\n",
"\n",
"```\n",
"datalad install #install a top-level DataLad dataset with the option of recursively installing subdatasets\n",
"datalad clone #install a single DataLad dataset\n",
"datalad get #download a local copy of a file or files of an installed DataLad dataset\n",
"datalad drop #remove a local copy of a file or files of an installed DataLad dataset\n",
"```\n",
"\n",
"The use of these commands are illustrated below with the StudyForrest dataset.\n"
]
},
{
"cell_type": "markdown",
"id": "4d562ab2",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 2. StudyForrest dataset cloning and structure\n",
"\n",
"Getting local access to the dataset is as simple as running the `clone` command and pointing it to the location of the DataLad dataset (in this case: https://github.com/psychoinformatics-de/studyforrest-data):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16de093a",
"metadata": {},
"outputs": [],
"source": [
"!datalad clone https://github.com/psychoinformatics-de/studyforrest-data"
]
},
{
"cell_type": "markdown",
"id": "f5833d43",
"metadata": {},
"source": [
"Once the dataset is cloned, it exists as a light-weight directory on your local machine (in this case: `/studyforrest-data`). At this point, it contains only small metadata and information on the identity of the files in the dataset, but not actual content of the (sometimes large) data files. This fact can be verified by checking the disk usage in the relevant local directory:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e7d5ade",
"metadata": {},
"outputs": [],
"source": [
"cd studyforrest-data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a615b2a",
"metadata": {},
"outputs": [],
"source": [
"!du -sh "
]
},
{
"cell_type": "markdown",
"id": "5c32135f",
"metadata": {},
"source": [
"As you can see, the dataset size is tiny, and definitely not the size one would expect for a multi-modality neuroimaging dataset. However, the dataset is a complete representation of all data files. To explore this, the structure of the installed StudyForrest dataset can be viewed with the `tree` command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5682b307",
"metadata": {},
"outputs": [],
"source": [
"!tree -d"
]
},
{
"cell_type": "markdown",
"id": "3d6fa751",
"metadata": {},
"source": [
"There are subdirectories for the orignal data, derivative data, artifacts, stimuli, and code, all of which add to the rich StudyForrest dataset. In turn, the content of these subdirectories are structured as DataLad datasets themselves. This demonstrates the concept of dataset nesting, with the top-level (or super) dataset being the StudyForrest dataset that we just cloned, and the subdatasets being all the subdirectories two levels down from the superdataset. These can be identified using the `subdatasets` command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "78311026",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!datalad subdatasets"
]
},
{
"cell_type": "markdown",
"id": "7d552edb",
"metadata": {},
"source": [
"We see something unexpected, however, when navigating down to and listing the content of specific subdatasets. For example, the `original/phase2` subdataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "000f5908",
"metadata": {},
"outputs": [],
"source": [
"cd original/phase2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6dc9d196",
"metadata": {},
"outputs": [],
"source": [
"ls"
]
},
{
"cell_type": "markdown",
"id": "db485dcb",
"metadata": {},
"source": [
"Note that the `ls` command does not yield an output, implying that there are no files or folders in the `phase2` directory. This is because our initial `datalad clone https://github.com/psychoinformatics-de/studyforrest-data` command only cloned the top-level dataset and referenced its subdatasets. It did not clone the subdatasets. To clone the content of sublevel datasets *as well*, an option is to use the `install` command with an added recursive flag, as in:\n",
"\n",
"```\n",
"datalad install -r https://github.com/psychoinformatics-de/studyforrest-data\n",
"```\n",
"\n",
"More generally, the `get` command can be used (recursively or on a single level) to install subdatasets, and adding the `-n` flag prevents all data from being retrieved, as in:\n",
"\n",
"```\n",
"datalad get -r -n https://github.com/psychoinformatics-de/studyforrest-data\n",
"```\n",
"\n",
"Here, we use `get` to clone the subdataset without retrieving data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aea2cde3",
"metadata": {},
"outputs": [],
"source": [
"!datalad get -n ."
]
},
{
"cell_type": "markdown",
"id": "51f5265a",
"metadata": {},
"source": [
"Now, after successful cloning of the sublevel dataset, running the `ls` command should show the dataset content:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ccd990be",
"metadata": {},
"outputs": [],
"source": [
"ls"
]
},
{
"cell_type": "markdown",
"id": "914baf5e",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 3. Retrieving and dropping data\n",
"\n",
"So, we have cloned a subdataset and we can inspect its contents, and now we want to work with the actual data in the files. Let's try and access the `participants.tsv` file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e77ef242",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import pandas as pd\n",
"\n",
"tsv_file = Path('participants.tsv')\n",
"try:\n",
" tsv_abs_path = tsv_file.resolve(strict=True)\n",
"except FileNotFoundError:\n",
" print('File does not exist')\n",
"else:\n",
" print('File exists. Printing file contents.')\n",
" participant_info = pd.read_csv(tsv_file, sep='\\t')\n",
" print(participant_info)"
]
},
{
"cell_type": "markdown",
"id": "90fa1ebb",
"metadata": {},
"source": [
"Easy, right? Now let's try with a NIfTI file, for example the functional time series located at `sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70eae076",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import nibabel as nib\n",
"\n",
"nii_file = Path('sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz')\n",
"try:\n",
" nii_abs_path = nii_file.resolve(strict=True)\n",
"except FileNotFoundError:\n",
" print('File does not exist')\n",
"else:\n",
" print('File exists. Printing file header content.')\n",
" img = nib.load(nii_file)\n",
" print(img.header)\n",
" "
]
},
{
"cell_type": "markdown",
"id": "2c3ba505",
"metadata": {},
"source": [
"As you can see, the returned message says `File does not exist`. If we navigate to and list the contents of the subdirectory where this file is supposed to be located, we see the following:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15b2d76c",
"metadata": {},
"outputs": [],
"source": [
"cd sub-01/ses-movie/func/"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00c10d37",
"metadata": {},
"outputs": [],
"source": [
"ls"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a6a6c15",
"metadata": {},
"outputs": [],
"source": [
"cd ../../.."
]
},
{
"cell_type": "markdown",
"id": "e885a79c",
"metadata": {},
"source": [
"Thus, we can see that the file named `sub-01_ses-movie_task-movie_run-1_bold.nii.gz` is there, but in reality: what appears to be the file in the dataset is merely a symbolic link (or *symlink*, indicated by the `@` at the end of the filename) to the actual file stored elsewhere. This is intentional behaviour of DataLad (and its dependeny git-annex) and underlies the core functionality of having local access to a full representation of the dataset, while (often large) data files are stored elsewhere.\n",
"\n",
"DataLad can be instructed to retrieve small text-based files upon dataset installation or cloning (technically, these are then stored and tracked with git and not with git-annex), which explains why the `tsv`-file was available and the `nii`-file not.\n",
"\n",
"To retrieve a specific file or many files, we use the `datalad get` command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75d9746f",
"metadata": {},
"outputs": [],
"source": [
"!datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz"
]
},
{
"cell_type": "markdown",
"id": "2558ac55",
"metadata": {},
"source": [
"Now, the same code to access the `nii`-file and print its header content should run without errors:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac6d1cc3",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import nibabel as nib\n",
"\n",
"nii_file = Path('sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz')\n",
"try:\n",
" nii_abs_path = nii_file.resolve(strict=True)\n",
"except FileNotFoundError:\n",
" print('File does not exist')\n",
"else:\n",
" print('File exists. Printing file header content.')\n",
" img = nib.load(nii_file)\n",
" print(img.header)"
]
},
{
"cell_type": "markdown",
"id": "b7798be0",
"metadata": {},
"source": [
"Once data processing is completed and data are not required locally anymore, the content can be dropped from the dataset to save diskspace:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2578f67",
"metadata": {},
"outputs": [],
"source": [
"!datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e527ea28",
"metadata": {},
"outputs": [],
"source": [
"cd ../.."
]
},
{
"cell_type": "markdown",
"id": "0622ad7b",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 4. Visualizing structural data \n",
"\n",
"Now that we are familiar with the concepts of `install`, `clone`, `get` and `drop`, we can explore and visualize the StudyForrest dataset!\n",
"\n",
"Let's view a structural T1w and T2w images of a single participant (located in the `original/3T_structural_mri`) subdataset.\n",
"\n",
"First we'll clone the dataset and get the relevant data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d26908d",
"metadata": {},
"outputs": [],
"source": [
"!datalad get -n original/3T_structural_mri"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64796f5c",
"metadata": {},
"outputs": [],
"source": [
"!datalad get original/3T_structural_mri/sub-01/anat/sub-01_T*w.nii.gz"
]
},
{
"cell_type": "markdown",
"id": "565e051b",
"metadata": {},
"source": [
"Then we can plot the structural images in 3 dimensions. We can do this using a variety of Python tools. Here we use the functionality of `matplotlib` as an example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d2b5dfc",
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import os\n",
"from utilities import plot_structural\n",
"t1w_fn = os.path.abspath('original/3T_structural_mri/sub-01/anat/sub-01_T1w.nii.gz')\n",
"fig = plot_structural(t1w_fn)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8d16615",
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"t2w_fn = os.path.abspath('original/3T_structural_mri/sub-01/anat/sub-01_T2w.nii.gz')\n",
"fig = plot_structural(t2w_fn)"
]
},
{
"cell_type": "markdown",
"id": "a07457ff",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 5. Functional data quality plots\n",
"\n",
"Since head movement during the acquisition of a functional MRI time series can be detrimental for the eventual data analysis and results, volume-to-volume head movement parameters are typically inspected as a quality indicator of fMRI data. Framewise displacement (FD) captures head movement in a single value per volume, resulting in an FD time series per functional run. Below we present interactive distribution plots of FD values for all participants over all runs of the 3T audiovisual movie dataset. Distributions and an example time series are also presented for a single subject and a single run.\n",
"\n",
"First, we retrieve the relevant data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7672912",
"metadata": {},
"outputs": [],
"source": [
"!datalad get -n derivative/aligned_mri"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1afd575",
"metadata": {},
"outputs": [],
"source": [
"!datalad get derivative/aligned_mri/sub-*/in_bold3Tp2/*_mcparams.txt"
]
},
{
"cell_type": "markdown",
"id": "1df60fa6",
"metadata": {},
"source": [
"Then we:\n",
"\n",
"1. calculate framewise displacement from the movement parameter files,\n",
"2. prepare all calculated values by structuring them in Pandas dataframes and lists, and\n",
"3. use Plotly to create interactive graphs of framewise displacement over all subjects and runs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6128c333",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import utilities as util\n",
"from plotly.colors import sequential, n_colors, qualitative\n",
"import plotly.graph_objs as go\n",
"\n",
"dataset_dir = os.path.abspath('derivative/aligned_mri')\n",
"participants, column_names, df_subs, df_subsruns = util.prepare_fd(dataset_dir)\n",
"\n",
"colors = n_colors('rgb(255, 149, 81)', 'rgb(109, 52, 137)', 20, colortype='rgb')\n",
"data = []\n",
"layout = go.Layout(\n",
" xaxis = dict(tickangle=45),\n",
" yaxis = dict(title='Framewise displacement (mm)', range=[-0.3, 2]),\n",
" title = 'Framewise displacement for all participants over all runs (audio-visual movie task)'\n",
")\n",
"fig1 = go.Figure(layout=layout)\n",
"i = 0\n",
"for colname, color in zip(participants, colors):\n",
" data.append(df_subs[colname].dropna().to_numpy())\n",
" fig1.add_trace(go.Violin(y=data[i], line_color=colors[i], name=colname, orientation='v', side='positive', width=1.8, points=False, box_visible=True, meanline_visible=True))\n",
" i += 1\n",
"fig1.update_layout(xaxis_showgrid=False, xaxis_zeroline=False)\n",
"fig1"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc964929",
"metadata": {},
"outputs": [],
"source": [
"sub_nr = 1\n",
"sub = f\"sub-{sub_nr:02d}\"\n",
"\n",
"data = []\n",
"layout = go.Layout(\n",
" title = 'Framewise displacement for all 8 runs of ' + sub + ' (audio-visual movie task)',\n",
" xaxis = dict(tickangle=45),\n",
" yaxis = dict(title='Framewise displacement (mm)', range=[-0.05, 1]),\n",
" height=400,\n",
")\n",
"fig2 = go.Figure(layout=layout)\n",
"i = 0\n",
"for colname, color in zip(column_names[8*(sub_nr-1):8*sub_nr], colors):\n",
" data.append(df_subsruns[colname].dropna().to_numpy())\n",
" fig2.add_trace(go.Violin(y=data[i], line_color=sequential.Viridis[i], name=colname, orientation='v', side='positive', width=1.8, points=False, box_visible=True, meanline_visible=True))\n",
" i += 1\n",
"fig2.update_layout(xaxis_showgrid=False, xaxis_zeroline=False)\n",
"fig2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "52d29792",
"metadata": {},
"outputs": [],
"source": [
"from plotly.subplots import make_subplots\n",
"\n",
"run_nr = 2\n",
"run = f\"run-{run_nr}\"\n",
"marker = sub + '_' + run\n",
"\n",
"fig3 = make_subplots(rows=1, cols=2, column_widths=[0.85, 0.15],shared_yaxes=True, subplot_titles=(\"Time series\", \"Distribution\"), horizontal_spacing=0.01)\n",
"fig3.add_trace(go.Scatter(y=df_subsruns[marker].dropna().to_numpy(), mode='lines', line = dict(color=sequential.Viridis[run_nr-1], width=2), name='Time series', showlegend=False),\n",
" row=1, col=1)\n",
"fig3.add_trace(go.Violin(y=df_subsruns[marker].dropna().to_numpy(), line_color=sequential.Viridis[run_nr-1], name='Distribution', orientation='v', side='positive', width=1.5, points='all', jitter=0.5, box_visible=True, meanline_visible=True, showlegend=False),\n",
" row=1, col=2)\n",
"fig3.update_layout(\n",
" height=300,\n",
" yaxis = dict(title='FD (mm)',range=[-0.05, 1]),\n",
" title = 'Framewise displacement for sub-01-run2 (time series and distribution)'\n",
")\n",
"fig3.update_xaxes(showticklabels=False)\n",
"fig3.update_xaxes(showticklabels=True, row=1, col=1)\n",
"fig3"
]
},
{
"cell_type": "markdown",
"id": "ab8f7112",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 6. Your move :)\n",
"\n",
"Thanks for following along! You have now experienced the basics of working with the StudyForrest data using DataLad. You have also seen some sample scripts and visualizations of the structural and functional data.\n",
"\n",
"Now it's your turn :) Feel free to add more code cells below and test out your favorite algorithm/script/package on the StudyForrest data!\n",
"\n",
"![ChessUrl](https://media.giphy.com/media/IgGLggVL4HXYDAot0Y/source.gif \"forest_wave\")"
]
}
],
"metadata": {
"date": 1617871734.783596,
"filename": "OHBM.rst",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
},
"title": "OHBM Brainhack TrainTrack: DataLad"
},
"nbformat": 4,
"nbformat_minor": 5
}