import marimo __generated_with = "0.23.3" app = marimo.App() @app.cell(hide_code=True) def _(): import marimo as mo from pathlib import Path import os from dartbrains_tools.data import get_file, get_subjects, get_tr, load_events, load_confounds, REPO_ID, CONDITIONS from huggingface_hub import hf_hub_download import nibabel as nib import matplotlib.pyplot as plt from nilearn.plotting import view_img, plot_glass_brain, plot_anat, plot_epi, plot_stat_map from nltools.data import Brain_Data from nltools.utils import get_anatomical IMG_DIR = next(p for p in (Path.cwd(), *Path.cwd().resolve().parents) if (p / "book.yml").exists()) / "images" / "brain_data" return ( Brain_Data, get_anatomical, get_file, get_subjects, load_events, mo, nib, plot_anat, plot_glass_brain, plot_stat_map, plt, view_img, ) @app.cell(hide_code=True) def _(mo): mo.md(r""" # Introduction to Neuroimaging Data In this tutorial we will learn the basics of the organization of data folders, and how to load, plot, and manipulate neuroimaging data in Python. To introduce the basics of fMRI data structures, watch this short video by Martin Lindquist. """) return @app.cell(hide_code=True) def _(mo): mo.md(r""" """) return @app.cell(hide_code=True) def _(mo): mo.Html(""" """) return @app.cell(hide_code=True) def _(mo): mo.md(r""" ## Software Packages There are many different software packages to analyze neuroimaging data. Most of them are open source and free to use (with the exception of [BrainVoyager](https://www.brainvoyager.com/)). The most popular ones ([SPM](https://www.fil.ion.ucl.ac.uk/spm/), [FSL](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki), & [AFNI](https://afni.nimh.nih.gov/)) have been around a long time and are where many new methods are developed and distributed. These packages have focused on implementing what they believe are the best statistical methods, ease of use, and computational efficiency. They have very large user bases so many bugs have been identified and fixed over the years. There are also lots of publicly available documentation, listserves, and online tutorials, which makes it very easy to get started using these tools. There are also many more boutique packages that focus on specific types of preprocessing step and analyses such as spatial normalization with [ANTs](http://stnava.github.io/ANTs/), connectivity analyses with the [conn-toolbox](https://web.conn-toolbox.org/), representational similarity analyses with the [rsaToolbox](https://github.com/rsagroup/rsatoolbox), and prediction/classification with [pyMVPA](http://www.pymvpa.org/). Many packages have been developed within proprietary software such as [Matlab](https://www.mathworks.com/products/matlab.html) (e.g., SPM, Conn, RSAToolbox, etc). Unfortunately, this requires that your university has site license for Matlab and many individual add-on toolboxes. If you are not affiliated with a University, you may have to pay for Matlab, which can be fairly expensive. There are free alternatives such as [octave](https://www.gnu.org/software/octave/), but octave does not include many of the add-on toolboxes offered by matlab that may be required for a specific package. Because of this restrictive licensing, it is difficult to run matlab on cloud computing servers and to use with free online courses such as dartbrains. Other packages have been written in C/C++/C# and need to be compiled to run on your specific computer and operating system. While these tools are typically highly computationally efficient, it can sometimes be challenging to get them to install and work on specific computers and operating systems. There has been a growing trend to adopt the open source Python framework in the data science and scientific computing communities, which has lead to an explosion in the number of new packages available for statistics, visualization, machine learning, and web development. [pyMVPA](http://www.pymvpa.org/) was an early leader in this trend, and there are many great tools that are being actively developed such as [nilearn](https://nilearn.github.io/), [brainiak](https://brainiak.org/), [neurosynth](https://github.com/neurosynth/neurosynth), [nipype](https://nipype.readthedocs.io/en/latest/), [fmriprep](https://fmriprep.readthedocs.io/en/stable/), and many more. One exciting thing is that these newer developments have built on the expertise of decades of experience with imaging analyses, and leverage changes in high performance computing. There is also a very tight integration with many cutting edge developments in adjacent communities such as machine learning with [scikit-learn](https://scikit-learn.org/stable/), [tensorflow](https://www.tensorflow.org/), and [pytorch](https://pytorch.org/), which has made new types of analyses much more accessible to the neuroimaging community. There has also been an influx of younger contributors with software development expertise. You might be surprised to know that many of the popular tools being used had core contributors originating from the neuroimaging community (e.g., scikit-learn, seaborn, and many more). For this course, I have chosen to focus on tools developed in Python as it is an easy to learn programming language, has excellent tools, works well on distributed computing systems, has great ways to disseminate information (e.g., jupyter notebooks, jupyter-book, etc), and is free! If you are just getting started, I would spend some time working with [NiLearn](https://nilearn.github.io/) and [Brainiak](https://brainiak.org/), which have a lot of functionality, are very well tested, are reasonably computationally efficient, and most importantly have lots of documentation and tutorials to get started. We will be using many packages throughout the course such as [fmriprep](https://fmriprep.readthedocs.io/en/stable/) to perform preprocessing, and [nltools](https://nltools.org/), which is a package developed in my lab, to do basic data manipulation and analysis. NLtools is built using many other toolboxes such as [nibabel](https://nipy.org/nibabel/) and [nilearn](https://nilearn.github.io/), and we will also be using these frequently throughout the course. """) return @app.cell(hide_code=True) def _(mo): mo.md(r""" ## BIDS: Brain Imaging Dataset Specification Recently, there has been growing interest to share datasets across labs and even on public repositories such as [openneuro](https://openneuro.org/). In order to make this a successful enterprise, it is necessary to have some standards in how the data are named and organized. Historically, each lab has used their own idiosyncratic conventions, which can make it difficult for outsiders to analyze. In the past few years, there have been heroic efforts by the neuroimaging community to create a standardized file organization and naming practices. This specification is called **BIDS** for [Brain Imaging Dataset Specification](http://bids.neuroimaging.io/). As you can imagine, individuals have their own distinct method of organizing their files. Think about how you keep track of your files on your personal laptop (versus your friend). This may be okay in the personal realm, but in science, it's best if anyone (especially yourself 6 months from now!) can follow your work and know *which* files mean *what* by their titles. Our course dataset — the [dartbrains/localizer](https://huggingface.co/datasets/dartbrains/localizer) dataset on HuggingFace — follows the BIDS layout. Here's the top-level structure of the raw side: ``` localizer/ ├── dataset_description.json # dataset name, BIDS version, authors ├── participants.tsv # one row per subject (age, sex, …) ├── participants.json # column descriptions for participants.tsv ├── task-localizer_bold.json # task-level acquisition params (TR, slice timing, …) ├── README.md ├── sub-S01/ │ ├── anat/ │ │ └── metadata.csv │ └── func/ │ ├── sub-S01_task-localizer_events.tsv # stimulus onsets, durations, conditions │ └── metadata.csv ├── sub-S02/ … ├── sub-S20/ └── derivatives/ # processed outputs (see next section) ``` A few things to notice: 1. **Files are in NIfTI format**, not raw DICOMs. (In this dataset the raw `.nii.gz` files aren't hosted to keep the download small — only the `events.tsv` per subject lives under raw, with the preprocessed scans available under `derivatives/`. A complete BIDS dataset would include `sub-S01/anat/sub-S01_T1w.nii.gz` and `sub-S01/func/sub-S01_task-localizer_bold.nii.gz` here.) 2. **Scans are broken up by modality** — `anat/`, `func/`, `dwi/`, `fmap/` — for each subject. 3. **Filenames carry metadata** as `key-value` *entities* separated by underscores: `sub-S01_task-localizer_events.tsv` tells you the subject, task, and content type at a glance. 4. **Sidecar JSON files** describe acquisition parameters in a machine-readable format (echo time, slice timing, phase encoding direction, …), either alongside each scan or "inherited" from a top-level file like `task-localizer_bold.json`. Not only does this specification standardize within labs, it also makes collaboration, software development, and data publishing dramatically easier. Because the format is consistent, tools like [pybids](https://github.com/bids-standard/pybids) can programmatically index and query an entire BIDS directory. In this course, we use lightweight helper functions in `dartbrains_tools.data` that download individual files on demand from HuggingFace Hub. """) return @app.cell(hide_code=True) def _(mo): mo.md(r""" ### The `derivatives/` folder BIDS makes a strict separation between **raw data** (what came off the scanner) and **derivatives** (anything produced by running a pipeline on that raw data). Derived files live in a sibling `derivatives/` directory, with one subfolder per pipeline. Here's the actual layout for our dataset: ``` localizer/derivatives/ ├── fmriprep/ │ ├── dataset_description.json │ ├── sub-S01.html # per-subject QC report │ ├── sub-S01/ │ │ ├── anat/ │ │ │ ├── sub-S01_desc-preproc_T1w.nii.gz # T1 in native space │ │ │ ├── sub-S01_desc-brain_mask.nii.gz # brain mask, native │ │ │ ├── sub-S01_dseg.nii.gz # tissue segmentation │ │ │ ├── sub-S01_label-{GM,WM,CSF}_probseg.nii.gz # tissue probabilities │ │ │ ├── sub-S01_from-T1w_to-MNI152NLin2009cAsym_mode-image_xfm.h5 # forward transform │ │ │ ├── sub-S01_from-MNI152NLin2009cAsym_to-T1w_mode-image_xfm.h5 # inverse transform │ │ │ ├── sub-S01_space-MNI152NLin2009cAsym_desc-preproc_T1w.nii.gz # T1 in MNI space │ │ │ └── sub-S01_space-MNI152NLin2009cAsym_desc-brain_mask.nii.gz │ │ ├── func/ │ │ │ ├── sub-S01_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz │ │ │ ├── sub-S01_task-localizer_space-MNI152NLin2009cAsym_desc-brain_mask.nii.gz │ │ │ ├── sub-S01_task-localizer_space-MNI152NLin2009cAsym_boldref.nii.gz │ │ │ └── sub-S01_task-localizer_desc-confounds_regressors.tsv # motion + physio regressors │ │ └── figures/ # QC SVGs (carpetplot, flirtbbr, dseg, …) │ ├── sub-S02/ … │ └── logs/CITATION.{bib,html,md,tex} └── betas/ # condition-level GLM estimates ├── S01_beta_audio_computation.nii.gz ├── S01_beta_audio_left_hand.nii.gz │ … (10 conditions per subject) ├── S01_betas.nii.gz # stacked 4D image (10 conditions) ├── S02_beta_… └── … ``` Each pipeline gets its own subfolder under `derivatives/` (here: `fmriprep/` for preprocessing and `betas/` for our first-level GLM outputs; other common ones are `freesurfer/`, `mriqc/`, `xcp_d/`). This means you can run multiple pipelines on the same dataset without them colliding, and deleting and re-running a pipeline never risks the raw data. Derivative files follow BIDS naming conventions but add **entities** that describe the processing variant. The most important ones to recognize: - `desc-` describes *what kind of derivative* — `desc-preproc_bold` is the preprocessed BOLD timeseries; `desc-brain_mask` is a brain mask; `desc-confounds_regressors` is the confounds TSV. - `space-` identifies the *coordinate space* — `space-MNI152NLin2009cAsym` means the file has been warped into the MNI152 nonlinear 2009c asymmetric template; absence of `space-` means native subject space. - `from-`/`to-` on `xfm.h5` files describe the *direction of a transform* (T1w → MNI for forward warps, MNI → T1w for inverse). - `label-` distinguishes *tissue classes* on segmentation outputs (GM, WM, CSF). These conventions keep filenames self-describing: `sub-S01_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz` tells you it's subject S01's localizer task, preprocessed and resampled into MNI space — without opening the file. In this course, our `Code.data.get_file()` helper takes a `scope` argument that distinguishes raw from derivative data: `scope='raw'` pulls from `sub-S01/`, `scope='derivatives'` pulls from `derivatives/fmriprep/sub-S01/`, and `scope='betas'` pulls from `derivatives/betas/`. The helper downloads on demand from HuggingFace and caches locally, so you don't need the full directory structure on disk. """) return @app.cell(hide_code=True) def _(mo): mo.md(r""" ### Accessing the Dataset The Localizer dataset is hosted on [HuggingFace](https://huggingface.co/datasets/dartbrains/localizer) in BIDS format. We provide helper functions in `dartbrains_tools.data` that download files on demand and cache them locally: ```python from dartbrains_tools.data import get_file, get_subjects, load_events # Get the preprocessed BOLD file for subject S01 bold_path = get_file('S01', 'derivatives', 'bold') # Get a list of all subjects subjects = get_subjects() # ['S01', 'S02', ..., 'S20'] # Load event timing for a subject events = load_events('S01') ``` Files are downloaded from HuggingFace Hub the first time you request them and cached locally for subsequent use. """) return @app.cell(hide_code=True) def _(mo): mo.md(r""" With a BIDS dataset, we often want to know which subjects are available, and retrieve specific files by subject, data type, and scope (raw vs. derivatives). Let's start by listing the subjects in the dataset. """) return @app.cell def _(get_subjects): subjects = get_subjects() subjects[:10] return @app.cell(hide_code=True) def _(mo): mo.md(r""" We can also retrieve the path to a specific file. For example, let's get the preprocessed BOLD file for the first 10 subjects. The `get_file` function downloads the file from HuggingFace Hub on first access and returns the local cached path. """) return @app.cell def _(get_file, get_subjects): bold_files = [get_file(sub, 'derivatives', 'bold') for sub in get_subjects()[:10]] bold_files return @app.cell(hide_code=True) def _(mo): mo.md(r""" In a BIDS dataset, each file follows a structured naming convention. For example, a preprocessed BOLD file is named: `sub-S01_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz` The key-value pairs (`sub-S01`, `task-localizer`, `space-...`, `desc-preproc`) are called **entities** and they encode metadata directly in the filename. This is one of the core design principles of BIDS: you can understand what a file contains just by reading its name. Common BIDS entities include: - `sub-