---
name: tribev2-brain-encoding
description: Use TRIBE v2, Meta's multimodal foundation model for predicting fMRI brain responses to video, audio, and text stimuli
triggers:
  - predict brain responses to video
  - fMRI encoding model
  - TRIBE v2 brain prediction
  - multimodal brain encoding
  - in-silico neuroscience model
  - predict cortical activity from video
  - brain response to naturalistic stimuli
  - tribev2 inference and training
---

# TRIBE v2 Brain Encoding Model

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection

TRIBE v2 is Meta's multimodal foundation model that predicts fMRI brain responses to naturalistic stimuli (video, audio, text). It combines LLaMA 3.2 (text), V-JEPA2 (video), and Wav2Vec-BERT (audio) encoders into a unified Transformer architecture that maps multimodal representations onto the cortical surface (fsaverage5, ~20k vertices).

## Installation

```bash
# Inference only
pip install -e .

# With brain visualization (PyVista & Nilearn)
pip install -e ".[plotting]"

# Full training dependencies (PyTorch Lightning, W&B, etc.)
pip install -e ".[training]"
```

## Quick Start — Inference

### Load pretrained model and predict from video

```python
from tribev2 import TribeModel

# Load from HuggingFace (downloads weights to cache)
model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")

# Build events dataframe from a video file
df = model.get_events_dataframe(video_path="path/to/video.mp4")

# Predict brain responses
preds, segments = model.predict(events=df)
print(preds.shape)  # (n_timesteps, n_vertices) on fsaverage5
```

### Multimodal input — video + audio + text

```python
from tribev2 import TribeModel

model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")

# All modalities together (text is auto-converted to speech and transcribed)
df = model.get_events_dataframe(
    video_path="path/to/video.mp4",
    audio_path="path/to/audio.wav",   # optional, overrides video audio
    text_path="path/to/script.txt",   # optional, auto-timed
)

preds, segments = model.predict(events=df)
print(preds.shape)  # (n_timesteps, n_vertices)
```

### Text-only prediction

```python
from tribev2 import TribeModel

model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")

df = model.get_events_dataframe(text_path="path/to/narration.txt")
preds, segments = model.predict(events=df)
```

## Brain Visualization

```python
from tribev2 import TribeModel
from tribev2.plotting import plot_brain_surface

model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")
df = model.get_events_dataframe(video_path="path/to/video.mp4")
preds, segments = model.predict(events=df)

# Plot a single timepoint on the cortical surface
plot_brain_surface(preds[0], backend="nilearn")   # or backend="pyvista"
```

## Training a Model from Scratch

### 1. Set environment variables

```bash
export DATAPATH="/path/to/studies"
export SAVEPATH="/path/to/output"
export SLURM_PARTITION="your_slurm_partition"
```

### 2. Authenticate with HuggingFace (required for LLaMA 3.2)

```bash
huggingface-cli login
# Paste a HuggingFace read token when prompted
# Request access at: https://huggingface.co/meta-llama/Llama-3.2-3B
```

### 3. Local test run

```bash
python -m tribev2.grids.test_run
```

### 4. Full grid search on Slurm

```bash
# Cortical surface model
python -m tribev2.grids.run_cortical

# Subcortical regions
python -m tribev2.grids.run_subcortical
```

## Key API — TribeModel

```python
from tribev2 import TribeModel

# Load pretrained weights
model = TribeModel.from_pretrained(
    "facebook/tribev2",
    cache_folder="./cache"  # local cache for HuggingFace weights
)

# Build events dataframe (word-level timings, chunking, etc.)
df = model.get_events_dataframe(
    video_path=None,   # str path to .mp4
    audio_path=None,   # str path to .wav
    text_path=None,    # str path to .txt
)

# Run prediction
preds, segments = model.predict(events=df)
# preds: np.ndarray of shape (n_timesteps, n_vertices)
# segments: list of segment metadata dicts
```

## Project Structure

```
tribev2/
├── main.py              # Experiment pipeline: Data, TribeExperiment
├── model.py             # FmriEncoder: Transformer multimodal→fMRI model
├── pl_module.py         # PyTorch Lightning training module
├── demo_utils.py        # TribeModel and inference helpers
├── eventstransforms.py  # Event transforms (word extraction, chunking)
├── utils.py             # Multi-study loading, splitting, subject weighting
├── utils_fmri.py        # Surface projection (MNI / fsaverage) and ROI analysis
├── grids/
│   ├── defaults.py      # Full default experiment configuration
│   └── test_run.py      # Quick local test entry point
├── plotting/            # Brain visualization backends
└── studies/             # Dataset definitions (Algonauts2025, Lahner2024, …)
```

## Configuration — Defaults

Edit `tribev2/grids/defaults.py` or set environment variables:

```python
# tribev2/grids/defaults.py (key fields)
{
    "datapath": "/path/to/studies",       # override with DATAPATH env var
    "savepath": "/path/to/output",        # override with SAVEPATH env var
    "slurm_partition": "learnfair",       # override with SLURM_PARTITION env var
    "model": "FmriEncoder",
    "modalities": ["video", "audio", "text"],
    "surface": "fsaverage5",              # ~20k vertices
}
```

## Custom Experiment with PyTorch Lightning

```python
from tribev2.main import Data, TribeExperiment
from tribev2.pl_module import TribePLModule
import pytorch_lightning as pl

# Configure experiment
experiment = TribeExperiment(
    datapath="/path/to/studies",
    savepath="/path/to/output",
    modalities=["video", "audio", "text"],
)

data = Data(experiment)
module = TribePLModule(experiment)

trainer = pl.Trainer(
    max_epochs=50,
    accelerator="gpu",
    devices=4,
)
trainer.fit(module, data)
```

## Working with fMRI Surfaces

```python
from tribev2.utils_fmri import project_to_fsaverage, get_roi_mask

# Project MNI coordinates to fsaverage5 surface
surface_data = project_to_fsaverage(mni_data, target="fsaverage5")

# Get a specific ROI mask (e.g., early visual cortex)
roi_mask = get_roi_mask(roi_name="V1", surface="fsaverage5")
v1_responses = preds[:, roi_mask]
print(v1_responses.shape)  # (n_timesteps, n_v1_vertices)
```

## Common Patterns

### Batch prediction over multiple videos

```python
from tribev2 import TribeModel
import numpy as np

model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")

video_paths = ["video1.mp4", "video2.mp4", "video3.mp4"]
all_predictions = []

for vp in video_paths:
    df = model.get_events_dataframe(video_path=vp)
    preds, segments = model.predict(events=df)
    all_predictions.append(preds)

# all_predictions: list of (n_timesteps_i, n_vertices) arrays
```

### Extract predictions for specific brain region

```python
from tribev2 import TribeModel
from tribev2.utils_fmri import get_roi_mask

model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")
df = model.get_events_dataframe(video_path="video.mp4")
preds, segments = model.predict(events=df)

# Focus on auditory cortex
ac_mask = get_roi_mask("auditory_cortex", surface="fsaverage5")
auditory_responses = preds[:, ac_mask]  # (n_timesteps, n_ac_vertices)
```

### Access segment timing metadata

```python
preds, segments = model.predict(events=df)

for i, seg in enumerate(segments):
    print(f"Segment {i}: onset={seg['onset']:.2f}s, duration={seg['duration']:.2f}s")
    print(f"  Brain response shape: {preds[i].shape}")
```

## Troubleshooting

**LLaMA 3.2 access denied**
```bash
# Must request access at https://huggingface.co/meta-llama/Llama-3.2-3B
# Then authenticate:
huggingface-cli login
# Use a HuggingFace token with read permissions
```

**CUDA out of memory during inference**
```python
# Use CPU for inference on smaller machines
import torch
model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")
model.to("cpu")
```

**Missing visualization dependencies**
```bash
pip install -e ".[plotting]"
# Installs pyvista and nilearn backends
```

**Slurm training not submitting**
```bash
# Check env vars are set
echo $DATAPATH $SAVEPATH $SLURM_PARTITION
# Or edit tribev2/grids/defaults.py directly
```

**Video without audio track causes error**
```python
# Provide audio separately or use text-only mode
df = model.get_events_dataframe(
    video_path="silent_video.mp4",
    audio_path="separate_audio.wav",
)
```

## Citation

```bibtex
@article{dAscoli2026TribeV2,
  title={A foundation model of vision, audition, and language for in-silico neuroscience},
  author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and Brookes, Teon
          and Begany, Katelyn and Raugel, Jos{\'e}phine and Banville, Hubert and King, Jean-R{\'e}mi},
  year={2026}
}
```

## Resources

- [Paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/)
- [Interactive Demo](https://aidemos.atmeta.com/tribev2/)
- [HuggingFace Weights](https://huggingface.co/facebook/tribev2)
- [Colab Notebook](https://colab.research.google.com/github/facebookresearch/tribev2/blob/main/tribe_demo.ipynb)