"Open

In this notebook, I will prepare human antibody structures from SAbDab (The Structural Antibody Database) for multimodal pre-training.

**Goals**
- Download human antibody structures with resolution 2.5Å or better.
- Use [proteinflow](https://github.com/adaptyvbio/ProteinFlow) to filter sequences for quality, cluster sequences, and split into train/valid/test.

## Setup

In [None]:
# Import necessary libraries
from pathlib import Path
import os

In [None]:
!pip install proteinflow &> /dev/null
!apt-get install -qq -y mmseqs2 &> /dev/null

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

path = Path("/content/gdrive/")
path_data = Path("/content/gdrive/MyDrive/data")

Mounted at /content/gdrive


In [None]:
import pandas as pd

from slugify import slugify

In [None]:
#!proteinflow generate --help

## SaAbDab

Download human antobody structures with resolution 2.5Å or better. This resulted in structures resolved by either X-ray crystallography or cryo-electron microscopy.

In [None]:
# Species Homo Sapiens and Resolution 2.5 A
sabdab_summary_url = 'https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/'
sabdab_url = 'https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/'
fname = slugify(sabdab_summary_url.split('/')[-2], lowercase=False)

In [None]:
# Need to generate url fresh everytime
!wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/ -O {path_data}/{fname}_summary.tsv

--2024-05-20 18:11:19-- https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/
Resolving opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)... 163.1.32.59
Connecting to opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)|163.1.32.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1050365 (1.0M) [text/tab-separated-values]
Saving to: ‘/content/gdrive/MyDrive/data/20240520-0899946_summary.tsv’


2024-05-20 18:11:21 (2.20 MB/s) - ‘/content/gdrive/MyDrive/data/20240520-0899946_summary.tsv’ saved [1050365/1050365]



In [None]:
# Need to generate url fresh everytime
!wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/ -O {path_data}/{fname}.zip

--2024-05-20 18:01:35-- https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/
Resolving opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)... 163.1.32.59
Connecting to opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)|163.1.32.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4002463609 (3.7G) [application/zip]
Saving to: ‘/content/gdrive/MyDrive/data/20240520_0899946.zip’


2024-05-20 18:05:24 (18.8 MB/s) - ‘/content/gdrive/MyDrive/data/20240520_0899946.zip’ saved [4002463609/4002463609]



In [None]:
!ls {path_data}

20240520-0899946_summary.tsv 20240520-0899946.zip


## ProteinFlow

**Filter**
- Discard biounits with sequences <30 residues, since they are very small and quite flexible.
- Retain redundant dataset of structures, since antibodies with identical amino acid sequences can have slight variations in their structure.
- Select proteins with <30% missing residues in the tails and <10% missing residues in the middle.
- Discard every biounits that contain unnatural aminoacids.
- Discard biounits that contain unexpected atoms.
- Discard biounits with discrepancies between fasta and PDB sequences.
- Discard biounits that contain chains with > 10,000 aminoacids in total.

**Cluster**

SAbDab sequences clustering is done across all 6 Complementary Determining Regions (CDRs) - H1, H2, H3, L1, L2, L3, based on the Chothia numbering using MMSeqs2. The minimum sequence identity for mmseqs clustering is set at 90%.

**Split**

The resulting CDR clusters are split into train, valid, and test set at ∼80:10:10 ratio in a way that ensures that every PDB file only appears in one subset.

In [None]:
!proteinflow generate --sabdab \
--sabdab_data_path {path_data}/{fname}.zip --tag {fname} \
--resolution_thr 2.5 --not_remove_redundancies \
--min_seq_id 0.9 \
--local_datasets_folder {path_data} \
--valid_split 0.1 --test_split 0.1 \
--split_tolerance 0.05

Log file: /content/gdrive/MyDrive/data/proteinflow_20240520-0899946/log.txt 

Moving files...
Unzipping /content/gdrive/MyDrive/data/20240520-0899946.zip...
100% 5071/5071 [01:55<00:00, 43.75it/s]
Filtering...
100% 1287/1287 [00:15<00:00, 84.22it/s] 
Downloading fasta files...
100% 1287/1287 [00:14<00:00, 90.75it/s]
Filter and process...
100% 2219/2219 [24:05<00:00, 1.54it/s]
<<< Too many missing values in total: 150
<<< Too many missing values in the middle: 120
<<< Incorrect alignment: 34
<<< Too many missing values in the ends: 22
<<< FASTA file not found: 10
<<< Some chains in the PDB do not appear in the fasta file: 8
<<< Unnatural amino acids found: 7
<<< PDB / mmCIF file is too large: 2
Total exceptions: 353
Checking excluded chains similarity...
100% 1869/1869 [00:37<00:00, 49.42it/s] 
Clustering with MMSeqs2 for CDR L1...
100% 1868/1868 [00:08<00:00, 232.49it/s]
100% 1121/1121 [00:00<00:00, 200666.42it/s]
Clustering with MMSeqs2 for CDR L2...
100% 1868/1868 [00:07<00:00, 243.9

In [None]:
!ls /content/gdrive/MyDrive/data/proteinflow_{fname}/

log.txt splits_dict test train valid


In [None]:
!proteinflow generate --help

Usage: proteinflow generate [OPTIONS]

 Generate a new ProteinFlow dataset

Options:
 --max_chains INTEGER The maximum number of chains per biounit
 --random_seed INTEGER The random seed to use for splitting
 --require_ligand Use this flag to require that the PDB files
 contain a ligand
 --foldseek Whether to use FoldSeek to cluster the
 dataset
 --tanimoto_clustering Whether to use Tanimoto Clustering instead
 of MMSeqs2. Only works if load_ligands is
 set to True
 --exclude_chains_without_ligands
 Exclude chains without ligands from the
 generated dataset
 --load_ligands Whether or not to load ligands found in the
 pdbs example: data['A']['ligand'][0]['X']
 --exclude_based_on_cdr [L1|L2|L3|H1|H2|H3]
 if given and exclude_clusters is true + the
 dataset is SAbDab, exclude files based on
 only the given CDR clusters
 --exclude_clusters Exclude clusters that contain chains similar
 to chains to exclude
 --exclude_threshold FLOAT Exclude chains with sequence identity to
 exclude_chains a