"cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "In this notebook, I will prepare human antibody structures from SAbDab (The Structural Antibody Database) for multimodal pre-training.\n", "\n", "**Goals**\n", "- Download human antibody structures with resolution 2.5Å or better.\n", "- Use [proteinflow](https://github.com/adaptyvbio/ProteinFlow) to filter sequences for quality, cluster sequences, and split into train/valid/test." ], "metadata": { "id": "p7-bYMemg_sf" } }, { "cell_type": "markdown", "source": [ "## Setup" ], "metadata": { "id": "c0tl4HMN45sA" } }, { "cell_type": "code", "source": [ "# Import necessary libraries\n", "from pathlib import Path\n", "import os" ], "metadata": { "id": "-b6nlNc5IM5q" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "!pip install proteinflow &> /dev/null\n", "!apt-get install -qq -y mmseqs2 &> /dev/null" ], "metadata": { "id": "mB_Z9PNWf-lB" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "from google.colab import drive\n", "drive.mount('/content/gdrive', force_remount=True)\n", "\n", "path = Path(\"/content/gdrive/\")\n", "path_data = Path(\"/content/gdrive/MyDrive/data\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Q1IRQGUk6d8v", "outputId": "b4da66f2-bdc8-4fb0-9705-116ad2ba0b27" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Mounted at /content/gdrive\n" ] } ] }, { "cell_type": "code", "source": [ "import pandas as pd\n", "\n", "from slugify import slugify" ], "metadata": { "id": "a2hPYQMCHCSS" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "#!proteinflow generate --help" ], "metadata": { "id": "RrR8IX1sr9fX" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## SaAbDab\n", "\n", "Download human antobody structures with resolution 2.5Å or better. This resulted in structures resolved by either X-ray crystallography or cryo-electron microscopy." ], "metadata": { "id": "koBvqfWi5CrO" } }, { "cell_type": "code", "source": [ "# Species Homo Sapiens and Resolution 2.5 A\n", "sabdab_summary_url = 'https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/'\n", "sabdab_url = 'https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/'\n", "fname = slugify(sabdab_summary_url.split('/')[-2], lowercase=False)" ], "metadata": { "id": "XG8gSUjWL4ux" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Need to generate url fresh everytime\n", "!wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/ -O {path_data}/{fname}_summary.tsv" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NkTHU5uTWSNi", "outputId": "e43ae073-5151-438f-be7f-161fe95790a3" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "--2024-05-20 18:11:19-- https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/\n", "Resolving opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)...\n", "Connecting to opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)||:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1050365 (1.0M) [text/tab-separated-values]\n", "Saving to: ‘/content/gdrive/MyDrive/data/20240520-0899946_summary.tsv’\n", "\n", "/content/gdrive/MyD 100%[===================>] 1.00M 2.20MB/s in 0.5s \n", "\n", "2024-05-20 18:11:21 (2.20 MB/s) - ‘/content/gdrive/MyDrive/data/20240520-0899946_summary.tsv’ saved [1050365/1050365]\n", "\n" ] } ] }, { "cell_type": "code", "source": [ "# Need to generate url fresh everytime\n", "!wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/ -O {path_data}/{fname}.zip" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "PUDJ8WjlWxyL", "outputId": "a8adb633-a8bc-4174-ff5d-e942cace6580" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "--2024-05-20 18:01:35-- https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/\n", "Resolving opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)...\n", "Connecting to opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)||:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 4002463609 (3.7G) [application/zip]\n", "Saving to: ‘/content/gdrive/MyDrive/data/20240520_0899946.zip’\n", "\n", "/content/gdrive/MyD 100%[===================>] 3.73G 30.5MB/s in 3m 23s \n", "\n", "2024-05-20 18:05:24 (18.8 MB/s) - ‘/content/gdrive/MyDrive/data/20240520_0899946.zip’ saved [4002463609/4002463609]\n", "\n" ] } ] }, { "cell_type": "code", "source": [ "!ls {path_data}" ], "metadata": { "id": "vvmghrVs3dKI", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d6ba09f9-9ff3-4135-e99e-2abdc1378c7d" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "20240520-0899946_summary.tsv 20240520-0899946.zip\n" ] } ] }, { "cell_type": "markdown", "source": [ "## ProteinFlow" ], "metadata": { "id": "3Wfi76x-5Gzy" } }, { "cell_type": "markdown", "source": [ "**Filter**\n", "- Discard biounits with sequences <30 residues, since they are very small and quite flexible.\n", "- Retain redundant dataset of structures, since antibodies with identical amino acid sequences can have slight variations in their structure.\n", "- Select proteins with <30% missing residues in the tails and <10% missing residues in the middle.\n", "- Discard every biounits that contain unnatural aminoacids.\n", "- Discard biounits that contain unexpected atoms.\n", "- Discard biounits with discrepancies between fasta and PDB sequences.\n", "- Discard biounits that contain chains with > 10,000 aminoacids in total.\n", "\n", "**Cluster**\n", "\n", "SAbDab sequences clustering is done across all 6 Complementary Determining Regions (CDRs) - H1, H2, H3, L1, L2, L3, based on the Chothia numbering using MMSeqs2. The minimum sequence identity for mmseqs clustering is set at 90%.\n", "\n", "**Split**\n", "\n", "The resulting CDR clusters are split into train, valid, and test set at ∼80:10:10 ratio in a way that ensures that every PDB file only appears in one subset." ], "metadata": { "id": "iudwiAZY8zD5" } }, { "cell_type": "code", "source": [ "!proteinflow generate --sabdab \\\n", "--sabdab_data_path {path_data}/{fname}.zip --tag {fname} \\\n", "--resolution_thr 2.5 --not_remove_redundancies \\\n", "--min_seq_id 0.9 \\\n", "--local_datasets_folder {path_data} \\\n", "--valid_split 0.1 --test_split 0.1 \\\n", "--split_tolerance 0.05" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4Z1NJrRrjG-X", "outputId": "92e8a2ea-373a-4a5a-dd20-c9ef6695adcb" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Log file: /content/gdrive/MyDrive/data/proteinflow_20240520-0899946/log.txt \n", "\n", "Moving files...\n", "Unzipping /content/gdrive/MyDrive/data/20240520-0899946.zip...\n", "100% 5071/5071 [01:55<00:00, 43.75it/s]\n", "Filtering...\n", "100% 1287/1287 [00:15<00:00, 84.22it/s] \n", "Downloading fasta files...\n", "100% 1287/1287 [00:14<00:00, 90.75it/s]\n", "Filter and process...\n", "100% 2219/2219 [24:05<00:00, 1.54it/s]\n", "<<< Too many missing values in total: 150\n", "<<< Too many missing values in the middle: 120\n", "<<< Incorrect alignment: 34\n", "<<< Too many missing values in the ends: 22\n", "<<< FASTA file not found: 10\n", "<<< Some chains in the PDB do not appear in the fasta file: 8\n", "<<< Unnatural amino acids found: 7\n", "<<< PDB / mmCIF file is too large: 2\n", "Total exceptions: 353\n", "Checking excluded chains similarity...\n", "100% 1869/1869 [00:37<00:00, 49.42it/s] \n", "Clustering with MMSeqs2 for CDR L1...\n", "100% 1868/1868 [00:08<00:00, 232.49it/s]\n", "100% 1121/1121 [00:00<00:00, 200666.42it/s]\n", "Clustering with MMSeqs2 for CDR L2...\n", "100% 1868/1868 [00:07<00:00, 243.93it/s]\n", "100% 1121/1121 [00:00<00:00, 236283.97it/s]\n", "Clustering with MMSeqs2 for CDR L3...\n", "100% 1868/1868 [00:08<00:00, 210.17it/s]\n", "100% 1121/1121 [00:00<00:00, 125495.51it/s]\n", "Clustering with MMSeqs2 for CDR H1...\n", "100% 1868/1868 [00:07<00:00, 236.49it/s]\n", "100% 1121/1121 [00:00<00:00, 228520.77it/s]\n", "Clustering with MMSeqs2 for CDR H2...\n", "100% 1868/1868 [00:08<00:00, 226.31it/s]\n", "100% 1121/1121 [00:00<00:00, 247476.96it/s]\n", "Clustering with MMSeqs2 for CDR H3...\n", "100% 1868/1868 [00:08<00:00, 212.87it/s]\n", "100% 1121/1121 [00:00<00:00, 205858.79it/s]\n", "/usr/local/lib/python3.10/dist-packages/networkx/convert_matrix.py:687: DeprecationWarning: from_numpy_matrix is deprecated and will be removed in NetworkX 3.0.\n", "Use from_numpy_array instead, e.g. from_numpy_array(A, **kwargs)\n", " warnings.warn(\n", "\n", "Split size:\n", " Train 79.89%\n", " Valid 10.04%\n", " Test 10.07%\n", "\n", "Moving files in the train set...\n", "100% 1571/1571 [00:06<00:00, 226.53it/s]\n", "Moving files in the validation set...\n", "100% 147/147 [00:00<00:00, 234.08it/s]\n", "Moving files in the test set...\n", "100% 150/150 [00:01<00:00, 148.44it/s]\n" ] } ] }, { "cell_type": "code", "source": [ "!ls /content/gdrive/MyDrive/data/proteinflow_{fname}/" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tww6FpWknRxz", "outputId": "ab1cef11-e87e-4428-e24b-35e6d512a7b9" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "log.txt splits_dict test train valid\n" ] } ] }, { "cell_type": "code", "source": [ "!proteinflow generate --help" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mXVNE6buUplB", "outputId": "94511098-180b-485c-fa27-f23be2f9700d" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Usage: proteinflow generate [OPTIONS]\n", "\n", " Generate a new ProteinFlow dataset\n", "\n", "Options:\n", " --max_chains INTEGER The maximum number of chains per biounit\n", " --random_seed INTEGER The random seed to use for splitting\n", " --require_ligand Use this flag to require that the PDB files\n", " contain a ligand\n", " --foldseek Whether to use FoldSeek to cluster the\n", " dataset\n", " --tanimoto_clustering Whether to use Tanimoto Clustering instead\n", " of MMSeqs2. Only works if load_ligands is\n", " set to True\n", " --exclude_chains_without_ligands\n", " Exclude chains without ligands from the\n", " generated dataset\n", " --load_ligands Whether or not to load ligands found in the\n", " pdbs example: data['A']['ligand'][0]['X']\n", " --exclude_based_on_cdr [L1|L2|L3|H1|H2|H3]\n", " if given and exclude_clusters is true + the\n", " dataset is SAbDab, exclude files based on\n", " only the given CDR clusters\n", " --exclude_clusters Exclude clusters that contain chains similar\n", " to chains to exclude\n", " --exclude_threshold FLOAT Exclude chains with sequence identity to\n", " exclude_chains above this threshold\n", " --exclude_chains_file TEXT Exclude specific chains from the dataset\n", " (path to a file containing the sequences to\n", " exclude, one sequence per line)\n", " -e, --exclude_chains TEXT Exclude specific chains from the dataset\n", " ({pdb_id}-{chain_id}, e.g. -e 1a2b-A)\n", " --require_antigen Use this flag to require that the SAbDab\n", " files contain an antigen\n", " --sabdab_data_path TEXT Path to a zip file or a directory containing\n", " SAbDab files (only used if `sabdab` is\n", " `True`)\n", " --sabdab Use this flag to generate a dataset from\n", " SAbDab files instead of PDB\n", " --min_seq_id FLOAT Minimum sequence identity for mmseqs\n", " clustering\n", " --load_live Load the files that are not in the latest\n", " PDB snapshot from the PDB FTP server\n", " (disregarded if pdb_snapshot is not none)\n", " --pdb_snapshot TEXT The pdb snapshot folder to load\n", " --valid_split FLOAT The fraction of chains to put in the\n", " validation set (default 5%)\n", " --test_split FLOAT The fraction of chains to put in the test\n", " set (default 5%)\n", " --split_tolerance FLOAT The tolerance on the split ratio (default\n", " 20%)\n", " --force When `True`, rewrite the files if they\n", " already exist\n", " --n INTEGER The number of files to process (for\n", " debugging purposes)\n", " --skip_splitting Use this flag to skip splitting the data\n", " --redundancy_thr FLOAT The threshold upon which sequences are\n", " considered as one and the same (default:\n", " 90%)\n", " --not_remove_redundancies Unless this flag is used, removes biounits\n", " that are doubles of others sequence wise\n", " --not_filter_methods Unless this flag is used, only files\n", " obtained with X-ray or EM will be processed\n", " --missing_middle_thr FLOAT The maximum fraction of missing residues in\n", " the middle (after missing ends are\n", " disregarded)\n", " --missing_ends_thr FLOAT The maximum fraction of missing residues at\n", " the ends\n", " --resolution_thr FLOAT The maximum resolution\n", " --max_length INTEGER The maximum number of residues per chain\n", " (set None for no threshold)\n", " --min_length INTEGER The minimum number of non-missing residues\n", " per chain\n", " --local_datasets_folder TEXT The folder where proteinflow datasets,\n", " temporary files and logs will be stored\n", " --pdb_id_list_path TEXT List of pdb ids to download and process\n", " --tag TEXT The name of the dataset\n", " --help Show this message and exit.\n" ] } ] } ], "metadata": { "colab": { "provenance": [], "machine_shape": "hm", "toc_visible": true, "authorship_tag": "ABX9TyNpqFRngY+vseSuyDb03VWc", "include_colab_link": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }