{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true,
"authorship_tag": "ABX9TyPktUSzGMpGU1YmHHSsNC7z",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vyn8znjVP_Fg"
},
"outputs": [],
"source": [
"# Install for running in colab\n",
"!pip install biopython fair-esm -U kaleido &> /dev/null"
]
},
{
"cell_type": "markdown",
"source": [
"# Zero-shot prediction of functional protein sequence variants"
],
"metadata": {
"id": "Qs3Xu2RWQhZq"
}
},
{
"cell_type": "markdown",
"source": [
"In Moderna Therapeutics's study titled [\"Therapeutic Enzyme Engineering Using a Generative Neural Network\"](https://www.nature.com/articles/s41598-022-05195-x), scientists turbocharge ornithine transcarbamylase (OTC), a critical player in the urea cycle combating a rare metabolic ailment. Their secret sauce? Marrying insights from nature's evolution dance with a dash of artificial evolution guided by savvy deep learning models. Meanwhile, in the paper [\"Efficient evolution of human antibodies from general protein language models\"](https://www.nature.com/articles/s41587-023-01763-2), authors play protein origami, deftly evolving human antibodies sans the cheat sheet on target antigen specifics binding specificity, or protein structure. No experimental data for model supervision; both approaches propose beneficial mutations in a zero-shot manner.\n",
"\n",
"Crafting functional protein sequences in the protein engineering arena is a puzzle akin to navigating a vast cosmic expanse. Picture a protein with 200 amino acids; a mere switcheroo of one amino acid spawns a array of 4000 potential variants (200*20). My [previous notebooks](https://nbviewer.org/github/arjan-hada/protein-variant-prediction/blob/master/00_protein_seq_to_fxn_model.ipynb) used machine learning as the Sherpa guiding directed evolution when armed with experimental compass. But what if the experimental compass is missing? Can we still map out a path to promising protein sequences by divining the protein's mutational horoscope?\n",
"\n",
"Nature leaves us clues, imprinting functional wisdom in protein sequences. Enter the unsupervised models, revealing higher functional variant tales from the sequence data troves, as showcased in the article [\"Language models enable zero-shot prediction of the effects of mutations on protein function\"](https://www.biorxiv.org/content/10.1101/2021.07.09.450648v2). These unsupervised maestros for predicting protein variant effects fall into three troupes: **hybrid protein language models (PLMs)**, **PLMs**, **Alignment based models**, and **Inverse folding models**.\n",
"\n",
"Our playbook? This notebook orchestrates and contrasts these model from each class, aiming to spotlight protein sequence gems that can lighten the load of experimental quests in directed evolution. The playbook uses competitive models from the benchmarks set by ProteinGym](https://proteingym.org/), providing a comprehensive evaluation of these models' capabilities in predicting the mutational landscape of proteins.\n",
"\n",
"**Goals**\n",
"- Efficiently identify promising protein sequence variants to alleviate the experimental screening burden in directed evolution.\n",
"- Provide a canvas of mutational landscape, specifically focusing on the [encapsidation protein 22K crucial for recombinant Adeno-Associated Virus (rAAV) production](https://pubmed.ncbi.nlm.nih.gov/38062776/). Maybe there are protein sequence gems that hold the key to improving rAAV production?"
],
"metadata": {
"id": "BnyB29yrtlKT"
}
},
{
"cell_type": "markdown",
"source": [
"## Protein fitness landscape prediction using Hybrid PLMs"
],
"metadata": {
"id": "CZUcWQ5fp8nn"
}
},
{
"cell_type": "markdown",
"source": [
"- Large Protein Language Models (PLMs) are trained on large protein databases encompassing all known sequences or on sets of homologous sequences spanning many protein families.\n",
"- Hybrid PLMs are PLMs that leverage multiple sequence alignment (MSAs) such as [Tranception](https://arxiv.org/abs/2205.13760), and [MSA Transformer](https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1.full).\n",
"- Below is the take on hybrid PLMs from [Pascal Notin](https://twitter.com/NotinPascal) author of [ProteinGym](https://proteingym.org/benchmarks) paper."
],
"metadata": {
"id": "xAFnB7MrXxgo"
}
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "Q1UYmhXGv-jJ"
}
},
{
"cell_type": "markdown",
"source": [
"From the hybrid PLMs class, we will be using [MSA Transformer](https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1.full) to predict mutational landscape for encapsidation protein 22K."
],
"metadata": {
"id": "eIZKlYZuktkY"
}
},
{
"cell_type": "markdown",
"source": [
"### Obtain FASTA file of protein sequence"
],
"metadata": {
"id": "QgJXp5Q4q61k"
}
},
{
"cell_type": "code",
"source": [
"query_url = \"https://rest.uniprot.org/uniprotkb/A0A7L4WH77.fasta\""
],
"metadata": {
"id": "Z3FenhSreyYO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"import requests\n",
"from Bio import SeqIO\n",
"from io import StringIO\n",
"\n",
"def read_fasta_from_url(url):\n",
" # Download the FASTA file content from the web link\n",
" response = requests.get(url)\n",
"\n",
" if response.status_code == 200:\n",
" # Parse the content using Biopython's SeqIO\n",
" fasta_content = StringIO(response.text)\n",
" record = next(SeqIO.parse(fasta_content, 'fasta'))\n",
" sequence = str(record.seq)\n",
"\n",
" return sequence\n",
" else:\n",
" print(f\"Failed to retrieve data. HTTP Status Code: {response.status_code}\")"
],
"metadata": {
"id": "WYqDhRoTlUZ8"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"protein_sequence = read_fasta_from_url(query_url)\n",
"print(f\"Protein Sequence: {protein_sequence}\")\n",
"print(f\"Protein Sequence Length: {len(protein_sequence)}\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "M6I_sLtNmH0O",
"outputId": "072d7c5d-ee4a-4f99-f8f0-eb211251a254"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Protein Sequence: MAPKKKLQLPPPPTDEEEYWDSQAEEVLDEEEEEDMMEDWESLDEEASEAEEVSDETPSPSVAFPSPAPQKSATGSSMATTSAPQASPALPVRRPNRRWDTTGTRAGKSKQPPPLAQEQQQRQGYRSWRGHKNAIVACLQDCGGNISFARRFLLYHHGVAFPRNILHYYRHLYSPYCTGGSSSNSSGHTEAKATG\n",
"Protein Sequence Length: 195\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"# Save FASTA file locally for future reference\n",
"# The -O option in wget allows you to specify the name of the output file.\n",
"!wget $query_url -O data/Encapsidation_protein_22K.fasta"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "BcW1Th1KAJfo",
"outputId": "e5a0a218-e4e9-4c08-c850-8cec7d08148e"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"--2023-12-18 21:53:32-- https://rest.uniprot.org/uniprotkb/A0A7L4WH77.fasta\n",
"Resolving rest.uniprot.org (rest.uniprot.org)... 193.62.193.81\n",
"Connecting to rest.uniprot.org (rest.uniprot.org)|193.62.193.81|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: unspecified [text/plain]\n",
"Saving to: ‘data/Encapsidation_protein_22K.fasta’\n",
"\n",
"data/Encapsidation_ [ <=> ] 309 --.-KB/s in 0s \n",
"\n",
"2023-12-18 21:53:32 (185 MB/s) - ‘data/Encapsidation_protein_22K.fasta’ saved [309]\n",
"\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### Generate sequence alignment in a3m format"
],
"metadata": {
"id": "mHtnbwJHwoe0"
}
},
{
"cell_type": "markdown",
"source": [
"The sequence alignment for encapsidation protein 22K was generated using [MMSeq2](https://github.com/soedinglab/MMseqs2) from [ColabFold v1.5.3 server](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb#scrollTo=kOblAo-xetgx)."
],
"metadata": {
"id": "Bbj52wdrwvaA"
}
},
{
"cell_type": "code",
"source": [
"# Number of entries in the MSA file\n",
"!grep -c \"^>\" data/Encapsidation_protein_22K.a3m"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QcAg0eaZ_BNQ",
"outputId": "d308d0d1-8394-4b9a-a8d4-01993859fc93"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"124\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!head -n 9 data/Encapsidation_protein_22K.a3m"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "jwQ9MEJN_WSb",
"outputId": "b1f066d1-14f1-4cd7-f58f-a68a8b005fc3"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"#195\t1\n",
">101\n",
"MAPKKKLQLPPPPTDEEEYWDSQAEEVLDEEEEEDMMEDWESLDEEASEAEEVSDETPSPSVAFPSPAPQKSATGSSMATTSAPQASPALPVRRPNRRWDTTGTRAGKSKQPPPLAQEQQQRQGYRSWRGHKNAIVACLQDCGGNISFARRFLLYHHGVAFPRNILHYYRHLYSPYCTGGSSSNSSGHTEAKATG\n",
">UniRef100_A0A7L4WJM1\t270\t0.923\t1.137E-78\t0\t194\t195\t0\t193\t194\n",
"MAPKKKLQLP-PPPTDEEEYWDSQAEEVLDEEEEDMMEDWESLDEEASEAEEVSDETPSPSVAFPSPAPQKSATGSSMATTSAPQASPALPVRRPNRRWDTTGTRAGKSKQPPPLAQEQQQRQGYRSWRGHKNAIVACLQDCGGNISFARRFLLYHHGVAFPRNILHYYRHLYSPYCTGGSSSNSSGHTEAKATG\n",
">UniRef100_A0A3Q9HLG8\t259\t0.892\t7.754E-75\t0\t194\t195\t0\t193\t194\n",
"MAPKKKLQLP-PPPTDEEEYWDSQAEEVLDEEEEDMMEDWESLDEEASEAEEVSDGTPSPSVASPSPAPQKSATGPSMATTSAPQAPPALPVRRPNRRWDTTGTRAGKSKQPPPLAQEQQQRQGYRSWRGHKNAIVACLQDCGGNISFARRFLLYHHGVAFPRNILHYYRHLYSPYCTGGSGSNSSGHTEAKAPG\n",
">UniRef100_A0A3Q9HL92\t258\t0.888\t1.457E-74\t0\t194\t195\t0\t196\t197\n",
"MAPKKKLQLPPPPPTDEEEYWDSQAEEVLDEEEEDTMEDWDSLDEEASEAEEVSDETPSPSVAFPSPAPQKSATGPSMATTSAPQAPPALPVRRPNRRWDTTGTRAGKSKQPPPLAQEQQQRQGYRSWRGHKNAIVACLQDCGGNISFARRFLLYHHGVAFPRNILHYYRHLYSPYYTGgsGSGSNSSGHTEAKATG\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### Generate mutant triplets to make predictions"
],
"metadata": {
"id": "y86uPNMoq-rH"
}
},
{
"cell_type": "code",
"source": [
"def generate_mutant_triplet(protein_sequence):\n",
"\n",
" '''\n",
" Input:\n",
" protein_sequence: the input protein sequence for which you want to generate mutant triplets.\n",
"\n",
" Output:\n",
" This script will generate all possible mutant triplets for the input protein sequence.\n",
" '''\n",
"\n",
" # Define the amino acid alphabet (one-letter codes)\n",
" amino_acids = \"ACDEFGHIKLMNPQRSTVWY\"\n",
"\n",
" # Create a list to store the mutant sequences\n",
" mutants = []\n",
"\n",
" # Iterate through each position in the protein sequence\n",
" for position in range(len(protein_sequence)):\n",
"\n",
" # Generate mutants for each amino acid in the alphabet\n",
" for aa in amino_acids:\n",
" # Create a mutant sequence by replacing the original amino acid\n",
" mutant = protein_sequence[position] + str(position+1) + aa\n",
" mutants.append(mutant)\n",
"\n",
" # Print the list of mutant sequences\n",
" return mutants"
],
"metadata": {
"id": "Aad73eTrpp6j"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"mutant = generate_mutant_triplet(protein_sequence)"
],
"metadata": {
"id": "kUQa_MKsrLS8"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"\n",
"Encapsidation_protein_22K_mutants_df = pd.DataFrame({\n",
" \"mutant\": mutant\n",
"})\n",
"\n",
"Encapsidation_protein_22K_mutants_df['wildtype'] = Encapsidation_protein_22K_mutants_df['mutant'].str[:-1]\n",
"Encapsidation_protein_22K_mutants_df['mutation'] = Encapsidation_protein_22K_mutants_df['mutant'].str[-1]\n",
"\n",
"Encapsidation_protein_22K_mutants_df.to_csv('data/Encapsidation_protein_22K_mutants_unlabeled.csv', index=False)"
],
"metadata": {
"id": "QNomPwTQrc11"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Predict fitness landscape"
],
"metadata": {
"id": "imqfZ4aQQRVW"
}
},
{
"cell_type": "code",
"source": [
"# We meed the MSA file in .a3m format and csv file with mutant triplet to run the script below\n",
"!ls data/"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "OBS6pnWLDIYI",
"outputId": "30a8c038-1cae-41eb-b03b-1f7f7ab4f183"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Encapsidation_protein_22K.a3m\t Encapsidation_protein_22K_mutants_unlabeled.csv\n",
"Encapsidation_protein_22K.fasta\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"# Script to generate mutational landscape\n",
"# predict_esm.py script from: https://github.com/facebookresearch/esm/blob/main/examples/variant-prediction/predict.py\n",
"\n",
"!python scripts/predict_esm.py \\\n",
" --model-location esm_msa1b_t12_100M_UR50S \\\n",
" --sequence $protein_sequence \\\n",
" --dms-input ./data/Encapsidation_protein_22K_mutants_unlabeled.csv \\\n",
" --mutation-col mutant \\\n",
" --dms-output ./data/Encapsidation_protein_22K_mutants_labeled.csv \\\n",
" --offset-idx 1 \\\n",
" --scoring-strategy masked-marginals \\\n",
" --msa-path ./data/Encapsidation_protein_22K.a3m"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "VpkE4fvzt_iE",
"outputId": "61b71e55-ed55-448c-d762-383b80f1d518"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Downloading: \"https://dl.fbaipublicfiles.com/fair-esm/models/esm_msa1b_t12_100M_UR50S.pt\" to /root/.cache/torch/hub/checkpoints/esm_msa1b_t12_100M_UR50S.pt\n",
"Downloading: \"https://dl.fbaipublicfiles.com/fair-esm/regression/esm_msa1b_t12_100M_UR50S-contact-regression.pt\" to /root/.cache/torch/hub/checkpoints/esm_msa1b_t12_100M_UR50S-contact-regression.pt\n",
"Transferred model to GPU\n",
"100% 196/196 [01:43<00:00, 1.89it/s]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### Visualize fitness landscape"
],
"metadata": {
"id": "5RvQcP8dwzVk"
}
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"\n",
"# Read the csv file containing mutation score from MSA Transformer model\n",
"fitness_prot22k_msa1b = pd.read_csv('data/Encapsidation_protein_22K_mutants_labeled.csv',\n",
" index_col=0)\n",
"fitness_prot22k_msa1b"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "8uxkE4caw1xk",
"outputId": "e4f88de3-5df2-4a67-c27c-0b744a52d41e"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" mutant wildtype mutation esm_msa1b_t12_100M_UR50S\n",
"0 M1A M1 A -16.325127\n",
"1 M1C M1 C -16.986792\n",
"2 M1D M1 D -18.815874\n",
"3 M1E M1 E -17.576103\n",
"4 M1F M1 F -13.433237\n",
"... ... ... ... ...\n",
"3895 G195S G195 S -9.930399\n",
"3896 G195T G195 T -13.784996\n",
"3897 G195V G195 V -12.468152\n",
"3898 G195W G195 W -14.053451\n",
"3899 G195Y G195 Y -15.699408\n",
"\n",
"[3900 rows x 4 columns]"
],
"text/html": [
"\n",
"
| \n", " | mutant | \n", "wildtype | \n", "mutation | \n", "esm_msa1b_t12_100M_UR50S | \n", "
|---|---|---|---|---|
| 0 | \n", "M1A | \n", "M1 | \n", "A | \n", "-16.325127 | \n", "
| 1 | \n", "M1C | \n", "M1 | \n", "C | \n", "-16.986792 | \n", "
| 2 | \n", "M1D | \n", "M1 | \n", "D | \n", "-18.815874 | \n", "
| 3 | \n", "M1E | \n", "M1 | \n", "E | \n", "-17.576103 | \n", "
| 4 | \n", "M1F | \n", "M1 | \n", "F | \n", "-13.433237 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 3895 | \n", "G195S | \n", "G195 | \n", "S | \n", "-9.930399 | \n", "
| 3896 | \n", "G195T | \n", "G195 | \n", "T | \n", "-13.784996 | \n", "
| 3897 | \n", "G195V | \n", "G195 | \n", "V | \n", "-12.468152 | \n", "
| 3898 | \n", "G195W | \n", "G195 | \n", "W | \n", "-14.053451 | \n", "
| 3899 | \n", "G195Y | \n", "G195 | \n", "Y | \n", "-15.699408 | \n", "
3900 rows × 4 columns
\n", "| \n", " | Unnamed: 0 | \n", "mutant | \n", "wildtype | \n", "mutation | \n", "esm_msa1b_t12_100M_UR50S | \n", "esm2_t33_650M_UR50D | \n", "esm1v_t33_650M_UR90S_1 | \n", "esm1v_t33_650M_UR90S_2 | \n", "esm1v_t33_650M_UR90S_3 | \n", "esm1v_t33_650M_UR90S_4 | \n", "esm1v_t33_650M_UR90S_5 | \n", "esm1b_t33_650M_UR50S | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0 | \n", "M1A | \n", "M1 | \n", "A | \n", "-16.325127 | \n", "-8.098816 | \n", "-8.235819 | \n", "-5.609891 | \n", "-6.854110 | \n", "-6.774120 | \n", "-6.800508 | \n", "-6.751205 | \n", "
| 1 | \n", "1 | \n", "M1C | \n", "M1 | \n", "C | \n", "-16.986792 | \n", "-11.144790 | \n", "-10.752127 | \n", "-7.939667 | \n", "-9.681019 | \n", "-10.263803 | \n", "-10.269108 | \n", "-9.223083 | \n", "
| 2 | \n", "2 | \n", "M1D | \n", "M1 | \n", "D | \n", "-18.815874 | \n", "-10.257328 | \n", "-9.317606 | \n", "-6.818065 | \n", "-8.292774 | \n", "-8.394552 | \n", "-8.780309 | \n", "-8.850616 | \n", "
| 3 | \n", "3 | \n", "M1E | \n", "M1 | \n", "E | \n", "-17.576103 | \n", "-9.041828 | \n", "-9.080865 | \n", "-6.556577 | \n", "-7.643363 | \n", "-7.661278 | \n", "-7.244850 | \n", "-7.663014 | \n", "
| 4 | \n", "4 | \n", "M1F | \n", "M1 | \n", "F | \n", "-13.433237 | \n", "-10.600483 | \n", "-9.636066 | \n", "-7.294350 | \n", "-8.199809 | \n", "-8.164932 | \n", "-8.874701 | \n", "-9.020489 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 3895 | \n", "3895 | \n", "G195S | \n", "G195 | \n", "S | \n", "-9.930399 | \n", "0.551918 | \n", "0.538798 | \n", "0.282230 | \n", "0.335325 | \n", "0.482387 | \n", "0.505919 | \n", "0.841090 | \n", "
| 3896 | \n", "3896 | \n", "G195T | \n", "G195 | \n", "T | \n", "-13.784996 | \n", "0.215631 | \n", "0.102809 | \n", "0.036828 | \n", "-0.015700 | \n", "0.123055 | \n", "0.254141 | \n", "0.318797 | \n", "
| 3897 | \n", "3897 | \n", "G195V | \n", "G195 | \n", "V | \n", "-12.468152 | \n", "-0.506383 | \n", "-0.631024 | \n", "-0.503328 | \n", "-0.551369 | \n", "-0.467469 | \n", "-0.509331 | \n", "-0.360502 | \n", "
| 3898 | \n", "3898 | \n", "G195W | \n", "G195 | \n", "W | \n", "-14.053451 | \n", "-3.002845 | \n", "-2.173256 | \n", "-1.882592 | \n", "-1.959077 | \n", "-2.038070 | \n", "-2.005057 | \n", "-2.339551 | \n", "
| 3899 | \n", "3899 | \n", "G195Y | \n", "G195 | \n", "Y | \n", "-15.699408 | \n", "-2.778253 | \n", "-2.155714 | \n", "-1.750376 | \n", "-1.710923 | \n", "-1.869018 | \n", "-1.828054 | \n", "-2.049861 | \n", "
3900 rows × 12 columns
\n", "