{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Introducing InstaNovo-P, a de novo sequencing model for phosphoproteomics\n", "\n", "## Announcing InstaNovo-P\n", "\n", "We are happy to share our newest model in the InstaNovo family, InstaNovo-P. [link to bioArxiv preprint]\n", "\n", "InstaNovo-P is a fine-tuned version of our base model, InstaNovo v1.0.0, specifically tailored towards application in phosphoproteomics mass spectrometry experiments. InstaNovo-P was further trained on approx. 2.8 million PSMs from 75 thousand phosphorylated peptides. InstaNovo-P was extended to recognize the residues phospho-tyrosine, -serine and -threonine, achieving high accuracy in detecting phosphorylated peptides while retaining its performance in unmodified peptides. \n" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Loading the InstaNovo-P model\n", "\n", "We first install the latest version of `instanovo` from [PyPi](https://pypi.org/project/instanovo/) to be able to load the InstaNovo-P checkpoint." ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "if 'google.colab' in sys.modules or 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:\n", " # Suppress TensorFlow warnings\n", " os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'\n", " # Install torchvision when running on Google Colab to prevent errors\n", " !uv pip install --system \"instanovo[cu124]>=1.1.2\" pyopenms-viz torchvision tf-nightly\n", "else:\n", " !uv pip install \"instanovo[cu124]>=1.1.2\" pyopenms-viz" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "We can use `instanovo version` to check the version of InstaNovo (the transformer-based model)." ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "!instanovo version" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "InstaNovo-P is a finetune of InstaNovo, so we import the transformer-based InstaNovo model." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "from instanovo.transformer.model import InstaNovo" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "Set the device to GPU if available (recommended), otherwise use CPU." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "import torch\n", "\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "device" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "InstaNovo supports automatic model downloads. You can see the IDs of the pretrained models that are available." ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "InstaNovo.get_pretrained()" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "We are now ready to download the InstaNovo-P checkpoint. " ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "model, config = InstaNovo.from_pretrained('instanovo-phospho-v1.0.0')\n", "model = model.to(device).eval()" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "## Loading the InstaNovo-P dataset\n", "\n", "Download the [test fold of the InstaNovo-P dataset](https://huggingface.co/datasets/InstaDeepAI/InstaNovo-P) dataset from HuggingFace.\n", "\n", "Normally, we would use the InstaNovo SpectrumDataFrame class to download a dataset from Hugging Face directly like this:\n", "\n", "```python\n", "from instanovo.utils import SpectrumDataFrame\n", "\n", "sdf = SpectrumDataFrame.from_huggingface(\n", " \"InstaDeepAI/InstaNovo-P\",\n", " is_annotated=True,\n", " shuffle=False,\n", " split=\"test\",\n", ")\n", "```\n", "\n", "But this downloads the whole dataset and the train partition alone is more then 6 GB, so this would take a while. So we use `load_dataset` to only download the `test` partition. " ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "\n", "data_files = {\"test\": \"dataset-phospho-test-0000-0001.parquet\"}\n", "dataset = load_dataset(\"InstaDeepAI/InstaNovo-P\", data_files=data_files, split=\"test[:10%]\")\n", "dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.DataFrame(dataset)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "from instanovo.utils import SpectrumDataFrame\n", "sdf = SpectrumDataFrame.from_pandas(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pd.options.plotting.backend = \"ms_matplotlib\"\n", "row = sdf[0]\n", "row_df = pd.DataFrame({\"mz\": row[\"mz_array\"], \"intensity\": row[\"intensity_array\"]})\n", "row_df.plot(\n", " kind=\"spectrum\",\n", " x=\"mz\",\n", " y=\"intensity\",\n", " annotate_mz=True,\n", " bin_method=\"none\",\n", " annotate_top_n_peaks=5,\n", " aggregate_duplicates=True,\n", " title=f\"Mass spectrum of {row['sequence']}\",\n", ");\n" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "from instanovo.transformer.dataset import SpectrumDataset, collate_batch\n", "\n", "ds = SpectrumDataset(\n", " sdf,\n", " model.residue_set,\n", " config.get(\"n_peaks\", 200),\n", " return_str=True,\n", " annotated=True,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "from torch.utils.data import DataLoader\n", "\n", "# When using SpectrumDataFrame, workers and shuffle is handled internally.\n", "dl = DataLoader(ds, batch_size=64, shuffle=False, num_workers=0, collate_fn=collate_batch)" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "batch = next(iter(dl))\n", "\n", "spectra, precursors, spectra_mask, peptides, _ = batch\n", "spectra = spectra.to(device)\n", "precursors = precursors.to(device)" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "## Decoding\n", "\n", "We have three options for decoding:\n", "- Greedy Search\n", "- Beam Search\n", "- Knapsack Beam Search\n", "\n", "For the best results and highest peptide recall, use **Knapsack Beam Search**. \n", "For fastest results (over 10x speedup), use **Greedy Search**.\n", "\n", "We generally use a beam size of 5 for Beam Search and Knapsack Beam Search, a higher beam size should increase recall at the cost of performance and vice versa.\n", "\n", "_Note: in our findings, greedy search has similar performance as knapsack beam search at 5% FDR. I.e. if you plan to filter at 5% FDR anyway, use greedy search for optimal performance._\n", "\n", "### Greedy Search and Beam Search\n", "\n", "Greedy search is used when `num_beams=1`, and beam search is used when `num_beams>1`" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "from instanovo.inference import BeamSearchDecoder, GreedyDecoder\n", "\n", "num_beams = 1 # Change this, defaults are 1 or 5\n", "\n", "if num_beams > 1:\n", " decoder = BeamSearchDecoder(model=model)\n", "else:\n", " decoder = GreedyDecoder(model=model)" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "### Knapsack Beam Search\n", "\n", "Setup knapsack beam search decoder. This may take a few minutes." ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "from instanovo.constants import MASS_SCALE\n", "from instanovo.inference.knapsack import Knapsack\n", "from instanovo.inference.knapsack_beam_search import KnapsackBeamSearchDecoder\n", "\n", "num_beams = 5\n", "\n", "def _setup_knapsack(model: InstaNovo) -> Knapsack:\n", " # Cannot allow negative masses in knapsack graph\n", " if \"(-17.03)\" in model.residue_set.residue_masses:\n", " model.residue_set.residue_masses[\"(-17.03)\"] = 1e3\n", " if \"[UNIMOD:385]\" in model.residue_set.residue_masses:\n", " model.residue_set.residue_masses[\"[UNIMOD:385]\"] = 1e3\n", "\n", " residue_masses = dict(model.residue_set.residue_masses.copy())\n", " if any(x < 0 for x in residue_masses.values()):\n", " raise NotImplementedError(\n", " \"Negative mass found in residues, this will break the knapsack graph. \"\n", " \"Either disable knapsack or use strictly positive masses\"\n", " )\n", " for special_residue in list(model.residue_set.residue_to_index.keys())[:3]:\n", " residue_masses[special_residue] = 0\n", " residue_indices = model.residue_set.residue_to_index\n", " return Knapsack.construct_knapsack(\n", " residue_masses=residue_masses,\n", " residue_indices=residue_indices,\n", " max_mass=4000.00,\n", " mass_scale=MASS_SCALE,\n", " )\n", "\n", "\n", "knapsack_path = Path(\"./checkpoints/knapsack/phospho\")\n", "\n", "if not knapsack_path.exists():\n", " print(\"Knapsack path missing or not specified, generating...\")\n", " knapsack = _setup_knapsack(model)\n", " decoder = KnapsackBeamSearchDecoder(model, knapsack)\n", " print(f\"Saving knapsack to {knapsack_path}\")\n", " knapsack_path.parent.mkdir(parents=True, exist_ok=True)\n", " knapsack.save(knapsack_path)\n", "else:\n", " print(\"Knapsack path found. Loading...\")\n", " decoder = KnapsackBeamSearchDecoder.from_file(model=model, path=knapsack_path)" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "## Inference time 🚀\n", "\n", "Evaluating a single batch..." ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "from instanovo.inference import ScoredSequence\n", "\n", "with torch.no_grad():\n", " p = decoder.decode(\n", " spectra=spectra,\n", " precursors=precursors,\n", " beam_size=num_beams,\n", " max_length=config[\"max_length\"],\n", " )\n", "\n", "preds = [x.sequence if isinstance(x, ScoredSequence) else [] for x in p]\n", "probs = [x.sequence_log_probability if isinstance(x, ScoredSequence) else -float(\"inf\") for x in p]" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "### Confidence probabilities\n", "The model returns per-residue confidences in the form of token log-probabilities. We can visualize these or use them as part of a workflow." ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "from typing import Optional\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import numpy as np\n", "\n", "from instanovo.inference.beam_search import ScoredSequence\n", "\n", "\n", "def plot_residue_confidence(prediction: ScoredSequence, peptide: Optional[str] = None) -> None:\n", " if not prediction:\n", " return\n", " ticks = list(range(len(prediction.sequence)))\n", " token_probabilities = np.exp(prediction.token_log_probabilities[:len(ticks)])\n", " sequence_confidence = np.exp(prediction.sequence_log_probability)\n", "\n", " _, ax = plt.subplots()\n", " bars = sns.barplot(x=ticks, y=token_probabilities, errorbar=None, ax=ax)\n", "\n", " # Increase Y-axis limit to create space for text labels\n", " ax.set_ylim(0, max(token_probabilities) * 1.2)\n", "\n", " # Add numbers above bars with a slanted angle\n", " for bar, prob in zip(bars.patches, token_probabilities):\n", " height = bar.get_height()\n", " ax.text(\n", " bar.get_x() + bar.get_width() / 2,\n", " float(height) + 0.02,\n", " f\"{float(prob):.4f}\",\n", " ha=\"center\",\n", " va=\"bottom\",\n", " fontsize=9,\n", " color=\"black\",\n", " rotation=45,\n", " )\n", "\n", " # Check if any residue contains a PTM (e.g., \"S(+79.97)\")\n", " has_ptm = any(\"[\" in res and \"]\" in res for res in prediction.sequence)\n", "\n", " # Set labels\n", " x_label = f\" Prediction: {''.join(prediction.sequence)}\"\n", " if peptide is not None:\n", " x_label += f\"\\nGround truth: {peptide}\"\n", " ax.set_xlabel(x_label)\n", " ax.set_ylabel(\"Confidence Probability\")\n", "\n", " # Set title with sequence confidence\n", " ax.set_title(\n", " f\"Residue Confidence per Position\\nSequence Probability: {sequence_confidence:.4f}\"\n", " )\n", "\n", " # Set X-ticks\n", " ax.set_xticks(ticks)\n", " ax.set_xticklabels(\n", " prediction.sequence,\n", " rotation=45 if has_ptm else 0,\n", " ha=\"right\" if has_ptm else \"center\",\n", " )\n", "\n", " plt.show()" ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "For a spectrum that is sequenced correctly, the sequence probability and per-residue probabilities are uniformly high:" ] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [], "source": [ "plot_residue_confidence(p[1], peptides[1])" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "For another spectrum which is sequenced incorrectly, the sequence probability is low and the per-residue probabilities of the incorrectly sequenced residues (up to isomerism) are lower than of those correctly sequenced:" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "plot_residue_confidence(p[0], peptides[0])" ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "### Evaluation" ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "from instanovo.utils.metrics import Metrics\n", "\n", "metrics = Metrics(model.residue_set, config[\"isotope_error_range\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "35", "metadata": {}, "outputs": [], "source": [ "aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(\n", " peptides, preds\n", ")\n", "peptide_recall" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "Evaluating on the test fold of the InstaNovo-P dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "37", "metadata": {}, "outputs": [], "source": [ "from tqdm.notebook import tqdm\n", "\n", "preds = []\n", "targs = []\n", "probs = []\n", "\n", "for _, batch in tqdm(enumerate(dl), total=len(dl)):\n", " spectra, precursors, _, peptides, _ = batch\n", " spectra = spectra.to(device)\n", " precursors = precursors.to(device)\n", "\n", " with torch.no_grad():\n", " p = decoder.decode(\n", " spectra=spectra,\n", " precursors=precursors,\n", " beam_size=config[\"n_beams\"],\n", " max_length=config[\"max_length\"],\n", " )\n", "\n", " preds += [x.sequence if isinstance(x, ScoredSequence) else [] for x in p]\n", " probs += [\n", " x.sequence_log_probability if isinstance(x, ScoredSequence) else -float(\"inf\") for x in p\n", " ]\n", " targs += list(peptides)" ] }, { "cell_type": "markdown", "id": "38", "metadata": {}, "source": [ "### Evaluation metrics\n", "\n", "Model performance without filtering:" ] }, { "cell_type": "code", "execution_count": null, "id": "39", "metadata": {}, "outputs": [], "source": [ "aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(\n", " targs, preds\n", ")\n", "aa_error_rate = metrics.compute_aa_er(targs, preds)\n", "auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))\n", "\n", "print(f\"amino acid error rate: {aa_error_rate:.5f}\")\n", "print(f\"amino acid precision: {aa_precision:.5f}\")\n", "print(f\"amino acid recall: {aa_recall:.5f}\")\n", "print(f\"peptide precision: {peptide_precision:.5f}\")\n", "print(f\"peptide recall: {peptide_recall:.5f}\")\n", "print(f\"area under the PR curve: {auc:.5f}\")" ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "### We can find a threshold to ensure a desired FDR:\n", "\n", "Model performance at 5% FDR:" ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "fdr = 5 / 100 # Desired FDR\n", "\n", "_, threshold = metrics.find_recall_at_fdr(targs, preds, np.exp(probs), fdr=fdr)\n", "aa_precision_fdr, aa_recall_fdr, peptide_recall_fdr, peptide_precision_fdr = (\n", " metrics.compute_precision_recall(targs, preds, np.exp(probs), threshold=threshold)\n", ")\n", "print(f\"Performance at {fdr*100:.1f}% FDR:\\n\")\n", "print(f\"amino acid precision: {aa_precision_fdr:.5f}\")\n", "print(f\"amino acid recall: {aa_recall_fdr:.5f}\")\n", "print(f\"peptide precision: {peptide_precision_fdr:.5f}\")\n", "print(f\"peptide recall: {peptide_recall_fdr:.5f}\")\n", "print(f\"area under the PR curve: {auc:.5f}\")\n", "print(f\"confidence threshold: {threshold:.5f} <-- Use this as a confidence cutoff\")" ] }, { "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "_Note: to reproduce the results of the paper, the entire InstaNovo-P test set should be evaluated._" ] }, { "cell_type": "markdown", "id": "43", "metadata": {}, "source": [ "### Saving the predictions..." ] }, { "cell_type": "code", "execution_count": null, "id": "44", "metadata": {}, "outputs": [], "source": [ "pred_df = pd.DataFrame(\n", " {\n", " \"targets\": targs,\n", " \"tokenized_predictions\": preds,\n", " \"predictions\": [\"\".join(x) for x in preds],\n", " \"log_probabilities\": probs,\n", " }\n", ")\n", "pred_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "45", "metadata": {}, "outputs": [], "source": [ "pred_df.to_csv(\"predictions_instanovo_phospho.csv\", index=False)" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }