{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Introducing InstaNovo-P, a de novo sequencing model for phosphoproteomics\n",
    "\n",
    "## Announcing InstaNovo-P\n",
    "\n",
    "We are happy to share our newest model in the InstaNovo family, InstaNovo-P. [link to bioArxiv preprint]\n",
    "\n",
    "InstaNovo-P is a fine-tuned version of our base model, InstaNovo v1.0.0, specifically tailored towards application in phosphoproteomics mass spectrometry experiments. InstaNovo-P was further trained on approx. 2.8 million PSMs from 75 thousand phosphorylated peptides. InstaNovo-P was extended to recognize the residues phospho-tyrosine, -serine and -threonine, achieving high accuracy in detecting phosphorylated peptides while retaining its performance in unmodified peptides. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "## Loading the InstaNovo-P model\n",
    "\n",
    "We first install the latest version of `instanovo` from [PyPi](https://pypi.org/project/instanovo/) to be able to load the InstaNovo-P checkpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "if 'google.colab' in sys.modules or 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:\n",
    "    # Suppress TensorFlow warnings\n",
    "    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'\n",
    "    # Install torchvision when running on Google Colab to prevent errors\n",
    "    !uv pip install --system \"instanovo[cu124]>=1.1.2\" pyopenms-viz torchvision tf-nightly\n",
    "else:\n",
    "    !uv pip install \"instanovo[cu124]>=1.1.2\" pyopenms-viz"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "We can use `instanovo version` to check the version of InstaNovo (the transformer-based model)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "!instanovo version"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "InstaNovo-P is a finetune of InstaNovo, so we import the transformer-based InstaNovo model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from instanovo.transformer.model import InstaNovo"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "Set the device to GPU if available (recommended), otherwise use CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "device"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "InstaNovo supports automatic model downloads. You can see the IDs of the pretrained models that are available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "InstaNovo.get_pretrained()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "We are now ready to download the InstaNovo-P checkpoint. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "model, config = InstaNovo.from_pretrained('instanovo-phospho-v1.0.0')\n",
    "model = model.to(device).eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "## Loading the InstaNovo-P dataset\n",
    "\n",
    "Download the [test fold of the InstaNovo-P dataset](https://huggingface.co/datasets/InstaDeepAI/InstaNovo-P) dataset from HuggingFace.\n",
    "\n",
    "Normally, we would use the InstaNovo SpectrumDataFrame class to download a dataset from Hugging Face directly like this:\n",
    "\n",
    "```python\n",
    "from instanovo.utils import SpectrumDataFrame\n",
    "\n",
    "sdf = SpectrumDataFrame.from_huggingface(\n",
    "    \"InstaDeepAI/InstaNovo-P\",\n",
    "    is_annotated=True,\n",
    "    shuffle=False,\n",
    "    split=\"test\",\n",
    ")\n",
    "```\n",
    "\n",
    "But this downloads the whole dataset and the train partition alone is more then 6 GB, so this would take a while.  So we use `load_dataset` to only download the `test` partition. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "data_files = {\"test\": \"dataset-phospho-test-0000-0001.parquet\"}\n",
    "dataset = load_dataset(\"InstaDeepAI/InstaNovo-P\", data_files=data_files, split=\"test[:10%]\")\n",
    "dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame(dataset)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from instanovo.utils import SpectrumDataFrame\n",
    "sdf = SpectrumDataFrame.from_pandas(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.options.plotting.backend = \"ms_matplotlib\"\n",
    "row = sdf[0]\n",
    "row_df = pd.DataFrame({\"mz\": row[\"mz_array\"], \"intensity\": row[\"intensity_array\"]})\n",
    "row_df.plot(\n",
    "    kind=\"spectrum\",\n",
    "    x=\"mz\",\n",
    "    y=\"intensity\",\n",
    "    annotate_mz=True,\n",
    "    bin_method=\"none\",\n",
    "    annotate_top_n_peaks=5,\n",
    "    aggregate_duplicates=True,\n",
    "    title=f\"Mass spectrum of {row['sequence']}\",\n",
    ");\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "from instanovo.transformer.dataset import SpectrumDataset, collate_batch\n",
    "\n",
    "ds = SpectrumDataset(\n",
    "    sdf,\n",
    "    model.residue_set,\n",
    "    config.get(\"n_peaks\", 200),\n",
    "    return_str=True,\n",
    "    annotated=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "\n",
    "# When using SpectrumDataFrame, workers and shuffle is handled internally.\n",
    "dl = DataLoader(ds, batch_size=64, shuffle=False, num_workers=0, collate_fn=collate_batch)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "batch = next(iter(dl))\n",
    "\n",
    "spectra, precursors, spectra_mask, peptides, _ = batch\n",
    "spectra = spectra.to(device)\n",
    "precursors = precursors.to(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21",
   "metadata": {},
   "source": [
    "## Decoding\n",
    "\n",
    "We have three options for decoding:\n",
    "- Greedy Search\n",
    "- Beam Search\n",
    "- Knapsack Beam Search\n",
    "\n",
    "For the best results and highest peptide recall, use **Knapsack Beam Search**. \n",
    "For fastest results (over 10x speedup), use **Greedy Search**.\n",
    "\n",
    "We generally use a beam size of 5 for Beam Search and Knapsack Beam Search, a higher beam size should increase recall at the cost of performance and vice versa.\n",
    "\n",
    "_Note: in our findings, greedy search has similar performance as knapsack beam search at 5% FDR. I.e. if you plan to filter at 5% FDR anyway, use greedy search for optimal performance._\n",
    "\n",
    "### Greedy Search and Beam Search\n",
    "\n",
    "Greedy search is used when `num_beams=1`, and beam search is used when `num_beams>1`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {},
   "outputs": [],
   "source": [
    "from instanovo.inference import BeamSearchDecoder, GreedyDecoder\n",
    "\n",
    "num_beams = 1  # Change this, defaults are 1 or 5\n",
    "\n",
    "if num_beams > 1:\n",
    "    decoder = BeamSearchDecoder(model=model)\n",
    "else:\n",
    "    decoder = GreedyDecoder(model=model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23",
   "metadata": {},
   "source": [
    "### Knapsack Beam Search\n",
    "\n",
    "Setup knapsack beam search decoder. This may take a few minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from instanovo.constants import MASS_SCALE\n",
    "from instanovo.inference.knapsack import Knapsack\n",
    "from instanovo.inference.knapsack_beam_search import KnapsackBeamSearchDecoder\n",
    "\n",
    "num_beams = 5\n",
    "\n",
    "def _setup_knapsack(model: InstaNovo) -> Knapsack:\n",
    "    # Cannot allow negative masses in knapsack graph\n",
    "    if \"(-17.03)\" in model.residue_set.residue_masses:\n",
    "        model.residue_set.residue_masses[\"(-17.03)\"] = 1e3\n",
    "    if \"[UNIMOD:385]\" in model.residue_set.residue_masses:\n",
    "        model.residue_set.residue_masses[\"[UNIMOD:385]\"] = 1e3\n",
    "\n",
    "    residue_masses = dict(model.residue_set.residue_masses.copy())\n",
    "    if any(x < 0 for x in residue_masses.values()):\n",
    "        raise NotImplementedError(\n",
    "            \"Negative mass found in residues, this will break the knapsack graph. \"\n",
    "            \"Either disable knapsack or use strictly positive masses\"\n",
    "        )\n",
    "    for special_residue in list(model.residue_set.residue_to_index.keys())[:3]:\n",
    "        residue_masses[special_residue] = 0\n",
    "    residue_indices = model.residue_set.residue_to_index\n",
    "    return Knapsack.construct_knapsack(\n",
    "        residue_masses=residue_masses,\n",
    "        residue_indices=residue_indices,\n",
    "        max_mass=4000.00,\n",
    "        mass_scale=MASS_SCALE,\n",
    "    )\n",
    "\n",
    "\n",
    "knapsack_path = Path(\"./checkpoints/knapsack/phospho\")\n",
    "\n",
    "if not knapsack_path.exists():\n",
    "    print(\"Knapsack path missing or not specified, generating...\")\n",
    "    knapsack = _setup_knapsack(model)\n",
    "    decoder = KnapsackBeamSearchDecoder(model, knapsack)\n",
    "    print(f\"Saving knapsack to {knapsack_path}\")\n",
    "    knapsack_path.parent.mkdir(parents=True, exist_ok=True)\n",
    "    knapsack.save(knapsack_path)\n",
    "else:\n",
    "    print(\"Knapsack path found. Loading...\")\n",
    "    decoder = KnapsackBeamSearchDecoder.from_file(model=model, path=knapsack_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25",
   "metadata": {},
   "source": [
    "## Inference time 🚀\n",
    "\n",
    "Evaluating a single batch..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26",
   "metadata": {},
   "outputs": [],
   "source": [
    "from instanovo.inference import ScoredSequence\n",
    "\n",
    "with torch.no_grad():\n",
    "    p = decoder.decode(\n",
    "        spectra=spectra,\n",
    "        precursors=precursors,\n",
    "        beam_size=num_beams,\n",
    "        max_length=config[\"max_length\"],\n",
    "    )\n",
    "\n",
    "preds = [x.sequence if isinstance(x, ScoredSequence) else [] for x in p]\n",
    "probs = [x.sequence_log_probability if isinstance(x, ScoredSequence) else -float(\"inf\") for x in p]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27",
   "metadata": {},
   "source": [
    "### Confidence probabilities\n",
    "The model returns per-residue confidences in the form of token log-probabilities. We can visualize these or use them as part of a workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Optional\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "\n",
    "from instanovo.inference.beam_search import ScoredSequence\n",
    "\n",
    "\n",
    "def plot_residue_confidence(prediction: ScoredSequence, peptide: Optional[str] = None) -> None:\n",
    "    if not prediction:\n",
    "        return\n",
    "    ticks = list(range(len(prediction.sequence)))\n",
    "    token_probabilities = np.exp(prediction.token_log_probabilities[:len(ticks)])\n",
    "    sequence_confidence = np.exp(prediction.sequence_log_probability)\n",
    "\n",
    "    _, ax = plt.subplots()\n",
    "    bars = sns.barplot(x=ticks, y=token_probabilities, errorbar=None, ax=ax)\n",
    "\n",
    "    # Increase Y-axis limit to create space for text labels\n",
    "    ax.set_ylim(0, max(token_probabilities) * 1.2)\n",
    "\n",
    "    # Add numbers above bars with a slanted angle\n",
    "    for bar, prob in zip(bars.patches, token_probabilities):\n",
    "        height = bar.get_height()\n",
    "        ax.text(\n",
    "            bar.get_x() + bar.get_width() / 2,\n",
    "            float(height) + 0.02,\n",
    "            f\"{float(prob):.4f}\",\n",
    "            ha=\"center\",\n",
    "            va=\"bottom\",\n",
    "            fontsize=9,\n",
    "            color=\"black\",\n",
    "            rotation=45,\n",
    "        )\n",
    "\n",
    "    # Check if any residue contains a PTM (e.g., \"S(+79.97)\")\n",
    "    has_ptm = any(\"[\" in res and \"]\" in res for res in prediction.sequence)\n",
    "\n",
    "    # Set labels\n",
    "    x_label = f\"    Prediction: {''.join(prediction.sequence)}\"\n",
    "    if peptide is not None:\n",
    "        x_label += f\"\\nGround truth: {peptide}\"\n",
    "    ax.set_xlabel(x_label)\n",
    "    ax.set_ylabel(\"Confidence Probability\")\n",
    "\n",
    "    # Set title with sequence confidence\n",
    "    ax.set_title(\n",
    "        f\"Residue Confidence per Position\\nSequence Probability: {sequence_confidence:.4f}\"\n",
    "    )\n",
    "\n",
    "    # Set X-ticks\n",
    "    ax.set_xticks(ticks)\n",
    "    ax.set_xticklabels(\n",
    "        prediction.sequence,\n",
    "        rotation=45 if has_ptm else 0,\n",
    "        ha=\"right\" if has_ptm else \"center\",\n",
    "    )\n",
    "\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29",
   "metadata": {},
   "source": [
    "For a spectrum that is sequenced correctly, the sequence probability and per-residue probabilities are uniformly high:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_residue_confidence(p[1], peptides[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "For another spectrum which is sequenced incorrectly, the sequence probability is low and the per-residue probabilities of the incorrectly sequenced residues (up to isomerism) are lower than of those correctly sequenced:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_residue_confidence(p[0], peptides[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33",
   "metadata": {},
   "source": [
    "### Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34",
   "metadata": {},
   "outputs": [],
   "source": [
    "from instanovo.utils.metrics import Metrics\n",
    "\n",
    "metrics = Metrics(model.residue_set, config[\"isotope_error_range\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35",
   "metadata": {},
   "outputs": [],
   "source": [
    "aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(\n",
    "    peptides, preds\n",
    ")\n",
    "peptide_recall"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36",
   "metadata": {},
   "source": [
    "Evaluating on the test fold of the InstaNovo-P dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm.notebook import tqdm\n",
    "\n",
    "preds = []\n",
    "targs = []\n",
    "probs = []\n",
    "\n",
    "for _, batch in tqdm(enumerate(dl), total=len(dl)):\n",
    "    spectra, precursors, _, peptides, _ = batch\n",
    "    spectra = spectra.to(device)\n",
    "    precursors = precursors.to(device)\n",
    "\n",
    "    with torch.no_grad():\n",
    "        p = decoder.decode(\n",
    "            spectra=spectra,\n",
    "            precursors=precursors,\n",
    "            beam_size=config[\"n_beams\"],\n",
    "            max_length=config[\"max_length\"],\n",
    "        )\n",
    "\n",
    "    preds += [x.sequence if isinstance(x, ScoredSequence) else [] for x in p]\n",
    "    probs += [\n",
    "        x.sequence_log_probability if isinstance(x, ScoredSequence) else -float(\"inf\") for x in p\n",
    "    ]\n",
    "    targs += list(peptides)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38",
   "metadata": {},
   "source": [
    "### Evaluation metrics\n",
    "\n",
    "Model performance without filtering:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39",
   "metadata": {},
   "outputs": [],
   "source": [
    "aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(\n",
    "    targs, preds\n",
    ")\n",
    "aa_error_rate = metrics.compute_aa_er(targs, preds)\n",
    "auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))\n",
    "\n",
    "print(f\"amino acid error rate:    {aa_error_rate:.5f}\")\n",
    "print(f\"amino acid precision:     {aa_precision:.5f}\")\n",
    "print(f\"amino acid recall:        {aa_recall:.5f}\")\n",
    "print(f\"peptide precision:        {peptide_precision:.5f}\")\n",
    "print(f\"peptide recall:           {peptide_recall:.5f}\")\n",
    "print(f\"area under the PR curve:  {auc:.5f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "### We can find a threshold to ensure a desired FDR:\n",
    "\n",
    "Model performance at 5% FDR:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41",
   "metadata": {},
   "outputs": [],
   "source": [
    "fdr = 5 / 100  # Desired FDR\n",
    "\n",
    "_, threshold = metrics.find_recall_at_fdr(targs, preds, np.exp(probs), fdr=fdr)\n",
    "aa_precision_fdr, aa_recall_fdr, peptide_recall_fdr, peptide_precision_fdr = (\n",
    "    metrics.compute_precision_recall(targs, preds, np.exp(probs), threshold=threshold)\n",
    ")\n",
    "print(f\"Performance at {fdr*100:.1f}% FDR:\\n\")\n",
    "print(f\"amino acid precision:     {aa_precision_fdr:.5f}\")\n",
    "print(f\"amino acid recall:        {aa_recall_fdr:.5f}\")\n",
    "print(f\"peptide precision:        {peptide_precision_fdr:.5f}\")\n",
    "print(f\"peptide recall:           {peptide_recall_fdr:.5f}\")\n",
    "print(f\"area under the PR curve:  {auc:.5f}\")\n",
    "print(f\"confidence threshold:     {threshold:.5f}  <-- Use this as a confidence cutoff\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42",
   "metadata": {},
   "source": [
    "_Note: to reproduce the results of the paper, the entire InstaNovo-P test set should be evaluated._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43",
   "metadata": {},
   "source": [
    "### Saving the predictions..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44",
   "metadata": {},
   "outputs": [],
   "source": [
    "pred_df = pd.DataFrame(\n",
    "    {\n",
    "        \"targets\": targs,\n",
    "        \"tokenized_predictions\": preds,\n",
    "        \"predictions\": [\"\".join(x) for x in preds],\n",
    "        \"log_probabilities\": probs,\n",
    "    }\n",
    ")\n",
    "pred_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {},
   "outputs": [],
   "source": [
    "pred_df.to_csv(\"predictions_instanovo_phospho.csv\", index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}