{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Performing differential expression analysis on PTM data\n",
    "\n",
    "Here, we will take phospho data processed with Spectronaut and \n",
    " * perform phosphosite inference and perform differential expression analysis on the phospho sites\n",
    " * take proteome data and normalize the differential expression results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining input files\n",
    "As with the standard differential expression analysis, we need:\n",
    "\n",
    "* an input file from a proteomics search engine. We currently only support Spectronaut, DIA-NN will come soon\n",
    "* a sample mapping file that maps each sample to a condition (e.g.  sample 'brain_replicate_1' is mapped to condition 'brain'). In the GUI, there is some functionality to help create such a file\n",
    "* (optional) a results directory can be defined on where to save the data\n",
    "* (optional) a list where we specify, which conditions we compare\n",
    "\n",
    "Simple specifications on how to export the Spectronaut file for PTM analysis can be found in the [README](https://github.com/MannLabs/alphaquant/blob/master/README.md#preparing-input-files)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PHOSPHO_FILE = \"./data/phospho/phospho_subset.tsv\"\n",
    "SAMPLEMAP_PHOSPHO = \"./data/phospho/samplemap_phospho.tsv\"\n",
    "RESULTS_DIR_PHOSPHO = \"./data/phospho/results_phospho\"\n",
    "\n",
    "PROTEOME_FILE = \"./data/phospho/proteome_subset.tsv\"\n",
    "SAMPLEMAP_PROTEOME = \"./data/phospho/samplemap_proteome.tsv\"\n",
    "RESULTS_DIR_PROTEOME = \"./data/phospho/results_proteome\"\n",
    "\n",
    "CONDPAIRS_LIST = [(\"egf_treated\", \"untreated\")] #this means each fc is egf_treated - untreated\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's quickly check what the phospho tables look like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "phospho_df = pd.read_csv(PHOSPHO_FILE, sep=\"\\t\")\n",
    "samplemap_phospho_df = pd.read_csv(SAMPLEMAP_PHOSPHO, sep=\"\\t\")\n",
    "display(phospho_df.head())\n",
    "#check the ptm columns\n",
    "display([x for x in phospho_df.columns if \"EG.PTM\" in x])\n",
    "#show the samplemap\n",
    "display(samplemap_phospho_df.head())\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "importantly, here are site probability columns for different types of variable modifications including phospho"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running AlphaQuant on phospho"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calling AlphaQuant on ptm data, we additionally have to specify the modification we are interested in. In our case it is `[Phospho (STY)]` as listed in the headers above. And we have to set `perform_ptm_mapping=True` to perform the phosphosite inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import alphaquant.run_pipeline as aq_pipeline\n",
    "\n",
    "aq_pipeline.run_pipeline(input_file=PHOSPHO_FILE, samplemap_file=SAMPLEMAP_PHOSPHO, results_dir=RESULTS_DIR_PHOSPHO,\n",
    "                        condpairs_list=CONDPAIRS_LIST, perform_ptm_mapping=True,modification_type=\"[Phospho (STY)]\",organism=\"human\", valid_values_filter_mode=\"both\") #counting statistics together with PTM mapping is currently an experimental feature, so we set valid_values_filter_mode to \"both\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspecting and visualizing phospho results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check out the results table located in the results directory:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Volcano plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import alphaquant.plotting.pairwise as aq_plotting_pairwise\n",
    "\n",
    "results_file_phospho = RESULTS_DIR_PHOSPHO + \"/egf_treated_VS_untreated.results.tsv\"\n",
    "df_phospho = pd.read_csv(results_file_phospho, sep=\"\\t\")\n",
    "display(df_phospho.head())\n",
    "\n",
    "aq_plotting_pairwise.volcano_plot(df_phospho)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Normalization check\n",
    "\n",
    "Here we see the distribution of each sample against the median across all samples. These distributions should be centered around 0. Samples from the same condition should have similar distributions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "normalized_df = pd.read_csv(RESULTS_DIR_PHOSPHO + \"/egf_treated_VS_untreated.normed.tsv\", sep='\\t')\n",
    "samplemap_df = pd.read_csv(SAMPLEMAP_PHOSPHO, sep='\\t')\n",
    "aq_plotting_pairwise.plot_normalization_overview(normalized_df, samplemap_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run AlphaQuant on proteome data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import alphaquant.run_pipeline as aq_pipeline\n",
    "\n",
    "aq_pipeline.run_pipeline(input_file=PROTEOME_FILE, samplemap_file=SAMPLEMAP_PROTEOME,\n",
    "                         results_dir=RESULTS_DIR_PROTEOME, condpairs_list=CONDPAIRS_LIST)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Volcano plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_file_proteome = RESULTS_DIR_PROTEOME + \"/egf_treated_VS_untreated.results.tsv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import alphaquant.plotting.pairwise as aq_plotting_pairwise\n",
    "\n",
    "df_proteome = pd.read_csv(results_file_proteome, sep=\"\\t\")\n",
    "aq_plotting_pairwise.volcano_plot(df_proteome)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is very little regulation on the protein level."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalization plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "normalized_df = pd.read_csv(RESULTS_DIR_PHOSPHO + \"/egf_treated_VS_untreated.normed.tsv\", sep='\\t')\n",
    "samplemap_df = pd.read_csv(SAMPLEMAP_PHOSPHO, sep='\\t')\n",
    "aq_plotting_pairwise.plot_normalization_overview(normalized_df, samplemap_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Combining phospho and proteome data by proteome-normalization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following command writes out the proteome normed files into a new results directory with the ending \"_protnormed\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import alphaquant.ptm.protein_ptm_normalization as aq_ptm_normalization\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "aq_ptm_normalization.PTMResultsNormalizer(results_dir_ptm=RESULTS_DIR_PHOSPHO,\n",
    "                                          results_dir_proteome=RESULTS_DIR_PROTEOME, organism=\"human\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Volcano plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import alphaquant.plotting.pairwise as aq_plotting_pairwise\n",
    "results_file_protnormed = f\"{RESULTS_DIR_PHOSPHO}_protnormed/egf_treated_VS_untreated.results.tsv\"\n",
    "df_protnormed = pd.read_csv(results_file_protnormed, sep=\"\\t\")\n",
    "aq_plotting_pairwise.volcano_plot(df_protnormed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, there is little qualitative change to this plot, because the protein regulation is low. Let's investigate this a bit futher:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "merged_df = df_phospho.merge(df_protnormed, on=\"protein\", how=\"inner\", suffixes=(\"_phospho\", \"_proteome\"))\n",
    "plt.scatter(merged_df[\"log2fc_phospho\"], merged_df[\"log2fc_proteome\"])\n",
    "plt.xlabel(\"log2fc_phospho\")\n",
    "plt.ylabel(\"log2fc_proteome\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed we see few changes per phospo site, but some difference due to some protein fold changes, as expected."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "test",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}