{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This code performs four main tasks:\n", "\n", "**[First part]**\n", "- annotate the MitImpact's variants with additional information provided by supplementary files available in the `data/APOGEE2_2022` folder. \n", "- select the features of interest that will later be used in the learning process and store them in a file named `mitimpact_features.csv` stored in a newly created `extracted_data` folder.\n", "\n", "**[Seconda part]**\n", "- perform a nested cross-validation. You will be given the chance to choose the machine learning classifier among a few alternatives, the number of splits and partitions for the test cross-validation, the number of splits for the grid-search cross-validation.\n", "- calculate a few performance measures.\n", "\n", "Note that several hours are required to execute the entire notebook and that some of the hyperparameters have been dropped from the grid-search procedure in order to speed up the computation. The notebook will create a `test/` folder to store models and predictions (if `test/` is not empty its content will be overwritten). After completing the cross-validation procedure, the notebook will calculate the mean auROC and its confidence intervals." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File description\n", "`data/`\\\n", " `MitImpact_db_3.0.6.1.txt`: MitImpact file; it contains all possible nonsynonymous mitochondrial SNVs\\\n", " `mtmam.csv`: MtMam substitution matrix for mitochondrial amino acid changes\\\n", " `ddg_dict.pk`: pickle file encoding a Python dictionary for the precomputed ΔΔG values (ΔG_mutant - ΔG_wildtype) resulting from the amino acid changes\\\n", " `AA3Dposition_aligned.csv`: 3D coordinates of the amino acids in the mitochondrial proteins\\\n", " `dataset_21-04-21.csv`: labels in the reference dataset (`P` and `N` indicate respectively that a variant is pathogenic or neutral); only SNVs are reported in the file\n", "\n", "`extracted_data/`\n", " `mitimpact_features.csv`: features extracted from MitImpact and used to annotateall SNVs\n", "\n", "*n.b., `extracted_data/` is created in the section '**Create of a list containing the names of features to extract from the `mitimpact` dataframe**' below*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import pickle as pk\n", "import os, warnings\n", "\n", "from matplotlib import pyplot as plt\n", "from scipy import stats\n", "\n", "from sklearn.experimental import enable_iterative_imputer\n", "from sklearn.base import clone\n", "\n", "from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold, GridSearchCV, ParameterGrid\n", "from sklearn.metrics import roc_auc_score, roc_curve\n", "from sklearn.metrics import precision_recall_curve\n", "from sklearn.metrics import average_precision_score\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.impute import IterativeImputer\n", "from sklearn.feature_selection import SelectFromModel\n", "\n", "from imblearn.pipeline import Pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if os.path.exists(\"./data/APOGEE2_2022/\"):\n", " PATH= \".\"\n", "else: # not os.path.exists(\"./playgrounds/data/APOGEE2_2022/\"):\n", " !git clone https://github.com/mazzalab/playgrounds.git\n", " PATH=\"./playgrounds\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# First part" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Retrieve and prepare the MitImpact database flat file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mitimpact = pd.read_csv(PATH+\"/data/APOGEE2_2022/MitImpact_db_3.0.6.1.txt\", sep=\"\\t\", low_memory=False, index_col=\"MitImpact_id\")\n", "\n", "# Rename the column Start to Pos\n", "mitimpact.rename(columns={\"Start\": \"Pos\"}, inplace=True)\n", "\n", "# Replace \".\" with NaN values\n", "mitimpact.replace(\".\", np.nan, inplace=True)\n", "\n", "mitimpact.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read the mitochondrial substitution matrix (mtMam)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mtmam = pd.read_csv(PATH+\"/data/APOGEE2_2022/mtmam.csv\", sep=\"\\t\", index_col=0)\n", "mtmam.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Annotate the mitimpact variants with mtMam scores\n", "mitimpact[\"mtmam\"] = mtmam.stack().loc[mitimpact.set_index([\"AA_ref\", \"AA_alt\"]).index].to_numpy()\n", "mitimpact[[\"mtmam\"]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load, format, and integrate the precomputed energetic variation values, calculated as $\\Delta \\Delta G = \\Delta G_{mutant} - \\Delta G_{wt}$, into MitImpact " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ddg = pd.DataFrame.from_dict(pk.load(open(PATH+\"/data/APOGEE2_2022/ddg_dict.pk\", \"rb\")), orient=\"index\").stack().to_frame()\n", "\n", "# Parse the dataframe\n", "ddg = pd.DataFrame(ddg[0].values.tolist(), index=ddg.index).stack().to_frame()\n", "\n", "# Rename the column of the scores as \"ddg\"\n", "ddg.columns = [\"ddg\"]\n", "\n", "# Reset index\n", "ddg.reset_index(inplace=True)\n", "\n", "# Parse the information about the position of AA in the protein\n", "ddg[\"AA_pos\"] = ddg.level_1.str.split(r\"[A-Z]\", expand=True)[1].astype(float)\n", "\n", "# Parse the information about the reference AA\n", "ddg[\"level_1\"] = ddg.level_1.str.split(\"\", expand=True)[1]\n", "\n", "# Rename the columns\n", "ddg.rename(columns={\"level_0\":\"Gene_symbol\",\"level_2\":\"AA_alt\",\"level_1\":\"AA_ref\"}, inplace=True)\n", "\n", "# Annotate variants in MitImpact with the ddg values (from the `ddg` dataframe)\n", "mitimpact[\"ddg\"] = pd.merge(\n", " mitimpact[[\"Gene_symbol\",\"AA_pos\",\"AA_ref\",\"AA_alt\"]],\n", " ddg,\n", " on=[\"Gene_symbol\",\"AA_pos\",\"AA_ref\",\"AA_alt\"],\n", " how=\"left\",\n", ").ddg.to_numpy()\n", "\n", "mitimpact[[\"ddg\"]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read the 3D coordinates of the amino acids" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "coords3D = pd.read_csv(PATH+\"/data/APOGEE2_2022/AA3Dposition_aligned.csv\", sep=\"\\t\", header=None)\n", "\n", "# Rename the columns appropriately\n", "coords3D.columns=[\"Gene_symbol\", \"AA_pos\", \"X\", \"Y\", \"Z\"]\n", "\n", "# Annotate the variants in MitImpact with X, Y, and Z values corresponding to the 3D coordinates of the amino acids.\n", "mitimpact[[\"X\", \"Y\", \"Z\"]] = pd.merge(\n", " mitimpact[[\"Gene_symbol\",\"AA_pos\"]],\n", " coords3D,\n", " on=[\"Gene_symbol\",\"AA_pos\"],\n", " how=\"left\",\n", ")[[\"X\", \"Y\", \"Z\"]].to_numpy()\n", "\n", "mitimpact[[\"X\", \"Y\", \"Z\"]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create of a list containing the names of features to extract from the `mitimpact` dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = [\n", " \"PhyloP_100V\", \"PhastCons_100V\", \"PolyPhen2_score\", \"SIFT_score\", \"FatHmmW_score\",\n", " \"FatHmm_score\", \"PROVEAN_score\", \"MutationAssessor_score\", \"EFIN_SP_score\", \"EFIN_HD_score\",\n", " \"CADD_phred_score\", \"VEST_pvalue\", \"PANTHER_score\", \"PhD-SNP_score\", \"SNAP_score\", \"MutationTaster_score\",\n", " \"mtmam\", \"ddg\", \"Pos\", \"X\", \"Y\", \"Z\"\n", "]\n", "\n", "# Cast the variables into float values.\n", "mitimpact[features] = mitimpact[features].astype(float)\n", "\n", "# Save the dataframe obtained.\n", "try:\n", " os.mkdir(PATH+\"/data/APOGEE2_2022/extracted_data/\")\n", "except FileExistsError:\n", " print(\"The folder exists. Data will be overwritten\")\n", " \n", "mitimpact[features].to_csv(PATH+\"/data/APOGEE2_2022/extracted_data/mitimpact_features.csv\", sep=\"\\t\")\n", "mitimpact[features].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Second part" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Handy functions and files check" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import shutil\n", "\n", "def _empty_or_create(folder_path):\n", " if(os.path.exists(folder_path)):\n", " warnings.warn(folder_path+\" is not empty. It will be emptied\")\n", " shutil.rmtree(folder_path)\n", " \n", " os.mkdir(folder_path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Existence file check\n", "if not 'dataset_21-04-21.csv' in os.listdir(PATH+\"/data/APOGEE2_2022/\"):\n", " raise FileNotFoundError(f\"Cannot find 'dataset_21-04-21.csv' in folder {PATH}/data/APOGEE2_2022/\")\n", "elif not 'mitimpact_features.csv' in os.listdir(PATH+\"/data/APOGEE2_2022/extracted_data\"):\n", " raise FileNotFoundError(f\"Cannot find 'mitimpact_features.csv' in folder {PATH}/data/APOGEE2_2022/. Please ensure you have executed the first part of the code above\")\n", "else:\n", " print(\"Required files present\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the classes of training-set variants:\n", "- `0` means Benign variants;\n", "- `1` means Pathogenic variants.\n", "\n", "The index values are the `MitImpact_ID` of the variants." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Y = pd.read_csv(PATH+\"/data/APOGEE2_2022/dataset_21-04-21.csv\", sep=\"\\t\", index_col=\"MitImpact_ID\", comment=\"#\").Class\n", "Y.replace([\"N\", \"P\"], [0, 1], inplace=True)\n", "\n", "# Target Vector\n", "Y.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Annotate the training-set variants (`Y` vector) with the features extracted from the `mitimpact_features.csv` file (previously generated in the **First part**)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = pd.read_csv(PATH+\"/data/APOGEE2_2022/extracted_data/mitimpact_features.csv\", sep=\"\\t\", index_col=\"MitImpact_id\").loc[Y.index]\n", "\n", "# Feature Matrix\n", "X.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ML Settings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#@title ##Settings\n", "\n", "#@markdown