{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ProteoFAV - Protein Features, Annotations and Variants\n",
    "\n",
    "Open-source framework for simple and fast integration of protein structure data with sequence annotations and genetic variation.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing and configuration\n",
    "View instructions provided in the main README.md available at https://github.com/bartongroup/ProteoFAV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import proteofav"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configuration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ProteoFAV implements two approaches to handle datasets. One can fetch a few files on the fly using functions conveniently provided. For large scale studies, however, is preferable to use a local source for the multiple data used, such as the mmCIF files for three-dimensional protein structures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setting Logging Level\n",
    "from proteofav.config import logging\n",
    "logger = logging.getLogger()\n",
    "assert len(logger.handlers) == 1\n",
    "handler = logger.handlers[0]\n",
    "handler.setLevel(logging.WARNING)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Downloading a protein structure in mmCIF and PDB format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from proteofav.structures import mmCIF, PDB\n",
    "\n",
    "pdb_id = \"2pah\"\n",
    "\n",
    "# create tmp dir\n",
    "out_dir = os.path.join(os.getcwd(), \"tmp\")\n",
    "os.makedirs(out_dir, exist_ok=True)\n",
    "\n",
    "# output file names\n",
    "out_mmcif = os.path.join(out_dir, \"{}.cif\".format(pdb_id))\n",
    "out_mmcif_bio = os.path.join(out_dir, \"{}_bio.cif\".format(pdb_id))\n",
    "out_pdb = os.path.join(out_dir, \"{}.pdb\".format(pdb_id))\n",
    "\n",
    "# download structures\n",
    "mmCIF.download(identifier=pdb_id, filename=out_mmcif)\n",
    "mmCIF.download(identifier=pdb_id, filename=out_mmcif_bio, \n",
    "               bio_unit=True, bio_unit_preferred=True)\n",
    "PDB.download(identifier=pdb_id, filename=out_pdb)\n",
    "\n",
    "assert os.path.exists(out_mmcif)\n",
    "assert os.path.exists(out_mmcif_bio)\n",
    "assert os.path.exists(out_pdb)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the structures onto a Pandas DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  group_PDB  id type_symbol label_atom_id label_alt_id label_comp_id  \\\n",
      "0      ATOM   1           N             N            .           VAL   \n",
      "1      ATOM   2           C            CA            .           VAL   \n",
      "2      ATOM   3           C             C            .           VAL   \n",
      "3      ATOM   4           O             O            .           VAL   \n",
      "4      ATOM   5           C            CB            .           VAL   \n",
      "\n",
      "  label_asym_id label_entity_id label_seq_id pdbx_PDB_ins_code  \\\n",
      "0             A               1            1                 ?   \n",
      "1             A               1            1                 ?   \n",
      "2             A               1            1                 ?   \n",
      "3             A               1            1                 ?   \n",
      "4             A               1            1                 ?   \n",
      "\n",
      "         ...         Cartn_z  occupancy  B_iso_or_equiv  pdbx_formal_charge  \\\n",
      "0        ...          18.770        1.0           56.51                   ?   \n",
      "1        ...          20.244        1.0           59.09                   ?   \n",
      "2        ...          20.700        1.0           44.63                   ?   \n",
      "3        ...          20.204        1.0           59.84                   ?   \n",
      "4        ...          20.638        1.0           53.90                   ?   \n",
      "\n",
      "   auth_seq_id auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num  \\\n",
      "0          118          VAL            A            N                  1   \n",
      "1          118          VAL            A           CA                  1   \n",
      "2          118          VAL            A            C                  1   \n",
      "3          118          VAL            A            O                  1   \n",
      "4          118          VAL            A           CB                  1   \n",
      "\n",
      "  pdbe_label_seq_id  \n",
      "0                 1  \n",
      "1                 1  \n",
      "2                 1  \n",
      "3                 1  \n",
      "4                 1  \n",
      "\n",
      "[5 rows x 22 columns]\n",
      "Index(['group_PDB', 'id', 'type_symbol', 'label_atom_id', 'label_alt_id',\n",
      "       'label_comp_id', 'label_asym_id', 'label_entity_id', 'label_seq_id',\n",
      "       'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy',\n",
      "       'B_iso_or_equiv', 'pdbx_formal_charge', 'auth_seq_id', 'auth_comp_id',\n",
      "       'auth_asym_id', 'auth_atom_id', 'pdbx_PDB_model_num',\n",
      "       'pdbe_label_seq_id'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "mmcif = mmCIF.read(filename=out_mmcif)\n",
    "print(mmcif.head())\n",
    "print(mmcif.columns)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  group_PDB  id type_symbol label_atom_id label_alt_id label_comp_id  \\\n",
      "0      ATOM   1           N             N            .           VAL   \n",
      "1      ATOM   2           C            CA            .           VAL   \n",
      "2      ATOM   3           C             C            .           VAL   \n",
      "3      ATOM   4           O             O            .           VAL   \n",
      "4      ATOM   5           C            CB            .           VAL   \n",
      "\n",
      "  label_asym_id label_entity_id label_seq_id pdbx_PDB_ins_code  \\\n",
      "0             A               1            1                 ?   \n",
      "1             A               1            1                 ?   \n",
      "2             A               1            1                 ?   \n",
      "3             A               1            1                 ?   \n",
      "4             A               1            1                 ?   \n",
      "\n",
      "         ...         B_iso_or_equiv  pdbx_formal_charge  auth_seq_id  \\\n",
      "0        ...                  56.51                   ?          118   \n",
      "1        ...                  59.09                   ?          118   \n",
      "2        ...                  44.63                   ?          118   \n",
      "3        ...                  59.84                   ?          118   \n",
      "4        ...                  53.90                   ?          118   \n",
      "\n",
      "   auth_comp_id  auth_asym_id auth_atom_id pdbx_PDB_model_num  \\\n",
      "0           VAL             A            N                  1   \n",
      "1           VAL             A           CA                  1   \n",
      "2           VAL             A            C                  1   \n",
      "3           VAL             A            O                  1   \n",
      "4           VAL             A           CB                  1   \n",
      "\n",
      "  pdbe_label_seq_id orig_label_asym_id orig_auth_asym_id  \n",
      "0                 1                  A                 A  \n",
      "1                 1                  A                 A  \n",
      "2                 1                  A                 A  \n",
      "3                 1                  A                 A  \n",
      "4                 1                  A                 A  \n",
      "\n",
      "[5 rows x 24 columns]\n",
      "Index(['group_PDB', 'id', 'type_symbol', 'label_atom_id', 'label_alt_id',\n",
      "       'label_comp_id', 'label_asym_id', 'label_entity_id', 'label_seq_id',\n",
      "       'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy',\n",
      "       'B_iso_or_equiv', 'pdbx_formal_charge', 'auth_seq_id', 'auth_comp_id',\n",
      "       'auth_asym_id', 'auth_atom_id', 'pdbx_PDB_model_num',\n",
      "       'pdbe_label_seq_id', 'orig_label_asym_id', 'orig_auth_asym_id'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "mmcif_bio = mmCIF.read(filename=out_mmcif_bio)\n",
    "print(mmcif_bio.head())\n",
    "print(mmcif_bio.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a forma description of each colum please see http://mmcif.wwpdb.org/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  group_PDB    id label_atom_id label_alt_id label_comp_id label_asym_id  \\\n",
      "0    HETATM  5316            FE            F           E .                 \n",
      "1    HETATM  5317            FE            F           E .                 \n",
      "\n",
      "  label_seq_id_full label_seq_id pdbx_PDB_ins_code  Cartn_x  \\\n",
      "0              FE C         FE C                 ?  . ?   6   \n",
      "1              FE D         FE D                 ?  . ? -39   \n",
      "\n",
      "         ...           Cartn_z occupancy B_iso_or_equiv type_symbol  \\\n",
      "0        ...          284 42.9    25 1.0         0  84.          FE   \n",
      "1        ...          235 43.6    84 1.0         0  91.          FE   \n",
      "\n",
      "  auth_atom_id auth_comp_id auth_asym_id auth_seq_id_full auth_seq_id  \\\n",
      "0           FE          E .                          FE C        FE C   \n",
      "1           FE          E .                          FE D        FE D   \n",
      "\n",
      "  pdbx_PDB_model_num  \n",
      "0                  1  \n",
      "1                  1  \n",
      "\n",
      "[2 rows x 21 columns]\n",
      "Index(['group_PDB', 'id', 'label_atom_id', 'label_alt_id', 'label_comp_id',\n",
      "       'label_asym_id', 'label_seq_id_full', 'label_seq_id',\n",
      "       'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy',\n",
      "       'B_iso_or_equiv', 'type_symbol', 'auth_atom_id', 'auth_comp_id',\n",
      "       'auth_asym_id', 'auth_seq_id_full', 'auth_seq_id',\n",
      "       'pdbx_PDB_model_num'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "#  Column names mimic of a PDB file mimics those of the mmCIF format\n",
    "#  Please prefer processing mmCIF instead PDB, which were deprecated\n",
    "pdb = PDB.read(filename=out_pdb)\n",
    "print(pdb.head())\n",
    "print(pdb.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dowloading a SIFTS xml record for obtaining PDB-UniProt  mapping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from proteofav.sifts import SIFTS\n",
    "\n",
    "# output file names\n",
    "out_sifts = os.path.join(out_dir, \"{}.xml\".format(pdb_id))\n",
    "\n",
    "SIFTS.download(identifier=pdb_id, filename=out_sifts)\n",
    "\n",
    "assert os.path.exists(out_sifts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the SIFTS record"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   PDB_regionId  PDB_regionStart  PDB_regionEnd PDB_regionResNum  \\\n",
      "0             1                1            335                1   \n",
      "1             1                1            335                2   \n",
      "2             1                1            335                3   \n",
      "3             1                1            335                4   \n",
      "4             1                1            335                5   \n",
      "\n",
      "  PDB_dbAccessionId PDB_dbResNum PDB_dbResName PDB_dbChainId PDB_Annotation  \\\n",
      "0              2pah          118           VAL             A       Observed   \n",
      "1              2pah          119           PRO             A       Observed   \n",
      "2              2pah          120           TRP             A       Observed   \n",
      "3              2pah          121           PHE             A       Observed   \n",
      "4              2pah          122           PRO             A       Observed   \n",
      "\n",
      "  PDB_entityId         ...          SCOP_regionEnd  SCOP_regionResNum  \\\n",
      "0            A         ...                     335                  1   \n",
      "1            A         ...                     335                  2   \n",
      "2            A         ...                     335                  3   \n",
      "3            A         ...                     335                  4   \n",
      "4            A         ...                     335                  5   \n",
      "\n",
      "   SCOP_dbAccessionId PDB_codeSecondaryStructure PDB_nameSecondaryStructure  \\\n",
      "0               42581                          T                       loop   \n",
      "1               42581                          T                       loop   \n",
      "2               42581                          T                       loop   \n",
      "3               42581                          T                       loop   \n",
      "4               42581                          T                       loop   \n",
      "\n",
      "  Pfam_regionId Pfam_regionStart  Pfam_regionEnd  Pfam_regionResNum  \\\n",
      "0             -                0               0                NaN   \n",
      "1             1                2             332                  2   \n",
      "2             1                2             332                  3   \n",
      "3             1                2             332                  4   \n",
      "4             1                2             332                  5   \n",
      "\n",
      "   Pfam_dbAccessionId  \n",
      "0                 NaN  \n",
      "1             PF00351  \n",
      "2             PF00351  \n",
      "3             PF00351  \n",
      "4             PF00351  \n",
      "\n",
      "[5 rows x 34 columns]\n"
     ]
    }
   ],
   "source": [
    "sifts = SIFTS.read(filename=out_sifts)\n",
    "print(sifts.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The SIFT record also contains mappings to many other databases, such as:\n",
    "- CATH\n",
    "- SCOP\n",
    "- PFAM\n",
    "\n",
    "Bear in mind that SIFT mapping occurs at residue, but also at the domain level. \n",
    "The default action is to load the residue mapping.\n",
    "\n",
    "Also see the *PDB_Annotation* which flags several types of annotation at residue level, for example whether a given UniProt residues was observed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dowloading a DSSP record for obtaining Secondary Structure information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n"
     ]
    }
   ],
   "source": [
    "from proteofav.dssp import DSSP\n",
    "\n",
    "# output file names\n",
    "out_dssp = os.path.join(out_dir, \"{}.dssp\".format(pdb_id))\n",
    "\n",
    "DSSP.download(identifier=pdb_id, filename=out_dssp)\n",
    "\n",
    "# sometimes fecthing from the DSSP FTP server at ftp://ftp.cmbi.ru.nl/pub/molbio/data/dssp/ times out...\n",
    "print(os.path.exists(out_dssp))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the DSSP record"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   RES RES_FULL INSCODE CHAIN AA SS  ACC    TCO  KAPPA  ALPHA    PHI    PSI\n",
      "0  118      118             A  V     127  0.000  360.0  360.0  360.0  124.7\n",
      "1  119      119             A  P      42 -0.071  360.0 -105.4  -51.6  149.9\n",
      "2  120      120             A  W     120 -0.593   41.9 -178.8  -81.0  139.2\n",
      "3  121      121             A  F      17 -0.980   33.0  -92.9 -138.9  150.9\n",
      "4  122      122             A  P       4 -0.405   27.9 -176.6  -65.3  130.9\n"
     ]
    }
   ],
   "source": [
    "dssp = DSSP.read(filename=out_dssp)\n",
    "print(dssp.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dowload a PDBe Validation XML record"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from proteofav.validation import Validation\n",
    "\n",
    "out_validation = os.path.join(out_dir, \"{}_validation.xml\".format(pdb_id))\n",
    "\n",
    "Validation.download(identifier=pdb_id, filename=out_validation)\n",
    "\n",
    "assert os.path.exists(out_validation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the Validation record"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   validation_rscc validation_rama validation_icode validation_ligRSRZ  \\\n",
      "0            0.896             NaN                ?                NaN   \n",
      "1            0.960         Favored                ?                NaN   \n",
      "2            0.961         Favored                ?                NaN   \n",
      "3            0.920         Favored                ?                NaN   \n",
      "4            0.973         Favored                ?                NaN   \n",
      "\n",
      "  validation_ligRSRnbrMean validation_flippable-sidechain  validation_psi  \\\n",
      "0                      NaN                            NaN             NaN   \n",
      "1                      NaN                            NaN           149.9   \n",
      "2                      NaN                            NaN           139.2   \n",
      "3                      NaN                            NaN           150.9   \n",
      "4                      NaN                            NaN           130.9   \n",
      "\n",
      "   validation_rsr  validation_owab validation_ligRSRnumnbrs        ...         \\\n",
      "0           0.233            52.97                      NaN        ...          \n",
      "1           0.190            28.84                      NaN        ...          \n",
      "2           0.154            33.47                      NaN        ...          \n",
      "3           0.229            39.98                      NaN        ...          \n",
      "4           0.197            26.14                      NaN        ...          \n",
      "\n",
      "   validation_chain validation_phi validation_said validation_rsrz  \\\n",
      "0                 A            NaN               A          -0.160   \n",
      "1                 A          -51.6               A          -0.274   \n",
      "2                 A          -81.0               A          -0.874   \n",
      "3                 A         -138.9               A          -0.308   \n",
      "4                 A          -65.3               A          -0.204   \n",
      "\n",
      "  validation_seq validation_ligRSRnbrStdev  validation_altcode  \\\n",
      "0              1                       NaN                   .   \n",
      "1              2                       NaN                   .   \n",
      "2              3                       NaN                   .   \n",
      "3              4                       NaN                   .   \n",
      "4              5                       NaN                   .   \n",
      "\n",
      "  validation_lig_rsrz_nbr_id  validation_NatomsEDS validation_resnum  \n",
      "0                        NaN                     7               118  \n",
      "1                        NaN                     7               119  \n",
      "2                        NaN                    14               120  \n",
      "3                        NaN                    11               121  \n",
      "4                        NaN                     7               122  \n",
      "\n",
      "[5 rows x 27 columns]\n"
     ]
    }
   ],
   "source": [
    "validation = Validation.read(filename=out_validation)\n",
    "print(validation.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PDB validation record is convenient when filtering a protein structure for analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select only CA residues in for a single chain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Protein structure representation is a hierarchical data structure (See http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ). So to obtain the data in tabular format, ProteoFAV transforms the data. For example, for use cases that require one residue per row, the residue three-dimensional coordinates can be represented by the residue's Cα. Other filtering parameters are obtained with *filter_structures*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   group_PDB  id type_symbol label_atom_id label_alt_id label_comp_id  \\\n",
      "1       ATOM   2           C            CA            .           VAL   \n",
      "8       ATOM   9           C            CA            .           PRO   \n",
      "15      ATOM  16           C            CA            .           TRP   \n",
      "29      ATOM  30           C            CA            .           PHE   \n",
      "40      ATOM  41           C            CA            .           PRO   \n",
      "\n",
      "   label_asym_id label_entity_id label_seq_id pdbx_PDB_ins_code  \\\n",
      "1              A               1            1                 ?   \n",
      "8              A               1            2                 ?   \n",
      "15             A               1            3                 ?   \n",
      "29             A               1            4                 ?   \n",
      "40             A               1            5                 ?   \n",
      "\n",
      "         ...         B_iso_or_equiv  pdbx_formal_charge  auth_seq_id  \\\n",
      "1        ...                  59.09                   ?          118   \n",
      "8        ...                  20.13                   ?          119   \n",
      "15       ...                  33.96                   ?          120   \n",
      "29       ...                  34.42                   ?          121   \n",
      "40       ...                  28.65                   ?          122   \n",
      "\n",
      "    auth_comp_id  auth_asym_id auth_atom_id pdbx_PDB_model_num  \\\n",
      "1            VAL             A           CA                  1   \n",
      "8            PRO             A           CA                  1   \n",
      "15           TRP             A           CA                  1   \n",
      "29           PHE             A           CA                  1   \n",
      "40           PRO             A           CA                  1   \n",
      "\n",
      "   pdbe_label_seq_id label_seq_id_full auth_seq_id_full  \n",
      "1                  1                 1              118  \n",
      "8                  2                 2              119  \n",
      "15                 3                 3              120  \n",
      "29                 4                 4              121  \n",
      "40                 5                 5              122  \n",
      "\n",
      "[5 rows x 24 columns]\n"
     ]
    }
   ],
   "source": [
    "from proteofav.structures import filter_structures\n",
    "\n",
    "mmcif_sel = filter_structures(mmcif, excluded_cols=None,\n",
    "                              models='first', chains='A', res=None, res_full=None,\n",
    "                              comps=None, atoms='CA', lines=None, category='auth',\n",
    "                              residue_agg=False, \n",
    "                              add_res_full=True, add_atom_altloc=False, reset_atom_id=True,\n",
    "                              remove_altloc=False, remove_hydrogens=True, remove_partial_res=False)\n",
    "print(mmcif_sel.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregating atoms residue-by-residue\n",
    "Three dimensional coordinates of all atoms can be represented by the residues centroid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   index pdbx_PDB_model_num auth_asym_id auth_seq_id group_PDB  id  \\\n",
      "0      0                  1            A         118      ATOM   1   \n",
      "1      1                  1            A         119      ATOM   8   \n",
      "2      2                  1            A         120      ATOM  15   \n",
      "3      3                  1            A         121      ATOM  29   \n",
      "4      4                  1            A         122      ATOM  40   \n",
      "\n",
      "  type_symbol label_atom_id label_alt_id label_comp_id        ...         \\\n",
      "0           N             N            .           VAL        ...          \n",
      "1           N             N            .           PRO        ...          \n",
      "2           N             N            .           TRP        ...          \n",
      "3           N             N            .           PHE        ...          \n",
      "4           N             N            .           PRO        ...          \n",
      "\n",
      "  pdbx_PDB_ins_code   Cartn_x    Cartn_y    Cartn_z  occupancy  \\\n",
      "0                 ? -7.310714  21.031714  20.424143        1.0   \n",
      "1                 ? -4.434571  21.470714  22.630714        1.0   \n",
      "2                 ? -1.007000  15.673429  23.737071        1.0   \n",
      "3                 ? -5.282455  15.784000  26.269091        1.0   \n",
      "4                 ? -2.392571  13.578286  28.745714        1.0   \n",
      "\n",
      "   B_iso_or_equiv  pdbx_formal_charge  auth_comp_id  auth_atom_id  \\\n",
      "0       52.974286                   ?           VAL             N   \n",
      "1       28.844286                   ?           PRO             N   \n",
      "2       33.466429                   ?           TRP             N   \n",
      "3       39.981818                   ?           PHE             N   \n",
      "4       26.141429                   ?           PRO             N   \n",
      "\n",
      "  pdbe_label_seq_id  \n",
      "0                 1  \n",
      "1                 2  \n",
      "2                 3  \n",
      "3                 4  \n",
      "4                 5  \n",
      "\n",
      "[5 rows x 23 columns]\n"
     ]
    }
   ],
   "source": [
    "from proteofav.structures import residues_aggregation\n",
    "\n",
    "mmcif_sel = residues_aggregation(mmcif, agg_method='centroid', category='auth')\n",
    "print(mmcif_sel.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write a PDB-formatted file from a mmCIF structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "new_out_pdb = os.path.join(out_dir, \"{}_new.pdb\".format(pdb_id)) \n",
    "PDB.write(table=mmcif, filename=new_out_pdb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get a UniProt-PDB mapping from the SIFTS xml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   PDB_regionId  PDB_regionStart  PDB_regionEnd PDB_regionResNum  \\\n",
      "0             1                1            335                1   \n",
      "1             1                1            335                2   \n",
      "2             1                1            335                3   \n",
      "3             1                1            335                4   \n",
      "4             1                1            335                5   \n",
      "\n",
      "  PDB_dbAccessionId PDB_dbResNum PDB_dbResName PDB_dbChainId PDB_Annotation  \\\n",
      "0              2pah          118           VAL             A       Observed   \n",
      "1              2pah          119           PRO             A       Observed   \n",
      "2              2pah          120           TRP             A       Observed   \n",
      "3              2pah          121           PHE             A       Observed   \n",
      "4              2pah          122           PRO             A       Observed   \n",
      "\n",
      "  PDB_entityId         ...          SCOP_regionEnd  SCOP_regionResNum  \\\n",
      "0            A         ...                     335                  1   \n",
      "1            A         ...                     335                  2   \n",
      "2            A         ...                     335                  3   \n",
      "3            A         ...                     335                  4   \n",
      "4            A         ...                     335                  5   \n",
      "\n",
      "   SCOP_dbAccessionId PDB_codeSecondaryStructure PDB_nameSecondaryStructure  \\\n",
      "0               42581                          T                       loop   \n",
      "1               42581                          T                       loop   \n",
      "2               42581                          T                       loop   \n",
      "3               42581                          T                       loop   \n",
      "4               42581                          T                       loop   \n",
      "\n",
      "  Pfam_regionId Pfam_regionStart  Pfam_regionEnd  Pfam_regionResNum  \\\n",
      "0             -                0               0                NaN   \n",
      "1             1                2             332                  2   \n",
      "2             1                2             332                  3   \n",
      "3             1                2             332                  4   \n",
      "4             1                2             332                  5   \n",
      "\n",
      "   Pfam_dbAccessionId  \n",
      "0                 NaN  \n",
      "1             PF00351  \n",
      "2             PF00351  \n",
      "3             PF00351  \n",
      "4             PF00351  \n",
      "\n",
      "[5 rows x 34 columns]\n",
      "['P00439']\n"
     ]
    }
   ],
   "source": [
    "sifts = SIFTS.read(filename=out_sifts)\n",
    "print(sifts.head())\n",
    "\n",
    "uniprot_ids = sifts.UniProt_dbAccessionId.unique()\n",
    "print(uniprot_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Downloading a sequence Annotation (GFF) from UniProt\n",
    "UniProt provides extensive, high-quality annotation for residues in proteins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "from proteofav.annotation import Annotation\n",
    "\n",
    "out_annotation = os.path.join(out_dir, \"{}.gff\".format(uniprot_ids[0]))\n",
    "\n",
    "Annotation.download(identifier=uniprot_ids[0], filename=out_annotation)\n",
    "\n",
    "assert os.path.exists(out_annotation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the sequence Annotation\n",
    "Note also that GFF files althoug tabular, contains some extra level nesting in the `GROUP` column. ProteoFAV tries to deconvolute this information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     NAME     SOURCE           TYPE START  END SCORE STRAND FRAME  \\\n",
      "0  P00439  UniProtKB          Chain     1  452     .      .     .   \n",
      "1  P00439  UniProtKB         Domain    36  114     .      .     .   \n",
      "2  P00439  UniProtKB  Metal binding   285  285     .      .     .   \n",
      "3  P00439  UniProtKB  Metal binding   290  290     .      .     .   \n",
      "4  P00439  UniProtKB  Metal binding   330  330     .      .     .   \n",
      "\n",
      "                                               GROUP Dbxref                ID  \\\n",
      "0  ID=PRO_0000205548;Note=Phenylalanine-4-hydroxy...    NaN  [PRO_0000205548]   \n",
      "1  Note=ACT;Ontology_term=ECO:0000255;evidence=EC...    NaN               NaN   \n",
      "2  Note=Iron%3B via tele nitrogen;Ontology_term=E...    NaN               NaN   \n",
      "3  Note=Iron%3B via tele nitrogen;Ontology_term=E...    NaN               NaN   \n",
      "4  Note=Iron;Ontology_term=ECO:0000250;evidence=E...    NaN               NaN   \n",
      "\n",
      "                            Note  Ontology_term  \\\n",
      "0  [Phenylalanine-4-hydroxylase]            NaN   \n",
      "1                          [ACT]  [ECO:0000255]   \n",
      "2      [Iron; via tele nitrogen]  [ECO:0000250]   \n",
      "3      [Iron; via tele nitrogen]  [ECO:0000250]   \n",
      "4                         [Iron]  [ECO:0000250]   \n",
      "\n",
      "                                 evidence  \n",
      "0                                     NaN  \n",
      "1  [ECO:0000255|PROSITE-ProRule:PRU01007]  \n",
      "2          [ECO:0000250|UniProtKB:P04176]  \n",
      "3          [ECO:0000250|UniProtKB:P04176]  \n",
      "4          [ECO:0000250|UniProtKB:P04176]  \n"
     ]
    }
   ],
   "source": [
    "annotation = Annotation.read(filename=out_annotation)\n",
    "print(annotation.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Downloading variants based on the UniProt ID\n",
    "We could fetch genetic variants from UniProt and Ensembl with:\n",
    "\n",
    "```python\n",
    "Variants.fetch(identifier=uniprot_ids[0], id_source='uniprot', \n",
    "               synonymous=False, uniprot_vars=True,\n",
    "               ensembl_germline_vars=True, ensembl_somatic_vars=True)\n",
    "```\n",
    "\n",
    "but `select_variants` handles merging of Ensembl vars for us"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from proteofav.variants import Variants\n",
    "\n",
    "uniprot, ensembl = Variants.select(identifier=uniprot_ids[0], id_source='uniprot', \n",
    "                                   synonymous=False, uniprot_vars=True,\n",
    "                                   ensembl_germline_vars=True, ensembl_somatic_vars=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Glancing over the variants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  accession alternativeSequence association_description association_disease  \\\n",
      "0    P00439                   A                     NaN                True   \n",
      "1    P00439                   L          haplotypes 1,4                True   \n",
      "2    P00439                   L                     NaN                True   \n",
      "3    P00439                   S            haplotype 36                True   \n",
      "4    P00439                   V                     NaN                True   \n",
      "\n",
      "  association_evidences_code  \\\n",
      "0                ECO:0000269   \n",
      "1                ECO:0000269   \n",
      "2                        NaN   \n",
      "3                ECO:0000269   \n",
      "4                ECO:0000269   \n",
      "\n",
      "         association_evidences_source_alternativeUrl  \\\n",
      "0  [http://europepmc.org/abstract/MED/22513348, h...   \n",
      "1  [http://europepmc.org/abstract/MED/22513348, h...   \n",
      "2                                                NaN   \n",
      "3          http://europepmc.org/abstract/MED/2014802   \n",
      "4  [http://europepmc.org/abstract/MED/8889590, ht...   \n",
      "\n",
      "                   association_evidences_source_id  \\\n",
      "0           [22513348, 8889590, 8088845, 12501224]   \n",
      "1  [22513348, 1672290, 8889590, 12501224, 1672294]   \n",
      "2                                              NaN   \n",
      "3                                          2014802   \n",
      "4          [8889590, 12501224, 22513348, 23792259]   \n",
      "\n",
      "  association_evidences_source_name  \\\n",
      "0                            PubMed   \n",
      "1                            PubMed   \n",
      "2                               NaN   \n",
      "3                            PubMed   \n",
      "4                            PubMed   \n",
      "\n",
      "                    association_evidences_source_url  \\\n",
      "0  [http://www.ncbi.nlm.nih.gov/pubmed/22513348, ...   \n",
      "1  [http://www.ncbi.nlm.nih.gov/pubmed/22513348, ...   \n",
      "2                                                NaN   \n",
      "3         http://www.ncbi.nlm.nih.gov/pubmed/2014802   \n",
      "4  [http://www.ncbi.nlm.nih.gov/pubmed/8889590, h...   \n",
      "\n",
      "                                    association_name  \\\n",
      "0  [Phenylketonuria (PKU), Hyperphenylalaninemia ...   \n",
      "1                              Phenylketonuria (PKU)   \n",
      "2                        Hyperphenylalaninemia (HPA)   \n",
      "3                              Phenylketonuria (PKU)   \n",
      "4                              Phenylketonuria (PKU)   \n",
      "\n",
      "                         ...                         siftPrediction siftScore  \\\n",
      "0                        ...                              tolerated      0.11   \n",
      "1                        ...                            deleterious         0   \n",
      "2                        ...                                    NaN       NaN   \n",
      "3                        ...                                    NaN       NaN   \n",
      "4                        ...                              tolerated      0.06   \n",
      "\n",
      "  somaticStatus  sourceType taxid     type wildType     xrefs_id  \\\n",
      "0             0       mixed  9606  VARIANT        V  rs796052017   \n",
      "1             0       mixed  9606  VARIANT        P    rs5030851   \n",
      "2             0     uniprot  9606  VARIANT        Q  rs199475662   \n",
      "3             0     uniprot  9606  VARIANT        L   rs62642930   \n",
      "4             0       mixed  9606  VARIANT        A    rs5030857   \n",
      "\n",
      "                                 xrefs_name  \\\n",
      "0  [dbSNP, Ensembl, 1000Genomes, ESP, ExAC]   \n",
      "1               [dbSNP, Ensembl, ESP, ExAC]   \n",
      "2                          [dbSNP, Ensembl]   \n",
      "3                          [dbSNP, Ensembl]   \n",
      "4  [dbSNP, Ensembl, 1000Genomes, ESP, ExAC]   \n",
      "\n",
      "                                           xrefs_url  \n",
      "0  [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t...  \n",
      "1  [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t...  \n",
      "2  [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t...  \n",
      "3  [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t...  \n",
      "4  [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t...  \n",
      "\n",
      "[5 rows x 42 columns]\n"
     ]
    }
   ],
   "source": [
    "print(uniprot.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "            Parent         allele  begin clinical_significance   codons  \\\n",
      "0  ENST00000553106  HGMD_MUTATION    377                    []            \n",
      "1  ENST00000553106            C/T     75                    []  Gat/Aat   \n",
      "2  ENST00000553106  HGMD_MUTATION    300                    []            \n",
      "3  ENST00000553106  HGMD_MUTATION    245                    []            \n",
      "4  ENST00000553106  HGMD_MUTATION    415                    []            \n",
      "\n",
      "           consequenceType  end          feature_type frequency  \\\n",
      "0  coding_sequence_variant  377  transcript_variation       NaN   \n",
      "1         missense_variant   75  transcript_variation       NaN   \n",
      "2  coding_sequence_variant  300  transcript_variation       NaN   \n",
      "3  coding_sequence_variant  245  transcript_variation       NaN   \n",
      "4  coding_sequence_variant  415  transcript_variation       NaN   \n",
      "\n",
      "   polyphenScore residues  seq_region_name  siftScore      translation  \\\n",
      "0            NaN           ENSP00000448059        NaN  ENSP00000448059   \n",
      "1          0.014      D/N  ENSP00000448059       0.49  ENSP00000448059   \n",
      "2            NaN           ENSP00000448059        NaN  ENSP00000448059   \n",
      "3            NaN           ENSP00000448059        NaN  ENSP00000448059   \n",
      "4            NaN           ENSP00000448059        NaN  ENSP00000448059   \n",
      "\n",
      "      xrefs_id  \n",
      "0     CD011183  \n",
      "1  rs767453024  \n",
      "2     CM950893  \n",
      "3     CM941133  \n",
      "4     CM920564  \n"
     ]
    }
   ],
   "source": [
    "print(ensembl.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merging down the two Variants tables\n",
    "For merging variants from the UniProt and Ensembl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Parent accession allele alternativeSequence association_description  \\\n",
      "0    NaN    P00439    NaN                 del                     NaN   \n",
      "1    NaN    P00439    NaN                 del                     NaN   \n",
      "2    NaN    P00439    NaN                   K                     NaN   \n",
      "3    NaN    P00439    NaN                 del                     NaN   \n",
      "4    NaN    P00439    NaN                   L                     NaN   \n",
      "\n",
      "  association_disease association_evidences_code  \\\n",
      "0                 NaN                        NaN   \n",
      "1                 NaN                        NaN   \n",
      "2                 NaN                        NaN   \n",
      "3                 NaN                        NaN   \n",
      "4                True                ECO:0000269   \n",
      "\n",
      "  association_evidences_source_alternativeUrl association_evidences_source_id  \\\n",
      "0                                         NaN                             NaN   \n",
      "1                                         NaN                             NaN   \n",
      "2                                         NaN                             NaN   \n",
      "3                                         NaN                             NaN   \n",
      "4  http://europepmc.org/abstract/MED/23792259                        23792259   \n",
      "\n",
      "  association_evidences_source_name  \\\n",
      "0                               NaN   \n",
      "1                               NaN   \n",
      "2                               NaN   \n",
      "3                               NaN   \n",
      "4                            PubMed   \n",
      "\n",
      "                         ...                         siftScore somaticStatus  \\\n",
      "0                        ...                               NaN           0.0   \n",
      "1                        ...                               NaN           0.0   \n",
      "2                        ...                                 0           0.0   \n",
      "3                        ...                               NaN           0.0   \n",
      "4                        ...                               NaN           0.0   \n",
      "\n",
      "          sourceType   taxid translation     type wildType    xrefs_id  \\\n",
      "0            uniprot  9606.0         NaN  VARIANT        L         NaN   \n",
      "1            uniprot  9606.0         NaN  VARIANT        Y         NaN   \n",
      "2  large_scale_study  9606.0         NaN  VARIANT        T  COSM546084   \n",
      "3            uniprot  9606.0         NaN  VARIANT        L         NaN   \n",
      "4            uniprot  9606.0         NaN  VARIANT        F         NaN   \n",
      "\n",
      "       xrefs_name                                          xrefs_url  \n",
      "0             NaN                                                NaN  \n",
      "1             NaN                                                NaN  \n",
      "2  cosmic curated  http://cancer.sanger.ac.uk/cosmic/mutation/ove...  \n",
      "3             NaN                                                NaN  \n",
      "4             NaN                                                NaN  \n",
      "\n",
      "[5 rows x 50 columns]\n"
     ]
    }
   ],
   "source": [
    "from proteofav.mergers import uniprot_vars_ensembl_vars_merger\n",
    "\n",
    "variants = uniprot_vars_ensembl_vars_merger(uniprot, ensembl)\n",
    "print(variants.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merging the Structure, DSSP, SIFTS, Validation, Annotation and Variants data onto a single DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   index pdbx_PDB_model_num auth_asym_id auth_seq_id group_PDB  id  \\\n",
      "0      0                  1            A         118      ATOM   1   \n",
      "1      1                  1            A         119      ATOM   8   \n",
      "2      1                  1            A         119      ATOM   8   \n",
      "3      2                  1            A         120      ATOM  15   \n",
      "4      3                  1            A         121      ATOM  29   \n",
      "\n",
      "  type_symbol label_atom_id label_alt_id label_comp_id  \\\n",
      "0           N             N            .           VAL   \n",
      "1           N             N            .           PRO   \n",
      "2           N             N            .           PRO   \n",
      "3           N             N            .           TRP   \n",
      "4           N             N            .           PHE   \n",
      "\n",
      "                         ...                             siftScore  \\\n",
      "0                        ...                                  0.14   \n",
      "1                        ...                                  0.01   \n",
      "2                        ...                          (0.03, 0.01)   \n",
      "3                        ...                                     0   \n",
      "4                        ...                                   NaN   \n",
      "\n",
      "  somaticStatus         sourceType   taxid      translation     type  \\\n",
      "0           0.0  large_scale_study  9606.0  ENSP00000448059  VARIANT   \n",
      "1           0.0  large_scale_study  9606.0  ENSP00000448059  VARIANT   \n",
      "2           0.0  large_scale_study  9606.0  ENSP00000448059  VARIANT   \n",
      "3           0.0  large_scale_study  9606.0  ENSP00000448059  VARIANT   \n",
      "4           0.0            uniprot  9606.0              NaN  VARIANT   \n",
      "\n",
      "   wildType     xrefs_id           xrefs_name  \\\n",
      "0         V  rs776442422                 ExAC   \n",
      "1         P  rs398123292  (1000Genomes, ExAC)   \n",
      "2         P  rs374999809          (ExAC, ESP)   \n",
      "3         W  rs775327122                 ExAC   \n",
      "4         F          NaN                  NaN   \n",
      "\n",
      "                                           xrefs_url  \n",
      "0  http://exac.broadinstitute.org/awesome?query=r...  \n",
      "1  (http://www.ensembl.org/Homo_sapiens/Variation...  \n",
      "2  (http://evs.gs.washington.edu/EVS/PopStatsServ...  \n",
      "3  http://exac.broadinstitute.org/awesome?query=r...  \n",
      "4                                                NaN  \n",
      "\n",
      "[5 rows x 155 columns]\n"
     ]
    }
   ],
   "source": [
    "from proteofav.mergers import table_merger\n",
    "\n",
    "# before merging we need to select/filter or add extra columns with necessary data\n",
    "from proteofav.structures import filter_structures\n",
    "from proteofav.dssp import filter_dssp\n",
    "from proteofav.sifts import filter_sifts\n",
    "from proteofav.validation import filter_validation\n",
    "from proteofav.annotation import filter_annotation\n",
    "\n",
    "# does residue aggregation and adds 'res_full' and removes hydrogens\n",
    "mmcif = filter_structures(mmcif, excluded_cols=None,\n",
    "                          models='first', chains=None, res=None, res_full=None,\n",
    "                          comps=None, atoms=None, lines=None, category='auth',\n",
    "                          residue_agg=True, agg_method='centroid',\n",
    "                          add_res_full=True, add_atom_altloc=False, reset_atom_id=True,\n",
    "                          remove_altloc=False, remove_hydrogens=True, remove_partial_res=False)\n",
    "\n",
    "# adds 'full_chain' and 'rsa'\n",
    "dssp = filter_dssp(dssp, excluded_cols=None,\n",
    "                   chains=None, chains_full=None, res=None,\n",
    "                   add_full_chain=True, add_ss_reduced=False,\n",
    "                   add_rsa=True, rsa_method=\"Sander\", add_rsa_class=False,\n",
    "                   reset_res_id=True)\n",
    "\n",
    "# does nothing\n",
    "sifts = filter_sifts(sifts, excluded_cols=None, chains=None,\n",
    "                     chain_auth=None, res=None, uniprot=None, site=None)\n",
    "\n",
    "# adds 'res_full'\n",
    "validation = filter_validation(validation, excluded_cols=None, chains=None, res=None,\n",
    "                      add_res_full=True)\n",
    "\n",
    "# annotation residue aggregation\n",
    "annotation = filter_annotation(annotation, identifier=None, annotation_agg=True, \n",
    "                               query_type='', group_residues=True,\n",
    "                               drop_types=('Helix', 'Beta strand', 'Turn', 'Chain'))\n",
    "\n",
    "table = table_merger(mmcif_table=mmcif, dssp_table=dssp, sifts_table=sifts,\n",
    "                     validation_table=validation, annotation_table=annotation,\n",
    "                     variants_table=variants)\n",
    "print(table.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Automating all the work done so far with the Merger class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   index pdbx_PDB_model_num auth_asym_id auth_seq_id group_PDB  id  \\\n",
      "0      0                  1            A         118      ATOM   1   \n",
      "1      1                  1            A         119      ATOM   8   \n",
      "2      1                  1            A         119      ATOM   8   \n",
      "3      2                  1            A         120      ATOM  15   \n",
      "4      3                  1            A         121      ATOM  29   \n",
      "\n",
      "  type_symbol label_atom_id label_alt_id label_comp_id  \\\n",
      "0           N             N            .           VAL   \n",
      "1           N             N            .           PRO   \n",
      "2           N             N            .           PRO   \n",
      "3           N             N            .           TRP   \n",
      "4           N             N            .           PHE   \n",
      "\n",
      "                         ...                             siftScore  \\\n",
      "0                        ...                                  0.14   \n",
      "1                        ...                                  0.01   \n",
      "2                        ...                          (0.03, 0.01)   \n",
      "3                        ...                                     0   \n",
      "4                        ...                                   NaN   \n",
      "\n",
      "  somaticStatus         sourceType   taxid      translation     type  \\\n",
      "0           0.0  large_scale_study  9606.0  ENSP00000448059  VARIANT   \n",
      "1           0.0  large_scale_study  9606.0  ENSP00000448059  VARIANT   \n",
      "2           0.0  large_scale_study  9606.0  ENSP00000448059  VARIANT   \n",
      "3           0.0  large_scale_study  9606.0  ENSP00000448059  VARIANT   \n",
      "4           0.0            uniprot  9606.0              NaN  VARIANT   \n",
      "\n",
      "   wildType     xrefs_id           xrefs_name  \\\n",
      "0         V  rs776442422                 ExAC   \n",
      "1         P  rs398123292  (1000Genomes, ExAC)   \n",
      "2         P  rs374999809          (ExAC, ESP)   \n",
      "3         W  rs775327122                 ExAC   \n",
      "4         F          NaN                  NaN   \n",
      "\n",
      "                                           xrefs_url  \n",
      "0  http://exac.broadinstitute.org/awesome?query=r...  \n",
      "1  (http://www.ensembl.org/Homo_sapiens/Variation...  \n",
      "2  (http://evs.gs.washington.edu/EVS/PopStatsServ...  \n",
      "3  http://exac.broadinstitute.org/awesome?query=r...  \n",
      "4                                                NaN  \n",
      "\n",
      "[5 rows x 142 columns]\n"
     ]
    }
   ],
   "source": [
    "from proteofav.mergers import Tables\n",
    "\n",
    "# files are read/stored in the directories defined in the user defined config.ini file.\n",
    "table = Tables.generate(merge_tables=True, uniprot_id=None, pdb_id=pdb_id, bio_unit=False,\n",
    "                        sifts=True, dssp=False, validation=True, annotations=True, variants=True,\n",
    "                        residue_agg='centroid', overwrite=False)\n",
    "print(table.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use case 1: characterising the structural properties of protein posttranslational modified sites (or any other site)\n",
    "\n",
    "One can use ProteoFAV for high-throughput structural characterization of binding sites, such as in Britto-Borges and Barton, 2017.\n",
    "\n",
    "For example, the cAMP-dependent protein kinase catalytic subunit alpha (PKAα) is a small protein kinase that is critical homeostatic process in human tissue and in stress response in lower organisms [UniProt:P17612](http://www.uniprot.org/uniprot/P17612). Accordinly, the function of the protein has been extensively studied, including the three dimensional structure with high sequence coverage and resolution.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "uniprot_id = 'P17612'\n",
    "gff_path = os.path.join(out_dir, uniprot_id + \".gff\")\n",
    "\n",
    "Annotation.download(\n",
    "    identifier=uniprot_id, \n",
    "    filename=gff_path)\n",
    "P17612_annotation = Annotation.read(filename=gff_path)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NAME</th>\n",
       "      <th>SOURCE</th>\n",
       "      <th>TYPE</th>\n",
       "      <th>START</th>\n",
       "      <th>END</th>\n",
       "      <th>SCORE</th>\n",
       "      <th>STRAND</th>\n",
       "      <th>FRAME</th>\n",
       "      <th>GROUP</th>\n",
       "      <th>Dbxref</th>\n",
       "      <th>ID</th>\n",
       "      <th>Note</th>\n",
       "      <th>Ontology_term</th>\n",
       "      <th>evidence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>P17612</td>\n",
       "      <td>UniProtKB</td>\n",
       "      <td>Modified residue</td>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>Note=Phosphoserine%3B by autocatalysis;Ontolog...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[Phosphoserine; by autocatalysis]</td>\n",
       "      <td>[ECO:0000250]</td>\n",
       "      <td>[ECO:0000250|UniProtKB:P05132]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>P17612</td>\n",
       "      <td>UniProtKB</td>\n",
       "      <td>Modified residue</td>\n",
       "      <td>49</td>\n",
       "      <td>49</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>Note=Phosphothreonine;Ontology_term=ECO:000024...</td>\n",
       "      <td>[PMID:18691976]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[Phosphothreonine]</td>\n",
       "      <td>[ECO:0000244]</td>\n",
       "      <td>[ECO:0000244|PubMed:18691976]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>P17612</td>\n",
       "      <td>UniProtKB</td>\n",
       "      <td>Modified residue</td>\n",
       "      <td>140</td>\n",
       "      <td>140</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>Note=Phosphoserine;Ontology_term=ECO:0000250;e...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[Phosphoserine]</td>\n",
       "      <td>[ECO:0000250]</td>\n",
       "      <td>[ECO:0000250|UniProtKB:P05132]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>P17612</td>\n",
       "      <td>UniProtKB</td>\n",
       "      <td>Modified residue</td>\n",
       "      <td>196</td>\n",
       "      <td>196</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>Note=Phosphothreonine;Ontology_term=ECO:000026...</td>\n",
       "      <td>[PMID:12372837]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[Phosphothreonine]</td>\n",
       "      <td>[ECO:0000269]</td>\n",
       "      <td>[ECO:0000269|PubMed:12372837]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>P17612</td>\n",
       "      <td>UniProtKB</td>\n",
       "      <td>Modified residue</td>\n",
       "      <td>198</td>\n",
       "      <td>198</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>Note=Phosphothreonine%3B by PDPK1;Ontology_ter...</td>\n",
       "      <td>[PMID:12372837,PMID:16765046,PMID:20137943,PMI...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[Phosphothreonine; by PDPK1]</td>\n",
       "      <td>[ECO:0000269,ECO:0000269,ECO:0000269,ECO:00002...</td>\n",
       "      <td>[ECO:0000269|PubMed:12372837,ECO:0000269|PubMe...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>P17612</td>\n",
       "      <td>UniProtKB</td>\n",
       "      <td>Modified residue</td>\n",
       "      <td>202</td>\n",
       "      <td>202</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>Note=Phosphothreonine;Ontology_term=ECO:000026...</td>\n",
       "      <td>[PMID:17909264]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[Phosphothreonine]</td>\n",
       "      <td>[ECO:0000269]</td>\n",
       "      <td>[ECO:0000269|PubMed:17909264]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>P17612</td>\n",
       "      <td>UniProtKB</td>\n",
       "      <td>Modified residue</td>\n",
       "      <td>331</td>\n",
       "      <td>331</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>Note=Phosphotyrosine;Ontology_term=ECO:0000250...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[Phosphotyrosine]</td>\n",
       "      <td>[ECO:0000250]</td>\n",
       "      <td>[ECO:0000250|UniProtKB:P05132]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>P17612</td>\n",
       "      <td>UniProtKB</td>\n",
       "      <td>Modified residue</td>\n",
       "      <td>339</td>\n",
       "      <td>339</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>.</td>\n",
       "      <td>Note=Phosphoserine;Ontology_term=ECO:0000244,E...</td>\n",
       "      <td>[PMID:18691976,PMID:19690332,PMID:24275569,PMI...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[Phosphoserine]</td>\n",
       "      <td>[ECO:0000244,ECO:0000244,ECO:0000244,ECO:00002...</td>\n",
       "      <td>[ECO:0000244|PubMed:18691976,ECO:0000244|PubMe...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      NAME     SOURCE              TYPE START  END SCORE STRAND FRAME  \\\n",
       "10  P17612  UniProtKB  Modified residue    11   11     .      .     .   \n",
       "11  P17612  UniProtKB  Modified residue    49   49     .      .     .   \n",
       "12  P17612  UniProtKB  Modified residue   140  140     .      .     .   \n",
       "13  P17612  UniProtKB  Modified residue   196  196     .      .     .   \n",
       "14  P17612  UniProtKB  Modified residue   198  198     .      .     .   \n",
       "15  P17612  UniProtKB  Modified residue   202  202     .      .     .   \n",
       "16  P17612  UniProtKB  Modified residue   331  331     .      .     .   \n",
       "17  P17612  UniProtKB  Modified residue   339  339     .      .     .   \n",
       "\n",
       "                                                GROUP  \\\n",
       "10  Note=Phosphoserine%3B by autocatalysis;Ontolog...   \n",
       "11  Note=Phosphothreonine;Ontology_term=ECO:000024...   \n",
       "12  Note=Phosphoserine;Ontology_term=ECO:0000250;e...   \n",
       "13  Note=Phosphothreonine;Ontology_term=ECO:000026...   \n",
       "14  Note=Phosphothreonine%3B by PDPK1;Ontology_ter...   \n",
       "15  Note=Phosphothreonine;Ontology_term=ECO:000026...   \n",
       "16  Note=Phosphotyrosine;Ontology_term=ECO:0000250...   \n",
       "17  Note=Phosphoserine;Ontology_term=ECO:0000244,E...   \n",
       "\n",
       "                                               Dbxref   ID  \\\n",
       "10                                                NaN  NaN   \n",
       "11                                    [PMID:18691976]  NaN   \n",
       "12                                                NaN  NaN   \n",
       "13                                    [PMID:12372837]  NaN   \n",
       "14  [PMID:12372837,PMID:16765046,PMID:20137943,PMI...  NaN   \n",
       "15                                    [PMID:17909264]  NaN   \n",
       "16                                                NaN  NaN   \n",
       "17  [PMID:18691976,PMID:19690332,PMID:24275569,PMI...  NaN   \n",
       "\n",
       "                                 Note  \\\n",
       "10  [Phosphoserine; by autocatalysis]   \n",
       "11                 [Phosphothreonine]   \n",
       "12                    [Phosphoserine]   \n",
       "13                 [Phosphothreonine]   \n",
       "14       [Phosphothreonine; by PDPK1]   \n",
       "15                 [Phosphothreonine]   \n",
       "16                  [Phosphotyrosine]   \n",
       "17                    [Phosphoserine]   \n",
       "\n",
       "                                        Ontology_term  \\\n",
       "10                                      [ECO:0000250]   \n",
       "11                                      [ECO:0000244]   \n",
       "12                                      [ECO:0000250]   \n",
       "13                                      [ECO:0000269]   \n",
       "14  [ECO:0000269,ECO:0000269,ECO:0000269,ECO:00002...   \n",
       "15                                      [ECO:0000269]   \n",
       "16                                      [ECO:0000250]   \n",
       "17  [ECO:0000244,ECO:0000244,ECO:0000244,ECO:00002...   \n",
       "\n",
       "                                             evidence  \n",
       "10                     [ECO:0000250|UniProtKB:P05132]  \n",
       "11                      [ECO:0000244|PubMed:18691976]  \n",
       "12                     [ECO:0000250|UniProtKB:P05132]  \n",
       "13                      [ECO:0000269|PubMed:12372837]  \n",
       "14  [ECO:0000269|PubMed:12372837,ECO:0000269|PubMe...  \n",
       "15                      [ECO:0000269|PubMed:17909264]  \n",
       "16                     [ECO:0000250|UniProtKB:P05132]  \n",
       "17  [ECO:0000244|PubMed:18691976,ECO:0000244|PubMe...  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# phosphorylated sites in UniProt\n",
    "P17612_annotation[P17612_annotation.GROUP.str.contains('Note=Phospho')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "phospho_residues = P17612_annotation.loc[P17612_annotation.GROUP.str.contains('Note=Phospho'), 'START']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "from proteofav.sifts import sifts_best"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "P17612_best_structure = sifts_best('P17612')['P17612'][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P17612_best_structure['experimental_method'] == 'X-ray diffraction'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P17612_best_structure['tax_id'] == 9606 # human"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "table = Tables.generate(\n",
    "    merge_tables=True, \n",
    "    uniprot_id='P17612', \n",
    "    bio_unit=False,\n",
    "    sifts=True,\n",
    "    validation=True, \n",
    "    annotations=True, \n",
    "    residue_agg='centroid', \n",
    "    overwrite=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# every residue in the structure not mapped to the UniProt is discarded\n",
    "table.dropna(subset=['UniProt_dbResNum'], axis=0, inplace=True) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "table['UniProt_dbResNum'] = table['UniProt_dbResNum'].astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>pdbx_PDB_model_num</th>\n",
       "      <th>auth_asym_id</th>\n",
       "      <th>auth_seq_id</th>\n",
       "      <th>group_PDB</th>\n",
       "      <th>id</th>\n",
       "      <th>type_symbol</th>\n",
       "      <th>label_atom_id</th>\n",
       "      <th>label_alt_id</th>\n",
       "      <th>label_comp_id</th>\n",
       "      <th>...</th>\n",
       "      <th>CATH_regionResNum</th>\n",
       "      <th>CATH_dbAccessionId</th>\n",
       "      <th>Pfam_regionId</th>\n",
       "      <th>Pfam_regionStart</th>\n",
       "      <th>Pfam_regionEnd</th>\n",
       "      <th>Pfam_regionResNum</th>\n",
       "      <th>Pfam_dbAccessionId</th>\n",
       "      <th>annotation</th>\n",
       "      <th>site</th>\n",
       "      <th>accession</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>286</td>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>48</td>\n",
       "      <td>ATOM</td>\n",
       "      <td>277</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>.</td>\n",
       "      <td>THR</td>\n",
       "      <td>...</td>\n",
       "      <td>49</td>\n",
       "      <td>3.30.200.20</td>\n",
       "      <td>1</td>\n",
       "      <td>44.0</td>\n",
       "      <td>298.0</td>\n",
       "      <td>49</td>\n",
       "      <td>PF00069</td>\n",
       "      <td>Domain: ['Protein kinase'] (nan), Modified res...</td>\n",
       "      <td>49</td>\n",
       "      <td>P17612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>130</th>\n",
       "      <td>39</td>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>139</td>\n",
       "      <td>ATOM</td>\n",
       "      <td>1040</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>.</td>\n",
       "      <td>SER</td>\n",
       "      <td>...</td>\n",
       "      <td>140</td>\n",
       "      <td>1.10.510.10</td>\n",
       "      <td>1</td>\n",
       "      <td>44.0</td>\n",
       "      <td>298.0</td>\n",
       "      <td>140</td>\n",
       "      <td>PF00069</td>\n",
       "      <td>Domain: ['Protein kinase'] (nan), Modified res...</td>\n",
       "      <td>140</td>\n",
       "      <td>P17612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>189</th>\n",
       "      <td>101</td>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>195</td>\n",
       "      <td>ATOM</td>\n",
       "      <td>1517</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>.</td>\n",
       "      <td>THR</td>\n",
       "      <td>...</td>\n",
       "      <td>196</td>\n",
       "      <td>1.10.510.10</td>\n",
       "      <td>1</td>\n",
       "      <td>44.0</td>\n",
       "      <td>298.0</td>\n",
       "      <td>196</td>\n",
       "      <td>PF00069</td>\n",
       "      <td>Domain: ['Protein kinase'] (nan), Modified res...</td>\n",
       "      <td>196</td>\n",
       "      <td>P17612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>191</th>\n",
       "      <td>103</td>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>197</td>\n",
       "      <td>HETATM</td>\n",
       "      <td>1538</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>.</td>\n",
       "      <td>TPO</td>\n",
       "      <td>...</td>\n",
       "      <td>198</td>\n",
       "      <td>1.10.510.10</td>\n",
       "      <td>1</td>\n",
       "      <td>44.0</td>\n",
       "      <td>298.0</td>\n",
       "      <td>198</td>\n",
       "      <td>PF00069</td>\n",
       "      <td>Domain: ['Protein kinase'] (nan), Modified res...</td>\n",
       "      <td>198</td>\n",
       "      <td>P17612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>109</td>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>201</td>\n",
       "      <td>ATOM</td>\n",
       "      <td>1567</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>.</td>\n",
       "      <td>THR</td>\n",
       "      <td>...</td>\n",
       "      <td>202</td>\n",
       "      <td>1.10.510.10</td>\n",
       "      <td>1</td>\n",
       "      <td>44.0</td>\n",
       "      <td>298.0</td>\n",
       "      <td>202</td>\n",
       "      <td>PF00069</td>\n",
       "      <td>Domain: ['Protein kinase'] (nan), Mutagenesis:...</td>\n",
       "      <td>202</td>\n",
       "      <td>P17612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>326</th>\n",
       "      <td>251</td>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>330</td>\n",
       "      <td>ATOM</td>\n",
       "      <td>2586</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>.</td>\n",
       "      <td>TYR</td>\n",
       "      <td>...</td>\n",
       "      <td>331</td>\n",
       "      <td>3.30.200.20</td>\n",
       "      <td>-</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Domain: ['AGC-kinase C-terminal'] (nan), Modif...</td>\n",
       "      <td>331</td>\n",
       "      <td>P17612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>334</th>\n",
       "      <td>259</td>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>338</td>\n",
       "      <td>HETATM</td>\n",
       "      <td>2648</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>.</td>\n",
       "      <td>SEP</td>\n",
       "      <td>...</td>\n",
       "      <td>339</td>\n",
       "      <td>3.30.200.20</td>\n",
       "      <td>-</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Domain: ['AGC-kinase C-terminal'] (nan), Modif...</td>\n",
       "      <td>339</td>\n",
       "      <td>P17612</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>7 rows × 91 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     index pdbx_PDB_model_num auth_asym_id auth_seq_id group_PDB    id  \\\n",
       "35     286                  1            A          48      ATOM   277   \n",
       "130     39                  1            A         139      ATOM  1040   \n",
       "189    101                  1            A         195      ATOM  1517   \n",
       "191    103                  1            A         197    HETATM  1538   \n",
       "195    109                  1            A         201      ATOM  1567   \n",
       "326    251                  1            A         330      ATOM  2586   \n",
       "334    259                  1            A         338    HETATM  2648   \n",
       "\n",
       "    type_symbol label_atom_id label_alt_id label_comp_id    ...     \\\n",
       "35            N             N            .           THR    ...      \n",
       "130           N             N            .           SER    ...      \n",
       "189           N             N            .           THR    ...      \n",
       "191           N             N            .           TPO    ...      \n",
       "195           N             N            .           THR    ...      \n",
       "326           N             N            .           TYR    ...      \n",
       "334           N             N            .           SEP    ...      \n",
       "\n",
       "    CATH_regionResNum CATH_dbAccessionId Pfam_regionId Pfam_regionStart  \\\n",
       "35                 49        3.30.200.20             1             44.0   \n",
       "130               140        1.10.510.10             1             44.0   \n",
       "189               196        1.10.510.10             1             44.0   \n",
       "191               198        1.10.510.10             1             44.0   \n",
       "195               202        1.10.510.10             1             44.0   \n",
       "326               331        3.30.200.20             -              0.0   \n",
       "334               339        3.30.200.20             -              0.0   \n",
       "\n",
       "     Pfam_regionEnd  Pfam_regionResNum  Pfam_dbAccessionId  \\\n",
       "35            298.0                 49             PF00069   \n",
       "130           298.0                140             PF00069   \n",
       "189           298.0                196             PF00069   \n",
       "191           298.0                198             PF00069   \n",
       "195           298.0                202             PF00069   \n",
       "326             0.0                NaN                 NaN   \n",
       "334             0.0                NaN                 NaN   \n",
       "\n",
       "                                            annotation  site accession  \n",
       "35   Domain: ['Protein kinase'] (nan), Modified res...    49    P17612  \n",
       "130  Domain: ['Protein kinase'] (nan), Modified res...   140    P17612  \n",
       "189  Domain: ['Protein kinase'] (nan), Modified res...   196    P17612  \n",
       "191  Domain: ['Protein kinase'] (nan), Modified res...   198    P17612  \n",
       "195  Domain: ['Protein kinase'] (nan), Mutagenesis:...   202    P17612  \n",
       "326  Domain: ['AGC-kinase C-terminal'] (nan), Modif...   331    P17612  \n",
       "334  Domain: ['AGC-kinase C-terminal'] (nan), Modif...   339    P17612  \n",
       "\n",
       "[7 rows x 91 columns]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table[table['UniProt_dbResNum'].isin(phospho_residues)] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "phospho_residues_b = table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'B_iso_or_equiv'].mean()\n",
    "all_residues_b = table.loc[:, 'B_iso_or_equiv'].mean()\n",
    "\n",
    "phospho_residues_b > all_residues_b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overall phophorylated Ser/Thr have are have high b-factors, hot residues, that is not true for the `3ovv` structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "T    4\n",
       "H    2\n",
       "E    1\n",
       "Name: PDB_codeSecondaryStructure, dtype: int64"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'PDB_codeSecondaryStructure'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4 of 7 residues occur on Turns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Observed'"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'PDB_Annotation'].all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And all residues were observed in the structure, not labeled in the REM465 field"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "35     Favored\n",
       "130    Favored\n",
       "189    Favored\n",
       "191        NaN\n",
       "195    Favored\n",
       "326    Favored\n",
       "334        NaN\n",
       "Name: validation_rama, dtype: object"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'validation_rama']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5 out 7 have are not Ramachandran outliers, the NaN values were given for the Phopho resides observed in the protein crystal"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use case 2: Spatial clustering of genetic variants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (proteofav)",
   "language": "python",
   "name": "proteofav"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}