{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ProteoFAV - Protein Features, Annotations and Variants\n", "\n", "Open-source framework for simple and fast integration of protein structure data with sequence annotations and genetic variation.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing and configuration\n", "View instructions provided in the main README.md available at https://github.com/bartongroup/ProteoFAV" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import proteofav" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ProteoFAV implements two approaches to handle datasets. One can fetch a few files on the fly using functions conveniently provided. For large scale studies, however, is preferable to use a local source for the multiple data used, such as the mmCIF files for three-dimensional protein structures." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Setting Logging Level\n", "from proteofav.config import logging\n", "logger = logging.getLogger()\n", "assert len(logger.handlers) == 1\n", "handler = logger.handlers[0]\n", "handler.setLevel(logging.WARNING)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downloading a protein structure in mmCIF and PDB format" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import os\n", "from proteofav.structures import mmCIF, PDB\n", "\n", "pdb_id = \"2pah\"\n", "\n", "# create tmp dir\n", "out_dir = os.path.join(os.getcwd(), \"tmp\")\n", "os.makedirs(out_dir, exist_ok=True)\n", "\n", "# output file names\n", "out_mmcif = os.path.join(out_dir, \"{}.cif\".format(pdb_id))\n", "out_mmcif_bio = os.path.join(out_dir, \"{}_bio.cif\".format(pdb_id))\n", "out_pdb = os.path.join(out_dir, \"{}.pdb\".format(pdb_id))\n", "\n", "# download structures\n", "mmCIF.download(identifier=pdb_id, filename=out_mmcif)\n", "mmCIF.download(identifier=pdb_id, filename=out_mmcif_bio, \n", " bio_unit=True, bio_unit_preferred=True)\n", "PDB.download(identifier=pdb_id, filename=out_pdb)\n", "\n", "assert os.path.exists(out_mmcif)\n", "assert os.path.exists(out_mmcif_bio)\n", "assert os.path.exists(out_pdb)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the structures onto a Pandas DataFrame" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " group_PDB id type_symbol label_atom_id label_alt_id label_comp_id \\\n", "0 ATOM 1 N N . VAL \n", "1 ATOM 2 C CA . VAL \n", "2 ATOM 3 C C . VAL \n", "3 ATOM 4 O O . VAL \n", "4 ATOM 5 C CB . VAL \n", "\n", " label_asym_id label_entity_id label_seq_id pdbx_PDB_ins_code \\\n", "0 A 1 1 ? \n", "1 A 1 1 ? \n", "2 A 1 1 ? \n", "3 A 1 1 ? \n", "4 A 1 1 ? \n", "\n", " ... Cartn_z occupancy B_iso_or_equiv pdbx_formal_charge \\\n", "0 ... 18.770 1.0 56.51 ? \n", "1 ... 20.244 1.0 59.09 ? \n", "2 ... 20.700 1.0 44.63 ? \n", "3 ... 20.204 1.0 59.84 ? \n", "4 ... 20.638 1.0 53.90 ? \n", "\n", " auth_seq_id auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num \\\n", "0 118 VAL A N 1 \n", "1 118 VAL A CA 1 \n", "2 118 VAL A C 1 \n", "3 118 VAL A O 1 \n", "4 118 VAL A CB 1 \n", "\n", " pdbe_label_seq_id \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 \n", "\n", "[5 rows x 22 columns]\n", "Index(['group_PDB', 'id', 'type_symbol', 'label_atom_id', 'label_alt_id',\n", " 'label_comp_id', 'label_asym_id', 'label_entity_id', 'label_seq_id',\n", " 'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy',\n", " 'B_iso_or_equiv', 'pdbx_formal_charge', 'auth_seq_id', 'auth_comp_id',\n", " 'auth_asym_id', 'auth_atom_id', 'pdbx_PDB_model_num',\n", " 'pdbe_label_seq_id'],\n", " dtype='object')\n" ] } ], "source": [ "mmcif = mmCIF.read(filename=out_mmcif)\n", "print(mmcif.head())\n", "print(mmcif.columns)\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " group_PDB id type_symbol label_atom_id label_alt_id label_comp_id \\\n", "0 ATOM 1 N N . VAL \n", "1 ATOM 2 C CA . VAL \n", "2 ATOM 3 C C . VAL \n", "3 ATOM 4 O O . VAL \n", "4 ATOM 5 C CB . VAL \n", "\n", " label_asym_id label_entity_id label_seq_id pdbx_PDB_ins_code \\\n", "0 A 1 1 ? \n", "1 A 1 1 ? \n", "2 A 1 1 ? \n", "3 A 1 1 ? \n", "4 A 1 1 ? \n", "\n", " ... B_iso_or_equiv pdbx_formal_charge auth_seq_id \\\n", "0 ... 56.51 ? 118 \n", "1 ... 59.09 ? 118 \n", "2 ... 44.63 ? 118 \n", "3 ... 59.84 ? 118 \n", "4 ... 53.90 ? 118 \n", "\n", " auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num \\\n", "0 VAL A N 1 \n", "1 VAL A CA 1 \n", "2 VAL A C 1 \n", "3 VAL A O 1 \n", "4 VAL A CB 1 \n", "\n", " pdbe_label_seq_id orig_label_asym_id orig_auth_asym_id \n", "0 1 A A \n", "1 1 A A \n", "2 1 A A \n", "3 1 A A \n", "4 1 A A \n", "\n", "[5 rows x 24 columns]\n", "Index(['group_PDB', 'id', 'type_symbol', 'label_atom_id', 'label_alt_id',\n", " 'label_comp_id', 'label_asym_id', 'label_entity_id', 'label_seq_id',\n", " 'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy',\n", " 'B_iso_or_equiv', 'pdbx_formal_charge', 'auth_seq_id', 'auth_comp_id',\n", " 'auth_asym_id', 'auth_atom_id', 'pdbx_PDB_model_num',\n", " 'pdbe_label_seq_id', 'orig_label_asym_id', 'orig_auth_asym_id'],\n", " dtype='object')\n" ] } ], "source": [ "mmcif_bio = mmCIF.read(filename=out_mmcif_bio)\n", "print(mmcif_bio.head())\n", "print(mmcif_bio.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a forma description of each colum please see http://mmcif.wwpdb.org/" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " group_PDB id label_atom_id label_alt_id label_comp_id label_asym_id \\\n", "0 HETATM 5316 FE F E . \n", "1 HETATM 5317 FE F E . \n", "\n", " label_seq_id_full label_seq_id pdbx_PDB_ins_code Cartn_x \\\n", "0 FE C FE C ? . ? 6 \n", "1 FE D FE D ? . ? -39 \n", "\n", " ... Cartn_z occupancy B_iso_or_equiv type_symbol \\\n", "0 ... 284 42.9 25 1.0 0 84. FE \n", "1 ... 235 43.6 84 1.0 0 91. FE \n", "\n", " auth_atom_id auth_comp_id auth_asym_id auth_seq_id_full auth_seq_id \\\n", "0 FE E . FE C FE C \n", "1 FE E . FE D FE D \n", "\n", " pdbx_PDB_model_num \n", "0 1 \n", "1 1 \n", "\n", "[2 rows x 21 columns]\n", "Index(['group_PDB', 'id', 'label_atom_id', 'label_alt_id', 'label_comp_id',\n", " 'label_asym_id', 'label_seq_id_full', 'label_seq_id',\n", " 'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy',\n", " 'B_iso_or_equiv', 'type_symbol', 'auth_atom_id', 'auth_comp_id',\n", " 'auth_asym_id', 'auth_seq_id_full', 'auth_seq_id',\n", " 'pdbx_PDB_model_num'],\n", " dtype='object')\n" ] } ], "source": [ "# Column names mimic of a PDB file mimics those of the mmCIF format\n", "# Please prefer processing mmCIF instead PDB, which were deprecated\n", "pdb = PDB.read(filename=out_pdb)\n", "print(pdb.head())\n", "print(pdb.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dowloading a SIFTS xml record for obtaining PDB-UniProt mapping" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from proteofav.sifts import SIFTS\n", "\n", "# output file names\n", "out_sifts = os.path.join(out_dir, \"{}.xml\".format(pdb_id))\n", "\n", "SIFTS.download(identifier=pdb_id, filename=out_sifts)\n", "\n", "assert os.path.exists(out_sifts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the SIFTS record" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " PDB_regionId PDB_regionStart PDB_regionEnd PDB_regionResNum \\\n", "0 1 1 335 1 \n", "1 1 1 335 2 \n", "2 1 1 335 3 \n", "3 1 1 335 4 \n", "4 1 1 335 5 \n", "\n", " PDB_dbAccessionId PDB_dbResNum PDB_dbResName PDB_dbChainId PDB_Annotation \\\n", "0 2pah 118 VAL A Observed \n", "1 2pah 119 PRO A Observed \n", "2 2pah 120 TRP A Observed \n", "3 2pah 121 PHE A Observed \n", "4 2pah 122 PRO A Observed \n", "\n", " PDB_entityId ... SCOP_regionEnd SCOP_regionResNum \\\n", "0 A ... 335 1 \n", "1 A ... 335 2 \n", "2 A ... 335 3 \n", "3 A ... 335 4 \n", "4 A ... 335 5 \n", "\n", " SCOP_dbAccessionId PDB_codeSecondaryStructure PDB_nameSecondaryStructure \\\n", "0 42581 T loop \n", "1 42581 T loop \n", "2 42581 T loop \n", "3 42581 T loop \n", "4 42581 T loop \n", "\n", " Pfam_regionId Pfam_regionStart Pfam_regionEnd Pfam_regionResNum \\\n", "0 - 0 0 NaN \n", "1 1 2 332 2 \n", "2 1 2 332 3 \n", "3 1 2 332 4 \n", "4 1 2 332 5 \n", "\n", " Pfam_dbAccessionId \n", "0 NaN \n", "1 PF00351 \n", "2 PF00351 \n", "3 PF00351 \n", "4 PF00351 \n", "\n", "[5 rows x 34 columns]\n" ] } ], "source": [ "sifts = SIFTS.read(filename=out_sifts)\n", "print(sifts.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SIFT record also contains mappings to many other databases, such as:\n", "- CATH\n", "- SCOP\n", "- PFAM\n", "\n", "Bear in mind that SIFT mapping occurs at residue, but also at the domain level. \n", "The default action is to load the residue mapping.\n", "\n", "Also see the *PDB_Annotation* which flags several types of annotation at residue level, for example whether a given UniProt residues was observed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dowloading a DSSP record for obtaining Secondary Structure information" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "from proteofav.dssp import DSSP\n", "\n", "# output file names\n", "out_dssp = os.path.join(out_dir, \"{}.dssp\".format(pdb_id))\n", "\n", "DSSP.download(identifier=pdb_id, filename=out_dssp)\n", "\n", "# sometimes fecthing from the DSSP FTP server at ftp://ftp.cmbi.ru.nl/pub/molbio/data/dssp/ times out...\n", "print(os.path.exists(out_dssp))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the DSSP record" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " RES RES_FULL INSCODE CHAIN AA SS ACC TCO KAPPA ALPHA PHI PSI\n", "0 118 118 A V 127 0.000 360.0 360.0 360.0 124.7\n", "1 119 119 A P 42 -0.071 360.0 -105.4 -51.6 149.9\n", "2 120 120 A W 120 -0.593 41.9 -178.8 -81.0 139.2\n", "3 121 121 A F 17 -0.980 33.0 -92.9 -138.9 150.9\n", "4 122 122 A P 4 -0.405 27.9 -176.6 -65.3 130.9\n" ] } ], "source": [ "dssp = DSSP.read(filename=out_dssp)\n", "print(dssp.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dowload a PDBe Validation XML record" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from proteofav.validation import Validation\n", "\n", "out_validation = os.path.join(out_dir, \"{}_validation.xml\".format(pdb_id))\n", "\n", "Validation.download(identifier=pdb_id, filename=out_validation)\n", "\n", "assert os.path.exists(out_validation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the Validation record" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " validation_rscc validation_rama validation_icode validation_ligRSRZ \\\n", "0 0.896 NaN ? NaN \n", "1 0.960 Favored ? NaN \n", "2 0.961 Favored ? NaN \n", "3 0.920 Favored ? NaN \n", "4 0.973 Favored ? NaN \n", "\n", " validation_ligRSRnbrMean validation_flippable-sidechain validation_psi \\\n", "0 NaN NaN NaN \n", "1 NaN NaN 149.9 \n", "2 NaN NaN 139.2 \n", "3 NaN NaN 150.9 \n", "4 NaN NaN 130.9 \n", "\n", " validation_rsr validation_owab validation_ligRSRnumnbrs ... \\\n", "0 0.233 52.97 NaN ... \n", "1 0.190 28.84 NaN ... \n", "2 0.154 33.47 NaN ... \n", "3 0.229 39.98 NaN ... \n", "4 0.197 26.14 NaN ... \n", "\n", " validation_chain validation_phi validation_said validation_rsrz \\\n", "0 A NaN A -0.160 \n", "1 A -51.6 A -0.274 \n", "2 A -81.0 A -0.874 \n", "3 A -138.9 A -0.308 \n", "4 A -65.3 A -0.204 \n", "\n", " validation_seq validation_ligRSRnbrStdev validation_altcode \\\n", "0 1 NaN . \n", "1 2 NaN . \n", "2 3 NaN . \n", "3 4 NaN . \n", "4 5 NaN . \n", "\n", " validation_lig_rsrz_nbr_id validation_NatomsEDS validation_resnum \n", "0 NaN 7 118 \n", "1 NaN 7 119 \n", "2 NaN 14 120 \n", "3 NaN 11 121 \n", "4 NaN 7 122 \n", "\n", "[5 rows x 27 columns]\n" ] } ], "source": [ "validation = Validation.read(filename=out_validation)\n", "print(validation.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PDB validation record is convenient when filtering a protein structure for analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select only CA residues in for a single chain" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Protein structure representation is a hierarchical data structure (See http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ). So to obtain the data in tabular format, ProteoFAV transforms the data. For example, for use cases that require one residue per row, the residue three-dimensional coordinates can be represented by the residue's Cα. Other filtering parameters are obtained with *filter_structures*" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " group_PDB id type_symbol label_atom_id label_alt_id label_comp_id \\\n", "1 ATOM 2 C CA . VAL \n", "8 ATOM 9 C CA . PRO \n", "15 ATOM 16 C CA . TRP \n", "29 ATOM 30 C CA . PHE \n", "40 ATOM 41 C CA . PRO \n", "\n", " label_asym_id label_entity_id label_seq_id pdbx_PDB_ins_code \\\n", "1 A 1 1 ? \n", "8 A 1 2 ? \n", "15 A 1 3 ? \n", "29 A 1 4 ? \n", "40 A 1 5 ? \n", "\n", " ... B_iso_or_equiv pdbx_formal_charge auth_seq_id \\\n", "1 ... 59.09 ? 118 \n", "8 ... 20.13 ? 119 \n", "15 ... 33.96 ? 120 \n", "29 ... 34.42 ? 121 \n", "40 ... 28.65 ? 122 \n", "\n", " auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num \\\n", "1 VAL A CA 1 \n", "8 PRO A CA 1 \n", "15 TRP A CA 1 \n", "29 PHE A CA 1 \n", "40 PRO A CA 1 \n", "\n", " pdbe_label_seq_id label_seq_id_full auth_seq_id_full \n", "1 1 1 118 \n", "8 2 2 119 \n", "15 3 3 120 \n", "29 4 4 121 \n", "40 5 5 122 \n", "\n", "[5 rows x 24 columns]\n" ] } ], "source": [ "from proteofav.structures import filter_structures\n", "\n", "mmcif_sel = filter_structures(mmcif, excluded_cols=None,\n", " models='first', chains='A', res=None, res_full=None,\n", " comps=None, atoms='CA', lines=None, category='auth',\n", " residue_agg=False, \n", " add_res_full=True, add_atom_altloc=False, reset_atom_id=True,\n", " remove_altloc=False, remove_hydrogens=True, remove_partial_res=False)\n", "print(mmcif_sel.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aggregating atoms residue-by-residue\n", "Three dimensional coordinates of all atoms can be represented by the residues centroid" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " index pdbx_PDB_model_num auth_asym_id auth_seq_id group_PDB id \\\n", "0 0 1 A 118 ATOM 1 \n", "1 1 1 A 119 ATOM 8 \n", "2 2 1 A 120 ATOM 15 \n", "3 3 1 A 121 ATOM 29 \n", "4 4 1 A 122 ATOM 40 \n", "\n", " type_symbol label_atom_id label_alt_id label_comp_id ... \\\n", "0 N N . VAL ... \n", "1 N N . PRO ... \n", "2 N N . TRP ... \n", "3 N N . PHE ... \n", "4 N N . PRO ... \n", "\n", " pdbx_PDB_ins_code Cartn_x Cartn_y Cartn_z occupancy \\\n", "0 ? -7.310714 21.031714 20.424143 1.0 \n", "1 ? -4.434571 21.470714 22.630714 1.0 \n", "2 ? -1.007000 15.673429 23.737071 1.0 \n", "3 ? -5.282455 15.784000 26.269091 1.0 \n", "4 ? -2.392571 13.578286 28.745714 1.0 \n", "\n", " B_iso_or_equiv pdbx_formal_charge auth_comp_id auth_atom_id \\\n", "0 52.974286 ? VAL N \n", "1 28.844286 ? PRO N \n", "2 33.466429 ? TRP N \n", "3 39.981818 ? PHE N \n", "4 26.141429 ? PRO N \n", "\n", " pdbe_label_seq_id \n", "0 1 \n", "1 2 \n", "2 3 \n", "3 4 \n", "4 5 \n", "\n", "[5 rows x 23 columns]\n" ] } ], "source": [ "from proteofav.structures import residues_aggregation\n", "\n", "mmcif_sel = residues_aggregation(mmcif, agg_method='centroid', category='auth')\n", "print(mmcif_sel.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write a PDB-formatted file from a mmCIF structure" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "new_out_pdb = os.path.join(out_dir, \"{}_new.pdb\".format(pdb_id)) \n", "PDB.write(table=mmcif, filename=new_out_pdb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get a UniProt-PDB mapping from the SIFTS xml" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " PDB_regionId PDB_regionStart PDB_regionEnd PDB_regionResNum \\\n", "0 1 1 335 1 \n", "1 1 1 335 2 \n", "2 1 1 335 3 \n", "3 1 1 335 4 \n", "4 1 1 335 5 \n", "\n", " PDB_dbAccessionId PDB_dbResNum PDB_dbResName PDB_dbChainId PDB_Annotation \\\n", "0 2pah 118 VAL A Observed \n", "1 2pah 119 PRO A Observed \n", "2 2pah 120 TRP A Observed \n", "3 2pah 121 PHE A Observed \n", "4 2pah 122 PRO A Observed \n", "\n", " PDB_entityId ... SCOP_regionEnd SCOP_regionResNum \\\n", "0 A ... 335 1 \n", "1 A ... 335 2 \n", "2 A ... 335 3 \n", "3 A ... 335 4 \n", "4 A ... 335 5 \n", "\n", " SCOP_dbAccessionId PDB_codeSecondaryStructure PDB_nameSecondaryStructure \\\n", "0 42581 T loop \n", "1 42581 T loop \n", "2 42581 T loop \n", "3 42581 T loop \n", "4 42581 T loop \n", "\n", " Pfam_regionId Pfam_regionStart Pfam_regionEnd Pfam_regionResNum \\\n", "0 - 0 0 NaN \n", "1 1 2 332 2 \n", "2 1 2 332 3 \n", "3 1 2 332 4 \n", "4 1 2 332 5 \n", "\n", " Pfam_dbAccessionId \n", "0 NaN \n", "1 PF00351 \n", "2 PF00351 \n", "3 PF00351 \n", "4 PF00351 \n", "\n", "[5 rows x 34 columns]\n", "['P00439']\n" ] } ], "source": [ "sifts = SIFTS.read(filename=out_sifts)\n", "print(sifts.head())\n", "\n", "uniprot_ids = sifts.UniProt_dbAccessionId.unique()\n", "print(uniprot_ids)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downloading a sequence Annotation (GFF) from UniProt\n", "UniProt provides extensive, high-quality annotation for residues in proteins" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "from proteofav.annotation import Annotation\n", "\n", "out_annotation = os.path.join(out_dir, \"{}.gff\".format(uniprot_ids[0]))\n", "\n", "Annotation.download(identifier=uniprot_ids[0], filename=out_annotation)\n", "\n", "assert os.path.exists(out_annotation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the sequence Annotation\n", "Note also that GFF files althoug tabular, contains some extra level nesting in the `GROUP` column. ProteoFAV tries to deconvolute this information" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " NAME SOURCE TYPE START END SCORE STRAND FRAME \\\n", "0 P00439 UniProtKB Chain 1 452 . . . \n", "1 P00439 UniProtKB Domain 36 114 . . . \n", "2 P00439 UniProtKB Metal binding 285 285 . . . \n", "3 P00439 UniProtKB Metal binding 290 290 . . . \n", "4 P00439 UniProtKB Metal binding 330 330 . . . \n", "\n", " GROUP Dbxref ID \\\n", "0 ID=PRO_0000205548;Note=Phenylalanine-4-hydroxy... NaN [PRO_0000205548] \n", "1 Note=ACT;Ontology_term=ECO:0000255;evidence=EC... NaN NaN \n", "2 Note=Iron%3B via tele nitrogen;Ontology_term=E... NaN NaN \n", "3 Note=Iron%3B via tele nitrogen;Ontology_term=E... NaN NaN \n", "4 Note=Iron;Ontology_term=ECO:0000250;evidence=E... NaN NaN \n", "\n", " Note Ontology_term \\\n", "0 [Phenylalanine-4-hydroxylase] NaN \n", "1 [ACT] [ECO:0000255] \n", "2 [Iron; via tele nitrogen] [ECO:0000250] \n", "3 [Iron; via tele nitrogen] [ECO:0000250] \n", "4 [Iron] [ECO:0000250] \n", "\n", " evidence \n", "0 NaN \n", "1 [ECO:0000255|PROSITE-ProRule:PRU01007] \n", "2 [ECO:0000250|UniProtKB:P04176] \n", "3 [ECO:0000250|UniProtKB:P04176] \n", "4 [ECO:0000250|UniProtKB:P04176] \n" ] } ], "source": [ "annotation = Annotation.read(filename=out_annotation)\n", "print(annotation.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downloading variants based on the UniProt ID\n", "We could fetch genetic variants from UniProt and Ensembl with:\n", "\n", "```python\n", "Variants.fetch(identifier=uniprot_ids[0], id_source='uniprot', \n", " synonymous=False, uniprot_vars=True,\n", " ensembl_germline_vars=True, ensembl_somatic_vars=True)\n", "```\n", "\n", "but `select_variants` handles merging of Ensembl vars for us" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from proteofav.variants import Variants\n", "\n", "uniprot, ensembl = Variants.select(identifier=uniprot_ids[0], id_source='uniprot', \n", " synonymous=False, uniprot_vars=True,\n", " ensembl_germline_vars=True, ensembl_somatic_vars=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Glancing over the variants" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " accession alternativeSequence association_description association_disease \\\n", "0 P00439 A NaN True \n", "1 P00439 L haplotypes 1,4 True \n", "2 P00439 L NaN True \n", "3 P00439 S haplotype 36 True \n", "4 P00439 V NaN True \n", "\n", " association_evidences_code \\\n", "0 ECO:0000269 \n", "1 ECO:0000269 \n", "2 NaN \n", "3 ECO:0000269 \n", "4 ECO:0000269 \n", "\n", " association_evidences_source_alternativeUrl \\\n", "0 [http://europepmc.org/abstract/MED/22513348, h... \n", "1 [http://europepmc.org/abstract/MED/22513348, h... \n", "2 NaN \n", "3 http://europepmc.org/abstract/MED/2014802 \n", "4 [http://europepmc.org/abstract/MED/8889590, ht... \n", "\n", " association_evidences_source_id \\\n", "0 [22513348, 8889590, 8088845, 12501224] \n", "1 [22513348, 1672290, 8889590, 12501224, 1672294] \n", "2 NaN \n", "3 2014802 \n", "4 [8889590, 12501224, 22513348, 23792259] \n", "\n", " association_evidences_source_name \\\n", "0 PubMed \n", "1 PubMed \n", "2 NaN \n", "3 PubMed \n", "4 PubMed \n", "\n", " association_evidences_source_url \\\n", "0 [http://www.ncbi.nlm.nih.gov/pubmed/22513348, ... \n", "1 [http://www.ncbi.nlm.nih.gov/pubmed/22513348, ... \n", "2 NaN \n", "3 http://www.ncbi.nlm.nih.gov/pubmed/2014802 \n", "4 [http://www.ncbi.nlm.nih.gov/pubmed/8889590, h... \n", "\n", " association_name \\\n", "0 [Phenylketonuria (PKU), Hyperphenylalaninemia ... \n", "1 Phenylketonuria (PKU) \n", "2 Hyperphenylalaninemia (HPA) \n", "3 Phenylketonuria (PKU) \n", "4 Phenylketonuria (PKU) \n", "\n", " ... siftPrediction siftScore \\\n", "0 ... tolerated 0.11 \n", "1 ... deleterious 0 \n", "2 ... NaN NaN \n", "3 ... NaN NaN \n", "4 ... tolerated 0.06 \n", "\n", " somaticStatus sourceType taxid type wildType xrefs_id \\\n", "0 0 mixed 9606 VARIANT V rs796052017 \n", "1 0 mixed 9606 VARIANT P rs5030851 \n", "2 0 uniprot 9606 VARIANT Q rs199475662 \n", "3 0 uniprot 9606 VARIANT L rs62642930 \n", "4 0 mixed 9606 VARIANT A rs5030857 \n", "\n", " xrefs_name \\\n", "0 [dbSNP, Ensembl, 1000Genomes, ESP, ExAC] \n", "1 [dbSNP, Ensembl, ESP, ExAC] \n", "2 [dbSNP, Ensembl] \n", "3 [dbSNP, Ensembl] \n", "4 [dbSNP, Ensembl, 1000Genomes, ESP, ExAC] \n", "\n", " xrefs_url \n", "0 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "1 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "2 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "3 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "4 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "\n", "[5 rows x 42 columns]\n" ] } ], "source": [ "print(uniprot.head())" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Parent allele begin clinical_significance codons \\\n", "0 ENST00000553106 HGMD_MUTATION 377 [] \n", "1 ENST00000553106 C/T 75 [] Gat/Aat \n", "2 ENST00000553106 HGMD_MUTATION 300 [] \n", "3 ENST00000553106 HGMD_MUTATION 245 [] \n", "4 ENST00000553106 HGMD_MUTATION 415 [] \n", "\n", " consequenceType end feature_type frequency \\\n", "0 coding_sequence_variant 377 transcript_variation NaN \n", "1 missense_variant 75 transcript_variation NaN \n", "2 coding_sequence_variant 300 transcript_variation NaN \n", "3 coding_sequence_variant 245 transcript_variation NaN \n", "4 coding_sequence_variant 415 transcript_variation NaN \n", "\n", " polyphenScore residues seq_region_name siftScore translation \\\n", "0 NaN ENSP00000448059 NaN ENSP00000448059 \n", "1 0.014 D/N ENSP00000448059 0.49 ENSP00000448059 \n", "2 NaN ENSP00000448059 NaN ENSP00000448059 \n", "3 NaN ENSP00000448059 NaN ENSP00000448059 \n", "4 NaN ENSP00000448059 NaN ENSP00000448059 \n", "\n", " xrefs_id \n", "0 CD011183 \n", "1 rs767453024 \n", "2 CM950893 \n", "3 CM941133 \n", "4 CM920564 \n" ] } ], "source": [ "print(ensembl.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merging down the two Variants tables\n", "For merging variants from the UniProt and Ensembl" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Parent accession allele alternativeSequence association_description \\\n", "0 NaN P00439 NaN del NaN \n", "1 NaN P00439 NaN del NaN \n", "2 NaN P00439 NaN K NaN \n", "3 NaN P00439 NaN del NaN \n", "4 NaN P00439 NaN L NaN \n", "\n", " association_disease association_evidences_code \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 True ECO:0000269 \n", "\n", " association_evidences_source_alternativeUrl association_evidences_source_id \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 http://europepmc.org/abstract/MED/23792259 23792259 \n", "\n", " association_evidences_source_name \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 PubMed \n", "\n", " ... siftScore somaticStatus \\\n", "0 ... NaN 0.0 \n", "1 ... NaN 0.0 \n", "2 ... 0 0.0 \n", "3 ... NaN 0.0 \n", "4 ... NaN 0.0 \n", "\n", " sourceType taxid translation type wildType xrefs_id \\\n", "0 uniprot 9606.0 NaN VARIANT L NaN \n", "1 uniprot 9606.0 NaN VARIANT Y NaN \n", "2 large_scale_study 9606.0 NaN VARIANT T COSM546084 \n", "3 uniprot 9606.0 NaN VARIANT L NaN \n", "4 uniprot 9606.0 NaN VARIANT F NaN \n", "\n", " xrefs_name xrefs_url \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 cosmic curated http://cancer.sanger.ac.uk/cosmic/mutation/ove... \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", "[5 rows x 50 columns]\n" ] } ], "source": [ "from proteofav.mergers import uniprot_vars_ensembl_vars_merger\n", "\n", "variants = uniprot_vars_ensembl_vars_merger(uniprot, ensembl)\n", "print(variants.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merging the Structure, DSSP, SIFTS, Validation, Annotation and Variants data onto a single DataFrame" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " index pdbx_PDB_model_num auth_asym_id auth_seq_id group_PDB id \\\n", "0 0 1 A 118 ATOM 1 \n", "1 1 1 A 119 ATOM 8 \n", "2 1 1 A 119 ATOM 8 \n", "3 2 1 A 120 ATOM 15 \n", "4 3 1 A 121 ATOM 29 \n", "\n", " type_symbol label_atom_id label_alt_id label_comp_id \\\n", "0 N N . VAL \n", "1 N N . PRO \n", "2 N N . PRO \n", "3 N N . TRP \n", "4 N N . PHE \n", "\n", " ... siftScore \\\n", "0 ... 0.14 \n", "1 ... 0.01 \n", "2 ... (0.03, 0.01) \n", "3 ... 0 \n", "4 ... NaN \n", "\n", " somaticStatus sourceType taxid translation type \\\n", "0 0.0 large_scale_study 9606.0 ENSP00000448059 VARIANT \n", "1 0.0 large_scale_study 9606.0 ENSP00000448059 VARIANT \n", "2 0.0 large_scale_study 9606.0 ENSP00000448059 VARIANT \n", "3 0.0 large_scale_study 9606.0 ENSP00000448059 VARIANT \n", "4 0.0 uniprot 9606.0 NaN VARIANT \n", "\n", " wildType xrefs_id xrefs_name \\\n", "0 V rs776442422 ExAC \n", "1 P rs398123292 (1000Genomes, ExAC) \n", "2 P rs374999809 (ExAC, ESP) \n", "3 W rs775327122 ExAC \n", "4 F NaN NaN \n", "\n", " xrefs_url \n", "0 http://exac.broadinstitute.org/awesome?query=r... \n", "1 (http://www.ensembl.org/Homo_sapiens/Variation... \n", "2 (http://evs.gs.washington.edu/EVS/PopStatsServ... \n", "3 http://exac.broadinstitute.org/awesome?query=r... \n", "4 NaN \n", "\n", "[5 rows x 155 columns]\n" ] } ], "source": [ "from proteofav.mergers import table_merger\n", "\n", "# before merging we need to select/filter or add extra columns with necessary data\n", "from proteofav.structures import filter_structures\n", "from proteofav.dssp import filter_dssp\n", "from proteofav.sifts import filter_sifts\n", "from proteofav.validation import filter_validation\n", "from proteofav.annotation import filter_annotation\n", "\n", "# does residue aggregation and adds 'res_full' and removes hydrogens\n", "mmcif = filter_structures(mmcif, excluded_cols=None,\n", " models='first', chains=None, res=None, res_full=None,\n", " comps=None, atoms=None, lines=None, category='auth',\n", " residue_agg=True, agg_method='centroid',\n", " add_res_full=True, add_atom_altloc=False, reset_atom_id=True,\n", " remove_altloc=False, remove_hydrogens=True, remove_partial_res=False)\n", "\n", "# adds 'full_chain' and 'rsa'\n", "dssp = filter_dssp(dssp, excluded_cols=None,\n", " chains=None, chains_full=None, res=None,\n", " add_full_chain=True, add_ss_reduced=False,\n", " add_rsa=True, rsa_method=\"Sander\", add_rsa_class=False,\n", " reset_res_id=True)\n", "\n", "# does nothing\n", "sifts = filter_sifts(sifts, excluded_cols=None, chains=None,\n", " chain_auth=None, res=None, uniprot=None, site=None)\n", "\n", "# adds 'res_full'\n", "validation = filter_validation(validation, excluded_cols=None, chains=None, res=None,\n", " add_res_full=True)\n", "\n", "# annotation residue aggregation\n", "annotation = filter_annotation(annotation, identifier=None, annotation_agg=True, \n", " query_type='', group_residues=True,\n", " drop_types=('Helix', 'Beta strand', 'Turn', 'Chain'))\n", "\n", "table = table_merger(mmcif_table=mmcif, dssp_table=dssp, sifts_table=sifts,\n", " validation_table=validation, annotation_table=annotation,\n", " variants_table=variants)\n", "print(table.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automating all the work done so far with the Merger class" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " index pdbx_PDB_model_num auth_asym_id auth_seq_id group_PDB id \\\n", "0 0 1 A 118 ATOM 1 \n", "1 1 1 A 119 ATOM 8 \n", "2 1 1 A 119 ATOM 8 \n", "3 2 1 A 120 ATOM 15 \n", "4 3 1 A 121 ATOM 29 \n", "\n", " type_symbol label_atom_id label_alt_id label_comp_id \\\n", "0 N N . VAL \n", "1 N N . PRO \n", "2 N N . PRO \n", "3 N N . TRP \n", "4 N N . PHE \n", "\n", " ... siftScore \\\n", "0 ... 0.14 \n", "1 ... 0.01 \n", "2 ... (0.03, 0.01) \n", "3 ... 0 \n", "4 ... NaN \n", "\n", " somaticStatus sourceType taxid translation type \\\n", "0 0.0 large_scale_study 9606.0 ENSP00000448059 VARIANT \n", "1 0.0 large_scale_study 9606.0 ENSP00000448059 VARIANT \n", "2 0.0 large_scale_study 9606.0 ENSP00000448059 VARIANT \n", "3 0.0 large_scale_study 9606.0 ENSP00000448059 VARIANT \n", "4 0.0 uniprot 9606.0 NaN VARIANT \n", "\n", " wildType xrefs_id xrefs_name \\\n", "0 V rs776442422 ExAC \n", "1 P rs398123292 (1000Genomes, ExAC) \n", "2 P rs374999809 (ExAC, ESP) \n", "3 W rs775327122 ExAC \n", "4 F NaN NaN \n", "\n", " xrefs_url \n", "0 http://exac.broadinstitute.org/awesome?query=r... \n", "1 (http://www.ensembl.org/Homo_sapiens/Variation... \n", "2 (http://evs.gs.washington.edu/EVS/PopStatsServ... \n", "3 http://exac.broadinstitute.org/awesome?query=r... \n", "4 NaN \n", "\n", "[5 rows x 142 columns]\n" ] } ], "source": [ "from proteofav.mergers import Tables\n", "\n", "# files are read/stored in the directories defined in the user defined config.ini file.\n", "table = Tables.generate(merge_tables=True, uniprot_id=None, pdb_id=pdb_id, bio_unit=False,\n", " sifts=True, dssp=False, validation=True, annotations=True, variants=True,\n", " residue_agg='centroid', overwrite=False)\n", "print(table.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Use case 1: characterising the structural properties of protein posttranslational modified sites (or any other site)\n", "\n", "One can use ProteoFAV for high-throughput structural characterization of binding sites, such as in Britto-Borges and Barton, 2017.\n", "\n", "For example, the cAMP-dependent protein kinase catalytic subunit alpha (PKAα) is a small protein kinase that is critical homeostatic process in human tissue and in stress response in lower organisms [UniProt:P17612](http://www.uniprot.org/uniprot/P17612). Accordinly, the function of the protein has been extensively studied, including the three dimensional structure with high sequence coverage and resolution.\n", "\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "uniprot_id = 'P17612'\n", "gff_path = os.path.join(out_dir, uniprot_id + \".gff\")\n", "\n", "Annotation.download(\n", " identifier=uniprot_id, \n", " filename=gff_path)\n", "P17612_annotation = Annotation.read(filename=gff_path)\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NAMESOURCETYPESTARTENDSCORESTRANDFRAMEGROUPDbxrefIDNoteOntology_termevidence
10P17612UniProtKBModified residue1111...Note=Phosphoserine%3B by autocatalysis;Ontolog...NaNNaN[Phosphoserine; by autocatalysis][ECO:0000250][ECO:0000250|UniProtKB:P05132]
11P17612UniProtKBModified residue4949...Note=Phosphothreonine;Ontology_term=ECO:000024...[PMID:18691976]NaN[Phosphothreonine][ECO:0000244][ECO:0000244|PubMed:18691976]
12P17612UniProtKBModified residue140140...Note=Phosphoserine;Ontology_term=ECO:0000250;e...NaNNaN[Phosphoserine][ECO:0000250][ECO:0000250|UniProtKB:P05132]
13P17612UniProtKBModified residue196196...Note=Phosphothreonine;Ontology_term=ECO:000026...[PMID:12372837]NaN[Phosphothreonine][ECO:0000269][ECO:0000269|PubMed:12372837]
14P17612UniProtKBModified residue198198...Note=Phosphothreonine%3B by PDPK1;Ontology_ter...[PMID:12372837,PMID:16765046,PMID:20137943,PMI...NaN[Phosphothreonine; by PDPK1][ECO:0000269,ECO:0000269,ECO:0000269,ECO:00002...[ECO:0000269|PubMed:12372837,ECO:0000269|PubMe...
15P17612UniProtKBModified residue202202...Note=Phosphothreonine;Ontology_term=ECO:000026...[PMID:17909264]NaN[Phosphothreonine][ECO:0000269][ECO:0000269|PubMed:17909264]
16P17612UniProtKBModified residue331331...Note=Phosphotyrosine;Ontology_term=ECO:0000250...NaNNaN[Phosphotyrosine][ECO:0000250][ECO:0000250|UniProtKB:P05132]
17P17612UniProtKBModified residue339339...Note=Phosphoserine;Ontology_term=ECO:0000244,E...[PMID:18691976,PMID:19690332,PMID:24275569,PMI...NaN[Phosphoserine][ECO:0000244,ECO:0000244,ECO:0000244,ECO:00002...[ECO:0000244|PubMed:18691976,ECO:0000244|PubMe...
\n", "
" ], "text/plain": [ " NAME SOURCE TYPE START END SCORE STRAND FRAME \\\n", "10 P17612 UniProtKB Modified residue 11 11 . . . \n", "11 P17612 UniProtKB Modified residue 49 49 . . . \n", "12 P17612 UniProtKB Modified residue 140 140 . . . \n", "13 P17612 UniProtKB Modified residue 196 196 . . . \n", "14 P17612 UniProtKB Modified residue 198 198 . . . \n", "15 P17612 UniProtKB Modified residue 202 202 . . . \n", "16 P17612 UniProtKB Modified residue 331 331 . . . \n", "17 P17612 UniProtKB Modified residue 339 339 . . . \n", "\n", " GROUP \\\n", "10 Note=Phosphoserine%3B by autocatalysis;Ontolog... \n", "11 Note=Phosphothreonine;Ontology_term=ECO:000024... \n", "12 Note=Phosphoserine;Ontology_term=ECO:0000250;e... \n", "13 Note=Phosphothreonine;Ontology_term=ECO:000026... \n", "14 Note=Phosphothreonine%3B by PDPK1;Ontology_ter... \n", "15 Note=Phosphothreonine;Ontology_term=ECO:000026... \n", "16 Note=Phosphotyrosine;Ontology_term=ECO:0000250... \n", "17 Note=Phosphoserine;Ontology_term=ECO:0000244,E... \n", "\n", " Dbxref ID \\\n", "10 NaN NaN \n", "11 [PMID:18691976] NaN \n", "12 NaN NaN \n", "13 [PMID:12372837] NaN \n", "14 [PMID:12372837,PMID:16765046,PMID:20137943,PMI... NaN \n", "15 [PMID:17909264] NaN \n", "16 NaN NaN \n", "17 [PMID:18691976,PMID:19690332,PMID:24275569,PMI... NaN \n", "\n", " Note \\\n", "10 [Phosphoserine; by autocatalysis] \n", "11 [Phosphothreonine] \n", "12 [Phosphoserine] \n", "13 [Phosphothreonine] \n", "14 [Phosphothreonine; by PDPK1] \n", "15 [Phosphothreonine] \n", "16 [Phosphotyrosine] \n", "17 [Phosphoserine] \n", "\n", " Ontology_term \\\n", "10 [ECO:0000250] \n", "11 [ECO:0000244] \n", "12 [ECO:0000250] \n", "13 [ECO:0000269] \n", "14 [ECO:0000269,ECO:0000269,ECO:0000269,ECO:00002... \n", "15 [ECO:0000269] \n", "16 [ECO:0000250] \n", "17 [ECO:0000244,ECO:0000244,ECO:0000244,ECO:00002... \n", "\n", " evidence \n", "10 [ECO:0000250|UniProtKB:P05132] \n", "11 [ECO:0000244|PubMed:18691976] \n", "12 [ECO:0000250|UniProtKB:P05132] \n", "13 [ECO:0000269|PubMed:12372837] \n", "14 [ECO:0000269|PubMed:12372837,ECO:0000269|PubMe... \n", "15 [ECO:0000269|PubMed:17909264] \n", "16 [ECO:0000250|UniProtKB:P05132] \n", "17 [ECO:0000244|PubMed:18691976,ECO:0000244|PubMe... " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# phosphorylated sites in UniProt\n", "P17612_annotation[P17612_annotation.GROUP.str.contains('Note=Phospho')]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "phospho_residues = P17612_annotation.loc[P17612_annotation.GROUP.str.contains('Note=Phospho'), 'START']" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "from proteofav.sifts import sifts_best" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "P17612_best_structure = sifts_best('P17612')['P17612'][0]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P17612_best_structure['experimental_method'] == 'X-ray diffraction'" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P17612_best_structure['tax_id'] == 9606 # human" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "table = Tables.generate(\n", " merge_tables=True, \n", " uniprot_id='P17612', \n", " bio_unit=False,\n", " sifts=True,\n", " validation=True, \n", " annotations=True, \n", " residue_agg='centroid', \n", " overwrite=False)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# every residue in the structure not mapped to the UniProt is discarded\n", "table.dropna(subset=['UniProt_dbResNum'], axis=0, inplace=True) " ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "table['UniProt_dbResNum'] = table['UniProt_dbResNum'].astype(int)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexpdbx_PDB_model_numauth_asym_idauth_seq_idgroup_PDBidtype_symbollabel_atom_idlabel_alt_idlabel_comp_id...CATH_regionResNumCATH_dbAccessionIdPfam_regionIdPfam_regionStartPfam_regionEndPfam_regionResNumPfam_dbAccessionIdannotationsiteaccession
352861A48ATOM277NN.THR...493.30.200.20144.0298.049PF00069Domain: ['Protein kinase'] (nan), Modified res...49P17612
130391A139ATOM1040NN.SER...1401.10.510.10144.0298.0140PF00069Domain: ['Protein kinase'] (nan), Modified res...140P17612
1891011A195ATOM1517NN.THR...1961.10.510.10144.0298.0196PF00069Domain: ['Protein kinase'] (nan), Modified res...196P17612
1911031A197HETATM1538NN.TPO...1981.10.510.10144.0298.0198PF00069Domain: ['Protein kinase'] (nan), Modified res...198P17612
1951091A201ATOM1567NN.THR...2021.10.510.10144.0298.0202PF00069Domain: ['Protein kinase'] (nan), Mutagenesis:...202P17612
3262511A330ATOM2586NN.TYR...3313.30.200.20-0.00.0NaNNaNDomain: ['AGC-kinase C-terminal'] (nan), Modif...331P17612
3342591A338HETATM2648NN.SEP...3393.30.200.20-0.00.0NaNNaNDomain: ['AGC-kinase C-terminal'] (nan), Modif...339P17612
\n", "

7 rows × 91 columns

\n", "
" ], "text/plain": [ " index pdbx_PDB_model_num auth_asym_id auth_seq_id group_PDB id \\\n", "35 286 1 A 48 ATOM 277 \n", "130 39 1 A 139 ATOM 1040 \n", "189 101 1 A 195 ATOM 1517 \n", "191 103 1 A 197 HETATM 1538 \n", "195 109 1 A 201 ATOM 1567 \n", "326 251 1 A 330 ATOM 2586 \n", "334 259 1 A 338 HETATM 2648 \n", "\n", " type_symbol label_atom_id label_alt_id label_comp_id ... \\\n", "35 N N . THR ... \n", "130 N N . SER ... \n", "189 N N . THR ... \n", "191 N N . TPO ... \n", "195 N N . THR ... \n", "326 N N . TYR ... \n", "334 N N . SEP ... \n", "\n", " CATH_regionResNum CATH_dbAccessionId Pfam_regionId Pfam_regionStart \\\n", "35 49 3.30.200.20 1 44.0 \n", "130 140 1.10.510.10 1 44.0 \n", "189 196 1.10.510.10 1 44.0 \n", "191 198 1.10.510.10 1 44.0 \n", "195 202 1.10.510.10 1 44.0 \n", "326 331 3.30.200.20 - 0.0 \n", "334 339 3.30.200.20 - 0.0 \n", "\n", " Pfam_regionEnd Pfam_regionResNum Pfam_dbAccessionId \\\n", "35 298.0 49 PF00069 \n", "130 298.0 140 PF00069 \n", "189 298.0 196 PF00069 \n", "191 298.0 198 PF00069 \n", "195 298.0 202 PF00069 \n", "326 0.0 NaN NaN \n", "334 0.0 NaN NaN \n", "\n", " annotation site accession \n", "35 Domain: ['Protein kinase'] (nan), Modified res... 49 P17612 \n", "130 Domain: ['Protein kinase'] (nan), Modified res... 140 P17612 \n", "189 Domain: ['Protein kinase'] (nan), Modified res... 196 P17612 \n", "191 Domain: ['Protein kinase'] (nan), Modified res... 198 P17612 \n", "195 Domain: ['Protein kinase'] (nan), Mutagenesis:... 202 P17612 \n", "326 Domain: ['AGC-kinase C-terminal'] (nan), Modif... 331 P17612 \n", "334 Domain: ['AGC-kinase C-terminal'] (nan), Modif... 339 P17612 \n", "\n", "[7 rows x 91 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table[table['UniProt_dbResNum'].isin(phospho_residues)] " ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "phospho_residues_b = table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'B_iso_or_equiv'].mean()\n", "all_residues_b = table.loc[:, 'B_iso_or_equiv'].mean()\n", "\n", "phospho_residues_b > all_residues_b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall phophorylated Ser/Thr have are have high b-factors, hot residues, that is not true for the `3ovv` structure." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "T 4\n", "H 2\n", "E 1\n", "Name: PDB_codeSecondaryStructure, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'PDB_codeSecondaryStructure'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4 of 7 residues occur on Turns" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Observed'" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'PDB_Annotation'].all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And all residues were observed in the structure, not labeled in the REM465 field" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "35 Favored\n", "130 Favored\n", "189 Favored\n", "191 NaN\n", "195 Favored\n", "326 Favored\n", "334 NaN\n", "Name: validation_rama, dtype: object" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'validation_rama']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5 out 7 have are not Ramachandran outliers, the NaN values were given for the Phopho resides observed in the protein crystal" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Use case 2: Spatial clustering of genetic variants" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python (proteofav)", "language": "python", "name": "proteofav" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }