{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Solution-1\n", "This tutorial shows how to find proteins for a specific organism, how to calculate protein-protein interactions, and visualize the results." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from pyspark.sql.functions import substring_index\n", "from mmtfPyspark.datasets import pdbjMineDataset\n", "from mmtfPyspark.webfilters import PdbjMineSearch\n", "from mmtfPyspark.interactions import InteractionFilter, InteractionFingerprinter\n", "from mmtfPyspark.io import mmtfReader\n", "from ipywidgets import interact, IntSlider\n", "import py3Dmol" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Configure Spark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder.appName(\"Solution-1\").getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find protein structures for mouse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our first task, we need to run a taxonomy query using SIFTS data. [See examples](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/demos/datasets/PDBMetaDataDemo.ipynb) and [SIFTS demo](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/demos/datasets/SiftsDataDemo.ipynb)\n", "\n", "To figure out how to query for taxonomy, the command below lists the first 10 entries for the SIFTS taxonomy table. As you can see, we can use the science_name field to query for a specific organism." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+-----+------+--------------------+----------------+\n", "|pdbid|chain|tax_id| scientific_name|structureChainId|\n", "+-----+-----+------+--------------------+----------------+\n", "| 101M| A| 9755| PHYMC| 101M.A|\n", "| 101M| A| 9755| Physeter catodon| 101M.A|\n", "| 101M| A| 9755|Physeter catodon ...| 101M.A|\n", "| 101M| A| 9755|Physeter macrocep...| 101M.A|\n", "| 101M| A| 9755|Physeter macrocep...| 101M.A|\n", "| 101M| A| 9755| Sperm whale| 101M.A|\n", "| 101M| A| 9755| sperm whale| 101M.A|\n", "| 102L| A| 10665| BPT4| 102L.A|\n", "| 102L| A| 10665| Bacteriophage T4| 102L.A|\n", "| 102L| A| 10665|Enterobacteria ph...| 102L.A|\n", "+-----+-----+------+--------------------+----------------+\n", "\n" ] } ], "source": [ "taxonomy_query = \"SELECT * FROM sifts.pdb_chain_taxonomy LIMIT 10\"\n", "taxonomy = pdbjMineDataset.get_dataset(taxonomy_query)\n", "taxonomy.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO-1: specify a taxonomy query where the scientific name is 'Mus musculus'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+-----+------+---------------+----------------+\n", "|pdbid|chain|tax_id|scientific_name|structureChainId|\n", "+-----+-----+------+---------------+----------------+\n", "| 12E8| H| 10090| Mus musculus| 12E8.H|\n", "| 12E8| L| 10090| Mus musculus| 12E8.L|\n", "| 12E8| M| 10090| Mus musculus| 12E8.M|\n", "| 12E8| P| 10090| Mus musculus| 12E8.P|\n", "| 15C8| H| 10090| Mus musculus| 15C8.H|\n", "| 15C8| L| 10090| Mus musculus| 15C8.L|\n", "| 1914| A| 10090| Mus musculus| 1914.A|\n", "| 1A0Q| H| 10090| Mus musculus| 1A0Q.H|\n", "| 1A0Q| L| 10090| Mus musculus| 1A0Q.L|\n", "| 1A14| H| 10090| Mus musculus| 1A14.H|\n", "+-----+-----+------+---------------+----------------+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "taxonomy_query = \"SELECT * FROM sifts.pdb_chain_taxonomy WHERE scientific_name = 'Mus musculus'\"\n", "taxonomy = pdbjMineDataset.get_dataset(taxonomy_query)\n", "taxonomy.show(10)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "path = \"../resources/mmtf_full_sample/\"\n", "\n", "pdb = mmtfReader.read_sequence_file(path, fraction=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO-2: Take the taxonomy from above and use it to filter the pdb structures" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "pdb = pdb.filter(PdbjMineSearch(taxonomy_query)).cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate polymer-polymer interactions for this subset of structures\n", "Find protein-protein interactions with a 6 A distance cutoff" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "distance_cutoff = 6.0\n", "interactionFilter = InteractionFilter(distance_cutoff, minInteractions=10)\n", "\n", "interactions = InteractionFingerprinter.get_polymer_interactions(pdb, interactionFilter).cache()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | structureChainId | \n", "queryChainId | \n", "targetChainId | \n", "groupNumbers | \n", "sequenceIndices | \n", "sequence | \n", "structureId | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "4M48.A | \n", "H | \n", "A | \n", "[337, 338, 498, 501, 502, 503, 504, 505, 506, ... | \n", "[70, 274, 275, 435, 438, 439, 440, 441, 442, 4... | \n", "MNSISDERETWSGKVDFLLSVIGFAVDLANVWRFPYLCYKNGGGAF... | \n", "4M48 | \n", "
1 | \n", "4M48.H | \n", "A | \n", "H | \n", "[100, 101, 102, 103, 31, 33, 50, 52, 53, 54, 5... | \n", "[49, 51, 68, 70, 71, 72, 73, 74, 75, 77, 117, ... | \n", "MNFGLRLVFLVLILKGVQCEVQLVESGGGLVKPGGSLKLSCAASGF... | \n", "4M48 | \n", "
2 | \n", "4M48.L | \n", "H | \n", "L | \n", "[1, 100, 101, 115, 117, 118, 119, 120, 121, 12... | \n", "[22, 53, 54, 56, 58, 60, 63, 64, 65, 66, 67, 6... | \n", "MDFQVQIFSFLLISASVAMSRGENVLTQSPAIMSTSPGEKVTMTCR... | \n", "4M48 | \n", "
3 | \n", "4M48.H | \n", "L | \n", "H | \n", "[100, 101, 102, 103, 104, 105, 106, 107, 108, ... | \n", "[53, 55, 57, 60, 61, 62, 63, 64, 65, 68, 77, 7... | \n", "MNFGLRLVFLVLILKGVQCEVQLVESGGGLVKPGGSLKLSCAASGF... | \n", "4M48 | \n", "
4 | \n", "4NN5.A | \n", "C | \n", "A | \n", "[126, 127, 129, 130, 131, 132, 133, 134, 136, ... | \n", "[11, 14, 15, 16, 19, 20, 23, 28, 30, 31, 32, 3... | \n", "YNFSNCNFTSITKIYCNIIFHDLTGDLKGAKFEQIEDCESKPACLL... | \n", "4NN5 | \n", "
5 | \n", "4NN5.C | \n", "A | \n", "C | \n", "[106, 107, 108, 109, 110, 112, 113, 143, 144, ... | \n", "[16, 41, 42, 68, 69, 70, 71, 73, 74, 86, 87, 8... | \n", "AAAVTSRGDVTVVCHDLETVEVTWGSGPDHHGANLSLEFRYGTGAL... | \n", "4NN5 | \n", "
6 | \n", "2QDQ.A | \n", "B | \n", "A | \n", "[2496, 2497, 2498, 2500, 2501, 2502, 2504, 250... | \n", "[4, 5, 6, 8, 9, 10, 12, 13, 15, 16, 17, 19, 20... | \n", "GAMVGGIAQIIAAQEEMLRKERELEEARKKLAQIRQQQYKFLPSEL... | \n", "2QDQ | \n", "
7 | \n", "2QDQ.B | \n", "A | \n", "B | \n", "[2497, 2498, 2500, 2501, 2504, 2505, 2507, 250... | \n", "[5, 6, 8, 9, 12, 13, 15, 16, 17, 19, 20, 22, 2... | \n", "GAMVGGIAQIIAAQEEMLRKERELEEARKKLAQIRQQQYKFLPSEL... | \n", "2QDQ | \n", "
8 | \n", "4P3A.C | \n", "D | \n", "C | \n", "[698, 701, 702, 704, 705, 706, 708, 709, 710, ... | \n", "[21, 24, 25, 27, 28, 29, 31, 32, 33, 34, 35, 3... | \n", "GANLHLLRQKIEEQAAKYKHSVPKKCCYDGARVNFYETCEERVARV... | \n", "4P3A | \n", "
9 | \n", "4P3A.D | \n", "C | \n", "D | \n", "[698, 701, 702, 704, 705, 706, 708, 709, 710, ... | \n", "[21, 24, 25, 27, 28, 29, 31, 32, 33, 34, 35, 3... | \n", "GANLHLLRQKIEEQAAKYKHSVPKKCCYDGARVNFYETCEERVARV... | \n", "4P3A | \n", "