{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Solution-1\n", "This tutorial shows how to find proteins for a specific organism, how to calculate protein-protein interactions, and visualize the results." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from pyspark.sql.functions import substring_index\n", "from mmtfPyspark.datasets import pdbjMineDataset\n", "from mmtfPyspark.webfilters import PdbjMineSearch\n", "from mmtfPyspark.interactions import InteractionFilter, InteractionFingerprinter\n", "from mmtfPyspark.io import mmtfReader\n", "from ipywidgets import interact, IntSlider\n", "import py3Dmol" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Configure Spark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder.appName(\"Solution-1\").getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find protein structures for mouse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our first task, we need to run a taxonomy query using SIFTS data. [See examples](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/demos/datasets/PDBMetaDataDemo.ipynb) and [SIFTS demo](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/demos/datasets/SiftsDataDemo.ipynb)\n", "\n", "To figure out how to query for taxonomy, the command below lists the first 10 entries for the SIFTS taxonomy table. As you can see, we can use the science_name field to query for a specific organism." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+-----+------+--------------------+----------------+\n", "|pdbid|chain|tax_id| scientific_name|structureChainId|\n", "+-----+-----+------+--------------------+----------------+\n", "| 101M| A| 9755| PHYMC| 101M.A|\n", "| 101M| A| 9755| Physeter catodon| 101M.A|\n", "| 101M| A| 9755|Physeter catodon ...| 101M.A|\n", "| 101M| A| 9755|Physeter macrocep...| 101M.A|\n", "| 101M| A| 9755|Physeter macrocep...| 101M.A|\n", "| 101M| A| 9755| Sperm whale| 101M.A|\n", "| 101M| A| 9755| sperm whale| 101M.A|\n", "| 102L| A| 10665| BPT4| 102L.A|\n", "| 102L| A| 10665| Bacteriophage T4| 102L.A|\n", "| 102L| A| 10665|Enterobacteria ph...| 102L.A|\n", "+-----+-----+------+--------------------+----------------+\n", "\n" ] } ], "source": [ "taxonomy_query = \"SELECT * FROM sifts.pdb_chain_taxonomy LIMIT 10\"\n", "taxonomy = pdbjMineDataset.get_dataset(taxonomy_query)\n", "taxonomy.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO-1: specify a taxonomy query where the scientific name is 'Mus musculus'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+-----+------+---------------+----------------+\n", "|pdbid|chain|tax_id|scientific_name|structureChainId|\n", "+-----+-----+------+---------------+----------------+\n", "| 12E8| H| 10090| Mus musculus| 12E8.H|\n", "| 12E8| L| 10090| Mus musculus| 12E8.L|\n", "| 12E8| M| 10090| Mus musculus| 12E8.M|\n", "| 12E8| P| 10090| Mus musculus| 12E8.P|\n", "| 15C8| H| 10090| Mus musculus| 15C8.H|\n", "| 15C8| L| 10090| Mus musculus| 15C8.L|\n", "| 1914| A| 10090| Mus musculus| 1914.A|\n", "| 1A0Q| H| 10090| Mus musculus| 1A0Q.H|\n", "| 1A0Q| L| 10090| Mus musculus| 1A0Q.L|\n", "| 1A14| H| 10090| Mus musculus| 1A14.H|\n", "+-----+-----+------+---------------+----------------+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "taxonomy_query = \"SELECT * FROM sifts.pdb_chain_taxonomy WHERE scientific_name = 'Mus musculus'\"\n", "taxonomy = pdbjMineDataset.get_dataset(taxonomy_query)\n", "taxonomy.show(10)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "path = \"../resources/mmtf_full_sample/\"\n", "\n", "pdb = mmtfReader.read_sequence_file(path, fraction=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO-2: Take the taxonomy from above and use it to filter the pdb structures" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "pdb = pdb.filter(PdbjMineSearch(taxonomy_query)).cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate polymer-polymer interactions for this subset of structures\n", "Find protein-protein interactions with a 6 A distance cutoff" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "distance_cutoff = 6.0\n", "interactionFilter = InteractionFilter(distance_cutoff, minInteractions=10)\n", "\n", "interactions = InteractionFingerprinter.get_polymer_interactions(pdb, interactionFilter).cache()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
structureChainIdqueryChainIdtargetChainIdgroupNumberssequenceIndicessequencestructureId
04M48.AHA[337, 338, 498, 501, 502, 503, 504, 505, 506, ...[70, 274, 275, 435, 438, 439, 440, 441, 442, 4...MNSISDERETWSGKVDFLLSVIGFAVDLANVWRFPYLCYKNGGGAF...4M48
14M48.HAH[100, 101, 102, 103, 31, 33, 50, 52, 53, 54, 5...[49, 51, 68, 70, 71, 72, 73, 74, 75, 77, 117, ...MNFGLRLVFLVLILKGVQCEVQLVESGGGLVKPGGSLKLSCAASGF...4M48
24M48.LHL[1, 100, 101, 115, 117, 118, 119, 120, 121, 12...[22, 53, 54, 56, 58, 60, 63, 64, 65, 66, 67, 6...MDFQVQIFSFLLISASVAMSRGENVLTQSPAIMSTSPGEKVTMTCR...4M48
34M48.HLH[100, 101, 102, 103, 104, 105, 106, 107, 108, ...[53, 55, 57, 60, 61, 62, 63, 64, 65, 68, 77, 7...MNFGLRLVFLVLILKGVQCEVQLVESGGGLVKPGGSLKLSCAASGF...4M48
44NN5.ACA[126, 127, 129, 130, 131, 132, 133, 134, 136, ...[11, 14, 15, 16, 19, 20, 23, 28, 30, 31, 32, 3...YNFSNCNFTSITKIYCNIIFHDLTGDLKGAKFEQIEDCESKPACLL...4NN5
54NN5.CAC[106, 107, 108, 109, 110, 112, 113, 143, 144, ...[16, 41, 42, 68, 69, 70, 71, 73, 74, 86, 87, 8...AAAVTSRGDVTVVCHDLETVEVTWGSGPDHHGANLSLEFRYGTGAL...4NN5
62QDQ.ABA[2496, 2497, 2498, 2500, 2501, 2502, 2504, 250...[4, 5, 6, 8, 9, 10, 12, 13, 15, 16, 17, 19, 20...GAMVGGIAQIIAAQEEMLRKERELEEARKKLAQIRQQQYKFLPSEL...2QDQ
72QDQ.BAB[2497, 2498, 2500, 2501, 2504, 2505, 2507, 250...[5, 6, 8, 9, 12, 13, 15, 16, 17, 19, 20, 22, 2...GAMVGGIAQIIAAQEEMLRKERELEEARKKLAQIRQQQYKFLPSEL...2QDQ
84P3A.CDC[698, 701, 702, 704, 705, 706, 708, 709, 710, ...[21, 24, 25, 27, 28, 29, 31, 32, 33, 34, 35, 3...GANLHLLRQKIEEQAAKYKHSVPKKCCYDGARVNFYETCEERVARV...4P3A
94P3A.DCD[698, 701, 702, 704, 705, 706, 708, 709, 710, ...[21, 24, 25, 27, 28, 29, 31, 32, 33, 34, 35, 3...GANLHLLRQKIEEQAAKYKHSVPKKCCYDGARVNFYETCEERVARV...4P3A
\n", "
" ], "text/plain": [ " structureChainId queryChainId targetChainId \\\n", "0 4M48.A H A \n", "1 4M48.H A H \n", "2 4M48.L H L \n", "3 4M48.H L H \n", "4 4NN5.A C A \n", "5 4NN5.C A C \n", "6 2QDQ.A B A \n", "7 2QDQ.B A B \n", "8 4P3A.C D C \n", "9 4P3A.D C D \n", "\n", " groupNumbers \\\n", "0 [337, 338, 498, 501, 502, 503, 504, 505, 506, ... \n", "1 [100, 101, 102, 103, 31, 33, 50, 52, 53, 54, 5... \n", "2 [1, 100, 101, 115, 117, 118, 119, 120, 121, 12... \n", "3 [100, 101, 102, 103, 104, 105, 106, 107, 108, ... \n", "4 [126, 127, 129, 130, 131, 132, 133, 134, 136, ... \n", "5 [106, 107, 108, 109, 110, 112, 113, 143, 144, ... \n", "6 [2496, 2497, 2498, 2500, 2501, 2502, 2504, 250... \n", "7 [2497, 2498, 2500, 2501, 2504, 2505, 2507, 250... \n", "8 [698, 701, 702, 704, 705, 706, 708, 709, 710, ... \n", "9 [698, 701, 702, 704, 705, 706, 708, 709, 710, ... \n", "\n", " sequenceIndices \\\n", "0 [70, 274, 275, 435, 438, 439, 440, 441, 442, 4... \n", "1 [49, 51, 68, 70, 71, 72, 73, 74, 75, 77, 117, ... \n", "2 [22, 53, 54, 56, 58, 60, 63, 64, 65, 66, 67, 6... \n", "3 [53, 55, 57, 60, 61, 62, 63, 64, 65, 68, 77, 7... \n", "4 [11, 14, 15, 16, 19, 20, 23, 28, 30, 31, 32, 3... \n", "5 [16, 41, 42, 68, 69, 70, 71, 73, 74, 86, 87, 8... \n", "6 [4, 5, 6, 8, 9, 10, 12, 13, 15, 16, 17, 19, 20... \n", "7 [5, 6, 8, 9, 12, 13, 15, 16, 17, 19, 20, 22, 2... \n", "8 [21, 24, 25, 27, 28, 29, 31, 32, 33, 34, 35, 3... \n", "9 [21, 24, 25, 27, 28, 29, 31, 32, 33, 34, 35, 3... \n", "\n", " sequence structureId \n", "0 MNSISDERETWSGKVDFLLSVIGFAVDLANVWRFPYLCYKNGGGAF... 4M48 \n", "1 MNFGLRLVFLVLILKGVQCEVQLVESGGGLVKPGGSLKLSCAASGF... 4M48 \n", "2 MDFQVQIFSFLLISASVAMSRGENVLTQSPAIMSTSPGEKVTMTCR... 4M48 \n", "3 MNFGLRLVFLVLILKGVQCEVQLVESGGGLVKPGGSLKLSCAASGF... 4M48 \n", "4 YNFSNCNFTSITKIYCNIIFHDLTGDLKGAKFEQIEDCESKPACLL... 4NN5 \n", "5 AAAVTSRGDVTVVCHDLETVEVTWGSGPDHHGANLSLEFRYGTGAL... 4NN5 \n", "6 GAMVGGIAQIIAAQEEMLRKERELEEARKKLAQIRQQQYKFLPSEL... 2QDQ \n", "7 GAMVGGIAQIIAAQEEMLRKERELEEARKKLAQIRQQQYKFLPSEL... 2QDQ \n", "8 GANLHLLRQKIEEQAAKYKHSVPKKCCYDGARVNFYETCEERVARV... 4P3A \n", "9 GANLHLLRQKIEEQAAKYKHSVPKKCCYDGARVNFYETCEERVARV... 4P3A " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "interactions = interactions.withColumn(\"structureId\", substring_index(interactions.structureChainId, '.', 1)).cache()\n", "interactions.toPandas().head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the protein-protein interactions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract id columns as lists (required for visualization)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "structure_ids = interactions.select(\"structureId\").rdd.flatMap(lambda x: x).collect()\n", "query_chain_ids = interactions.select(\"queryChainID\").rdd.flatMap(lambda x: x).collect()\n", "target_chain_ids = interactions.select(\"targetChainID\").rdd.flatMap(lambda x: x).collect()\n", "target_groups = interactions.select(\"groupNumbers\").rdd.flatMap(lambda x: x).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disable scrollbar for the visualization below" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "#%%javascript \n", "#IPython.OutputArea.prototype._should_scroll = function(lines) {return false;}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Show protein-protein interactions within cutoff distance (query = orange, target = blue)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def view_protein_protein_interactions(structure_ids, query_chain_ids, target_chain_ids, target_groups, distance=4.5):\n", " \n", " def view3d(i=0):\n", " \n", " print(f\"PDB: {structure_ids[i]}, query: {query_chain_ids[i]}, target: {target_chain_ids[i]}\")\n", "\n", " target = {'chain': target_chain_ids[i], 'resi': target_groups[i]}\n", " \n", " viewer = py3Dmol.view(query='pdb:' + structure_ids[i], width=600, height=600)\n", " viewer.setStyle({})\n", "\n", " viewer.setStyle({'chain': query_chain_ids[i]}, {'line': {'colorscheme': 'orangeCarbon'}})\n", " viewer.setStyle({'chain' : query_chain_ids[i], 'within':{'distance' : distance, 'sel':{'chain': target_chain_ids[i]}}}, {'sphere': {'colorscheme': 'orangeCarbon'}}); \n", " viewer.setStyle({'chain': target_chain_ids[i]}, {'line': {'colorscheme': 'lightblueCarbon'}})\n", " viewer.setStyle(target, {'stick': {'colorscheme': 'lightblueCarbon'}})\n", " viewer.zoomTo(target)\n", "\n", " return viewer.show()\n", "\n", " s_widget = IntSlider(min=0, max=len(structure_ids)-1, description='Structure', continuous_update=False)\n", " return interact(view3d, i=s_widget)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "702797ced24445edb912996910b0c869", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=47), Output()),…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "view_protein_protein_interactions(structure_ids, query_chain_ids, target_chain_ids, \\\n", " target_groups, distance=distance_cutoff);" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "spark.stop()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }