{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem-2\n", "Here we combine what you've learned about ColumnarStructure with skills from previous tutorials. We use ColumnarStructure to calculate the average number of water molecules per amino acid residue (waterRatio) and the average b-factor (temperature factor). We capture the information in a dataset and then plot the waterRatio against the resolution to see if there is a trend in the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import Row, SparkSession\n", "from mmtfPyspark.filters import ContainsLProteinChain, ExperimentalMethods\n", "from mmtfPyspark.io import mmtfReader\n", "from mmtfPyspark.ml import pythonRDDToDataset\n", "from mmtfPyspark.utils import ColumnarStructure\n", "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Configure Spark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder.appName(\"Solution-2\").getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read PDB structures" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "path = \"../resources/mmtf_full_sample\"\n", "pdb = mmtfReader.read_sequence_file(path).cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO-1: filter structures: exclusively protein structures" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "pdb = ... your code here ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO-2: filter structures: only structures determined by ExperimentalMethods.X_RAY_DIFFRACTION" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "pdb = ... your code here ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Keep only structures with one model" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "pdb = pdb.filter(lambda t: t[1].num_models ==1)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def calcProperties(s):\n", " # s[0] pdb id\n", " # s[1] mmtf structure record\n", " arrays = ColumnarStructure(s[1], firstModelOnly=True)\n", " \n", " # we column arrays\n", " atom_names = arrays.get_atom_names()\n", " entity_types = arrays.get_entity_types()\n", " \n", " # TODO-3: get array of b-factors\n", " b_factors = ... your code here ...\n", " \n", " # calculate number of protein residues using boolean indexing\n", " pro_idx = (entity_types == 'PRO') & (atom_names == 'CA')\n", " num_pro = int(np.sum(pro_idx))\n", " \n", " # TODO-4: calculate number of water residues using boolean indexing\n", " wat_idx = ... your code here ...\n", " num_wat = int(np.sum(wat_idx))\n", " \n", " # calculate average B-factor for protein atoms\n", " pro_atom_idx = (entity_types == 'PRO')\n", " pro_b_factors = b_factors[pro_atom_idx]\n", " ave_b = float(np.average(pro_b_factors))\n", " \n", " return Row(s[0], s[1].resolution, ave_b, num_pro, num_wat, num_wat/num_pro)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "rows = pdb.map(lambda s: calcProperties(s))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "col_names = [\"pdbId\", \"resolution\", \"ave_b\", \"numPro\", \"numWat\", \"waterRatio\"]\n", "summary = pythonRDDToDataset.get_dataset(rows, col_names).cache()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- pdbId: string (nullable = false)\n", " |-- resolution: float (nullable = false)\n", " |-- ave_b: float (nullable = false)\n", " |-- numPro: integer (nullable = false)\n", " |-- numWat: integer (nullable = false)\n", " |-- waterRatio: float (nullable = false)\n", "\n" ] } ], "source": [ "summary.printSchema()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+----------+---------+------+------+----------+\n", "|pdbId|resolution| ave_b|numPro|numWat|waterRatio|\n", "+-----+----------+---------+------+------+----------+\n", "| 1LBU| 1.8| 15.45025| 214| 229| 1.0700935|\n", "| 1LC0| 1.2|17.563633| 290| 407| 1.4034482|\n", "| 1LC5| 1.46| 16.42075| 355| 271| 0.7633803|\n", "| 1LFP| 1.72| 35.73947| 243| 350| 1.4403292|\n", "| 1LFW| 1.8|13.556223| 468| 570| 1.2179487|\n", "| 1LGH| 2.4| 37.60027| 396| 64|0.16161616|\n", "| 1LH0| 2.0|31.355368| 419| 278| 0.6634845|\n", "| 1LJ8| 1.7| 18.55931| 492| 437|0.88821137|\n", "| 1LKI| 2.0|22.673143| 172| 50|0.29069766|\n", "| 1LMI| 1.5| 19.07365| 131| 172| 1.3129771|\n", "| 1LML| 1.86|23.073887| 465| 212| 0.455914|\n", "| 1LO7| 1.5| 18.83036| 140| 173| 1.2357143|\n", "| 1LQ9| 1.3| 13.54245| 224| 258| 1.1517857|\n", "| 1LQV| 1.6|26.535683| 411| 429| 1.0437956|\n", "| 1LR0| 1.914|26.655485| 126| 122|0.96825397|\n", "| 1LRI| 1.45|21.217518| 98| 99| 1.0102041|\n", "| 1LRZ| 2.1|27.266705| 400| 318| 0.795|\n", "| 1LS1| 1.1|21.858797| 316| 316| 1.0|\n", "| 1LTS| 1.95| 32.31854| 741| 293| 0.3954116|\n", "| 1LU0| 1.03|15.596555| 61| 76| 1.2459016|\n", "+-----+----------+---------+------+------+----------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "summary.show()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data = summary.toPandas()\n", "data.plot(x='resolution', y='waterRatio', kind='scatter');" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "spark.stop()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }