{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1-Metadata\n", "This tutorial shows how to use Spark datasets to retrieve metadata about PDB structures. mmtfPyspark provides a number of moduls to fetch data from [external resources](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/mmtfPyspark/datasets).\n", "\n", "In this tutorial shows how to download and analyze PDB metadata from the [SIFTS project](https://www.ebi.ac.uk/pdbe/docs/sifts/methodology.html) as Spark Datasets.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from pyspark.sql.functions import substring_index\n", "from mmtfPyspark.datasets import pdbjMineDataset\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Configure Spark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder.appName(\"1-Metadata\").getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download up to date EC classification data\n", "The SIFTS project maintains up-to-date mappings of protein chains in the PDB to Enzyme Classifications [EC](http://www.sbcs.qmul.ac.uk/iubmb/enzyme/). We use the [pdbjMinedDataset class](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/datasets/pdbjMineDataset.py) to retrieve these mappings. An extensive [demo](https://nbviewer.jupyter.org/github/sbl-sdsc/mmtf-pyspark/blob/master/demos/datasets/SiftsDataDemo.ipynb) shows how to query SIFTS data with pdbjMineDataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Query EC data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "query = \"SELECT * FROM sifts.pdb_chain_enzyme\"\n", "enzymes = pdbjMineDataset.get_dataset(query).cache()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+-----+---------+---------+----------------+\n", "|pdbid|chain|accession|ec_number|structureChainId|\n", "+-----+-----+---------+---------+----------------+\n", "| 102L| A| P00720| 3.2.1.17| 102L.A|\n", "| 103L| A| P00720| 3.2.1.17| 103L.A|\n", "| 104L| A| P00720| 3.2.1.17| 104L.A|\n", "| 104L| B| P00720| 3.2.1.17| 104L.B|\n", "| 107L| A| P00720| 3.2.1.17| 107L.A|\n", "| 108L| A| P00720| 3.2.1.17| 108L.A|\n", "| 109L| A| P00720| 3.2.1.17| 109L.A|\n", "| 10GS| A| P09211| 2.5.1.18| 10GS.A|\n", "| 10GS| B| P09211| 2.5.1.18| 10GS.B|\n", "| 10MH| A| P05102| 2.1.1.37| 10MH.A|\n", "| 110L| A| P00720| 3.2.1.17| 110L.A|\n", "| 111L| A| P00720| 3.2.1.17| 111L.A|\n", "| 112L| A| P00720| 3.2.1.17| 112L.A|\n", "| 113L| A| P00720| 3.2.1.17| 113L.A|\n", "| 114L| A| P00720| 3.2.1.17| 114L.A|\n", "| 115L| A| P00720| 3.2.1.17| 115L.A|\n", "| 117E| A| P00817| 3.6.1.1| 117E.A|\n", "| 117E| B| P00817| 3.6.1.1| 117E.B|\n", "| 118L| A| P00720| 3.2.1.17| 118L.A|\n", "| 119L| A| P00720| 3.2.1.17| 119L.A|\n", "+-----+-----+---------+---------+----------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "enzymes.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### For better formatting, we can convert the dataset to pandas" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pdbidchainaccessionec_numberstructureChainId
0102LAP007203.2.1.17102L.A
1103LAP007203.2.1.17103L.A
2104LAP007203.2.1.17104L.A
3104LBP007203.2.1.17104L.B
4107LAP007203.2.1.17107L.A
5108LAP007203.2.1.17108L.A
6109LAP007203.2.1.17109L.A
710GSAP092112.5.1.1810GS.A
810GSBP092112.5.1.1810GS.B
910MHAP051022.1.1.3710MH.A
10110LAP007203.2.1.17110L.A
11111LAP007203.2.1.17111L.A
12112LAP007203.2.1.17112L.A
13113LAP007203.2.1.17113L.A
14114LAP007203.2.1.17114L.A
15115LAP007203.2.1.17115L.A
16117EAP008173.6.1.1117E.A
17117EBP008173.6.1.1117E.B
18118LAP007203.2.1.17118L.A
19119LAP007203.2.1.17119L.A
\n", "
" ], "text/plain": [ " pdbid chain accession ec_number structureChainId\n", "0 102L A P00720 3.2.1.17 102L.A\n", "1 103L A P00720 3.2.1.17 103L.A\n", "2 104L A P00720 3.2.1.17 104L.A\n", "3 104L B P00720 3.2.1.17 104L.B\n", "4 107L A P00720 3.2.1.17 107L.A\n", "5 108L A P00720 3.2.1.17 108L.A\n", "6 109L A P00720 3.2.1.17 109L.A\n", "7 10GS A P09211 2.5.1.18 10GS.A\n", "8 10GS B P09211 2.5.1.18 10GS.B\n", "9 10MH A P05102 2.1.1.37 10MH.A\n", "10 110L A P00720 3.2.1.17 110L.A\n", "11 111L A P00720 3.2.1.17 111L.A\n", "12 112L A P00720 3.2.1.17 112L.A\n", "13 113L A P00720 3.2.1.17 113L.A\n", "14 114L A P00720 3.2.1.17 114L.A\n", "15 115L A P00720 3.2.1.17 115L.A\n", "16 117E A P00817 3.6.1.1 117E.A\n", "17 117E B P00817 3.6.1.1 117E.B\n", "18 118L A P00720 3.2.1.17 118L.A\n", "19 119L A P00720 3.2.1.17 119L.A" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "enzymes.toPandas().head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remove redundcancy \n", "Here we select a single protein chain for each unique UniProt accession number" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "enzymes = enzymes.dropDuplicates([\"accession\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add a columns for enzyme type and subtype\n", "We use the [withColumn](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn) method to add a new column and the [substring_index](http://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.functions.substring_index) method to extract the first two levels from the EC number hierarchy." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "enzymes = enzymes.withColumn(\"enzymeType\", substring_index(enzymes.ec_number, '.', 1))\n", "enzymes = enzymes.withColumn(\"enzymeSubtype\", substring_index(enzymes.ec_number, '.', 2))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pdbidchainaccessionec_numberstructureChainIdenzymeTypeenzymeSubtype
03P96AA0QJI13.1.3.33P96.A33.1
12IXAAA4Q8F73.2.1.492IXA.A33.2
26IDOAA6TGP02.7.7.66IDO.A22.7
33FIEAA7GBG33.4.24.693FIE.A33.4
44Z38AA7Z4702.3.1.394Z38.A22.3
56DREBA8GG782.4.2.316DRE.B22.4
63ZH4AB1IBM32.5.1.73ZH4.A22.5
76CYZAB1MLU62.7.1.396CYZ.A22.7
82FUQAC6XZB64.2.2.72FUQ.A44.2
93C61AD0VWT21.3.98.13C61.A11.3
105MZ61G5ED393.4.22.495MZ6.133.4
115A2AAI1VWH93.2.1.15A2A.A33.2
125J1SAO146563.6.4.-5J1S.A33.6
132J63AO159224.2.99.182J63.A44.2
144Q5RAO185982.5.1.184Q5R.A22.5
155BSMAO241466.2.1.125BSM.A66.2
161KUFAO574133.4.24.-1KUF.A33.4
173ABDXO606732.7.7.73ABD.X22.7
183H0LAO666106.3.5.73H0L.A66.3
192PCJAO666467.6.2.-2PCJ.A77.6
\n", "
" ], "text/plain": [ " pdbid chain accession ec_number structureChainId enzymeType enzymeSubtype\n", "0 3P96 A A0QJI1 3.1.3.3 3P96.A 3 3.1\n", "1 2IXA A A4Q8F7 3.2.1.49 2IXA.A 3 3.2\n", "2 6IDO A A6TGP0 2.7.7.6 6IDO.A 2 2.7\n", "3 3FIE A A7GBG3 3.4.24.69 3FIE.A 3 3.4\n", "4 4Z38 A A7Z470 2.3.1.39 4Z38.A 2 2.3\n", "5 6DRE B A8GG78 2.4.2.31 6DRE.B 2 2.4\n", "6 3ZH4 A B1IBM3 2.5.1.7 3ZH4.A 2 2.5\n", "7 6CYZ A B1MLU6 2.7.1.39 6CYZ.A 2 2.7\n", "8 2FUQ A C6XZB6 4.2.2.7 2FUQ.A 4 4.2\n", "9 3C61 A D0VWT2 1.3.98.1 3C61.A 1 1.3\n", "10 5MZ6 1 G5ED39 3.4.22.49 5MZ6.1 3 3.4\n", "11 5A2A A I1VWH9 3.2.1.1 5A2A.A 3 3.2\n", "12 5J1S A O14656 3.6.4.- 5J1S.A 3 3.6\n", "13 2J63 A O15922 4.2.99.18 2J63.A 4 4.2\n", "14 4Q5R A O18598 2.5.1.18 4Q5R.A 2 2.5\n", "15 5BSM A O24146 6.2.1.12 5BSM.A 6 6.2\n", "16 1KUF A O57413 3.4.24.- 1KUF.A 3 3.4\n", "17 3ABD X O60673 2.7.7.7 3ABD.X 2 2.7\n", "18 3H0L A O66610 6.3.5.7 3H0L.A 6 6.3\n", "19 2PCJ A O66646 7.6.2.- 2PCJ.A 7 7.6" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "enzymes.toPandas().head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Count the occurance of the enzyme types" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
enzymeTypecount
024911
134635
212799
341203
45805
56620
67261
\n", "
" ], "text/plain": [ " enzymeType count\n", "0 2 4911\n", "1 3 4635\n", "2 1 2799\n", "3 4 1203\n", "4 5 805\n", "5 6 620\n", "6 7 261" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts = enzymes.groupBy(\"enzymeType\")\\\n", " .count()\\\n", " .sort(\"count\", ascending=False)\\\n", " .toPandas()\n", " \n", "counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use pandas to plot the occurances with Matplotlib" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEGCAYAAABsLkJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAFYFJREFUeJzt3X2wnnV95/H3hwRBgZWHnDJAsGFGujyMJGgWqMjKQxvjwzRuByyCGCkl0y0d3G07W1y7w67Cjh27QnFbW0ai0KoIWCV1HTXLw1oQlERQDKlLiihBhJggal2EkO/+cf9CjzHxnJOcc+4cf+/XzJn7un7X7/pd3ys55/6c+3o6qSokSf3ZY9gFSJKGwwCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdWr2sAv4eebMmVPz5s0bdhmSNKOsXr36e1U1Mla/3ToA5s2bx6pVq4ZdhiTNKEm+NZ5+HgKSpE6NKwCSPJzk/iT3JVnV2g5MsjLJg+31gNaeJFclWZfka0lePmqcpa3/g0mWTs0uSZLGYyKfAE6rqgVVtbDNXwLcUlVHAre0eYDXAke2r2XAB2AQGMClwInACcClW0NDkjT9duUcwBLg1DZ9LXA78Met/boaPGf67iT7Jzmk9V1ZVZsAkqwEFgMf24UaJHXu2WefZf369Tz99NPDLmXa7b333sydO5c999xzp9YfbwAU8PkkBfx1VV0NHFxVj7Xl3wUObtOHAY+MWnd9a9tRuyTttPXr17Pffvsxb948kgy7nGlTVWzcuJH169dzxBFH7NQY4w2AV1XVo0l+CViZ5B+3KaRaOOyyJMsYHDriJS95yWQMKekX2NNPP93dmz9AEg466CA2bNiw02OM6xxAVT3aXp8APsngGP7j7dAO7fWJ1v1R4PBRq89tbTtq33ZbV1fVwqpaODIy5mWsktTdm/9Wu7rfYwZAkn2S7Ld1GlgEfB1YAWy9kmcpcHObXgG8tV0NdBLwVDtU9DlgUZID2snfRa1NkjQE4zkEdDDwyZY0s4GPVtVnk9wD3JDkAuBbwJta/88ArwPWAT8Gzgeoqk1J3g3c0/q9a+sJ4cky75L/NZnD/YyH3/P6KR1f0q6b7PeB3eHn/sorr2TZsmW86EUvmtRxxwyAqnoImL+d9o3AGdtpL+CiHYy1HFg+8TIlqV9XXnklb3nLWyY9ALwTWJImwXXXXcdxxx3H/PnzOe+883j44Yc5/fTTOe644zjjjDP49re/DcDb3vY2brrppufX23fffQG4/fbbOfXUUznzzDM56qijOPfcc6kqrrrqKr7zne9w2mmncdppp01qzbv1s4AkaSZYs2YNl112GV/84heZM2cOmzZtYunSpc9/LV++nIsvvphPfepTP3ece++9lzVr1nDooYdy8sknc+edd3LxxRfzvve9j9tuu405c+ZMat1+ApCkXXTrrbdy1llnPf8GfeCBB3LXXXdxzjnnAHDeeedxxx13jDnOCSecwNy5c9ljjz1YsGABDz/88FSWbQBI0nSaPXs2W7ZsAWDLli0888wzzy/ba6+9np+eNWsWmzdvntJaDABJ2kWnn346N954Ixs3bgRg06ZNvPKVr+T6668H4CMf+QinnHIKMHjM/erVqwFYsWIFzz777Jjj77fffvzwhz+c9Lo9ByDpF8owLts89thjeec738mrX/1qZs2axfHHH8/73/9+zj//fN773vcyMjLChz70IQAuvPBClixZwvz581m8eDH77LPPmOMvW7aMxYsXc+ihh3LbbbdNWt0ZXLW5e1q4cGFN5A/CeB+A1J+1a9dy9NFHD7uModne/idZPerJzTvkISBJ6pQBIEmdMgAkzXi786HsqbSr++1J4N3Jf33xFI//1NSOLw3B3nvvzcaNGznooIO6eiro1r8HsPfee+/0GAaApBlt7ty5rF+/fpeeiz9Tbf2LYDvLAJA0o+255547/Rexeuc5AEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdWrcAZBkVpJ7k3y6zR+R5EtJ1iX5eJIXtPa92vy6tnzeqDHe0dq/keQ1k70zkqTxm8gngLcDa0fN/ylwRVW9FHgSuKC1XwA82dqvaP1IcgxwNnAssBj4yySzdq18SdLOGlcAJJkLvB74YJsPcDpwU+tyLfDGNr2kzdOWn9H6LwGur6qfVNU3gXXACZOxE5KkiRvvJ4Argf8EbGnzBwHfr6rNbX49cFibPgx4BKAtf6r1f759O+tIkqbZmAGQ5A3AE1W1ehrqIcmyJKuSrNqwYcN0bFKSujSeTwAnA7+R5GHgegaHfv4c2D/J7NZnLvBom34UOBygLX8xsHF0+3bWeV5VXV1VC6tq4cjIyIR3SJI0PmMGQFW9o6rmVtU8Bidxb62qc4HbgDNbt6XAzW16RZunLb+1qqq1n92uEjoCOBL48qTtiSRpQmaP3WWH/hi4PsllwL3ANa39GuBvkqwDNjEIDapqTZIbgAeAzcBFVfXcLmxfkrQLJhQAVXU7cHubfojtXMVTVU8DZ+1g/cuByydapCRp8nknsCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROjRkASfZO8uUkX02yJsl/a+1HJPlSknVJPp7kBa19rza/ri2fN2qsd7T2byR5zVTtlCRpbOP5BPAT4PSqmg8sABYnOQn4U+CKqnop8CRwQet/AfBka7+i9SPJMcDZwLHAYuAvk8yazJ2RJI3fmAFQAz9qs3u2rwJOB25q7dcCb2zTS9o8bfkZSdLar6+qn1TVN4F1wAmTsheSpAkb1zmAJLOS3Ac8AawE/gn4flVtbl3WA4e16cOARwDa8qeAg0a3b2ed0dtalmRVklUbNmyY+B5JksZlXAFQVc9V1QJgLoPf2o+aqoKq6uqqWlhVC0dGRqZqM5LUvQldBVRV3wduA34V2D/J7LZoLvBom34UOBygLX8xsHF0+3bWkSRNs/FcBTSSZP82/ULg14G1DILgzNZtKXBzm17R5mnLb62qau1nt6uEjgCOBL48WTsiSZqY2WN34RDg2nbFzh7ADVX16SQPANcnuQy4F7im9b8G+Jsk64BNDK78oarWJLkBeADYDFxUVc9N7u5IksZrzACoqq8Bx2+n/SG2cxVPVT0NnLWDsS4HLp94mZKkyeadwJLUKQNAkjo1nnMA0ri87NqXTen49y+9f0rHl3rjJwBJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUqTEDIMnhSW5L8kCSNUne3toPTLIyyYPt9YDWniRXJVmX5GtJXj5qrKWt/4NJlk7dbkmSxjKeTwCbgT+sqmOAk4CLkhwDXALcUlVHAre0eYDXAke2r2XAB2AQGMClwInACcClW0NDkjT9xgyAqnqsqr7Spn8IrAUOA5YA17Zu1wJvbNNLgOtq4G5g/ySHAK8BVlbVpqp6ElgJLJ7UvZEkjduEzgEkmQccD3wJOLiqHmuLvgsc3KYPAx4Ztdr61raj9m23sSzJqiSrNmzYMJHyJEkTMO4ASLIv8AngP1TVD0Yvq6oCajIKqqqrq2phVS0cGRmZjCElSdsxrgBIsieDN/+PVNXftebH26Ed2usTrf1R4PBRq89tbTtqlyQNwXiuAgpwDbC2qt43atEKYOuVPEuBm0e1v7VdDXQS8FQ7VPQ5YFGSA9rJ30WtTZI0BLPH0edk4Dzg/iT3tbb/DLwHuCHJBcC3gDe1ZZ8BXgesA34MnA9QVZuSvBu4p/V7V1VtmpS9kCRN2JgBUFV3ANnB4jO207+Ai3Yw1nJg+UQKlCRNDe8ElqROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqdmD7sAaXex9qijp3T8o/9x7ZSOL02UnwAkqVMGgCR1ygCQpE4ZAJLUqTEDIMnyJE8k+fqotgOTrEzyYHs9oLUnyVVJ1iX5WpKXj1pnaev/YJKlU7M7kqTxGs8ngA8Di7dpuwS4paqOBG5p8wCvBY5sX8uAD8AgMIBLgROBE4BLt4aGJGk4xgyAqvoCsGmb5iXAtW36WuCNo9qvq4G7gf2THAK8BlhZVZuq6klgJT8bKpKkabSz5wAOrqrH2vR3gYPb9GHAI6P6rW9tO2r/GUmWJVmVZNWGDRt2sjxJ0lh2+SRwVRVQk1DL1vGurqqFVbVwZGRksoaVJG1jZwPg8XZoh/b6RGt/FDh8VL+5rW1H7ZKkIdnZAFgBbL2SZylw86j2t7argU4CnmqHij4HLEpyQDv5u6i1SZKGZMxnASX5GHAqMCfJegZX87wHuCHJBcC3gDe17p8BXgesA34MnA9QVZuSvBu4p/V7V1Vte2JZkjSNxgyAqnrzDhadsZ2+BVy0g3GWA8snVJ0kacp4J7AkdcoAkKRO+fcApF8Qf/G7t07p+Bf91elTOr6mn58AJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlPcBSBq6//Fbb5jS8f/w45+e0vFnKj8BSFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKRO+ThoSdpF6y/5hykdf+57TpmScf0EIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSp6Y9AJIsTvKNJOuSXDLd25ckDUxrACSZBfwF8FrgGODNSY6ZzhokSQPT/QngBGBdVT1UVc8A1wNLprkGSRKQqpq+jSVnAour6nfa/HnAiVX1+6P6LAOWtdl/DXxjCkuaA3xvCsefatY/XNY/PDO5dpj6+n+5qkbG6rTbPQuoqq4Grp6ObSVZVVULp2NbU8H6h8v6h2cm1w67T/3TfQjoUeDwUfNzW5skaZpNdwDcAxyZ5IgkLwDOBlZMcw2SJKb5EFBVbU7y+8DngFnA8qpaM501bGNaDjVNIesfLusfnplcO+wm9U/rSWBJ0u7DO4ElqVMGgCR1ygCQpE4ZADNIkhOS/Js2fUySP0jyumHX1aMk1w27BmlX7XY3gk2lJEcBhwFfqqofjWpfXFWfHV5lY0tyKYNnKM1OshI4EbgNuCTJ8VV1+VAL/AWWZNtLlQOclmR/gKr6jemvaucleRWDx7J8vao+P+x6xpLkRGBtVf0gyQuBS4CXAw8A/72qnhpqgWNIcjHwyap6ZNi1bKubq4Daf8JFwFpgAfD2qrq5LftKVb18mPWNJcn9DOreC/guMHfUD8SXquq4oRa4C5KcX1UfGnYdO5LkKwzebD4IFIMA+BiD+1ioqv8zvOrGluTLVXVCm76Qwc/BJ4FFwN9X1XuGWd9YkqwB5rfLyK8GfgzcBJzR2n9zqAWOIclTwD8D/8Tg++bGqtow3KqaquriC7gf2LdNzwNWMQgBgHuHXd846r93e9Nt/r5h17eL+/btYdcwRn17AP8RWAksaG0PDbuuCdQ/+nvnHmCkTe8D3D/s+sZR/9pR01/ZZtlu/70P3Nu+hxYB1wAbgM8CS4H9hllbT4eA9qh22KeqHk5yKnBTkl9m8Bvd7u6ZJC+qqh8Dr9jamOTFwJbhlTU+Sb62o0XAwdNZy0RV1RbgiiQ3ttfHmVmHT/dIcgCDN6FU++2zqv45yebhljYuXx/1KfGrSRZW1aokvwI8O+zixqHa99Dngc8n2ZPB4dw3A38GjPnQtqkyk76Jd9XjSRZU1X0AVfWjJG8AlgMvG25p4/Jvq+on8Pwb0lZ7MvhNYnd3MPAa4Mlt2gN8cfrLmbiqWg+cleT1wA+GXc8EvBhYzeDfupIcUlWPJdmXmfHLz+8Af57kTxg8QfOuJI8Aj7Rlu7uf+jeuqmcZPAJnRZIXDaekgZ7OAcwFNlfVd7ez7OSqunMIZXUjyTXAh6rqju0s+2hVnTOEsrrW3nwOrqpvDruW8Ujyr4AjGPziur6qHh9ySeOS5Feq6v8Ou47t6SYAJEk/zfsAJKlTBoAkdaqnk8DStEjySQbHqvdlcIXH1mPsv1dVM+KEt/rgOQBpirRLjf+oqt4w7Fqk7fEQkGa0JG9J8uUk9yX56ySzkvwoyeVJvprk7iQHt773jfr6f0leneTBJCNt+R5J1iUZSfLhJB9o6z+U5NQky5OsTfLhUdtflOSuJF9JcmO7tHJHtS5KctOo+de2dWYn+X6Sq5KsSbIyyUGtz5FJPpdkdZIvtGvfpUlhAGjGSnI08FvAyVW1AHgOOJfBHa53V9V84AvAhQBVtaD1+y8M7gT/IvC3bR2AXwO+Wv9ym/4BwK8yuAt4BXAFcCzwsiQLkswB/gT4tRo8SmQV8Ac/p+T/DRy39c0dOJ/BfSgwuFb/zqo6Frir1QiDvxz1e1X1CuAdwP+c2L+StGOeA9BMdgaDu6LvSQLwQuAJ4Bng063PauDXt66Q5EjgvcBpVfVskuXAzcCVwG8Do59J9PdVVe05TI9X1f1tjDUMHicyFzgGuLNt/wUM3ry3q6q2JPkIcE57fQWDu0EDbAZubF3/Fvhoe9jcScAn2vjgz6wmkd9MmskCXFtV7/ipxuSP6l9Obj1H+z5vh2duAC6sqscAquqRJI8nOZ3BEzLPHTXUT9rrllHTW+dnt7FXVtWbJ1DzcuATbfrjVfVcku39HG596Nz32qcWadJ5CEgz2S3AmUl+CSDJge3ZTjuynMHdyP+wTfsHGfzWfWNVPTeB7d8NnJzkpW37+4x1jL4GjwT+HoNHGn941KLZwNanWp4D3FFVTwKPJfl3bfw9ksyfQH3Sz2UAaMaqqgcYHIP/fHvY3ErgkO31bcFwJvDbo04EL2yLVzC4ZHNCj6Ru5wreBnysbf8u4KhxrPpR4JvbPB7gKeCUdnjpVcBlrf1s4HeTfBVYA3hFkSaNl4Gqey0IrqiqU6Zpe38F3FVV17b52QwO9ew/HduXtvIcgLqW5BLg3/PTx/6ncnv3MXgi6sXTsT3p5/ETgCR1ynMAktQpA0CSOmUASFKnDABJ6pQBIEmd+v9bmh77J2be1wAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "counts.plot(x='enzymeType', y='count', kind='bar');" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "spark.stop()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }