{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# An introduction to k-mers for genome comparison and analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "k-mers provide sensitive and specific methods for comparing and analyzing genomes.\n", "\n", "This notebook provides pure Python implementations of some of the basic k-mer comparison techniques implemented in sourmash, including hash-based subsampling techniques.\n", "\n", "### Running this notebook.\n", "\n", "You can run this notebook interactively via mybinder; click on this button:\n", "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dib-lab/sourmash/master?filepath=doc%2Fkmers-and-minhash.ipynb)\n", "\n", "A rendered version of this notebook is available at [sourmash.readthedocs.io](https://sourmash.readthedocs.io) under \"Tutorials and notebooks\".\n", "\n", "You can also get this notebook from the [doc/ subdirectory of the sourmash github repository](https://github.com/dib-lab/sourmash/tree/master/doc). See [binder/environment.yaml](https://github.com/dib-lab/sourmash/blob/master/binder/environment.yml) for installation dependencies.\n", "\n", "### What is this?\n", "\n", "This is a Jupyter Notebook using Python 3. If you are running this via [binder](https://mybinder.org), you can use Shift-ENTER to run cells, and double click on code cells to edit them.\n", "\n", "Contact: C. Titus Brown, ctbrown@ucdavis.edu. Please [file issues on GitHub](https://github.com/dib-lab/sourmash/issues/) if you have any questions or comments!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating Jaccard similarity and containment\n", "\n", "Given any two collections of k-mers, we can calculate similarity and containment using the union and intersection functionality in Python." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def jaccard_similarity(a, b):\n", " a = set(a)\n", " b = set(b)\n", " \n", " intersection = len(a.intersection(b))\n", " union = len(a.union(b))\n", " \n", " return intersection / union" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def jaccard_containment(a, b):\n", " a = set(a)\n", " b = set(b)\n", " \n", " intersection = len(a.intersection(b))\n", " \n", " return intersection / len(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's try these functions out on some simple examples!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "a = ['ATGG', 'AACC']\n", "b = ['ATGG', 'CACA']\n", "c = ['ATGC', 'CACA']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_similarity(a, a)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_containment(a, a)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3333333333333333" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_similarity(b, a)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_similarity(a, c)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_containment(b, a)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAATAAAADqCAYAAAAlKRkOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAG6NJREFUeJzt3WmQnMd93/Fv71w7ewALYA9cC4IE\nSZEgDpISSVGiRBJ0WZbkQ0kqdlXsMt8orvKRVGJXOaU4ymTlKJGcVGxLcRLbSWynYtkVOapYLsmk\nbIoSJREAD/ECSBAgQBA3sItzd2fn7rx4FiYE7cyDBWaf7p75faqmRldp/nxmn9/008+/+zHWWkRE\nQtTjugARkeulABORYCnARCRYCjARCZYCTESCpQATkWApwEQkWAowEQmWAkxEgqUAE5FgKcBEJFgK\nMBEJlgJMRIKlABORYCnARCRYCjARCZYCTESCpQATkWApwEQkWAowEQmWAkxEgqUAE5FgKcBEJFgK\nMBEJlgJMRIKVdl2AxDAmB/QDfVe854DsVa80UAOqMe81oARcAC4S+KPZzYQxwDJgBdFxyRAdi7j3\nOlAGKle8ykARmJ1/L9qCLSX4jyOLZAL/++0cxhiik3DkitcKlvZHpk4UZOeveJ0Dpn0LtvmgGiQ6\nJiuAlfPvQ0BqCT+6TnRczgCT86/ztuDX8elWCjBXjBkERnk3rIaJRgc+qPFusE0C72DtdJIFmAkz\nCGwgOjYriYLKlyuGGjDFFaFmC/aS25K6kwIsKcb0AjcDNxEFV6/bghbtPPAOcAQ43e4R2vwIa5To\n+NxENLoKSYkozA4Db+vSMxkKsKVkTBbYCGwC1tE5N01KREF2BDiKtdXr+T8xEyYDjBONtDYQXqg3\n0wBOAG8Bh23BVhzX07EUYO1mTJootG4hOjmXcn7GBw3gJNHo7C1s65GHmTC9wK1EgbWWzgn1ZurA\nMeAg8I4tXF/Yy8IUYO1gTIrohNw0/+7LXE3S6sDbwF6sPX3lf2EmzChwF1Gwd3qoN1MjGrUeBI7Y\ngq07rid4CrAbEY22NgPbiNob5F1TlSx7hn8dptNsIbpJIe8qAq8Cr9uCrbkuJlQKsOthTIZoNLGN\nzpm3aZtyjvrBOykd2UTuUg7z5BDlv1hJ7lK6a0derZSIgmyvLi8XTwG2GNGk/FZgC1HTpFyhkqW+\nbzulozeTt6kfnNuqQuPp5cz90Qi9MykF2QLKwB5gjy3YsutiQqEAuxZRC8Q2osvFrONqvFNL0Tiw\nhdLbt5NrxIyyyob6X62g9OeryJd7On4C/3pUgNeBV9WKEU8B1koUXHcTBVe3Tsw31TDYt9/D3IG7\nyNayizs+Mz3UvrySyl+uJF83mKWqMWA14A3gJQVZcwqwZoy5A3gAXSou6MwaSq/cT0+578ZGpOdT\nVL+4mtrzA+TbVVuHKQO7bcHuc12IjxRgVzNmBfAhYLXrUnxUzlF/5QHKZ9a1967rC/0Uf3c12Qtp\njXSbOAV8xxbsedeF+EQBdlnUy3UvsJ3Ob668Lm/fRnHf3eTqS3Q3sWSo/68Ryn+1Qi0pTTSI7li+\nqB6yiAIMwJhh4FHCW3+XiNkBqi8+RP3SimRaRg7lKH1uLamTWW8Wt/vmPPC0Ldgp14W41t0BZkwP\ncM/8S6OuBRy5heKe99HbSCV7fCqGxh+MUnpySKOxJhrAS0ST/A3XxbjSvQFmzBDRqGvEdSk+qqVo\nvPQByqfXu51c3zVA8T+uoVctF01NEY3GunJurDsDzJgNwGP4s/+WVy6soPzChzGlG7zD2C5nU1R+\ncz32YK/uCDdRBZ6yBXvEdSFJ674AM2YL8CCo92ghb99O8fV76LWejXhq0PjjEUp/uVKXlE1YYJct\n2NdcF5Kk7gmwaL7rA0RNqXIVC/bV+yke3US/61pa+ZtlzH5xNX1Wza/NvAF8r1vmxbojwKI1jD8C\nrHddio9qKRrPPUz53FgYzaR788wV1pPTvFhTx4G/6YaNFDs/wIxZBvwY0Z7qcpW5PLWdj1EvDoY1\nv3QiQ/k3xklNZdT42sQF4IlO36u/swPMmNXAj6ItbxZ0cYjKrh30VHNhhsBMD7V/NU5dk/tNlYBv\n2II95bqQpdK5AWbMbcCH6d7dP1u6sJLyzh2k65mwj0/JUP+X49QO5BViTTSAZ2zB7nddyFLozACL\nFmJ/2HUZvjq/ivKuHaSXaklQ0sqG+m+MU3tTIdbKM524ILzzJkGN2Ui0GFsWcG54fuTVIeEFkLOk\nPnuU9J1FtBFgcx8yE+Zm10W0W2cFmDFriRpUdYt9AWdHKO16lHTcpoMhyllS//YYma1FtHfWwgyw\nw0yYda4LaafOCTBjRoCPoDmvBV1YQXn3o2Q7Mbwuy1p6/s0xsrfNaSTWRAr4UTNhOmb5XGcEWLSu\n8aNoadCCin1Ud+0glfSCbBeylp7PHCM1UkUPyFhYBviomTAd0VYU/h+0MQPAx1CrxIKqGerP/gh2\nsVs+h2ygQfq3jmAH6mjPrIX1Ah83E2bAdSE3KuwAi/as/xgQ/BexFBoGu3MH1VK/H4uykzRcI/vZ\no1QzDbpiSc116Ac+Nv+k9GCFG2DRsxk/ijrsm3r+w8xdWtm9I9NbyvR++rgm9VsYIrqcDHbqJdwA\ng0fQXl5N7dvG7ORa7dxwT5G+xyeZdV2Hx0aI9sULUpgBZsydQMf1tLTL5BiltzYrvC77++fou2dW\nI7EWNpoJE+QuLeEFWHTH8UHXZfiq3EvtxYdIo+1m/k4PmF8/QXpljZrrWjz2/hDvTIYVYNGeXjvQ\nQ2YXZMHufphaN91xvFYDDdL/+hg1Y+nAtXNtkQYeMxMmqEwIqljgfmDYdRG+2nsvxW6etI+zqUzv\nL5yh6LoOj60iOseCEU6AGbMO2Oa6DF9NjVI6fLvmveL8+AX6t2s+rJVtZsIEs/FnGAEW9XsFe6dk\nqc0/QSilea9r82snSeXUH9bKI6H0h4URYNHWOBpdNLHnPkrlvJZRXasVdTK/fFqjsBb6gIddF3Et\n/A+wqGVio+syfHVumPKxjWHsZe+TRy/Rd5d2rmjlphBaK/wOMGP6UctEUw2DfelBjC4dr8+vnqQn\npbuSrbzf9/WSfgcYPIBaJprav5Xi3ED3rXNsl9Ea2Z+b0l3JFtJ4flfS3wAzZhS41XUZvir3Ujt0\nhy4db9RPnic/pAbXVm41E2bMdRHN+Btg0UNopYm991Dphv29llrW0vMLZ+j45yfeIG+ncfw8AYy5\nFRh1XYavppdRObFBo692+eA0+ZvKCrEWRs2E8fJqyL8Ai5YL3ee6DJ+9dh91ejRx3y49YH7plDY/\njHG/j8uMvCsIuAMYdF2ErybHKJ0b1eir3TaXyGvHipYGiM5Nr/gVYMakgHtcl+Gz1+91XUHn+sdn\nNKqNca+ZMF49FMavAIPNRFvdygKmRilPD2mx9lIZr5DbNqsnGrXQB9zluogr+RNgxqSBu12X4bP9\nW7V+b6n97JSOcYy7zYTxpjfTnwCLer40t9PE9DIqmvtaeptLuiMZoxe4zXURl/kUYN5NEPrkzW1q\ntkzKz03pWMfw5lz1I8CMWYn6vpqay1M7tU6jr6TcN0N+uKoQa2HETJhVrosAXwLMo0T30aE7KKvv\nKzkpMH/vnCbzY3hxzroPsKh1wptrah8d36gF20l7eFrHPMatPrRUuA+w6PFoOddF+GpyNaVKrzYr\nTNryOpn3zqixtYUcHjza0IcA82Io6qvDt2m/Klc+fkHHPobzc9dtgBmzDFjrtAaP1dI0zqzR6NSV\nu2fJ9dW1RrKFtWbCLHNZgOsR2Hscf77Xjt5C2WrLHGcy0PPYJU3mx3A6CnN3chhjUIC1dHyj6wrk\n0Yv6AYlxu8tdKlx+OWvQk4aaqqVpXFihdY+u3VIm16tHsLXSR3QuO+EywNY5/GzvnV6n3i8fpMC8\nf1qXkTGcncsuA0yT9y2cHNcdMF98cEbfRQxn57KbAIt2nhhx8tmBmBpTI6UvthbJGj1+rZVhM2Gc\n9Cq6GoGtdvjZ3js3TKWW1ePkfNHfIH3HnHaoaKGH6Jx28sEu6PKxhdPrqLquQX7Q+2e0uDuGk3Na\nAeahcyManfpm85y+kxhdEmDGZNH8V0uXhjT/5ZuNZX0nMYbNhEn8GLn4VVkNag9oZmaQSj2D81X+\n8oN6Lalx7dTaisFBP5iLANPlYwtnx7T2zldbipoHi6EA63ZnR3S73ldb51xX4L3Ez+1kAyxa/7gy\n0c8MzIVVunz01aaSvpsYK82ESXR6KOkRWJ+DzwxKqU+bF/pqpKrvJkYPCT/XNekwGUz484JSylNr\naPscb2WgRw/7iJXoOa4A88jMMk3g+268ogCLoQDrVtPLFWC+Gy/rJkuMgSQ/LOkAS/QfLjQzy3Ry\n+O6mivYGi6ERWLeaGdT8l+/WVPQdxVCAdatyXisUfDdUV4DF6NAAi3rAdAnZQjWjk8N3AwqwOP1J\n9oIl+WWoByxGTWsgvZdv6DuKkWgvWJKBosvHGPW0At53vZYe7c4aK7FzPckTRg9obaGSpYHRHFgI\nBuu6ExkjsXM9yW2LvRldvBcefwO29sH0FEy4rgegmqWOR8foRv2zJ3j86CW25lJMf+kf+HGM22Ww\nTv1S2oNLyd/ncSbZSoZp/oVXxzixv+MkTxj3X/i8n4Vnfw++4LqOK1WznfWr/shGnv3F9/l1jNtl\noOHJJeQ2nuXjXh7jjgwwb0YXvwoHxmHWdR1XanhzdNrjE3dwYLjPr2PcLmlf5sAe5ADLvTzGCrBu\nY40nJ4XE0h9yrMSuthRgnrCavg9Gj35q4nRkH1hHzfG0m9FJEYyGfmziJPbXrADzhLFqoQiF/pBj\nJXaIkmyj8OZ73wqfPAS3l2BgAD7/M/DV/wHfc1lTjzdHpz1+5et88tQMt1fqDPzDL/P5D23gq//0\nAbfHuF0avjxV67/wSc5zOzUG+CyfZwtf5ae8OMaJbQvVlQH2Gvx31zVcLV3trDnC//wx/45xu8yk\nPAmwX/L2GCd2rid50mizvhay5c4KsE52MeVPT6OnOjLA9FDQFjIVevClv0hamtZzC+Ikdq4n+UVM\nJ/hZwTFgUlpj572yoW61ZjVOYud6kgE2i0fzYD5KV3SZ7btij/6GY1gSXOWSXIBZm+g/WIgyVV1C\n+k4BFmvWFmxHzoGBLiNbypZ0cvjuYkrfUYxEz/GkA2wm4c8LSv+MTg7fHctplByjowNMI7AWBi/q\n7pbvjmVdV+C9RAcpCjCPDF7U3S3fHcnqRyaGRmDdauBioisj5DocyamJNYYCrFvl50gb9YJ5qwqN\nyQwZ13V4rqMDTL1gMfJzVF3XIAs7l6bmugbPJd4qlWyARb1gFxL9zMAsO6dmVl8d7NV3E+N8kj1g\n4GaX1BMOPjMYq864rkCa2dOnFooYJ5P+QAWYZ4ZPayLfV3vymv+Kkfi57SLATpLglrOhGbxEtqem\nSxXflA31wzn9uMToghGYtWXgbOKfG5BlF7X1kG+OZKlqF4qWztqCLSX9oa6a8nQZ2cKKSd2p9c0b\neY2KYzg5pxVgHho7rmZJ3zw3oO8kRlcF2Ck0D9bUqjPkUlX94vuiZKi/2kfOdR0esziY/wJXAWZt\nBZhy8tkBMGBWndE8mC/29FHW/FdLZ23BOvl7dbkwVZeRLaw5qhGqL54dcF2B95ydyy4D7LjDz/be\n6mPkaCjEXGuA3Tmoy8cYXRlgJ4DEb7uGIlMlpXYK997JUZnRY9RaKQHHXH24uwCztgHsd/b5AVh3\nWBP5rn17UN9BjANJr3+8kuvN2fY5/nyvjR8iZxrqCXOlBo0nh3T5GMPpOew2wKy9QNRSIQvIVkit\nOk3ZdR3d6rU+Srp8bOm0LdjzLgtwPQIDjcJa2njAdQXd66+H1DoRw/m560OAHSLBR5GHZuw4vZmy\n5mGSNtNDbdcAva7r8FgVOOi6CPcBZm0NDw6ErwyYtUd0tzZp3x1U82qMg7Zgne9Q6z7AIs6Hoj67\n5Q2yWPWEJaUB9isr0QPUWvPinPUjwKydRFvsNNU/S2bkpEZhSXm5j9LJrDYvbOGcLVgv9g72I8Ai\nb7ouwGfvec2r76qj/e9hHesYXoy+wK8A2w9qGWhm6By5obPMua6j072Vo3Qgr96vFip41IDuT4BF\nO1S86roMn93+miaVl9qXhjXXGONVVztPLMSfAIu8htZHNjV6kt6+aY1Sl8qpDOXnB8i7rsNjJaJz\n1Bt+BVjUUvGy6zJ8dufLGiEslT8a0bGN8bItWK8evOxXgEX2AkXXRfhqzTF6l5/VKLXdDuYoPTuo\nxtUWisDrrou4mn8BZm0deMF1GT7b+jxGfWHt9Xtjml+M8aIPjatX8y/AIm+ivrCmhs6TGzuuO5Lt\n8nw/c7rz2NI5PGqduJKfAWatBXa6LsNnd71IRlvt3LgaNP7rmB5YG+NZW7Bejvj9DDAAa08Ah12X\n4au+Ipmb3tIo7EZ9Y4i5yYy67lt4xxast8+v8DfAIrtAo4xm7niZfG4Or+4KheR8iur/HFHbRAsN\nonPQW34HmLWXgOddl+GrdJ2e7bu11c71+uJqauUez88Bt16wBXvRdRGt+P/lWfsKegRbU6Mn6V19\nVG0ni7VrgKKaVls6Abziuog4/gdY5GnUod/U9t3kMmW8u8Xtq5kear+9WncdWygDT/s6cX+lMALM\n2lngGddl+CpTJbV9t+bCrtUXVlMtaq/7Vp6xBTvruohrEUaAAVh7GHjDdRm+Wn2c/Pq3dSkZ5+ll\nFHcO6tKxhX22YN92XcS1CifAIjuBC66L8NW23eQHLmqxdzPHM5R/d7XCq4WLwLOui1iMsAIsWuz9\nTdRasaAei7n/W/SkarozebWSoV5YT09d+9w30wCe8nG5UCthBRiAtVPAc67L8FVfkczdO/WUp6v9\nzhoqp7VNdCvP24Kdcl3EYoUXYADWvgocd12Gr9YcI3/TAYKYhE3CE8uZ/Z7mvVo5YQvW+5aJhYQZ\nYJGnAKdPBfbZlhfoWzGppUb7epn7b2P0ua7DYxeAv3VdxPUKN8CsLQFfB2Zcl+IjA+aBp8l18w6u\nJzKUPz1OTvNeTc0AX7MFG2yPZbgBBpf7w76GmlwXlK7T88G/JZUtdV+P2MUU1U9tIFXSUqFmSsDX\nQ+n3aib8L9fai0Qjsa47Sa9FrkT6waew3XRnsmSof2ocey6tbXKaqAJ/bQs2+Jak8AMMLt+ZfBK6\n5yRdjMFLZO/7NtVu2D+sBo3PrKd6NKcnazdRB560BTvpupB26IwAg8v7hz0F2mp5IcNn6H3vdyl3\ncojVoPG5tZRf69Pe9k1Yol6vjtkcoXMCDC4vN9KaySZWHyf/vu9QNvXOC7EqNP7dOsq71S7Rynds\nwR52XUQ7dVaAAVj7Jp5vwubS2Any9z9DpZNCrAqN31xPRdvjtPScLVgv97W/Ecb6v2PG9TFmC/Ag\n6Bb6QibHKD3/MNlGKuwfsSo0JtZTeaVfl41NWGCXLVivHkjbLp0bYADGbAAeAy0hWcjZEUrPPUKm\nng5za5mSof6Z9VQ159VUlWjO64jrQpZKZwcYgDErgR8DBlyX4qOZQSo7d0C5L6y7dudTVD81jj2u\nu43NzABP2II957qQpdT5AQZgTB74CDDquhQfVbLUd+2gemlFGCOZQzlKn15P5lKgI8cEnAG+YQu2\n4/eH644AAzAmBTwCbHJciZcaBvviQ8ydXu/3usFdAxQ/t5a8lgc1dQj4Vmjb4lyv7gmwy4x5H3Cv\n6zJ8tW8bs29tpg/PAqIB9isrKf7JCP2ua/HYS7Zgu+opXt0XYADG3Ao8DLoEWcjkGKXvf5B0NefH\nUpzpHmr/YS21l3SnsZkG8G1bsAdcF5K07gwwAGNGgUeB5a5L8VE5R/3Fh6icG3XbW/V6L3OfXUdW\n811NXQK+aQv2jOtCXOjeAAMwJg3cD2xxXYqvDmymuH8rvTbhXR3qYP90mLkvr/J7Ts6xvcDubpnv\nWkh3B9hlxqwlmuBXq8UCLqyg/P2HoDiQzLMUT6epfH4t9kBez25sYobokrHrdyVWgF1mTBZ4ALjT\ndSk+ahjsgS0U37qTvF2i7v0qNP7vKub+fBV9usvY1D6izno99wAF2A8zZgz4ELDSdSk+mu2n+tIH\nqF0Ybu/c2P5e5n5rDWk9eKOpc8B3bcGecl2ITxRgCzGmB9gKvBf8uBPnm6M3U9x7L9la9saOz2wP\ntT8cpfLUcs11NVEDvg+8agu2Yxbgt4sCrBVjBolC7FY6ceeOG1RL0di/ldLh28g1FnmXsGyof22I\n0p8Nk9e2zwtqAG8BL9qCnXZdjK8UYNciCrK7gfegIPsh5Rz1fdspHbuZfNzdyho0vrmcuT8Zplet\nEQtqAG8CLyu44inAFsOYfmA70US/Tr6rzOWp7b2Xyqlx8ld38jfA7h5g7g9HyUxmNM+1gDrRBP0r\ntmD1pK1rpAC7Hsb0AduAzWiO7IcU+6ju30b1+AZ6Kyl4ZhmlP1tFRhP0C6oBrxPNcXX84ut2U4Dd\nCGN6iSb77wJt63KV6ZlB9tzxT+B4li3AoOuCPFPh3eDSYwGvkwKsHaIesluIJvvX0N27wB4l6hA/\nyvwfl5kwBhgnCvpxh7W5ZoGTwEHgoHq5bpwCrN2ivcduIdq2Z7XjapIyBRwB9mPtpVb/QzNhlgG3\nATcBwwnU5oNTRKH1ti4T20sBtpSiSf9NRIHWSZsp1oHjwDvAkfknpC+amTD9wAaiMFtLZ80nThKF\n1iFNyi8dBVhSolaMTcBGYBXh3cUsEo2y3gGOY9u7gNhMmDSwjijMNkBwja114CxwmCi0Wo5EpT0U\nYC5Enf6rgJH51ygwhF9zZzNEy1cmiUZZiT7J2UyYYaIwGwFW4NdNAAtcIDo2Z+bfz6pTPnkKMF8Y\nkyGaE7ocaCMkc9LOAueJwur8372srSbw2dfMTJgMUZBdfq2cf09ih9ZpfjCspmzBr+PTrRRgPov2\nK+snupy6/N4H5OZf2SteaaLHaNVavF/+12XeDaqg74SZCZPl3VDLET1CLz3/yrR4rxG1Mlz5KhFd\nKheJgr0IzHbzflu+U4CJSLC0rk9EgqUAE5FgKcBEJFgKMBEJlgJMRIKlABORYCnARCRYCjARCZYC\nTESCpQALmDHmW8aY88YYPcF6EYwxh40xc8aYmfnj9zVjTDdvtBgsBVigjDEbiR7Aa4GfdFpMmH7C\nWjtAtIPuaeCLjuuR66AAC9fPA7uAPwYed1tKuKy1JeAviB7QIoHppB0wu83PA/8J2A3sMsaMWWtP\nO64pOCZ6wtTPEP0YSGAUYAEyxjxEtNnf/7HWThljDgL/CPhtt5UF5f8ZY2pE2xRNAh9xXI9cB11C\nhulx4BvW2qn5f/8ldBm5WJ+w1g4BvcCvAN82xnTLQ1g6hgIsMCZ66tFPAw8bY04ZY04B/xzYbozZ\n7ra68Fhr69barxDtaf+Q63pkcRRg4fkE0cm2Gbh7/nUn8B2ieTFZBBP5KaIdXd9wXY8sjnZkDYwx\n5glgr7X21676z38a+AKw3rb5iUGdxhhzGBgj+iGwRE9a+vfW2j91WZcsngJMRIKlS0gRCZYCTESC\npQATkWApwEQkWAowEQmWAkxEgqUAE5FgKcBEJFgKMBEJ1v8HqH8idkv7OpIAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "from matplotlib_venn import venn2, venn3\n", "\n", "venn2([set(a), set(b)])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAV0AAADKCAYAAAAGnJP4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAIABJREFUeJzt3XtwnNd53/Hvs7vYXSwWF4IXgCQo\nkaJI8WJaokyKUm1Fsq268Uwta8ayNY1qq5PRH03jpK6bTqapY4b12J70EmeS2m3HkzRpx84kYyuO\n3diqFVuWLUsidbGupHgRRUoECZAgCGKx98vpH++uSUIg9gW4eM95d5/PDAYjioP34cHub8973nMR\nYwxKKaWCEbFdgFJKdRINXaWUCpCGrlJKBUhDVymlAqShq5RSAdLQVUqpAGnoKqVUgDR0lVIqQBq6\nSikVIA1dpZQKkIauUkoFSENXKaUCpKGrlFIB0tBVSqkAaegqpVSANHSVUipAGrpKKRUgDV2llAqQ\nhq5SSgVIQ1cppQKkoauUUgHS0FVKqQBp6CqlVIA0dJVSKkAx2wV0IhFiQB/QA0TxPvwaXwLUZn0V\ngWljyFkp2BLZJwKkgV6gC699Gu0lgOHKdqoAM0DG7DVVGzVbI5LEa6durmynRseqBlS51FZ5YBpj\nisEX29nEGGO7hrYjguAFah/eG+Hy731AcpE/ugpkgOm5vhtD+doqD57skwRzt1MvXuAu5m7MADku\ntU2jnTLAtNlrwvfhJRLFa5O52qkXiC/yJ5d452vpUnsZU7u2wtVsGrotIsIAMAKsBdbg9cyCdh4Y\nBU4BY8ZQsVDDvOohu/ayrz4LZeTx2mkUOGX2mqyFGuYnEgFW4rXRCLCK4IcDa8AYl9rqHBoY10xD\nd5FE6ObSG2ItXs/WJVVgHC+AR4EJYwj8ly37JAIMc6mdVuANDbhkikvBctrsNSUrVYgMcOnDaA2L\n770ulSJwmsYHuzHTlusJJQ3dBRBhGXATXoAMWi5noRpvmDeB48awZLeNsk/iwI3AerzADdOzgxpw\nDngbOLzkvWCRtXhttRZvOCVMMngBfAxjTtsuJiw0dJsQIQJsALYBqy2X0yp54HXgkDHMtOqHyj4Z\nBLYDmwhX0F5NDXgLeM3sNaMt+6kicbwP721Af8t+rl0XgIPAUYylO4WQ0NC9ChF6gK3AFiBluZyl\nYmiEiuHUYn5AffjgBrywHWphba6ZwguVI4sefhBZgddOG2mPD6W5lIFjwGsYM2m7GBdp6M4iwghe\nD+R63Bt7XErTeKFy2BiaTiOSfZLGa6eb8KYpdYoKXqgcNHvNRNO/7c062IjXVquWtjTnjOG9po7r\nLIhLNHTrRLgB2AUM2K7FsipwCHjBGAqz/6fskz7gNrwhl076UJrLGHDA7DVj7/g/Xti+C7iZxU8R\nbBd54EW83m/Hh2/Hh64Iw8DtdF4vpJkS3hvlVWOoyD5JAu/BG3LRlYxXOgHsN3vNRQBENuN9gIft\nwdhSmwaexZg3bBdiU8eGrgh9eGG73nIpbpNaho//4TG2/d523JvC5JLaRyd49a+/zlCi2NZj261w\nDngaM8cdQgfouNAVIQrsxLvti1oux21rxvPsfjlKqhgnfqLA6j+KkHxTg3eWwQqV3xqjtCtLKlqm\nuuVlihuOtO3D11Y6AjyDMe8YxmpnHRW6IqwD3oudVVDh0Z2vcPtLJYYnZgVH1dD/oxyrvt5NpNDx\nQwxiMPdPkn/gPImEufIDPH2Rwi1PIwMXSNiqLySKwLPAoU5Z7dYRoVvfYOa9eE/a1XzWn8qx++UE\nsdrV7wKiF8us+WKN1KGODZSVZcr7TlFbV5onVA1m4yFyW19ybrWii84AP8KEcF+MBWr70BWhH/jH\nhG8FWbCkZtjzcp4Np3zeFlcNK/5PnuXf7rjb6N0z5H/nNPGU8Tc8tewc+d0/JR4v6XBWEzngx+2+\nuq2tQ1eEDcBd6AOg+XXnK7x/f4X+mYVPbUq9kGPtl5OdMNwgBvPwWXL/dIpUZIHT5eIFyrt/Sm3Z\neR1uaMLgzXB40XYhS6UtQ7e+dPc24N22a3He6rMF3vt8F13VxffCYhMlRv4AEifb9sOtr0J17yil\nzYXFLwSRGrUtL1HY+Lo+ZPPhJPB4Oy4pbrvQFSEF3IO30Yqazy0Hs2w5nkJascihXGPovxcYeKzt\nAmVrjuLvjxLtrbVm6e7K0+Te8yTJWFXnOzcxDfwDxsfKvxBpq9AVYQ3wQTprWerCJYpVfuXZEium\nWt9OvU/kWP2VbqTaFqvVPn6e3IMTdEdbvPoumaW05yfQO61DX01Ugacw5pDtQlqlbUJXhI3AB9Cl\nqfPrLlT40M9qpIpL92ZPHC1w3b+PEymGuif3mTPkPji9dEMBkQrV2x+nMjih47w+vIgxB2wX0Qqh\nflM0aOD6FETgAhQ3JXnryyVqidCus1/qwAWoxYg+835ikyuabzCkuAWR22wX0QqhD10NXJ+CCtyG\nEAdvEIHboMG7IG0RvKEOXQ1cn4IO3IYQBm+QgdugwbsgoQ/e0IauBq5PtgK3IUTBayNwGzR4FyTU\nwRvK0NXA9cl24DaEIHhtBm6DBu+ChDZ4Qxe6Grg+uRK4DQ4HrwuB26DBuyChDN5Qha4Iq4D3o4E7\nP6kZPvB01ZnAbShuSnLmd5zaxu/j590J3IZajOiBu4gWuqnYriUEbkFki+0iFiI0oStCEm+lWWhq\ntub2F/P0Zd2c+zlze4rJjzqxk9T2HIUHJ9xcSFOJE9t/N5Wa0B4T6ZfWe+uHfoZCKAJMvGWqH0CP\nP2nuhrdyrD/tVM/tHc79iyT5TVZvn/sqVP/DKLFWrzRrpcwAyVd248QHlOOiwD31o+2dF4rQBW4F\nRmwX4bz+6RK7XgnBIYixCKOfj1BNV21cXQzmD05RbtVeCkvp7Y30jF5H3nYdIdAH3G27CD+cD936\nkei32q7DebFKjbsPQNQ4/zsFoDrQxejnrOwg9fBZcpuK4Tmh96U9xLNpyrbrCIH1iNxsu4hmnH6D\nitCDzlTw533PF0gVQnF79Uv57d2c+2Q2yEvuyZC/dypcJznUYkT3v59aJYpzMz8cdBsiq20XMR9n\nQ7e+J+49EJ4eiTVbj+VYfc7tcdyrmbw/RXZnIDMaVpYpf/ZMOHf1yqVJvHgHTs38cJQAH0TE2feD\ns6GLtwm5HmXdzOBUkXe/7uQTeH8iwunfjVEZWNLpUWIw+05R83vEjovG1pE6caM+WPMhhXeH7CQn\nQ1eEFcAO23W4zxj+0QsQCfnwS60nxvhvLun47v2T5Oc9RDIkDu0kUUxg5QFkyKxxdf6uc6Fbnx52\nJzqO29zWN/L05kIfJIA3fze3Y0lunwcrVB5ok7PJqjGir+zW1Wo+3YaIc79350IX2AqstF2E85LF\nCu864twL6pqc+dcRTLTliwF+a4xSIsTDCrONrSN1fqWO7/qQBPbYLmI2p0K3vupst+06QmH3yyVi\ntbYJEgAqQ3HOf6KlY5Y7sxR2Zd1a5tsKL92OGHS1mg9bEFllu4jLORW6wHugPW4Dl9TyC0VGxtsu\nSACY/FiSSl9LxizFYP7VeHsOU+XSJE5s1kUTPt1hu4DLORO6IvTjDS2oZna90r49HJOIcu7XWzJm\n+atT5IfL7fshfngHcZ2768sQIhtsF9HgTOjiTRFzqR43jZzJMzjd3nOXp+/uprT6mlZgddWoPXie\nrlaV5KJKnNiRHdrb9ek2RJzIFyeKEGEIcOaTyF3GcOtr7TWOO6eoMP4vr2ne7gPnKfRX2zt0AU5s\npruY1C0gfXDmTtqJ0EX3VvBn/WienpAt9V2s3K3dFNctau5uV43aRy6077DC5WpRIkfepVPIfNrp\nQm/XegEi9AHrbNcRCluOd0Av9zKTH1tUD+6eixTCvPJsoU6tJ1mN6NiuDylgve0irIcusM12AaHQ\nP11i2XRH9N5+KfO+xGKO9/nIVOcELkC1i+ipDdrb9cl63lgNXRGiwGabNYTGtjc6b9zOJKJc/NCC\nFgFsylNsh+W+C/XmTe05NW4JrEFkwGYBtnu6G9FdxJqLVWqMnOm4IAHgwr0L6rXeP9mZt9kz/SSn\nlmlv1yervV3boWu9qx8KN54stN3qM7/Kwwly2331dlNVqrtnOq+X23Bsm26E49NmRKydGmItdOs7\niTm1PM9Zm044f6zMkpr8mK/FIB+5QKHLfkfCmvG1JMtdGrw+xIFNti5u8wW63eK1w2PVRIF0vjOm\niV1NdmfCz9LgD0+1/7zc+ZgokRObdIjBJ2t32VZCV4Q43niuambrGx05RnmlWIQLH513iGHXDIXl\n1XCeCtFKJze5f9imI5YjYuWQBFs93fWgL46mItUawxP6oBFg+q55e7EfutiZD9BmK6SITw1qb9en\nG21c1Fbo6nHqfgydLxEJyem+S60yFJ/vSJ935bSX2zA2osuCfVpr46K23tBW/rGhs3ZMe2+Xm9k9\n57LgdUVKvTW9c2o4N9xZi0OuwQAi6aAvGnjoijAIhPggxQANT+ib53LZ2+b849tmtGd3uellxHVZ\nsG+BdwBt9HR1aMGPRLHaNueftUpu+5xDCLuzuhrrciZCZGKYJT3os41o6Kq6NWf1TTNbrTc2e+ex\nqMHcWNDx3NnG1+p8XZ/aO3Trey0MB3nN0Foz3r6nQ1yLmT1XDCVsyVNsp0MnW2ViSMe4fepGZHmQ\nFwy6pzuEThXzZ9VkR0/0v6rsriuGEm6f0R7dXHK9JIoJbRufAu3tBh26OrTgR+9MmWRJQ3cuhRvj\nlx/TfktWe7lXc3aNjuv6FGguBR26qwO+XjitPndN54O1NZOIUrixBN4JEZ24jaNf54b1iHafhhEJ\n7GFs0KHbH/D1wqk/o2+W+RSvqwIMl6lE0ZkLV5Pt7dzNfxYoBvQEdbHAfikidKF75/qTzmmQzKc0\nYgBGSjpmOZ9CSp+fLEBvUBcK8pMwsH9U6PXkdJxyPuXVArCmpLfP8ykmiBq0jXxqy9DtC/Ba4dZd\n1Ido8ymvigKs0ZHv+UWQXI+u1vMpsHzS0HVNV7lKrKpjcfOprIgArC7peG4z2T4dgvGpLUNXhxf8\n6M1qz6SZam8MEzUrKzpdrJlsr+7B4JMOL3Ssvhl9kzQVEUrD5cGKPihqZqZPx3R90p5ux+rNauj6\n0JsfKcWNTolqJpfWIRifuoM6rFJD1zXprL5JfBjKrtMPJx9yPXo3sACBZFQgv5D6mWgOjb+95yE4\ntANSGZjYZ7uaKySL7oTuo595iOm3dxBNZPjYN91pp1f+avvxw3/38QcgtmMVT37uV3jUdkmfeZSH\n3p5mRyJK5psfw5m2Ksfd6um+Bx46BDtSkJnAnXaq6wYuLPVFgurpOnYb+OBT8NU/sV3FnKIOdeDW\n3/0Uu37DrXaqloTD3/ln6997zx/++b3sPTTB7iffsr+8/O71PPUbu3CrrYBaxK3QfRCe+irutVNd\nIDkVVBg61MsF+OxRWJe1XcWcIg4999hy31FSK9xqpzce20C899zykRsneuJUt67g2Z+c4GbbZd23\nhaMrUrjVVoARt0L3s3B0He61U10gOdWhPV2HSc2pN4lzZsYGSPRORmveyNhgNxcuFllmuSpnGX01\nLURb9XQ1dP1yqafrsIjR50N+GMeGFxzXVqGrVGukh6coZgYbfbjJPMv6E0v/8CO09DPcOUGFrkNP\nhxynowvzu+GeE5Qyq86fProiWyJ6aILdd63nJdtluUqMxu4CBJJTYszS/05ESAH/fMkv5NuOh+H4\nZiikoTsDD3wX/uzntqsC4ENP5lk+5cYR9d//9MPMjG2mWkoTS2S47s7vsue37bfTK994V/TwIx9P\nUIpuX8nPP38XP7Bd0qe/z8NjM2wuVUknYmTuvI7v/vYerLdVtEz1w99y50H2Dnj4OGwuQLobMg/A\nd/8M++1U9yjGvLXUFwkqdJPAp5b8Qu3gnp/nWXnBjdB12K2xH2T2Vb6mC26aiJWo/Oq3dYGET9/H\nmFNLfREdXnCNbjDmSzWi+wL5Eanp8MICBJJTgbzDjaEEuq+nL4WEvkl8yMQntZ18iBe1w7MA+SAu\nEmS3aibAa4XXTGBHNYXauZ63nBmndFl3TvfTXYBMEBcJMnSnA7xWeE2ndfqCD5nu0a6iaKA0kwok\nRtpCDmMCuRsPMnT11+9Hpkd7cE3VDPGxrgsxHbJqJq3vOr8Caynt6bpGQ7e5aKaCVOVcTMcrm0lP\n6wIonwLLJ+3puqbcFaUS1dvm+cQmqgBn4hq6zaQyOl3MJ+3pdrRcUkN3Pl1nawCjXbqvwLxqmFTW\nnYURjtOebkfLdetY5Xy6TgMw6tgG3a5JFKkI2kY+tV9P1xjKQCGo64VaJmW7ArfFRwE4Hdde3HyS\nOl1sIdqypwswFfD1wumirm6dV8KbozvWRayq+2hdVU9Gx7x9qmBMYBurBx26ZwK+XjidWdVluwRn\nSbFK8kgcoBwh8nacou2SXLVK321+BdpSQYfukm8m0RZmerooxMu2y3BS8kgJuXQewgs9egs9J4NZ\ndZqE7TJCItBcCjp0x9E9GPw5u1zbaS49z10xnLA/rVOi5tKToRwv6Zi3T6NBXizQ0DWGGjrE4M/o\nkI5VziW9/4qQPdRNXJcDv9OKcfROyZ8cxkwGeUEbq1V0iMGP06sS6K7/V4peLJMYjV/+R0aQI0lK\ntkpy1dCo9nJ9Oh30BW2EbqBd+dAqxaNkerS3crnug3O2x3M9+uF0OalSWz5OvPnfVFjoBAYeusYw\nCeSCvm4oja/Q0L1cz4E5J/rruO6V+qcoRmu654JPgXcCbf1itLfrx6lhvUW8XPrAnL230QTxi1Ed\nw2xYeUbn5/o0FeT83AYNXZedHYxTFX0DAXSNFYlNX/VD6NVuDd2GoVPa8/fJyvMlW6F7Ap061lwt\nGmFspS6dBuh7fN4ZCv9vQG+nAZJZSgMXdH6uT8dsXNTKC7V+ZpqVf3DoHNqoQwxUagz833mD5Bc9\nJCdiOoth/VHtzPg0gTFnbVzYZu/gNYvXDo9zyxNkUp291DX9XGG+oYWG7w90duBEqtSuP6a9XJ8O\n2rqwtdA1hvN4K9RUM0c2dPbk/2WP+Ort//0AiXJAx2i7aPgUha6yzs/1weqdtu1xMGufNqHyxrok\nlUhnBm/X6SKpQ756b7ko0f29nbsBzsaD+gDNp8NBHUI5F9uhexzdY7e5aizC22s6M0yWfW9BPddv\nDVp/TVvRO0Whf0oXRPhktbNn9QVqDFXgdZs1hMbBjZ3Xi5Filf7HFjRG+UaSxFvxzvsg33C4c4dV\nFmgUYy7aLMCFXsEhdCPq5qZ740z2d1aY9P60QKS44Nfo3y3rrNdTrERl5E26bdcREtYf4FsPXWPI\nAG/briMUXr+ho8KEwW8vajP3x/tI5Dpo57GRExQjRs9C82EGOGm7COuhW/eC7QJC4eSaJDPdnTEX\ntee53OwdxfwqR4h8Z7AzhhgiFaqbXiVpu46Q+AXGWO+4OBG6xnAW76GampcIL2zvgLG7So1V/+Oa\njiz61iDdUx2wH8OGIxQTRZ0m5sMUjjw/ciJ06w7QwXMsfRsdTnK+zcd2+3+cJz5+TaFbjhD53yva\nO3S7ilS0l+vbARd6ueBQ6BrDNDpv159nd7Tv+J0Uqqz8Xy0JkscGSJ3uat95uze9QjlWdec97LAx\njDlhu4gG135hz6Pzdpu7MJDg7eH23JN4+d8Uic607Hb5vw2350yGVIbi9Ue1l+uDAZ6yXcTlnApd\nYyjiDTOoZp7dkaASba8n9F1jRQa/1dKpT6+kSD6TbrNN8w3mlmdA0BkLPhzCmAnbRVzOqdAFMIbX\n0T0Zmismorx8U3vdOg//sbn8ePVW+epQex1eufot8oMTurGND3kc7MQ5F7p1T6ILJpo7vKGbi+n2\nCN70z3OkXluS2+WpGLFvrGiPsd1omeqO5zRwfdqPMc5NsXQydOs7kL1kuw73ifD0LVAL+ekSkZkK\nQ19b0iD5zjK6T7TB8uDtL1CMl3SKmA+nMOaI7SLm4mTo1j0HjNkuwnkXBhK8uDXEvbiqYe2XKn72\ny70WRpC964hlI+Hdc3fNSXLXHSdlu44QyAI/tl3E1TgbusZQA/4Bb1xGzefwDd2cGgrnw6Llf50j\n9UogT+EnY8T+82oqtRAOXaUyFG9+Rmcr+ODlhjHO3tU4G7oAxpADfkQI3ySBe2pn+JYIp17Ks+Kv\neoK85PNpko8Mhms2Q7RCdc/jRPRYdV/2Y4zTD+Kd/yUaw2ngWdt1OK8ai/CTPYRms/Po+RJrvmRl\n/9e/XEnPwWR47qBueZpST5ZrWqHXIY5jzCu2i2jG+dAFMIYXcWB3IOdl0nEO3ByC3m65xsg+iOas\nPRD6wgjxiyHYm2H9EbKrT+m2jT5cBJ6wXYQfoQjdup8AGdtFOO/k2m6Oj2RtlzGvVV8vkHzT6ikH\nM1GiX1hLrerw0FXfJIXtz+uDMx8qwGMY4/yHKIQodOur1R6D9pnkvmQOvDvFxbSbDxLSP8+x7AdO\nBMnhbhJ/udLNYYZYicqeJ4jpqjNffoYxk7aL8Cs0oQtgDBN4D9bCPS91qZmI8OM7YmSTbg01JF/P\ns/q/OvUE/m8HST3aj1N3BtEK1dsfp5oo6EGTPjyPMUdtF7EQoQpdAGM4gQZvc4VEjB/eGXEmeJOv\n51n3ewkiZedec18dpseV4I1WqN7xIyoDk7rqzIfnMeZ520UslHNvAD+M4U00eJtzJXgdDtwGF4JX\nA3dBQhm4ENLQBQ1e32wHbwgCt8Fm8GrgLkhoAxdCHLqgweubreANUeA22AheDdwFCXXgQshDFzR4\nfQs6eEMYuA1BBq8G7oKEPnChDUIXNHh9Cyp4Qxy4DUEErwbugrRF4AKII2e1tYQIw8A9oBPK59VV\nrnLnsyWGJlu/0qn/h1mGvpZCqm0xv/QjF8j9+lmSsRZ3UFIzFPc8TqRnRpf3NlEBnnR1m8bFaKvQ\nBRChG/ggsMZ2Lc7bcTjLtqMpIi2YgC/FKsN/WqLvibZbsropT/Hzo0QGqq0JyOG3yO18mqRuYNPU\nRbyVZqFZ+OBH24UugAgC7AJ22q7FeUPnCrzv+RjxyuIn4sfGS6z7fSF+pm17bekq1c+NUtqeX/w+\nCFKltv0FCuuP6Z2YD8eBJ8KytHch2jJ0G0S4Dng/6JjZvJLFCnftrzA4vfDVYulncqz+L0kixY7o\ntX3yHNn7J0lFFrg8N5GnvPsJagMX9LXYRA14BmNetV3IUmnr0AUQoRdvnHel7VrcZgy7Xs2x6aTP\n/W0rNVb9eYFl3+u4XtvOLIXfPU2sp+Zvme7yMfK7niTeVdZjdprI4m1A7vR+uNeq7UMXQIQocAew\nzXYtzlt3Os/tL8WJVa8eENELZdZ+oUb30Y7ttQ1WqOw9ReWG4jynOdQwm14jd9OrBLpRe0idAn7s\n8okPrdIRodsgwhrgfcCA7VqclixW2P1yiZHxWb3YqmHgBzlW/kV3pwwnNHPvJPlPThBPmit7sX2T\nFHY+TaR3GqtbWIZAHu+0h7aZndBMR4UugAgR4N3AraC7OM1r1USBPS9FSOfjJI4WWP2VCIm3NURm\n6atQ/fQ4xTtmSMVKVLb9gpIeINmUAQ4BB1w8Jn0pdVzoNoiQBvYAG23X4rRIbZr7v3iEbZ/fgT6Q\nnE/tw+d5+Tv/k+F4iWHbxThuDHgaY87ZLsSGjg3dBhFWArcDq23X4pgi8ALwmjHUZJ8k8KbgbQd9\nIDTLG8ABs9d4J5uI3ADcBvTZLMpBU3g92xO2C7Gp40O3QYTrgd3AoO1aLCsDB4FfGMM7bvtkn6Tx\n2mkjbbKM/BqM4oXtO3tsIhG8B7e3oCsks8AvgNcxpuOX6mvozlJfSrwd2EBnhcoFvLA9OlfYzib7\nJAVsAbZCRz2dLwFHgYNmr7nQ9G974bse7zXVaXdTo8BrwEk0aH5JQ/cq6suJG6GStlzOUqkBJ/CG\nEM4s5gfIPokA1+P16ta2rjTnTOIFyDGzd5GrpESW4bXTJmjbWQ0l4DBwEGMu2i7GRRq6TdSXFF+H\n11MZsVxOq2Txnhy/bgy5Vv1Q2ScDeKGymfYIlRrectSDZq8Za9lPFenCC95ttM9w1gTendIxjKnY\nLsZlGroLIEIfXqCM4K1wC9NOWjm82703gZPGLN3R47JPYnhjvtfjbTwUpgCuAuPAW8ARs3eJJ+uL\nDAM34t0l9C/ptVpvCm9RwzGMOWu7mLDQ0F0kERJ4gbIWL4Rde1JdBs7gBe2oMVjZqUn2ieB9QI3g\ntdUQ7o2Vn8cLj1FgzOy11FMTSXOpndbCPKvd7MjjtZHXVsY4cZhn2Gjotkh9j4dGAK8h+DeMAc5x\nKTzGjXFvU/d6L7jxYbUWO7fXM1xqp9El780ulshyLoXwMMEv5qlw6YP7VLttsWiLhu4SqT+I68Xr\nAc/+3sPihiZKwDSQmeN7xsWQbaYewldrp14WFzQ1vGCdZo72MntDuAJKRPBeN1drp8VuOZmj/vrh\nyvbKAFmdddB6GroW1Jcip+tfEbzFBpH6l+CFRuOrirdQIWMMRSsFW1SfmpbGGxeOzPoyeO3TaKsK\nXthmzd4Oe2GLxPACuJtL7dN4XcGl11Ljex7I6EOv4GnoKqVUgFx7oKGUUm1NQ1cppQKkoauUUgHS\n0FVKqQBp6CqlVIA0dJVSKkAaukopFSANXYeIyK+JyHMiMiMiZ0TkByLyPtt1uURETohIvt5GF0Tk\n70Vkne26XCUiP6m3kx615AgNXUeIyGeBPwa+hLcpzHXA14CP2qzLUR8xxqTxNgUfB/7Ucj1OEpH1\nwJ14K/futVqM+iUNXQeISD/wH4HfNMY8YozJGmPKxpjvGWP+ne36XGWMKQDfwtuXVr3Tp4BngL8A\nHrJbimrQI8jdcAfermR/a7uQMBGRFPAAXrCod/oU8EfAfuAZERkyxoxbrqnjaei6YTkwYXTzEb++\nIyIVvF23zgH/xHI9zqk/C7ge+BtjzISIvAH8GvAVu5UpHV5ww3lghXg7Ranm7jPGDODdHXwaeEK8\nExjUJQ8BPzTGTNT/+5voEIMTNHTd8DTe9o332S4kTIwxVWPMI3hbFeosjzoR6QY+AdwlImMiMgb8\nG+BmEbnZbnVKQ9cBxjs19fOWjhWeAAAAsElEQVTAV0XkPhFJiUiXiHxYRP6T7fpcJZ6PAsvwDtpU\nnvvwPoi2AbfUv7YCP8Mb51UW6X66DhGRB/F6JFvxdu5/HviiMeYpq4U5RERO4E2pq+JNhToJfNkY\n8w2bdblERB4FXjPG/NtZf/4J4E+AEX1+YI+GrlJKBUiHF5RSKkAaukopFSANXaWUCpCGrlJKBUhD\nVymlAqShq5RSAdLQVUqpAGnoKqVUgDR0lVIqQP8fgyxciwkTZw8AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "venn3([set(a), set(b), set(c)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating k-mers from DNA sequences\n", "\n", "To extract k-mers from DNA sequences, we walk over the sequence with a sliding window:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def build_kmers(sequence, ksize):\n", " kmers = []\n", " n_kmers = len(sequence) - ksize + 1\n", " \n", " for i in range(n_kmers):\n", " kmer = sequence[i:i + ksize]\n", " kmers.append(kmer)\n", " \n", " return kmers" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ATGGACCAGATATAGGGAGAG',\n", " 'TGGACCAGATATAGGGAGAGC',\n", " 'GGACCAGATATAGGGAGAGCC',\n", " 'GACCAGATATAGGGAGAGCCA',\n", " 'ACCAGATATAGGGAGAGCCAG',\n", " 'CCAGATATAGGGAGAGCCAGG',\n", " 'CAGATATAGGGAGAGCCAGGT',\n", " 'AGATATAGGGAGAGCCAGGTA',\n", " 'GATATAGGGAGAGCCAGGTAG',\n", " 'ATATAGGGAGAGCCAGGTAGG',\n", " 'TATAGGGAGAGCCAGGTAGGA',\n", " 'ATAGGGAGAGCCAGGTAGGAC',\n", " 'TAGGGAGAGCCAGGTAGGACA']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "build_kmers('ATGGACCAGATATAGGGAGAGCCAGGTAGGACA', 21)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the k-mers that are output, you can see how the sequence shifts to the right - look at the pattern in the middle.\n", "\n", "So, now, you can compare two sequences!" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "seq1 = 'ATGGACCAGATATAGGGAGAGCCAGGTAGGACA'\n", "seq2 = 'ATGGACCAGATATTGGGAGAGCCGGGTAGGACA'\n", "# differences: ^ ^" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 0.09090909090909091\n" ] } ], "source": [ "K = 10\n", "kmers1 = build_kmers(seq1, K)\n", "kmers2 = build_kmers(seq2, K)\n", "\n", "print(K, jaccard_similarity(kmers1, kmers2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading k-mers in from a file\n", "\n", "In practice, we often need to work with 100s of thousands of k-mers, and this means loading them in from sequences in files.\n", "\n", "There are three cut-down genome files in the `genomes/` directory that we will use below:\n", "\n", "```\n", "akkermansia.fa\n", "shew_os185.fa\n", "shew_os223.fa\n", "```\n", "The latter two are two strains of *Shewanella baltica*, and the first one is an unrelated genome *Akkermansia muciniphila*." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "import screed # a library for reading in FASTA/FASTQ\n", "\n", "def read_kmers_from_file(filename, ksize):\n", " all_kmers = []\n", " for record in screed.open(filename):\n", " sequence = record.sequence\n", " \n", " kmers = build_kmers(sequence, ksize)\n", " all_kmers += kmers\n", "\n", " return all_kmers" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "akker_kmers = read_kmers_from_file('genomes/akkermansia.fa', 31)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['AAATCTTATAAAATAACCACATAACTTAAAA',\n", " 'AATCTTATAAAATAACCACATAACTTAAAAA',\n", " 'ATCTTATAAAATAACCACATAACTTAAAAAG',\n", " 'TCTTATAAAATAACCACATAACTTAAAAAGA',\n", " 'CTTATAAAATAACCACATAACTTAAAAAGAA']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "akker_kmers[:5]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "499970\n" ] } ], "source": [ "print(len(akker_kmers))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "shew1_kmers = read_kmers_from_file('genomes/shew_os185.fa', 31)\n", "shew2_kmers = read_kmers_from_file('genomes/shew_os223.fa', 31)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the relationship between these three like so:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "akker vs shew1 0.0\n", "akker vs shew2 0.0\n", "shew1 vs shew2 0.23675152210020398\n" ] } ], "source": [ "print('akker vs shew1', jaccard_similarity(akker_kmers, shew1_kmers))\n", "print('akker vs shew2', jaccard_similarity(akker_kmers, shew2_kmers))\n", "print('shew1 vs shew2', jaccard_similarity(shew1_kmers, shew2_kmers))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "akker vs shew1 0.0\n", "akker vs shew2 0.0\n", "shew1 vs shew2 0.38397187523995907\n" ] } ], "source": [ "print('akker vs shew1', jaccard_containment(akker_kmers, shew1_kmers))\n", "print('akker vs shew2', jaccard_containment(akker_kmers, shew2_kmers))\n", "print('shew1 vs shew2', jaccard_containment(shew1_kmers, shew2_kmers))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAV0AAACpCAYAAACI/O4MAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAIABJREFUeJztnXl8lOW1x79nJpmsJCyBhH0RkUUE\nVETcABfU1qq91q1WqFot19ba1lurdQnR6m2t2r23Xpe6VHuLdanaVlFxRQUEVHZBJBCSEAIkZJ/M\nzHP/eCZAhcBMMvOuz/fzmQ9hYOY5b973/b3nOc95zhGlFAaDwWCwhoDdBhgMBoOfMKJrMBgMFmJE\n12AwGCzEiK7BYDBYiBFdg8FgsBAjugaDwWAhRnQNBoPBQozoGgwGg4UY0TUYDAYLMaJrMBgMFmJE\n12AwGCzEiK7BYDBYiBFdg8FgsBAjugaDwWAhRnQNBoPBQozoGgwGg4UY0TUYDAYLMaJrMBgMFmJE\n12AwGCzEiK7BYDBYiBFdg8FgsBAjugaDwWAhRnQNBoPBQjLsNsDgLkTIQD+sY4AClFLE7LXK0CVE\nQkARUADkAnnxPzteHedaiJ9rIAq0As1AU/zPZqAR2IFSDdYehPsQpZTdNjgaKZMAkA/0iL8K9vk5\nxN6LsuPCjKIFKYK+EHfv+1KlqtHiQ0gYEfKAQvQxFsR/LkQff8cNeCA6jrUh/tr35walaEmv5YZD\nIhIEitEi2zf+Z2EaRmoFavd5VaNUcxrGcS1GdL+AlEk+MAgYiL5I89CCmiqiaDGqBbYCFapUNaXw\n+xNCBEEf36D4qzfpm/k0oI+1EthqRNgiRLKBIcAw9PWcaZMl24FyoByldthkg2PwvehKmWQDA9AX\n5UC0h2c1u4CK+KtKlapIOgYRoeOBMhh9rKF0jJMAO4kLMFCpFO022eE9RLKAUcAIoB+pdRhSQSOw\nCVjnVwH2pejGQwbDgTFowXUSUbQgrQHKVWn3TpAImcARwGi0N+s02oENwGql8OVNmBJEioBxwGG4\nZ61mG7Aa2IhSUbuNsQpfiW48dDAGLUA5NpuTCI3oi3KtKlWtyXxQhJ7om3AU9k0rk6UG/bD5TCnS\n4u17CpEAMBJ9nvvabE13aAXWAiv9EP/1vOhKmQh6Oj02/qfTpluJEAU2AqtUqarp7D/F47SDgSPR\nYQS3EgY+BVYpRb3dxjgSkRHAZNKzGGYXEWAV8BFKtdltTLrwtOhKmQwFjgN62W1LCqkBFqlSVbXv\nmyKMQB+rHTHpdBED1gFLlcLzHlBCiAxEn2c3e7aHIgx8hPZ8PTfj8aToSpn0A6YA/e22JY1sAhYx\nV2UBU9GZCF4lAqwAlvs27CBSCJyEXgD1C83AIpRab7chqcRTohvPRJiCXjjyNpGCKNXfa+GTrwlL\nx+UQzfDD7sJG4AOl2Gi3IZYhIsB44Fjcs0CWasqBd7wS7/WM6EqZjEFPu7LstiXt7PpSC9tnh1C5\nQQBaQ+0sHRdh80A3LA6mggrgHaXw9u4n7d1OA0rsNsUBtAHvecHrdb3oSplkATPQSeDeJpofpaI0\nTOvoA4trZd9m3j0m2ydebxvwplKU221IWhA5Eu1E+NW77Yxy4C1Uctk8TsLVoitl0hc4Hb0l19u0\nHN7G1tsDRHsePP2rMSfMm1OgId+ujQ9W8wmw2DP1H0QygFPQqWCGA9MIvOLWzRWuFV0pk7HoBaSg\n3baknV1nN1NzdTZkJubBRgJRFk8IU+6bcEM18LpSWL6dOqWI5AFnousiGA5OBHgDpT6325BkcZ3o\nSplkAifjB09ABRVVP2ihYVpulz6/YUgzS8bngLgxNzlZWoEFSlFhtyFdQqQYOANd3cuQOEtRaqnd\nRiSDq0RXyqQHcDbQ025b0k6kd4QtP40SHty9hcFdBa28MSWTtizvzwh0lbf3lWKl3YYkhchhwHT8\nMGtLDxuBBSjlihCTa0Q3LrhfQZcZ9DaR3hE2/UoR7ZWa7buNOWHmnxT0ifCCFt4VdhuRECKj0BkK\nfpiNpJPNwKtuqOHgCtH1n+DeHyPaJ7ULYf4T3g+U4hO7jTgoIqPRi2aG1FCBXmBztPA6PrXIX4Lb\nM0L5fakXXID8lhAzF0bIanP0BZlCjhfhKLuN6BTt4RrBTS2DgDPihYAci6ONkzIpwDeCWxCl/L4Y\nkaL0pXrlN2cxc2GEUNhPwjvBbiP2Q2Q4OqRgSD1DgNPjO/kciWNFNy645+Abwb0/SqRf+nNr85uz\nmPmun4R3iqOEV9e9nYGJ4aaTYegKbI7EkaIrZZKBzlf0vuCqoGLLzyJEiq3bzNCjOYvT3m9HYs4P\n6KeGKSIO2LEokoO+rs0us/QzERFHppU6UnTRsS4vlWPsnOrrWrqdFtYVejZkc+xKTxQQSZAZIjbu\nXNRxxpnonnsGa5iGiONKYDpOdKVMxuGHjQ8Au6e1sPs0+5LhR27OY3ClX5pEZgFniNiWC3sy3i6/\n6USCwExEHLXhxFGiG6+DO9VuOywh3L+d6uvsr49w/Ech8pr80hiyCDjR8lF1poL3y406kzzgVLuN\n2BfHiG68Fu7pOMimtBHLjFExN4ZyQM5sRizIjEUxAlFX7OZJAaNFLBRAXU/hBMvGMxyIAYiMtduI\nDhwhcPE+Zqfih4UzgOrrW2kf4Jy6vz2as5jysWtL5XWBE0XoY9FYJ2Nfq3vDXo5HxBHVCB0huugO\nvW5upJg4DSd2vYBNOhlWmeuj+G4GMC3eyDN96LCC/VkTBoifc7uNAAeIrpRJDrpYs/eJZcbYNse5\n6ULHrswgGPFLmKEI3SE6PZiwghNxRJjBdtFF9zTzx/SrdnbrIYuQ20l2OJMJa/3i7QJMFiFdNYeP\nwy/Xtbs4DhFbQ3u2iq6USTEwyk4bLCNc3M6uL2fbbcYhObw8x0fZDCHSMcsS6Y1f0h7dRwiYZKcB\ndnu6x9s8vnXUzGkHF/QuC6gAx670U5vzUSL0TvF3HofZ5utkxiFi26K9bSIgZTIcvySLtxzeRtOx\nzls864wB23Pos6vNbjMsQkjlw1+kBLN45nSCwDF2DW6L6MZTxPyxeAZQM8d9i1PHrvBLXQaAQSIM\nSNF3TUnR9xjSyyhEbOlAY5enOwQotGlsa2kdHqZ1lPsaRPbenU3vOr94uwBHdvsbRAbgl9mb+xFg\noh0D2yW6tqdtWMbOC9wbHx2zwS/lHwGGinS7GI1/rmtvcBgili9uWy668Tq5g60e1xZi2TEapzo/\nY6EzBm7LJrPdL8Ir6E06Xfy05KLruBrcQxAbamLY4en6xxuoO7sVFXJ+xkJnBFWAwzf5KcQwRqTL\n98QY7M8GMiRP1x+0XcTSi0TKxJYni23s+rJzd58lyshy9x9D4uQAw5P+lK6VOzrl1hisoAARS2fe\nVj+ZD0PXNfU+zeNaLe0GkS7yWkMUb/dTMZxxXfjMYExxcjdjqbdrtej6J7Sw63z3pYl1xpiN3jmW\nQ1MiknTXkuS9Y4OTGISIZTM6y0Q3Xtimn1Xj2YoSRdMk73j0/XZk+6ifGiSzuUF3nTWbIdxNBqQs\nT/uQWOnpDrRwLHtpGR12RIHyVBGMBei7008Laslcq8WAezNUDB0Ms2ogK0XXH/VyAZqOc29ubmcM\n3OanEEP/JHqpDU2rJQarsGy2YjzddNA0yTtebgcltd47ps4JAiUJ/l8jut4gFxFLwp+WiK6USU/8\nsroby4rRNtT9WQtfpKAh5KONEpDIzEwkB7Bl/74hLfS3YhCrPF3/hBaaj2pzRQnHZAkglGwP222G\nhSQyMytKuxUGK7HkfFolDv4JLTRO8e4q/8Aa7x7b/hSJHHKBrK8llhiswpLzaZXo+ufibDnCe15u\nB73r/BTXhUN7PsbT9RYFiKQ9NJh2gZAyyQDcU8C7u0T6eHfbbG6Ld4/twBQc4t/940z4h7Q/SK3w\nyhzRa94SYpkxYj28K0yZ0SAZvukWDAe7dnVJQH8sDvsLI7quor3E+6v7+U3ey0HunIN5uv65rv1F\n2s+rFaJ7qCmadwgP8L4gFTR6/8Gyl4PdgP4JmfmLtJ9XI7qppH2A91f3ezT5KbxwsGvXiK43SXvI\nyIQXUkl4kB9E10+txUMinZYiNfFcb5L2h6kViz6Ji+4mevEcVxCOexgjeJsLWcDHDOI1LiNKFtns\n4HIephe6xusnDORVvkGEHIQY3+VuconwTyazkrMByKKOS3iEYhp5gnOpYiKgCLGbC3iUwdSzkFEs\n4Fqy2AHAQJZxGf9I6kjb+yYuSC9/fza7t4wnmNXABU+VAfD5gkF89Og3ULFMRKKMv+wpRp61iTXP\njuKTJ64lM0/b1vvwZUwv1bat+Ms41v39YlAB+o1/l1NufRmA9+6dTuWS02lv7svZv72BnsMaAWio\nyuGtuVcSbuyNUkEGT53Pcde9l7Dd+c1Jiu7bveDyK6Ax7jWe8Tb83wL48yD48WXQlgW9d8ArD8Pw\nVqgPwoxvwOahIApu/Cv86FP92fognHEprD8CJAZXPg/3LodZp8A/p0NAQagVfvlnuLAKHhgGt12u\nP6uAK16Eez5Kzn56AAcq9uMaT7cVZAjcUgh16+F3P4cj7oevRSFjMJQvhMdzIbYOcs+D2Tugbwa0\n/xIeuwQqAdZDznkwqxoGCHAHPPYd2Pg4DLoJvhGBzABEy+Cpb8Omm2HCQ3CegApA9GaYdz1ssPlX\nkQieEN3Ex8ggxnT+xiQ2U0cWf+RWVrGGV5nFCTzNCaznBU7gJWZyOS/QToCXuYozeYQJVFBDHiGi\ntBNgGRdzNXMpppFHuYBXmcE3eJFzmE8vXgDgb5zKy5zD1TwJQCEb+B6/6/KRxnISF6Rh098jlP8G\nyx+5Ys97K566gJFnvcj4r69ixZNHsuqvFzDyrPsAyO27ga/877/bFg0L656/lBN//Cv6HLGLl675\nCZvf/ZghJ1VRMukzRpy+gnfuvuHfPrP8oenkFFVxzgO/Z9fn+cy/4U4mXrmIUF5isdpgNEnRzYrB\n3L/BFZuhPAsm3gpPr4EfzYL/ehpuWA/fOgHmzIRXXoBrT9afq70DlveAs74H378bMhVc/CXo2QC7\nboN2gbVxb7NsMTz+tv75J0fBTRfChb+B8yrh8rsgNwaLCmHGbTD3E/33hOns+nVNh+fL4bQSqGqB\nnHaQn8IVT8L950LNWXDudTD1YVg4B84eAVvWwv88ByU/gEsvgV8CXAQXHw+rHoEH6iFYAyGAuXDB\nNfDiXFhVCkfeBRd8G+67Ftb+FD4OAk/CwOvgmuuh1MZfQ6IEEMlCqbRV1bMivJB4Qv0g6pnEZgB6\n0kYeVeygJy3043jWA3AUa6jkaADeZSw9qGACFQD0o4kMFDEEEJoJEQPaySafOoA9HjJAOyG0C5Qa\nVDBxQRp9/npyi5r2ez/cpG/mcGMOmXl1B/2Oz14dTqjHdvofXUsoL0rRmCVsenMCACNO20LJxB37\nf0gU0bZsVAxa67IIZjaRkZW4CAWS/XVNqdeCCzC0DfpVwac9YWc/+L4+p8xaAx/qc8rG/jBlrf55\nUgPkNMMj8aIyC0+ER/+lf85UMF577wzfp7NF4z7hgJLwXoGt76qD0dk94oqNIu9AzyUw/jJ4F2A1\n5AUhci7UAJwJq99C308V0P8MWAvwVaiugz5LocdnkLMRRj0Y/45CiB4OLR1j1MUfQLsgpyf6PhsM\nbcG9/57lsphUWs+tFZ5u14R9I31oYDDj+ZwlVPE6EzmDj1jMMbTRG4BaigH4FdfTTj6DWcIlzCeL\nKJN5kicoJUgbOdRwBU/t+e7HOZ8KjidIC7O5b8/79YzgHm4ji3pO42mOpCo5ozO6d21NuvKvfPDL\n69k4/2sohGm3/XzPvzXXjuDZy24jM7eeCbOeZsjJVTRW9ySrx849/yen9y7qPh9x0DGOueYNXrvp\nOzx90T3EItmM/dqDBDISV9JArBsP6tf6QOVg+Prn8IcquGUi/Owj+O0x0KDPKUdUwNsToHkJvNsL\nqobC+l6wfpv+90vOg3WjoGg7PP4XOKZBv3/ZdPjH6RDNgL/cv3fMPw6HW2bD7t7wg0eS9HKh8xvQ\nFTsPvw0X3wbP7IzX/D0SGmMQfAiGfgvKn4Fj6tCdMoZCxXNw9A9gwwMwbDf0+Qh6ZUIsFxomwzcr\nYNAQKH8B/joAwr+Av34Trn8YvqZAXoA91+yNMPEh+GozFNwHv7XpV9AV0vqMsOLCSX6MerJ4hjkc\nyzx60cqXeIxVTOMX3EKYbASdmhUjQB0juZSH+U9+wVYm8Q6jaSPIaqZxKT/lZm6kgK3Mi8d3AWbx\nPD/hJoawiNeYAcBYNvM9buZG7uQoFvAS1yZttwp2z2te+/w0Rp49jwufvomRM+ex6NezARh84ma+\n8r838x9P3smw6QtY/Pvkbetg3YvjyOtbwYXzbuSUW+5k3QuX0rgt8SLcorp4jFuy4LI5MGee9kx/\n9xjMmwbFt0BTNgTi6XZ/WAh96mDYLXDdxTD4M8hQ0BqExl4w+TPYdheM3QhXXbj3+598E+puhW8+\nC7d9ae/7cz6HHXPh/+6GJ86G2mQdjc5uQMc7b7fC+EJouIr47BH9BPlveLAMLuoPN+dCayA+23sQ\nXm6EnH5w2//AqcWwJRNiYQhugyHXwFs18NNsCF8BZwHcD9O+DfMa4KarYd5VMLtjrHvgo51Qejf8\n4V44z/JfQNdxvegm51m0EeRPzGEoi5jJcgDGUM33+TU/4i4ms5gctgNQwC568SnFNJJPmBJWUskQ\nVsarmh3OdgLAOD5kO4ftN9ZUFrM1HqroRSuF8QWTGawkRpBt5CdluyQb7/wCO9dPZeI39TFPumop\nLTuHAZBf3EpukbZt/GUrUbEgdZvyyS+po63DQwRadvYiq3DXQcfYuugEBp2wDAnAgMnbCeXXUr08\n0dqxoKQLx9gQhGlz4JRF8At9fHy1Gjb+WgvonMXQS59TcmPw9jyouRPW/QFacuGYbTC2ETLC8LP4\n569bClsOUHj6/iWweuL+719QDVlt8HyyxZc6e8g4PlPlAxi5AiYUwN1z4epNcMQEuPI7sHEL/KIK\n/vsUWN8XtgEMh9Zl8FgN3LkUHmmC/JOhdgLsyoddc+BzgAtg6cZ40e8PYerP0ffpvbB06wE6MPwQ\n1u+EohUkeT/ZR1rPrbNENwb8iVkUUMVFvLbn/ap4BkQU4Q2+zCjeAmAyq2lgII2EaCfAdkbRlyr6\nUUcz/amOn+QNjKEgHipYt0+ftg+ZQB7VAGylYI+lHzIMEPrSmNyhRrp3sjKy61n7/CgA1jwzmlB+\nDQA71heg4sZteHkYKKFwSCMjTt9EuKEfVcv7EG4KUrtmMsOmfXzQMbJ67KR6ue5+unNDD9oaiika\nXZuwjbFAktPzKHDyLBhYBU/vPacsj2e1tAvc/mX4sj6nVIegMl505K4xEIjqTIQgMPoTuF//fvjz\naCip1D+/tE/x6bnjoZf+vTG/DzTHr/EFvWF7CUw5QJz7kAdwIByfr/waPNcIP94NP5kLDw6DdR/D\nI8vjGUW7IOMhOHMW+n5aDzn18XDKt+CkEbB+OLROht2FsOt53ZqI+TBmEPp+yof6+2EUwD0wulc8\nVvwP6Nvxi/sTDIlCxliSvZ9sI62ia0VMN/EdTIsZSTXHk8tW7uE2ACbzHLX047N4GKA/yzgHneJU\nRDNjeY3f8xNAUcxKTmUFAGN5icf4EUKUbHZwCY8C8Cr/wfMUIyiy2cH58cyFDziaDUxHiBKgndN4\nMOlHkkQTP1n//O63aKweRTScz9MX/pwhJ7/AkZc+weqnL2bNMwEk2M5Rs54AYN3fj6Zq2XREokiw\nnQmzH0QCkJEV44hz/8LCn38flNB33EKGnKwfLgvvOZXKJWcSaStg/n/dTuHgFZz5yyc4+up/8N69\n3+SZy24HJQw/9dk96WSJEEvW0f3dSPj4eCjaCv30OeU/n4N1/eAVfU45ehk8EE9bW9kDLrpep4sV\n1MFjj+z9rl8/A7OvhHsvhrwGePQx/f7PZsDsMRCMQk4T/OZP+v3nRsKlZ+v3JQbXPbV38S3xI+7k\nfdfuPvwhzPwEjlIgZ8BbP4Z1APOh/61whYAqhsrn4fGOz9wFf5kDV10NGX1g+9/hMYDb4Ym74eJ7\nIJAJ7XfDEwAPwdGzYGoQopkQvgsedMXKoyat51ZUV0N0iQ5QJhcAfdI6iFPYckczzZNck7/ZJWp6\nt/D6Ca5Jl0oBzyjF/t6xyMnAGOvNMaSZKEo9nM4BrAgvuGVK0X0ytzk+ztdtmnK9f4z/TkMn7zdb\naoXBKtJ+Xq0Q3d0WjOEMQpWOX9HuNg2+Et02peisRZERXW/iCdHtzFPwHqEKH4huvotCc93mYNeu\nEV1v4gnR9Y+nm1npfUHaneeKTQEp4mDXrhFdb2JE11WEqjMh5u3pd0Oedztj7M/BPF3/zOD8RdrP\nqwkvpBKJCsEG7xb5DmdEiXqwvXzndO4wKNWKnxaJ/UPiOetdJO03kCpVUfw0FcvY4dr8zUPSku3d\nYzswh3IY0n6DGizH/aIbp8aicewnZ613Pd0dPf0kugri280751D/bnAX9SjVWbZKyrBKdCssGsd+\n8hZ7dzFta7GfQgu1Sh2wePm+GNH1FpacT6tuoq0WjWM/uZ+EwINtymMoqvt21rrGiyRyzZrwgrew\n5HxaIrqqVNXjl0WHQHuArM/TPkWxnN092oj4ahHt0LMzvZh28KpuBjdRacUgVt5EPgoxLPdeXLeq\nr/eOqXMiEK8+d2jK02mIwTKaUMo7nm4c/4QY8hd7L5d1a7F3Y9X7U6VUwqUbN6XTEINlWPbwNKKb\nDrI/DSGt3vEMI4Eotb1MPPdAKFXDPv3CDK7Fe6KrSlUriU/Z3I0oIW952rqJWk5NnzZUwPt1Jfay\n+dD/5d8wIQZ3E8GieC5Y31xvjcXj2Ufv57wTYlhzmJ8W0CqV4uBdmPdnUzoMMVjGZpSybGZq9c20\nEfZpge5lctaEyKx0v7fbmBOmpijxxpXuZ3UXPrMFv2TneJO1Vg5mqejGtwRbeoC20usl98d11w/z\n0y60Zrriter2K/6ZxXmLepSyNLPKjmnjGlzQSTUlFM7PRtrcK7zRQIwNQ/3k5a5JImvhi6zFBc0q\nDfvRlZlNt7BcdFWpakBPx7xPoC1Aj/fcG2LYUtLqow0RMbrjrSrVQrxFucE1RIg35bQSu26oVTaN\naz29n3Hvgtqaw9xre/JsUqrb1fD8c117gw1WFLj5IraIripVW/DL9sms8hDZq92Xx1nbs4W6wpDd\nZljIym5/g1LVWJh6ZOgWMeAjOwa2c+q42MaxraX4j0FXdZRQKJaM99MOtHKlUpZD7p/r2t2sRSlb\nutrYJrqqVJUDVXaNbynZn4fIX+web3drsZ+83BiwKGXfpneobUrZ9xnSQQRYZtfgdi+SfGDz+NbR\n74EQtDt/dTsqMZaOy7TbDAtZ24XNEIdiMX7J0HEnK1HKtm42toquKlXb8UvebmZtBr3/7nxv99Ph\nrTTn+kV0W4EPU/6tStUBn6b8ew2poA2bYrkd2O3pgp7a+WOXWtGfcwnucG6t3easMJ8c4ae83MVK\npe3aW4xfrmt3sciOjIV9sV10Valqwy+LDxIVSn7v3M0SS46KEgvafk1YxDal0jjL0nm776Xt+w1d\noQKlbJ9ZO+IGU6VqLX6p1JS/JIeC15zXHXnjoCYqi3PsNsMiIsDbaR9FqQ2YRTWnEMaKc54AjhDd\nOG8AtqRwWE7J73IIbXHOTrX6/FYWH5VrtxkW8rZSluWJv4MJMziBD1DKEUWJHCO6qlSFgdcA506/\nU4VEhUGlAUcUOm8PRnnzuKCP6uWuUooNlo2mwwwLLRvPcCC2OCGs0IFjRBdAlapa/HKBZm7PpP/9\n9i+qvT8p7KNshRrgfctHVeozbCisYgCgAT2LdgyOEl3YE9+1vAiFLfR4P4fCf9kX3103rImtJX6J\n47YCr3ajilh3eQ+zRdhqIsAr8a7NjsFxohvnXWCH3UZYQvEDOWRttP6i2FHYyrJxfonjKuB1pWiy\nzwIVQ4fPGmyzwX+8gVI77TbiizhSdOPFzl/BDwtrEhUG35xJ5lbrFtbq81tZMDUE4pc47ntKOaAx\nqva4XgHa7TbFByxFKUeW2nSk6AKoUtUIvIgfhDfYHGTof2VY0t6nPr+NV0/M9FGd3IVKOajkova8\nXscUPE8nG1Bqqd1GdIajbzxVqprwjfA2Bhl6QwaZ1ekT3t15bbx6YgbtmX6pIPauowS3A6U2Awsw\n9RnSwefAm3YbcTAcLbrgQ+EdckMGGdtSn9XQkNvG/JP8JrjOzRhQaiNaHIzwpo5y4PV4/NyxiO6p\n53ykTPKAc4BCu21JO5GeEcrvixHpl5ryih2CGw75RXDfUcoljSJFRgCn4gIHyOFsBBY4XXDBRaIL\nIGWSC5wN9LHblrQTKYhScUc7bYd1rwBNbc8W3jou5BPBjaE9XMckwieEyDC08PqpPVIq+RR4C5eI\nmatEF0DKJAicCIy225a0o4KKbdc2Uz8zL/nPolg3vJnlY3N9kqXQDLyWwg4Q1iJSBMwE8u02xUUo\nYDFKfWy3IcngOtHtQMpkFHASfvAO6mc0s+272ahQYlPQSDDKe5PCPtr4sBWdh+uoJPikEckBzgBK\n7DbFBYTR4YTNdhuSLK4VXQApk97oi9T7cd62wWEq7oBI0cHjvLvz2nhjSsAnW3sVuu3KMqU8siAl\nEkA7E96fyXWdevROs1R3/LAEV4sugJRJJjANGGG3LWknlh1j661tNE84sAe7uaSZ9ydl+6Qmbiuw\nQCkq7DYkLYiMBqYCfnh4JsNnwLso5ZwqfUnietHtQMrkcOB4wPtT6voZzdRcEyKWr0MrLVntLD4q\n4qN6uJvQC2bOq0ucSkTy0Q7FQLtNcQAtwDsotcluQ7qLZ0QXQMokBBwLjAO8vXgUzY2y7TvNLJkd\n4OPROT7xbnejd5htsdsQSxEZg3Yo/Or1fgYsdFrhmq7iKdHtIB7rnQIMttuWNLIBWMxcFUJPQ73s\nDYXRzQRXKOWDessHQnu9JwDDbLbESnaji49vstuQVOJJ0e1AymQA2kMostuWFFIJLIp3Ut6DCEOA\n44DetliVHmLAKmC56zMTUoVHFKmlAAAETElEQVRIMfo897fblDTSgl4gXeOGzQ7J4mnR7SAuvmPR\nXoIbp+ERdAL4alV68FJ1IgwAjgSG4t4QSwu6pvJqpXBEixXHIeLFh2w78DGwAqU8W4nNF6LbgZRJ\nDjoVZwzuSEKvR3t6n8bbGSWMCPnoB81owC1t1SvRHRY22Vhs3D2ICNqRGAcMsNeYbtEErAFWeyVu\nezB8JbodSJkIOt47FhiEs7zfdqACWKNKVbfToUTIAA5DP2j6dff70kAb2otfoxSuzLt0BCI90eJ7\nOJCamh3pZyvaqSh3yxbeVOBL0d2XeJ5vf/RC1ECsn64poBYttFuAGlWanjiWCNnoh0zHy47OEQrY\njvZqtwLVvl0cSwciGcBIdN76AJzlUADUoVP+1qFUvc222ILvRfeLxEMQA9ACXAL0AFJZLCaMXpXt\nENqtqtSeRG8R+qDFdzD6YZOuMMRO9opslVLY35DTD4hkos/tUGAIkGWDFQqoRpdd3IRS3i/RegiM\n6CZAvLpZjwO8QmhPIoBetFLoFvIdrwa0wO552SWwiSBCCL2luhAoiL8K0ceagX747PsAUvFXGH2s\njfE/G/b9u1KmPY3t6PhvX3QmT1H8516k3hNuRDsUtegZTY2bd4+lAyO6hqQRQTxT68DPiATRM5xC\ndKjpi68M9joVCp3CF0VvwW7+wqsR2IFSLdYehPswomswGAwW4rQgu8FgMHgaI7oGg8FgIUZ0DQaD\nwUKM6BoMBoOFGNE1GAwGCzGi2w1EZJOItIhIo4jsEpF/iIiXy0kiIl8XkQ/jx1wlIv8SkZPststg\ncAtGdLvPV5RS+eitxNuA39psT9oQkR8CvwLuBorRu5z+AJxnp12G9CIib8adCjt2tHkOI7opQunq\nSH9DF9HxHCJSCNwBfEcp9axSqkkp1a6UelEp9SO77TOkBxEZBpyM3hxxrq3GeAQjuilCRHKBi4EP\n7LYlTUxF12Z4zm5DDJYyC31NPwrMttcUb5BhtwEe4HkRiQB56L3mZ9psT7roA9QqpSJ2G2KwlFnA\n/cAi4AMRKVZKbbPZJldjPN3uc75SqifaC/wu8JaIlNhsUzrYARSJLh1o8AHxBdKhwDyl1FJ0g8iv\n22uV+zGimyKUUlGl1LPogiBeXM1/H11w/Hy7DTFYxmxgvlKqNv73pzAhhm5jvJYUIbp03rnocnlr\nbDYn5Sil6kXkduD38XDKfHSXi9OBGUqpG2010JBSRCQHuAgIikh1/O0soKeITFBKfWyfde7GiG73\neVFEoujV3XJgtlJqlc02pQWl1H3xG/BW4El0zdylwF22GmZIB+ejZ23j4d+Kzs9Dx3lvsMMoL2BK\nOxoMhv0QkZeBVUqpG77w/kXAb4BBZlG1axjRNRgMBgsxC2kGg8FgIUZ0DQaDwUKM6BoMBoOFGNE1\nGAwGCzGiazAYDBZiRNdgMBgsxIiuwWAwWIgRXYPBYLAQI7oGg8FgIf8PqyzoVXvFWZQAAAAASUVO\nRK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "venn3([set(akker_kmers), set(shew1_kmers), set(shew2_kmers)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Let's hash!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Choose a hash function!\n", "\n", "We need to pick a hash function that takes DNA k-mers and converts them into numbers.\n", "\n", "Both the [mash](https://mash.readthedocs.io/en/latest/) software for MinHash, and the [sourmash](https://sourmash.readthedocs.io) software for modulo and MinHash, use MurmurHash:\n", "\n", "https://en.wikipedia.org/wiki/MurmurHash\n", "\n", "this is implemented in the 'mmh3' library in Python.\n", "\n", "The other thing we need to do here is take into account the fact that DNA is double stranded, and so\n", "\n", "```\n", "hash_kmer('ATGG')\n", "```\n", "should be equivalent to\n", "```\n", "hash_kmer('CCAT')\n", "```\n", "Following mash's lead, for every input k-mer we will choose a *canonical* k-mer that is the lesser of the k-mer and its reverse complement." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "import mmh3\n", "\n", "def hash_kmer(kmer):\n", " # calculate the reverse complement\n", " rc_kmer = screed.rc(kmer)\n", " \n", " # determine whether original k-mer or reverse complement is lesser\n", " if kmer < rc_kmer:\n", " canonical_kmer = kmer\n", " else:\n", " canonical_kmer = rc_kmer\n", " \n", " # calculate murmurhash using a hash seed of 42\n", " hash = mmh3.hash64(canonical_kmer, 42)[0]\n", " if hash < 0: hash += 2**64\n", " \n", " # done\n", " return hash" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is now a function that we can use to turn any DNA \"word\" into a number:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13663093258475204077" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hash_kmer('ATGGC')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same input word always returns the same number:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13663093258475204077" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hash_kmer('ATGGC')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "as does its reverse complement:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13663093258475204077" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hash_kmer('GCCAT')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and nearby words return very different numbers:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1777382721305265773" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hash_kmer('GCCAA')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Note that hashing collections of k-mers doesn't change Jaccard calculations:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def hash_kmers(kmers):\n", " hashes = []\n", " for kmer in kmers:\n", " hashes.append(hash_kmer(kmer))\n", " return hashes" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "shew1_hashes = hash_kmers(shew1_kmers)\n", "shew2_hashes = hash_kmers(shew2_kmers)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.23675152210020398\n" ] } ], "source": [ "print(jaccard_similarity(shew1_kmers, shew2_kmers))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.2371520123045373\n" ] } ], "source": [ "print(jaccard_similarity(shew1_hashes, shew2_hashes))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(ok, it changes it a little, because of the canonical k-mer calculation!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementing subsampling with modulo hashing\n", "\n", "We are now ready to implement k-mer subsampling with modulo hash.\n", "\n", "We need to pick a sampling rate, and know the maximum possible hash value.\n", "\n", "For a sampling rate, let's start with 1000.\n", "\n", "The MurmurHash function turns k-mers into numbers between 0 and `2**64 - 1` (the maximum 64-bit number).\n", "\n", "Let's define these as variables:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "scaled = 1000\n", "MAX_HASH = 2**64" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, choose the range of hash values that we'll keep." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.844674407370955e+16\n" ] } ], "source": [ "keep_below = MAX_HASH / scaled\n", "print(keep_below)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and write a filter function:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "def subsample_modulo(kmers):\n", " keep = []\n", " for kmer in kmers:\n", " if hash_kmer(kmer) < keep_below:\n", " keep.append(kmer)\n", " # otherwise, discard\n", " \n", " return keep" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Now let's apply this to our big collections of k-mers!" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "akker_sub = subsample_modulo(akker_kmers)\n", "shew1_sub = subsample_modulo(shew1_kmers)\n", "shew2_sub = subsample_modulo(shew2_kmers)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "499970 502\n", "499970 513\n", "499970 503\n" ] } ], "source": [ "print(len(akker_kmers), len(akker_sub))\n", "print(len(shew1_kmers), len(shew1_sub))\n", "print(len(shew2_kmers), len(shew2_sub))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we go from ~500,000 k-mers to ~500 hashes! Do the Jaccard calculations change??" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "akker vs akker, total 1.0\n", "akker vs akker, sub 1.0\n" ] } ], "source": [ "print('akker vs akker, total', jaccard_similarity(akker_kmers, akker_kmers))\n", "print('akker vs akker, sub', jaccard_similarity(akker_sub, akker_sub))" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "akker vs shew1, total 0.0\n", "akker vs shew1, sub 0.0\n" ] } ], "source": [ "print('akker vs shew1, total', jaccard_similarity(akker_kmers, shew1_kmers))\n", "print('akker vs shew1, sub', jaccard_similarity(akker_sub, shew1_sub))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shew1 vs shew2, total 0.23675152210020398\n", "shew1 vs shew2, sub 0.2281795511221945\n" ] } ], "source": [ "print('shew1 vs shew2, total', jaccard_similarity(shew1_kmers, shew2_kmers))\n", "print('shew1 vs shew2, sub', jaccard_similarity(shew1_sub, shew2_sub))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And you can see that the numbers are different, but not very much - the Jaccard similarity is being *estimated*, so it is not exact but it is close." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's visualize --" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAV0AAACpCAYAAACI/O4MAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAIABJREFUeJztnXl8lOW1x79nJpmsJCyBhH0RkUUE\nVETcABfU1qq91q1WqFot19ba1lurdQnR6m2t2r23Xpe6VHuLdanaVlFxRQUEVHZBJBCSEAIkZJ/M\nzHP/eCZAhcBMMvOuz/fzmQ9hYOY5b973/b3nOc95zhGlFAaDwWCwhoDdBhgMBoOfMKJrMBgMFmJE\n12AwGCzEiK7BYDBYiBFdg8FgsBAjugaDwWAhRnQNBoPBQozoGgwGg4UY0TUYDAYLMaJrMBgMFmJE\n12AwGCzEiK7BYDBYiBFdg8FgsBAjugaDwWAhRnQNBoPBQozoGgwGg4UY0TUYDAYLMaJrMBgMFmJE\n12AwGCzEiK7BYDBYiBFdg8FgsBAjugaDwWAhRnQNBoPBQjLsNsDgLkTIQD+sY4AClFLE7LXK0CVE\nQkARUADkAnnxPzteHedaiJ9rIAq0As1AU/zPZqAR2IFSDdYehPsQpZTdNjgaKZMAkA/0iL8K9vk5\nxN6LsuPCjKIFKYK+EHfv+1KlqtHiQ0gYEfKAQvQxFsR/LkQff8cNeCA6jrUh/tr35walaEmv5YZD\nIhIEitEi2zf+Z2EaRmoFavd5VaNUcxrGcS1GdL+AlEk+MAgYiL5I89CCmiqiaDGqBbYCFapUNaXw\n+xNCBEEf36D4qzfpm/k0oI+1EthqRNgiRLKBIcAw9PWcaZMl24FyoByldthkg2PwvehKmWQDA9AX\n5UC0h2c1u4CK+KtKlapIOgYRoeOBMhh9rKF0jJMAO4kLMFCpFO022eE9RLKAUcAIoB+pdRhSQSOw\nCVjnVwH2pejGQwbDgTFowXUSUbQgrQHKVWn3TpAImcARwGi0N+s02oENwGql8OVNmBJEioBxwGG4\nZ61mG7Aa2IhSUbuNsQpfiW48dDAGLUA5NpuTCI3oi3KtKlWtyXxQhJ7om3AU9k0rk6UG/bD5TCnS\n4u17CpEAMBJ9nvvabE13aAXWAiv9EP/1vOhKmQh6Oj02/qfTpluJEAU2AqtUqarp7D/F47SDgSPR\nYQS3EgY+BVYpRb3dxjgSkRHAZNKzGGYXEWAV8BFKtdltTLrwtOhKmQwFjgN62W1LCqkBFqlSVbXv\nmyKMQB+rHTHpdBED1gFLlcLzHlBCiAxEn2c3e7aHIgx8hPZ8PTfj8aToSpn0A6YA/e22JY1sAhYx\nV2UBU9GZCF4lAqwAlvs27CBSCJyEXgD1C83AIpRab7chqcRTohvPRJiCXjjyNpGCKNXfa+GTrwlL\nx+UQzfDD7sJG4AOl2Gi3IZYhIsB44Fjcs0CWasqBd7wS7/WM6EqZjEFPu7LstiXt7PpSC9tnh1C5\nQQBaQ+0sHRdh80A3LA6mggrgHaXw9u4n7d1OA0rsNsUBtAHvecHrdb3oSplkATPQSeDeJpofpaI0\nTOvoA4trZd9m3j0m2ydebxvwplKU221IWhA5Eu1E+NW77Yxy4C1Uctk8TsLVoitl0hc4Hb0l19u0\nHN7G1tsDRHsePP2rMSfMm1OgId+ujQ9W8wmw2DP1H0QygFPQqWCGA9MIvOLWzRWuFV0pk7HoBaSg\n3baknV1nN1NzdTZkJubBRgJRFk8IU+6bcEM18LpSWL6dOqWI5AFnousiGA5OBHgDpT6325BkcZ3o\nSplkAifjB09ABRVVP2ihYVpulz6/YUgzS8bngLgxNzlZWoEFSlFhtyFdQqQYOANd3cuQOEtRaqnd\nRiSDq0RXyqQHcDbQ025b0k6kd4QtP40SHty9hcFdBa28MSWTtizvzwh0lbf3lWKl3YYkhchhwHT8\nMGtLDxuBBSjlihCTa0Q3LrhfQZcZ9DaR3hE2/UoR7ZWa7buNOWHmnxT0ifCCFt4VdhuRECKj0BkK\nfpiNpJPNwKtuqOHgCtH1n+DeHyPaJ7ULYf4T3g+U4hO7jTgoIqPRi2aG1FCBXmBztPA6PrXIX4Lb\nM0L5fakXXID8lhAzF0bIanP0BZlCjhfhKLuN6BTt4RrBTS2DgDPihYAci6ONkzIpwDeCWxCl/L4Y\nkaL0pXrlN2cxc2GEUNhPwjvBbiP2Q2Q4OqRgSD1DgNPjO/kciWNFNy645+Abwb0/SqRf+nNr85uz\nmPmun4R3iqOEV9e9nYGJ4aaTYegKbI7EkaIrZZKBzlf0vuCqoGLLzyJEiq3bzNCjOYvT3m9HYs4P\n6KeGKSIO2LEokoO+rs0us/QzERFHppU6UnTRsS4vlWPsnOrrWrqdFtYVejZkc+xKTxQQSZAZIjbu\nXNRxxpnonnsGa5iGiONKYDpOdKVMxuGHjQ8Au6e1sPs0+5LhR27OY3ClX5pEZgFniNiWC3sy3i6/\n6USCwExEHLXhxFGiG6+DO9VuOywh3L+d6uvsr49w/Ech8pr80hiyCDjR8lF1poL3y406kzzgVLuN\n2BfHiG68Fu7pOMimtBHLjFExN4ZyQM5sRizIjEUxAlFX7OZJAaNFLBRAXU/hBMvGMxyIAYiMtduI\nDhwhcPE+Zqfih4UzgOrrW2kf4Jy6vz2as5jysWtL5XWBE0XoY9FYJ2Nfq3vDXo5HxBHVCB0huugO\nvW5upJg4DSd2vYBNOhlWmeuj+G4GMC3eyDN96LCC/VkTBoifc7uNAAeIrpRJDrpYs/eJZcbYNse5\n6ULHrswgGPFLmKEI3SE6PZiwghNxRJjBdtFF9zTzx/SrdnbrIYuQ20l2OJMJa/3i7QJMFiFdNYeP\nwy/Xtbs4DhFbQ3u2iq6USTEwyk4bLCNc3M6uL2fbbcYhObw8x0fZDCHSMcsS6Y1f0h7dRwiYZKcB\ndnu6x9s8vnXUzGkHF/QuC6gAx670U5vzUSL0TvF3HofZ5utkxiFi26K9bSIgZTIcvySLtxzeRtOx\nzls864wB23Pos6vNbjMsQkjlw1+kBLN45nSCwDF2DW6L6MZTxPyxeAZQM8d9i1PHrvBLXQaAQSIM\nSNF3TUnR9xjSyyhEbOlAY5enOwQotGlsa2kdHqZ1lPsaRPbenU3vOr94uwBHdvsbRAbgl9mb+xFg\noh0D2yW6tqdtWMbOC9wbHx2zwS/lHwGGinS7GI1/rmtvcBgili9uWy668Tq5g60e1xZi2TEapzo/\nY6EzBm7LJrPdL8Ir6E06Xfy05KLruBrcQxAbamLY4en6xxuoO7sVFXJ+xkJnBFWAwzf5KcQwRqTL\n98QY7M8GMiRP1x+0XcTSi0TKxJYni23s+rJzd58lyshy9x9D4uQAw5P+lK6VOzrl1hisoAARS2fe\nVj+ZD0PXNfU+zeNaLe0GkS7yWkMUb/dTMZxxXfjMYExxcjdjqbdrtej6J7Sw63z3pYl1xpiN3jmW\nQ1MiknTXkuS9Y4OTGISIZTM6y0Q3Xtimn1Xj2YoSRdMk73j0/XZk+6ifGiSzuUF3nTWbIdxNBqQs\nT/uQWOnpDrRwLHtpGR12RIHyVBGMBei7008Laslcq8WAezNUDB0Ms2ogK0XXH/VyAZqOc29ubmcM\n3OanEEP/JHqpDU2rJQarsGy2YjzddNA0yTtebgcltd47ps4JAiUJ/l8jut4gFxFLwp+WiK6USU/8\nsroby4rRNtT9WQtfpKAh5KONEpDIzEwkB7Bl/74hLfS3YhCrPF3/hBaaj2pzRQnHZAkglGwP222G\nhSQyMytKuxUGK7HkfFolDv4JLTRO8e4q/8Aa7x7b/hSJHHKBrK8llhiswpLzaZXo+ufibDnCe15u\nB73r/BTXhUN7PsbT9RYFiKQ9NJh2gZAyyQDcU8C7u0T6eHfbbG6Ld4/twBQc4t/940z4h7Q/SK3w\nyhzRa94SYpkxYj28K0yZ0SAZvukWDAe7dnVJQH8sDvsLI7quor3E+6v7+U3ey0HunIN5uv65rv1F\n2s+rFaJ7qCmadwgP8L4gFTR6/8Gyl4PdgP4JmfmLtJ9XI7qppH2A91f3ezT5KbxwsGvXiK43SXvI\nyIQXUkl4kB9E10+txUMinZYiNfFcb5L2h6kViz6Ji+4mevEcVxCOexgjeJsLWcDHDOI1LiNKFtns\n4HIephe6xusnDORVvkGEHIQY3+VuconwTyazkrMByKKOS3iEYhp5gnOpYiKgCLGbC3iUwdSzkFEs\n4Fqy2AHAQJZxGf9I6kjb+yYuSC9/fza7t4wnmNXABU+VAfD5gkF89Og3ULFMRKKMv+wpRp61iTXP\njuKTJ64lM0/b1vvwZUwv1bat+Ms41v39YlAB+o1/l1NufRmA9+6dTuWS02lv7svZv72BnsMaAWio\nyuGtuVcSbuyNUkEGT53Pcde9l7Dd+c1Jiu7bveDyK6Ax7jWe8Tb83wL48yD48WXQlgW9d8ArD8Pw\nVqgPwoxvwOahIApu/Cv86FP92fognHEprD8CJAZXPg/3LodZp8A/p0NAQagVfvlnuLAKHhgGt12u\nP6uAK16Eez5Kzn56AAcq9uMaT7cVZAjcUgh16+F3P4cj7oevRSFjMJQvhMdzIbYOcs+D2Tugbwa0\n/xIeuwQqAdZDznkwqxoGCHAHPPYd2Pg4DLoJvhGBzABEy+Cpb8Omm2HCQ3CegApA9GaYdz1ssPlX\nkQieEN3Ex8ggxnT+xiQ2U0cWf+RWVrGGV5nFCTzNCaznBU7gJWZyOS/QToCXuYozeYQJVFBDHiGi\ntBNgGRdzNXMpppFHuYBXmcE3eJFzmE8vXgDgb5zKy5zD1TwJQCEb+B6/6/KRxnISF6Rh098jlP8G\nyx+5Ys97K566gJFnvcj4r69ixZNHsuqvFzDyrPsAyO27ga/877/bFg0L656/lBN//Cv6HLGLl675\nCZvf/ZghJ1VRMukzRpy+gnfuvuHfPrP8oenkFFVxzgO/Z9fn+cy/4U4mXrmIUF5isdpgNEnRzYrB\n3L/BFZuhPAsm3gpPr4EfzYL/ehpuWA/fOgHmzIRXXoBrT9afq70DlveAs74H378bMhVc/CXo2QC7\nboN2gbVxb7NsMTz+tv75J0fBTRfChb+B8yrh8rsgNwaLCmHGbTD3E/33hOns+nVNh+fL4bQSqGqB\nnHaQn8IVT8L950LNWXDudTD1YVg4B84eAVvWwv88ByU/gEsvgV8CXAQXHw+rHoEH6iFYAyGAuXDB\nNfDiXFhVCkfeBRd8G+67Ftb+FD4OAk/CwOvgmuuh1MZfQ6IEEMlCqbRV1bMivJB4Qv0g6pnEZgB6\n0kYeVeygJy3043jWA3AUa6jkaADeZSw9qGACFQD0o4kMFDEEEJoJEQPaySafOoA9HjJAOyG0C5Qa\nVDBxQRp9/npyi5r2ez/cpG/mcGMOmXl1B/2Oz14dTqjHdvofXUsoL0rRmCVsenMCACNO20LJxB37\nf0gU0bZsVAxa67IIZjaRkZW4CAWS/XVNqdeCCzC0DfpVwac9YWc/+L4+p8xaAx/qc8rG/jBlrf55\nUgPkNMMj8aIyC0+ER/+lf85UMF577wzfp7NF4z7hgJLwXoGt76qD0dk94oqNIu9AzyUw/jJ4F2A1\n5AUhci7UAJwJq99C308V0P8MWAvwVaiugz5LocdnkLMRRj0Y/45CiB4OLR1j1MUfQLsgpyf6PhsM\nbcG9/57lsphUWs+tFZ5u14R9I31oYDDj+ZwlVPE6EzmDj1jMMbTRG4BaigH4FdfTTj6DWcIlzCeL\nKJN5kicoJUgbOdRwBU/t+e7HOZ8KjidIC7O5b8/79YzgHm4ji3pO42mOpCo5ozO6d21NuvKvfPDL\n69k4/2sohGm3/XzPvzXXjuDZy24jM7eeCbOeZsjJVTRW9ySrx849/yen9y7qPh9x0DGOueYNXrvp\nOzx90T3EItmM/dqDBDISV9JArBsP6tf6QOVg+Prn8IcquGUi/Owj+O0x0KDPKUdUwNsToHkJvNsL\nqobC+l6wfpv+90vOg3WjoGg7PP4XOKZBv3/ZdPjH6RDNgL/cv3fMPw6HW2bD7t7wg0eS9HKh8xvQ\nFTsPvw0X3wbP7IzX/D0SGmMQfAiGfgvKn4Fj6tCdMoZCxXNw9A9gwwMwbDf0+Qh6ZUIsFxomwzcr\nYNAQKH8B/joAwr+Av34Trn8YvqZAXoA91+yNMPEh+GozFNwHv7XpV9AV0vqMsOLCSX6MerJ4hjkc\nyzx60cqXeIxVTOMX3EKYbASdmhUjQB0juZSH+U9+wVYm8Q6jaSPIaqZxKT/lZm6kgK3Mi8d3AWbx\nPD/hJoawiNeYAcBYNvM9buZG7uQoFvAS1yZttwp2z2te+/w0Rp49jwufvomRM+ex6NezARh84ma+\n8r838x9P3smw6QtY/Pvkbetg3YvjyOtbwYXzbuSUW+5k3QuX0rgt8SLcorp4jFuy4LI5MGee9kx/\n9xjMmwbFt0BTNgTi6XZ/WAh96mDYLXDdxTD4M8hQ0BqExl4w+TPYdheM3QhXXbj3+598E+puhW8+\nC7d9ae/7cz6HHXPh/+6GJ86G2mQdjc5uQMc7b7fC+EJouIr47BH9BPlveLAMLuoPN+dCayA+23sQ\nXm6EnH5w2//AqcWwJRNiYQhugyHXwFs18NNsCF8BZwHcD9O+DfMa4KarYd5VMLtjrHvgo51Qejf8\n4V44z/JfQNdxvegm51m0EeRPzGEoi5jJcgDGUM33+TU/4i4ms5gctgNQwC568SnFNJJPmBJWUskQ\nVsarmh3OdgLAOD5kO4ftN9ZUFrM1HqroRSuF8QWTGawkRpBt5CdluyQb7/wCO9dPZeI39TFPumop\nLTuHAZBf3EpukbZt/GUrUbEgdZvyyS+po63DQwRadvYiq3DXQcfYuugEBp2wDAnAgMnbCeXXUr08\n0dqxoKQLx9gQhGlz4JRF8At9fHy1Gjb+WgvonMXQS59TcmPw9jyouRPW/QFacuGYbTC2ETLC8LP4\n569bClsOUHj6/iWweuL+719QDVlt8HyyxZc6e8g4PlPlAxi5AiYUwN1z4epNcMQEuPI7sHEL/KIK\n/vsUWN8XtgEMh9Zl8FgN3LkUHmmC/JOhdgLsyoddc+BzgAtg6cZ40e8PYerP0ffpvbB06wE6MPwQ\n1u+EohUkeT/ZR1rPrbNENwb8iVkUUMVFvLbn/ap4BkQU4Q2+zCjeAmAyq2lgII2EaCfAdkbRlyr6\nUUcz/amOn+QNjKEgHipYt0+ftg+ZQB7VAGylYI+lHzIMEPrSmNyhRrp3sjKy61n7/CgA1jwzmlB+\nDQA71heg4sZteHkYKKFwSCMjTt9EuKEfVcv7EG4KUrtmMsOmfXzQMbJ67KR6ue5+unNDD9oaiika\nXZuwjbFAktPzKHDyLBhYBU/vPacsj2e1tAvc/mX4sj6nVIegMl505K4xEIjqTIQgMPoTuF//fvjz\naCip1D+/tE/x6bnjoZf+vTG/DzTHr/EFvWF7CUw5QJz7kAdwIByfr/waPNcIP94NP5kLDw6DdR/D\nI8vjGUW7IOMhOHMW+n5aDzn18XDKt+CkEbB+OLROht2FsOt53ZqI+TBmEPp+yof6+2EUwD0wulc8\nVvwP6Nvxi/sTDIlCxliSvZ9sI62ia0VMN/EdTIsZSTXHk8tW7uE2ACbzHLX047N4GKA/yzgHneJU\nRDNjeY3f8xNAUcxKTmUFAGN5icf4EUKUbHZwCY8C8Cr/wfMUIyiy2cH58cyFDziaDUxHiBKgndN4\nMOlHkkQTP1n//O63aKweRTScz9MX/pwhJ7/AkZc+weqnL2bNMwEk2M5Rs54AYN3fj6Zq2XREokiw\nnQmzH0QCkJEV44hz/8LCn38flNB33EKGnKwfLgvvOZXKJWcSaStg/n/dTuHgFZz5yyc4+up/8N69\n3+SZy24HJQw/9dk96WSJEEvW0f3dSPj4eCjaCv30OeU/n4N1/eAVfU45ehk8EE9bW9kDLrpep4sV\n1MFjj+z9rl8/A7OvhHsvhrwGePQx/f7PZsDsMRCMQk4T/OZP+v3nRsKlZ+v3JQbXPbV38S3xI+7k\nfdfuPvwhzPwEjlIgZ8BbP4Z1APOh/61whYAqhsrn4fGOz9wFf5kDV10NGX1g+9/hMYDb4Ym74eJ7\nIJAJ7XfDEwAPwdGzYGoQopkQvgsedMXKoyat51ZUV0N0iQ5QJhcAfdI6iFPYckczzZNck7/ZJWp6\nt/D6Ca5Jl0oBzyjF/t6xyMnAGOvNMaSZKEo9nM4BrAgvuGVK0X0ytzk+ztdtmnK9f4z/TkMn7zdb\naoXBKtJ+Xq0Q3d0WjOEMQpWOX9HuNg2+Et02peisRZERXW/iCdHtzFPwHqEKH4huvotCc93mYNeu\nEV1v4gnR9Y+nm1npfUHaneeKTQEp4mDXrhFdb2JE11WEqjMh5u3pd0Oedztj7M/BPF3/zOD8RdrP\nqwkvpBKJCsEG7xb5DmdEiXqwvXzndO4wKNWKnxaJ/UPiOetdJO03kCpVUfw0FcvY4dr8zUPSku3d\nYzswh3IY0n6DGizH/aIbp8aicewnZ613Pd0dPf0kugri280751D/bnAX9SjVWbZKyrBKdCssGsd+\n8hZ7dzFta7GfQgu1Sh2wePm+GNH1FpacT6tuoq0WjWM/uZ+EwINtymMoqvt21rrGiyRyzZrwgrew\n5HxaIrqqVNXjl0WHQHuArM/TPkWxnN092oj4ahHt0LMzvZh28KpuBjdRacUgVt5EPgoxLPdeXLeq\nr/eOqXMiEK8+d2jK02mIwTKaUMo7nm4c/4QY8hd7L5d1a7F3Y9X7U6VUwqUbN6XTEINlWPbwNKKb\nDrI/DSGt3vEMI4Eotb1MPPdAKFXDPv3CDK7Fe6KrSlUriU/Z3I0oIW952rqJWk5NnzZUwPt1Jfay\n+dD/5d8wIQZ3E8GieC5Y31xvjcXj2Ufv57wTYlhzmJ8W0CqV4uBdmPdnUzoMMVjGZpSybGZq9c20\nEfZpge5lctaEyKx0v7fbmBOmpijxxpXuZ3UXPrMFv2TneJO1Vg5mqejGtwRbeoC20usl98d11w/z\n0y60Zrriter2K/6ZxXmLepSyNLPKjmnjGlzQSTUlFM7PRtrcK7zRQIwNQ/3k5a5JImvhi6zFBc0q\nDfvRlZlNt7BcdFWpakBPx7xPoC1Aj/fcG2LYUtLqow0RMbrjrSrVQrxFucE1RIg35bQSu26oVTaN\naz29n3Hvgtqaw9xre/JsUqrb1fD8c117gw1WFLj5IraIripVW/DL9sms8hDZq92Xx1nbs4W6wpDd\nZljIym5/g1LVWJh6ZOgWMeAjOwa2c+q42MaxraX4j0FXdZRQKJaM99MOtHKlUpZD7p/r2t2sRSlb\nutrYJrqqVJUDVXaNbynZn4fIX+web3drsZ+83BiwKGXfpneobUrZ9xnSQQRYZtfgdi+SfGDz+NbR\n74EQtDt/dTsqMZaOy7TbDAtZ24XNEIdiMX7J0HEnK1HKtm42toquKlXb8UvebmZtBr3/7nxv99Ph\nrTTn+kV0W4EPU/6tStUBn6b8ew2poA2bYrkd2O3pgp7a+WOXWtGfcwnucG6t3easMJ8c4ae83MVK\npe3aW4xfrmt3sciOjIV9sV10Valqwy+LDxIVSn7v3M0SS46KEgvafk1YxDal0jjL0nm776Xt+w1d\noQKlbJ9ZO+IGU6VqLX6p1JS/JIeC15zXHXnjoCYqi3PsNsMiIsDbaR9FqQ2YRTWnEMaKc54AjhDd\nOG8AtqRwWE7J73IIbXHOTrX6/FYWH5VrtxkW8rZSluWJv4MJMziBD1DKEUWJHCO6qlSFgdcA506/\nU4VEhUGlAUcUOm8PRnnzuKCP6uWuUooNlo2mwwwLLRvPcCC2OCGs0IFjRBdAlapa/HKBZm7PpP/9\n9i+qvT8p7KNshRrgfctHVeozbCisYgCgAT2LdgyOEl3YE9+1vAiFLfR4P4fCf9kX3103rImtJX6J\n47YCr3ajilh3eQ+zRdhqIsAr8a7NjsFxohvnXWCH3UZYQvEDOWRttP6i2FHYyrJxfonjKuB1pWiy\nzwIVQ4fPGmyzwX+8gVI77TbiizhSdOPFzl/BDwtrEhUG35xJ5lbrFtbq81tZMDUE4pc47ntKOaAx\nqva4XgHa7TbFByxFKUeW2nSk6AKoUtUIvIgfhDfYHGTof2VY0t6nPr+NV0/M9FGd3IVKOajkova8\nXscUPE8nG1Bqqd1GdIajbzxVqprwjfA2Bhl6QwaZ1ekT3t15bbx6YgbtmX6pIPauowS3A6U2Awsw\n9RnSwefAm3YbcTAcLbrgQ+EdckMGGdtSn9XQkNvG/JP8JrjOzRhQaiNaHIzwpo5y4PV4/NyxiO6p\n53ykTPKAc4BCu21JO5GeEcrvixHpl5ryih2CGw75RXDfUcoljSJFRgCn4gIHyOFsBBY4XXDBRaIL\nIGWSC5wN9LHblrQTKYhScUc7bYd1rwBNbc8W3jou5BPBjaE9XMckwieEyDC08PqpPVIq+RR4C5eI\nmatEF0DKJAicCIy225a0o4KKbdc2Uz8zL/nPolg3vJnlY3N9kqXQDLyWwg4Q1iJSBMwE8u02xUUo\nYDFKfWy3IcngOtHtQMpkFHASfvAO6mc0s+272ahQYlPQSDDKe5PCPtr4sBWdh+uoJPikEckBzgBK\n7DbFBYTR4YTNdhuSLK4VXQApk97oi9T7cd62wWEq7oBI0cHjvLvz2nhjSsAnW3sVuu3KMqU8siAl\nEkA7E96fyXWdevROs1R3/LAEV4sugJRJJjANGGG3LWknlh1j661tNE84sAe7uaSZ9ydl+6Qmbiuw\nQCkq7DYkLYiMBqYCfnh4JsNnwLso5ZwqfUnietHtQMrkcOB4wPtT6voZzdRcEyKWr0MrLVntLD4q\n4qN6uJvQC2bOq0ucSkTy0Q7FQLtNcQAtwDsotcluQ7qLZ0QXQMokBBwLjAO8vXgUzY2y7TvNLJkd\n4OPROT7xbnejd5htsdsQSxEZg3Yo/Or1fgYsdFrhmq7iKdHtIB7rnQIMttuWNLIBWMxcFUJPQ73s\nDYXRzQRXKOWDessHQnu9JwDDbLbESnaji49vstuQVOJJ0e1AymQA2kMostuWFFIJLIp3Ut6DCEOA\n44DetliVHmLAKmC56zMTUoVHFKmlAAAETElEQVRIMfo897fblDTSgl4gXeOGzQ7J4mnR7SAuvmPR\nXoIbp+ERdAL4alV68FJ1IgwAjgSG4t4QSwu6pvJqpXBEixXHIeLFh2w78DGwAqU8W4nNF6LbgZRJ\nDjoVZwzuSEKvR3t6n8bbGSWMCPnoB81owC1t1SvRHRY22Vhs3D2ICNqRGAcMsNeYbtEErAFWeyVu\nezB8JbodSJkIOt47FhiEs7zfdqACWKNKVbfToUTIAA5DP2j6dff70kAb2otfoxSuzLt0BCI90eJ7\nOJCamh3pZyvaqSh3yxbeVOBL0d2XeJ5vf/RC1ECsn64poBYttFuAGlWanjiWCNnoh0zHy47OEQrY\njvZqtwLVvl0cSwciGcBIdN76AJzlUADUoVP+1qFUvc222ILvRfeLxEMQA9ACXAL0AFJZLCaMXpXt\nENqtqtSeRG8R+qDFdzD6YZOuMMRO9opslVLY35DTD4hkos/tUGAIkGWDFQqoRpdd3IRS3i/RegiM\n6CZAvLpZjwO8QmhPIoBetFLoFvIdrwa0wO552SWwiSBCCL2luhAoiL8K0ceagX747PsAUvFXGH2s\njfE/G/b9u1KmPY3t6PhvX3QmT1H8516k3hNuRDsUtegZTY2bd4+lAyO6hqQRQTxT68DPiATRM5xC\ndKjpi68M9joVCp3CF0VvwW7+wqsR2IFSLdYehPswomswGAwW4rQgu8FgMHgaI7oGg8FgIUZ0DQaD\nwUKM6BoMBoOFGNE1GAwGCzGi2w1EZJOItIhIo4jsEpF/iIiXy0kiIl8XkQ/jx1wlIv8SkZPststg\ncAtGdLvPV5RS+eitxNuA39psT9oQkR8CvwLuBorRu5z+AJxnp12G9CIib8adCjt2tHkOI7opQunq\nSH9DF9HxHCJSCNwBfEcp9axSqkkp1a6UelEp9SO77TOkBxEZBpyM3hxxrq3GeAQjuilCRHKBi4EP\n7LYlTUxF12Z4zm5DDJYyC31NPwrMttcUb5BhtwEe4HkRiQB56L3mZ9psT7roA9QqpSJ2G2KwlFnA\n/cAi4AMRKVZKbbPZJldjPN3uc75SqifaC/wu8JaIlNhsUzrYARSJLh1o8AHxBdKhwDyl1FJ0g8iv\n22uV+zGimyKUUlGl1LPogiBeXM1/H11w/Hy7DTFYxmxgvlKqNv73pzAhhm5jvJYUIbp03rnocnlr\nbDYn5Sil6kXkduD38XDKfHSXi9OBGUqpG2010JBSRCQHuAgIikh1/O0soKeITFBKfWyfde7GiG73\neVFEoujV3XJgtlJqlc02pQWl1H3xG/BW4El0zdylwF22GmZIB+ejZ23j4d+Kzs9Dx3lvsMMoL2BK\nOxoMhv0QkZeBVUqpG77w/kXAb4BBZlG1axjRNRgMBgsxC2kGg8FgIUZ0DQaDwUKM6BoMBoOFGNE1\nGAwGCzGiazAYDBZiRNdgMBgsxIiuwWAwWIgRXYPBYLAQI7oGg8FgIf8PqyzoVXvFWZQAAAAASUVO\nRK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "venn3([set(akker_kmers), set(shew1_kmers), set(shew2_kmers)])" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAV0AAACoCAYAAABDoD2pAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAHjFJREFUeJztnXuU3FWV7z+7+v1I0kk67zcEEvIg\nISRiIEhEVJQRGRn0XgbEx+gsxDvqdcDRgdtTZsQZ7wj3ypK17r1roc4IM6JwWcYgAheQVxJIJEDe\nD0g6SSfppPPq7upXVe37x6mGQDpJd7p+5/we57PWb3WlulJnV/9OfX/7t/c++4iq4vF4PB47pFwb\n4PF4PEnCi67H4/FYxIuux+PxWMSLrsfj8VjEi67H4/FYxIuux+PxWMSLrsfj8VjEi67H4/FYxIuu\nx+PxWMSLrsfj8VjEi67H4/FYxIuux+PxWMSLrsfj8VjEi67H4/FYxIuux+PxWMSLrsfj8VjEi67H\n4/FYxIuux+PxWMSLrsfj8VjEi67H4/FYxIuux+PxWMSLrsfj8VjEi67H4/FYpNS1AVFC0pICagpH\nbeFnCSCFAyB7wtENHAeOaYP2WDe4SIhQAQzBzJfeIwVo4QDzWduANlXyLuz0DBARAYYDIzDzufqE\nowao4N35De+e724g08dxBGhBtdveh4geoqpnflXCKIjraGAcMBIzIWuBKt6dgAMlAxwtHMcKR7M2\naOegDS4CIpQCdcDQPn5WDuCtFOjACHDrCT+bgRZV/IRzhchQYCxQD4zCzO0gHK9jwEHgEOa8N6Pq\nL8QFvOjyjsjWA+MLx1js3QUcBBqB3dqgzZbGBECEOmBS4RiH8WqCpAvYBzQBTaocDni8ZGM82THA\nlMJR58iSbswc3wXsTronnFjRLQjtZGAGRmjL3FoEGA9xD++KcFEnZ8GbHY/53JMwIQOXdGBEeCfw\ntio5t+bEBJGxmHk9hYHdpdggjznn24EdqGYd22OdxImupGUIMBMzKasdm3M6csA2YL026KA8QhGG\nA3OA8whvHL8T2ApsVuWoa2Mih0gp5vzOwoQNokAX5pxvRPWYa2NskQjRLXi1U4ALgAmcfVzWFXuB\nN7VBGwfyn0SYDMzFfOYo0QRswni/PhZ4OkRqgQuB84Fyx9YMhj3AenRgczyKxFp0C2I7E1hAuL3a\n/nIUWA9s1Ya+b8tEKMN48XMwSbAo0wlsBN5QJdFxwJMQqcTM61nEq/RzP/AKqvtdGxIUsRVdSct0\nYCHRF56+6ATWAJu0wZxAEVIYoV1AtD2evugEXgM2Jj7uK1KG8WwvJBx5iKBoxIhv7JKtsRNdScso\n4DJMyVfcaQFW8g9aDnyQeF5gTqQVWKXK264NcYLI+cAlmNLFJKCYmO/KOFU8xEZ0JS2VwAcw4YRk\n0DOqh6bbe9hxhbB6XjldFUGXfIWFPcDLiUm4idQAl2OqTpJIBng+LvHeWIiupGUi8GGS4gGoKC3/\nKUPLDVVQZuJ5PSU5XpvVxY4pcYhd94c88Koqr7s2JFCMd3sp8QsZnQ1bgZej7vVGWnQlLQJcDFxE\n9CoSzo5sXZY96Sxd5/Rdf7lvVIYXLq4kVxqn5MrpaASeVaXLtSFFxSTKlpJc7/ZUZIBnUd3r2pCz\nJbKiK2mpBq7EFPsng8wFXey9q4T8kNPX2rZVdfPMB4X2mjgnWk6kDXhaFasr+gJDZATwcdwvXgkr\nCqxC9U3XhpwNkRRdScsEjOAmI5wAcORTGZq/VAn99GCzJTleWtBN05ik/I3ywGpVIvlFfAeRqZhQ\nWVIumINhK/ACqpGqaImc6EpaFmBCCskIJ2iJsu9vO2ldMnDxVJQN52V4c0ZNAJaFlZ3Ac5Gs6xVZ\ngClz9PSfA8BTqGZcG9JfIiW6kpbFmBVWySBbl2X33Tm6J1UM6n32jcrw/MJK8iVJifMeBFZERnhF\nUpj47XTHlkSVNmBFVJYSR0Z0JS2XYor/k0F2aI5d9+bIji5O1vrg8A7+3+JKNJWMOwTTVnBF6BNs\nRnA/AkxzbUrEyWCE94hrQ85EJDwfScsSkiS4udocjT/OFk1wAUYdqWLp6k6IyFV28NQDfyYSui5b\n72IE96N4wS0G1cCnCknIUBN60ZW0fAizvjwZ5Kpz7Ppxlp6xgwsp9MXYliqueKUjQcI7krAKr+l1\neyWmEZOnOFQC1yDiqm9wvwhteKFQg/shTPOWZJCvzLPrnp5Bx3DPxO6xGV5cmJRFFACHMaGGDteG\nvIPIUkxnME/xaQceQ7XdtSF9EWZPdxGJEtyKPLv+e3fgggswaX81i/8UmWxvERgBfLLQxN09Ir2t\nGD3BUAN8rNBjOHSEUnQlLVOB+Y7NsMveu7ronmrvNnhqUzVzt4TSEwiIkZj+BW4RmYRpWuMJllGY\nO+XQETrRlbQMw5TPJIeWv8iQmWd/EcOsbdWMagnFxpiWOE/EYX7AxBo/QlJqzN0zHZHQOW+hEl1J\nSykmm5uc5h4d53Vx6CY3q8ZSCEvWllDeHakVPYPkUhEHbT9FyjFLe5Mzt8PBIkRC1b8iVKKLuR0I\nfclH0chV59h7VwpK3Hk+ld1lXL4mGosIikMKuMpBRcNlwDDLY3rMXcVSREKzHD40oitpmUPSVuQ0\nfa+b3HD3a+xHH65i9tYkxXdrgStFLN3mi0zBbBrpcUMlYYjnFwiF6EpahmN2PkgOLTe4ieOeijlb\nqxl5JNyrt4rLRGwka0UqCNEXPsFMRSQUTl0oRBdYQnhsCZ7ucT0cujFcBfsmvitIPpyF28GwQCTw\nLY4uJR6bosaBS8MQZnAudJKW84Fxru2wyv5v5PrdotEm1Z3lzN6WpPrdEkysNRhMAseHFcJDJcbB\nc4rTL76kpXdDxeTQuriDjtnh8nJPZNaOKqo6+9zePaZMEgmg94Hpq7C46O/rGSzTEHHq5Ln2ti6C\nEK6LDwotUQ78dbg3jyzJp1j4ZpKqGQAuKWxhX0zOx1crhJUPuBzcmehKWoaQpM5hAEeu7SA3Mvx1\nmhMOVDHseJKEdygwu2jvZpaf+mbk4WVMYYcOJ7j0dBdhYmrJIF+Rp+Wz4RdcAEFYuD5JCybAJNWK\n1fdiDj55FnYWFTq9WceJ6Ba83HNdjO2Mls92kK8NZQOOPhl9uIr6w0kqIasALhj0u5iVZ/MG/T6e\noBmOoySnK093Fklaf66iHP1E8N3Dis3sbUnzdmcVYcHETCiax+wJlgtdDGpddCUtJSSpZSNA65LO\nM26bHkbGHqyioitJwlvL4JuKD95b9thiBCJjbQ/qwtM9lyRVLAAcuTaaXn0K4fydSQoxwGB2KRGZ\niK9YiBrFS6D2Exeia/1DOqV7TA+dM6N7kTm30X1vCLtMFDlr4UzOtlLxYZrtVWpWRVfSMhrTXDg5\nHPlMtEuvqrrKGNecpJ67cDaOgUgxQhMe+6QwcXirA9okWZ6AlijHr4h+UmXmW3nXJljm/LPY2mc6\nSUoOxwurWydZE11JS4qkbTXduqSTfE30EmjvZ/ShpCXUyoGBNr72Xm50GYaItVi8TU93DJCs+GDr\n4nh07EohjG9OWkJtYr9faWKC9nej8BQTaxdNm6Lb/0kcFzpmx+ciM/6AawtsM5D5OhkfWog6U20N\nZFN0J1gcyz3d43rI1cVHdEcdic9n6R+1A6hi8KGF6DMGEStVRlZEV9JSQdKqFtoWxas9YlVXGTXt\nPa7NsMyZvV3TwjF5d3HxQ7B0Hm15uhNI2u1X+8J4xHNPZEJz0kS3P3dnI2DAlQ6ecDLGxiA2RTdZ\ndMyIfqnY+xnX7NoC24zvR5/dZN3BxRsr59KW6CZrO57Oad1odfzaVtYfiUZryuJRDtSf4TVedOPD\nCBvtHgMXXUmLQOCb/4WLzunxiuf2Up4tTVi9Lpy5l8KZRNkTHUoxLR8DxYanW2tpnPDQMz5+8dxe\najPxvKCcmiGn/I1Joo2wZ4rHAoFfRG2IYbK8XIDu8fG9yAxpT5qne7r5W0fSHIr4E/hF1MaEObWn\nEFd6xsS3UmNoW3y9+L45nej6LXniR03QA3hPNwiyI+OXROultj2+F5S+OZ3T4EU3fgR+Tr3oFhst\nUXJD41u3WduRNNGtETnlBqpedONH4OfUhjgEE17IUMr93E6eUpQSJrCWm1jOIyxlK1fRxShu5duM\noQ2ATYzlcW6hjcnM5DE+x1OB2NVTn4VUcZfMPvHNWzi+ey4lFa1c/1AagLefmci6n9+E5ssQyTH3\nLx9i+tU7WfeLebz15KdBFJEcs254mBnXbi+aLVUdAXrxh0ph7u2QLYV8CSxaC08shydHwhe+Apla\nmLgLXnoAhuXguRFwyxegoxo0BV97FNLrAzBsCHC0j+cDvxW1yVC4uxw6U+aPmdsPd2+C6k/CV4/C\nyDpoeQL+9wzI3AYfeBiuVqACuv4ZHrwJ9rj+DEUgFp5uMF3ZK8nyFe7hDpbxTZaxn9m8yjTOZQef\n417KaXnP6+toZyn/wbSAxLaXbH3xE01Tl77Mwlt/8p7n3nzoeqZfvZzP/HIZ06/+LRt+dT0A51+z\nmT//t+/zmV8u46Iv/4L1/35zUW2p6AnwQj08C6/cAweXwc5l8MZsuH8afOt6uOFpOHon1GTg60vM\n6+/4JHxoLTT/I9z/f+DeGwMy7FRfxNh5us/CPc2wbD/cDfA1+MQ82HwE7poHm2+FqwFmwqEX4V8O\nwfe/Ar+7A25ya3nRKEMk0D4jNkQ3mDFSwDBMu8EeStDCLeB8dnPO+wQXYBytXMwuUgSbfdcA1g/M\nvG4b1fXtJz3f3W4uaN1tVZTVGE+sur4LSfX+vvir4iQfYHihBJhUaCHZXgK5EnOi35oBP/yTef7z\nK+Gl+e/+n7ZCk5LmKhhyLCDDTjWHY79Y5HWY911YCfBdWLkO5gP8F3NSMgD/Gd4+bqG+1SKBnlcb\n4YXgbkezCPdwJ52MYjLPsYi3Axurv2ipnez+RV/6Favu/QZvPfkXKMIVd/3zO79b97P57Hjyz8l2\nDeWiL99X1HFTQffQ6BSYfCccGQWXPweXHYSKDqgu7F4x9wgcrzOP71sO13wDaq+EbDn89N6AjDqV\n6MaqXEyAq+AbAJ+A538OL7TD0EvgGMBCONbeR47me3DZTAgirOOKQM9rdD1dgFKUO1jGbXyHI0xl\nPeMDG6u/qKUc2ubHrmD6Jx7mhl//HdM/9jCr/+ct7/xu/hfXcf2/NzDv8/ez+dFPF31syQd4YalU\naF4GG78DO6bCM6fZIvtfFsHSldD2HfjxfXDHl6AniIvCqeZwrJKKj8OPDsAPnoCfrICl98B5J/6+\nBBB4z7n/Ecx4Dpb8Ah61aWvARF50g2ckHYxmCxsStNPw4W2Lmf+F1wC46Mtr6Tg89aTXzLxuG91t\n9RzdWWvZuiJwXgfM2QJ/PAe6qiBTmKtvDoehhaTWM0vgb9aYx7e9BdkyWB/BzxoOLiskCy+C1oWw\n7iWYWgPHVxeWQq+GYdXQ2vv6B2HC3XDzz+Cns+Hk8JenT2yIbjCbGu6nlpZCkq6dMg4wi1HsD2Ss\ngSCWVsmWVh5j82NmQ71Nj8ykvNa0AGt6dRRa+JO/9fRk8vlShk1uK+rYmgrIw3u9FrYVEq/NZfDG\nLJi1D87ZAt9dYJ7/18Vw6TrzuO4w/LKwk+sjY43oXth68vsOmlPN4dgsFGmC8l1Q0fv4DZh1ITTN\ng9d/CIsBfgiL58HrAM/BiNvg1mXwwLUQt/ZzgW7EKqrBzhtJy00EkeV9gwn8gS+ipFCEiazhL1nB\nb7iSLXycHoZSRiujeJOv8m/sZSg/5+/JUYmgpOjiazQwnOJuL94+v4s9y4qbwHr8639F2/7zyXXX\nUlrRyuTLf0vd1ANs/PXn0HwKKenhwpsf4tyPNvL8Dz7OwQ2LkVSOVEk3F1z/SFFLxvIov/qzgET3\nwQnwrS+aiiUVuGQNrFgBf6g3JWMdNTChEV58wFQ6/HocfOtm6Kowd/p/8wjcuTEAwx5X7aMcSuQa\nYtK29A9QfxPcCpCHkkth9XL4/QaouQa+egxG1MHhx+F/XQCZhXDzelgwFA4D9JaYuf0UReNBVAPz\n3G2I7o2YpjfJIDO7k93/ZGXbDyfkUnke/mQ8wlL9Z7kq+056VuQq4Bz75ngC5gFUA7tltfHl6bAw\nRngoPRTfJcAAXWVJ6zIGp57DGatWeGzQE6Tggh3RPW5hjPBQ1lwKudjE+k6ioyppXcaUE5JH78OL\nbvwIPCHoRbfYiAolx+PrDbZWx/eC0jcZ1VMuqPGiGz8CP6c2RDeIbHK4KT0UX2+wrSZpons6p8GL\nbvyIhegmy9MFKGsOtOTEKa01sVoQ0A9O5zT42tT44UU3kpTvi6832FoT70ThyZxu/h6DgHt5eGxz\nOOgBbIhuOwEXG4eOsr3xLalqq45vr+C+ObXoquax8CX1WOVg0AMELg7aoErSvN2qbfH0BrtLs3RV\nxPOznZq++uieSOBfUo81spz5fA8aWx5Zk6VxwkHFrnJS7fGrYDg4otu1CZbpgj7ahL6XQzYM8Vjh\nEEGvFsOe6Maho/zAqNoUP4FqGu3aAts0qZ6xv4IX3fhg5Vza9HTjm1zqi5o1ri0oPk2jA+2oH0L6\n4ywcBnqCNsRjhQM2BrEiutqg3cSvE9HpqX0lXgKVqegmUx2vz3Rm9p7xFSaZlrw7ufiRB3bbGMhm\nlj1ZE7PsYBklR+LjATWPjF+M+vQcV+13AnhXoJZ4bLAfVSshQZuie2avIW5Ur4+P6O5LXDx3IPO1\nkaSFz+KHtQunTdFtBuKXXDodQ16Kx+qtvORpGl38TS7DTf/vzFQ7sRQP9ARG/ERXGzQPvGVrvFBQ\nu6qSVFv0b8v313fSXZ6k+twuBh7f2xmAHR47HEHV2loC2yungujqH14kJwx9psu1GYNm8znxXWHX\nN1tUGejFcjtJW3kZH7baHMzql0kb9BBJuw0b8Vg5BLl7bsBkKrs5MCq+O2GcjHI2zoFqBu/tRpEc\nsMXmgC48mA0OxnRH2cEyqjYVdx82m2ybEv3wyMDYPYCqhfeTrDu5eLCjEJO3hgvRfYukbeEz/LFo\nJtTykmf7lKQl0M5eOFWbsLB231NUrF8orYtuIaG22fa4TqldXUHJ0eiVjzWNTloC7TiDL5D33m50\nOISq9UVbrhIkG0lSXaOoUPd49MrlNpyXJMEF2NiPXgtnYgsQ3XBSsnjDxaBORFcbtB3Y5mJsZ4z4\nTRWp1ujER/fXZzhcl6TQQgewadDvotoDvDbo9/EETQuq210M7LIU6FUYcFlOdEn1pKh/MBrebh5l\nzZykNStfq1q0xjUbgbYivZcnGF5xNbAz0S14u2+6Gt8JdY9XUdocfuHdPa6D1tpy12ZY5CjFzDOo\n5oAYtpmLDU2oWmlu0xeui97XkaQdVUWFMfeHu4A+m8qxdk6SwgoAq1SLvrBhG34rn7DizMsFx6Kr\nDdoDrHRpg3Vq11ZS/Vp4LzQbzutK2JY8O1VpLPq7mh0IkjW3o8F2FxULJ+La00UbdAdJa/s49r4y\n6Amfx9tW1c3G6VWuzbBIFng5sHdX3UvSyiPDTQdBnu9+4lx0C7xIkrayLjtYxuifhausKC95XlgI\nSDQXcpwdr6oGnvBahU+qhYUXbK8+64tQiK426HGM8CaH4curqXklPGGGdRd0cnRYkpJnu1QtJHJN\nY+znAx/Hcya2o7rTtREQEtEF0AbdQtJuxcb/qJLSQ+6rGfaNyrDlnGrXZljkOPCstdFU95C0uR0u\nQhFW6CU0olvgJeCgayOskepKMeH7OI3vdlT08OLFSeoilgWeUrXeUH8lvprBBQo8E4awQi+hEl1t\n0BzwFElaRln5djmjH3DTczcveZ5flCdbGqp5EDAvqtJifVSzUu0PJGluh4OVhYRmaAjdl00btA14\nhiT1Zhj+uyon8d03ZnQmbKnvJlW7Davfg2or8DS+2bktNqO63rUR7yd0ogugDboHWOvaDquM/6dK\nKrfaa3m5fXI7m6YnKY7bTBjieqb9o3s74s9+QpqcD6XoAmiD/gkI3VUqMFI9KSZ9r4KKHcHffr49\nMcOrF9YEPk54OAj8XjUkZYmqG0laM3+7tAJPoRrKO4rQii6ANujLJKk/Q6orxeTvlFO+MzjhbRyb\nYdX8pHm4K1QJ1151qi9heW+uhNAO/A7V0G6UEGrRBdAGXQm87toOa6S6Uky5vZzy3cUXib2jM7x0\ncZJWnB0AHndQqdBf/gjscG1EjMhgBLfVtSGnI/SiC6ANupok9ShNdaaY/LellDUVT3j312d4flFV\nglac7Sfcgtvbn+EZktZbOhjagN+iesy1IWdCzHmPBpKWhcAC13ZYI1ebY/c/9tB17uDqaBvHZnh5\nQRWaSpLg/r6I/XGDRUSAJcAFrk2JKMeBFWH3cHuJlOgCSFrmApcQES990KgoB76e4djHBp74yqO8\nfkEHm89NUgx3K6YWN3oN8kXmAh8EknJxLAZ7gadRDVfM/jRETnQBJC1jgI8Ata5tscbRj2Y4cGsl\nlPXvYtNdmuWFhVma65Oy2iwLvKwa8eW2IhOBq4Ak9cE4W9YDq8JapXAqIim6AJKWCuDDwGTXtlij\nc1o3e9JCbnjZaV93rLaLZy8poaMqKVvuHMMs7Y3HMluRYcDHgTrXpoSUPPAiqpG8wEZWdHuRtMwD\nFpGUcEOuNseehm46Z/ZdhdA4NsPKiyrJlyTj7wHbgRciE7/tLyLlwOXAua5NCRmtwLOo7ndtyNkS\nedGFd8INVwHJKfg//OkMh26uQAu7PHSVZVkzp4fGCUkpCctittnZ6NqQQBGZhkmyJeW8no4NwGpU\noxevP4FYiC6ApKUcuBiYTVK83uyILPu+2c3Ga2DN3Ap6ypKyzc7bwEoLDcjDgUglRnjPcW2KI1qB\nPxaWUEee2IhuL5KW4cBlwHjXtljgAPAy/6BlwKXACMf2BM1RTLIsWds79WK83sUkJ4Gcx2xn/0rU\nvdsTiZ3o9iJpmQx8gHgKUTuwWht0e+8TIggwE1hI/G5F2zENkLaoJqj7XF+IlGDu5uYDca5M2Q6s\nQfW4a0OKTWxFF0DSIsB0TNhhqGNzisEhTC+KHdrQd5mMCCWYzzwHGGnRtiBow5QFbQhNs5qwYBJt\n84C5QJyqVHZjPFv7PY8tEWvRPRFJy0SMJziVaMV8FdgFvKkNum8g/1GEcZgv5RSiU3CvQCOwCdid\neM/2TIhUYzzfmUT3DiePmePr0YHN8SiSGNHtRdJSBczATNIwe789mH211mvD4JY3ijCEd7+YYS26\nb8d83s2qtLs2JnKIpDCJtlnAWMfW9JcM5pxvQjUx5zxxonsikpYJGAGeQDi8hFbM7VUj0KQNxU0e\niJACxgCTMItKXMe7jwNNGC+n0Xu1RUJkBOYCOwUY4tia95MF9mBitjujtpqsGCRadE+kUPUwHhhX\n+GkjSZHHNGdpBBq1QY9aGPMdRKjBCPAkzIUnaC+4DSOyTUBTYkq+XGIEeErhGO3IigzmwroL2Itq\nouPzXnRPgaRlBEZ8R2BKdHqPs0la5DFe3VHMktVjhcct2qChWElVqH4YAgzr46il/zHhLEZcTzxa\ngQOqxC4THSlEqjB3OqMKRz3Fdy7yQAsm6XsQOBjnpNjZ4EV3gBR6PvQKcA1QwnsFKYcRnizQjRHY\nVm2I7h+6UBHRe8EpKfzsTUZq4egBWlX9breRQqQWU+VSUziqTzgqMec5hZnj+cLRg/FeM5hYfO/j\no8DhJIYMBoIXXY/H47FIlEqnPB6PJ/J40fV4PB6LeNH1eDwei3jR9Xg8Hot40fV4PB6LeNEdJCKy\nU0Q6RKRNRI6IyAoRmeTariARkRtFZE3hM+8Tkd+LyBLXdnk8UcCLbnH4lKrWYlazHQDuc2xPYIjI\nfwX+B3A3ptB+MnA/8GmXdnmCQ0SeKzgUFa5tiQNedIuIqnYCv8E0HYkdYjZM/D5wm6o+qqrtqtqj\nqstV9XbX9nmKj4hMxezVpsC1To2JCV50i4iYNnufA1a5tiUgFmNWKf1f14Z4rPF5zHz+OXCLW1Pi\nQZyaH7vkMRHJYpZRHsRsnx1HRgKHNEZbp3jOyOeBe4DVwCoRGaOqBxzbFGm8p1scrlPVOowX+HXg\njyISlZ6mA6EFqBcRf7FOAIXk6BTgYVVdC+wAbnRrVfTxoltEVDWnqo9imt7EMZu/EugCrnNtiMcK\ntwBPquqhwr8fwocYBo33WIqIiAgm2TAcs91MrFDVYyLy34CfFsIpT2I6Tl0FfFhV73BqoKdoiGkD\n+VmgRET2F56uAOpEZJ6qvu7OumjjRbc4LBeRHO/uZ3aLqm5wbFMgqOqPC1/CO4EHMb1y1wI/cGqY\np9hch7ljm4tpUdrLw5g477ddGBUHfGtHj8dzEiLyBLBBVb/9vuc/C/wEmOgTqmeHF12Px+OxiE+k\neTwej0W86Ho8Ho9FvOh6PB6PRbzoejwej0W86Ho8Ho9FvOh6PB6PRbzoejwej0W86Ho8Ho9FvOh6\nPB6PRf4/9BK4ugTvf7IAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "venn3([set(akker_sub), set(shew1_sub), set(shew2_sub)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A full list of notebooks\n", "\n", "[An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb)\n", "\n", "[Some sourmash command line examples!](sourmash-examples.ipynb)\n", "\n", "[Working with private collections of signatures.](sourmash-collections.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }