{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Explorations into Genomics\n", "\n", "*Inspired/stolen from work by John Jacobsen: http://eigenhombre.com/2013/11/03/nucleotide-repetition-lengths/*\n", "\n", "An exercise in laziness and counting. \n", "\n", "We represent genomes as strings on an alphabet of four letters, A, C, G, and T. Thus part of a genome might look like the following:\n", "\n", " \"...AACCGTGTGCGTTTATTAATTATTGCTTTA...\"\n", "\n", "Each letter corresponds to a particular nucleic acid (TODO verify this) in a very long strand of DNA. A single genome can be quite large and scientists generate them at an alarming rate.\n", "\n", "In this section we'll process genomes and perform simple counting analytics on them. We'll start by counting frequencies of the individual nucleic acids (how often does `A` appear?), move on to counting lengths of repeated nucleic acids (how often do we see long repeated strings like ('AAAAAA'), and finish with a quick markov chain.\n", "\n", "Throughout we'll see exercise laziness and become familiar with the following functions from PyToolz." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from toolz.curried import (map, frequencies, pipe, take, concat, drop, count,\n", " partitionby, identity, compose, sliding_window, \n", " merge_with, merge)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've stored some actual genomes in the data folder" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from glob import glob\n", "\n", "file_pattern = 'data/yeast-genome/chr*.fa'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Snoot - an infinite and uniformly random organism\n", "\n", "We'll compare our results against the genome of a snoot. A snoot is a magical creature with an infinitely long gene sequence where each letter is uniformly chosen. We can implement a snoot genome with a Python generator.\n", "\n", "By comparing our physical genomes against that of the snoot we'll highlight ways in which we deviate from randomness." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import random\n", "\n", "def random_genome():\n", " while True:\n", " yield random.choice('ACTG')\n", " \n", "snoot = random_genome() \n", "\n", "list(take(10, snoot))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "['A', 'C', 'A', 'T', 'C', 'T', 'G', 'C', 'C', 'T']" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "# Set up matplotlib and build a small function to plot histograms of count dictionaries\n", "\n", "from pylab import hist, show, xticks, title\n", "%matplotlib inline\n", "\n", "def histdict(d, *args, **kwargs):\n", " \"\"\" Plot a histogram given a dictionary of counts \n", " \n", " See Also:\n", " matplotlib.hist\n", " \"\"\"\n", " if all(isinstance(k, (int, float)) for k in d.keys()):\n", " hist(d.keys(), *args, weights=d.values(), **kwargs)\n", " else:\n", " keydict = dict(enumerate(d.keys()))\n", " revdict = {v: k for k, v in keydict.items()}\n", " hist(list(map(revdict.get, d.keys())), *args, weights=d.values(), **kwargs)\n", " xticks(keydict.keys(), keydict.values())\n", " \n", "histdict({'dog': 3, 'cat': 5})\n", "show()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAD/CAYAAAA62IfeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAC8xJREFUeJzt3VuIVWUbwPFnewjxkChkWQpWVjo2OtskrZycLPMilKys\n7KCUBQUVhgRJENNFVGSI0U1EdISKusi0kAIbO1GTlFQUFYFh54N5VsyZ9V1kfmk6ezvjzPTU7wcL\nZtiLdz1z8+fl3XtrqSiKIgBIoUd3DwBA9UQbIBHRBkhEtAESEW2AREQbIJFe1dw0YsSIOPLII6Nn\nz57Ru3fvaG5u7uy5ADiAqqJdKpWiqakpBg8e3NnzANCGqo9HfAcHoPtVFe1SqRTnnXdeTJgwIR55\n5JHOngmAg6jqeOTtt9+OoUOHxs8//xzTpk2LUaNGRX19fUT8EXQADl17TjCqivbQoUMjIuKoo46K\nWbNmRXNz895ot/fBAN3lj81md3erfRveiscj27dvjy1btkRExLZt2+LVV1+N2tradj0MgI6puNP+\n8ccfY9asWRERsXv37rjyyivj/PPP7/TBAPi7Ukf/adZSqeR4BEjln3I80p52+kYkQCKiDZCIaAMk\nItoAiYg2QCKiDZCIaAMkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKiDZCI\naAMkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKi\nDZCIaAMkItoAiVQV7ZaWliiXyzFjxozOngeANlQV7aVLl0ZNTU2USqXOngeANlSM9jfffBOvvPJK\nXHfddVEURVfMBMBB9Kp0w6233hr3339/bN68+aD3NDY27v25oaEhGhoaDsdsAP8iTXuujmkz2itW\nrIghQ4ZEuVyOpqaDP+yv0QbgQBr2XH+6q12rtHk88s4778RLL70Uxx9/fMyZMydWrVoVc+fObdeD\nAOi4UlHlQfXq1atj8eLFsXz58n0XKJWcdQOp/PGhiu7uVvvaeUif0/bpEYDuVfVO+6AL2GkDyfxn\ndtoAdC/RBkhEtAESEW2AREQbIBHRBkhEtAESEW2AREQbIBHRBkhEtAESEW2AREQbIBHRBkhEtAES\nEW2AREQbIBHRBkhEtAESEW2AREQbIBHRBkhEtAESEW2AREQbIBHRBkhEtAESEW2AREQbIBHRBkhE\ntAESEW2AREQbIBHRBkhEtAESqRjtnTt3xsSJE6Ouri5qampi0aJFXTEXAAdQKoqiqHTT9u3bo2/f\nvrF79+6YPHlyLF68OCZPnvzHAqVSVLEEwD9GqVSKiO7uVvvaWdXxSN++fSMiYteuXdHS0hKDBw8+\n5AcB0HG9qrmptbU1xo8fH1999VXceOONUVNTs8/rs2df0ynDVatUiliw4Po488wzu3UOgM5WVbR7\n9OgRa9eujU2bNsX06dOjqakpGhoa9r7+wgu//eXuUyJi1OGdsoJS6YWoq2sSbeAfrGnP1TFVRftP\nAwcOjAsuuCDWrFmzT7QjXuzwIB1RKn3Zrc8HqKxhz/Wnu9q1SsUz7V9++SU2btwYERE7duyI1157\nLcrlcrseBkDHVNxpf//99zFv3rxobW2N1tbWuPrqq+Pcc8/titkA2E/FaNfW1sYHH3zQFbMAUIFv\nRAIkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKi\nDZCIaAMkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKiDZCIaAMkItoAiYg2QCKiDZCIaAMkItoAiYg2\nQCKiDZCIaAMkItoAiYg2QCKiDZBIxWivX78+zjnnnBgzZkyceuqp8eCDD3bFXAAcQK9KN/Tu3TuW\nLFkSdXV1sXXr1jjttNNi2rRpMXr06K6YD4C/qLjTPuaYY6Kuri4iIvr37x+jR4+O7777rtMHA+Dv\nDulMe926dfHhhx/GxIkTO2seANpQ8XjkT1u3bo1LLrkkli5dGv3799/v1ca//Nyw5wLg/5r2XB1T\nVbR///33uPjii+Oqq66KCy+88AB3NHZ4EIB/t4bYd0N7V7tWqXg8UhRFzJ8/P2pqamLBggXteggA\nh0fFaL/99tvx9NNPx+uvvx7lcjnK5XKsXLmyK2YDYD8Vj0cmT54cra2tXTELABX4RiRAIqINkIho\nAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIqIN\nkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZA\nIqINkIhoAyQi2gCJVIz2tddeG0cffXTU1tZ2xTwAtKFitK+55ppYuXJlV8wCQAUVo11fXx+DBg3q\nilkAqKDX4Vmm8S8/N+y5APi/pj1Xx3RCtAH4u4bYd0N7V7tW8ekRgEREGyCRitGeM2dOnHnmmfHF\nF1/E8OHD47HHHuuKuQA4gIpn2s8880xXzAFAFRyPACQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIqIN\nkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZA\nIqINkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIqINkIhoAyQi2gCJiDZAIhWjvXLlyhg1alSc\ndNJJcd9993XFTAAcRJvRbmlpiZtuuilWrlwZn376aTzzzDPx2WefddVsAOynzWg3NzfHyJEjY8SI\nEdG7d++4/PLLY9myZV01GwD76dXWi99++20MHz587+/Dhg2L995772/3DRw44/BPdgh27vwsevS4\nrltnAOgKbUa7VCpVtcimTSsOyzAdsWjRoli0aFF3jwGkUV3f/mnajPZxxx0X69ev3/v7+vXrY9iw\nYfvcUxRF50wGwN+0eaY9YcKE+PLLL2PdunWxa9eueO6552LmzJldNRsA+2lzp92rV6946KGHYvr0\n6dHS0hLz58+P0aNHd9VsAOynVBzi+UZjY2MMGDAgFi5c2FkzAfznrF69Oo444og444wz2rzvkL8R\nWe2bkwBU7/XXX4933nmn4n1VRfvuu++OU045Jerr6+Pzzz+PiIi1a9fGpEmTYty4cXHRRRfFxo0b\nIyLi/fffj7Fjx0a5XI7bbrstamtrO/BnAOT25JNPxrhx46Kuri7mzp0bK1asiEmTJsX48eNj2rRp\n8dNPP8W6devi4YcfjiVLlkS5XI633nrr4AsWFaxZs6aora0tduzYUWzevLkYOXJksXjx4mLs2LHF\nG2+8URRFUdx5553FggULiqIoijFjxhTvvvtuURRFcfvttxe1tbWVHgHwr/TJJ58UJ598cvHrr78W\nRVEUGzZsKH777be9rz/yyCPFwoULi6IoisbGxuKBBx6ouGabb0RGRLz55ptx0UUXRZ8+faJPnz4x\nc+bM2LZtW2zcuDHq6+sjImLevHkxe/bs2LRpU2zdujUmTpwYERFXXHFFrFjR/Z/hBugOq1atiksv\nvTQGDx4cERGDBg2Kjz/+OC699NL44YcfYteuXXHCCSfsvb+o4i3GiscjpVKp4kIHe72aAQD+rQ7U\nz5tvvjluueWW+Oijj+Lhhx+OHTt2HNKaFaN99tlnx4svvhg7d+6MLVu2xPLly6Nfv34xaNCgvecu\nTz31VDQ0NMTAgQNjwIAB0dzcHBERzz777CENA/BvMnXq1Hj++edjw4YNERGxYcOG2Lx5cxx77LER\nEfH444/vvXfAgAGxZcuWimtWPB4pl8tx2WWXxbhx42LIkCFx+umnR6lUiieeeCJuuOGG2L59e5x4\n4onx2GOPRUTEo48+Gtdff3306NEjpkyZEgMHDmzP3wqQXk1NTdxxxx0xZcqU6NmzZ5TL5WhsbIzZ\ns2fHoEGDYurUqfH1119HRMSMGTPikksuiWXLlsVDDz0UZ5111gHXPOTPaVeybdu26NevX0RE3Hvv\nvfHjjz/GkiVLDucjAP6zKu60D9XLL78c99xzT+zevTtGjBixz/YfgI457DttADqP/yMSIBHRBkhE\ntAESEW2AREQbIBHRBkjkf6yy1KK4Rv2gAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "counts = frequencies(take(10000, snoot))\n", "histdict(counts)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAD9CAYAAAC/fMwDAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFU1JREFUeJzt3HFMlPfhx/HP0921zaJYl8hh7rE9F87hIQ6oOc2yZljE\nUvYrwbmy0a2AtVkCMdVpsnRZ1uD+GGzLsrV191vT8geziWC6COaX9MIWxa0mOzcmMe2ZePkFU+44\nyCw/W+zo6PR+f2BPmQIHwt3p9/1KLoHn7nvP9/Gr73t8fNBKJBIJAQDuefdlegIAgPQg+ABgCIIP\nAIYg+ABgCIIPAIYg+ABgiFmD/8knn2jz5s0qLi6Wz+fTD3/4Q0nS2NiYKioqtG7dOm3fvl2XL19O\njmltbZXX61VBQYF6e3uT2/v7+1VUVCSv16u9e/cu0eEAAGYya/AffPBBnTx5UgMDAzp37pxOnjyp\nd955R21tbaqoqNCFCxdUXl6utrY2SVI4HFZXV5fC4bCCwaCam5v12W3+TU1Nam9vVyQSUSQSUTAY\nXPqjAwAkzXlJ5/Of/7wkaXJyUlevXtXKlSt1/PhxNTQ0SJIaGhrU3d0tSerp6VFdXZ2cTqc8Ho/y\n8/MVCoUUj8c1Pj4uv98vSaqvr0+OAQCkx5zBv3btmoqLi+VyubR161YVFhZqdHRULpdLkuRyuTQ6\nOipJGh4elm3bybG2bSsWi92y3e12KxaLLfaxAABm4ZjrBffdd58GBgb04Ycf6oknntDJkyenPW9Z\nlizLWpTJLNb7AIBpUvlfcuYM/mdWrFihr3/96+rv75fL5dLIyIjy8vIUj8eVm5sraerMfWhoKDkm\nGo3Ktm253W5Fo9Fp291u94InjezU0tKilpaWTE8DC8Da3d1SPVme9ZLOpUuXknfgTExM6A9/+INK\nSkpUXV2tjo4OSVJHR4dqamokSdXV1ers7NTk5KQGBwcViUTk9/uVl5ennJwchUIhJRIJHT58ODkG\nAJAes57hx+NxNTQ06Nq1a7p27ZqeffZZlZeXq6SkRLW1tWpvb5fH49HRo0clST6fT7W1tfL5fHI4\nHAoEAslPnkAgoMbGRk1MTKiqqkqVlZVLf3QAgCQrm/57ZMuyuKRzF+vr61NZWVmmp4EFYO3ubqm2\nk+ADwF0u1XbyXysAgCEIPgAYguADgCEIPgAYguADgCEIPgAYguADgCEIPgAYguADgCEIPgAYguAD\ngCEIPgAYguADgCEIPgAYguADgCEIPgAYguADgCEIPgAYguADgCEIPgAYguADgCEIPgAYguADgCEI\nPgAYguADgCEIPgAYguADgCFmDf7Q0JC2bt2qwsJCbdiwQa+88ookqaWlRbZtq6SkRCUlJXr77beT\nY1pbW+X1elVQUKDe3t7k9v7+fhUVFcnr9Wrv3r1LdDgAgJlYiUQiMdOTIyMjGhkZUXFxsa5cuaJH\nH31U3d3dOnr0qJYvX679+/dPe304HNYzzzyjv/71r4rFYtq2bZsikYgsy5Lf79ehQ4fk9/tVVVWl\nF154QZWVldMnY1maZTrIgJycL2h8/P8yOofly1fqo4/GMjoHIJul2s5Zz/Dz8vJUXFwsSVq2bJnW\nr1+vWCwmSbd9856eHtXV1cnpdMrj8Sg/P1+hUEjxeFzj4+Py+/2SpPr6enV3d8/7oJB+U7FPZPSR\n6Q8c4F6R8jX8ixcv6uzZs9qyZYsk6dVXX9WXv/xl7d69W5cvX5YkDQ8Py7bt5BjbthWLxW7Z7na7\nkx8cAFKXk/MFWZaV8UdOzhcy/UuBBXCk8qIrV67om9/8pl5++WUtW7ZMTU1NeumllyRJP/7xj3Xg\nwAG1t7cvyoS+9rWvJb9+5JFH9MgjnkV531Q5HA4dOLBfy5YtS+t+gVTc+BtXpudhZXoKRuvr61Nf\nX9+8x80Z/E8//VQ7d+7Ud7/7XdXU1EiScnNzk88///zzeuqppyRNnbkPDQ0ln4tGo7JtW263W9Fo\ndNp2t9t92/396U/l8z6IxfTAA/+tp576L5WWlmZ0HgAwk7KyMpWVlSW/P3jwYErjZg1+IpHQ7t27\n5fP5tG/fvuT2eDyu1atXS5KOHTumoqIiSVJ1dbWeeeYZ7d+/X7FYTJFIRH6///pfAXMUCoXk9/t1\n+PBhvfDCCzPs9aWUJr5UHnyQf1sAkLpsuLEhVbMG//Tp03rzzTe1ceNGlZSUSJJ++tOf6siRIxoY\nGJBlWVq7dq1ee+01SZLP51Ntba18Pp8cDocCgYAsa+qvfoFAQI2NjZqYmFBVVdUtd+gAwN0oOy6z\npXaJbdbbMtNt6sMhs9NZsaJUJ068wSWd67JhTSRu1/1MdqyHxJrckB1rsgi3ZQIA7h0EHwAMQfAB\nwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAE\nHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAM\nQfABwBCzBn9oaEhbt25VYWGhNmzYoFdeeUWSNDY2poqKCq1bt07bt2/X5cuXk2NaW1vl9XpVUFCg\n3t7e5Pb+/n4VFRXJ6/Vq7969S3Q4AICZzBp8p9OpX/3qV3rvvff0l7/8Rb/5zW90/vx5tbW1qaKi\nQhcuXFB5ebna2tokSeFwWF1dXQqHwwoGg2publYikZAkNTU1qb29XZFIRJFIRMFgcOmPDgCQNGvw\n8/LyVFxcLElatmyZ1q9fr1gspuPHj6uhoUGS1NDQoO7ubklST0+P6urq5HQ65fF4lJ+fr1AopHg8\nrvHxcfn9fklSfX19cgwAID0cqb7w4sWLOnv2rDZv3qzR0VG5XC5Jksvl0ujoqCRpeHhYW7ZsSY6x\nbVuxWExOp1O2bSe3u91uxWKxGfbUctPXZdcfAIAb+q4/5iel4F+5ckU7d+7Uyy+/rOXLl097zrIs\nWZY17x3PrGUR3wsA7kVlmn4yfDClUXPepfPpp59q586devbZZ1VTUyNp6qx+ZGREkhSPx5Wbmytp\n6sx9aGgoOTYajcq2bbndbkWj0Wnb3W53ShMEACyOWYOfSCS0e/du+Xw+7du3L7m9urpaHR0dkqSO\njo7kB0F1dbU6Ozs1OTmpwcFBRSIR+f1+5eXlKScnR6FQSIlEQocPH06OAQCkx6yXdE6fPq0333xT\nGzduVElJiaSp2y5ffPFF1dbWqr29XR6PR0ePHpUk+Xw+1dbWyufzyeFwKBAIJC/3BAIBNTY2amJi\nQlVVVaqsrFziQwMA3MxKfHbfZBaY+nDI7HRWrCjViRNvqLS0NKPzyBbZsCaSpSz6bZpR2bEeEmty\nQ3asSWrrwU/aAoAhCD4AGILgA4AhCD4AGILgA4AhCD4AGILgA4AhCD4AGILgA4AhCD4AGILgA4Ah\nCD4AGILgA4AhCD4AGILgA4AhCD4AGILgA4AhCD4AGILgA4AhCD4AGILgA4AhCD4AGILgA4AhCD4A\nGILgA4AhCD4AGILgA4AhCD4AGGLO4D/33HNyuVwqKipKbmtpaZFt2yopKVFJSYnefvvt5HOtra3y\ner0qKChQb29vcnt/f7+Kiork9Xq1d+/eRT4MAMBc5gz+rl27FAwGp22zLEv79+/X2bNndfbsWT35\n5JOSpHA4rK6uLoXDYQWDQTU3NyuRSEiSmpqa1N7erkgkokgkcst7AgCW1pzBf+yxx7Ry5cpbtn8W\n8pv19PSorq5OTqdTHo9H+fn5CoVCisfjGh8fl9/vlyTV19eru7t7EaYPAEiVY6EDX331Vf3ud7/T\npk2b9Mtf/lIPPfSQhoeHtWXLluRrbNtWLBaT0+mUbdvJ7W63W7FYbIZ3brnp67LrDwDADX3XH/Oz\noH+0bWpq0uDgoAYGBrR69WodOHBgIW8zg5abHmWL+L4AcK8o0/RWpmZBwc/NzZVlWbIsS88//7zO\nnDkjaerMfWhoKPm6aDQq27bldrsVjUanbXe73QvZNQBggRYU/Hg8nvz62LFjyTt4qqur1dnZqcnJ\nSQ0ODioSicjv9ysvL085OTkKhUJKJBI6fPiwampqFucIAAApmfMafl1dnU6dOqVLly5pzZo1Onjw\noPr6+jQwMCDLsrR27Vq99tprkiSfz6fa2lr5fD45HA4FAgFZliVJCgQCamxs1MTEhKqqqlRZWbm0\nRwYAmMZK3O52mwyZ+nDI7HRWrCjViRNvqLS0NKPzyBbZsCaSddu7wkyUHeshsSY3ZMeapLYe/KQt\nABiC4AOAIQg+ABiC4AOAIQg+ABiC4AOAIQg+ABiC4AOAIQg+ABiC4AOAIQg+ABiC4AOAIQg+ABiC\n4AOAIQg+ABiC4AOAIQg+ABiC4AOAIQg+ABiC4AOAIQg+ABiC4AOAIQg+ABiC4AOAIQg+ABiC4AOA\nIQg+ABiC4AOAIeYM/nPPPSeXy6WioqLktrGxMVVUVGjdunXavn27Ll++nHyutbVVXq9XBQUF6u3t\nTW7v7+9XUVGRvF6v9u7du8iHAQCYy5zB37Vrl4LB4LRtbW1tqqio0IULF1ReXq62tjZJUjgcVldX\nl8LhsILBoJqbm5VIJCRJTU1Nam9vVyQSUSQSueU9AQBLa87gP/bYY1q5cuW0bcePH1dDQ4MkqaGh\nQd3d3ZKknp4e1dXVyel0yuPxKD8/X6FQSPF4XOPj4/L7/ZKk+vr65BgAQHo4FjJodHRULpdLkuRy\nuTQ6OipJGh4e1pYtW5Kvs21bsVhMTqdTtm0nt7vdbsVisRneveWmr8uuPwAAN/Rdf8zPgoJ/M8uy\nZFnWnb7NTVoW8b0A4F5UpuknwwdTGrWgu3RcLpdGRkYkSfF4XLm5uZKmztyHhoaSr4tGo7JtW263\nW9FodNp2t9u9kF0DABZoQcGvrq5WR0eHJKmjo0M1NTXJ7Z2dnZqcnNTg4KAikYj8fr/y8vKUk5Oj\nUCikRCKhw4cPJ8cAANJjzks6dXV1OnXqlC5duqQ1a9boJz/5iV588UXV1taqvb1dHo9HR48elST5\nfD7V1tbK5/PJ4XAoEAgkL/cEAgE1NjZqYmJCVVVVqqysXNojAwBMYyU+u28yC0x9OGR2OitWlOrE\niTdUWlqa0Xlki2xYE8lSFv02zajsWA+JNbkhO9YktfXgJ20BwBAEHwAMQfABwBAEHwAMQfABwBAE\nHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAM\nQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMQfABwBAEHwAMcUfB93g82rhxo0pK\nSuT3+yVJY2Njqqio0Lp167R9+3Zdvnw5+frW1lZ5vV4VFBSot7f3zmYOAJiXOwq+ZVnq6+vT2bNn\ndebMGUlSW1ubKioqdOHCBZWXl6utrU2SFA6H1dXVpXA4rGAwqObmZl27du3OjwAAkJI7vqSTSCSm\nfX/8+HE1NDRIkhoaGtTd3S1J6unpUV1dnZxOpzwej/Lz85MfEgCApXfHZ/jbtm3Tpk2b9Prrr0uS\nRkdH5XK5JEkul0ujo6OSpOHhYdm2nRxr27Zisdid7B4AMA+OOxl8+vRprV69Wv/4xz9UUVGhgoKC\nac9bliXLsmYcf/vnWm76uuz6AwBwQ9/1x/zcUfBXr14tSVq1apV27NihM2fOyOVyaWRkRHl5eYrH\n48rNzZUkud1uDQ0NJcdGo1G53e7bvGvLnUwJAAxQpuknwwdTGrXgSzr//Oc/NT4+Lkn6+OOP1dvb\nq6KiIlVXV6ujo0OS1NHRoZqaGklSdXW1Ojs7NTk5qcHBQUUikeSdPQCApbfgM/zR0VHt2LFDkvTv\nf/9b3/nOd7R9+3Zt2rRJtbW1am9vl8fj0dGjRyVJPp9PtbW18vl8cjgcCgQCs17uAQAsLivxn7fZ\nZNDUB0Bmp7NiRalOnHhDpaWlGZ1HtsiGNZGsW+4GM1V2rIfEmtyQHWuS2nrwk7YAYAiCDwCGIPgA\nYAiCDwCGIPgAYAiCDwCGIPgAYAiCDwCGIPgAYAiCDwCGIPgAYAiCDwCGIPgAYAiCDwCGIPgAYAiC\nDwCGIPgAYAiCDwCGIPgAYAiCDwCGIPgAYAiCDwCGIPgAYAiCDwCGIPgAYAiCDwCGIPgAYAiCDwCG\nSGvwg8GgCgoK5PV69bOf/Sydu0Za9GV6AliwvkxPAGmQtuBfvXpVe/bsUTAYVDgc1pEjR3T+/Pl0\n7R5p0ZfpCWDB+jI9AaRB2oJ/5swZ5efny+PxyOl06tvf/rZ6enrStXsAMJ4jXTuKxWJas2ZN8nvb\nthUKhW553YoVT6VrSrc1MfG/uu8+/mkDwL0nbcG3LCul13344f8s8UzmVlJSkukpZJnU1m7KwaWZ\nQYq/f8ywVL8W81s71uRmd8evRdqC73a7NTQ0lPx+aGhItm1Pe00ikUjXdADAOGm7drFp0yZFIhFd\nvHhRk5OT6urqUnV1dbp2DwDGS9sZvsPh0KFDh/TEE0/o6tWr2r17t9avX5+u3QOA8axEllxH6e7u\n1je+8Q2dP39eX/rSlzI9HczDyMiI9u3bp7/97W966KGH5HK59Otf/1perzfTU8MsPvjgA23btk3S\n1Bp+7nOf06pVq2RZlkKhkJxOZ4ZniLmMjo7q+9//vkKhkFauXKn7779fP/jBD1RTU3Pb12dN8L/1\nrW9pYmJCpaWlamlpyfR0kKJEIqGvfOUr2rVrl773ve9Jks6dO6ePPvpIX/3qVzM8O6Tq4MGDWr58\nufbv35/pqSBFt/uz9/777+v48ePas2fPbcdkxf2HV65cUSgU0qFDh9TV1ZXp6WAeTp48qfvvvz/5\nG06SNm7cSOzvQlly7ocUnThxQg888MC0P3sPP/zwjLGXsiT4PT09qqys1MMPP6xVq1bp73//e6an\nhBS9++67evTRRzM9DcA47733nkpLS+c1JiuCf+TIET399NOSpKefflpHjhzJ8IyQKu7FBjLjP//s\n7dmzR8XFxfL7/TOOSdtdOjMZGxvTyZMn9e6778qyLF29elWWZekXv/hFpqeGFBQWFuqtt97K9DQA\n4xQWFur3v/998vtDhw7pgw8+0KZNm2Yck/Ez/Lfeekv19fW6ePGiBgcH9f7772vt2rX685//nOmp\nIQWPP/64/vWvf+n1119Pbjt37pzeeeedDM4KuPc9/vjj+uSTT/Tb3/42ue3jjz+edUzGg9/Z2akd\nO3ZM27Zz5051dnZmaEaYr2PHjumPf/yj8vPztWHDBv3oRz/S6tWrMz0tzBOX5+4+3d3dOnXqlL74\nxS9q8+bNamxs1M9//vMZX581t2UCAJZWxs/wAQDpQfABwBAEHwAMQfABwBAEHwAMQfABwBD/D50+\nJB6ICAkYAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "counts = pipe(file_pattern, glob, map(open), map(drop(1)), \n", " concat, map(str.upper), map(str.strip), \n", " concat, frequencies)\n", "histdict(counts)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAY8AAAD9CAYAAABEB/uZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XFMm/edP/D3w8FOmlpwQI1JbSYGfhxKoEAu2Oh0kyDM\npulUoEcwRHeBZlT3C1NzaS/qctOWlUzKJdkU6bo2yZ0mUABpNQunApoKgcvBXbqTHCVrVS3ZT0UX\nN9jG8LsZm9JewBA+vz8IzyUjCTykiQN5v6RHMt/n+X79eZ6vzZvHfmwUEREQERHpEBfrAoiIaO1h\neBARkW4MDyIi0o3hQUREujE8iIhIN4YHERHptqLwuHnzJgoKCvDSSy8BACYmJuBwOGC1WuF0OhGJ\nRLRtjx49ClVVkZWVhf7+fq398uXLyM3Nhaqq2L9/v9Y+MzODmpoaqKqKoqIiXL9+XVvX2toKq9UK\nq9WKtrY2rd3r9cJut0NVVdTW1mJ2dnb1R4CIiHRbUXi8/fbbyM7OhqIoAIBjx47B4XDg008/RWlp\nKY4dOwYAuHr1Kjo6OnD16lX09fXhe9/7HhY/RtLY2Ijm5mYMDw9jeHgYfX19AIDm5makpKRgeHgY\nb7zxBg4ePAhgIaB+8pOf4OLFi7h48SIOHz6MyclJAMDBgwdx4MABDA8PY8OGDWhubv5qjwoREd3X\nsuHh9/vxwQcf4NVXX9WCoKenB/X19QCA+vp6dHV1AQC6u7uxa9cuJCQkID09HRaLBR6PB8FgEFNT\nU7DZbACAuro6rc/tY1VVVeH8+fMAgHPnzsHpdMJgMMBgMMDhcKC3txcigsHBQezcuXPJ/RMR0aOx\nbHi88cYb+NnPfoa4uP/ddHx8HEajEQBgNBoxPj4OABgdHYXZbNa2M5vNCAQCS9pNJhMCgQAAIBAI\nIC0tDQAQHx+PpKQkhEKhe441MTEBg8Gg1XP7WERE9GjE32/lr3/9a2zcuBEFBQUYGhq66zaKomgv\nZz1seu/nUdVFRLSerORbq+4bHv/5n/+Jnp4efPDBB5iensbnn3+O3bt3w2g0YmxsDKmpqQgGg9i4\ncSOAhbMAn8+n9ff7/TCbzTCZTPD7/UvaF/uMjIzg2WefxdzcHCYnJ5GSkgKTyXRHYPl8Pmzfvh3J\nycmIRCKYn59HXFwc/H4/TCbTAx0Eevw0NTWhqakp1mXQKnH+1q6V/tF935et/uEf/gE+nw9erxdu\ntxvbt29He3s7ysvL0draCmDhiqjKykoAQHl5OdxuN6LRKLxeL4aHh2Gz2ZCamorExER4PB6ICNrb\n21FRUaH1WRyrs7MTpaWlAACn04n+/n5EIhGEw2EMDAygrKwMiqKgpKQEZ8+eXXL/RET0aNz3zOOP\nLSbS3//938PlcqG5uRnp6en41a9+BQDIzs6Gy+VCdnY24uPjcerUKa3PqVOn8Morr+DGjRt48cUX\n8cILLwAAGhoasHv3bqiqipSUFLjdbgBAcnIyDh06hMLCQgDAW2+9BYPBAAA4fvw4amtr8aMf/Qhb\nt25FQ0PDV3AoiIhopZT1/JXsiqLwZas1amhoCMXFxbEug1aJ87d2rfT3JsODiIg0K/29ya8nISIi\n3RgeRESkG8ODiIh0Y3gQEZFuDA8iItKN4UFERLrp+pDgWmQ0Zsb0/v/2b/8PfvjD78e0BiKir9q6\nD4//9//6l9/ooTmLTz75vzG8f6L7S0xMxtRUOKY1PP30Bnz++URMayD91n14ALE889gI4NMY3j/R\n/S0ER2w/SDs1xW+/Xov4ngcREenG8CAiIt0YHkREpBvDg4iIdGN4EBGRbgwPIiLSjeFBRES6MTyI\niEg3hgcREel23/CYnp6G3W5Hfn4+srOz8YMf/AAA0NTUBLPZjIKCAhQUFKC3t1frc/ToUaiqiqys\nLPT3/+9Xg1y+fBm5ublQVRX79+/X2mdmZlBTUwNVVVFUVITr169r61pbW2G1WmG1WtHW1qa1e71e\n2O12qKqK2tpazM7OPviRICKilZNlfPnllyIiMjs7K3a7XS5cuCBNTU1y4sSJJdteuXJF8vLyJBqN\nitfrlczMTJmfnxcRkcLCQvF4PCIismPHDunt7RURkZMnT0pjY6OIiLjdbqmpqRERkVAoJBkZGRIO\nhyUcDktGRoZEIhEREamurpaOjg4REdm7d6+cPn36rrUDEEBiuDSLy7VnuUNMFDOxf44s1ECPj5XO\nx7IvW339618HAESjUdy8eRMbNmxYDJ0l23Z3d2PXrl1ISEhAeno6LBYLPB4PgsEgpqamYLPZAAB1\ndXXo6uoCAPT09KC+vh4AUFVVhfPnzwMAzp07B6fTCYPBAIPBAIfDgd7eXogIBgcHsXPnTgBAfX29\nNhYRET0ay4bH/Pw88vPzYTQaUVJSgi1btgAA3nnnHeTl5aGhoQGRSAQAMDo6CrPZrPU1m80IBAJL\n2k0mEwKBAAAgEAggLS0NABAfH4+kpCSEQqF7jjUxMQGDwYC4uLglYxER0aOx7LfqxsXF4eOPP8bk\n5CTKysowNDSExsZG/PjHPwYAHDp0CAcOHEBzc/NDL1ZRVvPtm0233S6+tRAREQAMDQ1haGhId78V\nfyV7UlISvvOd7+DSpUsoLi7W2l999VW89NJLABbOAnw+n7bO7/fDbDbDZDLB7/cvaV/sMzIygmef\nfRZzc3OYnJxESkoKTCbTHTvk8/mwfft2JCcnIxKJYH5+HnFxcfD7/TCZTPepvGmlu0hE9MQpLi6+\n43f64cOHV9Tvvi9b/eEPf9Bekrpx4wYGBgZQUFCAsbExbZv3338fubm5AIDy8nK43W5Eo1F4vV4M\nDw/DZrMhNTUViYmJ8Hg8EBG0t7ejoqJC69Pa2goA6OzsRGlpKQDA6XSiv78fkUgE4XAYAwMDKCsr\ng6IoKCkpwdmzZwEsXJFVWVm5op0lIqKvxn3PPILBIOrr6zE/P4/5+Xns3r0bpaWlqKurw8cffwxF\nUfDNb34T//zP/wwAyM7OhsvlQnZ2NuLj43Hq1CntpaZTp07hlVdewY0bN/Diiy/ihRdeAAA0NDRg\n9+7dUFUVKSkpcLvdAIDk5GQcOnQIhYWFAIC33noLBoMBAHD8+HHU1tbiRz/6EbZu3YqGhoaHc3SI\niOiuFLnbZVPrxEJwxXL3WuByfYiOjpYY1kB0b7F/jgCActerNyk2FGVl88FPmBMRkW4MDyIi0o3h\nQUREujE8iIhIN4YHERHpxvAgIiLdGB5ERKQbw4OIiHRjeBARkW4MDyIi0o3hQUREujE8iIhIN4YH\nERHpxvAgIiLdGB5ERKQbw4OIiHRjeBARkW4MDyIi0o3hQUREut03PKanp2G325Gfn4/s7Gz84Ac/\nAABMTEzA4XDAarXC6XQiEolofY4ePQpVVZGVlYX+/n6t/fLly8jNzYWqqti/f7/WPjMzg5qaGqiq\niqKiIly/fl1b19raCqvVCqvVira2Nq3d6/XCbrdDVVXU1tZidnb2wY8EERGtnCzjyy+/FBGR2dlZ\nsdvtcuHCBXnzzTfl+PHjIiJy7NgxOXjwoIiIXLlyRfLy8iQajYrX65XMzEyZn58XEZHCwkLxeDwi\nIrJjxw7p7e0VEZGTJ09KY2OjiIi43W6pqakREZFQKCQZGRkSDoclHA5LRkaGRCIRERGprq6Wjo4O\nERHZu3evnD59+q61AxBAYrg0i8u1Z7lDTBQzsX+OLNRAj4+VzseyL1t9/etfBwBEo1HcvHkTGzZs\nQE9PD+rr6wEA9fX16OrqAgB0d3dj165dSEhIQHp6OiwWCzweD4LBIKampmCz2QAAdXV1Wp/bx6qq\nqsL58+cBAOfOnYPT6YTBYIDBYIDD4UBvby9EBIODg9i5c+eS+yciokcjfrkN5ufnsXXrVvzXf/0X\nGhsbsWXLFoyPj8NoNAIAjEYjxsfHAQCjo6MoKirS+prNZgQCASQkJMBsNmvtJpMJgUAAABAIBJCW\nlrZQTHw8kpKSEAqFMDo6ekefxbEmJiZgMBgQFxe3ZKy7a7rtdvGthYiIAGBoaAhDQ0O6+y0bHnFx\ncfj4448xOTmJsrIyDA4O3rFeURQoiqL7jldjdffT9FWXQUS0bhQXF6O4uFj7+fDhwyvqt+KrrZKS\nkvCd73wHly9fhtFoxNjYGAAgGAxi48aNABbOAnw+n9bH7/fDbDbDZDLB7/cvaV/sMzIyAgCYm5vD\n5OQkUlJSlozl8/lgMpmQnJyMSCSC+fl5bSyTybTS3SAioq/AfcPjD3/4g3Yl1Y0bNzAwMICCggKU\nl5ejtbUVwMIVUZWVlQCA8vJyuN1uRKNReL1eDA8Pw2azITU1FYmJifB4PBARtLe3o6KiQuuzOFZn\nZydKS0sBAE6nE/39/YhEIgiHwxgYGEBZWRkURUFJSQnOnj275P6JiOgRud+76Z988okUFBRIXl6e\n5Obmyk9/+lMRWbgSqrS0VFRVFYfDIeFwWOtz5MgRyczMlM2bN0tfX5/WfunSJcnJyZHMzEzZt2+f\n1j49PS3V1dVisVjEbreL1+vV1rW0tIjFYhGLxSJnzpzR2q9duyY2m00sFou4XC6JRqP3vGqAV1sR\n3VvsnyO82upxs9L5UG5tvC4tvEcSy91rgcv1ITo6WmJYA9G9xf45AgAK1vGvoTVHUVY2H/yEORER\n6cbwICIi3RgeRESkG8ODiIh0Y3gQEZFuDA8iItKN4UFERLoxPIiISDeGBxER6cbwICIi3RgeRESk\nG8ODiIh0Y3gQEZFuDA8iItKN4UFERLoxPIiISDeGBxER6cbwICIi3RgeRESk27Lh4fP5UFJSgi1b\ntiAnJwc///nPAQBNTU0wm80oKChAQUEBent7tT5Hjx6FqqrIyspCf3+/1n758mXk5uZCVVXs379f\na5+ZmUFNTQ1UVUVRURGuX7+urWttbYXVaoXVakVbW5vW7vV6YbfboaoqamtrMTs7+2BHgoiIVk6W\nEQwG5aOPPhIRkampKbFarXL16lVpamqSEydOLNn+ypUrkpeXJ9FoVLxer2RmZsr8/LyIiBQWForH\n4xERkR07dkhvb6+IiJw8eVIaGxtFRMTtdktNTY2IiIRCIcnIyJBwOCzhcFgyMjIkEomIiEh1dbV0\ndHSIiMjevXvl9OnTS2oBIIDEcGkWl2vPcoeYKGZi/xxZqIEeHyudj2XPPFJTU5Gfnw8AeOqpp/Dc\nc88hEAgsBs+S7bu7u7Fr1y4kJCQgPT0dFosFHo8HwWAQU1NTsNlsAIC6ujp0dXUBAHp6elBfXw8A\nqKqqwvnz5wEA586dg9PphMFggMFggMPhQG9vL0QEg4OD2LlzJwCgvr5eG4uIiB6+eD0bf/bZZ/jo\no49QVFSE3/zmN3jnnXfQ1taGbdu24cSJEzAYDBgdHUVRUZHWx2w2IxAIICEhAWazWWs3mUxaCAUC\nAaSlpS0UFB+PpKQkhEIhjI6O3tFncayJiQkYDAbExcUtGWuppttuF99aiIgIAIaGhjA0NKS734rD\n44svvsDOnTvx9ttv46mnnkJjYyN+/OMfAwAOHTqEAwcOoLm5WXcBeimKorNH08Mog4hoXSguLkZx\ncbH28+HDh1fUb0VXW83OzqKqqgp//dd/jcrKSgDAxo0boSgKFEXBq6++iosXLwJYOAvw+XxaX7/f\nD7PZDJPJBL/fv6R9sc/IyAgAYG5uDpOTk0hJSVkyls/ng8lkQnJyMiKRCObn57WxTCbTinaYiIge\n3LLhISJoaGhAdnY2Xn/9da09GAxqt99//33k5uYCAMrLy+F2uxGNRuH1ejE8PAybzYbU1FQkJibC\n4/FARNDe3o6KigqtT2trKwCgs7MTpaWlAACn04n+/n5EIhGEw2EMDAygrKwMiqKgpKQEZ8+eBbBw\nRdZiqBER0SOw3DvqFy5cEEVRJC8vT/Lz8yU/P18++OAD2b17t+Tm5srzzz8vFRUVMjY2pvU5cuSI\nZGZmyubNm6Wvr09rv3TpkuTk5EhmZqbs27dPa5+enpbq6mqxWCxit9vF6/Vq61paWsRisYjFYpEz\nZ85o7deuXRObzSYWi0VcLpdEo9G7XjXAq62I7i32zxFebfW4Wel8KLc2XpcW3h+J5e61wOX6EB0d\nLTGsgejeYv8cAQDlrlduUmwoysrmg58wJyIi3RgeRESkG8ODiIh0Y3gQEZFuuj5hTvQgEhOTMTUV\njnUZePrpDfj884lYl0G0pjE86JFZCI7YX1UzNaX3WwqIHo3H5Q+slWB4EBE9Jh6PP7BW9scV3/Mg\nIiLdGB5ERKQbw4OIiHRjeBARkW4MDyIi0o3hQUREujE8iIhIN4YHERHpxvAgIiLdGB5ERKQbw4OI\niHRjeBARkW7LhofP50NJSQm2bNmCnJwc/PznPwcATExMwOFwwGq1wul0IhKJaH2OHj0KVVWRlZWF\n/v5+rf3y5cvIzc2FqqrYv3+/1j4zM4OamhqoqoqioiJcv35dW9fa2gqr1Qqr1Yq2tjat3ev1wm63\nQ1VV1NbWYnZ29sGOBBERrZwsIxgMykcffSQiIlNTU2K1WuXq1avy5ptvyvHjx0VE5NixY3Lw4EER\nEbly5Yrk5eVJNBoVr9crmZmZMj8/LyIihYWF4vF4RERkx44d0tvbKyIiJ0+elMbGRhERcbvdUlNT\nIyIioVBIMjIyJBwOSzgcloyMDIlEIiIiUl1dLR0dHSIisnfvXjl9+vSS2gEIIDFcmsXl2rPcIX5i\nxH4+FpdlH/ZPjMdjTjgfi9bSfCx75pGamor8/HwAwFNPPYXnnnsOgUAAPT09qK+vBwDU19ejq6sL\nANDd3Y1du3YhISEB6enpsFgs8Hg8CAaDmJqags1mAwDU1dVpfW4fq6qqCufPnwcAnDt3Dk6nEwaD\nAQaDAQ6HA729vRARDA4OYufOnUvun4iIHj5d/8/js88+w0cffQS73Y7x8XEYjUYAgNFoxPj4OABg\ndHQURUVFWh+z2YxAIICEhASYzWat3WQyIRAIAAACgQDS0tIWCoqPR1JSEkKhEEZHR+/oszjWxMQE\nDAYD4uLiloy1VNNtt4tvLUREtGDo1qLPisPjiy++QFVVFd5++208/fTTd6xTFAWK8mj+O5v++2l6\nGGUQEa0Txbjzj+rDK+q1oqutZmdnUVVVhd27d6OyshLAwtnG2NgYACAYDGLjxo0AFs4CfD6f1tfv\n98NsNsNkMsHv9y9pX+wzMjICAJibm8Pk5CRSUlKWjOXz+WAymZCcnIxIJIL5+XltLJPJtKIdJiKi\nB7dseIgIGhoakJ2djddff11rLy8vR2trK4CFK6IWQ6W8vBxutxvRaBRerxfDw8Ow2WxITU1FYmIi\nPB4PRATt7e2oqKhYMlZnZydKS0sBAE6nE/39/YhEIgiHwxgYGEBZWRkURUFJSQnOnj275P6JiOgR\nWO4d9QsXLoiiKJKXlyf5+fmSn58vvb29EgqFpLS0VFRVFYfDIeFwWOtz5MgRyczMlM2bN0tfX5/W\nfunSJcnJyZHMzEzZt2+f1j49PS3V1dVisVjEbreL1+vV1rW0tIjFYhGLxSJnzpzR2q9duyY2m00s\nFou4XC6JRqOP4ZULvNrqdrGfD17d88cejznhfCxaS/Oh3Cp4XVp4fySWu9cCl+tDdHS0xLCGx0fs\n52ORgnX8sNfl8ZgTzseitTQf/IQ5ERHpxvAgIiLdGB5ERKQbw4OIiHRjeBARkW4MDyIi0o3hQURE\nujE8iIhIN4YHERHpxvAgIiLdGB5ERKQbw4OIiHRjeBARkW4MDyIi0o3hQUREujE8iIhIN4YHERHp\nxvAgIiLdGB5ERKTbsuHx3e9+F0ajEbm5uVpbU1MTzGYzCgoKUFBQgN7eXm3d0aNHoaoqsrKy0N/f\nr7VfvnwZubm5UFUV+/fv19pnZmZQU1MDVVVRVFSE69eva+taW1thtVphtVrR1tamtXu9Xtjtdqiq\nitraWszOzq7+CBARkX6yjP/4j/+Q3/72t5KTk6O1NTU1yYkTJ5Zse+XKFcnLy5NoNCper1cyMzNl\nfn5eREQKCwvF4/GIiMiOHTukt7dXREROnjwpjY2NIiLidrulpqZGRERCoZBkZGRIOByWcDgsGRkZ\nEolERESkurpaOjo6RERk7969cvr06bvWDkAAieHSLC7XnuUO8RMj9vOxuCz7sH9iPB5zwvlYtJbm\nY9kzj29961vYsGHD3UJnSVt3dzd27dqFhIQEpKenw2KxwOPxIBgMYmpqCjabDQBQV1eHrq4uAEBP\nTw/q6+sBAFVVVTh//jwA4Ny5c3A6nTAYDDAYDHA4HOjt7YWIYHBwEDt37gQA1NfXa2MREdGjEb/a\nju+88w7a2tqwbds2nDhxAgaDAaOjoygqKtK2MZvNCAQCSEhIgNls1tpNJhMCgQAAIBAIIC0tbaGY\n+HgkJSUhFAphdHT0jj6LY01MTMBgMCAuLm7JWHfXdNvt4lsLEREtGLq16LOqN8wbGxvh9Xrx8ccf\nY9OmTThw4MBqhtFNUZRV9Gq6bSn+6oohIloXinHn78mVWVV4bNy4EYqiQFEUvPrqq7h48SKAhbMA\nn8+nbef3+2E2m2EymeD3+5e0L/YZGRkBAMzNzWFychIpKSlLxvL5fDCZTEhOTkYkEsH8/Lw2lslk\nWs1uEBHRKq0qPILBoHb7/fff167EKi8vh9vtRjQahdfrxfDwMGw2G1JTU5GYmAiPxwMRQXt7Oyoq\nKrQ+ra2tAIDOzk6UlpYCAJxOJ/r7+xGJRBAOhzEwMICysjIoioKSkhKcPXsWwMIVWZWVlas/AkRE\npN9y76jX1tbKpk2bJCEhQcxmszQ3N8vu3bslNzdXnn/+eamoqJCxsTFt+yNHjkhmZqZs3rxZ+vr6\ntPZLly5JTk6OZGZmyr59+7T26elpqa6uFovFIna7Xbxer7aupaVFLBaLWCwWOXPmjNZ+7do1sdls\nYrFYxOVySTQafUyvXODVVreL/Xzw6p4/9njMCedj0VqaD+VWwevSwnsksdy9FrhcH6KjoyWGNTw+\nYj8fixSs44e9Lo/HnHA+Fq2l+eAnzImISDeGBxER6cbwICIi3RgeRESkG8ODiIh0Y3gQEZFuDA8i\nItKN4UFERLoxPIiISDeGBxER6cbwICIi3RgeRESkG8ODiIh0Y3gQEZFuDA8iItKN4UFERLoxPIiI\nSDeGBxER6cbwICIi3ZYNj+9+97swGo3Izc3V2iYmJuBwOGC1WuF0OhGJRLR1R48ehaqqyMrKQn9/\nv9Z++fJl5ObmQlVV7N+/X2ufmZlBTU0NVFVFUVERrl+/rq1rbW2F1WqF1WpFW1ub1u71emG326Gq\nKmprazE7O7v6I0BERLotGx579uxBX1/fHW3Hjh2Dw+HAp59+itLSUhw7dgwAcPXqVXR0dODq1avo\n6+vD9773Pe0fqTc2NqK5uRnDw8MYHh7WxmxubkZKSgqGh4fxxhtv4ODBgwAWAuonP/kJLl68iIsX\nL+Lw4cOYnJwEABw8eBAHDhzA8PAwNmzYgObm5q/uiBAR0bKWDY9vfetb2LBhwx1tPT09qK+vBwDU\n19ejq6sLANDd3Y1du3YhISEB6enpsFgs8Hg8CAaDmJqags1mAwDU1dVpfW4fq6qqCufPnwcAnDt3\nDk6nEwaDAQaDAQ6HA729vRARDA4OYufOnUvun4iIHo341XQaHx+H0WgEABiNRoyPjwMARkdHUVRU\npG1nNpsRCASQkJAAs9mstZtMJgQCAQBAIBBAWlraQjHx8UhKSkIoFMLo6OgdfRbHmpiYgMFgQFxc\n3JKx7q7pttvFtxYiIlowdGvRZ1XhcTtFUaAoyoMOs+L70q/pqy6DiGgdKcadf1QfXlGvVV1tZTQa\nMTY2BgAIBoPYuHEjgIWzAJ/Pp23n9/thNpthMpng9/uXtC/2GRkZAQDMzc1hcnISKSkpS8by+Xww\nmUxITk5GJBLB/Py8NpbJZFrNbhAR0SqtKjzKy8vR2toKYOGKqMrKSq3d7XYjGo3C6/VieHgYNpsN\nqampSExMhMfjgYigvb0dFRUVS8bq7OxEaWkpAMDpdKK/vx+RSAThcBgDAwMoKyuDoigoKSnB2bNn\nl9w/ERE9IrKM2tpa2bRpkyQkJIjZbJaWlhYJhUJSWloqqqqKw+GQcDisbX/kyBHJzMyUzZs3S19f\nn9Z+6dIlycnJkczMTNm3b5/WPj09LdXV1WKxWMRut4vX69XWtbS0iMViEYvFImfOnNHar127Jjab\nTSwWi7hcLolGo3etHYAAEsOlWVyuPcsd4idG7OdjcVn2Yf/EeDzmhPOxaC3Nh3Kr4HVp4T2SWO5e\nC1yuD9HR0RLDGh4fsZ+PRQrW8cNel8djTjgfi9bSfPAT5kREpBvDg4iIdGN4EBGRbgwPIiLSjeFB\nRES6MTyIiEg3hgcREenG8CAiIt0YHkREpBvDg4iIdGN4EBGRbgwPIiLSjeFBRES6MTyIiEg3hgcR\nEenG8CAiIt0YHkREpBvDg4iIdHug8EhPT8fzzz+PgoIC2Gw2AMDExAQcDgesViucTicikYi2/dGj\nR6GqKrKystDf36+1X758Gbm5uVBVFfv379faZ2ZmUFNTA1VVUVRUhOvXr2vrWltbYbVaYbVa0dbW\n9iC7QUREej3IP2tPT0+XUCh0R9ubb74px48fFxGRY8eOycGDB0VE5MqVK5KXlyfRaFS8Xq9kZmbK\n/Py8iIgUFhaKx+MREZEdO3ZIb2+viIicPHlSGhsbRUTE7XZLTU2NiIiEQiHJyMiQcDgs4XBYu/3H\nEPN/Jt8sLteeBznE60rs52NxeaCH/bryeMwJ52PRWpqPB37ZamF//1dPTw/q6+sBAPX19ejq6gIA\ndHd3Y9euXUhISEB6ejosFgs8Hg+CwSCmpqa0M5e6ujqtz+1jVVVV4fz58wCAc+fOwel0wmAwwGAw\nwOFwoK+v70F3hYiIVuiBwkNRFHz729/Gtm3b8Itf/AIAMD4+DqPRCAAwGo0YHx8HAIyOjsJsNmt9\nzWYzAoHAknaTyYRAIAAACAQCSEtLAwDEx8cjKSkJoVDonmMREdGjEf8gnX/zm99g06ZN+O///m84\nHA5kZWXo8ShzAAAFwElEQVTdsV5RFCiK8kAFPrim224X31qIiGjB0K1FnwcKj02bNgEAnnnmGbz8\n8su4ePEijEYjxsbGkJqaimAwiI0bNwJYOKPw+XxaX7/fD7PZDJPJBL/fv6R9sc/IyAieffZZzM3N\nYXJyEikpKTCZTBgaGtL6+Hw+bN++/R5VNj3ILhIRrXPFuPOP6sMr6rXql63+53/+B1NTUwCAL7/8\nEv39/cjNzUV5eTlaW1sBLFwRVVlZCQAoLy+H2+1GNBqF1+vF8PAwbDYbUlNTkZiYCI/HAxFBe3s7\nKioqtD6LY3V2dqK0tBQA4HQ60d/fj0gkgnA4jIGBAZSVla12V4iISKdVn3mMj4/j5ZdfBgDMzc3h\nr/7qr+B0OrFt2za4XC40NzcjPT0dv/rVrwAA2dnZcLlcyM7ORnx8PE6dOqW9pHXq1Cm88soruHHj\nBl588UW88MILAICGhgbs3r0bqqoiJSUFbrcbAJCcnIxDhw6hsLAQAPDWW2/BYDCs/igQEZEuivzx\n5VLryEI4xXL3WuByfYiOjpYY1vD4iP18LFKWXCX4pHo85oTzsWgtzQc/YU5ERLoxPIiISDeGBxER\n6cbwICIi3RgeRESkG8ODiIh0Y3gQEZFuDA8iItKN4UFERLoxPIiISDeGBxER6cbwICIi3RgeRESk\nG8ODiIh0Y3gQEZFuDA8iItKN4UFERLoxPIiISDeGBxER6bamw6Ovrw9ZWVlQVRXHjx+PdTn0lRqK\ndQH0QIZiXQA9ZGs2PG7evInXXnsNfX19uHr1Kt577z38/ve/j3VZ9JUZinUB9ECGYl0APWRrNjwu\nXrwIi8WC9PR0JCQkoLa2Ft3d3bEui4joiRAf6wJWKxAIIC0tTfvZbDbD4/Es2S4p6aVHWdYdotER\nxMcXxuz+iYgeljUbHoqirGi7yclfP+RK7u+Xv/wEv/xlc0xreLysbN4WHH54Vazw8fNkeFjHYuXz\nx/m43do4Fms2PEwmE3w+n/azz+eD2Wy+YxsRedRlERE9Edbsex7btm3D8PAwPvvsM0SjUXR0dKC8\nvDzWZRERPRHW7JlHfHw83n33XZSVleHmzZtoaGjAc889F+uyiIieCIqsw9d2urq68Jd/+Zf4/e9/\nj82bN8e6HNJhbGwMr7/+Oi5dugSDwQCj0Yh//Md/hKqqsS6NlhEKhfDtb38bwMI8/smf/AmeeeYZ\nKIoCj8eDhISEGFdI9zM+Po433ngDHo8HGzZswNe+9jV8//vfR2Vl5V23X5fhUVNTgxs3bmDr1q1o\namqKdTm0QiKCP//zP8eePXvwN3/zNwCATz75BJ9//jn+4i/+IsbVkR6HDx/G008/jb/7u7+LdSm0\nAnd77o2MjKCnpwevvfbaXfus2fc87uWLL76Ax+PBu+++i46OjliXQzoMDg7ia1/7mvbgBYDnn3+e\nwbFGrcO/S9etf/u3f8Of/umf3vHc+8Y3vnHP4ADWYXh0d3fjhRdewDe+8Q0888wz+O1vfxvrkmiF\nfve73+HP/uzPYl0G0RPnypUr2Lp1q64+6y483nvvPVRXVwMAqqur8d5778W4IlopXutPFBt//Nx7\n7bXXkJ+fD5vNds8+a/Zqq7uZmJjA4OAgfve730FRFNy8eROKouBnP/tZrEujFdiyZQs6OztjXQbR\nE2fLli34l3/5F+3nd999F6FQCNu2bbtnn3V15tHZ2Ym6ujp89tln8Hq9GBkZwTe/+U1cuHAh1qXR\nCmzfvh0zMzP4xS9+obV98skn+PDDD2NYFdH6t337dkxPT+Of/umftLYvv/zyvn3WVXi43W68/PLL\nd7RVVVXB7XbHqCLS6/3338e//uu/wmKxICcnBz/84Q+xadOmWJdFq8CXIdeWrq4u/Pu//zsyMjJg\nt9vxyiuv4Kc//ek9t1+Xl+oSEdHDta7OPIiI6NFgeBARkW4MDyIi0o3hQUREujE8iIhIN4YHERHp\n9v8B/mI60TATFaEAAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Counting long strings\n", "\n", "How often do very long repetitions of base pair occur? E.g. are sequences like \"TTTTTTTTT\" very rare or are they commonplace?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Using partitionby - also known as itertools.groupby\n", "from toolz.curried import partitionby, identity\n", "\n", "animals = ['cat', 'dog', 'hen', 'goose', 'moose', 'rat', 'giraffe']\n", "list(partitionby(len, animals)) # Parition animals by their name length" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "[['cat', 'dog', 'hen'], ['goose', 'moose'], ['rat'], ['giraffe']]" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "# We want to partition our gene sequences 'AACCTTTTTGCT' by the letters themselves\n", "# Instead of `len`, our key function is just the trivial identity function\n", "identity('A')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "'A'" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "# Lets compose partitionby with identity to collect our sequence \n", "# into groups of repeated elements\n", "\n", "partitions = list(partitionby(identity, 'AACCTTTTTGCT'))\n", "partitions" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "[['A', 'A'], ['C', 'C'], ['T', 'T', 'T', 'T', 'T'], ['G'], ['C'], ['T']]" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# From here we just want to compute the length of each group\n", "\n", "list(map(len, partitions))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "[2, 2, 5, 1, 1, 1]" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Counting recurrences on the random genome\n", "\n", "Lets pull this together and count recurring base pairs on our random genome." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pipe(snoot, take(20), partitionby(identity), map(len), list)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "[1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1]" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "result = pipe(snoot, take(1000000), partitionby(identity), map(len), frequencies)\n", "\n", "histdict(result, log=True)\n", "title('Counts of Repeated Base Pairs - Random')\n", "result\n" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "{1: 562222,\n", " 2: 140330,\n", " 3: 35289,\n", " 4: 8843,\n", " 5: 2219,\n", " 6: 577,\n", " 7: 138,\n", " 8: 33,\n", " 9: 8,\n", " 10: 2}" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAXkAAAEICAYAAAC6fYRZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XtcVHX+P/DXKHhJh5sXRAZCAQOEtMQUXHSszCzFuw5t\nqdBlM7HVdV3dskAzC2/rmo9sNUVNRSo1THHc1cJsTajUWgUfoDE2gplXLl64DJ/fH/w8X0dhnBlm\nGObwej4ePB7MOXPO+31mmPd8eJ/PnFEIIQSIiEiWWjg6ASIish8WeSIiGWORJyKSMRZ5IiIZY5En\nIpIxFnkiIhljkSdcuHABAwcOhJubG2bPnu3odBrNlClT8NZbbzk6DbsIDw/HN9984+g0bE7Oz5m9\nsMhbYevWrYiMjIRSqUTXrl3xzDPP4L///a/d47Zo0QK//PKLzfe7Zs0adO7cGaWlpViyZMk966dM\nmYLWrVtDqVTCy8sLTzzxBE6ePGnzPCzV0MdDoVBAoVDUuW7Dhg1o2bIllEollEolAgMD8dFHH1kd\ny1otWrRA+/btoVQqoVKpMGvWLNTU1Nx3uxMnTmDgwIGNkKFxjr6+vnj99ddRXV1tl1imnjOqG4u8\nhZYvX46ZM2di3rx5+P3336HX6zFt2jTs2rWrUeLb47NrZ8+eRWhoaL3rFQoF5syZg7KyMhQXF8Pf\n3x/x8fE2z8MaDX08TG0/YMAAlJWVoaysDNu3b8ff/vY3HD9+vEHxrPHzzz+jrKwMBw4cwNatW7F2\n7doG7c9gMNgos/9zO8dvvvkGO3bswJo1a2we4zZ+ftMyLPIWKCkpQVJSEj788EOMGjUKbdu2RcuW\nLfHss88iJSUFAFBRUYEZM2bA19cXvr6+mDlzJiorKwHUjg5jYmKM9nnnaHTKlCmYNm0ahg8fDjc3\nN/Tv319ad3tU1qtXLyiVSnz22We4dOkShg8fDk9PT3To0AEDBw6s9wVw+PBh9O3bFx4eHnjsscfw\n3XffSTE3bdqExYsXQ6lU4quvvjL5GLRp0wbjx483GskXFxdj7Nix6Ny5M7p3744PPvhAWpecnIxx\n48ZBo9HAzc0Nffr0wc8//2zWtjk5OYiKioKnpye6du2K6dOno6qqqt7HAwB2796N3r17w9PTEwMG\nDMD//vc/aX/Hjh3Do48+Cjc3N2g0Gty6dcvksd75WPbu3RuhoaE4deqUtGz8+PHw8fGBh4cHBg0a\nhNzcXGldZmYmevbsCTc3N6hUKixbtkxaZypHUx566CHExMTg5MmT+OWXX/D444+jY8eO6NSpE55/\n/nmUlJRI9w0ICJCey9vPwQsvvAB3d3ds3LgROTk5iIyMhLu7O7p06YJZs2aZlcP9BAYGYsCAAUaP\nxZ///Gf4+/vD3d0dkZGR+Pbbb6V1ycnJmDBhAiZPngw3NzeEh4fjxx9/lNbf7zlbu3YtgoOD0aFD\nB4wcORLnz5+X1rVo0QKrV69GcHAw3Nzc8Pbbb+PMmTOIioqCh4cHNBqN9Pcka4LMtnfvXuHi4iIM\nBkO993nrrbdEVFSUuHjxorh48aKIjo4Wb731lhBCiNTUVPGHP/zB6P4KhUKcOXNGCCHE5MmTRYcO\nHcT3338vqqurxR//+Eeh0WjqvK8QQsydO1e8+uqrorq6WlRXV4tvv/22zpwuX74sPDw8xObNm4XB\nYBBpaWnC09NTXLlyRQghxJQpU6Qc6zJlyhQxb948IYQQ5eXl4vnnnxeDBw8WQghhMBjEo48+Kt55\n5x1RVVUlfvnlF9G9e3exb98+IYQQSUlJwtXVVWzfvl1UV1eLpUuXim7duonq6ur7bvvjjz+K7Oxs\nYTAYhE6nE6GhoWLFihX1Ph5Hjx4VnTt3Fjk5OaKmpkZs3LhRBAQEiMrKSlFRUSH8/f3FihUrRHV1\ntfj888+Fq6trvcd993OVnZ0tPDw8REFBgdF9ysvLRWVlpZgxY4bo3bu3tK5Lly7S83Ht2jVx9OhR\nkzlWVFTUmYdCoRCnT58WQghx8uRJ0aVLF7F+/Xpx+vRpsX//flFZWSkuXrwoBg4cKGbMmCFtFxAQ\nIA4cOGD0HGRkZAghhLh586bo37+/2Lx5sxBCiOvXr4sjR47UGd8cd+aYl5cnfHx8xMaNG6X1mzdv\nFleuXBEGg0EsW7ZMdOnSRTrepKQk0aZNG7F3715RU1Mj/v73v4v+/fsLIcR9n7MDBw6Ijh07imPH\njomKigoxffp0MXDgQKO8Ro0aJcrKysTJkydFq1atxODBg0VhYaEoKSkRYWFhRnnKFYu8BTZv3iy6\ndOli8j6BgYFi79690u19+/aJgIAAIcT9i/yUKVPEyy+/LK3LzMwUISEhdd5XCCHefvttMXLkSOkF\nVp9NmzaJfv36GS2LiooSGzZskOLeLuJ1mTx5smjTpo3w8PAQLVq0EN27dxcXL14UQghx5MgR4e/v\nb3T/RYsWifj4eCFE7Ys4KipKWldTUyN8fHzEoUOH7rvt3f7xj3+I0aNHS7fvfjxeffXVe4r2Qw89\nJA4ePCgOHjwounbtarTuzjfgu6WmpgoXFxfh4eEhlEqlUCgU4vXXX6/zvkIIcfXqVaFQKERpaakQ\nQgh/f3/xr3/9S5SUlBjdz1SOdVEoFMLNzU14enqKwMBA8dZbb4mampp77rdz507xyCOPSLfvLvKD\nBg0yuv/AgQNFUlKS9Dw2xO0c27VrJxQKhZg+fbrJ+3t6eoqff/5Zym3IkCHSupMnT4q2bdsKIcR9\nn7OEhAQxZ84caV15eblwdXUVZ8+elfI6fPiwtL5Pnz5i8eLF0u1Zs2YZvTHKFds1FujQoQMuXbpk\n8sRXcXExHnzwQem2v78/iouLzY7h7e0t/d62bVuUl5fXe9/Zs2cjKCgITz31FAIDA6WWUV05+fv7\nGy178MEHzc5LoVBg9uzZuHr1KnQ6HVq3bo1NmzYBqO3nFxcXw9PTU/p577338Pvvv0vbq1Qqo32p\nVCoUFxfj119/Nbltfn4+hg8fDh8fH7i7u+PNN9/E5cuX683z7NmzWLZsmdH+zp07h/Pnz6O4uBi+\nvr73PAbCRH+3f//+uHr1KkpLS/Hbb7/hxIkTeOONNwDU9rXnzp2LoKAguLu7o1u3blAoFLh06RIA\nYPv27cjMzERAQADUajWOHDly3xzrc+zYMVy5cgWnT5/GggULoFAocOHCBWg0GqhUKri7u+OFF14w\n+djc+RwAwLp165Cfn4/Q0FA89thj2LNnT53bDRs2TDr5nJaWZjLH8vJypKenY9OmTTh79qy0bunS\npQgLC4OHhwc8PT1RUlIiPU6A8d/8Aw88gFu3bqGmpqbe5+y28+fPG91u164dOnTogKKiojr33bZt\nW4teX3LBIm+BqKgotG7dGjt37qz3Pl27doVOp5Nu//rrr+jatSuA2j/CGzduSOt+++23BuXTvn17\nLF26FGfOnMGuXbuwfPnyOnvqvr6+Ri86oLbY3P0CMuV2MfTz88PKlSvxzjvvoLS0FH5+fujWrRuu\nXr0q/ZSWlmL37t3Stnq9Xvq9pqYG586dg6+v7323nTp1KsLCwnD69GmUlJTg3XffNfkG6+/vjzff\nfNNof+Xl5Zg4cSJ8fHyMXvy3HwNzZ2p07twZY8aMwZdffgmgdobVrl27cODAAZSUlKCwsBCi9j9j\nAEBkZCS++OILXLx4EaNGjcKECRPum6Ml3njjDbRs2RInTpxASUkJPvnkE5OPzd3HGRQUhK1bt+Li\nxYuYM2cOxo0bh5s3b96z3d69e6WTz3FxcffNa/z48Rg+fDiSk5MBAIcOHcKSJUvw2Wef4dq1a7h6\n9Src3d3NOnla33N2292vtevXr+Py5csW/V03ByzyFnB3d8eCBQswbdo0ZGRk4MaNG6iqqsLevXsx\nZ84cAEBcXBwWLlyIS5cu4dKlS1iwYAFeeOEFALUnCU+ePImffvoJt27dkl4It93vD9/b2xtnzpyR\nbu/ZswenT5+GEAJubm5o2bIlWrZsec92zzzzDPLz85GWlobq6mqkp6fj1KlTGD58uFlx717/5JNP\nIigoCKtXr0a/fv2gVCqxePFi3Lx5EwaDASdOnMAPP/wg3f/HH3/Ezp07UV1djRUrVqBNmzbo378/\n+vbta3Lb8vJyKJVKPPDAAzh16hRWr15t8vF4+eWX8dFHHyEnJwdCCFy/fh179uxBeXk5oqOj4eLi\ngpUrV6Kqqgo7duzA999/b/K473T58mXs3LkT4eHhUm6tW7eGl5cXrl+/Lo3wAaCqqgpbtmxBSUmJ\nNA3z9vNiKkdLlJeXo127dnBzc0NRUVGdU19N2bx5My5evAig9u9aoVCgRQvblIO5c+ciLS0N586d\nQ1lZGVxcXNCxY0dUVlZiwYIFKC0tNWs/UVFRJp+zuLg4pKam4qeffkJFRQXeeOMN9O/f/57/Wu90\n59+yOW80csAib6G//OUvWL58ORYuXIjOnTvD398fH374IUaPHg0AmDdvHiIjI/Hwww/j4YcfRmRk\nJObNmwcA6NGjB95++208+eST0kyJO0dYdc0BvvN2cnIyJk+eDE9PT3z22WcoKCjAkCFDoFQqER0d\njWnTpmHQoEH35Ozl5YXdu3dj2bJl6NixI5YuXYrdu3fDy8ur3rh353D3+tmzZ2PlypUwGAzYvXs3\njh8/ju7du6NTp0545ZVXpBeyQqHAyJEjkZ6eDi8vL2zZsgU7duyQ3pBMbbt06VJs3boVbm5ueOWV\nV6DRaOp9PD7//HP06dMHa9euRWJiIry8vBAcHCy1lVxdXbFjxw5s2LABHTp0wKeffoqxY8eaPObv\nvvtOalWEhYXB29tbmv0zadIkPPjgg/D19UV4eDiioqKMctu8eTO6desGd3d3rFmzBlu2bAEAkznW\nl0ddkpKScPToUbi7u2PEiBEYO3Zsvfet6/nbt28fwsPDoVQqMXPmTGzbtg2tW7euNw9T7t53eHg4\nHn/8cSxfvhxPP/00nn76afTo0QMBAQFo27atURE29TffqlUrk8/ZE088gXfeeQdjx45F165dUVhY\niG3bttWb193Lmsuce4Ww8duZEALz5s1DWVkZIiMjMWnSJFvunpzM/Pnzcfr0aXzyySeOToWoWbL5\nSP6LL75AUVERWrVqdc/JHmp+msu/xERNlVlFPiEhAd7e3oiIiDBartVqERISguDgYGlmR35+PgYM\nGIClS5fe00Ol5qe5/EtM1FSZVeTj4+Oh1WqNlhkMBiQmJkKr1SI3NxdpaWnIy8uDSqWCh4dH7c5t\ndCKHnFdSUpLJnjMR2ZdZVTgmJgaenp5Gy3JychAUFISAgAC4urpCo9EgIyMDY8aMwb59+/D6669D\nrVbbI2ciIjKTi7UbFhUVwc/PT7qtUqmQnZ2Ntm3b4uOPP77v9vwXnojIOpac67K6n2KLIp2UlISv\nv/5a+hCJnH6SkpIcngOPj8fG45PPz9dff42kpCSL66zVI3lfX1+jTzLq9XqLZ9Pc/WEgIiKqm1qt\nhlqtxvz58y3azuqRfGRkJAoKCqDT6VBZWYn09HTExsZatI/k5GRkZWVZmwIRUbORlZVl3cBYmEGj\n0QgfHx/RqlUroVKpxPr164UQtVdJ7NGjhwgMDBSLFi0yZ1cSM0M7ra+//trRKdiVnI9PzscmBI/P\n2VlaO23+iVdzKRQKJCUlSf+CEBFR/bKyspCVlYX58+fDkrLt0CLvoNBERE7L0trJTysREcmY1bNr\nbCE5OblR2jWffrodX375b7vGUCiAP//5FfTp08eucYioebrdrrFUs2jXDBs2EVqtG4BIu8Vo2fIz\npKQMs9kXIhMR1cXS2unQkXzjehKAZd++Y4kWLfLttm8iIms5tCfPefJEROaxdp68w3vyRER0f43+\niVciImr62K4hInICbNcQEckY2zVERHQPFnkiIhljkScikjGeeCUicgI88UpEJGM88UpERPdgkSci\nkjEWeSIiGWORJyKSMRZ5IiIZ4xRKIiInwCmUREQyximURER0DxZ5IiIZY5EnIpIxFnkbevPNJCgU\nCrv+uLl5OfowiciJ2LzIZ2VlISYmBlOnTsXBgwdtvfsmraLiOgBh15+ysquNd0BE5PRsXuRbtGgB\npVKJiooKqFQqW++eiIgsYFaRT0hIgLe3NyIiIoyWa7VahISEIDg4GCkpKQCAmJgYZGZm4v3330dS\nUpLtMyYiIrOZVeTj4+Oh1WqNlhkMBiQmJkKr1SI3NxdpaWnIy8uDQqEAAHh4eKCiosL2GRMRkdnM\n+jBUTEwMdDqd0bKcnBwEBQUhICAAAKDRaJCRkYFTp05h3759uHbtGqZPn27rfImIyAJWf+K1qKgI\nfn5+0m2VSoXs7GzMnTsXo0ePNmsfd37i9fanuYiI6P9kZWU16PIvVhf5222ZhuBlDYiITLt7ANxo\nlzXw9fWFXq+Xbuv1eotn0/ACZURE5rH2AmVWF/nIyEgUFBRAp9OhsrIS6enpiI2NtXZ3RERkB2YV\n+bi4OERHRyM/Px9+fn5ITU2Fi4sLVq1ahaFDhyIsLAwTJ05EaGioRcGTk5PZhyciMoNarbbfpYbT\n0tLqXD5s2DAMGzbM4qBERNQ4+KUhREROwNqevEIIIWyfjhmBFQo0VuhhwyZCqx0DYKLdYri6zkJV\n1XLUXmPGnhrvcSOipsfS2smRPBGRE+BI3gSO5IlILpxqJE9ERPbFdg0RkRNgu8YEtmuISC7YriEi\nIgmLPBGRjLEnT0TkBNiTN4E9eSKSC0trp9XXkydHcbHJtfxNUSo9UVp6xa4xiKhxsMg7nWrY+7+F\nsjL7vokQUeNhT56IyAmwJ2+C3Hry7PsTNV+cJ09ERBIWeSIiGWORJyKSMRZ5IiIZY5EnIpIxTqEk\nInICnEJpAqdQWh6DUyiJmiZOoSQiIgmLPBGRjLHIExHJGIs8EZGM2aXIX79+HX379sWePXvssXsi\nIjKTXYr84sWLMXGi/WayEBGRecwq8gkJCfD29kZERITRcq1Wi5CQEAQHByMlJQUA8J///AdhYWHo\n1KmT7bMlIiKLmFXk4+PjodVqjZYZDAYkJiZCq9UiNzcXaWlpyMvLw8GDB3HkyBFs3boVa9eu5Xxr\nIiIHMuuboWJiYqDT6YyW5eTkICgoCAEBAQAAjUaDjIwMLFy4EACwceNGdOrUye5fVUdERPWz+uv/\nioqK4OfnJ91WqVTIzs6Wbk+ePPm++7jzI7pqtRpqtdradIiIZCkrK6tBl3+xusjbYoRuzXUYiIia\nk7sHwPPnz7doe6uLvK+vL/R6vXRbr9dDpVJZtI/k5GSO4JskF7u32ZRKT5SWXrFrDCI5sXZEb/UU\nysjISBQUFECn06GyshLp6emIjY21aB+3izw1NdWovQia/X7Kyq423uEQyYBarbaq+2FWkY+Li0N0\ndDTy8/Ph5+eH1NRUuLi4YNWqVRg6dCjCwsIwceJEhIaGWhSclxomIjIPLzVsAi813DRjcHotkeWc\n6lLDHMkTEZmHI3kTOJJvmjE4kieyHEfyREQyxJG8CRzJN80YHMkTWc6pRvJERGRfbNcQETkBtmtM\nYLumacZgu4bIcmzXEBGRhEWeiEjG2JMnInIC7MmbwJ5804zBnjyR5diTJyIiCYs8EZGMsSdPROQE\n2JM3gT35phmDPXkiy7EnT0REEhZ5IiIZY5EnIpIxF0cnQM2VCxQKhV0jKJWeKC29YtcYRE0dizw5\nSDXsfXK3rMy+byJEzoBTKImInACnUJrAKZTNNwanaZLccAolERFJWOSJiGSMRZ6ISMZY5ImIZMzm\nRf7UqVOYOnUqJkyYgHXr1tl690REZAGbF/mQkBCsXr0a27Ztw759+2y9eyIisoBZRT4hIQHe3t6I\niIgwWq7VahESEoLg4GCkpKRIy7/88ks8++yz0Gg0ts2WiIgsYlaRj4+Ph1arNVpmMBiQmJgIrVaL\n3NxcpKWlIS8vDwAwYsQI7N27Fxs3brR9xkREZDazLmsQExMDnU5ntCwnJwdBQUEICAgAAGg0GmRk\nZOD333/Hjh07cOvWLQwePNjW+RIRkQWsvnZNUVER/Pz8pNsqlQrZ2dkYNGgQBg0aZNY+7vyIrlqt\nhlqttjYdIiJZysrKatDlX6wu8ra4gqA112EgImpO7h4Az58/36LtrZ5d4+vrC71eL93W6/VQqVQW\n7YMXKCMiMo+1FyizushHRkaioKAAOp0OlZWVSE9PR2xsrEX7SE5OZouGiMgMarXafkU+Li4O0dHR\nyM/Ph5+fH1JTU+Hi4oJVq1Zh6NChCAsLw8SJExEaGmpRcI7kiYjMw0sNm8BLDTffGLzUMMmNU11q\nmCN5IiLzcCRvAkfyzTcGR/IkNxzJExHJEEfyJnAk31xjuKL2C8PtS6n0RGnpFbvHIQIsr51WfxiK\nqOmrhv3fSICysoZ/MJDIXtiuISJyAmzXmMB2DWPYOw5P8FJjcaoTr0REZF8s8kREMsaePBGRE2BP\n3gT25BnD3nHYk6fGwp48ERFJWOSJiGSMRZ6ISMZ44pWIyAnwxKsJPPHKGPaOwxOv1Fh44pWIiCQs\n8kREMsYiT0QkYyzyREQyxtk1REROgLNrTODsGsawdxzOrqHGwtk1REQkYZEnIpIxFnkiIhljkSci\nkjEXe+w0IyMDe/bsQWlpKV588UUMGTLEHmGImggXKBQKu0ZQKj1RWnrFrjFInuxS5EeOHImRI0fi\n2rVr+Otf/8oiTzJXDXvP4ikrs++bCMmX2e2ahIQEeHt7IyIiwmi5VqtFSEgIgoODkZKSYrRu4cKF\nSExMtE2mRERkMbOLfHx8PLRardEyg8GAxMREaLVa5ObmIi0tDXl5eRBCYM6cORg2bBh69+5t86SJ\niMg8ZrdrYmJioNPpjJbl5OQgKCgIAQEBAACNRoOMjAzs378fBw4cQGlpKU6fPo0//elPtsyZiIjM\n1KCefFFREfz8/KTbKpUK2dnZ+OCDDzB9+vT7bn/nR3TVajXUanVD0iEikp2srKwGXf6lQUW+oTMK\nrLkOAxFRc3L3AHj+/PkWbd+gefK+vr7Q6/XSbb1eD5VKZfb2vEAZEZF5rL1AWYOKfGRkJAoKCqDT\n6VBZWYn09HTExsaavX1ycjJbNEREZlCr1fYt8nFxcYiOjkZ+fj78/PyQmpoKFxcXrFq1CkOHDkVY\nWBgmTpyI0NBQs4NzJE9EZB5eatgEXmqYMZw/Di9nTLWc6lLDHMkTEZmHI3kTOJJnDOePw5E81XKq\nkTwREdkX2zVERE6A7RoT2K5hDOePw3YN1WK7hoiIJGzXEBE5AbZrTGC7hjGcPw7bNVSL7RoiIpKw\nXUNE5ATYrjGB7RrGcP44bNdQLbZriIhI0qAvDSGixuLS4C/puR+l0hOlpVfsGoMaH4s8kVOohr1b\nQmVl9n0TIcfgiVciIifAE68m8MQrYzh/HJ7cpVo88UpERBIWeSIiGWORJyKSMRZ5IiIZ4+waIiIn\nwNk1JnB2DWM4fxzOrqFanF1DREQSFnkiIhljkScikjEWeSIiGWORJyKSMZsX+cLCQrz00ksYP368\nrXdNREQWsnmR79atGz7++GNb75aIiKxgVpFPSEiAt7c3IiIijJZrtVqEhIQgODgYKSkpdkmQiIis\nZ1aRj4+Ph1arNVpmMBiQmJgIrVaL3NxcpKWlIS8vzy5JEhGRdcwq8jExMfD09DRalpOTg6CgIAQE\nBMDV1RUajQYZGRm4cuUKXn31VRw/fpyjeyIiB7P66/+Kiorg5+cn3VapVMjOzoaXlxc++ugjs/Zx\n53UY1Go11Gq1tekQEclSVlZWg67xZXWRt9WXCrO4ExHV73aNtLbYW13kfX19odfrpdt6vR4qlcqi\nfVhzRTUioubodrGfP3++RdtZPYUyMjISBQUF0Ol0qKysRHp6OmJjYy3aBy81TNSUuEChUNj1x83N\ny9EH6bSsvdQwhBk0Go3w8fERrVq1EiqVSqxfv14IIURmZqbo0aOHCAwMFIsWLTJnVxIzQ9vE009P\nEMA2AQi7/bi6/kUAsGuM2h/GaFox5HQsjRODGsbSx9Csdk1aWlqdy4cNG4Zhw4ZZ/s7y/yUnJ7Mn\nT0RkBmt78vzSEBvhl4Y01xiNFUc+MRxUcmSDXxpCREQSfscrEZET4He8msB2DWM4fxz5xGC7pmHY\nriEiIgnbNUREToDtGhPYrmEM548jnxhs1zQM2zVERCRhu4aIyAmwXWMC2zWM4fxx5BOD7ZqGYbuG\niIgkLPJERDLGIk9EJGM88UpE5AR44tUEnnhlDOePI58YPPHaMDzxSkREEhZ5IiIZY5EnIpIxFnki\nIhljkScikjGzvsjbXvhF3kTNjQsUCoVdIyiVnigtvWLXGI7AL/I2gVMoGcP54zCGJTHkPE2TUyiJ\niEjCIk9EJGMs8kREMsYiT0QkYzafXXP9+nW89tpraN26NdRqNZ577jlbhyAiIjPZfCS/Y8cOTJgw\nAWvWrMGuXbtsvXsnkuXoBOwsy9EJ2FGWoxOwsyxHJ0CNyKwin5CQAG9vb0RERBgt12q1CAkJQXBw\nMFJSUgAARUVF8PPzAwC0bNnSxuk6kyxHJ2BnWY5OwI6yHJ2AnWU5OgFqRGYV+fj4eGi1WqNlBoMB\niYmJ0Gq1yM3NRVpaGvLy8qBSqaDX6wEANTU1ts+YiIjMZlaRj4mJgaenp9GynJwcBAUFISAgAK6u\nrtBoNMjIyMCYMWOwfft2vPbaa4iNjbVL0kREZB6rT7ze2ZYBAJVKhezsbDzwwANYv369Wfuw98eb\njX0KQGO3vVdV3f7tzmOab6dojfG4mROjocfXVI6jLpYeW1M+lrrUd3zOdhz1RGjU2tK0WV3kG/og\nyvljx0RETYXVs2t8fX2l3jsA6PV6qFQqmyRFRES2YXWRj4yMREFBAXQ6HSorK5Gens4ePBFRE2NW\nkY+Li0N0dDTy8/Ph5+eH1NRUuLi4YNWqVRg6dCjCwsIwceJEhIaG3ndfdU27lAu9Xo/BgwejZ8+e\nCA8Px8qVKx2dkl0YDAY88sgjGDFihKNTsblr165h3LhxCA0NRVhYGI4cOeLolGzqvffeQ8+ePRER\nEYHnnns1aE4WAAAEKUlEQVQOFRUVjk6pQeqa3n3lyhUMGTIEPXr0wFNPPYVr1645MEPr1XVss2fP\nRmhoKHr16oUxY8agpKTk/jsSjai6uloEBgaKwsJCUVlZKXr16iVyc3MbMwW7On/+vDh27JgQQoiy\nsjLRo0cPWR3fbcuWLRPPPfecGDFihKNTsblJkyaJdevWCSGEqKqqEteuXXNwRrZTWFgounXrJm7d\nuiWEEGLChAliw4YNDs6qYb755htx9OhRER4eLi2bPXu2SElJEUII8f7774s5c+Y4Kr0GqevY/v3v\nfwuDwSCEEGLOnDlmHVujXrumvmmXctGlSxf07t0bANC+fXuEhoaiuLjYwVnZ1rlz55CZmYmXXnpJ\ndifPS0pKcOjQISQkJAAAXFxc4O7u7uCsbMfNzQ2urq64ceMGqqurcePGDfj6+jo6rQapa3r3rl27\nMHnyZADA5MmT8cUXXzgitQar69iGDBmCFi1qy3a/fv1w7ty5++6nUYt8XdMui4qKGjOFRqPT6XDs\n2DH069fP0anY1MyZM7FkyRLpD01OCgsL0alTJ8THx+PRRx/Fyy+/jBs3bjg6LZvx8vLCrFmz4O/v\nj65du8LDwwNPPvmko9OyuQsXLsDb2xsA4O3tjQsXLjg4I/tYv349nnnmmfver1Ffqc1l7mp5eTnG\njRuHf/7zn2jfvr2j07GZ3bt3o3PnznjkkUdkN4oHgOrqahw9ehSvvfYajh49inbt2uH99993dFo2\nc+bMGaxYsQI6nQ7FxcUoLy/Hli1bHJ2WXSkUClnWnXfffRetWrUy6wKQjVrkm8O0y6qqKowdOxbP\nP/88Ro0a5eh0bOrw4cPYtWsXunXrhri4OHz11VeYNGmSo9OyGZVKBZVKhb59+wIAxo0bh6NHjzo4\nK9v54YcfEB0djQ4dOsDFxQVjxozB4cOHHZ2WzXl7e+O3334DAJw/fx6dO3d2cEa2tWHDBmRmZpr9\nBt2oRV7u0y6FEHjxxRcRFhaGGTNmODodm1u0aBH0ej0KCwuxbds2PP7449i0aZOj07KZLl26wM/P\nD/n5+QCA/fv3o2fPng7OynZCQkJw5MgR3Lx5E0II7N+/H2FhYY5Oy+ZiY2OxceNGAMDGjRtlNdjS\narVYsmQJMjIy0KZNG/M2sstpYRMyMzNFjx49RGBgoFi0aFFjh7erQ4cOCYVCIXr16iV69+4tevfu\nLfbu3evotOwiKytLlrNrjh8/LiIjI8XDDz8sRo8eLavZNUIIkZKSIsLCwkR4eLiYNGmSqKysdHRK\nDaLRaISPj49wdXUVKpVKrF+/Xly+fFk88cQTIjg4WAwZMkRcvXrV0Wla5e5jW7dunQgKChL+/v5S\nfZk6dep996MQQobNVSIiAsCv/yMikjUWeSIiGWORJyKSMRZ5IiIZY5EnIpIxFnkiIhn7f0TOYryM\ntEX2AAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Counting recurrences on the Yeast genome" ] }, { "cell_type": "code", "collapsed": false, "input": [ "result = pipe(file_pattern, glob, map(open), map(drop(1)), \n", " concat, map(str.upper), map(str.strip), concat,\n", " partitionby(identity), map(len), frequencies)\n", "\n", "histdict(result, log=True)\n", "title('Counts of Repeated Base Pairs - Yeast')\n", "result" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "{1: 6118373,\n", " 2: 1729275,\n", " 3: 488751,\n", " 4: 158660,\n", " 5: 52793,\n", " 6: 17076,\n", " 7: 7079,\n", " 8: 2700,\n", " 9: 1305,\n", " 10: 778,\n", " 11: 496,\n", " 12: 344,\n", " 13: 247,\n", " 14: 135,\n", " 15: 91,\n", " 16: 64,\n", " 17: 52,\n", " 18: 26,\n", " 19: 34,\n", " 20: 23,\n", " 21: 14,\n", " 22: 17,\n", " 23: 13,\n", " 24: 19,\n", " 25: 8,\n", " 26: 10,\n", " 27: 4,\n", " 28: 4,\n", " 29: 3,\n", " 30: 2,\n", " 31: 4,\n", " 32: 1,\n", " 33: 1,\n", " 34: 1,\n", " 35: 2,\n", " 36: 1,\n", " 37: 1,\n", " 42: 1}" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAXkAAAEICAYAAAC6fYRZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XtYVNX6B/DvCGYFg2AKIgONAQYoCYXXQqf7sQyzTAeP\npmB1ulDZ6WJ30cwTll3taJqSlySOxxRLGE9ZY3aDLnY6ij5iOTaClrfkUoqM7+8PH/ePwQFnhoEN\nm+/neeZ53Htmr/XuNfCyXXuttXUiIiAiIk3qpHYARETUcpjkiYg0jEmeiEjDmOSJiDSMSZ6ISMOY\n5ImINIxJns7q119/xbBhwxAUFIRHH31U7XBazeTJk/HMM8+oHUaL6NevHz777DO1w6BWwCTfAlau\nXImUlBTo9Xr06tULN9xwA7744osWr7dTp074+eeffV7uwoULERoaisrKSrz44otnvD958mR06dIF\ner0e3bp1w9VXX41t27b5PA5PNbc9dDoddDqdy/feeecd+Pn5Qa/XQ6/XIzo6GgsWLPC6Lm916tQJ\ngYGB0Ov1MBgMePjhh3Hy5MmzHrd161YMGzasRWMTEQwfPhwzZ8502r9s2TLExMTg2LFjPq+zpX4H\n2jMmeR97+eWX8dBDD+Hpp5/Gb7/9Brvdjvvuuw/r1q1rlfpbYm7bnj17EB8f3+j7Op0O06ZNQ1VV\nFSoqKhAVFYWMjAyfx+GN5rZHU8dffvnlqKqqQlVVFVavXo3HHnsMP/zwQ7Pq88aPP/6IqqoqbNy4\nEStXrsSiRYuaVZ7D4fBJXDqdDm+//TZeeeUVlJaWAgAOHDiARx55BIsXL8a5557rk3oa4vxOZ0zy\nPnT06FFMnz4d//znP3HzzTfjvPPOg5+fH2688Ubk5OQAAI4fP46pU6ciIiICEREReOihh1BbWwvg\n1NVhamqqU5n1r0wmT56M++67DyNHjkRQUBAGDx6svHf6qqx///7Q6/VYtWoVDh48iJEjRyIkJAQX\nXHABhg0b1ugvwJdffokBAwYgODgYAwcOxFdffaXUuWzZMsyZMwd6vR6ffPJJk21w7rnn4rbbbnO6\nkq+oqMCtt96K0NBQXHTRRXjjjTeU97KzszFmzBiYzWYEBQXhsssuw48//ujWsSUlJRgyZAhCQkLQ\nq1cv3H///Thx4kSj7QEAH374IZKSkhASEoLLL78c//vf/5TytmzZgksvvRRBQUEwm81nvdKs35ZJ\nSUmIj4/Hjh07lH233XYbwsPDERwcjOHDhyuJDgAKCwvRt29fBAUFwWAwYO7cucp7TcXYlIsvvhip\nqanYtm0bfv75Z1x11VXo3r07evTogQkTJuDo0aPKZ41Go/Jdnv4OJk6ciK5du2Lp0qUoKSlBSkoK\nunbtip49e+Lhhx92K4aGYmNj8dRTT2HKlCkQETzwwAMYM2YMhg8f3uR5vvDCC4iJiUFQUBD69u2L\ntWvXKu/t2rULw4cPR3BwMHr06IH09HQAjX/nHZ6QzxQVFYm/v784HI5GP/PMM8/IkCFD5MCBA3Lg\nwAEZOnSoPPPMMyIikpubK1dccYXT53U6nfz0008iIjJp0iS54IIL5JtvvpG6ujr561//Kmaz2eVn\nRUQef/xxufvuu6Wurk7q6urk888/dxnToUOHJDg4WFasWCEOh0Py8vIkJCREDh8+LCIikydPVmJ0\nZfLkyfL000+LiEh1dbVMmDBBrrzyShERcTgccumll8pzzz0nJ06ckJ9//lkuuugi2bBhg4iITJ8+\nXTp37iyrV6+Wuro6eemll6R3795SV1d31mO/++47KS4uFofDITabTeLj4+XVV19ttD2+//57CQ0N\nlZKSEjl58qQsXbpUjEaj1NbWyvHjxyUqKkpeffVVqaurk3//+9/SuXPnRs+74XdVXFwswcHBUlZW\n5vSZ6upqqa2tlalTp0pSUpLyXs+ePZXv4/fff5fvv/++yRiPHz/uMg6dTie7du0SEZFt27ZJz549\nZcmSJbJr1y75+OOPpba2Vg4cOCDDhg2TqVOnKscZjUbZuHGj03dQUFAgIiJ//vmnDB48WFasWCEi\nIjU1NfL111+7rN8dDodDBg0aJKNHj5YLL7xQqqurm/wuRERWrVol+/btExGR/Px8CQgIkP3794uI\niNlsltmzZ4uIyPHjx+WLL75wao/63zmJMMn70IoVK6Rnz55NfiY6OlqKioqU7Q0bNojRaBSRsyf5\nyZMny5133qm8V1hYKHFxcS4/KyLy7LPPyqhRo5Qk0Jhly5bJoEGDnPYNGTJE3nnnHaXe00nclUmT\nJsm5554rwcHB0qlTJ7nooovkwIEDIiLy9ddfS1RUlNPnZ8+eLRkZGSJyKsEMGTJEee/kyZMSHh4u\nmzdvPuuxDb3yyisyevRoZbthe9x9991nJO2LL75YNm3aJJs2bZJevXo5vVf/D3BDubm54u/vL8HB\nwaLX60Wn08kDDzzg8rMiIkeOHBGdTieVlZUiIhIVFSVvvfWWHD161OlzTcXoik6nk6CgIAkJCZHo\n6Gh55pln5OTJk2d8bs2aNZKcnKxsN0zyw4cPd/r8sGHDZPr06cr32Fzbtm0TnU4n69atExHPzzMp\nKUk59vbbb5e77rpL9u7de8bnmOTPxO4aH7rgggtw8ODBJm98VVRU4MILL1S2o6KiUFFR4XYdYWFh\nyr/PO+88VFdXN/rZRx99FDExMbjuuusQHR2tdBm5iikqKspp34UXXuh2XDqdDo8++iiOHDkCm82G\nLl26YNmyZQBO9edXVFQgJCREef3jH//Ab7/9phxvMBicyjIYDKioqMAvv/zS5LE7d+7EyJEjER4e\njq5du+Kpp57CoUOHGo1zz549mDt3rlN5e/fuxb59+1BRUYGIiIgz2kCa6N8dPHgwjhw5gsrKSuzf\nvx9bt27Fk08+CeBUv/bjjz+OmJgYdO3aFb1794ZOp8PBgwcBAKtXr0ZhYSGMRiNMJhO+/vrrs8bY\nmC1btuDw4cPYtWsXZs6cCZ1Oh19//RVmsxkGgwFdu3bFxIkTm2yb+t8BACxevBg7d+5EfHw8Bg4c\niPXr17s8bsSIEcrN57y8vEbLT0hIAAD07dvXrfNctmwZkpOTlfe2bt2qtN2cOXMgIhg4cCD69euH\n3NzcRusl9sn71JAhQ9ClSxesWbOm0c/06tULNptN2f7ll1/Qq1cvAEBAQAD++OMP5b39+/c3K57A\nwEC89NJL+Omnn7Bu3Tq8/PLLLvvUIyIisGfPHqd9e/bsOSPpNeV0MoyMjMTrr7+O5557DpWVlYiM\njETv3r1x5MgR5VVZWYkPP/xQOdZutyv/PnnyJPbu3YuIiIizHnvPPfcgISEBu3btwtGjR/H88883\n+Qc2KioKTz31lFN51dXVGDduHMLDw1FeXn5GGzQ2uqah0NBQ3HLLLfjggw8AnBphtW7dOmzcuBFH\njx7F7t27Iaf+5wwASElJwdq1a3HgwAHcfPPNGDt27Flj9MSTTz4JPz8/bN26FUePHsXy5cubbJuG\n5xkTE4OVK1fiwIEDmDZtGsaMGYM///zzjOOKioqUm8+n+8bd0dR57tmzB3fddRfefPNNHD58GEeO\nHEG/fv2UtgsLC8PChQtRXl6Ot956C/feey9H1DSBSd6HunbtipkzZ+K+++5DQUEB/vjjD5w4cQJF\nRUWYNm0aACA9PR2zZs3CwYMHcfDgQcycORMTJ04EcOqG0bZt2/Df//4Xx44dQ3Z2tlP5TV1VAqd+\n+H/66Sdle/369di1axdEBEFBQfDz84Ofn98Zx91www3YuXMn8vLyUFdXh/z8fOzYsQMjR450q96G\n719zzTWIiYnB/PnzMWjQIOj1esyZMwd//vknHA4Htm7dim+//Vb5/HfffYc1a9agrq4Or776Ks49\n91wMHjwYAwYMaPLY6upq6PV6nH/++dixYwfmz5/fZHvceeedWLBgAUpKSiAiqKmpwfr161FdXY2h\nQ4fC398fr7/+Ok6cOIH3338f33zzTZPnXd+hQ4ewZs0a9OvXT4mtS5cu6NatG2pqapQrfAA4ceIE\n3n33XRw9elQZhnn6e2kqRk9UV1cjICAAQUFBKC8vdzn0tSkrVqzAgQMHAJz6udbpdOjUyXfpoqnz\nrKmpgU6nQ/fu3XHy5Enk5uZi69atyrGrVq3C3r17AQDBwcFOsTX8zgm88doS3n33XUlJSZGAgADp\n2bOnjBw5Ur766isRETl27Jg88MADEh4eLuHh4fLggw863VR7/vnnpXv37hIVFSUrVqyQTp06OfXJ\n1+/H/PTTTyUyMlLZXrBggYSHh0twcLD861//kldeeUWMRqMEBASIwWCQWbNmNRrz559/Lpdddpl0\n7dpVUlJSnG5muXPjteH7+fn50qtXL6mtrZWKigpJT0+Xnj17SkhIiAwZMkTpD87OzpYxY8bIuHHj\nRK/Xy6WXXipbtmxRymnq2M8++0zi4uIkMDBQUlNT5dlnn5XU1FSX7bFq1SoREbFYLDJgwAAJDg6W\n8PBwGTt2rFRVVYmIyLfffivJycmi1+tl3LhxYjabGz3vd955R/z8/CQwMFACAwMlNDRUxo8fr/Rh\nV1dXy6hRo0Sv14vRaJRly5Yp32Vtba385S9/kZCQEAkKCpKBAwc6tXdTMTZU/+ejvm3btslll10m\ngYGBkpycLHPnznX6WanfJ5+dnS0TJ050On7ChAkSGhoqgYGB0q9fP+WmbHM0jLWp83zqqaekW7du\n0r17d/n73/8uJpNJFi9eLCIijz32mEREREhgYKBER0fLokWLlDJdfecdnU7Et4NKP//8c7z77ruo\nq6tDaWlpq0wCovZrxowZ2LVrF5YvX652KESa5O/rAq+44gpcccUVKCgowMCBA31dPGmMj68xiKgB\ntzrZMjMzERYWhsTERKf9FosFcXFxiI2NPWPkxsqVKzF+/HjfRUqa1NTSAUTUfG5112zevBmBgYG4\n/fbblVlpDocDF198MT7++GNERERgwIAByMvLQ3x8PH755RfMmjULCxcubPETICKixrl1JZ+amoqQ\nkBCnfSUlJYiJiYHRaETnzp1hNptRUFAAAFiyZAkyMzN9Hy0REXnE6z758vJyREZGKtsGgwHFxcUA\ncMbQP1f4X3QiIu94ci/L6yTviyTtKtDq6mqkpaWjuvp4s8v3VEXFT3j//bw2dcM4OzvbrT+ara0t\nxsWY3MOY3NcW4/I093qd5CMiIpxmKtrt9jOmRnvj0KFD+OqrYhw7tqLZZXnK3/9ebN26tU0leSKi\n5vA6yaekpKCsrAw2mw29evVCfn5+k2tXuJKdnQ2TyQSTyeS038/vPADXeRua13Q6favXSUTkDqvV\nCqvV6vFxbt14TU9Px9ChQ7Fz505ERkYiNzcX/v7+mDdvHq6//nokJCRg3LhxTT5YwpXTSb6t8PPr\nqXYIZ2hL7VNfW4yLMbmHMbmvLcVlMpm86jry+YxXtyvW6Vz2ye/Zswd9+w5DTc0eF0e1rICATLz+\n+hUcGUREbVZjubMxXKCMiEjDVE3y2dnZXvUxERF1NFarld01zcXuGiJq69hdQ0RECiZ5IiINY5In\nItIw3nglImoHeOPVB3jjlYjaOt54JSIiBZM8EZGGMckTEWkYkzwRkYYxyRMRaRiTPBGRhjHJExFp\nGJM8EZGGccYrEVE7wBmvPsAZr0TU1nHGKxERKZjkiYg0jEmeiEjD/H1doIjg6aefRlVVFVJSUnD7\n7bf7ugoiInKTz6/k165di/LycpxzzjkwGAy+Lp6IiDzgVpLPzMxEWFgYEhMTnfZbLBbExcUhNjYW\nOTk5AICdO3fi8ssvx0svvYT58+f7PmIiInKbW0k+IyMDFovFaZ/D4UBWVhYsFgtKS0uRl5eH7du3\nw2AwIDg4+FThndjlT0SkJrf65FNTU2Gz2Zz2lZSUICYmBkajEQBgNptRUFCABx98EPfffz82b94M\nk8nUZLn1B/abTKazfp6IqKOxWq3NmjTq9Y3X8vJyREZGKtsGgwHFxcU477zz8Pbbb7tVhjezt4iI\nOpKGF8AzZszw6Hiv+1N0Op23hxIRUSvxOslHRETAbrcr23a7naNpiIjaGK+TfEpKCsrKymCz2VBb\nW4v8/HykpaV5VAYXKCMico+3C5S5leTT09MxdOhQ7Ny5E5GRkcjNzYW/vz/mzZuH66+/HgkJCRg3\nbhzi4+M9qjw7O5s3W4mI3GAymbxK8m7deM3Ly3O5f8SIERgxYoTHlRIRUevgQHYiIg3jQ0OIiNoB\nPjTEB/jQECJq6/jQECIiUjDJExFpGJM8EZGG8cYrEVE7wBuvPsAbr0TU1vHGKxERKZjkiYg0jEme\niEjDmOSJiDSMSZ6ISMOY5ImINIxJnohIw5jkiYg0jDNeiYjaAc549QHOeCWito4zXomISMEkT0Sk\nYT5P8larFampqbjnnnuwadMmXxdPREQe8HmS79SpE/R6PY4fPw6DweDr4omIyANuJfnMzEyEhYUh\nMTHRab/FYkFcXBxiY2ORk5MDAEhNTUVhYSFeeOEFTJ8+3fcRExGR29xK8hkZGbBYLE77HA4HsrKy\nYLFYUFpairy8PGzfvh06nQ4AEBwcjOPHj/s+YiIicpu/Ox9KTU2FzWZz2ldSUoKYmBgYjUYAgNls\nRkFBAXbs2IENGzbg999/x/333+/reImIyANuJXlXysvLERkZqWwbDAYUFxfj8ccfx+jRo90qo/7A\nfpPJBJPJ5G04RESaZLVamzVp1Oskf7pbpjm8mb1FRNSRNLwAnjFjhkfHez26JiIiAna7Xdm22+0c\nTUNE1MZ4neRTUlJQVlYGm82G2tpa5OfnIy0tzaMyuHYNEZF7vF27xq0kn56ejqFDh2Lnzp2IjIxE\nbm4u/P39MW/ePFx//fVISEjAuHHjEB8f71Hl2dnZ7IcnInKDyWTyKsm71Sefl5fncv+IESMwYsQI\njyslIqLWwbVriIg0jOvJExG1A1xP3ge4njwRtXVcT56IiBRM8kREGsYkT0SkYUzyREQaxtE1RETt\nAEfX+ABH1xBRW8fRNUREpGCSJyLSMCZ5IiINY5Jv4L77HoROp2v1V1BQN7VPnYg0yOsnQ2nVsWPV\nAFr/XnRVVfOftEVE1BCv5ImINIxJnohIw5jkiYg0jDNeiYjaAc549YGAgEzU1ORCjRuvgGez2Iio\nY+KMVyIiUjDJExFpWIsk+ZqaGgwYMADr169vieKJiMhNLZLk58yZg3HjxrVE0URE5AG3knxmZibC\nwsKQmJjotN9isSAuLg6xsbHIyckBAHz00UdISEhAjx49fB8tERF5xK0kn5GRAYvF4rTP4XAgKysL\nFosFpaWlyMvLw/bt27Fp0yZ8/fXXWLlyJRYtWsQRI0REKnJr7ZrU1FTYbDanfSUlJYiJiYHRaAQA\nmM1mFBQUYNasWQCApUuXokePHtDpuCYLEZFavF6grLy8HJGRkcq2wWBAcXGxsj1p0qSzllF/YL/J\nZILJZPI2HCIiTbJarc2aNOp1kvfFFbo3s7eIiDqShhfAM2bM8Oh4r0fXREREwG63K9t2ux0Gg8Hb\n4oiIqAV4neRTUlJQVlYGm82G2tpa5OfnIy0tzaMyuHYNEZF7WnTtmvT0dGzatAmHDh1CaGgoZs6c\niYyMDBQVFWHq1KlwOByYMmUKnnjiCfcr5to1DXDtGiI6O0/XruECZfUwyRNRW8cFyoiISMH15ImI\n2gGuJ+8D7K4horaO3TVERKRgkici0jAmeSIiDWOSJyLSMI6uISJqBzi6xgc4uoaI2jqOriEiIgWT\nPBGRhjHJExFpGJM8EZGGMckTEWkYkzwRkYYxyRMRaRiTPBGRhnHGKxFRO8AZrz6g7ozXzgDqVKgX\n0OtDUFl5WJW6icgzns549W/BWMgjdVDnjwtQVaVTpV4iannskyci0jCfJ/kdO3bgnnvuwdixY7F4\n8WJfF09ERB7weZKPi4vD/Pnz8d5772HDhg2+Lp6IiDzgVpLPzMxEWFgYEhMTnfZbLBbExcUhNjYW\nOTk5yv4PPvgAN954I8xms2+jJSIij7iV5DMyMmCxWJz2ORwOZGVlwWKxoLS0FHl5edi+fTsA4Kab\nbkJRURGWLl3q+4iJiMhtbo2uSU1Nhc1mc9pXUlKCmJgYGI1GAIDZbEZBQQF+++03vP/++zh27Biu\nvPJKX8dLREQe8HoIZXl5OSIjI5Vtg8GA4uJiDB8+HMOHD3erjPoD+00mE0wmk7fhEBFpktVqbdak\nUa+TvE7X/LHV3szeIiLqSBpeAM+YMcOj470eXRMREQG73a5s2+12GAwGb4sjIqIW4HWST0lJQVlZ\nGWw2G2pra5Gfn4+0tDSPyuDaNURE7mnRtWvS09OxadMmHDp0CKGhoZg5cyYyMjJQVFSEqVOnwuFw\nYMqUKXjiiSfcr5hr1zSgU6neU3WrtIQREXnI07VruEBZPUzyRNTWeZrkuXYNEZGGcT15IqJ2gOvJ\n+wC7a4iorWN3DRERKZjkiYg0jEmeiEjDmOSJiDSMo2uIiNoBjq7xAY6uIaK2ztPRNV6vQkla4u+T\nVUU9pdeHoLLycKvXS9SRMMkTgDqo8b+IqqrW/8NC1NHwxisRkYYxyRMRaRiTPBGRhjHJExFpGJM8\nEZGGcTIUEVE7wMlQPtCRJ0Opdc6chEXkGS41TERECiZ5IiINY5InItKwFlnWoKCgAOvXr0dlZSWm\nTJmCa6+9tiWqISKis2iRJD9q1CiMGjUKv//+Ox555BEmeSIilbjdXZOZmYmwsDAkJiY67bdYLIiL\ni0NsbCxycnKc3ps1axaysrJ8EykREXnM7SSfkZEBi8XitM/hcCArKwsWiwWlpaXIy8vD9u3bISKY\nNm0aRowYgaSkJJ8HTURE7nG7uyY1NRU2m81pX0lJCWJiYmA0GgEAZrMZBQUF+Pjjj7Fx40ZUVlZi\n165d+Nvf/ubLmImIyE3N6pMvLy9HZGSksm0wGFBcXIw33ngD999//1mPrz97y2QywWQyNSccanfU\neVgJwAeWUPthtVqbtTJAs5J8c39BvZmiS1qizsNKAD6whNqPhhfAM2bM8Oj4Zo2Tj4iIgN1uV7bt\ndjsMBkNziiQiIh9qVpJPSUlBWVkZbDYbamtrkZ+fj7S0NLeP5wJlRETu8XaBMoibzGazhIeHyznn\nnCMGg0GWLFkiIiKFhYXSp08fiY6OltmzZ7tbnDRWtc1mk4CAKAGk1V8BARkCQJW61atXzbrVPWei\n9sjTn12uQlkPV6HsKPWeqlulH32iZuEqlEREpOBDQ4iI2gE+NMQH2F3TUeoFgM44NYSzdXF8PjWX\np901LbJAGVHbp84YfY7Pp9bGPnkiIg1jkici0jAmeSIiDePoGiKidoCja3yAo2s6Sr1q1s1JWNQ8\nnAxFREQKJnkiIg1jkici0jAmeSIiDWOSJyLSMCZ5IiINY5InItIwVRcoy87OPuMhtUTa5g+dTp1F\nyrgCZvtmtVq9mjzKyVD1cDJUR6lXzbr5NCxqHk6GIiIiBZM8EZGGMckTEWmYz5P87t27cccdd+C2\n227zddFEROQhnyf53r174+233/Z1sURE5AW3knxmZibCwsKQmJjotN9isSAuLg6xsbHIyclpkQCJ\niMh7biX5jIwMWCwWp30OhwNZWVmwWCwoLS1FXl4etm/f3iJBEhGRd9xK8qmpqQgJCXHaV1JSgpiY\nGBiNRnTu3BlmsxkFBQU4fPgw7r77bvzwww+8uiciUpnXM17Ly8sRGRmpbBsMBhQXF6Nbt25YsGCB\nW2XUf5QVZ74SEZ3J25mup3md5H0xNdub5xUSEXUkDS+AZ8yY4dHxXo+uiYiIgN1uV7btdjsMBoO3\nxRERUQvwOsmnpKSgrKwMNpsNtbW1yM/PR1pamkdlZGdnN+u/IUTkiVOLo7X2Kyiom9onrglWq9W7\n3g9xg9lslvDwcDnnnHPEYDDIkiVLRESksLBQ+vTpI9HR0TJ79mx3ilI0VrXNZpOAgCgBpNVfAQEZ\nAkCVutWrV826ec4do254lBuoaZ62J1ehrIerUHaUetWsu2Oes0ppRpO4CiURESlUTfLskycico+3\nffLsrqmH3TUdpV416+6Y58zuGt9hdw0RESlUfcYrEXUEfK6tmpjkiaiF1UGtLqqqKnX+uLQl7K4h\nItIwjq4hImoHOLrGBzi6pqPUq2bdPOfWrltrI3s4uoaIiBRM8kREGsYkT0SkYUzyREQaxiRPRKRh\nTPJERBrGJE9EpGGcDEVE1A5wMpQPcDJUR6lXzbp5zq1dNydDERGRZjHJExFpGJM8EZGG+Xw9+Zqa\nGtx7773o0qULTCYTxo8f7+sqiIjITT6/kn///fcxduxYLFy4EOvWrfN18S3K4dindgguWNUOoBFW\ntQNwwap2AC5Y1Q7ABavaAbhgVTsAl7Qw+s+tJJ+ZmYmwsDAkJiY67bdYLIiLi0NsbCxycnIAAOXl\n5YiMjAQA+Pn5+TjcluVw7Fc7BBesagfQCKvaAbhgVTsAF6xqB+CCVe0AXLCqHYBLHSbJZ2RkwGKx\nOO1zOBzIysqCxWJBaWkp8vLysH37dhgMBtjtdgDAyZMnfR8xERG5za0kn5qaipCQEKd9JSUliImJ\ngdFoROfOnWE2m1FQUIBbbrkFq1evxr333ou0tLQWCZqIiNwkbtq9e7f069dP2V61apXccccdyvby\n5cslKyvL3eIEp2ZH8MUXX3zx5eHLE16PrtHpmvcUdNHYLDQiorbI69E1ERERSt87ANjtdhgMBp8E\nRUREvuF1kk9JSUFZWRlsNhtqa2uRn5/PPngiojbGrSSfnp6OoUOHYufOnYiMjERubi78/f0xb948\nXH/99UhISMC4ceMQHx9/1rJcDbtsC4xGIy655BIkJydj4MCBqsTgaqjq4cOHce2116JPnz647rrr\n8Pvvv6seU3Z2NgwGA5KTk5GcnHzGyKuWZrfbceWVV6Jv377o168fXn/9dQDqtlVjMandVseOHcOg\nQYOQlJSEhIQEPPHEEwDUbavGYlK7rYBTowaTk5Nx0003AVD/989VTB63k0c9+M1UV1cn0dHRsnv3\nbqmtrZW0AJdEAAAEIklEQVT+/ftLaWlpa4bQKKPRKIcOHVI1hs8++0y+//57pxvcjz76qOTk5IiI\nyAsvvCDTpk1TPabs7GyZO3duq8ZR3759+2TLli0iIlJVVSV9+vSR0tJSVduqsZjUbisRkZqaGhER\nOXHihAwaNEg2b96s+s+Vq5jaQlvNnTtXxo8fLzfddJOIqP/75yomT9upVdeuaWzYZVshKt8MdjVU\ndd26dZg0aRIAYNKkSVi7dq3qMQHqtlXPnj2RlJQEAAgMDER8fDzKy8tVbavGYgLU/7k6//zzAQC1\ntbVwOBwICQlR/efKVUyAum21d+9eFBYW4o477lDiULudXMUkIm13qeH6s2EBwGAwKL8IatPpdLjm\nmmuQkpKCRYsWqR2O4tdff0VYWBgAICwsDL/++qvKEZ3yxhtvoH///pgyZYoq/4U9zWazYcuWLRg0\naFCbaavTMQ0ePBiA+m118uRJJCUlISwsTOlSUrutXMUEqNtWDz30EF588UV06vT/aVHtdnIVk06n\n86idWjXJN3fYZUv64osvsGXLFhQVFeHNN9/E5s2b1Q7pDDqdrk204T333IPdu3fjhx9+QHh4OB5+\n+GFV4qiursatt96K1157DXq93uk9tdqquroaY8aMwWuvvYbAwMA20VadOnXCDz/8gL179+Kzzz7D\np59+6vS+Gm3VMCar1apqW3344YcIDQ1FcnJyo1fJrd1OjcXkaTu1apJvy8Muw8PDAQA9evTA6NGj\nUVJSonJEp4SFhWH//lNr6uzbtw+hoaEqRwSEhoYqP/B33HGHKm114sQJ3HrrrZg4cSJuvvlmAOq3\n1emYJkyYoMTUFtrqtK5du+LGG2/Ed999p3pbNYzp22+/VbWtvvzyS6xbtw69e/dGeno6PvnkE0yc\nOFHVdnIV0+233+5xO7Vqkm+rwy7/+OMPVFVVATi1VPJ//vOfMxZjU0taWhqWLl0KAFi6dKmSPNS0\nb9//r9a5Zs2aVm8rEcGUKVOQkJCAqVOnKvvVbKvGYlK7rQ4ePKj8d/7PP//ERx99hOTkZFXbqrGY\nTidToPXbavbs2bDb7di9ezfee+89XHXVVVi+fLmq7eQqpmXLlnn+M+XLu8DuKCwslD59+kh0dLTM\nnj27tat36eeff5b+/ftL//79pW/fvqrFZTabJTw8XDp37iwGg0GWLFkihw4dkquvvlpiY2Pl2muv\nlSNHjqga0+LFi2XixImSmJgol1xyiYwaNUr279/fqjFt3rxZdDqd9O/fX5KSkiQpKUmKiopUbStX\nMRUWFqreVj/++KMkJydL//79JTExUebMmSMiompbNRaT2m11mtVqVUayqP37d9qnn36qxDRhwgSP\n2km1B3kTEVHL4+P/iIg0jEmeiEjDmOSJiDSMSZ6ISMOY5ImINIxJnohIw/4PB3pZQvbJt0oAAAAA\nSUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Markov Chains\n", "\n", "Do certain base pairs commonly occur after others? Given that we've just seen \"AG\" is it more or less likely that we'll see 'G' again? What about 'T'? These questions can be answered with Markov chains. \n", "\n", "\n", "To do this exercise we'll need to use `toolz.sliding_window`. Sliding window is a generalization on functions like sliding average or sliding max. Here are some examples demonstrating its use" ] }, { "cell_type": "code", "collapsed": false, "input": [ "list(sliding_window(3, range(10)))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "[(0, 1, 2),\n", " (1, 2, 3),\n", " (2, 3, 4),\n", " (3, 4, 5),\n", " (4, 5, 6),\n", " (5, 6, 7),\n", " (6, 7, 8),\n", " (7, 8, 9)]" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "list(sliding_window(5, range(10)))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "[(0, 1, 2, 3, 4),\n", " (1, 2, 3, 4, 5),\n", " (2, 3, 4, 5, 6),\n", " (3, 4, 5, 6, 7),\n", " (4, 5, 6, 7, 8),\n", " (5, 6, 7, 8, 9)]" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "list(map(sum, sliding_window(5, range(10))))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "[10, 15, 20, 25, 30, 35]" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "from numpy import mean\n", "data = [4, 4, 6, 3, 4, 6, 7, 3, 8, 9, 10, 10, 10, 10, 8, 7]\n", "\n", "moving_average = compose(map(mean), sliding_window(3))\n", "\n", "list(moving_average(data))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "[4.666666666666667,\n", " 4.333333333333333,\n", " 4.333333333333333,\n", " 4.333333333333333,\n", " 5.666666666666667,\n", " 5.333333333333333,\n", " 6.0,\n", " 6.666666666666667,\n", " 9.0,\n", " 9.6666666666666661,\n", " 10.0,\n", " 10.0,\n", " 9.3333333333333339,\n", " 8.3333333333333339]" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use sliding window without a function like `sum` or `mean`. We'll use it to chop up our sequence of base-pairs into a sequence of tuples of base-pairs, called k-mers." ] }, { "cell_type": "code", "collapsed": false, "input": [ "seq = list(take(20, snoot))\n", "print seq\n", "pipe(seq, sliding_window(3), list)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['C', 'T', 'A', 'C', 'C', 'A', 'G', 'A', 'G', 'G', 'T', 'G', 'C', 'C', 'T', 'C', 'C', 'T', 'A', 'A']\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "[('C', 'T', 'A'),\n", " ('T', 'A', 'C'),\n", " ('A', 'C', 'C'),\n", " ('C', 'C', 'A'),\n", " ('C', 'A', 'G'),\n", " ('A', 'G', 'A'),\n", " ('G', 'A', 'G'),\n", " ('A', 'G', 'G'),\n", " ('G', 'G', 'T'),\n", " ('G', 'T', 'G'),\n", " ('T', 'G', 'C'),\n", " ('G', 'C', 'C'),\n", " ('C', 'C', 'T'),\n", " ('C', 'T', 'C'),\n", " ('T', 'C', 'C'),\n", " ('C', 'C', 'T'),\n", " ('C', 'T', 'A'),\n", " ('T', 'A', 'A')]" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "pipe(snoot, take(1000000), sliding_window(3), frequencies)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "{('A', 'A', 'A'): 15769,\n", " ('A', 'A', 'C'): 15635,\n", " ('A', 'A', 'G'): 15442,\n", " ('A', 'A', 'T'): 15915,\n", " ('A', 'C', 'A'): 15599,\n", " ('A', 'C', 'C'): 15733,\n", " ('A', 'C', 'G'): 15629,\n", " ('A', 'C', 'T'): 15649,\n", " ('A', 'G', 'A'): 15747,\n", " ('A', 'G', 'C'): 15731,\n", " ('A', 'G', 'G'): 15585,\n", " ('A', 'G', 'T'): 15609,\n", " ('A', 'T', 'A'): 15923,\n", " ('A', 'T', 'C'): 15486,\n", " ('A', 'T', 'G'): 15648,\n", " ('A', 'T', 'T'): 15609,\n", " ('C', 'A', 'A'): 15641,\n", " ('C', 'A', 'C'): 15628,\n", " ('C', 'A', 'G'): 15758,\n", " ('C', 'A', 'T'): 15424,\n", " ('C', 'C', 'A'): 15629,\n", " ('C', 'C', 'C'): 15548,\n", " ('C', 'C', 'G'): 15426,\n", " ('C', 'C', 'T'): 15793,\n", " ('C', 'G', 'A'): 15466,\n", " ('C', 'G', 'C'): 15468,\n", " ('C', 'G', 'G'): 15654,\n", " ('C', 'G', 'T'): 15599,\n", " ('C', 'T', 'A'): 15867,\n", " ('C', 'T', 'C'): 15621,\n", " ('C', 'T', 'G'): 15367,\n", " ('C', 'T', 'T'): 15654,\n", " ('G', 'A', 'A'): 15487,\n", " ('G', 'A', 'C'): 15518,\n", " ('G', 'A', 'G'): 15614,\n", " ('G', 'A', 'T'): 15710,\n", " ('G', 'C', 'A'): 15634,\n", " ('G', 'C', 'C'): 15702,\n", " ('G', 'C', 'G'): 15518,\n", " ('G', 'C', 'T'): 15571,\n", " ('G', 'G', 'A'): 15578,\n", " ('G', 'G', 'C'): 15626,\n", " ('G', 'G', 'G'): 15692,\n", " ('G', 'G', 'T'): 15613,\n", " ('G', 'T', 'A'): 15655,\n", " ('G', 'T', 'C'): 15478,\n", " ('G', 'T', 'G'): 15736,\n", " ('G', 'T', 'T'): 15581,\n", " ('T', 'A', 'A'): 15864,\n", " ('T', 'A', 'C'): 15829,\n", " ('T', 'A', 'G'): 15857,\n", " ('T', 'A', 'T'): 15617,\n", " ('T', 'C', 'A'): 15589,\n", " ('T', 'C', 'C'): 15413,\n", " ('T', 'C', 'G'): 15614,\n", " ('T', 'C', 'T'): 15496,\n", " ('T', 'G', 'A'): 15538,\n", " ('T', 'G', 'C'): 15600,\n", " ('T', 'G', 'G'): 15578,\n", " ('T', 'G', 'T'): 15629,\n", " ('T', 'T', 'A'): 15722,\n", " ('T', 'T', 'C'): 15527,\n", " ('T', 'T', 'G'): 15595,\n", " ('T', 'T', 'T'): 15565}" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "def reshape(item):\n", " ((a, b, c), count) = item # pattern matching\n", " return {(a, b): {c: count}}\n", "\n", "pipe(snoot, take(1000000), sliding_window(3), frequencies, dict.items, map(reshape), merge_with(merge))" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'dict' object is not callable", "output_type": "pyerr", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[1;33m{\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m \u001b[1;33m{\u001b[0m\u001b[0mc\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mcount\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 5\u001b[1;33m \u001b[0mpipe\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msnoot\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtake\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m1000000\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msliding_window\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfrequencies\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdict\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmap\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mreshape\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmerge_with\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmerge\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32m/home/mrocklin/workspace/toolz/toolz/functoolz/core.pyc\u001b[0m in \u001b[0;36mpipe\u001b[1;34m(data, *functions)\u001b[0m\n\u001b[0;32m 278\u001b[0m \"\"\"\n\u001b[0;32m 279\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mfunc\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mfunctions\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 280\u001b[1;33m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 281\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mdata\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mTypeError\u001b[0m: 'dict' object is not callable" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "pipe(file_pattern, glob, map(open), map(drop(1)), concat, map(str.upper), map(str.strip), concat,\n", " sliding_window(3), frequencies, dict.items, map(reshape), merge_with(merge))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }