{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.\n", "\n", "Library documentation: http://www.nltk.org/" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# needed to display the graphs\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "showing info http://nltk.github.com/nltk_data/\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# import the library and download sample texts\n", "import nltk\n", "nltk.download()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*** Introductory Examples for the NLTK Book ***\n", "Loading text1, ..., text9 and sent1, ..., sent9\n", "Type the name of the text or sentence to view it.\n", "Type: 'texts()' or 'sents()' to list the materials.\n", "text1: Moby Dick by Herman Melville 1851\n", "text2: Sense and Sensibility by Jane Austen 1811\n", "text3: The Book of Genesis\n", "text4: Inaugural Address Corpus\n", "text5: Chat Corpus\n", "text6: Monty Python and the Holy Grail\n", "text7: Wall Street Journal\n", "text8: Personals Corpus\n", "text9: The Man Who Was Thursday by G . K . Chesterton 1908\n" ] } ], "source": [ "from nltk.book import *" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Displaying 11 of 11 matches:\n", "ong the former , one was of a most monstrous size . ... This came towards us , \n", "ON OF THE PSALMS . \" Touching that monstrous bulk of the whale or ork we have r\n", "ll over with a heathenish array of monstrous clubs and spears . Some were thick\n", "d as you gazed , and wondered what monstrous cannibal and savage could ever hav\n", "that has survived the flood ; most monstrous and most mountainous ! That Himmal\n", "they might scout at Moby Dick as a monstrous fable , or still worse and more de\n", "th of Radney .'\" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l\n", "ing Scenes . In connexion with the monstrous pictures of whales , I am strongly\n", "ere to enter upon those still more monstrous stories of them which are to be fo\n", "ght have been rummaged out of this monstrous cabinet there is no telling . But \n", "of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u\n" ] } ], "source": [ "# examine concordances (word + context)\n", "text1.concordance(\"monstrous\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "imperial subtly impalpable pitiable curious abundant perilous\n", "trustworthy untoward singular lamentable few determined maddens\n", "horrible tyrannical lazy mystifying christian exasperate\n" ] } ], "source": [ "text1.similar(\"monstrous\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a_pretty is_pretty a_lucky am_glad be_glad\n" ] } ], "source": [ "text2.common_contexts([\"monstrous\", \"very\"])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": [ "iVBORw0KGgoAAAANSUhEUgAAAakAAAEZCAYAAAAt5touAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n", "AAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu4HFWZ7/HvDwJyJwnwiCIYEJWAkWBQLhPIDjDeTsDk\n", "iAIKKp4z4mhEHWYAxRkSPTpRZw5BFJhxRgRERQUzEB1umo5yEwIhBAhogKCAIEFuotzf+aNWpWvX\n", "7t637N577eT3eZ56unrVury1+vJ2VfXerYjAzMwsRxuMdABmZmbtOEmZmVm2nKTMzCxbTlJmZpYt\n", "JykzM8uWk5SZmWXLScrWG5IOkHTnEPSzStLBa9H+/ZIuX9s4hspQzcsgxn1J0i7DPa6NLk5Slq21\n", "TQZ1EfHLiNhtKLpKSw+Svi3pWUlPpmW5pC9J2qoSxwUR8bYhiGNIDOG8dCNpQkpET6XlXkknDaKf\n", "D0n65VDHZ6ODk5TlrG0yyFgAX46IrYBtgWOBfYFrJG02UkFJGsnX+tYRsSVwFPBPkt46grHYKOMk\n", "ZaOOCidLWilptaQLJY1L286S9KNK3S9Luiqtd0n6XWXbjpIulvSH1M8Zqfw1kn6eyh6R9B1JWw8k\n", "RICIeC4ilgCHAdtQJKxuRwZpX06T9LCkJyTdKmn3tO3bks6WdEU6KmtI2qkS/26SrpT0qKQ7Jb2n\n", "su3baS5+KulPQJekd0q6I/V1v6QT2szLxDTWY5Juk3Rord9vSFqY+rm+v6fsIuJ64HbgDT0mTNpa\n", "0nnpsVgl6ZQ0NxOBs4D90tHYH/v7INi6wUnKRqPjKd74DwReATwGfCNt+ztgkqQPSjoA+DDwgXoH\n", "kjYEFgL3Aq8GdgC+X6nyxdT3RGBHYM5gg42IPwFXAge02PzWVP7aiNgaeA9QfSN+H/B5iqOyW4AL\n", "Uvybpz6/A2wHHAmcmd7US0cBX4iILYBrgf8E/iYd5e0B/LwejKSNgEuBy1K/nwAukPS6SrUjKOZj\n", "HLCSYq56k/KN/iqNu7RFnTOALYGdgWkUj9mxEbEC+ChwXURsGRHj+xjL1jFOUjYaHQd8LiIejIjn\n", "gbnA4ZI2iIi/AMcApwHnA7Mj4sEWfbyFIgn9Q0T8JSKejYhrACLi7oj4WUQ8HxGrU1/T1jLm3wOt\n", "3mCfp3hznpjivysiHqpsXxgRV0fEc8ApFEcUrwJmAPdGxLkR8VJE3AJcTJHkSgsi4rq0T88AzwF7\n", "SNoqIp6IiFbJYl9g84iYFxEvRMQiimR+VKXOxRGxJCJepEiak/vY99XAo8A3gZNSn2ukDwxHAJ+J\n", "iKcj4j7gXykeR0hHprZ+cpKy0WgC8ON0Ouox4A7gBeDlABFxA3BPqvvDNn3sCNwXES/VN0h6uaTv\n", "p1NiT1Aku23WMuYdKN6ou4mInwNfpzgSfFjSv0nastwM3F+p+zTFUdYrKY7+9innIM3D+0hzkNqu\n", "OYWXvBt4J7Aqnc7bt0Wcr2zR7r5UXvb7cGXbX4At2u51YZuIGB8Ru0fE11ts3xbYKI1T+i3FnNl6\n", "zknKRqPfAm+PiHGVZbOI+D2ApI8DGwMPAie26eN3wE7pU3zdl4AXgTekU3DHMLDXSrcve0jaAjgE\n", "aPkNtYg4IyL2BnYHXgf8Q9mUIplW+xkPPEAxB4trc7BlRHy8bVDF0c9MitN4C4AftKj2ILCjpOrR\n", "y6vTmJ2ymuKIckKlbCeaCXq0fXnGhpCTlOVuY0mbVJYxwNnAl8ovEUjaTtJhaf11wBeA91Nc1zhR\n", "0p4t+r2B4hTcPEmbpb73T9u2AJ4GnpS0A82k0R9KC5JeJmkKRUJ4FDinR2Vpb0n7pGtBfwaeoUiQ\n", "pXdK+itJG6f9ui4iHgB+ArxO0tGSNkrLmyWVXyVXbZyNVPx91tbpNN1TtXFKv0pxnJjadFGcWiyv\n", "1w35qbcUzw+AL0raQtKrgU9TXG+D4sjtVWmObD3jJGW5+ynFm2a5/BNwOnAJcIWkJ4HrgLeko6Lz\n", "gXkRsTwiVgKfBc6vvMEFrHljPBTYleKo5HfAe1OducCbgCcovkRwEf3/NB8Ub/BPUhwhnAvcCOyf\n", "rpeVdcr+tgL+neI03qrU5quVet8FTqVIcnsBR6f4n6L40sWRFEc5vwf+meIIsj5G6Wjg3nQK8yMU\n", "ibwaN+na16HAO4BHKE5FHhMRv+6l397mpr/bPkHxweAeiiPOC2gm9Z9RfCvwIUl/6KU/WwfJP3po\n", "lidJ5wD3R8Q/jnQsZiPFR1Jm+fK32my95yRllq/R+B83zIaUT/eZmVm2fCRlZmbZGjPSAeRCkg8p\n", "zcwGISI6dv3UR1IVEZH9cuqpp454DOtCjI7Tcea+jJY4O81JyszMsuUkZWZm2XKSGmW6urpGOoQ+\n", "jYYYwXEONcc5tEZLnJ3mr6AnksJzYWY2MJIIf3HCzMzWR05SZmaWLScpMzPLlpOUmZlly0nKzMyy\n", "5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtlykjIz\n", "s2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpyk\n", "zMwsW05SZmaWrRFLUhLHSRyT1j8k8YrKtm9KTByp2MzMLA+KiJGOAYlFwN9HcNPIxaDIYS7MzEYT\n", "SUSEOtX/sB1JSXxAYpnELRLnSZwqcYLEu4G9gQskbpbYRKIhMUXiUImlablL4p7U15RUZ4nEZRLb\n", "p/KGxDyJX6X6U1P5HqlsaYph11YxNhrNBWD+/O5l5TJ7dvN2/vxmvfnzu/dVLa+2r96HZj/1cer1\n", "Zs1qPbf1mEtl/XJbff/qbat1W5k1q4h10qTithxv1qzW+1+WlftR1q/HUx+7WlatD825r96vt2m3\n", "D/U6ZWzVWNrFVo+7HLfed3UOquWtnkfVvgcyL632rz5Pvc1Hu3HqWrXrrT4Uz43+PK69jVOqz2W7\n", "8lbtq2Xl87O/9Vtt621/2r3+W/XR6rVXfx8on1tTp/Z8PyjHW58MS5KS2AM4BZgewWTgk2lTRHAR\n", "sAR4XwRviuAZINK2SyPYK4K9gFuAr0qMAc4A3h3B3sA5wBfL/oANI9gH+BRwair/KHB66mcKcH+r\n", "OOtPmgULWr+5LFzYvF2woFlvwYLufVXLq+2r96HZT32cer1Fi1rPbz3mUll/qJLUokVFrCtWFLfl\n", "eIsWtd7/sqzcj7L+QJJUtT405756v96mv0mqjK0aS7vY6nGX49b7rs5Btby3JFWfo8Ekqfo8jVSS\n", "WrFi6JJUfS7blfeVdMrnZ3/rt9rW2/60e/236qO3JFW+D5TPrSVLer4flOOtT8YM0zgHAT+I4I8A\n", "ETymngeHbQ8XJU4E/hzBWRJvAPYArkp9bAg8WKl+cbq9GZiQ1q8FTpF4FXBxBCvXZmfMzGx4DFeS\n", "CnpJQpU6PUgcArwbOLAsAm6PYP82/Tybbl8k7V8E35O4HpgB/FTiuAh6HJc0GnMq611AVx8hm5mt\n", "XxqNBo2+DqmH0HAlqZ8DP5b4/xH8UWJ8Ki8T11PAVvVGEq8GvgG8NWJN8rkL2E5i3wiul9gIeG0E\n", "d7QbXGKXCO4BzpDYCZgEPZNUV9ecyvoA99DMbD3Q1dVFV+UNcu7cuR0db1iSVAR3SHwRWCzxIrAU\n", "WEXz6OnbwNkSf4Y1R0gCPgiMBxakU3sPRDBD4nDgaxJbp304DVomqbL/90ocDTwP/J7mNSwzM8vY\n", "cB1JEcF5wHlttl1M81oSwPR0exPw+Rb1lwHTWpRPr6yvBnZJ6/OAeX3FWD96mjkTJk/uWW/16qLu\n", "6tWwa/qe4OTJMHZs977Gjm2WV/up358xo+inPla93vTptFSNe+bMnvXL7a2ODutlvR1BTp8OO+wA\n", "ixfDtGnN8caNa+5vvZ+yrNyP6py2G7u+P/W5qm6fMaNnm3b7UK/TKt529+txr17dun51/qvlvc1r\n", "2aa/89Kqr/o89TYf/Ympt3a9mThx4OO1K6/PZbvyvp7X06f3/fzobd/62p9y7gfyfGpVVr4PrExX\n", "zPfeu3udsv9287KuyuLvpHLgv5MyMxu4debvpMzMzAbKScrMzLLlJGVmZtlykjIzs2w5SZmZWbac\n", "pMzMLFtOUmZmli0nKTMzy5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaW\n", "LScpMzPLlpOUmZlly0nKzMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZ\n", "mWXLScrMzLLlJGVmZtlykjIzs2wNKElJzJE4oVPBmJmZVQ30SCo6EkU/SYwZyfFz0mgMfV8D7bPR\n", "GNo4bGTNnz/0j2m7vsryqVOb486fX9yvxlKuV2PsRDxDqdU8zp7dLKsvZVz112G1/fr8OuszSUmc\n", "InGXxC+B16ey10j8t8QSiV9Ia8q/LXGmxHUSd0t0SZwrcYfEOZU+j5K4VWK5xLxK+dslbpK4ReLK\n", "VDZH4nyJq4FzJV6dxrwpLftV2p+U+r1F4ksSu0jcVNn+2ur90cxJyobaggXDn6SWLGmOu2BBcb8a\n", "S7lejbET8QylVvO4cKGT1GD1emQiMQU4AtgT2Ai4GbgJ+DfgoxGslNgHOBM4ODUbG8F+EocBlwD7\n", "AXcAN0rsCTwCzAPeBDwOXCHxLuBa4N+BAyK4T2JsJZTdgKkRPCuxKfDXaf21wHeBN0u8AzgMeEsE\n", "z0iMjeBxiSck9oxgGXAs8K21mjEzMxs2fZ0+OwC4OIJngGckLgE2AfYHfiitqbdxug3g0rR+G/BQ\n", "BLcDSNwOTEhLI4JHU/kFwIHAi8AvIrgPIILHK31eEsGzlbG+nhLei8BrU/khwLdSrNX2/wEcK/F3\n", "wHuBN7fb2Tlz5qxZ7+rqoqurq4/pMTNbvzQaDRrDeGjXV5IKQLWyDYDHI9irTZvn0u1LsCaxlPfH\n", "AM/X6tf7b+XPlfVPA7+P4BiJDaFISm1iBbgIOBX4ObAkgsfaDVJNUmZm1lP9A/zcuXM7Ol5f16R+\n", "AcyU2ERiS+BQioRxr8ThABKSeGM/xwvgBmCaxDYpyRwJNIDrgQMlJqR+x7fpYyvgobT+AWDDtH4l\n", "xRHTpqn9OIB0BHY5cBY0r4uZmVn+ej2SimCpxIXAMuAPFAkmgPcDZ0l8juJa1feAW8tm1S5a9PmQ\n", "xMnAIoojn4URxSlCiY8AF0tsADwMvK1FP2cCF0l8ALgM+FPq93KJycASieeAnwCfS22+C8wCruh9\n", "OkaPoTwTWfY10D59NnTdMnMmTJ48tH22e46U5Xvv3Rx37Fh44YWescyc2T3GTsQzlFrN44wZvY9d\n", "3dbq9bg+v9YUMaLfKh8WEn8PbBnBqe3rKNaHuTAzG0qSiIj+XLYZlHX+744kfgzsDBw00rGYmdnA\n", "rBdHUv3hIykzs4Hr9JGU/3efmZlly0nKzMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYt\n", "JykzM8uWk5SZmWXLScrMzLLlJGVmZtlykjIzs2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZ\n", "ZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nKzMyy1bEk\n", "JXG8xB0S5w9xvw2JKUPZp5mZ5amTR1J/CxwSwTFlgcSYIeg30jKiZs+GRqN5f/785nq1fKg1GsVS\n", "jlcfa6jGLvspx2vXb7VetWwgcfQ1xnAqY5g1q2dZb/UHuq2/9WbPbl2vr777M3b1OVuut3seV+ej\n", "XSzVx3Awr4eyfTlW+Twvl0YDJk3q3uf8+UX9ss7s2cXt1KmtY6m/bqqx1ee6HL/Va6F8/VfXyz7K\n", "eFrNUV+v27JdedtoFPtSLjvv3H1bqz7WNR1JUhJnA7sAl0k8LnGexNXAuRLbSvxI4oa07J/abC7x\n", "LYlfSdwscVgq31Ti++mo7GJg08o4R0ncKrFcYl6l/E8SX5G4TeJKiX0lFkvcLXHoUOzjwoXdnxQL\n", "FjTXhyNJleM5SQ2tMoZFi3qW9VZ/oNv6W2/hwtb1hiJJVZ+z5Xq753F1PtrFUn0MB/N6KNuXY5XP\n", "83JpNGDFiu59LlhQ1C/rLFxY3C5Z0jqW+uumGlt9rsvxW70Wytd/db3so4yn1Rz19bot25W3jUax\n", "L+Vy333dt7XqY13TkSQVwUeBB4Eu4DRgInBwBO8HvgacFsFbgMOB/0jNTgF+FsE+wEHAVyU2ozgi\n", "+1MEuwOnQnGqT+KVwDxgOjAZeLPEu1Jfm6W+3gA8BXw+9TkrrZuZ2SgwFKffeqN0e0kEz6b1Q4CJ\n", "0po6W0psDrwVOFTi71P5y4CdgAOA0wEiWC5xa+r3zUAjgkcBJC4ADgT+C3gugstTP8uBZyJ4UeI2\n", "YEK7YOfMmbNmvauri66urkHttJnZuqrRaNAYxsO2Tiep0p8r6wL2ieC5aoWUtP53BL9pUS56ql+X\n", "UqXs+Ur5S1CMFcFLvV0XqyYpMzPrqf4Bfu7cuR0dbyS+gn4FcHx5R2LPtHp5rXyvtPoL4H2p7A3A\n", "GymS0Q3ANIltJDYEjgQWdzx6MzMbNp08koo268cD35BYlsZfDHwM+AIwP53O2wC4BzgMOAs4R+IO\n", "YAWwBCCChyROBhZRHEUtjODSFuP1FsugzZgB1bOBM2c21zt5lrDse+zY1mMN1dhlP33116reQGPI\n", "6axqGcv06T3Leqs/0G39rTdjRut6/X1celN9zpbr7Z7H1floF8vavh7KesuWNe+Xz3OAyZPhoou6\n", "1505E8aNg2nTivsrV8Kuu8ILL3SvU4+rVcyt5nrs2GLcet3Vq5v3q+szZsADDxTxlO3q8db7qm+f\n", "PLn7uFdd1az3wAPNOnU5vY6GkiJG/NvcWZAUngszs4GRRES0uiQzJPwfJ8zMLFtOUmZmli0nKTMz\n", "y5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nK\n", "zMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtly\n", "kjIzs2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZZSu7JCUxR+KEXrbvKfGOyv1DJU4anujM\n", "zGw4ZZekgOhj+17AO9dUDi6N4MudDQkajWIp16u3APPnN+tU6wLMnl1sb9VnvZ92YwLMmtX9frV9\n", "2X91nFZj1vWnTn28duWzZ7feXo273Kd6WbuYWo3Z37Leyvsao69+yvatHoPe+mv1PGqn2l9Zt3ye\n", "1cfsTX/2b7BtBtP3UMewro4PrV/vrQzkdTzaZJGkJE6RuEvil8DrU9kiiSlpfVuJeyU2Aj4PHCGx\n", "VOK9Eh+SOCPV207iRxI3pGX/VD4t1V8qcbPEFgONsa8ktWBB+yS1cGGxvVWf9X7ajQmwaFH7JFX2\n", "Xx2n1Zh1/alTH69d+cKFrbdX4y73qV7WLqZOJam+xuirn7J9q8dgqJJUtb+ybvk8q4/ZGyep0Tk+\n", "9D9JDeR1PNqMGekAUiI6AtgT2Ai4Gbgpbe52VBXB8xL/CEyJ4PjU/oOVKqcDp0VwjcROwGXA7sAJ\n", "wMciuE5iM+DZTu6TmZkNjRFPUsABwMURPAM8I3FJH/WVllYOASaquXVLic2Ba4DTJC5IYz3QqvGc\n", "OXPWrHd1ddHV1dXPXTAzWz80Gg0aw3iYmUOSClonnReADdP6Jv3sS8A+ETxXK/+yxELgfwHXSLwt\n", "grvqjatJyszMeqp/gJ87d25Hx8vhmtQvgJkSm0hsCRyayldBcU0KOLxS/0lgy8r9aoK7AorTgAAS\n", "k9PtayK4PYKvADeSrnuZmVneRvxIKoKlEhcCy4A/ADdQHF39C/ADiY8AP6F5fWoRcLLEUuCfU3m5\n", "7XjgGxLLKPZtMfAx4JMS04GXgNuA/x5onNUzf+V6tWzmTJg8uXXbGTNg113b99nurGK9fPr09nGM\n", "HduMoxpTX/pTp1089fIZM1pvr8Zd3i5b1n7/qzG1GrO/Zb2V9zVGX/2U7Vs9Br311+rxa6c+RllW\n", "Ps+qY/ZmMGet+9umk2fER/ps+0iPD/1/vgzkdTzaKKKvb3yvHySF58LMbGAkERHtview1nI43Wdm\n", "ZtaSk5SZmWXLScrMzLLlJGVmZtlykjIzs2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZZctJ\n", "yszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nKzMyy5SRlZmbZ\n", "cpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtnqaJKSmCnxksTr\n", "O9T/FInTO9G3mZmNPEVE5zoXFwKbAjdHMGeI+x4TwQtD15+ik3NhZrYukkREqFP9d+xISmILYB9g\n", "NnBEKuuSWCyxQOJuiXkSx0jcIHGrxC6p3nYSP0rlN0jsn8rnSJwvcTVwnsQ0iUvL8STOSf0sk5iV\n", "ys+UuFHiNql/iXL+fGg0ivVGo7kMh+q4neh3XTDc+zKa5m7+/GIZaJvZs4ulbFv2Uy+DnvPRaBT1\n", "6mXl66i3+WvX16xZvcdcb1fd73osreKfPbvZRxnn9tsXt7Nmwc4794y92r7cv/J+2Ud1n6vL/Pkw\n", "aVKxPmlSMcbUqc1t5VxPmlTcVrdNmlTcnz+/aFc+VvX+R9PzdCDGdLDvdwGXRfBbiUck3pTK3wjs\n", "BjwG3At8M4K3SBwPfAL4NHA6cFoE10jsBFwG7J7a7wZMjeBZia7KeP8IPBbBGwEkxqbyUyJ4TGJD\n", "4CqJSREs7y3wBQvg8cehq6v7A9/V1abBEGo0muMO5XhD3d9IGu59GU1zt2BBcfupTw2szapVxfqE\n", "CUXbsp9Vq7qXfepTPeej0YCFC+HrX+9e1mgUryNoP3/t+irbtVNvV93veixl3NX4Fy6Ebbct+ihf\n", "7w8/XGxbtAiefLL52i/HqbYvYyjvl/ta3ed6vCtWNG9/9zt45pnuiXDVKrj/fnjqKXjooea2FStg\n", "zJhiueUWGJve2bbdtnv/5XvWuqaTSeoo4LS0/sN0fyFwYwQPA0isBC5PdW4Dpqf1Q4CJah5Abimx\n", "ORDAJRE822K8g0lHbAARlE+VIyT+hmJfX0GR7HpNUmZmloeOJCmJ8RQJ5w0SAWxIkWB+At0SzEuV\n", "+y9V4hGwTwTP1foF+HNvQ9fq7wycAOwdwRMS5wCbtGs8Z84coPhEs2pVF3Q7UDMzs0ajQWMYzy12\n", "6kjqcOC8CP62LJBoAAf2s/0VwPHAv6S2e0awrI82VwIfpzhdWJ7u2wp4GnhS4uXAO4BF7Took1Sj\n", "UZziMDOz7rq6uuiqnFecO3duR8fr1BcnjgR+XCu7KJW3+wpdVLYdD+ydvgBxO3BcrV6rNv8PGCex\n", "XOIWoCsltqXAncAFwNWD3B8zMxsBHTmSiuCgFmVnAGfUyqZX1hcDi9P6oxQJrd7H3Nr9apungQ+1\n", "aHPsQOOfORMmTy7Wh/tCZDneUI+7Ll1QHanHZDSYOXNwbVauLNZ33bV7PytX9iyrz0dXF6xe3bNs\n", "7Njm66iddn098MDA2lX3e8aM1tuq8a9e3eyjfL2ffXZRtmxZ8QWFdmOU5eUXGKr72m6fx46FRx8t\n", "6l50UTGnjzzSbAvFXC9eDNOmNccv2229dTH+uHGwww4956A/cz1adfTvpEYT/52UmdnAjdq/kzIz\n", "M1tbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtlykjIzs2w5SZmZWbacpMzMLFtOUmZmli0n\n", "KTMzy5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZll\n", "y0nKzMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLSWqUaTQaIx1C\n", "n0ZDjOA4h5rjHFqjJc5Oc5IaZUbDE3c0xAiOc6g5zqE1WuLsNCcpMzPLlpOUmZllSxEx0jFkQZIn\n", "wsxsECJCnerbScrMzLLl031mZpYtJykzM8vWep+kJL1d0p2SfiPppGEYb0dJiyTdLuk2Scen8vGS\n", "rpT0a0lXSBpbafOZFN+dkt5aKZ8iaXnadnql/GWSLkzl10t69VrEu6GkpZIuzTVOSWMl/UjSCkl3\n", "SNon0zg/kx735ZK+m/od8TglfUvSw5KWV8qGJS5JH0xj/FrSBwYR51fT475M0sWSth7JOFvFWNl2\n", "gqSXJI3PcS5T+SfSfN4m6csjHScAEbHeLsCGwEpgArARcAswscNjbg9MTutbAHcBE4GvACem8pOA\n", "eWl99xTXRinOlTSvJd4AvCWt/xR4e1r/GHBmWj8C+P5axPt3wAXAJel+dnEC5wIfTutjgK1zizON\n", "dQ/wsnT/QuCDOcQJHADsBSyvlHU8LmA8cDcwNi13A2MHGOdfAxuk9XkjHWerGFP5jsBlwL3A+Ezn\n", "cjpwJbBRur/dSMcZEet9ktoPuKxy/2Tg5GGOYQFwCHAn8PJUtj1wZ1r/DHBSpf5lwL7AK4AVlfIj\n", "gbMrdfZJ62OARwYZ26uAq9KT99JUllWcFAnpnhblucU5nuIDybjUx6UUb7BZxEnx5lN9w+p4XMBR\n", "wFmVNmcDRw4kztq2WcB3RjrOVjECPwTeSPckldVcAj8ADmpRb0TjXN9P9+0A/K5y//5UNiwkTaD4\n", "NPMrijeEh9Omh4GXp/VXprhKZYz18gdoxr5mvyLiBeCJ6imGATgN+AfgpUpZbnHuDDwi6RxJN0v6\n", "pqTNc4szIv4I/CvwW+BB4PGIuDK3OCs6Hdc2vfQ1WB+m+DSfVZyS3gXcHxG31jZlE2PyWuDAdHqu\n", "IWnvHOJc35NUjNTAkrYALgI+GRFPVbdF8RFjxGIDkDQD+ENELAVa/g1EDnFSfEp7E8WphTcBT1Mc\n", "Ea+RQ5ySXgN8iuLT6yuBLSQdXa2TQ5yt5BpXlaRTgOci4rsjHUuVpM2AzwKnVotHKJy+jAHGRcS+\n", "FB9OfzBZmLzQAAAFFElEQVTC8QBOUg9QnCsu7Uj3LN8RkjaiSFDnR8SCVPywpO3T9lcAf2gT46tS\n", "jA+k9Xp52Wan1NcYYOv0SX4g9gcOk3Qv8D3gIEnnZxjn/RSfUm9M939EkbQeyizOvYFrI+LR9Mny\n", "YorTzbnFWer04/xoi74G9fqT9CHgncD7K8W5xPkaig8my9Jr6VXATZJenlGMpfspnpek19NLkrYd\n", "8Th7Oxe4ri8UnxzupngSbczwfHFCwHnAabXyr5DO+1IcCdQvAG9McWrrbpoXLX8F7JP6rF+0PCua\n", "54kH/cWJ1Mc0mteksosT+AXwurQ+J8WYVZzAnsBtwKap/3OBj+cSJz2vT3Q8LorrdPdQXEAfV64P\n", "MM63A7cD29bqjVic9Rhr26rXpHKby+OAuWn9dcBvs4hzsG9c68oCvIPigvZK4DPDMN5Uims8twBL\n", "0/L29OBdBfwauKL6wFGcLlhJcTH7bZXyKcDytO1rlfKXURyq/wa4HpiwljFPo/ntvuzipEgANwLL\n", "KD4Jbp1pnCdSvKEup0hSG+UQJ8WR8oPAcxTXEY4drrjSWL9JywcHGOeHU7v7aL6WzhzJOCsxPlvO\n", "ZW37PaQklclcrokzPR/PT+PeBHSNdJwR4X+LZGZm+Vrfr0mZmVnGnKTMzCxbTlJmZpYtJykzM8uW\n", "k5SZmWXLScrMzLLlJGU2AJJOk/TJyv3LJX2zcv9fJX16kH13Kf0kSottUyX9Kv2MwgpJf1PZtl3a\n", "dlOq9x4VP1nys0HE8NnBxG7WKU5SZgNzNcW/jELSBsA2FH+RX9oPuKY/HaX2/am3PcXPpRwXERMp\n", "/iD8OEnvTFUOBm6NiCkRcTXwf4D/GxEH96f/ms8Moo1ZxzhJmQ3MdRSJCGAPin919JSKH158GcVv\n", "g90s6eD0X9lvlfSfkjYGkLRK0jxJNwHvUfGjmyvS/Vltxvw4cE5E3AIQxf9AOxE4WdKewJeBd6n4\n", "ccp/Av4K+Jakr0jaQ9INaduy9I9ukXR0OvpaKulsSRtImgdsmsrO78DcmQ3YmJEOwGw0iYgHJb0g\n", "aUeKZHUdxU8N7Ac8CdxK8WOa51D8Ns9KSecCfwucTvHfxFdHxBRJm1D826HpEXG3pAtp/d/Gdwe+\n", "XSu7CdgjIpalxDQlIspfeZ4OnBARN0v6GjA/Ir6b/tHnGEkTgfcC+0fEi5LOBN4fESdL+nhE7DVU\n", "82W2tnwkZTZw11Kc8tufIkldl9bLU32vB+6NiJWp/rnAgZX2F6bb3VK9u9P979D+Zxx6+3kH9bL9\n", "OuCzkk6k+P9pz1CcHpwCLJG0FDiI4h+HmmXHScps4K6hOKU2ieKfa15PM2ld26K+6H6E9HSbftsl\n", "mjsokkrVFIpTjb2KiO8BhwJ/AX6ajrIAzo2IvdKyW0R8vq++zEaCk5TZwF0LzAAejcJjFD89sF/a\n", "9mtgQnn9BzgGWNyinztTvV3S/aPajPcN4EPp+hPpF07nUfycRq8k7RwR90bEGcB/USTWnwGHS9ou\n", "1RkvaafU5Pl0WtAsC05SZgN3G8W3+q6vlN1K8ZPwf0yn1I4FfijpVuAF4OxUb80RVar3EeAn6YsT\n", "D9PimlREPAQcDXxT0gqKI7n/jIifVPps93MG75V0WzqttwdwXkSsAD4HXCFpGcVPcWyf6v87cKu/\n", "OGG58E91mJlZtnwkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nKzMyy\n", "9T9kxgDDol5HqgAAAABJRU5ErkJggg==\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# see where in a text certain words are found to occur\n", "text4.dispersion_plot([\"citizens\", \"democracy\", \"freedom\", \"duties\", \"America\"])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "44764" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count of all tokens (including punctuation)\n", "len(text3)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2789" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# number of distinct tokens\n", "len(set(text3))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'among',\n", " u'the',\n", " u'merits',\n", " u'and',\n", " u'the',\n", " u'happiness',\n", " u'of',\n", " u'Elinor',\n", " u'and',\n", " u'Marianne',\n", " u',',\n", " u'let',\n", " u'it',\n", " u'not',\n", " u'be',\n", " u'ranked',\n", " u'as',\n", " u'the',\n", " u'least',\n", " u'considerable',\n", " u',',\n", " u'that',\n", " u'though',\n", " u'sisters',\n", " u',',\n", " u'and',\n", " u'living',\n", " u'almost',\n", " u'within',\n", " u'sight',\n", " u'of',\n", " u'each',\n", " u'other',\n", " u',',\n", " u'they',\n", " u'could',\n", " u'live',\n", " u'without',\n", " u'disagreement',\n", " u'between',\n", " u'themselves',\n", " u',',\n", " u'or',\n", " u'producing',\n", " u'coolness',\n", " u'between',\n", " u'their',\n", " u'husbands',\n", " u'.',\n", " u'THE',\n", " u'END']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the texts are just lists of strings\n", "text2[141525:]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "FreqDist({u',': 18713, u'the': 13721, u'.': 6862, u'of': 6536, u'and': 6024, u'a': 4569, u'to': 4542, u';': 4072, u'in': 3916, u'that': 2982, ...})" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# build a frequency distribution\n", "fdist1 = FreqDist(text1) \n", "fdist1" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u',', 18713),\n", " (u'the', 13721),\n", " (u'.', 6862),\n", " (u'of', 6536),\n", " (u'and', 6024),\n", " (u'a', 4569),\n", " (u'to', 4542),\n", " (u';', 4072),\n", " (u'in', 3916),\n", " (u'that', 2982),\n", " (u\"'\", 2684),\n", " (u'-', 2552),\n", " (u'his', 2459),\n", " (u'it', 2209),\n", " (u'I', 2124),\n", " (u's', 1739),\n", " (u'is', 1695),\n", " (u'he', 1661),\n", " (u'with', 1659),\n", " (u'was', 1632)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fdist1.most_common(20)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "906" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fdist1['whale']" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": [ "iVBORw0KGgoAAAANSUhEUgAAAZQAAAEZCAYAAACw69OmAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n", "AAALEgAACxIB0t1+/AAAIABJREFUeJztnXuc1FX9/58vQQEVXVEDvJviBUVBECxXRSvDvJcpWmpK\n", "3tAfWmYudlGzzEtlakHfvCRewzTvhpiK14BSNvGCookJChmyAtqqwPv3xznjDrADOzuf2Tm7834+\n", "HvPYz+fM57zm9ZnZnfee8z4XmRmO4ziOUyprVNqA4ziO0zHwgOI4juNkggcUx3EcJxM8oDiO4ziZ\n", "4AHFcRzHyQQPKI7jOE4mlC2gSLpe0jxJ0/PKekh6WNKrkiZKqsl7brSkmZJmSNo/r3ygpOnxuSvz\n", "yrtIGh/LJ0vaMu+54+NrvCrpuHLdo+M4jtNEOVsofwCGrVBWBzxsZtsBj8RzJPUFjgL6xjpjJCnW\n", "GQuMMLM+QB9JOc0RwPxYfgVwadTqAfwYGBwf5+cHLsdxHKc8lC2gmNmTwIIVig8BxsXjccBh8fhQ\n", "4DYz+8TMZgGvAUMk9Qa6m9nUeN2NeXXyte4EvhCPvwxMNLMGM2sAHmblwOY4juNkTFvnUHqa2bx4\n", "PA/oGY83AWbnXTcb2LSZ8jmxnPjzLQAzWwK8L2nDVWg5juM4ZaRiSXkLa774ui+O4zgdhM5t/Hrz\n", "JPUys7mxO+s/sXwOsHnedZsRWhZz4vGK5bk6WwBvS+oMrG9m8yXNAYbm1dkceLQ5M9tuu60tXryY\n", "efNCo2mbbbahe/fu1NfXA9C/f38AP/dzP/fzqj/v2TN0KM2bNw8zy+W4l8fMyvYAtgKm551fBpwb\n", "j+uAS+JxX6AeWAvYGngdUHxuCjAEEPAgMCyWjwTGxuPhwB/jcQ/gX0ANsEHuuIA/K4Xzzz+/pPod\n", "SSMFD1lopOAhFY0UPKSikYKHVDTi92az3/lla6FIug3YB9hI0luEkVeXALdLGgHMAo6M3+ovSbod\n", "eAlYAoyMxnOB4wagG/CgmU2I5dcBN0maCcyPQQUze0/SRcDf43UXWkjOr0Qu4raWxsbGkup3JI0U\n", "PGShkYKHVDRS8JCKRgoeUtIoRNkCipkdXeCpLxa4/mLg4mbKnwX6NVP+ETEgNfPcHwjDlh3HcZw2\n", "oqpnyudyJ61l2LDSRyN3FI0UPGShkYKHVDRS8JCKRgoeUtIohJp6lqoPSVbN9+84jlMskgom5au6\n", "hZIbxdBaGhqaTc1UpUYKHrLQSMFDKhopeEhFIwUPKWkUoqoDiuM4jpMd3uVVxffvOI5TLN7l5TiO\n", "45Sdqg4onkPJTiMFD1lopOAhFY0UPKSikYKHlDQKUdUBxXEcx8kOz6FU8f07juMUi+dQHMdxnLJT\n", "1QHFcyjZaaTgIQuNFDykopGCh1Q0UvCQkkYhqjqgOI7jONnhOZQqvn/HcZxiWVUOpa032HIcx3Ha\n", "GDN4/XV44gnYZhvYZ5/yvE5Vd3l5DiU7jRQ8ZKGRgodUNFLwkIpGCh6K0Vi2DKZPhzFjYPhw2HRT\n", "6NMHRoyABx4oXw7FWyiO4zjtnCVLYNq00AJ54gl48klYsGD5azbaCPbeGwYOLJ+PiuRQJJ0JfJuw\n", "re81ZnalpB7AeGBL4m6OuZ0WJY0GTgSWAqPMbGIsH0jYzbErYTfHM2N5F+BGYDfCbo5Hmdmbzfjw\n", "HIrjOO2Oxkb4+9+bAsgzz8Dixctfs9lmoWtr773DY/vtQc3vBF8Uq8qhtHlAkbQzcBuwO/AJMAE4\n", "FTgF+K+ZXSbpXGADM6uT1Be4NV6/KfBXoI+ZmaSpwBlmNlXSg8BVZjZB0khgZzMbKeko4HAzG96M\n", "Fw8ojuMkz+LFIWjkAsiUKfDxx8tf06dPU/DYe2/YcstsAsiKpDaxcQdgipk1mtlS4HHga8AhwLh4\n", "zTjgsHh8KHCbmX1iZrOA14AhknoD3c1sarzuxrw6+Vp3Al9ozojnULLTSMFDFhopeEhFIwUPqWi0\n", "tYeFC+HBB+Hcc2GPPaCmBr785ZD/ePLJEEz69YPTT4fx4+Htt+HVV+Haa+G442CrrQoHk3LOQ6lE\n", "DuUF4Gexi6sR+ArwD6CnmeX25J0H9IzHmwCT8+rPJrRUPonHOebEcuLPtwDMbImk9yX1MLP3ynA/\n", "juM4JbFgQch7PP54eEybFhLrOdZYAwYNgsMPhwsvhNpa6NGjcn4L0eYBxcxmSLoUmAh8ANQTciP5\n", "15iksvdFLVq0iLq6Orp27QrAoEGDqK2tpaamBmiK5IXOc2Utvb7Qeb5Wa+pncV5TU1PR+im9n6XW\n", "T+HzyL+H9v55pPB+Zv15vPsu/O1vDdTXw5131jB9Ouy6a7i+vr6Gzp1h+PAG+veHfv1q+PznYdmy\n", "nF7bvp/19fVMmjSJxsZGVkfFJzZK+hmhpXEmMNTM5sburMfMbAdJdQBmdkm8fgJwPvBmvGbHWH40\n", "sLeZnRavucDMJkvqDLxjZhs389qeQ3Ecp+zMndvU+nj8cXjppeWfX2stGDIkJNH32Qc+9zlYZ53K\n", "eF0dqeVQkPSZ+HML4KuEpPu9wPHxkuOBu+PxvcBwSWtJ2hroA0w1s7nAQklDJAk4Frgnr05O6wjg\n", "keZ8eA4lO40UPGShkYKHVDRS8JCKRrH1Z8+GW26Bk08Oo6t694ZLLmlg7NgQTLp2hX33hQsugMce\n", "g4aGkGy/6CL44hcLB5MU3otVUal5KHdI2pCQBxlpZu9LugS4XdII4rBhADN7SdLtwEvAknh9rlkx\n", "kjBsuBth2PCEWH4dcJOkmYRhwyuN8HIcx8mKWbOWb4H861/LP7/OOiEH8vWvhxbIoEHQpUtFrJaV\n", "ind5VRLv8nIcp1jM4LXXQosiF0D+/e/lr1lvvZA4z3Vh7bYbrLlmZfxmja/l5TiO00rM4JVXYNKk\n", "EDyeeCIM081ngw1gr72aAkj//tCpU0XsVpSqDihZ5FDyR3BUs0YKHrLQSMFDKhopeKiERn4AyT16\n", "926gvr6pfm4Zk1wA6dcvDO3NykPqGoWo6oDiOI5jBjNnhuR4LoDMnbv8NX37wpFHNgWQvn3LMwu9\n", "veM5lCq+f8epRnI5kEmTmoLIO+8sf81nPhNGYQ0dGh5ZrYPVEfAciuM4VUtuL5D8Lqw5c5a/ZuON\n", "Q+DIBZEddvAA0hqqOqB4DiU7jRQ8ZKGRgodUNFLw0FqNf/87tD4efTQ8Ntpo5RxIrvWx776w446r\n", "DiDt+b0oh0YhqjqgOI7TMZg3rymAPPZY6NLK57Ofha99rSmAeA6kPHgOpYrv33HaKwsWhCG8uRbI\n", "iy8u//x664Xgsd9+IYDsvPPqR2E5LcNzKI7jtGsWL4annmoKIM89F3IjObp1C/NA9tsvPAYMgM7+\n", "7dbmVPVb7jmU7DRS8JCFRgoeUtGopAczeP55uPtueO21Bv74xxqWLGl6fs01wwKKuQAyePCqlzJp\n", "z+9FihqFqOqA4jhOOixdCk8/HYLI3XfDG2+E8v79w94gQ4Y0dWHtuSesvXZl/Tor4zmUKr5/x6k0\n", "jY3w17/CXXfBvffCf//b9FzPnnDooXDQQWFW+vrrV86n04TnUBzHSYaGhrC97V13wV/+Ah980PTc\n", "ttuGXQkPOyxsfeuJ9PZFVX9cvh9KdhopeMhCIwUPqWhk6eHtt2Hs2LAv+sYbwze+AXfcEYLJbruF\n", "fUCmTw/7ol92GXz+803BpKO9Fx1BoxDeQnEcpyzMnBm6s268ESZPbirv1CnkQQ47LDy22KJyHp1s\n", "8RxKFd+/42TNSy+Flsedd4ZRWjm6dg2tk8MPDzmRDTesnEenNJLLoUgaDXwTWAZMB04A1gHGA1sS\n", "d2w0s4a8608ElgKjzGxiLB9I2LGxK2HHxjNjeRfgRmA3wo6NR5nZm210e45TNZiFrqo77giPl19u\n", "em799eGQQ0IQ2X//dPdId7KjzXMokrYCTgJ2M7N+QCfCFr11wMNmth1hD/i6eH1f4CigLzAMGBP3\n", "kAcYC4wwsz5AH0nDYvkIYH4svwK4tDkvnkPJTiMFD1lopOAhFY1C9c3g2Wdh9GjYbjvYddeQA3n5\n", "ZejRA048MSTd//MfuOqqBg4/vLRgkvJ7Ua0ahahEC2UhYS/5tSUtBdYG3gZGA/vEa8YBkwhB5VDg\n", "NjP7BJgl6TVgiKQ3ge5mNjXWuRE4DJgAHAKcH8vvBH5T7ptynI7MsmUwdWpTd9asWU3PbbwxfPWr\n", "cMQRYa+Q/K1uP/ywza06FaQiORRJJwO/BP4HPGRmx0paYGYbxOcFvGdmG0i6GphsZrfE564F/kLo\n", "FrvEzL4Uy/cCvm9mB0uaDnzZzN6Oz70GDDaz91bw4TkUxynA0qXwzDMhgNx5J8ye3fRc795NQWSv\n", "vapzu9tqJakciqRtgLOArYD3gT9J+mb+NWZmksr+Tb/NNttQV1dH165dARg0aBC1tbWfLkuQaxr6\n", "uZ9Xy7kZvPFGDTfdBM8/38B77/Hpsu9f+lID++wDQ4fW8LnPwcKFoX6nTun49/Psz+vr65k0aRKN\n", "jY2sFjNr0wchH3Jt3vmxwG+Bl4Fesaw3MCMe1wF1eddPAIYAvYCX88qPBsbmXbNHPO4MvNucl/79\n", "+1spLFiwoKT6HUkjBQ9ZaKTgoRIa//632c9/bta3r1nIkpj177/Att7a7JxzzCZPNlu6tLweUtZI\n", "wUMqGiFsNP/9XokcygzgR5K6AY3AF4GpwAfA8YQE+vHA3fH6e4FbJf0K2BToA0w1M5O0UNKQWP9Y\n", "4Kq8OscDk4EjCEl+x3HyWLQodGXddFPYQyTX+7vRRnD00eGxxx6+b4jTciqVQ/k+4Qt/GfAc8G2g\n", "O3A7sAUrDxs+jzBseAlwppk9FMtzw4a7EYYNj4rlXYCbgAGEYcPDzWxWMz6sEvfvOJViyZIw2fCm\n", "m8LSJ//7Xyjv0iUM8T32WBg2bPnEuuPks6ocik9srOL7d6oDM6ivD0Hk1lvD7oY59toLjjsuJNfL\n", "tKK508FYVUDxtbxKIJUx4SlopOAhC40UPGSlMXNmA5ddBrvsEtbLuuKKEEy22y7MG/nXv+CJJ+Db\n", "324+mKRyHylopOAhJY1C+FpejtOB+PDD0JV1ww1hKfj6+lC+4YYwfHjo0ho82PMiTnnwLq8qvn+n\n", "Y2AWFl/8wx9g/HhYuDCUr7UWHHxwCCIHHBDOHadUkpqH4jhONrz9dsiL3HADzJjRVD54MJxwAhx1\n", "FGywQcXsOVWI51BKIJX+zBQ0UvCQhUYKHlal8dFHYfmTAw+EzTeHuroQTHr2hO99D154AaZMgVNP\n", "BaljvxdtqZGCh5Q0CuEtFMdJHDOYNi10ad16K7wXFxBac82wn8gJJ4Shvp39r9mpMJ5DqeL7d9Lm\n", "3XfhlltCIMnfW6R//xBEjjkmTEJ0nLbEcyiO005YsiTss3799XD//eEcwiitb3wjBJISe2odp2x4\n", "DqUEUunPTEEjBQ9ZaFTKw5tvwo9/DFtuGWasz5oVFmo86KCwPMrbb8OVVxYXTNrre5GiRgoeUtIo\n", "hLdQHKdCfPJJaIVccw1MmNC0llafPnDyySE/0rt3ZT06TjF4DqWK79+pDG+8AddeG7q15s4NZWut\n", "FZY/Oflk2Htvn3jopEtJORRJ6wL/M7OlkrYHtgf+YmEHRcdxWsDHH8O994bWyMSJTeU77ggnnRTW\n", "09pww8r5c5wsaEkO5Qmgi6RNgYcIy8TfUE5TbYXnULLTSMFDFhpZe3jttTBXZPPN4etfD8Gka9cw\n", "e/3JJ+HFF+E731k5mHTE96I9a6TgISWNQrQkhyIz+1DSCGCMmV0m6Z9lc+Q47ZyPPw5LoPz+9/Do\n", "o03lO+8curS++U2fwe50TFabQ5E0DRgJXAGMMLMXJU03s35tYbCceA7FyZLZs+G3vw35kf/+N5R1\n", "6xaWQDn5ZN+syukYlDoP5SxgNHBXDCbbAI9ladBx2jN//3tYGv5Pf2qaN7LLLnDKKWHyoe8z4lQL\n", "Lcmh9DSzQ8zsUgAzex14qrUvKGl7SdPyHu9LGiWph6SHJb0qaaKkmrw6oyXNlDRD0v555QMlTY/P\n", "XZlX3kXS+Fg+WdKWzXnxHEp2Gil4yEKjpfWXLg3zQ2prw2KMt90Whv0eeSQ89VQD9fUwcmTrg0l7\n", "ei+qQSMFDylpFKIlAWV0C8tahJm9YmYDzGwAMBD4ELgLqAMeNrPtCHvA1wFI6gscBfQFhgFjpE87\n", "DsYSuuH6AH0kDYvlI4D5sfwKwj71jlMyCxfCr38N224bhvk+/TSsv35YmPFf/wq5k5128q4tpzop\n", "mEORdADwFcKX+R+B3J9Id6CvmQ0u+cVDa+NHZraXpBnAPmY2T1IvYJKZ7SBpNLAs10KSNAG4AHgT\n", "eNTMdozlw4GhZnZqvOZ8M5siqTPwjplt3Mzrew7FaRGzZsFVV4X8yKJFoWybbeDMM+Fb34Lu3Svp\n", "znHajtbmUN4GngUOjT9zAguB72TkbThwWzzuaWa53a7nAT3j8SbA5Lw6s4FNgU/icY45sZz48y0A\n", "M1sSu9V6mNl7Gfl2qgAzeOaZkB+56y5YtiyU77NPGOp70EHQqVNlPTpOShQMKGb2T+Cfkm4pxyRG\n", "SWsBBwPnNvPaJqnsTYe9996buro6unbtCsCgQYOora2lJnZ85/oaC53Pnj2bddddt8XXN3e+ePFi\n", "Nttss1bXz1FTU9Pq+vl1K1U/lfdz8eLF9Oy5GXfcAffc08Arr0B9fQ2dO8PZZzdwxBEweHD6n0cW\n", "72cKn0cq72cKn0el3s/6+nomTZpEY2Mjq8XMVvkAaoGHgZnAG/Hxr9XVa4HuocCEvPMZQK943BuY\n", "EY/rgLq86yYAQ4BewMt55UcDY/Ou2SMedwbebc5D//79rRQWLFhQUv2OpJGCh1I1Fi0yGzNmgW22\n", "mVlon5j16GF23nlmc+a0jYeUNFLwkIpGCh5S0Qhho/nv9ZbMQ3mFMHT4OWBpXiD67+rD1Sp1/0hY\n", "wmVcPL+MkEi/VFIdUGNmdTEpfyswmNCV9VdgWzMzSVOAUcBU4AHgKjObIGkk0M/MTou5lcPMbHgz\n", "Hmx19+90fJYtC/uOnHsuvPNOKNthBzjrrDCjfe21K+vPcVJiVTmUlgSUKWY2JGND6xCS6lub2aJY\n", "1gO4HdgCmAUcaWYN8bnzgBOBJcCZZvZQLB9IWAamG/CgmY2K5V2Am4ABwHxguJnNasaHB5QqZ+pU\n", "GDUqbJsLsPvucOGF8OUvwxpVvbmD4zRPqQHlEqAT8Gfgo1y5mT2XpclKMGDAAJs2bVqr6zc0NHza\n", "31jtGil4KEbjnXdg9GgYNy6c9+oFl1wCBx/cQI8e7ec+yqmRgodUNFLwkIpGqTPl9wAMGLRC+b6t\n", "duQ4FeKjj8I8kp/+FBYvDsvGf+c78IMfhKG/ZZzz5TgdHt8PpYrvv5owg/vug+9+F15/PZQdcgj8\n", "8pdhkqLjOC2j1P1Qzie0UBR/AmBmP8nMoeOUkZdeCq2Q3D4kO+4YWin777/qeo7jFEdL0o4fxMdi\n", "YBlh9vxWZfTUZvhaXtlppOBhRY0FC8JIrV12CcFk/fVDIPnnPwsHkxTvo1IaKXhIRSMFDylpFGK1\n", "LRQz+0X+uaTLgYkFLnecirN0aVgi5Yc/DMvIr7EGnHoq/OQnsPFKC/A4jpMVRedQ4vDeqWbW7nue\n", "PYfS8Xj88bC+1j/jFnB77w1XXgklNkYdx4mUmkOZnne6BvAZwPMnTlK89RacfXbYkwRgiy3gF78I\n", "KwL7yr+O0za0JIdycHwcBOwPbGJmV5fVVRvhOZTsNCrl4eOP4dJLw8z2P/0JBg9u4MIL4eWXwx7u\n", "xQaTFN7LVDRS8JCKRgoeUtIoREtyKLMk9Qf2IozyehLwPeWdivPoo3DGGSF4AHzta/Czn8H221fW\n", "l+NUKy2ZKX8mcBJhpryAw4BrzOyq8tsrL55DaZ/MmRM2tPrjH8N5nz5w9dVhuRTHccpLqUuvTCes\n", "3PtBPF8HmGxm/TJ32sZ4QGlffPJJCBznnx9muXfrFma4f+970KVLpd05TnWwqoDS0uXvlhU4btd4\n", "DiU7jXJ7eOIJ2G23kHhfvBgOPTRMWPzBD5YPJqnfR3vSSMFDKhopeEhJoxAtWcvrD8AUSfldXteX\n", "zZHj5DF3LpxzDtx8czj/7GfDVrwHHlhZX47jrEyL5qHEZeJriUl5M2v9Er0J4V1e6bJkCYwZAz/6\n", "ESxcGFoho0fD978furocx6kMrcqhSBoMbGRmD65Q/hVgnpk9m7nTNsYDSpo88wyMHNk0OfErXwmt\n", "km22qawvx3Fan0O5FHipmfKXgF80U16MoRpJd0h6WdJLkoZI6iHpYUmvSpooqSbv+tGSZkqaIWn/\n", "vPKBkqbH567MK+8iaXwsnyxpy+Z8eA4lO40sPLzxRgMnngh77hmCyZZbwt13w/33tzyYpHAfHUUj\n", "BQ+paKTgISWNQqwqoHRvbpfDWLZRia97JWGHxR2BXQj7ydcBD5vZdsAj8Zy4BfBRQF9gGDBG+nS6\n", "2lhghJn1AfpIGhbLRxC2E+4DXEEIjk6iLFsGv/sdHHcc/OEPYY+SH/4wJN0PPdRnujtOe2FVXV6v\n", "FVqva1XPrfYFpfWBaWb22RXKZwD7mNk8Sb2ASWa2g6TRwDIzuzReNwG4gLCF8KMxKBH3jh9qZqfG\n", "a843symSOgPvmNlKywJ6l1flmTULRowIkxQhrAJ89dWw3XYVteU4TgFa2+X1iKSf5bUGkLSGpIuA\n", "R0vwszXwrqQ/SHpO0jVxbktPM5sXr5kH9IzHmwCz8+rPBjZtpnxOLCf+fAvAzJYA78dFLZ1EMINr\n", "roF+/UIw2WgjGD8eJkzwYOI47ZVVBZSzgW2A1yX9OQ4bnglsF59rLZ2B3YAxZrYbYa+VuvwLYrOh\n", "7E0Hz6Fkp1FM/bfegmHD4OSTw5ySr34VXnwR9t+/oeTuLe8rz04jBQ+paKTgISWNQhSch2Jmi4Hh\n", "krYBdiJ8wb9kZq+X+Jqzgdlm9vd4fgcwGpgrqZeZzZXUG/hPfH4OsHle/c2ixpx4vGJ5rs4WwNux\n", "y2t9M3tvRSPrrbcedXV1dO3aFYBBgwZRW1tLTU0YD5B74wudL168eJXPt+R88eLFJdXPp7X12+p8\n", "wYIGJkyAU0+tYeFCqK1t4Kyz4KtfrUGC2bMr/352pM+j1N9P//1O6/Oo1PtZX1/PpEmTaGxsZHVU\n", "ZE95SU8A3zazVyVdAKwdn5pvZpdKqgNqzKwuJuVvBQYTurL+CmxrZiZpCjAKmAo8AFxlZhMkjQT6\n", "mdlpMbdymJkNb8aH51DaiLffDi2SBx4I5wcfDP/3f9C7d2V9OY5THCWt5VUOJO0KXAusBbwOnAB0\n", "Am4ntCxmAUeaWUO8/jzgRGAJcKaZPRTLBwI3AN0Io8ZGxfIuwE3AAGA+MLy5EWseUMqPGdxyC/y/\n", "/wcNDWEb3quvhm9+00dvOU57JLmAkgoDBgywadNaP+m/oaHh0+ZhtWs0V3/ePDjlFLjnnnB+wAEh\n", "Eb/pps0IZOAhC40UPKSikYKHVDRS8JCKRsmLQ0raS9IJ8XhjSVu32o3T4TELI7Z22ikEk+7d4brr\n", "QndXoWDiOE77pyXL118ADAS2N7PtJG0K3G5me7aBv7LiXV7Z8+67YdmUO+4I51/6Elx7bdiS13Gc\n", "9k+pLZTDgUMJw3sxszlA9+zsOR2FO+8MrZI77oB11gmz3x96yIOJ41QLLQkoH5nZp3ugxEmIHQKf\n", "h5KNxvz5cPbZDRxxRGihDB0K06eH/EkxifdK30cqHlLRSMFDKhopeEhJoxAtCSh/kvR/QI2kkwnr\n", "bF1bNkdOu+K++2DnncNs97XXDiO4HnkEtvYsm+NUHS3dD2V/ILfK70Nm9nBZXbURnkNpPe+/D2ed\n", "BTfcEM5ra8PCjtu2aoU3x3HaC6XuKX828MeYO+lQeEBpHRMnhgUdZ88OG19dfDGceSZ06lRpZ47j\n", "lJtSk/LdgYmSnpJ0hqSeq63RTvAcSnEaixbBqafCl78cgsngwVBfD9/9Lixa1H7uI3UPqWik4CEV\n", "jRQ8pKRRiNUGFDO7wMx2Ak4HegNPSHqkbI6cJHn8cdh117Bcypprws9+Bk8/DTvsUGlnjuOkQotn\n", "yscFG48AjgbWNbNdymmsLfAur9Xz4Ydw3nlwZdwPs39/GDcOdmn3n77jOK2hpC4vSSMlTSKM7tqI\n", "sKijf51UAZMnw4ABIZh06gQ/+hFMmeLBxHGc5mlJDmUL4Cwz62tm55tZc/vMt0s8h9K8xkcfQV1d\n", "2Nv91Vehb98QXH7yk7A9b1t4qJRGCh5S0UjBQyoaKXhISaMQBfdDkbSemS0ELgdsxR0Pm9tfxGn/\n", "PPssHH982PBKgu9/Hy68EOKWMY7jOAVZ1Z7yD5jZgZJm0czuiWbW7qeueQ6liU8+CYn2n/4Uli4N\n", "80nGjYPPf77SzhzHSQlfvr4AHlAC06eHVkluJf9Ro+DnPw8z3x3HcfIpNSm/0hDhjjJs2HMoYab7\n", "oEFg1sCWW4YlVK68svhgUun7yEojBQ+paKTgIRWNFDykpFGIggFFUjdJGwIbS+qR99iKsBVvq5E0\n", "S9LzkqZJmhrLekh6WNKrkiZKqsm7frSkmZJmxGVgcuUDJU2Pz12ZV95F0vhYPlnSlqX47YiYhe6t\n", "E06Ajz+GAw8MLZV99620M8dx2iuryqGcBZwJbAK8nffUIuD3ZvabVr+o9AYwMD+xL+ky4L9mdpmk\n", "c4ENVthTfnea9pTvE/eUnwqcYWZTJT3I8nvK72xmIyUdBRzue8o3sXQpnHFGWF5eCgs6nn56pV05\n", "jtMeKHUtr1FmdlXGht4ABpnZ/LyyGcA+ZjZPUi9gkpntIGk0sMzMLo3XTQAuAN4EHjWzHWP5cGCo\n", "mZ0arznfzKZI6gy8Y2YbN+Oj6gLKhx/CMceEnRS7dIFbb4WvfrXSrhzHaS+UlEMxs6sk7SzpSEnH\n", "5R4lejLgr5L+IemkWNbTzObF43lAbs2wTYDZeXVnE1oqK5bPoakrblPgreh/CfD+isOeofpyKPPn\n", "wxe/GIJJTQ389a9NwcT7mNPxkIpGCh5S0UjBQ0oahSg4DyVH3AJ4H2An4AHgAOAp4MYSXndPM3tH\n", "0sbAw7FoMv9WAAAfOElEQVR18imxO6u6mg5lZtYsGDYMXnkFNt8cJkwIExYdx3GyYrUBhbB+167A\n", "c2Z2Qlxt+JZSXtTM3ok/35V0FzAYmCepl5nNjeuG/SdePgfYPK/6ZoSWyZx4vGJ5rs4WwNuxy2v9\n", "5iZiLlq0iLq6OrrGWXuDBg2itraWmpowHiAXyQud58paen2h83yt1tRf3fmsWTUccAD06tXA4YfD\n", "1VfXsOmmy19fU1NT0uuVWj+l97PU+lmcp/B+llq/I72fKXwelXo/6+vrmTRpEo2NjayOluRQ/m5m\n", "u0t6FtgPWAjMMLPtV6vevN7aQCczWxS3E54IXAh8EZhvZpdKqgNqVkjKD6YpKb9tbMVMAUYBUwmt\n", "p/ykfD8zOy3mVg6r1qT8I4/A4YeHpeeHDoW774b116+0K8dx2iul7ofyd0kbANcA/wCmAc+U4Kcn\n", "8KSkemAKcL+ZTQQuAb4k6VVC4LoEIK4ddjvwEvAXYGReFBhJ2I54JvCamU2I5dcBG0qaCZwF1DVn\n", "pKPnUG69FQ44IASTo44K3VyFgon3MafjIRWNFDykopGCh5Q0CrHaLi8zGxkPfyfpIWA9M/tna1/Q\n", "zN4AVvomj11SXyxQ52Lg4mbKnwX6NVP+EXBkaz22d8zgl7+Ec84J59/5DvziF7BGS/59cBzHaSWr\n", "mocykGbW8MphZs+Vy1Rb0RG7vJYtg7PPhl//Opz/8pdhR0XHcZwsaNU8lLgHyqoCSrufU93RAkpj\n", "Y1iT6/bbw66KN94Iw1fKHDmO47SeVuVQzGyome1b6FE+u21HR8qhNDSEYcG33w7rrRfyJcUEE+9j\n", "TsdDKhopeEhFIwUPKWkUoiXzUI6n+eXrS5mH4mTIu++GCYovvAC9e8Nf/hL2f3ccx2lLWjJs+Dc0\n", "BZRuhBFYz5nZEWX2VnY6QpfXiy+Glsns2bDDDqFlsqUvhek4TpnIdD+UuArweDP7chbmKkl7DyhT\n", "poRg0tAQNsK67z7osdICM47jONlR6jyUFfkQaPe7NUL7zqE8+WRYl6uhAU4/vYG//rW0YOJ9zOl4\n", "SEUjBQ+paKTgISWNQrQkh3Jf3ukaQF/CREOnQjz6KBx8cFg5+Oijw57v3bpV2pXjONVOS3IoQ/NO\n", "lwBvmtlb5TTVVrTHLq8JE8JSKo2N8K1vwbXXQqdOlXblOE61kEkORdJ65LVomltssb3R3gLKPffA\n", "kUeGHRZPOQXGjPHZ747jtC2l7il/iqS5wHTg2fj4R7YWK0N7yqH86U9wxBEhmJx5Jowd2xRMUuhX\n", "TcFDFhopeEhFIwUPqWik4CEljUK0ZPn6cwjb6f63bC6cVXLzzWEG/LJlcO658POfh617HcdxUqIl\n", "OZSJhD3ZP2gbS21He+jyuu46OOmksODj+eeHhwcTx3EqRal7yu8G3AD8Dfg4FpuZjcrSZCVIPaCM\n", "GQOnnx6Of/5zqGt2EX7HcZy2o9R5KL8nbGo1mZA7yeVR2j0p51B+9aumYHLFFasOJin0q6bgIQuN\n", "FDykopGCh1Q0UvCQkkYhWpJD6WRmvgB6G3LxxfCDH4TjMWPgtNMq68dxHKcltKTL62LgTeBe4KNc\n", "eanDhiV1IrR4ZpvZwZJ6AOOBLYFZwJFm1hCvHQ2cCCwFRsUdHnN7ttwAdAUeNLMzY3kX4EZgN2A+\n", "cJSZvdmMh6S6vHJ5kosuCnmS666DE06otCvHcZwmSu3yOoawhe4zNHV3ZdHldSZhW9/cN3od8LCZ\n", "bQc8Es+Je8ofRZihPwwYI32alh4LjDCzPkAfScNi+QjC/vR9gCuASzPwW1bMwgiuiy4KExVvvtmD\n", "ieM47YvVBhQz28rMtl7xUcqLStoM+AphP/hccDgEGBePxwGHxeNDgdvM7BMzmwW8BgyR1BvobmZT\n", "43U35tXJ17oT+EJzPlLJoZiFuSWXXw6dO8P48XDMMW3vo5L1U9FIwUMqGil4SEUjBQ8paRSiUvuh\n", "XEGY37JeXllPM5sXj+cBPePxJoQBATlmA5sCn8TjHHNiOfHnW9HnEknvS+qR4uz+Zcvg1FPh97+H\n", "tdaCO+4I63Q5juO0N1qSlN+dZvZDIbQIikbSQcB/zGzaCuuEfYqZmaSyJzcWLVpEXV0dXbt2BWDQ\n", "oEHU1tZSU1MDNEXyQue5spZev+L5/PkNXH55CCZdu8K99zaw++4ArdMr5bympqai9bN4P1f8z6tS\n", "9VP4PPLvob1/Him8nyl8HpV6P+vr65k0aRKNjY2sjjbfDyUm+Y8lLDTZldBK+TMhcA01s7mxO+sx\n", "M9tBUh2AmV0S608AzicMFHjMzHaM5UcDe5vZafGaC8xssqTOwDtmtnEzXiqWlF+yBI47Dm67DdZZ\n", "B+6/H4YOrYgVx3GcFpPUfihmdp6ZbR7zMMOBR83sWMIosuPjZccDd8fje4HhktaStDXQB5hqZnOB\n", "hZKGxCT9scA9eXVyWkcQkvwrUakcytKlYaXg226Dz32ugYceKi2YpNCvmoKHLDRS8JCKRgoeUtFI\n", "wUNKGoVIYT+UXBPhEuB2SSOIw4YBzOwlSbcTRoQtAUbmNStGEoYNdyMMG54Qy68DbpI0kzBseHiG\n", "fkti6VI48US45RZYd92QiN9zz0q7chzHKZ3W7Icyy8xmF7i8XdHWXV7LloV1ua6/PnRzTZgAtbVt\n", "9vKO4zgls6our4ItFEl9CCOvJq1QXiupi5m9nq3Njk1uNNf114fdFR94wIOJ4zgdi1XlUH4NLGym\n", "fGF8rt3TVjkUMzjjDLjmmjCa6/77YZ99itPIwkc5NVLwkIVGCh5S0UjBQyoaKXhISaMQqwooPc3s\n", "+RULY1lJExuridykxbFjoUsXuPde2G+/SrtyHMfJnoI5FEmvmdm2xT7Xnih3DsUMvvtd+PWvw6TF\n", "e+6BYcNWX89xHCdVWjts+B+STm5G7CQ6yPL15SS3Ntevfw1rrgl//rMHE8dxOjarCihnASdIelzS\n", "r+LjccLCi2e1jb3yUq4cillYfj63Ntcdd8CBBxankYWPttRIwUMWGil4SEUjBQ+paKTgISWNQhQc\n", "5RVnrH8e2BfYmTBf5H4ze7RsbjoIF1wQdljs1Cks9HjIIZV25DiOU36KXnqlI1GOHMpPfhL2NOnU\n", "KcyE//rXM5V3HMepKFkvveIU4OKLQzBZYw246SYPJo7jVBdVHVCyzKFcdlnIm0gwbhwcfXTxGln4\n", "qJRGCh6y0EjBQyoaKXhIRSMFDylpFKKqA0pW/OpXYUSXFGbCf/OblXbkOI7T9ngOpcT7v+qqMHER\n", "wkz4b387A2OO4ziJ4jmUMvHb3zYFk9/9zoOJ4zjVTVUHlFJyKDffDNdeG/oif/MbOOWU1umk0ifq\n", "fczpeEhFIwUPqWik4CEljUJUdUBpLR9+CGefHY5/+Us4/fTK+nEcx0kBz6G04v6vvBLOOgsGDYKp\n", "U0My3nEcpxpIKociqaukKZLqJb0k6eexvIekhyW9Kmli3Ls+V2e0pJmSZkjaP698oKTp8bkr88q7\n", "SBofyydL2jIr/42NcOml4fjHP/Zg4jiOk6PNA4qZNQL7mll/YBdgX0m1QB3wsJltR9gDvg5AUl/g\n", "KMLWw8OAMXEPeYCxwAgz6wP0kZRbfnEEMD+WXwFc2pyX1uRQrrsO3nkH+veH2to0+jNT0EjBQxYa\n", "KXhIRSMFD6lopOAhJY1CVCSHYmYfxsO1gE7AAuAQYFwsHwccFo8PBW4zs0/MbBbwGjBEUm+gu5lN\n", "jdfdmFcnX+tO4AtZ+P7oI7jkknDsrRPHcZzlqUhAkbSGpHpgHvCYmb1I2NBrXrxkHtAzHm8C5O9h\n", "PxvYtJnyObGc+PMtADNbArwvqceKPurr64vyfcMNMHs29OsHhx4KNTU1q62zOjqKRgoestBIwUMq\n", "Gil4SEUjBQ8paRSi4GrD5cTMlgH9Ja0PPCRp3xWeN0llHy2wzTbbUFdXR9euXQEYNGgQtbW1n77h\n", "uaZhTU0NH38Mf/5zA/37w3nn1bDGGss/v+L1fu7nfu7nHeG8vr6eSZMm0djYyGoxs4o+gB8B3wNm\n", "AL1iWW9gRjyuA+ryrp8ADAF6AS/nlR8NjM27Zo943Bl4t7nX7t+/v7WUa681A7O+fc2WLg1lCxYs\n", "aHH9QnQUjRQ8ZKGRgodUNFLwkIpGCh5S0Qhho/nv80qM8tooN4JLUjfgS8A04F7g+HjZ8cDd8fhe\n", "YLiktSRtDfQBpprZXGChpCExSX8scE9enZzWEYQkf6v55BP42c/C8Q9/GFYTdhzHcZanzeehSOpH\n", "SJivER83mdnlMcdxO7AFMAs40swaYp3zgBOBJcCZZvZQLB8I3AB0Ax40s1GxvAtwEzAAmA8Mt5DQ\n", "X9GLteT+x42Db30LttsOXnop7HXiOI5TjaxqHopPbFzN/S9ZAn37wsyZIbAcd1wbmXMcx0mQpCY2\n", "pkRL5qGMHx+CyWc/C8ccs/xzqYwJT0EjBQ9ZaKTgIRWNFDykopGCh5Q0ClHVAWV1LF0KP/1pOP7B\n", "D6BzRcbEOY7jtA+8y2sV9z9+PAwfDltuGVopa67ZhuYcx3ESxLu8WsGyZXDRReH4vPM8mDiO46yO\n", "qg4oq8qh3HUXvPgibL45HH9889ek0p+ZgkYKHrLQSMFDKhopeEhFIwUPKWkUoqoDSiGWLYOf/CQc\n", "19VBly6V9eM4jtMe8BxKM/d/zz1w2GGwySbw+usQV2ZxHMepejyHUgRmTbmTc8/1YOI4jtNSqjqg\n", "NJdD+ctf4NlnoWdPOOmkVddPpT8zBY0UPGShkYKHVDRS8JCKRgoeUtIoRFUHlBUxa8qdfP/70K1b\n", "Zf04juO0JzyHknf/EyfCl78MG28Mb7wB66xTQXOO4zgJ4jmUFmAGF14Yjr/3PQ8mjuM4xVLVASU/\n", "h/LYY/DMM9CjB5x2Wsvqp9KfmYJGCh6y0EjBQyoaKXhIRSMFDylpFKKqA0o+udzJd78L3btX1ovj\n", "OE57xHMoZjz+OAwdCjU1MGsWrL9+pZ05juOkiedQVkNu3slZZ3kwcRzHaS2V2AJ4c0mPSXpR0guS\n", "crss9pD0sKRXJU3MbRMcnxstaaakGZL2zysfKGl6fO7KvPIuksbH8smStmzOS//+/Xn6aXjkEVhv\n", "PRg1qrh7SaU/MwWNFDxkoZGCh1Q0UvCQikYKHlLSKEQlWiifAN8xs52APYDTJe0I1AEPm9l2hD3g\n", "6wAk9QWOAvoCw4AxcQ95gLHACDPrA/SRNCyWjwDmx/IrgEsLmcm1TkaNgg02yPI2HcdxqouK51Ak\n", "3Q38Jj72MbN5knoBk8xsB0mjgWVmdmm8fgJwAfAm8KiZ7RjLhwNDzezUeM35ZjZFUmfgHTPbuJnX\n", "NjDWXTfkTjbcsA1u2HEcpx2TbA5F0lbAAGAK0NPM5sWn5gE94/EmwOy8arOBTZspnxPLiT/fAjCz\n", "JcD7knoU8nHGGR5MHMdxSqVim9pKWhe4EzjTzBY19WKBmVloPZSXvffem9dfr2PJkq5ccAEMGjSI\n", "2tpaampC+ibX11jofPbs2ay77rotvr6588WLF7PZZpu1un6OmpqaVtfPr1up+qm8nx3l88ji/Uzh\n", "80jl/Uzh86jU+1lfX8+kSZNobGxktZhZmz+ANYGHgLPyymYAveJxb2BGPK4D6vKumwAMAXoBL+eV\n", "Hw2Mzbtmj3jcGXi3OR/9+/e3733PWs2CBQtaX7mDaaTgIQuNFDykopGCh1Q0UvCQikYIG81/t7d5\n", "DiUm1McRkubfySu/LJZdKqkOqDGzupiUvxUYTOjK+iuwrZmZpCnAKGAq8ABwlZlNkDQS6Gdmp8Xc\n", "ymFmNrwZLzZ3rtGz54rPOI7jOM2xqhxKJQJKLfAE8DyQe/HRhKBwO7AFMAs40swaYp3zgBOBJYQu\n", "sodi+UDgBqAb8KCZ5YYgdwFuIuRn5gPDzWxWM16sre/fcRynPZNUQEmJAQMG2LRp01pdv6Gh4dP+\n", "xmrXSMFDFhopeEhFIwUPqWik4CEVjWRHeTmO4zgdh6puoXiXl+M4TnF4C8VxHMcpO1UdUJrbU74Y\n", "UllXJwWNFDxkoZGCh1Q0UvCQikYKHlLSKERVBxTHcRwnOzyHUsX37ziOUyyeQ3Ecx3HKTlUHFM+h\n", "ZKeRgocsNFLwkIpGCh5S0UjBQ0oahajqgOI4juNkh+dQqvj+HcdxisVzKI7jOE7ZqeqA4jmU7DRS\n", "8JCFRgoeUtFIwUMqGil4SEmjEFUdUBzHcZzs8BxKFd+/4zhOsXgOxXEcxyk7FQkokq6XNE/S9Lyy\n", "HpIelvSqpImSavKeGy1ppqQZkvbPKx8oaXp87sq88i6SxsfyyZK2bM6H51Cy00jBQxYaKXhIRSMF\n", "D6lopOAhJY1CVKqF8gdg2ApldcDDZrYd8Eg8J24BfBTQN9YZE7cRBhgLjDCzPkAfSTnNEYTthPsA\n", "VwCXNmdi0aJFJd3EU089VVL9jqSRgocsNFLwkIpGCh5S0UjBQ0oahahIQDGzJ4EFKxQfQthrnvjz\n", "sHh8KHCbmX0St/F9DRgiqTfQ3cymxutuzKuTr3Un8IXmfLz++usl3cc//vGPkup3JI0UPGShkYKH\n", "VDRS8JCKRgoeUtIoREo5lJ5mNi8ezwN6xuNNgNl5180GNm2mfE4sJ/58C8DMlgDvS+pRJt+O4zgO\n", "aQWUT4lDr8o+/Kpnz56rv2gVNDY2luyho2ik4CELjRQ8pKKRgodUNFLwkJJGQcysIg9gK2B63vkM\n", "oFc87g3MiMd1QF3edROAIUAv4OW88qOBsXnX7BGPOwPvFvBg/vCHP/zhj+Iehb7XO5MO9wLHExLo\n", "xwN355XfKulXhK6sPsBUMzNJCyUNAaYCxwJXraA1GTiCkORfiUJjqR3HcZziqcjERkm3AfsAGxHy\n", "JT8G7gFuB7YAZgFHmllDvP484ERgCXCmmT0UywcCNwDdgAfNbFQs7wLcBAwA5gPDY0LfcRzHKRNV\n", "PVPecRzHyY6UurzanDj0+D0z+6jCPnqZ2dxKemgNceRcH6BLrszMnmhjD8u9d6l8ppWmvf5OOe2b\n", "JEd5tSE3A69I+kWFfVxX4dcvGkknAY8TBkBcCDwEXFABKyu+d0V/ppL2lPQNScfHx3HFGJDUtSVl\n", "q6j/RnxMKeZ1V8ODGWq1GElntqSsQN3FkhYVeCws0seRktaLxz+SdJek3YqoXytp3Xh8rKRfFVpx\n", "YxUa20t6RNKL8XwXST8sRiPWK+n3s02p1CivVB6EoLpTC6/tRfgCmxDP+xJm6reV15viz7NK0FgM\n", "LCrwWFiEzguE3FV9PN8BuKsVfnoBBwMHAZ+pwGd6M/AMMAa4Ovco8vWea0lZWz6Aaa2oczmwHrAm\n", "YSDLf4FjS33d3O9IG9//9PizFpgUf7+mFFMfELArMA04HXi8SA9PEEakTovnAl4sUiOL38+SP9eW\n", "Pqq6ywvAzJYBL7bw8hsIy8b8IJ7PJAwkaKsWxkBJmwAnSrqR8Av6aRLMzN5bnYCZ5f7r+inwNuEX\n", "FuAbhMmiLaXRzP4nCUldzWyGpO2LqI+kIwm/7I/Hot9IOsfM/lSMzooU+ZkOBPpa/Msrhti9tgmw\n", "dvzvN/d5rAesXaxexlzTijr7m9k5kg4nDIz5KvAkYYDLKpF0NHAMsLWk+/Ke6k4YGNPWLI0/DwKu\n", "MbP7JV1URP0lZmaSDgN+a2bXShpRpIe1zWxKbqWoqPdJkRqt/v3Mo9Wfa7FUfUApko3MbLykOgAz\n", "+0TSkjZ8/d8R/sP4LPBsM89vXYTWIWa2S975WEnPAz9qYf23JG1AGN79sKQFhF/WYvghsLuZ/QdA\n", "0saE+yspoBTJC4R5T2+3ou7+wLcIw9l/mVe+CDivZGclYGZjWlEt931wEHCHmb0vqaVfZM8A7wAb\n", "A78gBFcI78U/W+GlVOZI+j3wJeCS2AVZTBf/oji69JvAXpI6Ef7DL4Z3JW2bO5F0BOE9KoZSfj9z\n", "lPK5tuqFnJaxWNKGuRNJewDvt9WLm9lVwFWSxgL/B+xN+I/4STOrL1LuA0nfBG6L58MJ3WEt9XJ4\n", "PLxA0iTCf+UTivQg4N288/k0fRG1FRsDL0maCuQS+WZmh6yuopmNA8ZJOsLM7iinyTbiPkkzgEbg\n", "NEmficerxczeBN4E9iijv2I4krCY7OVm1hBbk+cUUf8oQovrRDObK2kLQqAshjMIf6fbS3ob+Bch\n", "QK2WvFbeurTy9zOPVn+uxeLDhosgznu5GtiJ0KWyMXCEmbXpf2AxyXkS8OdYdDihWX9V4VoraWwN\n", "XAl8PhY9TZjjMytDq6vzcDmhj/pWQiA5CnjezL7fhh6GNlduZpOK1DmIkFP7NBlvZj8pxVsliP8w\n", "NZjZUknrEBZgXe1oMUlPm9mekhaT1w0bMTNbrxx+Uya2ir5GWBWkB7CQ8F6s9vci7/fSWPmfLDOz\n", "xymC1n6uxeIBpUgkrQnkcgWvmFmxfaJZeJhOWFrmg3i+DjDZzPq1tZdSkHQZMIWQODXgKcJ9tVlA\n", "yQJJ/0cYoLAfIXfxdUICuNg+94og6Qtm9oikr9EUDHJfYmZmfy5QtcORZWCU9BDQQOiezuV0MLNf\n", "Fqy0ssZlK/49SLrUzM5tqUas0w/YkfB7atHHjcVotOh1PKAUh6Q9Cf9xdKaMH8xqPEwHBpvZ/+J5\n", "N8JyNC0OKLHZexJN9wLhD+bEjO2uysM0MxuwQtn0tgiMGX9xTDezfpKeN7Nd4nDTCWZWm6npMiHp\n", "QjM7X9INrPxeYGYntL2r9o+kF8xs5xI1Sv4bkXQBYWWSnYAHgAOAp8zsiFK8NYfnUIpA0s2EhHg9\n", "ef9xEPZiaUv+AEyR9GfCf5KHAdcXqXEPYVjjw8CyWNYm/11IOg0YCWyjvF07CSOCnm4LD2a2Z/y5\n", "bgZy/4s/P5S0KSEX1CsD3TbBzM6Ph6fS1EXj3w2l84ykXczs+WIrZvw3cgSha/k5MztBUk/glmI9\n", "tQT/pSmOLIbwlYyZ/UrS4zR1FX3LzKYVKdOt2GZzhtwK/AW4BDiXvBFBZlaJIaalcl8c8XY5TaPv\n", "WjNst9LcQ1MXTRnXOO/Y5AWATsAJkt5g+YT6Ls3XXI4s/0b+F3MnSyStD/wH2LxIjRbhXV5FIOlP\n", "hMR1KUP4kiDOQ/mbmT1QaS8diZiI7WpxYdP2RBZdNA5I2mpVz7dk4Iuk9cxsYUymN9cNudo5Z3la\n", "Ywhz544CzgY+IEy2zLwr0wNKC1hhCN8AwnL5rR3ClwQxd7A28DGQG1hQlaNxsiAvt9YpV9bWubVS\n", "ifM2ftOaLhonWyQ9YGYHxtbNSphZi+ecxa76xwmDXv4HrFeuz9gDSgvIG8J3GWEse/4wvsvMbHCb\n", "m8oANS3umD/UtajhiE7h3JqZ/b+KmSqCFbpo+gCt6aJxyoCkWwjB4Ekze7mVGvsBexG6yLcFnot6\n", "v87MaO61PKC0nEqOSsoahcUdRwGbEb4I9yB0ge1XUWPtEEkvk0BurbVk0UXjlIcYDGoJAWEbwrpi\n", "RQcDSZ2BQYSh7acS8ipFLZXUotdpp38DbUr+iAvg9bynugNPm9k3KmKsBCS9AOxOCCL9Je0IXJw3\n", "A95pIR0pt+akR6nBQNIjwDrA3wjdXk/mljvKGh/l1TI62qgkWHlxx5dV5OKO1U7Gy2M4zko0EwwG\n", "tSIYPE8ISDsTZusvkPS33Dy2LPGA0gLM7H3Cml3DK+0lQ7JY3LHayc14vgw4lBVya21vx+mAlBwM\n", "zOw7AJK6ExYz/QNhnlSXVVRrFd7l5eQGHaxHmN39cYXttDs6Um7NSZO8YPA9oJeZtTgYSPp/hBzM\n", "QMKAiycJ3V6PZu3TWyhO0QshOoEUZvw7HZtmgsH1hIBQDF0Jrennyr32oLdQHKeVxFnHG9CxcmtO\n", "Qkg6h7BEUtmDQRZ4QHEcx3EyoZgdzBzHcRynIB5QHMdxnEzwgOI4juNkggcUx8kAST+Q9IKkf0qa\n", "Jqls67tJmhS3o3acpPBhw45TIpI+BxwIDDCzT+Kim5lPGsvDaKPN0BynGLyF4jil0wv4b25Yp5m9\n", "Z2bvSPqRpKmSpsd954FPWxi/kvR3SS9L2l3SXZJelXRRvGYrSTMk3SzpJUl/ils9L4ek/SU9I+lZ\n", "SbdLWieWXyLpxdhiuryN3genyvGA4jilMxHYXNIrkn4rae9Y/hszGxxnzHeTdFAsN+AjM9sdGEvY\n", "KfFUwvIa34pL4gBsB/zWzPoSlt0Ymf+ikjYibJz0BTMbSNhp8buxhXSYme1kZrsCF5Xrxh0nHw8o\n", "jlMiZvYBYSbzycC7wHhJxwP7SZos6XnCSrF986rdG3++ALxgZvPisjf/oml71rfM7G/x+GbCMuY5\n", "RNhyoC9h7/JpwHHAFoR15xolXSfpcJr2vHecsuI5FMfJADNbRtgI6fG4DMupQD9goJnNkXQ+eRuZ\n", "0bQq8bK849x57u8yP08ims+bPGxmx6xYGAcFfAE4AjgjHjtOWfEWiuOUiKTtJPXJKxoAzCAEgPmS\n", "1gW+3grpLSTtEY+PYfk1nAyYDOwpaZvoYx1JfWIepcbM/gJ8F9i1Fa/tOEXjLRTHKZ11gasl1QBL\n", "gJnAKUADoUtrLjClQN1Vjdh6BThd0vXAi4R8S1NFs/9K+hZwm6TcqLIfAIuAeyR1JbRsvtPK+3Kc\n", "ovC1vBwnQeK2vPf5EvhOe8K7vBwnXfy/Padd4S0Ux3EcJxO8heI4juNkggcUx3EcJxM8oDiO4ziZ\n", "4AHFcRzHyQQPKI7jOE4meEBxHMdxMuH/A8A9viCKa0WSAAAAAElFTkSuQmCC\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fdist1.plot(20, cumulative=True)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'CIRCUMNAVIGATION',\n", " u'Physiognomically',\n", " u'apprehensiveness',\n", " u'cannibalistically',\n", " u'characteristically',\n", " u'circumnavigating',\n", " u'circumnavigation',\n", " u'circumnavigations',\n", " u'comprehensiveness',\n", " u'hermaphroditical',\n", " u'indiscriminately',\n", " u'indispensableness',\n", " u'irresistibleness',\n", " u'physiognomically',\n", " u'preternaturalness',\n", " u'responsibilities',\n", " u'simultaneousness',\n", " u'subterraneousness',\n", " u'supernaturalness',\n", " u'superstitiousness',\n", " u'uncomfortableness',\n", " u'uncompromisedness',\n", " u'undiscriminating',\n", " u'uninterpenetratingly']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply a list comprehension to get words over 15 characters\n", "V = set(text1)\n", "long_words = [w for w in V if len(w) > 15]\n", "sorted(long_words)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'#14-19teens',\n", " u'#talkcity_adults',\n", " u'((((((((((',\n", " u'........',\n", " u'Question',\n", " u'actually',\n", " u'anything',\n", " u'computer',\n", " u'cute.-ass',\n", " u'everyone',\n", " u'football',\n", " u'innocent',\n", " u'listening',\n", " u'remember',\n", " u'seriously',\n", " u'something',\n", " u'together',\n", " u'tomorrow',\n", " u'watching']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fdist2 = FreqDist(text5)\n", "sorted(w for w in set(text5) if len(w) > 7 and fdist2[w] > 7)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "United States; fellow citizens; four years; years ago; Federal\n", "Government; General Government; American people; Vice President; Old\n", "World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;\n", "God bless; every citizen; Indian tribes; public debt; one another;\n", "foreign nations; political parties\n" ] } ], "source": [ "# word sequences that appear together unusually often\n", "text4.collocations()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Raw Text Processing" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1176896" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# download raw text from an online repository\n", "import urllib2\n", "url = \"http://www.gutenberg.org/files/2554/2554.txt\"\n", "response = urllib2.urlopen(url)\n", "raw = response.read().decode('utf8')\n", "len(raw)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "u'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\\r\\n'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw[:75]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "254352" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# tokenize the raw text\n", "from nltk import word_tokenize\n", "tokens = word_tokenize(raw)\n", "len(tokens)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'The',\n", " u'Project',\n", " u'Gutenberg',\n", " u'EBook',\n", " u'of',\n", " u'Crime',\n", " u'and',\n", " u'Punishment',\n", " u',',\n", " u'by']" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens[:10]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'CHAPTER',\n", " u'I',\n", " u'On',\n", " u'an',\n", " u'exceptionally',\n", " u'hot',\n", " u'evening',\n", " u'early',\n", " u'in',\n", " u'July',\n", " u'a',\n", " u'young',\n", " u'man',\n", " u'came',\n", " u'out',\n", " u'of',\n", " u'the',\n", " u'garret',\n", " u'in',\n", " u'which',\n", " u'he',\n", " u'lodged',\n", " u'in',\n", " u'S.',\n", " u'Place',\n", " u'and',\n", " u'walked',\n", " u'slowly',\n", " u',',\n", " u'as',\n", " u'though',\n", " u'in',\n", " u'hesitation',\n", " u',',\n", " u'towards',\n", " u'K.',\n", " u'bridge',\n", " u'.']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = nltk.Text(tokens)\n", "text[1024:1062]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya\n", "Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old\n", "woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;\n", "great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;\n", "Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market\n" ] } ], "source": [ "text.collocations()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "5338" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw.find(\"PART I\")" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'BBC',\n", " u'NEWS',\n", " u'|',\n", " u'Health',\n", " u'|',\n", " u'Blondes',\n", " u\"'to\",\n", " u'die',\n", " u'out',\n", " u'in']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# HTML parsing using the Beautiful Soup library\n", "from bs4 import BeautifulSoup\n", "url = \"http://news.bbc.co.uk/2/hi/health/2284783.stm\"\n", "html = urllib2.urlopen(url).read().decode('utf8')\n", "raw = BeautifulSoup(html).get_text()\n", "tokens = word_tokenize(raw)\n", "tokens[0:10]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Displaying 5 of 5 matches:\n", "hey say too few people now carry the gene for blondes to last beyond the next \n", "blonde hair is caused by a recessive gene . In order for a child to have blond\n", " have blonde hair , it must have the gene on both sides of the family in the g\n", "ere is a disadvantage of having that gene or by chance . They do n't disappear\n", "des would disappear is if having the gene was a disadvantage and I do not thin\n" ] } ], "source": [ "# isolate just the article text\n", "tokens = tokens[110:390]\n", "text = nltk.Text(tokens)\n", "text.concordance('gene')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regular Expressions" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# regular expression library\n", "import re\n", "wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'abaissed',\n", " u'abandoned',\n", " u'abased',\n", " u'abashed',\n", " u'abatised',\n", " u'abed',\n", " u'aborted',\n", " u'abridged',\n", " u'abscessed',\n", " u'absconded']" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# match the end of a word\n", "[w for w in wordlist if re.search('ed$', w)][0:10]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'abjectly',\n", " u'adjuster',\n", " u'dejected',\n", " u'dejectly',\n", " u'injector',\n", " u'majestic',\n", " u'objectee',\n", " u'objector',\n", " u'rejecter',\n", " u'rejector']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# wildcard matches any single character\n", "[w for w in wordlist if re.search('^..j..t..$', w)][0:10]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'gold', u'golf', u'hold', u'hole']" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# combination of caret (start of word) and sets\n", "[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',\n", " u'miiiiiinnnnnnnnnneeeeeeee',\n", " u'mine',\n", " u'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))\n", "\n", "# plus symbol matches any number of times repeating\n", "[w for w in chat_words if re.search('^m+i+n+e+$', w)]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'0.0085',\n", " u'0.05',\n", " u'0.1',\n", " u'0.16',\n", " u'0.2',\n", " u'0.25',\n", " u'0.28',\n", " u'0.3',\n", " u'0.4',\n", " u'0.5']" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wsj = sorted(set(nltk.corpus.treebank.words()))\n", "\n", "# more advanced regex example\n", "[w for w in wsj if re.search('^[0-9]+\\.[0-9]+$', w)][0:10]" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'C$', u'US$']" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[w for w in wsj if re.search('^[A-Z]+\\$$', w)]" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'1614',\n", " u'1637',\n", " u'1787',\n", " u'1901',\n", " u'1903',\n", " u'1917',\n", " u'1925',\n", " u'1929',\n", " u'1933',\n", " u'1934']" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[w for w in wsj if re.search('^[0-9]{4}$', w)][0:10]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'10-day',\n", " u'10-lap',\n", " u'10-year',\n", " u'100-share',\n", " u'12-point',\n", " u'12-year',\n", " u'14-hour',\n", " u'15-day',\n", " u'150-point',\n", " u'190-point']" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)][0:10]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'black-and-white',\n", " u'bread-and-butter',\n", " u'father-in-law',\n", " u'machine-gun-toting',\n", " u'savings-and-loan']" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)][0:10]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'62%-owned',\n", " u'Absorbed',\n", " u'According',\n", " u'Adopting',\n", " u'Advanced',\n", " u'Advancing',\n", " u'Alfred',\n", " u'Allied',\n", " u'Annualized',\n", " u'Anything']" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[w for w in wsj if re.search('(ed|ing)$', w)][0:10]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'io', 549),\n", " (u'ea', 476),\n", " (u'ie', 331),\n", " (u'ou', 329),\n", " (u'ai', 261),\n", " (u'ia', 253),\n", " (u'ee', 217),\n", " (u'oo', 174),\n", " (u'ua', 109),\n", " (u'au', 106),\n", " (u'ue', 105),\n", " (u'ui', 95)]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# using \"findall\" to extract partial matches from words\n", "fd = nltk.FreqDist(vs for word in wsj \n", " for vs in re.findall(r'[aeiou]{2,}', word))\n", "fd.most_common(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalizing Text" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# NLTK has several word stemmers built in\n", "porter = nltk.PorterStemmer()\n", "lancaster = nltk.LancasterStemmer()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'UK',\n", " u'Blond',\n", " u\"'to\",\n", " u'die',\n", " u'out',\n", " u'in',\n", " u'200',\n", " u\"years'\",\n", " u'Scientist',\n", " u'believ']" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[porter.stem(t) for t in tokens][0:10]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'uk',\n", " u'blond',\n", " u\"'to\",\n", " u'die',\n", " u'out',\n", " u'in',\n", " u'200',\n", " u\"years'\",\n", " u'sci',\n", " u'believ']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[lancaster.stem(t) for t in tokens][0:10]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'UK',\n", " u'Blondes',\n", " u\"'to\",\n", " u'die',\n", " u'out',\n", " u'in',\n", " u'200',\n", " u\"years'\",\n", " u'Scientists',\n", " u'believe']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wnl = nltk.WordNetLemmatizer()\n", "[wnl.lemmatize(t) for t in tokens][0:10]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# also has a tokenizer that takes a regular expression as a parameter\n", "text = 'That U.S.A. poster-print costs $12.40...'\n", "pattern = r'''(?x) # set flag to allow verbose regexps\n", " ([A-Z]\\.)+ # abbreviations, e.g. U.S.A.\n", " | \\w+(-\\w+)* # words with optional internal hyphens\n", " | \\$?\\d+(\\.\\d+)?%? # currency and percentages, e.g. $12.40, 82%\n", " | \\.\\.\\. # ellipsis\n", " | [][.,;\"'?():-_`] # these are separate tokens; includes ], [\n", "'''\n", "nltk.regexp_tokenize(text, pattern)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tagging" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('They', 'PRP'),\n", " ('refuse', 'VBP'),\n", " ('to', 'TO'),\n", " ('permit', 'VB'),\n", " ('us', 'PRP'),\n", " ('to', 'TO'),\n", " ('obtain', 'VB'),\n", " ('the', 'DT'),\n", " ('refuse', 'NN'),\n", " ('permit', 'NN')]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use a built-in tokenizer and tagger\n", "text = word_tokenize(\"They refuse to permit us to obtain the refuse permit\")\n", "nltk.pos_tag(text)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "man time day year car moment world family house country child boy\n", "state job way war girl place word work\n" ] } ], "source": [ "# Word similarity using a pre-tagged text\n", "text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())\n", "text.similar('woman')" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'The', u'AT'),\n", " (u'Fulton', u'NP-TL'),\n", " (u'County', u'NN-TL'),\n", " (u'Grand', u'JJ-TL'),\n", " (u'Jury', u'NN-TL'),\n", " (u'said', u'VBD'),\n", " (u'Friday', u'NR'),\n", " (u'an', u'AT'),\n", " (u'investigation', u'NN'),\n", " (u'of', u'IN')]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tagged words are saved as tuples\n", "nltk.corpus.brown.tagged_words()[0:10]" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'The', u'DET'),\n", " (u'Fulton', u'NOUN'),\n", " (u'County', u'NOUN'),\n", " (u'Grand', u'ADJ'),\n", " (u'Jury', u'NOUN'),\n", " (u'said', u'VERB'),\n", " (u'Friday', u'NOUN'),\n", " (u'an', u'DET'),\n", " (u'investigation', u'NOUN'),\n", " (u'of', u'ADP')]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.corpus.brown.tagged_words(tagset='universal')[0:10]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'NOUN', 30640),\n", " (u'VERB', 14399),\n", " (u'ADP', 12355),\n", " (u'.', 11928),\n", " (u'DET', 11389),\n", " (u'ADJ', 6706),\n", " (u'ADV', 3349),\n", " (u'CONJ', 2717),\n", " (u'PRON', 2535),\n", " (u'PRT', 2264),\n", " (u'NUM', 2166),\n", " (u'X', 106)]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.corpus import brown\n", "brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')\n", "tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)\n", "tag_fd.most_common()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VERB ADV ADP ADJ . PRT \n", " 37 8 7 6 4 2 \n" ] } ], "source": [ "# Part of speech tag count for words following \"often\" in a text\n", "brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')\n", "tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']\n", "fd = nltk.FreqDist(tags)\n", "fd.tabulate()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load some raw sentences to tag\n", "from nltk.corpus import brown\n", "brown_tagged_sents = brown.tagged_sents(categories='news')\n", "brown_sents = brown.sents(categories='news')" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('I', 'NN'),\n", " ('do', 'NN'),\n", " ('not', 'NN'),\n", " ('like', 'NN'),\n", " ('green', 'NN'),\n", " ('eggs', 'NN'),\n", " ('and', 'NN'),\n", " ('ham', 'NN'),\n", " (',', 'NN'),\n", " ('I', 'NN'),\n", " ('do', 'NN'),\n", " ('not', 'NN'),\n", " ('like', 'NN'),\n", " ('them', 'NN'),\n", " ('Sam', 'NN'),\n", " ('I', 'NN'),\n", " ('am', 'NN'),\n", " ('!', 'NN')]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Default tagger (assigns same tag to each token)\n", "tags = [tag for (word, tag) in brown.tagged_words(categories='news')]\n", "nltk.FreqDist(tags).max()\n", "raw = 'I do not like green eggs and ham, I do not like them Sam I am!'\n", "tokens = word_tokenize(raw)\n", "default_tagger = nltk.DefaultTagger('NN')\n", "default_tagger.tag(tokens)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.13089484257215028" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Evaluate the performance against a tagged corpus\n", "default_tagger.evaluate(brown_tagged_sents)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'Various', u'JJ'),\n", " (u'of', u'IN'),\n", " (u'the', u'AT'),\n", " (u'apartments', u'NNS'),\n", " (u'are', u'BER'),\n", " (u'of', u'IN'),\n", " (u'the', u'AT'),\n", " (u'terrace', u'NN'),\n", " (u'type', u'NN'),\n", " (u',', u','),\n", " (u'being', u'BEG'),\n", " (u'on', u'IN'),\n", " (u'the', u'AT'),\n", " (u'ground', u'NN'),\n", " (u'floor', u'NN'),\n", " (u'so', u'QL'),\n", " (u'that', u'CS'),\n", " (u'entrance', u'NN'),\n", " (u'is', u'BEZ'),\n", " (u'direct', u'JJ'),\n", " (u'.', u'.')]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Training a unigram tagger\n", "from nltk.corpus import brown\n", "brown_tagged_sents = brown.tagged_sents(categories='news')\n", "brown_sents = brown.sents(categories='news')\n", "unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)\n", "unigram_tagger.tag(brown_sents[2007])" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.9349006503968017" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now evalute it\n", "unigram_tagger.evaluate(brown_tagged_sents)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.9730592517453309" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Combining taggers\n", "t0 = nltk.DefaultTagger('NN')\n", "t1 = nltk.UnigramTagger(brown_tagged_sents, backoff=t0)\n", "t2 = nltk.BigramTagger(brown_tagged_sents, backoff=t1)\n", "t2.evaluate(brown_tagged_sents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classifying Text" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'last_letter': 'k'}" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define a feature extractor\n", "def gender_features(word):\n", " return {'last_letter': word[-1]}\n", "gender_features('Shrek')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Prepare a list of examples\n", "from nltk.corpus import names\n", "labeled_names = ([(name, 'male') for name in names.words('male.txt')] +\n", " [(name, 'female') for name in names.words('female.txt')])\n", "import random\n", "random.shuffle(labeled_names)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Process the names data\n", "featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]\n", "train_set, test_set = featuresets[500:], featuresets[:500]\n", "classifier = nltk.NaiveBayesClassifier.train(train_set)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'male'" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classifier.classify(gender_features('Neo'))" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'female'" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classifier.classify(gender_features('Trinity'))" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.752\n" ] } ], "source": [ "print(nltk.classify.accuracy(classifier, test_set))" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Most Informative Features\n", " last_letter = u'a' female : male = 35.4 : 1.0\n", " last_letter = u'k' male : female = 31.9 : 1.0\n", " last_letter = u'f' male : female = 17.4 : 1.0\n", " last_letter = u'p' male : female = 11.3 : 1.0\n", " last_letter = u'm' male : female = 10.2 : 1.0\n" ] } ], "source": [ "classifier.show_most_informative_features(5)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Document classification\n", "from nltk.corpus import movie_reviews\n", "documents = [(list(movie_reviews.words(fileid)), category)\n", " for category in movie_reviews.categories()\n", " for fileid in movie_reviews.fileids(category)]\n", "random.shuffle(documents)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [], "source": [ "all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())\n", "word_features = all_words.keys()[:2000]\n", "\n", "def document_features(document):\n", " document_words = set(document)\n", " features = {}\n", " for word in word_features:\n", " features['contains(%s)' % word] = (word in document_words)\n", " return features" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [], "source": [ "featuresets = [(document_features(d), c) for (d,c) in documents]\n", "train_set, test_set = featuresets[100:], featuresets[:100]\n", "classifier = nltk.NaiveBayesClassifier.train(train_set)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.64\n" ] } ], "source": [ "print(nltk.classify.accuracy(classifier, test_set))" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Most Informative Features\n", " contains(sans) = True neg : pos = 8.4 : 1.0\n", " contains(uplifting) = True pos : neg = 8.2 : 1.0\n", " contains(mediocrity) = True neg : pos = 7.7 : 1.0\n", " contains(dismissed) = True pos : neg = 7.0 : 1.0\n", " contains(overwhelmed) = True pos : neg = 6.3 : 1.0\n" ] } ], "source": [ "classifier.show_most_informative_features(5)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }