{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NLTK"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.\n",
"\n",
"Library documentation: http://www.nltk.org/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# needed to display the graphs\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"showing info http://nltk.github.com/nltk_data/\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# import the library and download sample texts\n",
"import nltk\n",
"nltk.download()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"*** Introductory Examples for the NLTK Book ***\n",
"Loading text1, ..., text9 and sent1, ..., sent9\n",
"Type the name of the text or sentence to view it.\n",
"Type: 'texts()' or 'sents()' to list the materials.\n",
"text1: Moby Dick by Herman Melville 1851\n",
"text2: Sense and Sensibility by Jane Austen 1811\n",
"text3: The Book of Genesis\n",
"text4: Inaugural Address Corpus\n",
"text5: Chat Corpus\n",
"text6: Monty Python and the Holy Grail\n",
"text7: Wall Street Journal\n",
"text8: Personals Corpus\n",
"text9: The Man Who Was Thursday by G . K . Chesterton 1908\n"
]
}
],
"source": [
"from nltk.book import *"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Displaying 11 of 11 matches:\n",
"ong the former , one was of a most monstrous size . ... This came towards us , \n",
"ON OF THE PSALMS . \" Touching that monstrous bulk of the whale or ork we have r\n",
"ll over with a heathenish array of monstrous clubs and spears . Some were thick\n",
"d as you gazed , and wondered what monstrous cannibal and savage could ever hav\n",
"that has survived the flood ; most monstrous and most mountainous ! That Himmal\n",
"they might scout at Moby Dick as a monstrous fable , or still worse and more de\n",
"th of Radney .'\" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l\n",
"ing Scenes . In connexion with the monstrous pictures of whales , I am strongly\n",
"ere to enter upon those still more monstrous stories of them which are to be fo\n",
"ght have been rummaged out of this monstrous cabinet there is no telling . But \n",
"of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u\n"
]
}
],
"source": [
"# examine concordances (word + context)\n",
"text1.concordance(\"monstrous\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"imperial subtly impalpable pitiable curious abundant perilous\n",
"trustworthy untoward singular lamentable few determined maddens\n",
"horrible tyrannical lazy mystifying christian exasperate\n"
]
}
],
"source": [
"text1.similar(\"monstrous\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a_pretty is_pretty a_lucky am_glad be_glad\n"
]
}
],
"source": [
"text2.common_contexts([\"monstrous\", \"very\"])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": [
"iVBORw0KGgoAAAANSUhEUgAAAakAAAEZCAYAAAAt5touAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n",
"AAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu4HFWZ7/HvDwJyJwnwiCIYEJWAkWBQLhPIDjDeTsDk\n",
"iAIKKp4z4mhEHWYAxRkSPTpRZw5BFJhxRgRERQUzEB1umo5yEwIhBAhogKCAIEFuotzf+aNWpWvX\n",
"7t637N577eT3eZ56unrVury1+vJ2VfXerYjAzMwsRxuMdABmZmbtOEmZmVm2nKTMzCxbTlJmZpYt\n",
"JykzM8uWk5SZmWXLScrWG5IOkHTnEPSzStLBa9H+/ZIuX9s4hspQzcsgxn1J0i7DPa6NLk5Slq21\n",
"TQZ1EfHLiNhtKLpKSw+Svi3pWUlPpmW5pC9J2qoSxwUR8bYhiGNIDOG8dCNpQkpET6XlXkknDaKf\n",
"D0n65VDHZ6ODk5TlrG0yyFgAX46IrYBtgWOBfYFrJG02UkFJGsnX+tYRsSVwFPBPkt46grHYKOMk\n",
"ZaOOCidLWilptaQLJY1L286S9KNK3S9Luiqtd0n6XWXbjpIulvSH1M8Zqfw1kn6eyh6R9B1JWw8k\n",
"RICIeC4ilgCHAdtQJKxuRwZpX06T9LCkJyTdKmn3tO3bks6WdEU6KmtI2qkS/26SrpT0qKQ7Jb2n\n",
"su3baS5+KulPQJekd0q6I/V1v6QT2szLxDTWY5Juk3Rord9vSFqY+rm+v6fsIuJ64HbgDT0mTNpa\n",
"0nnpsVgl6ZQ0NxOBs4D90tHYH/v7INi6wUnKRqPjKd74DwReATwGfCNt+ztgkqQPSjoA+DDwgXoH\n",
"kjYEFgL3Aq8GdgC+X6nyxdT3RGBHYM5gg42IPwFXAge02PzWVP7aiNgaeA9QfSN+H/B5iqOyW4AL\n",
"Uvybpz6/A2wHHAmcmd7US0cBX4iILYBrgf8E/iYd5e0B/LwejKSNgEuBy1K/nwAukPS6SrUjKOZj\n",
"HLCSYq56k/KN/iqNu7RFnTOALYGdgWkUj9mxEbEC+ChwXURsGRHj+xjL1jFOUjYaHQd8LiIejIjn\n",
"gbnA4ZI2iIi/AMcApwHnA7Mj4sEWfbyFIgn9Q0T8JSKejYhrACLi7oj4WUQ8HxGrU1/T1jLm3wOt\n",
"3mCfp3hznpjivysiHqpsXxgRV0fEc8ApFEcUrwJmAPdGxLkR8VJE3AJcTJHkSgsi4rq0T88AzwF7\n",
"SNoqIp6IiFbJYl9g84iYFxEvRMQiimR+VKXOxRGxJCJepEiak/vY99XAo8A3gZNSn2ukDwxHAJ+J\n",
"iKcj4j7gXykeR0hHprZ+cpKy0WgC8ON0Ouox4A7gBeDlABFxA3BPqvvDNn3sCNwXES/VN0h6uaTv\n",
"p1NiT1Aku23WMuYdKN6ou4mInwNfpzgSfFjSv0nastwM3F+p+zTFUdYrKY7+9innIM3D+0hzkNqu\n",
"OYWXvBt4J7Aqnc7bt0Wcr2zR7r5UXvb7cGXbX4At2u51YZuIGB8Ru0fE11ts3xbYKI1T+i3FnNl6\n",
"zknKRqPfAm+PiHGVZbOI+D2ApI8DGwMPAie26eN3wE7pU3zdl4AXgTekU3DHMLDXSrcve0jaAjgE\n",
"aPkNtYg4IyL2BnYHXgf8Q9mUIplW+xkPPEAxB4trc7BlRHy8bVDF0c9MitN4C4AftKj2ILCjpOrR\n",
"y6vTmJ2ymuKIckKlbCeaCXq0fXnGhpCTlOVuY0mbVJYxwNnAl8ovEUjaTtJhaf11wBeA91Nc1zhR\n",
"0p4t+r2B4hTcPEmbpb73T9u2AJ4GnpS0A82k0R9KC5JeJmkKRUJ4FDinR2Vpb0n7pGtBfwaeoUiQ\n",
"pXdK+itJG6f9ui4iHgB+ArxO0tGSNkrLmyWVXyVXbZyNVPx91tbpNN1TtXFKv0pxnJjadFGcWiyv\n",
"1w35qbcUzw+AL0raQtKrgU9TXG+D4sjtVWmObD3jJGW5+ynFm2a5/BNwOnAJcIWkJ4HrgLeko6Lz\n",
"gXkRsTwiVgKfBc6vvMEFrHljPBTYleKo5HfAe1OducCbgCcovkRwEf3/NB8Ub/BPUhwhnAvcCOyf\n",
"rpeVdcr+tgL+neI03qrU5quVet8FTqVIcnsBR6f4n6L40sWRFEc5vwf+meIIsj5G6Wjg3nQK8yMU\n",
"ibwaN+na16HAO4BHKE5FHhMRv+6l397mpr/bPkHxweAeiiPOC2gm9Z9RfCvwIUl/6KU/WwfJP3po\n",
"lidJ5wD3R8Q/jnQsZiPFR1Jm+fK32my95yRllq/R+B83zIaUT/eZmVm2fCRlZmbZGjPSAeRCkg8p\n",
"zcwGISI6dv3UR1IVEZH9cuqpp454DOtCjI7Tcea+jJY4O81JyszMsuUkZWZm2XKSGmW6urpGOoQ+\n",
"jYYYwXEONcc5tEZLnJ3mr6AnksJzYWY2MJIIf3HCzMzWR05SZmaWLScpMzPLlpOUmZlly0nKzMyy\n",
"5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtlykjIz\n",
"s2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpyk\n",
"zMwsW05SZmaWrRFLUhLHSRyT1j8k8YrKtm9KTByp2MzMLA+KiJGOAYlFwN9HcNPIxaDIYS7MzEYT\n",
"SUSEOtX/sB1JSXxAYpnELRLnSZwqcYLEu4G9gQskbpbYRKIhMUXiUImlablL4p7U15RUZ4nEZRLb\n",
"p/KGxDyJX6X6U1P5HqlsaYph11YxNhrNBWD+/O5l5TJ7dvN2/vxmvfnzu/dVLa+2r96HZj/1cer1\n",
"Zs1qPbf1mEtl/XJbff/qbat1W5k1q4h10qTithxv1qzW+1+WlftR1q/HUx+7WlatD825r96vt2m3\n",
"D/U6ZWzVWNrFVo+7HLfed3UOquWtnkfVvgcyL632rz5Pvc1Hu3HqWrXrrT4Uz43+PK69jVOqz2W7\n",
"8lbtq2Xl87O/9Vtt621/2r3+W/XR6rVXfx8on1tTp/Z8PyjHW58MS5KS2AM4BZgewWTgk2lTRHAR\n",
"sAR4XwRviuAZINK2SyPYK4K9gFuAr0qMAc4A3h3B3sA5wBfL/oANI9gH+BRwair/KHB66mcKcH+r\n",
"OOtPmgULWr+5LFzYvF2woFlvwYLufVXLq+2r96HZT32cer1Fi1rPbz3mUll/qJLUokVFrCtWFLfl\n",
"eIsWtd7/sqzcj7L+QJJUtT405756v96mv0mqjK0aS7vY6nGX49b7rs5Btby3JFWfo8Ekqfo8jVSS\n",
"WrFi6JJUfS7blfeVdMrnZ3/rt9rW2/60e/236qO3JFW+D5TPrSVLer4flOOtT8YM0zgHAT+I4I8A\n",
"ETymngeHbQ8XJU4E/hzBWRJvAPYArkp9bAg8WKl+cbq9GZiQ1q8FTpF4FXBxBCvXZmfMzGx4DFeS\n",
"CnpJQpU6PUgcArwbOLAsAm6PYP82/Tybbl8k7V8E35O4HpgB/FTiuAh6HJc0GnMq611AVx8hm5mt\n",
"XxqNBo2+DqmH0HAlqZ8DP5b4/xH8UWJ8Ki8T11PAVvVGEq8GvgG8NWJN8rkL2E5i3wiul9gIeG0E\n",
"d7QbXGKXCO4BzpDYCZgEPZNUV9ecyvoA99DMbD3Q1dVFV+UNcu7cuR0db1iSVAR3SHwRWCzxIrAU\n",
"WEXz6OnbwNkSf4Y1R0gCPgiMBxakU3sPRDBD4nDgaxJbp304DVomqbL/90ocDTwP/J7mNSwzM8vY\n",
"cB1JEcF5wHlttl1M81oSwPR0exPw+Rb1lwHTWpRPr6yvBnZJ6/OAeX3FWD96mjkTJk/uWW/16qLu\n",
"6tWwa/qe4OTJMHZs977Gjm2WV/up358xo+inPla93vTptFSNe+bMnvXL7a2ODutlvR1BTp8OO+wA\n",
"ixfDtGnN8caNa+5vvZ+yrNyP6py2G7u+P/W5qm6fMaNnm3b7UK/TKt529+txr17dun51/qvlvc1r\n",
"2aa/89Kqr/o89TYf/Ympt3a9mThx4OO1K6/PZbvyvp7X06f3/fzobd/62p9y7gfyfGpVVr4PrExX\n",
"zPfeu3udsv9287KuyuLvpHLgv5MyMxu4debvpMzMzAbKScrMzLLlJGVmZtlykjIzs2w5SZmZWbac\n",
"pMzMLFtOUmZmli0nKTMzy5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaW\n",
"LScpMzPLlpOUmZlly0nKzMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZ\n",
"mWXLScrMzLLlJGVmZtlykjIzs2wNKElJzJE4oVPBmJmZVQ30SCo6EkU/SYwZyfFz0mgMfV8D7bPR\n",
"GNo4bGTNnz/0j2m7vsryqVOb486fX9yvxlKuV2PsRDxDqdU8zp7dLKsvZVz112G1/fr8OuszSUmc\n",
"InGXxC+B16ey10j8t8QSiV9Ia8q/LXGmxHUSd0t0SZwrcYfEOZU+j5K4VWK5xLxK+dslbpK4ReLK\n",
"VDZH4nyJq4FzJV6dxrwpLftV2p+U+r1F4ksSu0jcVNn+2ur90cxJyobaggXDn6SWLGmOu2BBcb8a\n",
"S7lejbET8QylVvO4cKGT1GD1emQiMQU4AtgT2Ai4GbgJ+DfgoxGslNgHOBM4ODUbG8F+EocBlwD7\n",
"AXcAN0rsCTwCzAPeBDwOXCHxLuBa4N+BAyK4T2JsJZTdgKkRPCuxKfDXaf21wHeBN0u8AzgMeEsE\n",
"z0iMjeBxiSck9oxgGXAs8K21mjEzMxs2fZ0+OwC4OIJngGckLgE2AfYHfiitqbdxug3g0rR+G/BQ\n",
"BLcDSNwOTEhLI4JHU/kFwIHAi8AvIrgPIILHK31eEsGzlbG+nhLei8BrU/khwLdSrNX2/wEcK/F3\n",
"wHuBN7fb2Tlz5qxZ7+rqoqurq4/pMTNbvzQaDRrDeGjXV5IKQLWyDYDHI9irTZvn0u1LsCaxlPfH\n",
"AM/X6tf7b+XPlfVPA7+P4BiJDaFISm1iBbgIOBX4ObAkgsfaDVJNUmZm1lP9A/zcuXM7Ol5f16R+\n",
"AcyU2ERiS+BQioRxr8ThABKSeGM/xwvgBmCaxDYpyRwJNIDrgQMlJqR+x7fpYyvgobT+AWDDtH4l\n",
"xRHTpqn9OIB0BHY5cBY0r4uZmVn+ej2SimCpxIXAMuAPFAkmgPcDZ0l8juJa1feAW8tm1S5a9PmQ\n",
"xMnAIoojn4URxSlCiY8AF0tsADwMvK1FP2cCF0l8ALgM+FPq93KJycASieeAnwCfS22+C8wCruh9\n",
"OkaPoTwTWfY10D59NnTdMnMmTJ48tH22e46U5Xvv3Rx37Fh44YWescyc2T3GTsQzlFrN44wZvY9d\n",
"3dbq9bg+v9YUMaLfKh8WEn8PbBnBqe3rKNaHuTAzG0qSiIj+XLYZlHX+744kfgzsDBw00rGYmdnA\n",
"rBdHUv3hIykzs4Hr9JGU/3efmZlly0nKzMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYt\n",
"JykzM8uWk5SZmWXLScrMzLLlJGVmZtlykjIzs2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZ\n",
"ZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nKzMyy1bEk\n",
"JXG8xB0S5w9xvw2JKUPZp5mZ5amTR1J/CxwSwTFlgcSYIeg30jKiZs+GRqN5f/785nq1fKg1GsVS\n",
"jlcfa6jGLvspx2vXb7VetWwgcfQ1xnAqY5g1q2dZb/UHuq2/9WbPbl2vr777M3b1OVuut3seV+ej\n",
"XSzVx3Awr4eyfTlW+Twvl0YDJk3q3uf8+UX9ss7s2cXt1KmtY6m/bqqx1ee6HL/Va6F8/VfXyz7K\n",
"eFrNUV+v27JdedtoFPtSLjvv3H1bqz7WNR1JUhJnA7sAl0k8LnGexNXAuRLbSvxI4oa07J/abC7x\n",
"LYlfSdwscVgq31Ti++mo7GJg08o4R0ncKrFcYl6l/E8SX5G4TeJKiX0lFkvcLXHoUOzjwoXdnxQL\n",
"FjTXhyNJleM5SQ2tMoZFi3qW9VZ/oNv6W2/hwtb1hiJJVZ+z5Xq753F1PtrFUn0MB/N6KNuXY5XP\n",
"83JpNGDFiu59LlhQ1C/rLFxY3C5Z0jqW+uumGlt9rsvxW70Wytd/db3so4yn1Rz19bot25W3jUax\n",
"L+Vy333dt7XqY13TkSQVwUeBB4Eu4DRgInBwBO8HvgacFsFbgMOB/0jNTgF+FsE+wEHAVyU2ozgi\n",
"+1MEuwOnQnGqT+KVwDxgOjAZeLPEu1Jfm6W+3gA8BXw+9TkrrZuZ2SgwFKffeqN0e0kEz6b1Q4CJ\n",
"0po6W0psDrwVOFTi71P5y4CdgAOA0wEiWC5xa+r3zUAjgkcBJC4ADgT+C3gugstTP8uBZyJ4UeI2\n",
"YEK7YOfMmbNmvauri66urkHttJnZuqrRaNAYxsO2Tiep0p8r6wL2ieC5aoWUtP53BL9pUS56ql+X\n",
"UqXs+Ur5S1CMFcFLvV0XqyYpMzPrqf4Bfu7cuR0dbyS+gn4FcHx5R2LPtHp5rXyvtPoL4H2p7A3A\n",
"GymS0Q3ANIltJDYEjgQWdzx6MzMbNp08koo268cD35BYlsZfDHwM+AIwP53O2wC4BzgMOAs4R+IO\n",
"YAWwBCCChyROBhZRHEUtjODSFuP1FsugzZgB1bOBM2c21zt5lrDse+zY1mMN1dhlP33116reQGPI\n",
"6axqGcv06T3Leqs/0G39rTdjRut6/X1celN9zpbr7Z7H1floF8vavh7KesuWNe+Xz3OAyZPhoou6\n",
"1505E8aNg2nTivsrV8Kuu8ILL3SvU4+rVcyt5nrs2GLcet3Vq5v3q+szZsADDxTxlO3q8db7qm+f\n",
"PLn7uFdd1az3wAPNOnU5vY6GkiJG/NvcWZAUngszs4GRRES0uiQzJPwfJ8zMLFtOUmZmli0nKTMz\n",
"y5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nK\n",
"zMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtly\n",
"kjIzs2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZZSu7JCUxR+KEXrbvKfGOyv1DJU4anujM\n",
"zGw4ZZekgOhj+17AO9dUDi6N4MudDQkajWIp16u3APPnN+tU6wLMnl1sb9VnvZ92YwLMmtX9frV9\n",
"2X91nFZj1vWnTn28duWzZ7feXo273Kd6WbuYWo3Z37Leyvsao69+yvatHoPe+mv1PGqn2l9Zt3ye\n",
"1cfsTX/2b7BtBtP3UMewro4PrV/vrQzkdTzaZJGkJE6RuEvil8DrU9kiiSlpfVuJeyU2Aj4PHCGx\n",
"VOK9Eh+SOCPV207iRxI3pGX/VD4t1V8qcbPEFgONsa8ktWBB+yS1cGGxvVWf9X7ajQmwaFH7JFX2\n",
"Xx2n1Zh1/alTH69d+cKFrbdX4y73qV7WLqZOJam+xuirn7J9q8dgqJJUtb+ybvk8q4/ZGyep0Tk+\n",
"9D9JDeR1PNqMGekAUiI6AtgT2Ai4Gbgpbe52VBXB8xL/CEyJ4PjU/oOVKqcDp0VwjcROwGXA7sAJ\n",
"wMciuE5iM+DZTu6TmZkNjRFPUsABwMURPAM8I3FJH/WVllYOASaquXVLic2Ba4DTJC5IYz3QqvGc\n",
"OXPWrHd1ddHV1dXPXTAzWz80Gg0aw3iYmUOSClonnReADdP6Jv3sS8A+ETxXK/+yxELgfwHXSLwt\n",
"grvqjatJyszMeqp/gJ87d25Hx8vhmtQvgJkSm0hsCRyayldBcU0KOLxS/0lgy8r9aoK7AorTgAAS\n",
"k9PtayK4PYKvADeSrnuZmVneRvxIKoKlEhcCy4A/ADdQHF39C/ADiY8AP6F5fWoRcLLEUuCfU3m5\n",
"7XjgGxLLKPZtMfAx4JMS04GXgNuA/x5onNUzf+V6tWzmTJg8uXXbGTNg113b99nurGK9fPr09nGM\n",
"HduMoxpTX/pTp1089fIZM1pvr8Zd3i5b1n7/qzG1GrO/Zb2V9zVGX/2U7Vs9Br311+rxa6c+RllW\n",
"Ps+qY/ZmMGet+9umk2fER/ps+0iPD/1/vgzkdTzaKKKvb3yvHySF58LMbGAkERHtview1nI43Wdm\n",
"ZtaSk5SZmWXLScrMzLLlJGVmZtlykjIzs2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZZctJ\n",
"yszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nKzMyy5SRlZmbZ\n",
"cpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtnqaJKSmCnxksTr\n",
"O9T/FInTO9G3mZmNPEVE5zoXFwKbAjdHMGeI+x4TwQtD15+ik3NhZrYukkREqFP9d+xISmILYB9g\n",
"NnBEKuuSWCyxQOJuiXkSx0jcIHGrxC6p3nYSP0rlN0jsn8rnSJwvcTVwnsQ0iUvL8STOSf0sk5iV\n",
"ys+UuFHiNql/iXL+fGg0ivVGo7kMh+q4neh3XTDc+zKa5m7+/GIZaJvZs4ulbFv2Uy+DnvPRaBT1\n",
"6mXl66i3+WvX16xZvcdcb1fd73osreKfPbvZRxnn9tsXt7Nmwc4794y92r7cv/J+2Ud1n6vL/Pkw\n",
"aVKxPmlSMcbUqc1t5VxPmlTcVrdNmlTcnz+/aFc+VvX+R9PzdCDGdLDvdwGXRfBbiUck3pTK3wjs\n",
"BjwG3At8M4K3SBwPfAL4NHA6cFoE10jsBFwG7J7a7wZMjeBZia7KeP8IPBbBGwEkxqbyUyJ4TGJD\n",
"4CqJSREs7y3wBQvg8cehq6v7A9/V1abBEGo0muMO5XhD3d9IGu59GU1zt2BBcfupTw2szapVxfqE\n",
"CUXbsp9Vq7qXfepTPeej0YCFC+HrX+9e1mgUryNoP3/t+irbtVNvV93veixl3NX4Fy6Ebbct+ihf\n",
"7w8/XGxbtAiefLL52i/HqbYvYyjvl/ta3ed6vCtWNG9/9zt45pnuiXDVKrj/fnjqKXjooea2FStg\n",
"zJhiueUWGJve2bbdtnv/5XvWuqaTSeoo4LS0/sN0fyFwYwQPA0isBC5PdW4Dpqf1Q4CJah5Abimx\n",
"ORDAJRE822K8g0lHbAARlE+VIyT+hmJfX0GR7HpNUmZmloeOJCmJ8RQJ5w0SAWxIkWB+At0SzEuV\n",
"+y9V4hGwTwTP1foF+HNvQ9fq7wycAOwdwRMS5wCbtGs8Z84coPhEs2pVF3Q7UDMzs0ajQWMYzy12\n",
"6kjqcOC8CP62LJBoAAf2s/0VwPHAv6S2e0awrI82VwIfpzhdWJ7u2wp4GnhS4uXAO4BF7Took1Sj\n",
"UZziMDOz7rq6uuiqnFecO3duR8fr1BcnjgR+XCu7KJW3+wpdVLYdD+ydvgBxO3BcrV6rNv8PGCex\n",
"XOIWoCsltqXAncAFwNWD3B8zMxsBHTmSiuCgFmVnAGfUyqZX1hcDi9P6oxQJrd7H3Nr9apungQ+1\n",
"aHPsQOOfORMmTy7Wh/tCZDneUI+7Ll1QHanHZDSYOXNwbVauLNZ33bV7PytX9iyrz0dXF6xe3bNs\n",
"7Njm66iddn098MDA2lX3e8aM1tuq8a9e3eyjfL2ffXZRtmxZ8QWFdmOU5eUXGKr72m6fx46FRx8t\n",
"6l50UTGnjzzSbAvFXC9eDNOmNccv2229dTH+uHGwww4956A/cz1adfTvpEYT/52UmdnAjdq/kzIz\n",
"M1tbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtlykjIzs2w5SZmZWbacpMzMLFtOUmZmli0n\n",
"KTMzy5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZll\n",
"y0nKzMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLSWqUaTQaIx1C\n",
"n0ZDjOA4h5rjHFqjJc5Oc5IaZUbDE3c0xAiOc6g5zqE1WuLsNCcpMzPLlpOUmZllSxEx0jFkQZIn\n",
"wsxsECJCnerbScrMzLLl031mZpYtJykzM8vWep+kJL1d0p2SfiPppGEYb0dJiyTdLuk2Scen8vGS\n",
"rpT0a0lXSBpbafOZFN+dkt5aKZ8iaXnadnql/GWSLkzl10t69VrEu6GkpZIuzTVOSWMl/UjSCkl3\n",
"SNon0zg/kx735ZK+m/od8TglfUvSw5KWV8qGJS5JH0xj/FrSBwYR51fT475M0sWSth7JOFvFWNl2\n",
"gqSXJI3PcS5T+SfSfN4m6csjHScAEbHeLsCGwEpgArARcAswscNjbg9MTutbAHcBE4GvACem8pOA\n",
"eWl99xTXRinOlTSvJd4AvCWt/xR4e1r/GHBmWj8C+P5axPt3wAXAJel+dnEC5wIfTutjgK1zizON\n",
"dQ/wsnT/QuCDOcQJHADsBSyvlHU8LmA8cDcwNi13A2MHGOdfAxuk9XkjHWerGFP5jsBlwL3A+Ezn\n",
"cjpwJbBRur/dSMcZEet9ktoPuKxy/2Tg5GGOYQFwCHAn8PJUtj1wZ1r/DHBSpf5lwL7AK4AVlfIj\n",
"gbMrdfZJ62OARwYZ26uAq9KT99JUllWcFAnpnhblucU5nuIDybjUx6UUb7BZxEnx5lN9w+p4XMBR\n",
"wFmVNmcDRw4kztq2WcB3RjrOVjECPwTeSPckldVcAj8ADmpRb0TjXN9P9+0A/K5y//5UNiwkTaD4\n",
"NPMrijeEh9Omh4GXp/VXprhKZYz18gdoxr5mvyLiBeCJ6imGATgN+AfgpUpZbnHuDDwi6RxJN0v6\n",
"pqTNc4szIv4I/CvwW+BB4PGIuDK3OCs6Hdc2vfQ1WB+m+DSfVZyS3gXcHxG31jZlE2PyWuDAdHqu\n",
"IWnvHOJc35NUjNTAkrYALgI+GRFPVbdF8RFjxGIDkDQD+ENELAVa/g1EDnFSfEp7E8WphTcBT1Mc\n",
"Ea+RQ5ySXgN8iuLT6yuBLSQdXa2TQ5yt5BpXlaRTgOci4rsjHUuVpM2AzwKnVotHKJy+jAHGRcS+\n",
"FB9OfzBZmLzQAAAFFElEQVTC8QBOUg9QnCsu7Uj3LN8RkjaiSFDnR8SCVPywpO3T9lcAf2gT46tS\n",
"jA+k9Xp52Wan1NcYYOv0SX4g9gcOk3Qv8D3gIEnnZxjn/RSfUm9M939EkbQeyizOvYFrI+LR9Mny\n",
"YorTzbnFWer04/xoi74G9fqT9CHgncD7K8W5xPkaig8my9Jr6VXATZJenlGMpfspnpek19NLkrYd\n",
"8Th7Oxe4ri8UnxzupngSbczwfHFCwHnAabXyr5DO+1IcCdQvAG9McWrrbpoXLX8F7JP6rF+0PCua\n",
"54kH/cWJ1Mc0mteksosT+AXwurQ+J8WYVZzAnsBtwKap/3OBj+cSJz2vT3Q8LorrdPdQXEAfV64P\n",
"MM63A7cD29bqjVic9Rhr26rXpHKby+OAuWn9dcBvs4hzsG9c68oCvIPigvZK4DPDMN5Uims8twBL\n",
"0/L29OBdBfwauKL6wFGcLlhJcTH7bZXyKcDytO1rlfKXURyq/wa4HpiwljFPo/ntvuzipEgANwLL\n",
"KD4Jbp1pnCdSvKEup0hSG+UQJ8WR8oPAcxTXEY4drrjSWL9JywcHGOeHU7v7aL6WzhzJOCsxPlvO\n",
"ZW37PaQklclcrokzPR/PT+PeBHSNdJwR4X+LZGZm+Vrfr0mZmVnGnKTMzCxbTlJmZpYtJykzM8uW\n",
"k5SZmWXLScrMzLLlJGU2AJJOk/TJyv3LJX2zcv9fJX16kH13Kf0kSottUyX9Kv2MwgpJf1PZtl3a\n",
"dlOq9x4VP1nys0HE8NnBxG7WKU5SZgNzNcW/jELSBsA2FH+RX9oPuKY/HaX2/am3PcXPpRwXERMp\n",
"/iD8OEnvTFUOBm6NiCkRcTXwf4D/GxEH96f/ms8Moo1ZxzhJmQ3MdRSJCGAPin919JSKH158GcVv\n",
"g90s6eD0X9lvlfSfkjYGkLRK0jxJNwHvUfGjmyvS/Vltxvw4cE5E3AIQxf9AOxE4WdKewJeBd6n4\n",
"ccp/Av4K+Jakr0jaQ9INaduy9I9ukXR0OvpaKulsSRtImgdsmsrO78DcmQ3YmJEOwGw0iYgHJb0g\n",
"aUeKZHUdxU8N7Ac8CdxK8WOa51D8Ns9KSecCfwucTvHfxFdHxBRJm1D826HpEXG3pAtp/d/Gdwe+\n",
"XSu7CdgjIpalxDQlIspfeZ4OnBARN0v6GjA/Ir6b/tHnGEkTgfcC+0fEi5LOBN4fESdL+nhE7DVU\n",
"82W2tnwkZTZw11Kc8tufIkldl9bLU32vB+6NiJWp/rnAgZX2F6bb3VK9u9P979D+Zxx6+3kH9bL9\n",
"OuCzkk6k+P9pz1CcHpwCLJG0FDiI4h+HmmXHScps4K6hOKU2ieKfa15PM2ld26K+6H6E9HSbftsl\n",
"mjsokkrVFIpTjb2KiO8BhwJ/AX6ajrIAzo2IvdKyW0R8vq++zEaCk5TZwF0LzAAejcJjFD89sF/a\n",
"9mtgQnn9BzgGWNyinztTvV3S/aPajPcN4EPp+hPpF07nUfycRq8k7RwR90bEGcB/USTWnwGHS9ou\n",
"1RkvaafU5Pl0WtAsC05SZgN3G8W3+q6vlN1K8ZPwf0yn1I4FfijpVuAF4OxUb80RVar3EeAn6YsT\n",
"D9PimlREPAQcDXxT0gqKI7n/jIifVPps93MG75V0WzqttwdwXkSsAD4HXCFpGcVPcWyf6v87cKu/\n",
"OGG58E91mJlZtnwkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nKzMyy\n",
"9T9kxgDDol5HqgAAAABJRU5ErkJggg==\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# see where in a text certain words are found to occur\n",
"text4.dispersion_plot([\"citizens\", \"democracy\", \"freedom\", \"duties\", \"America\"])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"44764"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# count of all tokens (including punctuation)\n",
"len(text3)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2789"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# number of distinct tokens\n",
"len(set(text3))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'among',\n",
" u'the',\n",
" u'merits',\n",
" u'and',\n",
" u'the',\n",
" u'happiness',\n",
" u'of',\n",
" u'Elinor',\n",
" u'and',\n",
" u'Marianne',\n",
" u',',\n",
" u'let',\n",
" u'it',\n",
" u'not',\n",
" u'be',\n",
" u'ranked',\n",
" u'as',\n",
" u'the',\n",
" u'least',\n",
" u'considerable',\n",
" u',',\n",
" u'that',\n",
" u'though',\n",
" u'sisters',\n",
" u',',\n",
" u'and',\n",
" u'living',\n",
" u'almost',\n",
" u'within',\n",
" u'sight',\n",
" u'of',\n",
" u'each',\n",
" u'other',\n",
" u',',\n",
" u'they',\n",
" u'could',\n",
" u'live',\n",
" u'without',\n",
" u'disagreement',\n",
" u'between',\n",
" u'themselves',\n",
" u',',\n",
" u'or',\n",
" u'producing',\n",
" u'coolness',\n",
" u'between',\n",
" u'their',\n",
" u'husbands',\n",
" u'.',\n",
" u'THE',\n",
" u'END']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the texts are just lists of strings\n",
"text2[141525:]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"FreqDist({u',': 18713, u'the': 13721, u'.': 6862, u'of': 6536, u'and': 6024, u'a': 4569, u'to': 4542, u';': 4072, u'in': 3916, u'that': 2982, ...})"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# build a frequency distribution\n",
"fdist1 = FreqDist(text1) \n",
"fdist1"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u',', 18713),\n",
" (u'the', 13721),\n",
" (u'.', 6862),\n",
" (u'of', 6536),\n",
" (u'and', 6024),\n",
" (u'a', 4569),\n",
" (u'to', 4542),\n",
" (u';', 4072),\n",
" (u'in', 3916),\n",
" (u'that', 2982),\n",
" (u\"'\", 2684),\n",
" (u'-', 2552),\n",
" (u'his', 2459),\n",
" (u'it', 2209),\n",
" (u'I', 2124),\n",
" (u's', 1739),\n",
" (u'is', 1695),\n",
" (u'he', 1661),\n",
" (u'with', 1659),\n",
" (u'was', 1632)]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fdist1.most_common(20)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"906"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fdist1['whale']"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": [
"iVBORw0KGgoAAAANSUhEUgAAAZQAAAEZCAYAAACw69OmAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n",
"AAALEgAACxIB0t1+/AAAIABJREFUeJztnXuc1FX9/58vQQEVXVEDvJviBUVBECxXRSvDvJcpWmpK\n",
"3tAfWmYudlGzzEtlakHfvCRewzTvhpiK14BSNvGCookJChmyAtqqwPv3xznjDrADOzuf2Tm7834+\n",
"HvPYz+fM57zm9ZnZnfee8z4XmRmO4ziOUyprVNqA4ziO0zHwgOI4juNkggcUx3EcJxM8oDiO4ziZ\n",
"4AHFcRzHyQQPKI7jOE4mlC2gSLpe0jxJ0/PKekh6WNKrkiZKqsl7brSkmZJmSNo/r3ygpOnxuSvz\n",
"yrtIGh/LJ0vaMu+54+NrvCrpuHLdo+M4jtNEOVsofwCGrVBWBzxsZtsBj8RzJPUFjgL6xjpjJCnW\n",
"GQuMMLM+QB9JOc0RwPxYfgVwadTqAfwYGBwf5+cHLsdxHKc8lC2gmNmTwIIVig8BxsXjccBh8fhQ\n",
"4DYz+8TMZgGvAUMk9Qa6m9nUeN2NeXXyte4EvhCPvwxMNLMGM2sAHmblwOY4juNkTFvnUHqa2bx4\n",
"PA/oGY83AWbnXTcb2LSZ8jmxnPjzLQAzWwK8L2nDVWg5juM4ZaRiSXkLa774ui+O4zgdhM5t/Hrz\n",
"JPUys7mxO+s/sXwOsHnedZsRWhZz4vGK5bk6WwBvS+oMrG9m8yXNAYbm1dkceLQ5M9tuu60tXryY\n",
"efNCo2mbbbahe/fu1NfXA9C/f38AP/dzP/fzqj/v2TN0KM2bNw8zy+W4l8fMyvYAtgKm551fBpwb\n",
"j+uAS+JxX6AeWAvYGngdUHxuCjAEEPAgMCyWjwTGxuPhwB/jcQ/gX0ANsEHuuIA/K4Xzzz+/pPod\n",
"SSMFD1lopOAhFY0UPKSikYKHVDTi92az3/lla6FIug3YB9hI0luEkVeXALdLGgHMAo6M3+ovSbod\n",
"eAlYAoyMxnOB4wagG/CgmU2I5dcBN0maCcyPQQUze0/SRcDf43UXWkjOr0Qu4raWxsbGkup3JI0U\n",
"PGShkYKHVDRS8JCKRgoeUtIoRNkCipkdXeCpLxa4/mLg4mbKnwX6NVP+ETEgNfPcHwjDlh3HcZw2\n",
"oqpnyudyJ61l2LDSRyN3FI0UPGShkYKHVDRS8JCKRgoeUtIohJp6lqoPSVbN9+84jlMskgom5au6\n",
"hZIbxdBaGhqaTc1UpUYKHrLQSMFDKhopeEhFIwUPKWkUoqoDiuM4jpMd3uVVxffvOI5TLN7l5TiO\n",
"45Sdqg4onkPJTiMFD1lopOAhFY0UPKSikYKHlDQKUdUBxXEcx8kOz6FU8f07juMUi+dQHMdxnLJT\n",
"1QHFcyjZaaTgIQuNFDykopGCh1Q0UvCQkkYhqjqgOI7jONnhOZQqvn/HcZxiWVUOpa032HIcx3Ha\n",
"GDN4/XV44gnYZhvYZ5/yvE5Vd3l5DiU7jRQ8ZKGRgodUNFLwkIpGCh6K0Vi2DKZPhzFjYPhw2HRT\n",
"6NMHRoyABx4oXw7FWyiO4zjtnCVLYNq00AJ54gl48klYsGD5azbaCPbeGwYOLJ+PiuRQJJ0JfJuw\n",
"re81ZnalpB7AeGBL4m6OuZ0WJY0GTgSWAqPMbGIsH0jYzbErYTfHM2N5F+BGYDfCbo5Hmdmbzfjw\n",
"HIrjOO2Oxkb4+9+bAsgzz8Dixctfs9lmoWtr773DY/vtQc3vBF8Uq8qhtHlAkbQzcBuwO/AJMAE4\n",
"FTgF+K+ZXSbpXGADM6uT1Be4NV6/KfBXoI+ZmaSpwBlmNlXSg8BVZjZB0khgZzMbKeko4HAzG96M\n",
"Fw8ojuMkz+LFIWjkAsiUKfDxx8tf06dPU/DYe2/YcstsAsiKpDaxcQdgipk1mtlS4HHga8AhwLh4\n",
"zTjgsHh8KHCbmX1iZrOA14AhknoD3c1sarzuxrw6+Vp3Al9ozojnULLTSMFDFhopeEhFIwUPqWi0\n",
"tYeFC+HBB+Hcc2GPPaCmBr785ZD/ePLJEEz69YPTT4fx4+Htt+HVV+Haa+G442CrrQoHk3LOQ6lE\n",
"DuUF4Gexi6sR+ArwD6CnmeX25J0H9IzHmwCT8+rPJrRUPonHOebEcuLPtwDMbImk9yX1MLP3ynA/\n",
"juM4JbFgQch7PP54eEybFhLrOdZYAwYNgsMPhwsvhNpa6NGjcn4L0eYBxcxmSLoUmAh8ANQTciP5\n",
"15iksvdFLVq0iLq6Orp27QrAoEGDqK2tpaamBmiK5IXOc2Utvb7Qeb5Wa+pncV5TU1PR+im9n6XW\n",
"T+HzyL+H9v55pPB+Zv15vPsu/O1vDdTXw5131jB9Ouy6a7i+vr6Gzp1h+PAG+veHfv1q+PznYdmy\n",
"nF7bvp/19fVMmjSJxsZGVkfFJzZK+hmhpXEmMNTM5sburMfMbAdJdQBmdkm8fgJwPvBmvGbHWH40\n",
"sLeZnRavucDMJkvqDLxjZhs389qeQ3Ecp+zMndvU+nj8cXjppeWfX2stGDIkJNH32Qc+9zlYZ53K\n",
"eF0dqeVQkPSZ+HML4KuEpPu9wPHxkuOBu+PxvcBwSWtJ2hroA0w1s7nAQklDJAk4Frgnr05O6wjg\n",
"keZ8eA4lO40UPGShkYKHVDRS8JCKRrH1Z8+GW26Bk08Oo6t694ZLLmlg7NgQTLp2hX33hQsugMce\n",
"g4aGkGy/6CL44hcLB5MU3otVUal5KHdI2pCQBxlpZu9LugS4XdII4rBhADN7SdLtwEvAknh9rlkx\n",
"kjBsuBth2PCEWH4dcJOkmYRhwyuN8HIcx8mKWbOWb4H861/LP7/OOiEH8vWvhxbIoEHQpUtFrJaV\n",
"ind5VRLv8nIcp1jM4LXXQosiF0D+/e/lr1lvvZA4z3Vh7bYbrLlmZfxmja/l5TiO00rM4JVXYNKk\n",
"EDyeeCIM081ngw1gr72aAkj//tCpU0XsVpSqDihZ5FDyR3BUs0YKHrLQSMFDKhopeKiERn4AyT16\n",
"926gvr6pfm4Zk1wA6dcvDO3NykPqGoWo6oDiOI5jBjNnhuR4LoDMnbv8NX37wpFHNgWQvn3LMwu9\n",
"veM5lCq+f8epRnI5kEmTmoLIO+8sf81nPhNGYQ0dGh5ZrYPVEfAciuM4VUtuL5D8Lqw5c5a/ZuON\n",
"Q+DIBZEddvAA0hqqOqB4DiU7jRQ8ZKGRgodUNFLw0FqNf/87tD4efTQ8Ntpo5RxIrvWx776w446r\n",
"DiDt+b0oh0YhqjqgOI7TMZg3rymAPPZY6NLK57Ofha99rSmAeA6kPHgOpYrv33HaKwsWhCG8uRbI\n",
"iy8u//x664Xgsd9+IYDsvPPqR2E5LcNzKI7jtGsWL4annmoKIM89F3IjObp1C/NA9tsvPAYMgM7+\n",
"7dbmVPVb7jmU7DRS8JCFRgoeUtGopAczeP55uPtueO21Bv74xxqWLGl6fs01wwKKuQAyePCqlzJp\n",
"z+9FihqFqOqA4jhOOixdCk8/HYLI3XfDG2+E8v79w94gQ4Y0dWHtuSesvXZl/Tor4zmUKr5/x6k0\n",
"jY3w17/CXXfBvffCf//b9FzPnnDooXDQQWFW+vrrV86n04TnUBzHSYaGhrC97V13wV/+Ah980PTc\n",
"ttuGXQkPOyxsfeuJ9PZFVX9cvh9KdhopeMhCIwUPqWhk6eHtt2Hs2LAv+sYbwze+AXfcEYLJbruF\n",
"fUCmTw/7ol92GXz+803BpKO9Fx1BoxDeQnEcpyzMnBm6s268ESZPbirv1CnkQQ47LDy22KJyHp1s\n",
"8RxKFd+/42TNSy+Flsedd4ZRWjm6dg2tk8MPDzmRDTesnEenNJLLoUgaDXwTWAZMB04A1gHGA1sS\n",
"d2w0s4a8608ElgKjzGxiLB9I2LGxK2HHxjNjeRfgRmA3wo6NR5nZm210e45TNZiFrqo77giPl19u\n",
"em799eGQQ0IQ2X//dPdId7KjzXMokrYCTgJ2M7N+QCfCFr11wMNmth1hD/i6eH1f4CigLzAMGBP3\n",
"kAcYC4wwsz5AH0nDYvkIYH4svwK4tDkvnkPJTiMFD1lopOAhFY1C9c3g2Wdh9GjYbjvYddeQA3n5\n",
"ZejRA048MSTd//MfuOqqBg4/vLRgkvJ7Ua0ahahEC2UhYS/5tSUtBdYG3gZGA/vEa8YBkwhB5VDg\n",
"NjP7BJgl6TVgiKQ3ge5mNjXWuRE4DJgAHAKcH8vvBH5T7ptynI7MsmUwdWpTd9asWU3PbbwxfPWr\n",
"cMQRYa+Q/K1uP/ywza06FaQiORRJJwO/BP4HPGRmx0paYGYbxOcFvGdmG0i6GphsZrfE564F/kLo\n",
"FrvEzL4Uy/cCvm9mB0uaDnzZzN6Oz70GDDaz91bw4TkUxynA0qXwzDMhgNx5J8ye3fRc795NQWSv\n",
"vapzu9tqJakciqRtgLOArYD3gT9J+mb+NWZmksr+Tb/NNttQV1dH165dARg0aBC1tbWfLkuQaxr6\n",
"uZ9Xy7kZvPFGDTfdBM8/38B77/Hpsu9f+lID++wDQ4fW8LnPwcKFoX6nTun49/Psz+vr65k0aRKN\n",
"jY2sFjNr0wchH3Jt3vmxwG+Bl4Fesaw3MCMe1wF1eddPAIYAvYCX88qPBsbmXbNHPO4MvNucl/79\n",
"+1spLFiwoKT6HUkjBQ9ZaKTgoRIa//632c9/bta3r1nIkpj177/Att7a7JxzzCZPNlu6tLweUtZI\n",
"wUMqGiFsNP/9XokcygzgR5K6AY3AF4GpwAfA8YQE+vHA3fH6e4FbJf0K2BToA0w1M5O0UNKQWP9Y\n",
"4Kq8OscDk4EjCEl+x3HyWLQodGXddFPYQyTX+7vRRnD00eGxxx6+b4jTciqVQ/k+4Qt/GfAc8G2g\n",
"O3A7sAUrDxs+jzBseAlwppk9FMtzw4a7EYYNj4rlXYCbgAGEYcPDzWxWMz6sEvfvOJViyZIw2fCm\n",
"m8LSJ//7Xyjv0iUM8T32WBg2bPnEuuPks6ocik9srOL7d6oDM6ivD0Hk1lvD7oY59toLjjsuJNfL\n",
"tKK508FYVUDxtbxKIJUx4SlopOAhC40UPGSlMXNmA5ddBrvsEtbLuuKKEEy22y7MG/nXv+CJJ+Db\n",
"324+mKRyHylopOAhJY1C+FpejtOB+PDD0JV1ww1hKfj6+lC+4YYwfHjo0ho82PMiTnnwLq8qvn+n\n",
"Y2AWFl/8wx9g/HhYuDCUr7UWHHxwCCIHHBDOHadUkpqH4jhONrz9dsiL3HADzJjRVD54MJxwAhx1\n",
"FGywQcXsOVWI51BKIJX+zBQ0UvCQhUYKHlal8dFHYfmTAw+EzTeHuroQTHr2hO99D154AaZMgVNP\n",
"BaljvxdtqZGCh5Q0CuEtFMdJHDOYNi10ad16K7wXFxBac82wn8gJJ4Shvp39r9mpMJ5DqeL7d9Lm\n",
"3XfhlltCIMnfW6R//xBEjjkmTEJ0nLbEcyiO005YsiTss3799XD//eEcwiitb3wjBJISe2odp2x4\n",
"DqUEUunPTEEjBQ9ZaFTKw5tvwo9/DFtuGWasz5oVFmo86KCwPMrbb8OVVxYXTNrre5GiRgoeUtIo\n",
"hLdQHKdCfPJJaIVccw1MmNC0llafPnDyySE/0rt3ZT06TjF4DqWK79+pDG+8AddeG7q15s4NZWut\n",
"FZY/Oflk2Htvn3jopEtJORRJ6wL/M7OlkrYHtgf+YmEHRcdxWsDHH8O994bWyMSJTeU77ggnnRTW\n",
"09pww8r5c5wsaEkO5Qmgi6RNgYcIy8TfUE5TbYXnULLTSMFDFhpZe3jttTBXZPPN4etfD8Gka9cw\n",
"e/3JJ+HFF+E731k5mHTE96I9a6TgISWNQrQkhyIz+1DSCGCMmV0m6Z9lc+Q47ZyPPw5LoPz+9/Do\n",
"o03lO+8curS++U2fwe50TFabQ5E0DRgJXAGMMLMXJU03s35tYbCceA7FyZLZs+G3vw35kf/+N5R1\n",
"6xaWQDn5ZN+syukYlDoP5SxgNHBXDCbbAI9ladBx2jN//3tYGv5Pf2qaN7LLLnDKKWHyoe8z4lQL\n",
"Lcmh9DSzQ8zsUgAzex14qrUvKGl7SdPyHu9LGiWph6SHJb0qaaKkmrw6oyXNlDRD0v555QMlTY/P\n",
"XZlX3kXS+Fg+WdKWzXnxHEp2Gil4yEKjpfWXLg3zQ2prw2KMt90Whv0eeSQ89VQD9fUwcmTrg0l7\n",
"ei+qQSMFDylpFKIlAWV0C8tahJm9YmYDzGwAMBD4ELgLqAMeNrPtCHvA1wFI6gscBfQFhgFjpE87\n",
"DsYSuuH6AH0kDYvlI4D5sfwKwj71jlMyCxfCr38N224bhvk+/TSsv35YmPFf/wq5k5128q4tpzop\n",
"mEORdADwFcKX+R+B3J9Id6CvmQ0u+cVDa+NHZraXpBnAPmY2T1IvYJKZ7SBpNLAs10KSNAG4AHgT\n",
"eNTMdozlw4GhZnZqvOZ8M5siqTPwjplt3Mzrew7FaRGzZsFVV4X8yKJFoWybbeDMM+Fb34Lu3Svp\n",
"znHajtbmUN4GngUOjT9zAguB72TkbThwWzzuaWa53a7nAT3j8SbA5Lw6s4FNgU/icY45sZz48y0A\n",
"M1sSu9V6mNl7Gfl2qgAzeOaZkB+56y5YtiyU77NPGOp70EHQqVNlPTpOShQMKGb2T+Cfkm4pxyRG\n",
"SWsBBwPnNvPaJqnsTYe9996buro6unbtCsCgQYOora2lJnZ85/oaC53Pnj2bddddt8XXN3e+ePFi\n",
"Nttss1bXz1FTU9Pq+vl1K1U/lfdz8eLF9Oy5GXfcAffc08Arr0B9fQ2dO8PZZzdwxBEweHD6n0cW\n",
"72cKn0cq72cKn0el3s/6+nomTZpEY2Mjq8XMVvkAaoGHgZnAG/Hxr9XVa4HuocCEvPMZQK943BuY\n",
"EY/rgLq86yYAQ4BewMt55UcDY/Ou2SMedwbebc5D//79rRQWLFhQUv2OpJGCh1I1Fi0yGzNmgW22\n",
"mVlon5j16GF23nlmc+a0jYeUNFLwkIpGCh5S0Qhho/nv9ZbMQ3mFMHT4OWBpXiD67+rD1Sp1/0hY\n",
"wmVcPL+MkEi/VFIdUGNmdTEpfyswmNCV9VdgWzMzSVOAUcBU4AHgKjObIGkk0M/MTou5lcPMbHgz\n",
"Hmx19+90fJYtC/uOnHsuvPNOKNthBzjrrDCjfe21K+vPcVJiVTmUlgSUKWY2JGND6xCS6lub2aJY\n",
"1gO4HdgCmAUcaWYN8bnzgBOBJcCZZvZQLB9IWAamG/CgmY2K5V2Am4ABwHxguJnNasaHB5QqZ+pU\n",
"GDUqbJsLsPvucOGF8OUvwxpVvbmD4zRPqQHlEqAT8Gfgo1y5mT2XpclKMGDAAJs2bVqr6zc0NHza\n",
"31jtGil4KEbjnXdg9GgYNy6c9+oFl1wCBx/cQI8e7ec+yqmRgodUNFLwkIpGqTPl9wAMGLRC+b6t\n",
"duQ4FeKjj8I8kp/+FBYvDsvGf+c78IMfhKG/ZZzz5TgdHt8PpYrvv5owg/vug+9+F15/PZQdcgj8\n",
"8pdhkqLjOC2j1P1Qzie0UBR/AmBmP8nMoeOUkZdeCq2Q3D4kO+4YWin777/qeo7jFEdL0o4fxMdi\n",
"YBlh9vxWZfTUZvhaXtlppOBhRY0FC8JIrV12CcFk/fVDIPnnPwsHkxTvo1IaKXhIRSMFDylpFGK1\n",
"LRQz+0X+uaTLgYkFLnecirN0aVgi5Yc/DMvIr7EGnHoq/OQnsPFKC/A4jpMVRedQ4vDeqWbW7nue\n",
"PYfS8Xj88bC+1j/jFnB77w1XXgklNkYdx4mUmkOZnne6BvAZwPMnTlK89RacfXbYkwRgiy3gF78I\n",
"KwL7yr+O0za0JIdycHwcBOwPbGJmV5fVVRvhOZTsNCrl4eOP4dJLw8z2P/0JBg9u4MIL4eWXwx7u\n",
"xQaTFN7LVDRS8JCKRgoeUtIoREtyKLMk9Qf2IozyehLwPeWdivPoo3DGGSF4AHzta/Czn8H221fW\n",
"l+NUKy2ZKX8mcBJhpryAw4BrzOyq8tsrL55DaZ/MmRM2tPrjH8N5nz5w9dVhuRTHccpLqUuvTCes\n",
"3PtBPF8HmGxm/TJ32sZ4QGlffPJJCBznnx9muXfrFma4f+970KVLpd05TnWwqoDS0uXvlhU4btd4\n",
"DiU7jXJ7eOIJ2G23kHhfvBgOPTRMWPzBD5YPJqnfR3vSSMFDKhopeEhJoxAtWcvrD8AUSfldXteX\n",
"zZHj5DF3LpxzDtx8czj/7GfDVrwHHlhZX47jrEyL5qHEZeJriUl5M2v9Er0J4V1e6bJkCYwZAz/6\n",
"ESxcGFoho0fD978furocx6kMrcqhSBoMbGRmD65Q/hVgnpk9m7nTNsYDSpo88wyMHNk0OfErXwmt\n",
"km22qawvx3Fan0O5FHipmfKXgF80U16MoRpJd0h6WdJLkoZI6iHpYUmvSpooqSbv+tGSZkqaIWn/\n",
"vPKBkqbH567MK+8iaXwsnyxpy+Z8eA4lO40sPLzxRgMnngh77hmCyZZbwt13w/33tzyYpHAfHUUj\n",
"BQ+paKTgISWNQqwqoHRvbpfDWLZRia97JWGHxR2BXQj7ydcBD5vZdsAj8Zy4BfBRQF9gGDBG+nS6\n",
"2lhghJn1AfpIGhbLRxC2E+4DXEEIjk6iLFsGv/sdHHcc/OEPYY+SH/4wJN0PPdRnujtOe2FVXV6v\n",
"FVqva1XPrfYFpfWBaWb22RXKZwD7mNk8Sb2ASWa2g6TRwDIzuzReNwG4gLCF8KMxKBH3jh9qZqfG\n",
"a843symSOgPvmNlKywJ6l1flmTULRowIkxQhrAJ89dWw3XYVteU4TgFa2+X1iKSf5bUGkLSGpIuA\n",
"R0vwszXwrqQ/SHpO0jVxbktPM5sXr5kH9IzHmwCz8+rPBjZtpnxOLCf+fAvAzJYA78dFLZ1EMINr\n",
"roF+/UIw2WgjGD8eJkzwYOI47ZVVBZSzgW2A1yX9OQ4bnglsF59rLZ2B3YAxZrYbYa+VuvwLYrOh\n",
"7E0Hz6Fkp1FM/bfegmHD4OSTw5ySr34VXnwR9t+/oeTuLe8rz04jBQ+paKTgISWNQhSch2Jmi4Hh\n",
"krYBdiJ8wb9kZq+X+Jqzgdlm9vd4fgcwGpgrqZeZzZXUG/hPfH4OsHle/c2ixpx4vGJ5rs4WwNux\n",
"y2t9M3tvRSPrrbcedXV1dO3aFYBBgwZRW1tLTU0YD5B74wudL168eJXPt+R88eLFJdXPp7X12+p8\n",
"wYIGJkyAU0+tYeFCqK1t4Kyz4KtfrUGC2bMr/352pM+j1N9P//1O6/Oo1PtZX1/PpEmTaGxsZHVU\n",
"ZE95SU8A3zazVyVdAKwdn5pvZpdKqgNqzKwuJuVvBQYTurL+CmxrZiZpCjAKmAo8AFxlZhMkjQT6\n",
"mdlpMbdymJkNb8aH51DaiLffDi2SBx4I5wcfDP/3f9C7d2V9OY5THCWt5VUOJO0KXAusBbwOnAB0\n",
"Am4ntCxmAUeaWUO8/jzgRGAJcKaZPRTLBwI3AN0Io8ZGxfIuwE3AAGA+MLy5EWseUMqPGdxyC/y/\n",
"/wcNDWEb3quvhm9+00dvOU57JLmAkgoDBgywadNaP+m/oaHh0+ZhtWs0V3/ePDjlFLjnnnB+wAEh\n",
"Eb/pps0IZOAhC40UPKSikYKHVDRS8JCKRsmLQ0raS9IJ8XhjSVu32o3T4TELI7Z22ikEk+7d4brr\n",
"QndXoWDiOE77pyXL118ADAS2N7PtJG0K3G5me7aBv7LiXV7Z8+67YdmUO+4I51/6Elx7bdiS13Gc\n",
"9k+pLZTDgUMJw3sxszlA9+zsOR2FO+8MrZI77oB11gmz3x96yIOJ41QLLQkoH5nZp3ugxEmIHQKf\n",
"h5KNxvz5cPbZDRxxRGihDB0K06eH/EkxifdK30cqHlLRSMFDKhopeEhJoxAtCSh/kvR/QI2kkwnr\n",
"bF1bNkdOu+K++2DnncNs97XXDiO4HnkEtvYsm+NUHS3dD2V/ILfK70Nm9nBZXbURnkNpPe+/D2ed\n",
"BTfcEM5ra8PCjtu2aoU3x3HaC6XuKX828MeYO+lQeEBpHRMnhgUdZ88OG19dfDGceSZ06lRpZ47j\n",
"lJtSk/LdgYmSnpJ0hqSeq63RTvAcSnEaixbBqafCl78cgsngwVBfD9/9Lixa1H7uI3UPqWik4CEV\n",
"jRQ8pKRRiNUGFDO7wMx2Ak4HegNPSHqkbI6cJHn8cdh117Bcypprws9+Bk8/DTvsUGlnjuOkQotn\n",
"yscFG48AjgbWNbNdymmsLfAur9Xz4Ydw3nlwZdwPs39/GDcOdmn3n77jOK2hpC4vSSMlTSKM7tqI\n",
"sKijf51UAZMnw4ABIZh06gQ/+hFMmeLBxHGc5mlJDmUL4Cwz62tm55tZc/vMt0s8h9K8xkcfQV1d\n",
"2Nv91Vehb98QXH7yk7A9b1t4qJRGCh5S0UjBQyoaKXhISaMQBfdDkbSemS0ELgdsxR0Pm9tfxGn/\n",
"PPssHH982PBKgu9/Hy68EOKWMY7jOAVZ1Z7yD5jZgZJm0czuiWbW7qeueQ6liU8+CYn2n/4Uli4N\n",
"80nGjYPPf77SzhzHSQlfvr4AHlAC06eHVkluJf9Ro+DnPw8z3x3HcfIpNSm/0hDhjjJs2HMoYab7\n",
"oEFg1sCWW4YlVK68svhgUun7yEojBQ+paKTgIRWNFDykpFGIggFFUjdJGwIbS+qR99iKsBVvq5E0\n",
"S9LzkqZJmhrLekh6WNKrkiZKqsm7frSkmZJmxGVgcuUDJU2Pz12ZV95F0vhYPlnSlqX47YiYhe6t\n",
"E06Ajz+GAw8MLZV99620M8dx2iuryqGcBZwJbAK8nffUIuD3ZvabVr+o9AYwMD+xL+ky4L9mdpmk\n",
"c4ENVthTfnea9pTvE/eUnwqcYWZTJT3I8nvK72xmIyUdBRzue8o3sXQpnHFGWF5eCgs6nn56pV05\n",
"jtMeKHUtr1FmdlXGht4ABpnZ/LyyGcA+ZjZPUi9gkpntIGk0sMzMLo3XTQAuAN4EHjWzHWP5cGCo\n",
"mZ0arznfzKZI6gy8Y2YbN+Oj6gLKhx/CMceEnRS7dIFbb4WvfrXSrhzHaS+UlEMxs6sk7SzpSEnH\n",
"5R4lejLgr5L+IemkWNbTzObF43lAbs2wTYDZeXVnE1oqK5bPoakrblPgreh/CfD+isOeofpyKPPn\n",
"wxe/GIJJTQ389a9NwcT7mNPxkIpGCh5S0UjBQ0oahSg4DyVH3AJ4H2An4AHgAOAp4MYSXndPM3tH\n",
"0sbAw7FoMv9WAAAfOElEQVR18imxO6u6mg5lZtYsGDYMXnkFNt8cJkwIExYdx3GyYrUBhbB+167A\n",
"c2Z2Qlxt+JZSXtTM3ok/35V0FzAYmCepl5nNjeuG/SdePgfYPK/6ZoSWyZx4vGJ5rs4WwNuxy2v9\n",
"5iZiLlq0iLq6OrrGWXuDBg2itraWmpowHiAXyQud58paen2h83yt1tRf3fmsWTUccAD06tXA4YfD\n",
"1VfXsOmmy19fU1NT0uuVWj+l97PU+lmcp/B+llq/I72fKXwelXo/6+vrmTRpEo2NjayOluRQ/m5m\n",
"u0t6FtgPWAjMMLPtV6vevN7aQCczWxS3E54IXAh8EZhvZpdKqgNqVkjKD6YpKb9tbMVMAUYBUwmt\n",
"p/ykfD8zOy3mVg6r1qT8I4/A4YeHpeeHDoW774b116+0K8dx2iul7ofyd0kbANcA/wCmAc+U4Kcn\n",
"8KSkemAKcL+ZTQQuAb4k6VVC4LoEIK4ddjvwEvAXYGReFBhJ2I54JvCamU2I5dcBG0qaCZwF1DVn\n",
"pKPnUG69FQ44IASTo44K3VyFgon3MafjIRWNFDykopGCh5Q0CrHaLi8zGxkPfyfpIWA9M/tna1/Q\n",
"zN4AVvomj11SXyxQ52Lg4mbKnwX6NVP+EXBkaz22d8zgl7+Ec84J59/5DvziF7BGS/59cBzHaSWr\n",
"mocykGbW8MphZs+Vy1Rb0RG7vJYtg7PPhl//Opz/8pdhR0XHcZwsaNU8lLgHyqoCSrufU93RAkpj\n",
"Y1iT6/bbw66KN94Iw1fKHDmO47SeVuVQzGyome1b6FE+u21HR8qhNDSEYcG33w7rrRfyJcUEE+9j\n",
"TsdDKhopeEhFIwUPKWkUoiXzUI6n+eXrS5mH4mTIu++GCYovvAC9e8Nf/hL2f3ccx2lLWjJs+Dc0\n",
"BZRuhBFYz5nZEWX2VnY6QpfXiy+Glsns2bDDDqFlsqUvhek4TpnIdD+UuArweDP7chbmKkl7DyhT\n",
"poRg0tAQNsK67z7osdICM47jONlR6jyUFfkQaPe7NUL7zqE8+WRYl6uhAU4/vYG//rW0YOJ9zOl4\n",
"SEUjBQ+paKTgISWNQrQkh3Jf3ukaQF/CREOnQjz6KBx8cFg5+Oijw57v3bpV2pXjONVOS3IoQ/NO\n",
"lwBvmtlb5TTVVrTHLq8JE8JSKo2N8K1vwbXXQqdOlXblOE61kEkORdJ65LVomltssb3R3gLKPffA\n",
"kUeGHRZPOQXGjPHZ747jtC2l7il/iqS5wHTg2fj4R7YWK0N7yqH86U9wxBEhmJx5Jowd2xRMUuhX\n",
"TcFDFhopeEhFIwUPqWik4CEljUK0ZPn6cwjb6f63bC6cVXLzzWEG/LJlcO658POfh617HcdxUqIl\n",
"OZSJhD3ZP2gbS21He+jyuu46OOmksODj+eeHhwcTx3EqRal7yu8G3AD8Dfg4FpuZjcrSZCVIPaCM\n",
"GQOnnx6Of/5zqGt2EX7HcZy2o9R5KL8nbGo1mZA7yeVR2j0p51B+9aumYHLFFasOJin0q6bgIQuN\n",
"FDykopGCh1Q0UvCQkkYhWpJD6WRmvgB6G3LxxfCDH4TjMWPgtNMq68dxHKcltKTL62LgTeBe4KNc\n",
"eanDhiV1IrR4ZpvZwZJ6AOOBLYFZwJFm1hCvHQ2cCCwFRsUdHnN7ttwAdAUeNLMzY3kX4EZgN2A+\n",
"cJSZvdmMh6S6vHJ5kosuCnmS666DE06otCvHcZwmSu3yOoawhe4zNHV3ZdHldSZhW9/cN3od8LCZ\n",
"bQc8Es+Je8ofRZihPwwYI32alh4LjDCzPkAfScNi+QjC/vR9gCuASzPwW1bMwgiuiy4KExVvvtmD\n",
"ieM47YvVBhQz28rMtl7xUcqLStoM+AphP/hccDgEGBePxwGHxeNDgdvM7BMzmwW8BgyR1BvobmZT\n",
"43U35tXJ17oT+EJzPlLJoZiFuSWXXw6dO8P48XDMMW3vo5L1U9FIwUMqGil4SEUjBQ8paRSiUvuh\n",
"XEGY37JeXllPM5sXj+cBPePxJoQBATlmA5sCn8TjHHNiOfHnW9HnEknvS+qR4uz+Zcvg1FPh97+H\n",
"tdaCO+4I63Q5juO0N1qSlN+dZvZDIbQIikbSQcB/zGzaCuuEfYqZmaSyJzcWLVpEXV0dXbt2BWDQ\n",
"oEHU1tZSU1MDNEXyQue5spZev+L5/PkNXH55CCZdu8K99zaw++4ArdMr5bympqai9bN4P1f8z6tS\n",
"9VP4PPLvob1/Him8nyl8HpV6P+vr65k0aRKNjY2sjjbfDyUm+Y8lLDTZldBK+TMhcA01s7mxO+sx\n",
"M9tBUh2AmV0S608AzicMFHjMzHaM5UcDe5vZafGaC8xssqTOwDtmtnEzXiqWlF+yBI47Dm67DdZZ\n",
"B+6/H4YOrYgVx3GcFpPUfihmdp6ZbR7zMMOBR83sWMIosuPjZccDd8fje4HhktaStDXQB5hqZnOB\n",
"hZKGxCT9scA9eXVyWkcQkvwrUakcytKlYaXg226Dz32ugYceKi2YpNCvmoKHLDRS8JCKRgoeUtFI\n",
"wUNKGoVIYT+UXBPhEuB2SSOIw4YBzOwlSbcTRoQtAUbmNStGEoYNdyMMG54Qy68DbpI0kzBseHiG\n",
"fkti6VI48US45RZYd92QiN9zz0q7chzHKZ3W7Icyy8xmF7i8XdHWXV7LloV1ua6/PnRzTZgAtbVt\n",
"9vKO4zgls6our4ItFEl9CCOvJq1QXiupi5m9nq3Njk1uNNf114fdFR94wIOJ4zgdi1XlUH4NLGym\n",
"fGF8rt3TVjkUMzjjDLjmmjCa6/77YZ99itPIwkc5NVLwkIVGCh5S0UjBQyoaKXhISaMQqwooPc3s\n",
"+RULY1lJExuridykxbFjoUsXuPde2G+/SrtyHMfJnoI5FEmvmdm2xT7Xnih3DsUMvvtd+PWvw6TF\n",
"e+6BYcNWX89xHCdVWjts+B+STm5G7CQ6yPL15SS3Ntevfw1rrgl//rMHE8dxOjarCihnASdIelzS\n",
"r+LjccLCi2e1jb3yUq4cillYfj63Ntcdd8CBBxankYWPttRIwUMWGil4SEUjBQ+paKTgISWNQhQc\n",
"5RVnrH8e2BfYmTBf5H4ze7RsbjoIF1wQdljs1Cks9HjIIZV25DiOU36KXnqlI1GOHMpPfhL2NOnU\n",
"KcyE//rXM5V3HMepKFkvveIU4OKLQzBZYw246SYPJo7jVBdVHVCyzKFcdlnIm0gwbhwcfXTxGln4\n",
"qJRGCh6y0EjBQyoaKXhIRSMFDylpFKKqA0pW/OpXYUSXFGbCf/OblXbkOI7T9ngOpcT7v+qqMHER\n",
"wkz4b387A2OO4ziJ4jmUMvHb3zYFk9/9zoOJ4zjVTVUHlFJyKDffDNdeG/oif/MbOOWU1umk0ifq\n",
"fczpeEhFIwUPqWik4CEljUJUdUBpLR9+CGefHY5/+Us4/fTK+nEcx0kBz6G04v6vvBLOOgsGDYKp\n",
"U0My3nEcpxpIKociqaukKZLqJb0k6eexvIekhyW9Kmli3Ls+V2e0pJmSZkjaP698oKTp8bkr88q7\n",
"SBofyydL2jIr/42NcOml4fjHP/Zg4jiOk6PNA4qZNQL7mll/YBdgX0m1QB3wsJltR9gDvg5AUl/g\n",
"KMLWw8OAMXEPeYCxwAgz6wP0kZRbfnEEMD+WXwFc2pyX1uRQrrsO3nkH+veH2to0+jNT0EjBQxYa\n",
"KXhIRSMFD6lopOAhJY1CVCSHYmYfxsO1gE7AAuAQYFwsHwccFo8PBW4zs0/MbBbwGjBEUm+gu5lN\n",
"jdfdmFcnX+tO4AtZ+P7oI7jkknDsrRPHcZzlqUhAkbSGpHpgHvCYmb1I2NBrXrxkHtAzHm8C5O9h\n",
"PxvYtJnyObGc+PMtADNbArwvqceKPurr64vyfcMNMHs29OsHhx4KNTU1q62zOjqKRgoestBIwUMq\n",
"Gil4SEUjBQ8paRSi4GrD5cTMlgH9Ja0PPCRp3xWeN0llHy2wzTbbUFdXR9euXQEYNGgQtbW1n77h\n",
"uaZhTU0NH38Mf/5zA/37w3nn1bDGGss/v+L1fu7nfu7nHeG8vr6eSZMm0djYyGoxs4o+gB8B3wNm\n",
"AL1iWW9gRjyuA+ryrp8ADAF6AS/nlR8NjM27Zo943Bl4t7nX7t+/v7WUa681A7O+fc2WLg1lCxYs\n",
"aHH9QnQUjRQ8ZKGRgodUNFLwkIpGCh5S0Qhho/nv80qM8tooN4JLUjfgS8A04F7g+HjZ8cDd8fhe\n",
"YLiktSRtDfQBpprZXGChpCExSX8scE9enZzWEYQkf6v55BP42c/C8Q9/GFYTdhzHcZanzeehSOpH\n",
"SJivER83mdnlMcdxO7AFMAs40swaYp3zgBOBJcCZZvZQLB8I3AB0Ax40s1GxvAtwEzAAmA8Mt5DQ\n",
"X9GLteT+x42Db30LttsOXnop7HXiOI5TjaxqHopPbFzN/S9ZAn37wsyZIbAcd1wbmXMcx0mQpCY2\n",
"pkRL5qGMHx+CyWc/C8ccs/xzqYwJT0EjBQ9ZaKTgIRWNFDykopGCh5Q0ClHVAWV1LF0KP/1pOP7B\n",
"D6BzRcbEOY7jtA+8y2sV9z9+PAwfDltuGVopa67ZhuYcx3ESxLu8WsGyZXDRReH4vPM8mDiO46yO\n",
"qg4oq8qh3HUXvPgibL45HH9889ek0p+ZgkYKHrLQSMFDKhopeEhFIwUPKWkUoqoDSiGWLYOf/CQc\n",
"19VBly6V9eM4jtMe8BxKM/d/zz1w2GGwySbw+usQV2ZxHMepejyHUgRmTbmTc8/1YOI4jtNSqjqg\n",
"NJdD+ctf4NlnoWdPOOmkVddPpT8zBY0UPGShkYKHVDRS8JCKRgoeUtIoRFUHlBUxa8qdfP/70K1b\n",
"Zf04juO0JzyHknf/EyfCl78MG28Mb7wB66xTQXOO4zgJ4jmUFmAGF14Yjr/3PQ8mjuM4xVLVASU/\n",
"h/LYY/DMM9CjB5x2Wsvqp9KfmYJGCh6y0EjBQyoaKXhIRSMFDylpFKKqA0o+udzJd78L3btX1ovj\n",
"OE57xHMoZjz+OAwdCjU1MGsWrL9+pZ05juOkiedQVkNu3slZZ3kwcRzHaS2V2AJ4c0mPSXpR0guS\n",
"crss9pD0sKRXJU3MbRMcnxstaaakGZL2zysfKGl6fO7KvPIuksbH8smStmzOS//+/Xn6aXjkEVhv\n",
"PRg1qrh7SaU/MwWNFDxkoZGCh1Q0UvCQikYKHlLSKEQlWiifAN8xs52APYDTJe0I1AEPm9l2hD3g\n",
"6wAk9QWOAvoCw4AxcQ95gLHACDPrA/SRNCyWjwDmx/IrgEsLmcm1TkaNgg02yPI2HcdxqouK51Ak\n",
"3Q38Jj72MbN5knoBk8xsB0mjgWVmdmm8fgJwAfAm8KiZ7RjLhwNDzezUeM35ZjZFUmfgHTPbuJnX\n",
"NjDWXTfkTjbcsA1u2HEcpx2TbA5F0lbAAGAK0NPM5sWn5gE94/EmwOy8arOBTZspnxPLiT/fAjCz\n",
"JcD7knoU8nHGGR5MHMdxSqVim9pKWhe4EzjTzBY19WKBmVloPZSXvffem9dfr2PJkq5ccAEMGjSI\n",
"2tpaampC+ibX11jofPbs2ay77rotvr6588WLF7PZZpu1un6OmpqaVtfPr1up+qm8nx3l88ji/Uzh\n",
"80jl/Uzh86jU+1lfX8+kSZNobGxktZhZmz+ANYGHgLPyymYAveJxb2BGPK4D6vKumwAMAXoBL+eV\n",
"Hw2Mzbtmj3jcGXi3OR/9+/e3733PWs2CBQtaX7mDaaTgIQuNFDykopGCh1Q0UvCQikYIG81/t7d5\n",
"DiUm1McRkubfySu/LJZdKqkOqDGzupiUvxUYTOjK+iuwrZmZpCnAKGAq8ABwlZlNkDQS6Gdmp8Xc\n",
"ymFmNrwZLzZ3rtGz54rPOI7jOM2xqhxKJQJKLfAE8DyQe/HRhKBwO7AFMAs40swaYp3zgBOBJYQu\n",
"sodi+UDgBqAb8KCZ5YYgdwFuIuRn5gPDzWxWM16sre/fcRynPZNUQEmJAQMG2LRp01pdv6Gh4dP+\n",
"xmrXSMFDFhopeEhFIwUPqWik4CEVjWRHeTmO4zgdh6puoXiXl+M4TnF4C8VxHMcpO1UdUJrbU74Y\n",
"UllXJwWNFDxkoZGCh1Q0UvCQikYKHlLSKERVBxTHcRwnOzyHUsX37ziOUyyeQ3Ecx3HKTlUHFM+h\n",
"ZKeRgocsNFLwkIpGCh5S0UjBQ0oahajqgOI4juNkh+dQqvj+HcdxisVzKI7jOE7ZqeqA4jmU7DRS\n",
"8JCFRgoeUtFIwUMqGil4SEmjEFUdUBzHcZzs8BxKFd+/4zhOsXgOxXEcxyk7FQkokq6XNE/S9Lyy\n",
"HpIelvSqpImSavKeGy1ppqQZkvbPKx8oaXp87sq88i6SxsfyyZK2bM6H51Cy00jBQxYaKXhIRSMF\n",
"D6lopOAhJY1CVKqF8gdg2ApldcDDZrYd8Eg8J24BfBTQN9YZE7cRBhgLjDCzPkAfSTnNEYTthPsA\n",
"VwCXNmdi0aJFJd3EU089VVL9jqSRgocsNFLwkIpGCh5S0UjBQ0oahahIQDGzJ4EFKxQfQthrnvjz\n",
"sHh8KHCbmX0St/F9DRgiqTfQ3cymxutuzKuTr3Un8IXmfLz++usl3cc//vGPkup3JI0UPGShkYKH\n",
"VDRS8JCKRgoeUtIoREo5lJ5mNi8ezwN6xuNNgNl5180GNm2mfE4sJ/58C8DMlgDvS+pRJt+O4zgO\n",
"aQWUT4lDr8o+/Kpnz56rv2gVNDY2luyho2ik4CELjRQ8pKKRgodUNFLwkJJGQcysIg9gK2B63vkM\n",
"oFc87g3MiMd1QF3edROAIUAv4OW88qOBsXnX7BGPOwPvFvBg/vCHP/zhj+Iehb7XO5MO9wLHExLo\n",
"xwN355XfKulXhK6sPsBUMzNJCyUNAaYCxwJXraA1GTiCkORfiUJjqR3HcZziqcjERkm3AfsAGxHy\n",
"JT8G7gFuB7YAZgFHmllDvP484ERgCXCmmT0UywcCNwDdgAfNbFQs7wLcBAwA5gPDY0LfcRzHKRNV\n",
"PVPecRzHyY6UurzanDj0+D0z+6jCPnqZ2dxKemgNceRcH6BLrszMnmhjD8u9d6l8ppWmvf5OOe2b\n",
"JEd5tSE3A69I+kWFfVxX4dcvGkknAY8TBkBcCDwEXFABKyu+d0V/ppL2lPQNScfHx3HFGJDUtSVl\n",
"q6j/RnxMKeZ1V8ODGWq1GElntqSsQN3FkhYVeCws0seRktaLxz+SdJek3YqoXytp3Xh8rKRfFVpx\n",
"YxUa20t6RNKL8XwXST8sRiPWK+n3s02p1CivVB6EoLpTC6/tRfgCmxDP+xJm6reV15viz7NK0FgM\n",
"LCrwWFiEzguE3FV9PN8BuKsVfnoBBwMHAZ+pwGd6M/AMMAa4Ovco8vWea0lZWz6Aaa2oczmwHrAm\n",
"YSDLf4FjS33d3O9IG9//9PizFpgUf7+mFFMfELArMA04HXi8SA9PEEakTovnAl4sUiOL38+SP9eW\n",
"Pqq6ywvAzJYBL7bw8hsIy8b8IJ7PJAwkaKsWxkBJmwAnSrqR8Av6aRLMzN5bnYCZ5f7r+inwNuEX\n",
"FuAbhMmiLaXRzP4nCUldzWyGpO2LqI+kIwm/7I/Hot9IOsfM/lSMzooU+ZkOBPpa/Msrhti9tgmw\n",
"dvzvN/d5rAesXaxexlzTijr7m9k5kg4nDIz5KvAkYYDLKpF0NHAMsLWk+/Ke6k4YGNPWLI0/DwKu\n",
"MbP7JV1URP0lZmaSDgN+a2bXShpRpIe1zWxKbqWoqPdJkRqt/v3Mo9Wfa7FUfUApko3MbLykOgAz\n",
"+0TSkjZ8/d8R/sP4LPBsM89vXYTWIWa2S975WEnPAz9qYf23JG1AGN79sKQFhF/WYvghsLuZ/QdA\n",
"0saE+yspoBTJC4R5T2+3ou7+wLcIw9l/mVe+CDivZGclYGZjWlEt931wEHCHmb0vqaVfZM8A7wAb\n",
"A78gBFcI78U/W+GlVOZI+j3wJeCS2AVZTBf/oji69JvAXpI6Ef7DL4Z3JW2bO5F0BOE9KoZSfj9z\n",
"lPK5tuqFnJaxWNKGuRNJewDvt9WLm9lVwFWSxgL/B+xN+I/4STOrL1LuA0nfBG6L58MJ3WEt9XJ4\n",
"PLxA0iTCf+UTivQg4N288/k0fRG1FRsDL0maCuQS+WZmh6yuopmNA8ZJOsLM7iinyTbiPkkzgEbg\n",
"NEmficerxczeBN4E9iijv2I4krCY7OVm1hBbk+cUUf8oQovrRDObK2kLQqAshjMIf6fbS3ob+Bch\n",
"QK2WvFbeurTy9zOPVn+uxeLDhosgznu5GtiJ0KWyMXCEmbXpf2AxyXkS8OdYdDihWX9V4VoraWwN\n",
"XAl8PhY9TZjjMytDq6vzcDmhj/pWQiA5CnjezL7fhh6GNlduZpOK1DmIkFP7NBlvZj8pxVsliP8w\n",
"NZjZUknrEBZgXe1oMUlPm9mekhaT1w0bMTNbrxx+Uya2ir5GWBWkB7CQ8F6s9vci7/fSWPmfLDOz\n",
"xymC1n6uxeIBpUgkrQnkcgWvmFmxfaJZeJhOWFrmg3i+DjDZzPq1tZdSkHQZMIWQODXgKcJ9tVlA\n",
"yQJJ/0cYoLAfIXfxdUICuNg+94og6Qtm9oikr9EUDHJfYmZmfy5QtcORZWCU9BDQQOiezuV0MLNf\n",
"Fqy0ssZlK/49SLrUzM5tqUas0w/YkfB7atHHjcVotOh1PKAUh6Q9Cf9xdKaMH8xqPEwHBpvZ/+J5\n",
"N8JyNC0OKLHZexJN9wLhD+bEjO2uysM0MxuwQtn0tgiMGX9xTDezfpKeN7Nd4nDTCWZWm6npMiHp\n",
"QjM7X9INrPxeYGYntL2r9o+kF8xs5xI1Sv4bkXQBYWWSnYAHgAOAp8zsiFK8NYfnUIpA0s2EhHg9\n",
"ef9xEPZiaUv+AEyR9GfCf5KHAdcXqXEPYVjjw8CyWNYm/11IOg0YCWyjvF07CSOCnm4LD2a2Z/y5\n",
"bgZy/4s/P5S0KSEX1CsD3TbBzM6Ph6fS1EXj3w2l84ykXczs+WIrZvw3cgSha/k5MztBUk/glmI9\n",
"tQT/pSmOLIbwlYyZ/UrS4zR1FX3LzKYVKdOt2GZzhtwK/AW4BDiXvBFBZlaJIaalcl8c8XY5TaPv\n",
"WjNst9LcQ1MXTRnXOO/Y5AWATsAJkt5g+YT6Ls3XXI4s/0b+F3MnSyStD/wH2LxIjRbhXV5FIOlP\n",
"hMR1KUP4kiDOQ/mbmT1QaS8diZiI7WpxYdP2RBZdNA5I2mpVz7dk4Iuk9cxsYUymN9cNudo5Z3la\n",
"Ywhz544CzgY+IEy2zLwr0wNKC1hhCN8AwnL5rR3ClwQxd7A28DGQG1hQlaNxsiAvt9YpV9bWubVS\n",
"ifM2ftOaLhonWyQ9YGYHxtbNSphZi+ecxa76xwmDXv4HrFeuz9gDSgvIG8J3GWEse/4wvsvMbHCb\n",
"m8oANS3umD/UtajhiE7h3JqZ/b+KmSqCFbpo+gCt6aJxyoCkWwjB4Ekze7mVGvsBexG6yLcFnot6\n",
"v87MaO61PKC0nEqOSsoahcUdRwGbEb4I9yB0ge1XUWPtEEkvk0BurbVk0UXjlIcYDGoJAWEbwrpi\n",
"RQcDSZ2BQYSh7acS8ipFLZXUotdpp38DbUr+iAvg9bynugNPm9k3KmKsBCS9AOxOCCL9Je0IXJw3\n",
"A95pIR0pt+akR6nBQNIjwDrA3wjdXk/mljvKGh/l1TI62qgkWHlxx5dV5OKO1U7Gy2M4zko0EwwG\n",
"tSIYPE8ISDsTZusvkPS33Dy2LPGA0gLM7H3Cml3DK+0lQ7JY3LHayc14vgw4lBVya21vx+mAlBwM\n",
"zOw7AJK6ExYz/QNhnlSXVVRrFd7l5eQGHaxHmN39cYXttDs6Um7NSZO8YPA9oJeZtTgYSPp/hBzM\n",
"QMKAiycJ3V6PZu3TWyhO0QshOoEUZvw7HZtmgsH1hIBQDF0Jrennyr32oLdQHKeVxFnHG9CxcmtO\n",
"Qkg6h7BEUtmDQRZ4QHEcx3EyoZgdzBzHcRynIB5QHMdxnEzwgOI4juNkggcUx8kAST+Q9IKkf0qa\n",
"Jqls67tJmhS3o3acpPBhw45TIpI+BxwIDDCzT+Kim5lPGsvDaKPN0BynGLyF4jil0wv4b25Yp5m9\n",
"Z2bvSPqRpKmSpsd954FPWxi/kvR3SS9L2l3SXZJelXRRvGYrSTMk3SzpJUl/ils9L4ek/SU9I+lZ\n",
"SbdLWieWXyLpxdhiuryN3genyvGA4jilMxHYXNIrkn4rae9Y/hszGxxnzHeTdFAsN+AjM9sdGEvY\n",
"KfFUwvIa34pL4gBsB/zWzPoSlt0Ymf+ikjYibJz0BTMbSNhp8buxhXSYme1kZrsCF5Xrxh0nHw8o\n",
"jlMiZvYBYSbzycC7wHhJxwP7SZos6XnCSrF986rdG3++ALxgZvPisjf/oml71rfM7G/x+GbCMuY5\n",
"RNhyoC9h7/JpwHHAFoR15xolXSfpcJr2vHecsuI5FMfJADNbRtgI6fG4DMupQD9goJnNkXQ+eRuZ\n",
"0bQq8bK849x57u8yP08ims+bPGxmx6xYGAcFfAE4AjgjHjtOWfEWiuOUiKTtJPXJKxoAzCAEgPmS\n",
"1gW+3grpLSTtEY+PYfk1nAyYDOwpaZvoYx1JfWIepcbM/gJ8F9i1Fa/tOEXjLRTHKZ11gasl1QBL\n",
"gJnAKUADoUtrLjClQN1Vjdh6BThd0vXAi4R8S1NFs/9K+hZwm6TcqLIfAIuAeyR1JbRsvtPK+3Kc\n",
"ovC1vBwnQeK2vPf5EvhOe8K7vBwnXfy/Padd4S0Ux3EcJxO8heI4juNkggcUx3EcJxM8oDiO4ziZ\n",
"4AHFcRzHyQQPKI7jOE4meEBxHMdxMuH/A8A9viCKa0WSAAAAAElFTkSuQmCC\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fdist1.plot(20, cumulative=True)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'CIRCUMNAVIGATION',\n",
" u'Physiognomically',\n",
" u'apprehensiveness',\n",
" u'cannibalistically',\n",
" u'characteristically',\n",
" u'circumnavigating',\n",
" u'circumnavigation',\n",
" u'circumnavigations',\n",
" u'comprehensiveness',\n",
" u'hermaphroditical',\n",
" u'indiscriminately',\n",
" u'indispensableness',\n",
" u'irresistibleness',\n",
" u'physiognomically',\n",
" u'preternaturalness',\n",
" u'responsibilities',\n",
" u'simultaneousness',\n",
" u'subterraneousness',\n",
" u'supernaturalness',\n",
" u'superstitiousness',\n",
" u'uncomfortableness',\n",
" u'uncompromisedness',\n",
" u'undiscriminating',\n",
" u'uninterpenetratingly']"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# apply a list comprehension to get words over 15 characters\n",
"V = set(text1)\n",
"long_words = [w for w in V if len(w) > 15]\n",
"sorted(long_words)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'#14-19teens',\n",
" u'#talkcity_adults',\n",
" u'((((((((((',\n",
" u'........',\n",
" u'Question',\n",
" u'actually',\n",
" u'anything',\n",
" u'computer',\n",
" u'cute.-ass',\n",
" u'everyone',\n",
" u'football',\n",
" u'innocent',\n",
" u'listening',\n",
" u'remember',\n",
" u'seriously',\n",
" u'something',\n",
" u'together',\n",
" u'tomorrow',\n",
" u'watching']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fdist2 = FreqDist(text5)\n",
"sorted(w for w in set(text5) if len(w) > 7 and fdist2[w] > 7)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"United States; fellow citizens; four years; years ago; Federal\n",
"Government; General Government; American people; Vice President; Old\n",
"World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;\n",
"God bless; every citizen; Indian tribes; public debt; one another;\n",
"foreign nations; political parties\n"
]
}
],
"source": [
"# word sequences that appear together unusually often\n",
"text4.collocations()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Raw Text Processing"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1176896"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# download raw text from an online repository\n",
"import urllib2\n",
"url = \"http://www.gutenberg.org/files/2554/2554.txt\"\n",
"response = urllib2.urlopen(url)\n",
"raw = response.read().decode('utf8')\n",
"len(raw)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"u'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\\r\\n'"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw[:75]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"254352"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# tokenize the raw text\n",
"from nltk import word_tokenize\n",
"tokens = word_tokenize(raw)\n",
"len(tokens)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'The',\n",
" u'Project',\n",
" u'Gutenberg',\n",
" u'EBook',\n",
" u'of',\n",
" u'Crime',\n",
" u'and',\n",
" u'Punishment',\n",
" u',',\n",
" u'by']"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokens[:10]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'CHAPTER',\n",
" u'I',\n",
" u'On',\n",
" u'an',\n",
" u'exceptionally',\n",
" u'hot',\n",
" u'evening',\n",
" u'early',\n",
" u'in',\n",
" u'July',\n",
" u'a',\n",
" u'young',\n",
" u'man',\n",
" u'came',\n",
" u'out',\n",
" u'of',\n",
" u'the',\n",
" u'garret',\n",
" u'in',\n",
" u'which',\n",
" u'he',\n",
" u'lodged',\n",
" u'in',\n",
" u'S.',\n",
" u'Place',\n",
" u'and',\n",
" u'walked',\n",
" u'slowly',\n",
" u',',\n",
" u'as',\n",
" u'though',\n",
" u'in',\n",
" u'hesitation',\n",
" u',',\n",
" u'towards',\n",
" u'K.',\n",
" u'bridge',\n",
" u'.']"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = nltk.Text(tokens)\n",
"text[1024:1062]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya\n",
"Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old\n",
"woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;\n",
"great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;\n",
"Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market\n"
]
}
],
"source": [
"text.collocations()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"5338"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw.find(\"PART I\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'BBC',\n",
" u'NEWS',\n",
" u'|',\n",
" u'Health',\n",
" u'|',\n",
" u'Blondes',\n",
" u\"'to\",\n",
" u'die',\n",
" u'out',\n",
" u'in']"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# HTML parsing using the Beautiful Soup library\n",
"from bs4 import BeautifulSoup\n",
"url = \"http://news.bbc.co.uk/2/hi/health/2284783.stm\"\n",
"html = urllib2.urlopen(url).read().decode('utf8')\n",
"raw = BeautifulSoup(html).get_text()\n",
"tokens = word_tokenize(raw)\n",
"tokens[0:10]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Displaying 5 of 5 matches:\n",
"hey say too few people now carry the gene for blondes to last beyond the next \n",
"blonde hair is caused by a recessive gene . In order for a child to have blond\n",
" have blonde hair , it must have the gene on both sides of the family in the g\n",
"ere is a disadvantage of having that gene or by chance . They do n't disappear\n",
"des would disappear is if having the gene was a disadvantage and I do not thin\n"
]
}
],
"source": [
"# isolate just the article text\n",
"tokens = tokens[110:390]\n",
"text = nltk.Text(tokens)\n",
"text.concordance('gene')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regular Expressions"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# regular expression library\n",
"import re\n",
"wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'abaissed',\n",
" u'abandoned',\n",
" u'abased',\n",
" u'abashed',\n",
" u'abatised',\n",
" u'abed',\n",
" u'aborted',\n",
" u'abridged',\n",
" u'abscessed',\n",
" u'absconded']"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# match the end of a word\n",
"[w for w in wordlist if re.search('ed$', w)][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'abjectly',\n",
" u'adjuster',\n",
" u'dejected',\n",
" u'dejectly',\n",
" u'injector',\n",
" u'majestic',\n",
" u'objectee',\n",
" u'objector',\n",
" u'rejecter',\n",
" u'rejector']"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# wildcard matches any single character\n",
"[w for w in wordlist if re.search('^..j..t..$', w)][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'gold', u'golf', u'hold', u'hole']"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# combination of caret (start of word) and sets\n",
"[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',\n",
" u'miiiiiinnnnnnnnnneeeeeeee',\n",
" u'mine',\n",
" u'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))\n",
"\n",
"# plus symbol matches any number of times repeating\n",
"[w for w in chat_words if re.search('^m+i+n+e+$', w)]"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'0.0085',\n",
" u'0.05',\n",
" u'0.1',\n",
" u'0.16',\n",
" u'0.2',\n",
" u'0.25',\n",
" u'0.28',\n",
" u'0.3',\n",
" u'0.4',\n",
" u'0.5']"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wsj = sorted(set(nltk.corpus.treebank.words()))\n",
"\n",
"# more advanced regex example\n",
"[w for w in wsj if re.search('^[0-9]+\\.[0-9]+$', w)][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'C$', u'US$']"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[w for w in wsj if re.search('^[A-Z]+\\$$', w)]"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'1614',\n",
" u'1637',\n",
" u'1787',\n",
" u'1901',\n",
" u'1903',\n",
" u'1917',\n",
" u'1925',\n",
" u'1929',\n",
" u'1933',\n",
" u'1934']"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[w for w in wsj if re.search('^[0-9]{4}$', w)][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'10-day',\n",
" u'10-lap',\n",
" u'10-year',\n",
" u'100-share',\n",
" u'12-point',\n",
" u'12-year',\n",
" u'14-hour',\n",
" u'15-day',\n",
" u'150-point',\n",
" u'190-point']"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'black-and-white',\n",
" u'bread-and-butter',\n",
" u'father-in-law',\n",
" u'machine-gun-toting',\n",
" u'savings-and-loan']"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'62%-owned',\n",
" u'Absorbed',\n",
" u'According',\n",
" u'Adopting',\n",
" u'Advanced',\n",
" u'Advancing',\n",
" u'Alfred',\n",
" u'Allied',\n",
" u'Annualized',\n",
" u'Anything']"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[w for w in wsj if re.search('(ed|ing)$', w)][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'io', 549),\n",
" (u'ea', 476),\n",
" (u'ie', 331),\n",
" (u'ou', 329),\n",
" (u'ai', 261),\n",
" (u'ia', 253),\n",
" (u'ee', 217),\n",
" (u'oo', 174),\n",
" (u'ua', 109),\n",
" (u'au', 106),\n",
" (u'ue', 105),\n",
" (u'ui', 95)]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# using \"findall\" to extract partial matches from words\n",
"fd = nltk.FreqDist(vs for word in wsj \n",
" for vs in re.findall(r'[aeiou]{2,}', word))\n",
"fd.most_common(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normalizing Text"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# NLTK has several word stemmers built in\n",
"porter = nltk.PorterStemmer()\n",
"lancaster = nltk.LancasterStemmer()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'UK',\n",
" u'Blond',\n",
" u\"'to\",\n",
" u'die',\n",
" u'out',\n",
" u'in',\n",
" u'200',\n",
" u\"years'\",\n",
" u'Scientist',\n",
" u'believ']"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[porter.stem(t) for t in tokens][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'uk',\n",
" u'blond',\n",
" u\"'to\",\n",
" u'die',\n",
" u'out',\n",
" u'in',\n",
" u'200',\n",
" u\"years'\",\n",
" u'sci',\n",
" u'believ']"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[lancaster.stem(t) for t in tokens][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'UK',\n",
" u'Blondes',\n",
" u\"'to\",\n",
" u'die',\n",
" u'out',\n",
" u'in',\n",
" u'200',\n",
" u\"years'\",\n",
" u'Scientists',\n",
" u'believe']"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wnl = nltk.WordNetLemmatizer()\n",
"[wnl.lemmatize(t) for t in tokens][0:10]"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# also has a tokenizer that takes a regular expression as a parameter\n",
"text = 'That U.S.A. poster-print costs $12.40...'\n",
"pattern = r'''(?x) # set flag to allow verbose regexps\n",
" ([A-Z]\\.)+ # abbreviations, e.g. U.S.A.\n",
" | \\w+(-\\w+)* # words with optional internal hyphens\n",
" | \\$?\\d+(\\.\\d+)?%? # currency and percentages, e.g. $12.40, 82%\n",
" | \\.\\.\\. # ellipsis\n",
" | [][.,;\"'?():-_`] # these are separate tokens; includes ], [\n",
"'''\n",
"nltk.regexp_tokenize(text, pattern)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tagging"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('They', 'PRP'),\n",
" ('refuse', 'VBP'),\n",
" ('to', 'TO'),\n",
" ('permit', 'VB'),\n",
" ('us', 'PRP'),\n",
" ('to', 'TO'),\n",
" ('obtain', 'VB'),\n",
" ('the', 'DT'),\n",
" ('refuse', 'NN'),\n",
" ('permit', 'NN')]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Use a built-in tokenizer and tagger\n",
"text = word_tokenize(\"They refuse to permit us to obtain the refuse permit\")\n",
"nltk.pos_tag(text)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"man time day year car moment world family house country child boy\n",
"state job way war girl place word work\n"
]
}
],
"source": [
"# Word similarity using a pre-tagged text\n",
"text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())\n",
"text.similar('woman')"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'The', u'AT'),\n",
" (u'Fulton', u'NP-TL'),\n",
" (u'County', u'NN-TL'),\n",
" (u'Grand', u'JJ-TL'),\n",
" (u'Jury', u'NN-TL'),\n",
" (u'said', u'VBD'),\n",
" (u'Friday', u'NR'),\n",
" (u'an', u'AT'),\n",
" (u'investigation', u'NN'),\n",
" (u'of', u'IN')]"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Tagged words are saved as tuples\n",
"nltk.corpus.brown.tagged_words()[0:10]"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'The', u'DET'),\n",
" (u'Fulton', u'NOUN'),\n",
" (u'County', u'NOUN'),\n",
" (u'Grand', u'ADJ'),\n",
" (u'Jury', u'NOUN'),\n",
" (u'said', u'VERB'),\n",
" (u'Friday', u'NOUN'),\n",
" (u'an', u'DET'),\n",
" (u'investigation', u'NOUN'),\n",
" (u'of', u'ADP')]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.corpus.brown.tagged_words(tagset='universal')[0:10]"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'NOUN', 30640),\n",
" (u'VERB', 14399),\n",
" (u'ADP', 12355),\n",
" (u'.', 11928),\n",
" (u'DET', 11389),\n",
" (u'ADJ', 6706),\n",
" (u'ADV', 3349),\n",
" (u'CONJ', 2717),\n",
" (u'PRON', 2535),\n",
" (u'PRT', 2264),\n",
" (u'NUM', 2166),\n",
" (u'X', 106)]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nltk.corpus import brown\n",
"brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')\n",
"tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)\n",
"tag_fd.most_common()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"VERB ADV ADP ADJ . PRT \n",
" 37 8 7 6 4 2 \n"
]
}
],
"source": [
"# Part of speech tag count for words following \"often\" in a text\n",
"brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')\n",
"tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']\n",
"fd = nltk.FreqDist(tags)\n",
"fd.tabulate()"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Load some raw sentences to tag\n",
"from nltk.corpus import brown\n",
"brown_tagged_sents = brown.tagged_sents(categories='news')\n",
"brown_sents = brown.sents(categories='news')"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('I', 'NN'),\n",
" ('do', 'NN'),\n",
" ('not', 'NN'),\n",
" ('like', 'NN'),\n",
" ('green', 'NN'),\n",
" ('eggs', 'NN'),\n",
" ('and', 'NN'),\n",
" ('ham', 'NN'),\n",
" (',', 'NN'),\n",
" ('I', 'NN'),\n",
" ('do', 'NN'),\n",
" ('not', 'NN'),\n",
" ('like', 'NN'),\n",
" ('them', 'NN'),\n",
" ('Sam', 'NN'),\n",
" ('I', 'NN'),\n",
" ('am', 'NN'),\n",
" ('!', 'NN')]"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Default tagger (assigns same tag to each token)\n",
"tags = [tag for (word, tag) in brown.tagged_words(categories='news')]\n",
"nltk.FreqDist(tags).max()\n",
"raw = 'I do not like green eggs and ham, I do not like them Sam I am!'\n",
"tokens = word_tokenize(raw)\n",
"default_tagger = nltk.DefaultTagger('NN')\n",
"default_tagger.tag(tokens)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.13089484257215028"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Evaluate the performance against a tagged corpus\n",
"default_tagger.evaluate(brown_tagged_sents)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'Various', u'JJ'),\n",
" (u'of', u'IN'),\n",
" (u'the', u'AT'),\n",
" (u'apartments', u'NNS'),\n",
" (u'are', u'BER'),\n",
" (u'of', u'IN'),\n",
" (u'the', u'AT'),\n",
" (u'terrace', u'NN'),\n",
" (u'type', u'NN'),\n",
" (u',', u','),\n",
" (u'being', u'BEG'),\n",
" (u'on', u'IN'),\n",
" (u'the', u'AT'),\n",
" (u'ground', u'NN'),\n",
" (u'floor', u'NN'),\n",
" (u'so', u'QL'),\n",
" (u'that', u'CS'),\n",
" (u'entrance', u'NN'),\n",
" (u'is', u'BEZ'),\n",
" (u'direct', u'JJ'),\n",
" (u'.', u'.')]"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Training a unigram tagger\n",
"from nltk.corpus import brown\n",
"brown_tagged_sents = brown.tagged_sents(categories='news')\n",
"brown_sents = brown.sents(categories='news')\n",
"unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)\n",
"unigram_tagger.tag(brown_sents[2007])"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9349006503968017"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now evalute it\n",
"unigram_tagger.evaluate(brown_tagged_sents)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9730592517453309"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Combining taggers\n",
"t0 = nltk.DefaultTagger('NN')\n",
"t1 = nltk.UnigramTagger(brown_tagged_sents, backoff=t0)\n",
"t2 = nltk.BigramTagger(brown_tagged_sents, backoff=t1)\n",
"t2.evaluate(brown_tagged_sents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classifying Text"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'last_letter': 'k'}"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Define a feature extractor\n",
"def gender_features(word):\n",
" return {'last_letter': word[-1]}\n",
"gender_features('Shrek')"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Prepare a list of examples\n",
"from nltk.corpus import names\n",
"labeled_names = ([(name, 'male') for name in names.words('male.txt')] +\n",
" [(name, 'female') for name in names.words('female.txt')])\n",
"import random\n",
"random.shuffle(labeled_names)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Process the names data\n",
"featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]\n",
"train_set, test_set = featuresets[500:], featuresets[:500]\n",
"classifier = nltk.NaiveBayesClassifier.train(train_set)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'male'"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"classifier.classify(gender_features('Neo'))"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'female'"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"classifier.classify(gender_features('Trinity'))"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.752\n"
]
}
],
"source": [
"print(nltk.classify.accuracy(classifier, test_set))"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Most Informative Features\n",
" last_letter = u'a' female : male = 35.4 : 1.0\n",
" last_letter = u'k' male : female = 31.9 : 1.0\n",
" last_letter = u'f' male : female = 17.4 : 1.0\n",
" last_letter = u'p' male : female = 11.3 : 1.0\n",
" last_letter = u'm' male : female = 10.2 : 1.0\n"
]
}
],
"source": [
"classifier.show_most_informative_features(5)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Document classification\n",
"from nltk.corpus import movie_reviews\n",
"documents = [(list(movie_reviews.words(fileid)), category)\n",
" for category in movie_reviews.categories()\n",
" for fileid in movie_reviews.fileids(category)]\n",
"random.shuffle(documents)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())\n",
"word_features = all_words.keys()[:2000]\n",
"\n",
"def document_features(document):\n",
" document_words = set(document)\n",
" features = {}\n",
" for word in word_features:\n",
" features['contains(%s)' % word] = (word in document_words)\n",
" return features"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"featuresets = [(document_features(d), c) for (d,c) in documents]\n",
"train_set, test_set = featuresets[100:], featuresets[:100]\n",
"classifier = nltk.NaiveBayesClassifier.train(train_set)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.64\n"
]
}
],
"source": [
"print(nltk.classify.accuracy(classifier, test_set))"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Most Informative Features\n",
" contains(sans) = True neg : pos = 8.4 : 1.0\n",
" contains(uplifting) = True pos : neg = 8.2 : 1.0\n",
" contains(mediocrity) = True neg : pos = 7.7 : 1.0\n",
" contains(dismissed) = True pos : neg = 7.0 : 1.0\n",
" contains(overwhelmed) = True pos : neg = 6.3 : 1.0\n"
]
}
],
"source": [
"classifier.show_most_informative_features(5)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}