{ "cells": [ { "cell_type": "code", "execution_count": 69, "id": "9a8e5623", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import sys\n", "#reload(sys)\n", "#sys.setdefaultencoding(\"utf-8\")" ] }, { "cell_type": "markdown", "id": "bcc1b649", "metadata": {}, "source": [ "# Basic NLTK Usage" ] }, { "cell_type": "code", "execution_count": 70, "id": "19298ae8", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import nltk" ] }, { "cell_type": "code", "execution_count": 71, "id": "2fe59107", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "showing info http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.download()" ] }, { "cell_type": "code", "execution_count": 72, "id": "ddd9651f", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from nltk.book import *" ] }, { "cell_type": "markdown", "id": "2ef671fe", "metadata": {}, "source": [ "# Concordances" ] }, { "cell_type": "markdown", "id": "dd8cd428", "metadata": {}, "source": [ "Wikipedia:\n", "> A concordance is an alphabetical list of the principal words used in a \n", "> book or body of work, with their immediate contexts. Because of the time, \n", "> difficulty, and expense involved in creating a concordance in the \n", "> pre-computer era, only works of special importance, such as the Vedas,\n", "> Bible, Qur'an or the works of Shakespeare, had concordances prepared for them." ] }, { "cell_type": "code", "execution_count": 73, "id": "92c78aea", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Displaying 11 of 11 matches:\n", "ong the former , one was of a most monstrous size . ... This came towards us , \n", "ON OF THE PSALMS . \" Touching that monstrous bulk of the whale or ork we have r\n", "ll over with a heathenish array of monstrous clubs and spears . Some were thick\n", "d as you gazed , and wondered what monstrous cannibal and savage could ever hav\n", "that has survived the flood ; most monstrous and most mountainous ! That Himmal\n", "they might scout at Moby Dick as a monstrous fable , or still worse and more de\n", "th of Radney .'\" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l\n", "ing Scenes . In connexion with the monstrous pictures of whales , I am strongly\n", "ere to enter upon those still more monstrous stories of them which are to be fo\n", "ght have been rummaged out of this monstrous cabinet there is no telling . But \n", "of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u\n" ] } ], "source": [ "print text1\n", "text1.concordance(\"monstrous\")" ] }, { "cell_type": "code", "execution_count": 74, "id": "7a2bcde6", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Displaying 25 of 67 matches:\n", " unto Enoch was born Irad : and Irad begat Mehujael : and Mehujael begat Methus\n", "d Irad begat Mehujael : and Mehujael begat Methusa and Methusael begat Lamech .\n", "Mehujael begat Methusa and Methusael begat Lamech . And Lamech took unto him tw\n", "ed an hundred and thirty years , and begat a son in his own likeness , and afte\n", "n Seth were eight hundred yea and he begat sons and daughters : And all the day\n", "ived an hundred and five years , and begat Enos : And Seth lived after he begat\n", "begat Enos : And Seth lived after he begat Enos eight hundred and seven years ,\n", " eight hundred and seven years , and begat sons and daughte And all the days of\n", " . And Enos lived ninety years , and begat Cainan : And Enos lived after he beg\n", "gat Cainan : And Enos lived after he begat Cainan eight hundred and fifteen yea\n", "ight hundred and fifteen years , and begat sons and daughte And all the days of\n", ". And Cainan lived seventy years and begat Mahalaleel : And Cainan lived after \n", "halaleel : And Cainan lived after he begat Mahalaleel eight hundred and forty y\n", " eight hundred and forty years , and begat sons and daughte And all the days of\n", "eel lived sixty and five years , and begat Jared : And Mahalaleel lived after h\n", "ared : And Mahalaleel lived after he begat Jared eight hundred and thirty years\n", "eight hundred and thirty years , and begat sons and daughte And all the days of\n", "hundred sixty and two years , and he begat Eno And Jared lived after he begat E\n", "e begat Eno And Jared lived after he begat Enoch eight hundred years , and bega\n", "egat Enoch eight hundred years , and begat sons and daughte And all the days of\n", "och lived sixty and five years , and begat Methuselah : And Enoch walked with G\n", ": And Enoch walked with God after he begat Methuselah three hundred years , and\n", "Methuselah three hundred years , and begat sons and daughte And all the days of\n", "hundred eighty and seven years , and begat Lamech . And Methuselah lived after \n", "mech . And Methuselah lived after he begat Lamech seven hundred eighty and two \n" ] } ], "source": [ "print text3\n", "text3.concordance(\"begat\")" ] }, { "cell_type": "markdown", "id": "bbee8ae4", "metadata": {}, "source": [ "# Contexts" ] }, { "cell_type": "markdown", "id": "67d7b3b0", "metadata": {}, "source": [ "NLTK can perform contextual analyses. Here, it looks for contexts of the word \"monstrous\" (\"most monstrous size\"), and then looks for other words that appear in such a context (\"most ____ size\")." ] }, { "cell_type": "code", "execution_count": 75, "id": "0aeb527a", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "abundant candid careful christian contemptible curious delightfully\n", "determined doleful domineering exasperate fearless few gamesome\n", "horrible impalpable imperial lamentable lazy loving\n" ] } ], "source": [ "print text1\n", "text1.similar(\"monstrous\")" ] }, { "cell_type": "code", "execution_count": 76, "id": "f39bb082", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "very exceedingly heartily so a amazingly as extremely good great\n", "remarkably sweet vast\n" ] } ], "source": [ "print text2\n", "text2.similar(\"monstrous\")" ] }, { "cell_type": "code", "execution_count": 77, "id": "5f2d9157", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "a_lucky a_pretty am_glad be_glad is_pretty\n" ] } ], "source": [ "print text2\n", "text2.common_contexts([\"monstrous\",\"very\"])" ] }, { "cell_type": "code", "execution_count": 78, "id": "56a1fe7c", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "most_and\n" ] } ], "source": [ "print text1\n", "text1.common_contexts([\"monstrous\",\"curious\"])" ] }, { "cell_type": "markdown", "id": "a2ab7ddd", "metadata": {}, "source": [ "# Dispersion Plots" ] }, { "cell_type": "markdown", "id": "6f69498c", "metadata": {}, "source": [ "A *dispersion plot* is just a simple indication of where a word occurs within\n", "a corpus. Actually, you'll find these even as indicators in scroll bars these days." ] }, { "cell_type": "code", "execution_count": 79, "id": "be25c5ec", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAokAAAGFCAYAAACR/WnHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xt8FNXB//HvEm6K3BEQEIlE5BLudxRdoIDVoAEVFETF\nSxUNWmopVH8q2IJ4ocVrtT4KchFRW1FSsFRgBQ2gBKQKeAUEAX0g3EEgIef3x3k2u9mZ3Wxuu4F8\n3q/XvnbmzJkzZ87MLl9mZsFjjDECAAAAglSIdwcAAABQ9hASAQAA4EBIBAAAgAMhEQAAAA6ERAAA\nADgQEgEAAOBASARQ5q1cuVItW7YsdjvNmjXT0qVLi7z+3LlzNXDgwGL3o6SU1LgUVoUKFbRly5aY\nbxdAbBESAZS44oaxUL1799ZXX31V7HY8Ho88Ho/rsltvvVVVqlRRjRo1VKNGDbVt21YPPvigDh06\nlFdnxIgR+ve//13sfpSUkhqXUNu2bVOFChVUvXp1Va9eXYmJiXriiScK3c7MmTPVu3fvEu8fgNgg\nJAIocZHCWFnl8Xg0fvx4HTp0SHv37tWMGTO0evVqXXLJJTp27Fjc+pWbmxu3bR88eFCHDx/WvHnz\n9Nhjj2nJkiVx6wuA2CMkAogZY4ymTp2qpKQk1atXT8OGDdP+/fslSaNHj9Z1112XV3f8+PH61a9+\nJUny+Xw6//zz85bt2LFDQ4YMUf369VWvXj2NGTNGkvT999+rb9++qlevns4991zddNNNOnjwYKH6\nJ0mVK1dWly5d9P777ysrK0szZsyQlP/KmDFGY8eOVYMGDVSzZk21a9dOmzZtkmSvSt59990aMGCA\natSoIa/Xq+3bt+dt56uvvlL//v1Vt25dtWzZUm+//XbesltvvVWjR4/WlVdeqXPOOUc+n0+LFi1S\n69atVaNGDTVp0kTTpk1zHZfNmzfL6/Wqdu3aSk5O1sKFC/O1e++99yolJUU1atRQjx49or5l3KNH\nD7Vp00ZffvmlY9nBgwd18803q379+mrWrJkmT54sY4w2b96s0aNHa9WqVapevbrq1KkT1bYAlB2E\nRAAx8+yzz+r999/XihUrtHv3btWuXVv33nuvJOkvf/mLvvjiC73++utauXKlXnvtNc2aNcvRxqlT\np5SSkqLExET98MMP2rlzp2644Ya85Q899JB2796tzZs3a8eOHZo4cWKR+3vOOeeof//+WrlypWPZ\nkiVLtHLlSn377bc6ePCg3n777XxB6I033tAjjzyivXv3qkOHDhoxYoQk6ejRo+rfv79uuukm7dmz\nR2+++abuuecebd68OW/defPm6eGHH9aRI0fUq1cv3X777XrllVd06NAhbdy4UX379nX0Jzs7W4MG\nDdIVV1yhPXv26LnnntOIESP0zTff5NWZP3++Jk6cqP379yspKUkPPfRQxP03xsgYo08++UQbN25U\nx44dHXXGjBmjw4cPa+vWrfroo480a9YszZgxQ61atdJLL72knj176vDhw9q3b1/BAw6gTCEkAoiZ\nl19+WX/+85/VqFEjVapUSY8++qjeeecd5ebm6qyzztLs2bM1duxYjRw5Us8//7waNWrkaOPTTz/V\n7t279dRTT+mss85SlSpVdMkll0iSmjdvrn79+qlSpUqqV6+exo4dq48++qhYfT7vvPNcA06lSpV0\n+PBhbd68Wbm5ubr44ovVsGHDvOUpKSm69NJLVblyZU2ePFmrVq3Sjz/+qPT0dCUmJuqWW25RhQoV\n1KFDBw0ZMiTf1cTU1FT17NlTklS1alVVrlxZGzdu1KFDh1SzZk3XsLZ69WodPXpUEyZMUMWKFdWn\nTx+lpKRo3rx5eXWGDBmiLl26KCEhQSNGjNDnn38ecd/r1aununXr6s4779QTTzyhPn365Ft+6tQp\nzZ8/X48//riqVaumCy64QA888IBmz54tKXBlFsDpiZAIIGa2bdumwYMHq3bt2qpdu7Zat26tihUr\n6ueff5YkdevWTRdeeKEk6frrr3dtY8eOHbrgggtUoYLz6+vnn3/WDTfcoCZNmqhmzZoaOXKksrKy\nitXnnTt3qm7duo7yvn37Ki0tTffee68aNGigu+66S4cPH5Zkn29s0qRJXt1q1aqpTp062rVrl374\n4QetWbMmbwxq166tN954I28MPB5PvlvIkvSPf/xDixYtUrNmzeT1erV69WpHf3bt2uVY74ILLtCu\nXbvy2m3QoEHesrPOOktHjhyJuO9ZWVnat2+fNm3apLS0NMfyvXv3Kjs7WxdccEFeWdOmTbVz586I\n7QI4PRASAcRM06ZN9cEHH2j//v15r2PHjum8886TJL3wwgs6efKkGjVqpCeffNK1jfPPP1/bt2/X\nqVOnHMsefPBBJSQk6Msvv9TBgwc1e/bsQv3wI/THNkeOHNGHH34Y9he6Y8aM0dq1a7Vp0yZ98803\neuqppyTZK2g7duzI186+ffvUuHFjNW3aVJdffnm+MTh8+LBeeOGFsP3q0qWLFixYoD179ig1NVVD\nhw511GnUqJF27NiR7+rdDz/8oMaNG0e9/4VVr149VapUSdu2bcsr2759e15APt1+vAQgP0IigFJx\n8uRJHT9+PO+Vk5Oju+++Ww8++GDejzj27Nmj999/X5L0zTff6OGHH9bcuXM1a9YsPfnkk9qwYYOj\n3W7duum8887ThAkTdOzYMR0/flwZGRmSbBirVq2aatSooZ07d+aFtmj4n7+TpBMnTigzM1Opqamq\nW7euRo0a5ai/du1arVmzRtnZ2Tr77LNVtWpVJSQk5C1ftGiRPvnkE508eVIPP/ywevbsqcaNG+uq\nq67SN998ozlz5ig7O1vZ2dn67LPP8v4pm9BbtNnZ2Zo7d64OHjyohIQEVa9ePd92/Lp3766zzz5b\nTz75pLKzs+Xz+ZSenp73vGZp3PpNSEjQ0KFD9dBDD+nIkSP64Ycf9Ne//lU33XSTJKlBgwb68ccf\nlZ2dXeLbBlD6CIkASsWVV16ps88+O+/12GOP6f7779fVV1+d96vfnj176tNPP9WpU6c0cuRITZgw\nQW3btlVSUpKmTJmikSNH5gUM/1WphIQELVy4UN99952aNm2q888/X2+99ZYk6dFHH9W6detUs2ZN\nDRo0SNdee23UV7M8Ho+efPJJ1ahRQ/Xq1dMtt9yirl27KiMjQ2eddVZeHX97hw4d0m9+8xvVqVNH\nzZo1U7169TRu3Li8esOHD9ekSZNUt25drV+/XnPmzJEkVa9eXUuWLNGbb76pxo0b67zzztMf//hH\nnTx50rENvzlz5igxMVE1a9bU3//+d82dOzdfvyX7i+yFCxdq8eLFOvfcc5WWlqbZs2erRYsWYduN\nNDbRLnvuuedUrVo1XXjhherdu7dGjBiRF6r79eunNm3aqGHDhqpfv37Y9gCUTR7Dk8UAUKJGjRql\nJk2a6E9/+lO8uwIARcaVRAAoYfzdG8CZgJAIACXsdPwfZwAgFLebAQAA4FAx3h0oC/gbPwAAOJ3E\n4hoft5v/j/+fv+BlX48++mjc+1AWX4wL48K4MCaMC+MS71esEBIBAADgQEgEAACAAyERrrxeb7y7\nUCYxLu4YF3eMixNj4o5xcce4xBe/bpb94QrDAAAATgexyi1cSQQAAIADIREAAAAOhEQAAAA4EBIB\nAADgQEgEAACAAyERAAAADoREAAAAOBASAQAA4EBIBAAAgAMhEQAAAA6ERAAAADgQEgEAAOBASAQA\nAIADIREAAAAOhEQAAAA4EBIBAADgQEgEAACAAyERAAAADoREAAAAOBASAQAA4EBIBAAAgAMhEQAA\nAA6ERAAAADgQEgEAAOBASAQAAIADIREAAAAOhEQAAAA4EBIBAADgEJeQ+PLL0uzZdnrmTGn37sCy\nO++UNm+OfZ+mT7fvPl/+d/90aHlw/eAy/3zwtNvy4PZDtxPatltfQl/B+xHchlv/Q/sTXE+S0tLc\n++fGbV+Ctx2uPbdxDl0vXD23+YL4x2TwYDvfu3f+4xFcJy0t0J9wx8n/HnysQ/sVek4Fr1/Q/rht\nN1J74foarm8FbSvaOqHnuts6BX2m3NoLbcN/TArqX7h23bYfbr1oP5uhyyJtL5r5UOGOaaQ+hhNN\nf6NZN9yywn4eJbt/vXuHX9/teEUj2rpu3zdubRXncxNazz/t/x4qbJtF+d4rzLnp9mdFYY5DQZ/D\nSMv9f064facOHuz+mQ0+htEcTxReXELiXXdJI0fa6ddfl3btCix75RWpVavY92nBAvte0B9o/vfg\n+sFl/vngabflkf4gCm3brS+hr+D9CG7Drf+h/QmuJ0np6YX7AgytG7ztcO25jXPoeuHquc0XxD8m\ny5fb+c8+y388guukpwf6E80XXXA9t/1x63tRAkSk9sL1NVzfCtpWtHVCz3W3dQr6TLm1F9qG/5gU\n1L9o/1B361ukcXL7bIYui7S9aOZDhTumkfoYTjT9jWbdcMsK+3mU7P599ln49QsTTsKtV9D2o2mr\nOJ+b0Hr+af/3UGHbLMr3XmHOTbc/KwpzHAr6HEZa7v9zwu07dfly989s8DGM5nii8CrGYiOzZknT\npkkej9SundS8uXTOOVKzZtLatdKIEdLZZ0sZGdIVV9i6u3ZJjzxi1z92TMrOlrZskTIzpQcekI4c\nkerVs1ciGzaUvF6pRw97Mh04IL36qnTppdLGjdJtt0knT0q5udI//iElJcVirwEAAE5fpR4SN26U\nJk+WVq2S6tSR9u+Xnn3WBsZrr5Wef96Gwk6dbH2Px74GDbIvSRo2zIbAnBxpzBhp4UKpbl1p/nzp\noYdsIPR4pFOnpDVrpMWLpUmTpP/8R3rpJen++6Xhw+36OTnu/dy2baImTvT/TcYryVvaQwMAAFAg\nn88nX1Eu2RdTqYfEZcukoUNtQJSk2rWddYwJv/6TT9qrjKNHS19+aUPnr35ll506JTVqFKg7ZIh9\n79RJ2rbNTvfqZUPqjz/a5eGuIjZrZkPixIk2kMbhWAAAADh4vV55vd68+UmTJsVku6UeEj2eyCHQ\nX8fNhx/a28MrVth5Y6Q2bextaTdVqtj3hITAFcMbb7S3odPTpSuvtD+a6dOn8PsBAABQnpR6SOzb\n1/4y6Xe/s1cT9+2z5f7gWL26dOiQc70ffpDuvVdasiQQ/i6+WNqzR1q92ga/7Gzp22+l1q3Db3/L\nFunCC+1t6u3bpS++cA+Jqan23R/UgwK763RofX9Zhw52ulatwLTbcjf+tmrVcrbt1hc3/m342whd\nJ1x/g5elpBS8nUj98Y+Nn1t7buMcul64euG2G4nXa8fEfyW7a1f3/a9VS/ruu0B/3I5XaJ+Cj7Xb\n/kTT92jqRGrP7Xzzr+PWt4K2FW2dSNsOHadw56Fbe6H1UlKi61+kfXHbfrR9i/TZjHZcC3sOR/q+\nKOx3QzT9jWbdcMsK26Zk9+/UqfDrF7XP0dZ1+74pTFtF+Y70T4e7UFFQmyV17KL5Pi7Knz8FnQ+R\nlgf/ORH6nbphg/tnNvi7IprjicLzGFPQdb7imzVLeuope4WvY0f7g5Xq1W1w/Oc/pQcfDPxw5de/\nlp5+WvrXv6TnnpOaNLFtNG5srwZu2CDdd5908KC9Wjh2rHT77fZD53+2ce9eqVs3GxCnTpXmzJEq\nVZLOO0964438J5YkeTwexWAYAAAAii1WuSUmIbGsIyQCAIDTRaxyC//jCgAAABwIiQAAAHAgJAIA\nAMCBkAgAAAAHQiIAAAAcCIkAAABwICQCAADAgZAIAAAAB0IiAAAAHAiJAAAAcCAkAgAAwIGQCAAA\nAAdCIgAAABwIiQAAAHAgJAIAAMCBkAgAAAAHQiIAAAAcCIkAAABwICQCAADAgZAIAAAAB0IiAAAA\nHAiJAAAAcCAkAgAAwIGQCAAAAAdCIgAAABwIiQAAAHAgJAIAAMChUCFx4kRp2rRS6gkAAADKjEKF\nRI+ntLoRnZyc+G6/vPP5ysa2S6MfPl989w+IN59Pmj49MB1P0W7fX2/6dCktLf/n2P/eu3dgv/x1\nQ9sP9/kPXq+0xXvMC+Ifo0jflWlpgbrB64Qrc5sPrhtuPlwZSl6BIXHyZOnii+0H7euvbdn330u/\n/rXUpYt02WWB8ltvle65R+rZU2re3B7EW26RWreWRo0KtDlvntSundS2rTRhQqD8gw+kzp2lDh2k\n/v1t2cSJ0siR0qWX2rZ++MFus3Nn+1q1KrD+E0/Ydjt0kB58UNqyxdbx+/bb/PMoHEIicOby+aQF\nCwLT8VTYkLhggZSe7h5KPvsssF/+utGGxOD1Slu8x7wg0YTE9PRA3eB1wpW5zQfXDTcfrgwlr2Kk\nhZmZ0vz50oYNUna21KmTDVl33SW99JKUlCStWWOD4dKldp0DB2xwe/996eqr7XTr1lLXrradc8+1\nwXDdOqlWLWnAAOm996RevaTf/EZauVK64ALbjt9XX0kffyxVqSL98ov0n//Y6W+/lYYPt18Cixfb\nbX76qVS1ql2/Vi2pZk273fbtpRkzpNtuK83hBAAAODNEDIkrV0pDhtjQVbWqDX3Hj0sZGdL11wfq\nnTxp3z0eadAgO52cLDVsKLVpY+fbtJG2bbMvr1eqW9eWjxghrVghJSTYK4QXXGDLa9UKtHn11TYU\n+reVlmaDX0KCDYqS9OGHNgBWrZp//TvusOHwL3+R3nrLBko3EydOzJv2er3yer2RhgYAACAmfD6f\nfHG4fBoxJHo8kjH5y3JzbQBbv959ncqV7XuFCoFg55/PyZEqVcpfP7R9N2efHZj+61+l886TZs+W\nTp0KhEK3vkrStddKkyZJffva2+O1a7tvIzgkAgAAlBWhF68mTZoUk+1GfCbxssvsMxnHj0uHD0sL\nF9rAlpgovfOOrWOM9N//Rrcxj0fq1k366CMpK8uGvDfftFcWe/SwVxS3bbN19+1zb+PQIXuFUpJm\nzbJtSPYZxhkz7O1oSdq/375XqSINHCiNHp3/uUgAAACEF/FKYseO0rBh9nm++vVtwPN4pLlzbej6\n85/ts4o33mh/MCLl/wW026+hGzaUpk6V+vSxATMlJXCL+u9/t7e3c3OlBg2kf//b2c4999irg7Nm\nSVdcIZ1zji0fOFD6/HN7tbByZemqq2z/JPvc4rvv2ucfUXTxvAMfvO3S6AdPF6C883oDj+nE+/MQ\n7fb99VJTpe++c/+e6NrVLvdLTbU/boxme8HrlbZ4j3lBoulfSkr+um7HI7Qdt3aLUgelw2NMNDd8\nT29PP22vhIa7OuvxeFQOhgEAAJwBYpVbIl5JPBMMHixt3SotWxbvngAAAJw+ysWVxIJwJREAAJwu\nYpVb+L+bAQAA4EBIBAAAgAMhEQAAAA6ERAAAADgQEgEAAOBASAQAAIADIREAAAAOhEQAAAA4EBIB\nAADgQEgEAACAAyERAAAADoREAAAAOBASAQAA4EBIBAAAgAMhEQAAAA6ERAAAADgQEgEAAOBASAQA\nAIADIREAAAAOhEQAAAA4EBIBAADgQEgEAACAAyERAAAADoREAAAAOBASAQAA4EBIBAAAgAMhEQAA\nAA6lFhKffVZq3VoaObJk2/V6pczMkm0TAAAA+ZVaSPzb36QPP5Rmzw6U5eQUv12Px77KsrQ0yedz\nXzZ9unt5uPrx5O+Tzxd4BZdHWicegrftnw433pHWDVcWzf4XVrTjeroLPTZpaZHrFKa9kqxbUm1M\nn+5+PhanzZJYr6Bz2u3z4ra9SN9xbuu4jUVonWi3XVj+dqdPz7+N4H0OruNf1rt3+L76z+HgcfAf\n8+C2Bg92jnu44xBpW6FCPz9ux7GgMR882FkWPBah54R/X9y279aX0OMZ7fni77vbmKSlBcbVP/7T\np9tjFdynSN+pZ/L3bGkolZB4993Sli3SFVdItWpJN98sXXqpdMst0t690nXXSd262VdGhl3n6FHp\nttuk7t2lTp2k99+35b/8It1wg70qOWSInfebN09q105q21aaMCFQfs450h/+ICUnS/37S6tXS5df\nLjVvLi1cWBp7nF96evgTccEC9/KyeOKeCSEx3HhHWjdcGSGx6EKPTXp65DqFaa8k65ZUGwsWnJ4h\n0e3z4ra9SN9xbutEExKj3XZh+dtdsCD/NoL3ObiOf9lnn0UObunp+cfBf8yD21q+vHRCYujnx+04\nFjTmy5c7y4LHIvSc8O+L2/bd+hJ6PKM9X/x9dxuT9PTAuPrHf8ECe6yC+0RILDmlEhJfeklq1Mge\njLFjpc2bpaVLpblzpfvus2Wffiq98450xx12ncmTpX79pDVrpGXLpHHjpGPH7BXJc86RNm2SJk0K\n3GretcsGw+XLpc8/tyfJe+/ZZceO2ba+/FKqXl165BHb5rvv2mkAAABEVrE0GzfGvl99tVSlip3+\n8EMbGv0OH7ZXEZcssVf5nn7alp84IW3fLq1cKd1/vy1r29ZeOTTGhkKvV6pb1y4bMUJasUK65hqp\ncmVp4MDAOlWrSgkJ9sritm3ufZ04cWLetNfrldfrLYERAAAAKB6fzydfHC6DlmpI9Dv77MC0MfZq\nYeXKznr//Kd00UXOcn/YDBb6XKIxgbJKlQLlFSoEtlWhQvjnIoNDIgAAQFkRevFq0qRJMdluzP8J\nnAED7C+f/TZssO8DB+YvX7/evl92mfTGG3b6yy+l//7XhsFu3aSPPpKysqRTp6Q337TPHQIAAKD4\nSu1KYvCVvuDpZ5+V7r1Xat/eXtW7/HLpxRelhx+Wfvtbezs5N1e68EL745XRo6VRo+wPV1q1krp0\nse00bChNnSr16WOvIqakSIMGObcXqS+lJSXF3gp3k5rqXl4W7277+xTat0h9jed+BG/bPx1uvCOt\nG64s3HgUh1ufz0Sh+7l3b+Q6hWmvJOuWVBupqVKHDpHXL2q/SnK90HPa7fPitl6k7zi3ddzO89A6\n0W67sPzthrbvts/+aa/XPkcf6TvQfw4Ht9Ohg/2xpn++du3ovj8jfQ+EG3+3OqF9Cbdcsn92hpYF\nj0Xo+PTpEygL3b5bX8KNd6R1/OsF9z14eUqKlJRkx7VxY1uWlGQvFLVv76xfkp+78spjjNvN3PLF\n4/GIYQAAAKeDWOUW/scVAAAAOBASAQAA4EBIBAAAgAMhEQAAAA6ERAAAADgQEgEAAOBASAQAAIAD\nIREAAAAOhEQAAAA4EBIBAADgQEgEAACAAyERAAAADoREAAAAOBASAQAA4EBIBAAAgAMhEQAAAA6E\nRAAAADgQEgEAAOBASAQAAIADIREAAAAOhEQAAAA4EBIBAADgQEgEAACAAyERAAAADoREAAAAOBAS\nAQAA4EBIBAAAgEPcQuI559j3Xbuk66+30zNnSmPGFK/d6dOlX34pXhsAAADlXdxCosdj3xs1kt5+\nO39ZUZ06JT3zjHTsWPHaAQAAKO/ifrt52zapbVs7bYy0Y4fUp4/UooX02GOBenPmSN27Sx07Snff\nLeXm2vJzzpF+/3upQwdpyhR7ZbJPH6lvX2nGDGns2EAbr7wi/e53kfvj80U/75/2+dzLw603fbpz\nuVtZUbltvzhtRNte8Hj430PHprgKOj6hZdFse/p0W2/wYDuflhYoC203dB8L6l9ByyPVj6bvbv0p\nyrErTcHnQWH2P9zyeO1TYbbr9hko6c9lUZaH1i3M905B32GFWd9tPpp1ilsvUl238rS0ktl2pPO4\npM+PaMc53DEsieMSzfdaUbbj77P/+zn4e7qgY1XSfxaVF3EPiaE+/VT65z+l//7XXmHMzJQ2b5be\nekvKyJDWr5cqVJDmzrX1jx2TevSQPv9cevhhe2XS55OWLZOGDpUWLrRXGCV7O/v22yNvPxYhccEC\n53K3sqIiJEZeHmrBAltv+XI7n54eKAttl5BYeITEshsSC/O9U9B3WGHWd5uPZp3i1otU1608Pb1k\ntl0WQ2K4Y1iWQ6K/z/7v5+Dv6YKOFSGxaCrGuwOhBgyQate200OGSB9/LCUk2LDYpYst/+UXqWFD\nO52QIF17rXtb1arZK4oLF0otW0rZ2VKbNqW/DwAAAKe7MhcSgxkTeE7xllvs7eRQVatGfpbxjjuk\nyZOlVq2k224LX2/ixImS/H/b8Mrr9Ra12wAAACXG5/PJF4dLoWUuJP7nP9L+/Tb8vfeefa7wrLOk\na66xzxeee660b5905IjUtKlz/erVpUOHpDp17Hy3btKPP9rb1F98EX67/pA4caJEPgQAAGWF15v/\n4tWkSZNist24hcTgq3/+aY/Hhrprr7XBbuRIqVMnu+zPf7a3onNzpUqVpBdftCEx9Crib34jXXGF\n1LixtHSpLRs6VNqwQapZs/T3CwAA4EwQt5B46JB9b9bM/khFsreUb7nFvf7QofYVrh2/tDTnr5w+\n/rjgXzX7hV5FjDTvny5ondCy1FTncreyoiqJK6Fu+xntOuHGpSQUdqyj6UNqqv11/IYNdj4lRUpK\nsmWFbbeg7UXT/2jbCq4Trm9l4ap4cfaxoOMbS4XZbml9BooyXpHq1qpVtG0X5fuqMOd+YeoUpl6k\num7lKSkls+1I53FJniOF+byEO4YlcVyi+cwXZTv+Pvu/s2vVCnxPF3SsysJ34enIY4wx8e5EaTlw\nwP6zOR06SPPnh6/n8Xh0Bg8DAAA4g8Qqt5zRITFahEQAAHC6iFVuKXP/TiIAAADij5AIAAAAB0Ii\nAAAAHAiJAAAAcCAkAgAAwIGQCAAAAAdCIgAAABwIiQAAAHAgJAIAAMCBkAgAAAAHQiIAAAAcCIkA\nAABwICQCAADAgZAIAAAAB0IiAAAAHAiJAAAAcCAkAgAAwIGQCAAAAAdCIgAAABwIiQAAAHAgJAIA\nAMCBkAgAAAAHQiIAAAAcCIkAAABwICQCAADAgZAIAAAAB0IiAAAAHAiJAAAAcChTIXHiRGnatPDL\nN2yQFi8OzC9cKD3xRKl3Kyyfz75Cy9ymQ+uHTrvVl6S0NPftuPUl0ny4+uHqDR4cXbs+nzR9ev75\nSPtdWMFtF1c0fQitk5YWXdtpac6+BrcVvLygsfcvK+wxjabNwoimvtvxKeqxjrRuNJ+10POwsNt1\nW6e47YTh5FaQAAAdWklEQVTOu/XR/17cc704416abZZGv0pKWe5bUZ2J++RXlO/EYCX550l5UaZC\noscTefn69dKiRYH5QYOk8eNLt0+RxCIkpqfHJyQuXx5duz6ftGBB/vmSDInBbRdXUUJWenp0baen\nO/sa3Fbw8jMpJLodn3iFxNDzsLDbjUVIdOuj/7245zohsfDKct+K6kzcJ7/ihsSS/POkvIh7SJw8\nWbr4Yql3b+nrr21Znz5SZqad3rtXSkyUsrOlRx6R5s+XOnaU3npLmjlTGjPG1tuzR7ruOqlbN/vK\nyLDlH31k63fsKHXqJB05EvNdBAAAOO1UjOfGMzNt6NuwwYbATp2kzp3tstCripUqSX/6k13n2Wdt\n2euvB5bff780dqx0ySXS9u3SFVdImzbZ29cvvij17CkdOyZVqeLel4kTJ+ZNe71eeb3eEttPAACA\novL5fPLF4TJxXEPiypXSkCFS1ar2dfXVkesbY19uPvxQ2rw5MH/4sHT0qA2NY8dKI0bYbTVu7L5+\ncEgEAAAoK0IvXk2aNCkm241rSPR43ENfxYrSqVN2+vjx6NoyRlqzRqpcOX/5+PFSSor0r3/ZwPjv\nf9vb2wAAAAgvriHxssukW2+V/vhHe7t54ULprrukZs3sbeWuXaV33gnUr1HDXiH0Cw6YAwbY29C/\n/72d//xzqUMH6fvvpTZt7Ouzz+xzjyUVEt3uSAeXhS4Pt8w/7dZeSop7eUF9KWidSNuU7HOh0bTr\n9Uq1akXebnHu3KemFn3dUEUZx5SU6NpOSZGSksK3Fby8oLEPt6wo/S/s+oWt73Z8inO8w60bzWet\noPOwpLZbmHZC54PHK/Q8KO65XhpPyJREm2X5yZ2y3LeiOhP3ya+wf86FKsk/T8oLjzHhbuDGxpQp\n9tnC+vWlCy6wzyVedZU0dKiUkGCn586VtmyR9u+XBg60gfKPf5R++SXwjGJWlnTvvfaWc06OdPnl\n9lnE++6zv9StUEFKTrY/dqlUKX8fPB6P4jwMAAAAUYlVbol7SCwLCIkAAOB0EavcEvd/AgcAAABl\nDyERAAAADoREAAAAOBASAQAA4EBIBAAAgAMhEQAAAA6ERAAAADgQEgEAAOBASAQAAIADIREAAAAO\nhEQAAAA4EBIBAADgQEgEAACAAyERAAAADoREAAAAOBASAQAA4EBIBAAAgAMhEQAAAA6ERAAAADgQ\nEgEAAOBASAQAAIADIREAAAAOhEQAAAA4EBIBAADgQEgEAACAAyERAAAADoREAAAAOBASAQAA4FCq\nIXHBAqlCBenrr0un/cxM6f77S6dtny/wCi2XpOnTS2e7pSHcPsRTWejDmeZ0H9PTvf+lafr0kv/O\n8beXlhZ4D/3O8/nyl/t8zn745ws6ftOnu3+nBm8r9Hu3sOdEpPrTp9t98fejsMKtE26//OMabp3g\n9+D2w7XnLx882L78y9PSAmX+dtz6Gnrcgtd3Kw/to9uxCT1O/rrt2gXOnenTbd/851Hw+Aef12lp\ngZd/3eD99K/rLw8+lv5p/3zoORnaBqJXqiFx3jwpJcW+l7ScHKlzZ+mZZ0q+bangkLhgQelstzQQ\nEsuH031MT/f+l6YFC0r+O8ffXnp64N0tJAaX+3zOfvjnCzp+CxbENyQuWGD3xd+Pwgq3Trj98o9r\nuHWC34PbD9eev3z5cvvyL09PD5T523Hra+hxC17frTy0j9GERH/dTZsC586CBbZv/vMoePyDz+v0\n9MDLv27wfvrX9ZcHH0v/tH8+9JwMbQPRK7WQeOSItGaN9Pzz0vz5tsznky6/XEpNlZo3lyZMkGbP\nlrp1s3/z2LLF1tuzR7ruOlverZuUkWHLJ06URo6ULr1Uuvlm6aOPpEGDAtsbNcq207699O67tvye\ne6SuXaXkZLs+AAAAClaxtBp+7z3piiukpk2lc8+V1q2z5f/9r/TVV1Lt2lJionTnndKnn0rPPis9\n95z017/aW8hjx0qXXCJt327b2bTJrv/VV9LHH0tVquT/29Kf/mTb/O9/7fyBA/Z98mRbfuqU9Ktf\nSV98IbVt6+zvxKAE6fV6JXlLdkAAAACKwOfzyReH2y2lFhLnzbNBT5Kuvz5w67lrV6lBA1uelCQN\nHGink5MDl8s//FDavDnQ1uHD0tGjkscjXX21DYihli4NXLGUpFq17Pv8+dIrr9jb07t327BZUEiU\nuPUFAADKBq/X+38XsKxJkybFZLulEhL37bOB78svbbA7dcq+X3VV/oBXoUJgvkIFG+QkyRh7q7py\nZWfbZ58dfrvG5J/fulWaNk1au1aqWdPejj5+vHj7BgAAUB6USkh85x37zODf/hYo83qlFSuiW3/A\nAHv7+fe/t/MbNtjnDCPp31964QV7u1qyt5sPHZKqVZNq1JB+/llavFjq0ye6PgQFdtfy1NTo2ikL\nQvcl3L7FUlnow5nmdB/T073/pak0vm/8baakBN7dviv27s1f7r9LE9pOQccvNVXq0CH8crf1C3tO\nRKqfmip99529gxWpH4VtO9x++cc13DrB78Hth2vPX75hQ/76KSnSzp3R9TNY8Ppu5aF9jPTnSOg6\nrVsHzp2kJPvIV+PGtqxWrcD+BfcpuB/+dXfuDLTtX/e772x548aBY+nfjmTn/edo8LrBbSB6HmNC\nr78VX9++9kcpAwYEyp57zobGpCTp/fdtWZ8+9kpfp072RyjTptllWVnSvffaW845OfbHLi++KE2a\nJFWvLv3ud3b94HWOHrXrZGZKCQn2RyqpqfbqYUaGdP759uS4+mobYPMNgsejUhgGAACAEher3FIq\nIfF0Q0gEAACni1jlFv7HFQAAADgQEgEAAOBASAQAAIADIREAAAAOhEQAAAA4EBIBAADgQEgEAACA\nAyERAAAADoREAAAAOBASAQAA4EBIBAAAgAMhEQAAAA6ERAAAADgQEgEAAOBASAQAAIADIREAAAAO\nhEQAAAA4EBIBAADgQEgEAACAAyERAAAADoREAAAAOBASAQAA4EBIBAAAgAMhEQAAAA6ERAAAADgQ\nEgEAAOBASAQAAIBDXEOi1ytlZsazBwAAAHAT15Do8dhXacnNLb22UbJ8PvtCZGlp0Y1VpOXTp0eu\nU5rHIbjvxT3mPp8dj/KKz0t0SnKcGPPiK4kx5DjETsxC4tGj0lVXSR06SG3bSm+9lX/5PfdIXbtK\nycnSxIm27IMPpKFDA3V8PmnQIDu9ZInUq5fUubOtc/SoLW/WTJowwZZPnWrf/b79Nv88yg5CYnTS\n04sfEhcsiFzndAqJ6enF7tJpi89LdAiJZQsh8fQSs5D4wQdS48bS559LX3whXXFF/uWTJ0uffSZt\n2CB99JH05ZdS//7SmjXSL7/YOvPnSzfeKO3da+svXWpvV3fuLP3lL7aOxyPVq2fLH3xQqlnTtilJ\nM2ZIt90Wqz0GAAA4fVWM1YbatZN+/3t7lS8lRbr00vzL58+XXnlFysmRdu+WNm2yVxWvuEJ6/33p\n2mulRYukp5+Wli+3y3v1suuePBmYlqRhwwLTd9xhw+Ff/mKvXn72mXv/JvovX0ryer3yer0lst8A\nAADF4fP55IvDJdSYhcSLLpLWr5f+9S/p//0/qW/fwLKtW6Vp06S1a+2Vv1GjpOPH7bIbbpCef16q\nU8fejq5WzZb37y+98Yb7tvx1JBsuJ02y2+vSRapd232d4JAIAABQVoRevJo0aVJMthuz2827d0tV\nq0ojRtgriuvX23JjpEOHbLCrUUP6+Wdp8eLAepddJq1bZ68y3nCDLeveXfrkE+n77+380aP2eUM3\nVapIAwdKo0fb8AkAAICCxexK4hdfSOPGSRUqSJUrSy++aMOixyO1by917Ci1bCmdf37+W9EJCfb2\n9OuvS7Nm2bJzz5VmzrTPJ544YcsmT7ZXK90MHy69+640YECp7iKKgbv70UlJiW6sItVJTY1cpzSP\nRXDbxd2O12ufTy6v+MxEpyTHiTEvvpIYQ45D7HiMMSbenShtTz8tHT5sbzu78Xg8KgfDAAAAzgCx\nyi0xu5IYL4MH22cely2Ld08AAABOH+XiSmJBuJIIAABOF7HKLfzfzQAAAHAgJAIAAMCBkAgAAAAH\nQiIAAAAcCIkAAABwICQCAADAgZAIAAAAB0IiAAAAHAiJAAAAcCAkAgAAwIGQCAAAAAdCIgAAABwI\niQAAAHAgJAIAAMCBkAgAAAAHQiIAAAAcCIkAAABwICQCAADAgZAIAAAAB0IiAAAAHAiJAAAAcCAk\nAgAAwIGQCAAAAAdCIgAAABwIiQAAAHAgJAIAAMCBkAgAAAAHQiIAAAAcYhYSn3pKeu45Oz12rNSv\nn51etky66SZp3jypXTupbVtpwoTAeuecI/3hD1JystS/v7R6tXT55VLz5tLChbbOqVPSuHFSt25S\n+/bS3/9uy30+yeuVrr9eatXKbgcAzlQ+n32FTserLwWZPr30th3cdnBf4jkm8eS236FlwedONG2F\nHr9o1i+J8Q/Xb7/p06PbXxQsZiHxssuklSvt9Nq10tGjUk6OLWvRwgbD5culzz+XPvtMeu89W/fY\nMRsov/xSql5deuQRGyzffddOS9Krr0q1akmffmpfr7wibdtml33+ufTMM9KmTdKWLdInn8RqjwEg\ntk63kLhgQeltO7htQmLphMTQ41dWQuKCBYTEkhKzkNipk5SZKR0+LFWtKvXsacPixx/bgNenj1S3\nrpSQII0YIa1YYderXFkaONBOt21r6yUk2CuL/iC4ZIk0a5bUsaPUo4e0b5/03XeSx2OvLjZqZKc7\ndAisAwAAgPAqxmpDlSpJiYnSzJlSr1721vKyZTbMNWtmA6SfMTbU+dfzq1DBhkb/dE5OYNnzz9vb\n0cF8PqlKlcB8QkL+dYJNnDgxb9rr9crr9RZuBwEAAEqBz+eTLw6XQmMWEiWpd2/p6aelGTPslcCx\nY6WuXe3Vvvvuk7Ky7FXFN9+089EaOFB68UV7lbFiRembb6QmTQrXt+CQCAAAUFaEXryaNGlSTLYb\n01839+4t/fSTvdVcv7501lm2rGFDaepUG/I6dJC6dJEGDbLr+K8o+gXP+6fvuENq3dre0m7bVho9\n2l4x9Hgirw8AAAB3Mb2S2LevdOJEYP7rrwPTN9xgX6EOHQpMP/qo+zKPR5o82b6CXX65ffn5f10N\nAGei4Kdk4v3ETDTbT00tvW3XquXel3iPS7y47XdomX++oDHyLw89ftGsXxLjH67ffqmp9oJTaWy7\nvPEYY0y8OxFvHo9HDAMAADgdxCq38I9pAwAAwIGQCAAAAAdCIgAAABwIiQAAAHAgJAIAAMCBkAgA\nAAAHQiIAAAAcCIkAAABwICQCAADAgZAIAAAAB0IiAAAAHAiJAAAAcCAkAgAAwIGQCAAAAAdCIgAA\nABwIiQAAAHAgJAIAAMCBkAgAAAAHQiIAAAAcCIkAAABwICQCAADAgZAIAAAAB0IiAAAAHAiJAAAA\ncCAkAgAAwIGQCAAAAAdCIgAAABwIiQAAAHAgJMKVz+eLdxfKJMbFHePijnFxYkzcMS7uGJf4IiTC\nFR9Md4yLO8bFHePixJi4Y1zcMS7xRUgEAACAAyERAAAADh5jjIl3J+LN4/HEuwsAAABRi0V8q1jq\nWzgNkJMBAADy43YzAAAAHAiJAAAAcCj3IfGDDz5Qy5YtddFFF+mJJ56Id3dK3I4dO9SnTx+1adNG\nycnJevbZZyVJ+/btU//+/dWiRQsNGDBABw4cyFvn8ccf10UXXaSWLVtqyZIleeWZmZlq27atLrro\nIt1///155SdOnNCwYcN00UUXqUePHvrhhx9it4PFcOrUKXXs2FGDBg2SxJhI0oEDB3TdddepVatW\nat26tdasWcO4yO5nmzZt1LZtWw0fPlwnTpwol+Ny2223qUGDBmrbtm1eWazG4fXXX1eLFi3UokUL\nzZo1q5T3tHDcxmXcuHFq1aqV2rdvryFDhujgwYN5y8rDuLiNid+0adNUoUIF7du3L6+sPIyJFH5c\nnnvuObVq1UrJyckaP358Xnncx8WUYzk5OaZ58+Zm69at5uTJk6Z9+/Zm06ZN8e5Widq9e7dZv369\nMcaYw4cPmxYtWphNmzaZcePGmSeeeMIYY8zUqVPN+PHjjTHGbNy40bRv396cPHnSbN261TRv3tzk\n5uYaY4zp2rWrWbNmjTHGmF//+tdm8eLFxhhjXnjhBTN69GhjjDFvvvmmGTZsWEz3saimTZtmhg8f\nbgYNGmSMMYyJMebmm282r776qjHGmOzsbHPgwIFyPy5bt241iYmJ5vjx48YYY4YOHWpmzpxZLsdl\nxYoVZt26dSY5OTmvLBbjkJWVZS688EKzf/9+s3///rzpssJtXJYsWWJOnTpljDFm/Pjx5W5c3MbE\nGGO2b99uBg4caJo1a2aysrKMMeVnTIxxH5dly5aZX/3qV+bkyZPGGGP+93//1xhTNsalXIfEjIwM\nM3DgwLz5xx9/3Dz++ONx7FHpu+aaa8x//vMfc/HFF5uffvrJGGOD5MUXX2yMMWbKlClm6tSpefUH\nDhxoVq1aZXbt2mVatmyZVz5v3jxz11135dVZvXq1McYGi3r16sVqd4psx44dpl+/fmbZsmUmJSXF\nGGPK/ZgcOHDAJCYmOsrL+7hkZWWZFi1amH379pns7GyTkpJilixZUm7HZevWrfn+gIvFOLzxxhvm\n7rvvzlvnrrvuMvPmzSulPSya0HEJ9s9//tOMGDHCGFO+xsVtTK677jqzYcOGfCGxPI2JMc5xuf76\n683SpUsd9crCuJTr2807d+7U+eefnzffpEkT7dy5M449Kl3btm3T+vXr1b17d/38889q0KCBJKlB\ngwb6+eefJUm7du1SkyZN8tbxj0loeePGjfPGKngcK1asqJo1a+a7jVAWjR07Vk899ZQqVAh8BMr7\nmGzdulXnnnuuRo0apU6dOunOO+/U0aNHy/241KlTRw888ICaNm2qRo0aqVatWurfv3+5Hxe/0h6H\nrKyssG2dLl577TVdeeWVksr3uLz33ntq0qSJ2rVrl6+8PI+JJH377bdasWKFevToIa/Xq7Vr10oq\nG+NSrkNiefr3EY8cOaJrr71WzzzzjKpXr55vmcfjKVdjkZ6ervr166tjx45h//mj8jYmkpSTk6N1\n69bpnnvu0bp161StWjVNnTo1X53yOC7ff/+9pk+frm3btmnXrl06cuSI5syZk69OeRwXN4yD0+TJ\nk1W5cmUNHz483l2Jq2PHjmnKlCmaNGlSXlm479/yJicnR/v379fq1av11FNPaejQofHuUp5yHRIb\nN26sHTt25M3v2LEjX9I+U2RnZ+vaa6/VyJEjlZqaKsn+jf+nn36SJO3evVv169eX5ByTH3/8UU2a\nNFHjxo31448/Osr962zfvl2SPdkPHjyoOnXqxGTfiiIjI0Pvv/++EhMTdeONN2rZsmUaOXJkuR4T\nyf7NskmTJuratask6brrrtO6devUsGHDcj0ua9euVa9evVS3bl1VrFhRQ4YM0apVq8r9uPiV9uem\nbt26p+139cyZM7Vo0SLNnTs3r6y8jsv333+vbdu2qX379kpMTNSPP/6ozp076+effy63Y+LXpEkT\nDRkyRJLUtWtXVahQQXv37i0b41KoG+lnmOzsbHPhhRearVu3mhMnTpyRP1zJzc01I0eONL/97W/z\nlY8bNy7vWYfHH3/c8VD1iRMnzJYtW8yFF16Y96Bst27dzOrVq01ubq7jQVn/sw7z5s0rsw/du/H5\nfHnPJDImxvTu3dt8/fXXxhhjHn30UTNu3LhyPy6ff/65adOmjTl27JjJzc01N998s3n++efL7biE\nPk8Vi3HIysoyiYmJZv/+/Wbfvn1502VJ6LgsXrzYtG7d2uzZsydfvfI0LpGe03T74Up5GBNjnOPy\n0ksvmUceecQYY8zXX39tzj//fGNM2RiXch0SjTFm0aJFpkWLFqZ58+ZmypQp8e5OiVu5cqXxeDym\nffv2pkOHDqZDhw5m8eLFJisry/Tr189cdNFFpn///vlOlsmTJ5vmzZubiy++2HzwwQd55WvXrjXJ\nycmmefPmZsyYMXnlx48fN9dff71JSkoy3bt3N1u3bo3lLhaLz+fL+3UzY2IDUZcuXUy7du3M4MGD\nzYEDBxgXY8wTTzxhWrdubZKTk83NN99sTp48WS7H5YYbbjDnnXeeqVSpkmnSpIl57bXXYjYOr732\nmklKSjJJSUlm5syZMdnfaIWOy6uvvmqSkpJM06ZN8753/b84NaZ8jIt/TCpXrpx3rgRLTEzMC4nG\nlI8xMcZ9XE6ePGluuukmk5ycbDp16mSWL1+eVz/e48L/3QwAAACHcv1MIgAAANwREgEAAOBASAQA\nAIADIREAAAAOhEQA5cLYsWP1zDPP5M0PHDhQd955Z978Aw88oL/+9a9Fatvn82nQoEGuyz7++GN1\n795drVq1UqtWrfTKK6/kLduzZ4+6d++uzp076+OPP9bbb7+t1q1bq1+/foXuw5QpU4rUdwAIh5AI\noFy49NJLlZGRIUnKzc1VVlaWNm3alLd81apVuuSSS6JqKzc3N6p6P/30k0aMGKGXX35Zmzdv1scf\nf6yXX35ZixYtkiQtXbpU7dq1U2Zmpi699FK9+uqr+p//+R8tXbq0kHsnPf7444VeBwAiISQCKBd6\n9uypVatWSZI2btyo5ORkVa9eXQcOHNCJEye0efNmderUSUuXLlWnTp3Url073X777Tp58qQkqVmz\nZpowYYI6d+6st99+Wx988IFatWqlzp07691333Xd5gsvvKBRo0apQ4cOkqS6devqySef1NSpU7Vh\nwwaNHz9e7733njp27KjHHntMn3zyiW677Tb94Q9/0MaNG9WtWzd17NhR7du31/fffy9JmjNnjrp3\n766OHTvq7rvvVm5uriZMmKBffvlFHTt21MiRI2MwmgDKg4rx7gAAxEKjRo1UsWJF7dixQ6tWrVLP\nnj21c+dOrVq1SjVq1FC7du106tQpjRo1SsuWLVNSUpJuueUW/e1vf9P9998vj8ejevXqKTMzU8eP\nH1eLFi20fPlyNW/eXMOGDXP9P4s3bdqkW2+9NV9Z586dtXHjRrVv316PPfaYMjMz9eyzz0qSli9f\nrmnTpqlTp06677779Nvf/lbDhw9XTk6OcnJytHnzZr311lvKyMhQQkKC7rnnHs2dO1dTp07VCy+8\noPXr18diKAGUE1xJBFBu9OrVSxkZGcrIyFDPnj3Vs2dPZWRk5N1q/vrrr5WYmKikpCRJ0i233KIV\nK1bkrT9s2DBJ0ldffaXExEQ1b95cknTTTTcp3P9LEOn/KzD2f71yXdazZ09NmTJFTz75pLZt26aq\nVatq6dKlyszMVJcuXdSxY0ctW7ZMW7duLdJYAEBBCIkAyo1LLrlEn3zyib744gu1bdtWPXr0yAuN\nvXr1ctQ3xuS7QlitWjXXdsMFvdatWyszMzNfWWZmppKTkwvs64033qiFCxfqrLPO0pVXXqnly5dL\nssF1/fr1Wr9+vb766is98sgjBbYFAEVBSARQbvTq1Uvp6emqW7euPB6PateurQMHDmjVqlXq1auX\nWrRooW3btuU9/zd79mxdfvnljnZatmypbdu2acuWLZKkefPmuW7v3nvv1cyZM7VhwwZJUlZWliZM\nmKA//OEPBfZ169atSkxM1JgxY3TNNdfoiy++UL9+/fTOO+9oz549kqR9+/Zp+/btkqRKlSopJyen\n8IMCAGEQEgGUG8nJycrKylKPHj3yytq1a6datWqpTp06qlq1qmbMmKHrr79e7dq1U8WKFXX33XdL\nUr4rilWrVtXf//53XXXVVercubMaNGjg+kxiw4YNNWfOHN15551q1aqVLrnkEt1+++266qqr8tp0\nW0+S3nrrLSUnJ6tjx47auHGjbr75ZrVq1Up//vOfNWDAALVv314DBgzQTz/9JEn6zW9+o3bt2vHD\nFQAlxmMiPTADAACAcokriQAAAHAgJAIAAMCBkAgAAAAHQiIAAAAcCIkAAABwICQCAADAgZAIAAAA\nh/8PlbxZEtSLxOIAAAAASUVORK5CYII=\n" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print text4\n", "figsize(10,6)\n", "text4.dispersion_plot([\"citizens\",\"democracy\",\"freedom\",\"liberty\",\"duties\",\"America\",\"slavery\",\"women\"])" ] }, { "cell_type": "markdown", "id": "34d6a401", "metadata": {}, "source": [ "# N-Gram Text Generation" ] }, { "cell_type": "markdown", "id": "73fb521e", "metadata": {}, "source": [ "For n-gram text generation, we compute estimates of the conditional\n", "probabilities $P(x_n|x_{n-1}...x_1)$ and then sample from this distribution." ] }, { "cell_type": "code", "execution_count": 80, "id": "a35b9d84", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "In the mount Gilead . And Joseph answered and said , God hath caused\n", "me to Ephron the Hittite for a token of a tree to be gathered togeth\n", "water ye the sheep . And to every beast of the sons of Javan ; Elishah\n", ", and five years , and they fell before him , and your little ones . I\n", "will give according as he had done speaking , that the land which I\n", "have set for a witness between me and thee and thy flocks , and Admah\n", ", and with harp ? And they smote the\n" ] } ], "source": [ "print text3\n", "text3.generate()" ] }, { "cell_type": "markdown", "id": "8c328630", "metadata": {}, "source": [ "# Basic Text Access" ] }, { "cell_type": "markdown", "id": "a6760d1d", "metadata": {}, "source": [ "Texts in NLTK are represented as Text objects.\n", "Text objects are a lot like \"lists\" of words or tokens." ] }, { "cell_type": "code", "execution_count": 81, "id": "3995dd5f", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Among', 'the', 'vicissitudes', 'incident', 'to', 'life', 'no', 'event', 'could', 'have', 'filled', 'me', 'with', 'greater', 'anxieties', 'than', 'that']\n" ] } ], "source": [ "print list(text4)[:30]" ] }, { "cell_type": "code", "execution_count": 82, "id": "dc70ff4b", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "44764" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text3)" ] }, { "cell_type": "code", "execution_count": 83, "id": "04d49aac", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[u'!', u\"'\", u'(', u')', u',', u',)', u'.', u'.)', u':', u';', u';)', u'?', u'?)', u'A', u'Abel', u'Abelmizraim', u'Abidah', u'Abide', u'Abimael', u'Abimelech', u'Abr', u'Abrah', u'Abraham', u'Abram', u'Accad', u'Achbor', u'Adah', u'Adam', u'Adbeel', u'Admah', u'Adullamite', u'After', u'Aholibamah', u'Ahuzzath', u'Ajah', u'Akan', u'All', u'Allonbachuth', u'Almighty', u'Almodad', u'Also', u'Alvah', u'Alvan', u'Am', u'Amal', u'Amalek', u'Amalekites', u'Ammon', u'Amorite', u'Amorites', u'Amraphel', u'An', u'Anah', u'Anamim', u'And', u'Aner', u'Angel', u'Appoint', u'Aram', u'Aran', u'Ararat', u'Arbah', u'Ard', u'Are', u'Areli', u'Arioch', u'Arise', u'Arkite', u'Arodi', u'Arphaxad', u'Art', u'Arvadite', u'As', u'Asenath', u'Ashbel', u'Asher', u'Ashkenaz', u'Ashteroth', u'Ask', u'Asshur', u'Asshurim', u'Assyr', u'Assyria', u'At', u'Atad', u'Avith', u'Baalhanan', u'Babel', u'Bashemath', u'Be', u'Because', u'Becher', u'Bedad', u'Beeri', u'Beerlahairoi', u'Beersheba', u'Behold', u'Bela', u'Belah', u'Benam']\n" ] } ], "source": [ "print sorted(set(text3))[:100]" ] }, { "cell_type": "code", "execution_count": 84, "id": "93a1e6b9", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "16.050197203298673" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text3)*1.0/len(set(text3))" ] }, { "cell_type": "code", "execution_count": 85, "id": "6e685599", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text3.count(\"smote\")" ] }, { "cell_type": "code", "execution_count": 86, "id": "944416a5", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Call', 'me', 'Ishmael', '.']" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sent1" ] }, { "cell_type": "code", "execution_count": 87, "id": "3dc3e82b", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['and', 'to', 'teach', 'them', 'by', 'what', 'name', 'a', 'whale', '-']" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1[100:110]" ] }, { "cell_type": "code", "execution_count": 88, "id": "6e77d94a", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from nltk import *" ] }, { "cell_type": "markdown", "id": "014759e1", "metadata": {}, "source": [ "# Frequency Distributions" ] }, { "cell_type": "markdown", "id": "66a8e2f6", "metadata": {}, "source": [ "Frequency distribution (FreqDist) are histograms or counts of the number of words in a text." ] }, { "cell_type": "code", "execution_count": 89, "id": "f6f0c0e3", "metadata": { "collapsed": false }, "outputs": [], "source": [ "fdist = FreqDist(text1)" ] }, { "cell_type": "code", "execution_count": 90, "id": "80c3de4d", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "471" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fdist['them']" ] }, { "cell_type": "code", "execution_count": 91, "id": "519d74fc", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnQAAAFtCAYAAACOWPcBAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X1wlfWB6PHvqYnrWoRIlIQmaCyciNHwohBz5167tDEB\n2W3Amy2ILS8t7u7AdrXq7qXdWys4o+DMenfUmrluG7uB7m3wMivQnRLCUGOrdzZpeZluG1tPtwjJ\nSUjlJWxEICLP/eMsR14SBExy8uR8PzNnSJ5znpPfk3mM3/k953meSBAEAZIkSQqtT6R6AJIkSfp4\nDDpJkqSQM+gkSZJCzqCTJEkKOYNOkiQp5Aw6SZKkkLtg0B0/fpw777yTKVOmUFRUxDe+8Q0ADh06\nRHl5OYWFhVRUVNDV1ZVcZ/Xq1USjUSZOnEhDQ0Ny+Y4dOyguLiYajfLQQw8ll584cYL58+cTjUYp\nLS1l7969yedqa2spLCyksLCQtWvX9ttGS5IkDScXDLqrrrqKV199ld27d/OLX/yCV199lddff501\na9ZQXl7OW2+9RVlZGWvWrAGgpaWF9evX09LSQn19PcuXL+f0Ze6WLVtGTU0NsViMWCxGfX09ADU1\nNWRnZxOLxXj44YdZsWIFkIjGJ554gubmZpqbm1m1atVZ4ShJkqSEjzzkevXVVwPQ09PDBx98wLXX\nXsvmzZtZvHgxAIsXL2bjxo0AbNq0iQULFpCZmUlBQQETJkygqamJjo4Ouru7KSkpAWDRokXJdc58\nr6qqKrZv3w7A1q1bqaioICsri6ysLMrLy5MRKEmSpA9lfNQLTp06xe23386///u/s2zZMm699VY6\nOzvJyckBICcnh87OTgDa29spLS1Nrpufn088HiczM5P8/Pzk8ry8POLxOADxeJxx48YlBpORwahR\nozh48CDt7e1nrXP6vc4UiUQud7slSZIG3UDdoOsjg+4Tn/gEu3fv5siRI8ycOZNXX331rOcjkUhK\nw8o7l+lirFy5kpUrV6Z6GAoJ9xddLPcVXYqB7KWLPst11KhR/PEf/zE7duwgJyeH/fv3A9DR0cGY\nMWOAxMxba2trcp22tjby8/PJy8ujra3tvOWn19m3bx8AJ0+e5MiRI2RnZ5/3Xq2trWfN2EmSJCnh\ngkF34MCB5IkIx44dY9u2bUydOpXKykpqa2uBxJmoc+fOBaCyspK6ujp6enrYs2cPsViMkpIScnNz\nGTlyJE1NTQRBwLp165gzZ05yndPvtWHDBsrKygCoqKigoaGBrq4uDh8+zLZt25g5c+bA/BYkSZJC\n7IKHXDs6Oli8eDGnTp3i1KlTLFy4kLKyMqZOncq8efOoqamhoKCAl19+GYCioiLmzZtHUVERGRkZ\nVFdXJ6cXq6urWbJkCceOHWP27NnMmjULgKVLl7Jw4UKi0SjZ2dnU1dUBMHr0aB577DGmT58OwOOP\nP05WVtaA/SI0vM2YMSPVQ1CIuL/oYrmvaKiIBCH+EFokEvEzdJIkKRQGslu8U4QkSVLIGXSSJEkh\nZ9BJkiSFnEEnSZIUcgadJElSyBl0kiRJIWfQSZIkhZxBJ0mSFHIGnSRJUsgZdJIkSSFn0EmSJIWc\nQSdJkhRyBp0kSVLIGXSSJEkhZ9BJkiSFnEEnSZIUcgadJElSyBl0kiRJIWfQSZIkhZxBJ0mSFHIG\nnSRJUsgZdJIkSSFn0EmSJIWcQSdJkhRyBp0kSVLIGXSSJEkhZ9BJkiSFnEEnSZIUcgadJElSyBl0\nkiRJIWfQSZIkhZxBJ0mSFHIGnSRJUsgZdJIkSSFn0EmSJIWcQSdJkhRyBp0kSVLIGXSSJEkhl5Hq\nAUiSJPWnU6fg+HE4ehTee+/Dx+V+v2IF/MmfpHqrLsygkyRJg6a/Y6u3ZcePw1VXwdVXJx6f/OSH\nX/f2/ell113X+2smTkz1b+2jRYIgCFI9iMsViUQI8fAlSRpShmJsXWyMnfn9H/4hfGIIfqhsILvF\noJMkKQT6K7Yu9Jr+jq3e4muoxtZgMOj6YNBJkoaCwY6ti5mlupzvr7oqfWNrMBh0fTDoJEkf5XJj\n61KC7EKx1V/xZWyFn0HXB4NOksJtqMXW5caXsaWLYdD1waCTpIFzsbH1cT40/3Fi62JjzNjSUGHQ\n9cGgk5Suwh5bp783tpROUhZ0ra2tLFq0iN///vdEIhH+/M//nAcffJCVK1fy3e9+l+uvvx6Ap556\ninvuuQeA1atX89JLL3HFFVfw3HPPUVFRAcCOHTtYsmQJx48fZ/bs2Tz77LMAnDhxgkWLFrFz506y\ns7NZv349N954IwC1tbU8+eSTAHzzm99k0aJFZw/eoJM0BPUVW/1x+YdzY2ugPq9lbEn9L2VBt3//\nfvbv38+UKVN49913ueOOO9i4cSMvv/wy11xzDY888shZr29paeH+++/nZz/7GfF4nLvvvptYLEYk\nEqGkpIRvf/vblJSUMHv2bB588EFmzZpFdXU1v/zlL6murmb9+vW88sor1NXVcejQIaZPn86OHTsA\nuOOOO9ixYwdZWVmD8ouRlN6CALq74cCBDx/vvHP296cfBw8mQutSY+vjxJexJYXPQHbLBe8UkZub\nS25uLgAjRozglltuIR6PA/Q6oE2bNrFgwQIyMzMpKChgwoQJNDU1ceONN9Ld3U1JSQkAixYtYuPG\njcyaNYvNmzezatUqAKqqqvjqV78KwNatW6moqEgGXHl5OfX19dx33339tOmS0snx4+eHWF+Bdnr5\nH/wBXH994urx5z4+/ekPvx49GkaMMLYkpc5F3/rr7bffZteuXZSWlvLGG2/w/PPPs3btWqZNm8Yz\nzzxDVlYW7e3tlJaWJtfJz88nHo+TmZlJfn5+cnleXl4yDOPxOOPGjUsMJiODUaNGcfDgQdrb289a\n5/R7nWvlypXJr2fMmMGMGTMueuMlhdPJk3Do0KUF2vvv9x1nRUXnL8vOToSZJF2uxsZGGhsbB+Vn\nXVTQvfvuu/zpn/4pzz77LCNGjGDZsmV861vfAuCxxx7j0UcfpaamZkAH2pczg05S+AQBHDly4Zmy\ncx//8R+QldV7oI0bB7fffv7yESMgEkn11kpKJ+dONJ0+IjkQPjLo3n//faqqqvjSl77E3LlzARgz\nZkzy+QceeIDPf/7zQGLmrbW1NflcW1sb+fn55OXl0dbWdt7y0+vs27ePT33qU5w8eZIjR46QnZ1N\nXl7eWVXb2trK5z73uY+3tZIG3HvvXVqcHTyYuBVQX7Nn0ej5z2VlwRVXpHpLJWnouGDQBUHA0qVL\nKSoq4mtf+1pyeUdHB2PHjgXglVdeobi4GIDKykruv/9+HnnkEeLxOLFYjJKSEiKRCCNHjqSpqYmS\nkhLWrVvHgw8+mFyntraW0tJSNmzYQFlZGQAVFRX87d/+LV1dXQRBwLZt23j66acH5JcgqXfvv58I\nrksJtA8+SARYb4FWXHz+8uxsuPLKVG+pJIXbBYPujTfe4Pvf/z6TJk1i6tSpQOISJT/4wQ/YvXs3\nkUiEm266iRdffBGAoqIi5s2bR1FRERkZGVRXVxP5z2Mc1dXVLFmyhGPHjjF79mxmzZoFwNKlS1m4\ncCHRaJTs7Gzq6uoAGD16NI899hjTp08H4PHHHz/rDFdJl+bUqcShzY86EeDMx7vvJj7w31uc3XQT\nlJScv/zqqz20KUmDzQsLS8NATw/E47BvH+zdm/j39KOtLRFrhw4lPkfW16HN6647/7lRozxbU5L6\ni3eK6INBp3QQBHD4cO+xdvpx4ACMHQs33AA33pj49/QjPx/GjEnMtGVmpnprJCl9GXR9MOg0HFxo\ndu30IzPzw0A7N9huuCERc54kIElDm0HXB4NOQ92Zs2t9BduZs2u9Rdu4cTByZKq3RJL0cRl0fTDo\nlGpnzq71FWxnzq71FmzOrklSejDo+mDQaaAdPQqxWN/B5uyaJOliGXR9MOjUH4IAOjrg178+/3Hg\nAIwfDwUFvUebs2uSpItl0PXBoNOlOHECfvvb86PtN79JXDtt4sQPHzffnPj3hhsMNklS/zDo+mDQ\nqTcHDpwfbL/+NbS2Jmbazgy30/F27bWpHrUkabgz6Ppg0KWvkyfh7bd7P0x68iTccsv54fbpT3sd\nNklS6hh0fTDo0sN778G//Rvs3Jl47NoFLS2Qm/vhodEzH2PGeOspSdLQY9D1waAbfo4cgd27Pwy3\nnTvhd79LhNrttyceU6fCpEnwyU+merSSJF08g64PBl24vfPO2eG2cyfs35+ItalTPwy4W2+FK69M\n9WglSfp4DLo+GHThsX8/NDefHXDd3R+G2+l/b77Zs0olScOTQdcHg25oOnUK3nwT3ngDXn898e+h\nQ1BSAnfc8eHM2003+Vk3SVL6MOj6YNANDceOwc9+lgi3N96A//f/EpcB+a//NfH4b/8tcdbpJz6R\n6pFKkpQ6Bl0fDLrUeOeds2fffvELKCpKhNvpiBs7NtWjlCRpaDHo+mDQDY6jR+G112DbtsSjrQ1K\nSz+cfSsp8YxTSZI+ikHXB4NuYHzwAezY8WHA/fznMG0a3H03lJcnvvbEBUmSLo1B1weDrv/s2QMN\nDYmA+/GP4VOfSsRbeTl85jMwYkSqRyhJUrgZdH0w6C7fqVOJExk2bYLNmxOfi6uoSATc3Xcngk6S\nJPUfg64PBt2lOXYMtm9PBNwPfwijR0NlJcyZk/gcnGehSpI0cAy6Phh0H+3IkcQs3MaNiZibMiUR\ncJWVMGFCqkcnSVL6MOj6YND17uhR+Jd/gbq6xOfhZsyAqiqYPRuuuy7Vo5MkKT0ZdH0w6D50/DjU\n1ycibssW+C//Be67D+bOhaysVI9OkiQZdH1I96ALgsSJDd/7Hrz8cuKm9vfdB//9v8P116d6dJIk\n6UwD2S0ZA/KuGlD798O6dfCP/wg9PbBkSeKG9zfckOqRSZKkVDDoQiII4NVX4YUXEp+Lu/de+N//\nO3GnBm9wL0lSejPohrgjR2DtWqiuTtyd4S//MjEzd801qR6ZJEkaKgy6IWrPHvhf/wv+6Z8SF/t9\n8UW46y5n4yRJ0vm8lOwQs2sX3H8/TJ+euN3WL38J69cnbr9lzEmSpN4YdENAECQu+jtzJnz+83D7\n7fC738Hq1d6CS5IkfTQPuaZQEMA//3Mi3I4ehf/xP+CLX4Qrr0z1yCRJUpgYdCny4x/D178OH3wA\njz8Of/In3ktVkiRdHoNukO3enQi53/4WnnwSvvAFQ06SJH08psQgeftt+NKX4J57Ep+Ta2mB+fON\nOUmS9PGZEwOspweeegqmTYNoFGKxxLXk/JycJEnqLx5yHUCvvgrLl8OECfDzn0NBQapHJEmShiOD\nbgB0dsJf/zW89ho89xzMmeM15CRJ0sDxkGs/W78eJk2C3NzE5+TmzjXmJEnSwHKGrp8cPJj4bNzu\n3fDDH0JJSapHJEmS0oUzdP1g2zaYPBnGjk3cusuYkyRJg8kZuo/h5MnERYFra2HtWvjc51I9IkmS\nlI4MusvU2Qnz5sEf/AHs3AljxqR6RJIkKV15yPUy/OIXcOed8Ed/BPX1xpwkSUotZ+gu0ebN8MAD\nicuR3HdfqkcjSZJk0F2S6urE/Vf/5V888UGSJA0dFzzk2traymc/+1luvfVWbrvtNp577jkADh06\nRHl5OYWFhVRUVNDV1ZVcZ/Xq1USjUSZOnEhDQ0Ny+Y4dOyguLiYajfLQQw8ll584cYL58+cTjUYp\nLS1l7969yedqa2spLCyksLCQtWvX9ttGX46nn4ZnnoHXXzfmJEnS0HLBoMvMzOTv//7v+dWvfsW/\n/uu/8sILL/Dmm2+yZs0aysvLeeuttygrK2PNmjUAtLS0sH79elpaWqivr2f58uUEQQDAsmXLqKmp\nIRaLEYvFqK+vB6Cmpobs7GxisRgPP/wwK1asABLR+MQTT9Dc3ExzczOrVq06KxwHSxDA//yfiTNZ\nf/ITuOmmQR+CJEnSBV0w6HJzc5kyZQoAI0aM4JZbbiEej7N582YWL14MwOLFi9m4cSMAmzZtYsGC\nBWRmZlJQUMCECRNoamqio6OD7u5uSv5zamvRokXJdc58r6qqKrZv3w7A1q1bqaioICsri6ysLMrL\ny5MROJi+9S340Y8SMZeXN+g/XpIk6SNd9Fmub7/9Nrt27eLOO++ks7OTnJwcAHJycujs7ASgvb2d\n/Pz85Dr5+fnE4/Hzlufl5RGPxwGIx+OMGzcOgIyMDEaNGsXBgwf7fK/B9Hd/Bxs2QEMDXHfdoP5o\nSZKki3ZRJ0W8++67VFVV8eyzz3LNNdec9VwkEiGSwpuVrly5Mvn1jBkzmDFjRr+873e+Ay+8AD/9\nKVx/fb+8pSRJSiONjY00NjYOys/6yKB7//33qaqqYuHChcydOxdIzMrt37+f3NxcOjo6GPOfF2LL\ny8ujtbU1uW5bWxv5+fnk5eXR1tZ23vLT6+zbt49PfepTnDx5kiNHjpCdnU1eXt5Zv4TW1lY+18ut\nGM4Muv6yeTOsXAmvvQZnTBJKkiRdtHMnmlatWjVgP+uCh1yDIGDp0qUUFRXxta99Lbm8srKS2tpa\nIHEm6unQq6yspK6ujp6eHvbs2UMsFqOkpITc3FxGjhxJU1MTQRCwbt065syZc957bdiwgbKyMgAq\nKipoaGigq6uLw4cPs23bNmbOnNn/v4Fz/OIXievMvfIKTJgw4D9OkiTpY4sEp09D7cXrr7/OZz7z\nGSZNmpQ8rLp69WpKSkqYN28e+/bto6CggJdffpmsrCwAnnrqKV566SUyMjJ49tlnkxG2Y8cOlixZ\nwrFjx5g9e3byEignTpxg4cKF7Nq1i+zsbOrq6igoKADge9/7Hk899RQA3/zmN5MnTyQHH4lwgeFf\nst//PnFJkqefhvnz++1tJUmS+r1bznrvCwXdUNefv5hTp2DmzETQPflkv7ylJElS0kAGnfdy/U9r\n1sCJEzCAh7clSZIGhLf+At54I3Fv1p//HDL8jUiSpJBJ+xm6996DJUvgxRc9o1WSJIVT2n+G7q//\nGuJx+MEP+mlQkiRJvRjIz9Cl9QHG5mb4/vfh3/4t1SORJEm6fGl7yDUI4MEHEydDeCcISZIUZmkb\ndHV18P77sGhRqkciSZL08aTlZ+iOHYOJE2HtWvijPxqAgUmSJJ3D69D1s+eegzvuMOYkSdLwkHYz\ndN3dMH48vPYa3HLLAA1MkiTpHM7Q9aPqavjc54w5SZI0fKTVDN3Ro/DpT8OPfwy33jqAA5MkSTqH\nM3T95B/+Ae66y5iTJEnDS9rM0J06BdEo/NM/QWnpAA9MkiTpHM7Q9YP6esjKgjvvTPVIJEmS+lfa\nBF11NfzlX0IkkuqRSJIk9a+0OOT6zjuJw63xOHzyk4MwMEmSpHN4yPVj+r//F2bPNuYkSdLwlBZB\n93/+D9x/f6pHIUmSNDCG/SHXvXsTt/lqb4crrxykgUmSJJ3DQ64fwz//M8yda8xJkqTha9gH3Q9/\nCJWVqR6FJEnSwBnWh1y7uuCGG2D/frj66kEcmCRJ0jk85HqZtm5N3OrLmJMkScPZsA66H/4QPv/5\nVI9CkiRpYA3bQ66nTkFODuzcCePGDfLAJEmSzuEh18vw5pswcqQxJ0mShr9hG3Q/+Ql85jOpHoUk\nSdLAM+gkSZJCblgGXRAYdJIkKX0My6Brb4eeHvj0p1M9EkmSpIE3LINu587E/VsjkVSPRJIkaeAN\ny6DbsSMRdJIkSelgWAbdzp1w++2pHoUkSdLgMOgkSZJCbtgFXVcXHDkCBQWpHokkSdLgGHZB95vf\nwM03e0KEJElKH8Mu6H79a5g4MdWjkCRJGjwGnSRJUsgZdJIkSSE3LIPu5ptTPQpJkqTBEwmCIEj1\nIC5XJBLhzOGfOgWf/CQcOJD4V5Ikaag4t1v607Caofv972HECGNOkiSll2EVdHv3wo03pnoUkiRJ\ng8ugkyRJCjmDTpIkKeQMOkmSpJC7YNB95StfIScnh+Li4uSylStXkp+fz9SpU5k6dSpbtmxJPrd6\n9Wqi0SgTJ06koaEhuXzHjh0UFxcTjUZ56KGHkstPnDjB/PnziUajlJaWsnfv3uRztbW1FBYWUlhY\nyNq1ay9qYww6SZKUji4YdF/+8pepr68/a1kkEuGRRx5h165d7Nq1i3vuuQeAlpYW1q9fT0tLC/X1\n9Sxfvjx5au6yZcuoqakhFosRi8WS71lTU0N2djaxWIyHH36YFStWAHDo0CGeeOIJmpubaW5uZtWq\nVXR1dX3kxuzbZ9BJkqT0c8Ggu+uuu7j22mvPW97bNVQ2bdrEggULyMzMpKCggAkTJtDU1ERHRwfd\n3d2UlJQAsGjRIjZu3AjA5s2bWbx4MQBVVVVs374dgK1bt1JRUUFWVhZZWVmUl5efF5a9aW+HvLyP\nfJkkSdKwknE5Kz3//POsXbuWadOm8cwzz5CVlUV7ezulpaXJ1+Tn5xOPx8nMzCQ/Pz+5PC8vj3g8\nDkA8HmfcuHGJgWRkMGrUKA4ePEh7e/tZ65x+r96sXLkSSFxU+PDhGWRnz7icTZIkSepXjY2NNDY2\nDsrPuuSgW7ZsGd/61rcAeOyxx3j00Uepqanp94FdrNNBF4/Dd78LV1yRsqFIkiQlzZgxgxkzZiS/\nX7Vq1YD9rEs+y3XMmDFEIhEikQgPPPAAzc3NQGLmrbW1Nfm6trY28vPzycvLo62t7bzlp9fZt28f\nACdPnuTIkSNkZ2ef916tra1nzdj1Zv9+yM291K2RJEkKv0sOuo6OjuTXr7zySvIM2MrKSurq6ujp\n6WHPnj3EYjFKSkrIzc1l5MiRNDU1EQQB69atY86cOcl1amtrAdiwYQNlZWUAVFRU0NDQQFdXF4cP\nH2bbtm3MnDnzguMy6CRJUrq64CHXBQsW8Nprr3HgwAHGjRvHqlWraGxsZPfu3UQiEW666SZefPFF\nAIqKipg3bx5FRUVkZGRQXV1NJBIBoLq6miVLlnDs2DFmz57NrFmzAFi6dCkLFy4kGo2SnZ1NXV0d\nAKNHj+axxx5j+vTpADz++ONkZWVdcEMMOkmSlK4iQW+nrIZEJBJJnnH75JNw9Cg89VSKByVJktSL\nM7ulvw2bO0U4QydJktKVQSdJkhRywyrocnJSPQpJkqTBN2yC7uBBuO66VI9CkiRp8A2roMvOTvUo\nJEmSBt+wOMs1CODKKxNnuV55ZapHJUmSdD7Pcv0I//EfcNVVxpwkSUpPwyLoDh3ycKskSUpfwyLo\n/PycJElKZ8Mm6EaPTvUoJEmSUmPYBJ0zdJIkKV0Ni6Dr6oKsrFSPQpIkKTWGRdB1d8M116R6FJIk\nSakxLILu3XcNOkmSlL6GRdA5QydJktKZQSdJkhRywyLo3n0XRoxI9SgkSZJSY1gEnTN0kiQpnRl0\nkiRJIWfQSZIkhdywCDovWyJJktLZsAi67m5PipAkSelr2ASdM3SSJCldhT7oTp6EEyfg6qtTPRJJ\nkqTUCH3QHT2aONwaiaR6JJIkSakR+qDzosKSJCndhT7o3nvPw62SJCm9hT7ojh2DP/zDVI9CkiQp\ndQw6SZKkkBsWQXfVVakehSRJUuoMi6Bzhk6SJKUzg06SJCnkDDpJkqSQM+gkSZJCLvRBd/y4QSdJ\nktJb6IPOs1wlSVK6GxZB5wydJElKZwadJElSyBl0kiRJIRf6oDt+3M/QSZKk9Bb6oHv/fcjMTPUo\nJEmSUsegkyRJCjmDTpIkKeQMOkmSpJAz6CRJkkLOoJMkSQo5g06SJCnkLhh0X/nKV8jJyaG4uDi5\n7NChQ5SXl1NYWEhFRQVdXV3J51avXk00GmXixIk0NDQkl+/YsYPi4mKi0SgPPfRQcvmJEyeYP38+\n0WiU0tJS9u7dm3yutraWwsJCCgsLWbt2bZ9jNOgkSVK6u2DQffnLX6a+vv6sZWvWrKG8vJy33nqL\nsrIy1qxZA0BLSwvr16+npaWF+vp6li9fThAEACxbtoyamhpisRixWCz5njU1NWRnZxOLxXj44YdZ\nsWIFkIjGJ554gubmZpqbm1m1atVZ4Xgmg06SJKW7CwbdXXfdxbXXXnvWss2bN7N48WIAFi9ezMaN\nGwHYtGkTCxYsIDMzk4KCAiZMmEBTUxMdHR10d3dTUlICwKJFi5LrnPleVVVVbN++HYCtW7dSUVFB\nVlYWWVlZlJeXnxeWpxl0kiQp3WVc6gqdnZ3k5OQAkJOTQ2dnJwDt7e2UlpYmX5efn088HiczM5P8\n/Pzk8ry8POLxOADxeJxx48YlBpKRwahRozh48CDt7e1nrXP6vXrT2rqSl16CLVtgxowZzJgx41I3\nSZIkqd81NjbS2Ng4KD/rkoPuTJFIhEgk0l9juSzZ2Sv56lfhjI/5SZIkpdy5E02rVq0asJ91yWe5\n5uTksH//fgA6OjoYM2YMkJh5a21tTb6ura2N/Px88vLyaGtrO2/56XX27dsHwMmTJzly5AjZ2dnn\nvVdra+tZM3Zn8pCrJElKd5ccdJWVldTW1gKJM1Hnzp2bXF5XV0dPTw979uwhFotRUlJCbm4uI0eO\npKmpiSAIWLduHXPmzDnvvTZs2EBZWRkAFRUVNDQ00NXVxeHDh9m2bRszZ87sdTwGnSRJSncXPOS6\nYMECXnvtNQ4cOMC4ceN44okn+PrXv868efOoqamhoKCAl19+GYCioiLmzZtHUVERGRkZVFdXJw/H\nVldXs2TJEo4dO8bs2bOZNWsWAEuXLmXhwoVEo1Gys7Opq6sDYPTo0Tz22GNMnz4dgMcff5ysrKxe\nx2jQSZKkdBcJTl9bJIQikQg5OQG7dsHYsakejSRJUt8ikQgDlV3eKUKSJCnkDDpJkqSQM+gkSZJC\nzqCTJEkKudAH3QcfQMbHujyyJElSuIU+6DIyIMU3q5AkSUqp0Aedh1slSVK6M+gkSZJCLvRBd8UV\nqR6BJElSahl0kiRJIRf6oPtE6LdAkiTp4wl9DjlDJ0mS0p1BJ0mSFHKhDzoPuUqSpHQX+hxyhk6S\nJKU7g06SJCnkDDpJkqSQC33Q+Rk6SZKU7kKfQ87QSZKkdGfQSZIkhVzog85DrpIkKd2FPoecoZMk\nSenOoJPSKNmAAAAK3UlEQVQkSQq50Aedh1wlSVK6C30OOUMnSZLSnUEnSZIUcqEPOg+5SpKkdBf6\nHHKGTpIkpTuDTpIkKeRCH3QecpUkSeku9DnkDJ0kSUp3Bp0kSVLIGXSSJEkhF/qg8zN0kiQp3YU+\nhww6SZKU7kKfQ5FIqkcgSZKUWgadJElSyBl0kiRJIWfQSZIkhZxBJ0mSFHIGnSRJUsgZdJIkSSFn\n0EmSJIWcQSdJkhRyBp0kSVLIGXSSJEkhZ9BJkiSF3GUHXUFBAZMmTWLq1KmUlJQAcOjQIcrLyyks\nLKSiooKurq7k61evXk00GmXixIk0NDQkl+/YsYPi4mKi0SgPPfRQcvmJEyeYP38+0WiU0tJS9u7d\n2+s4DDpJkpTuLjvoIpEIjY2N7Nq1i+bmZgDWrFlDeXk5b731FmVlZaxZswaAlpYW1q9fT0tLC/X1\n9SxfvpwgCABYtmwZNTU1xGIxYrEY9fX1ANTU1JCdnU0sFuPhhx9mxYoVfYzjcrdAkiRpePhYh1xP\nR9lpmzdvZvHixQAsXryYjRs3ArBp0yYWLFhAZmYmBQUFTJgwgaamJjo6Ouju7k7O8C1atCi5zpnv\nVVVVxfbt23sdg0EnSZLSXcblrhiJRLj77ru54oor+Iu/+Av+7M/+jM7OTnJycgDIycmhs7MTgPb2\ndkpLS5Pr5ufnE4/HyczMJD8/P7k8Ly+PeDwOQDweZ9y4cYlBZmQwatQoDh06xOjRo88ax86dK1m5\nMvH1jBkzmDFjxuVukiRJUr9pbGyksbFxUH7WZQfdG2+8wdixY3nnnXcoLy9n4sSJZz0fiUSIDML0\n2bRpHwadJEnSUHHuRNOqVasG7Gdd9iHXsWPHAnD99ddz77330tzcTE5ODvv37wego6ODMWPGAImZ\nt9bW1uS6bW1t5Ofnk5eXR1tb23nLT6+zb98+AE6ePMmRI0fOm50DD7lKkiRdVtC99957dHd3A3D0\n6FEaGhooLi6msrKS2tpaAGpra5k7dy4AlZWV1NXV0dPTw549e4jFYpSUlJCbm8vIkSNpamoiCALW\nrVvHnDlzkuucfq8NGzZQVlb2sTdWkiRpOLqsQ66dnZ3ce++9QGL27Itf/CIVFRVMmzaNefPmUVNT\nQ0FBAS+//DIARUVFzJs3j6KiIjIyMqiurk4ejq2urmbJkiUcO3aM2bNnM2vWLACWLl3KwoULiUaj\nZGdnU1dX1+tYnKGTJEnpLhKce6pqiEQiEZYvD3jhhVSPRJIk6cIikch5VwjpL94pQpIkKeQMOkmS\npJAz6CRJkkLOoJMkSQo5g06SJCnkDDpJkqSQM+gkSZJCzqCTJEkKOYNOkiQp5Aw6SZKkkDPoJEmS\nQs6gkyRJCjmDTpIkKeQMOkmSpJAz6CRJkkLOoJMkSQo5g06SJCnkDDpJkqSQM+gkSZJCzqCTJEkK\nOYNOkiQp5EIfdJIkSeku9EHnDJ0kSUp3Bp0kSVLIGXSSJEkhZ9BJkiSFnEEnSZIUcgadJElSyBl0\nkiRJIWfQSZIkhVzog66iItUjkCRJSq1IEARBqgdxuSKRCCEeviRJSiMD2S2hn6GTJElKdwadJElS\nyBl0kiRJIWfQSZIkhZxBJ0mSFHIGnSRJUsgZdJIkSSFn0EmSJIWcQSdJkhRyBp0kSVLIGXSSJEkh\nZ9BJkiSFnEEnSZIUcgadJElSyBl0SguNjY2pHoJCxP1FF8t9RUPFkA66+vp6Jk6cSDQa5emnn071\ncBRi/tHVpXB/0cVyX9FQMWSD7oMPPuCrX/0q9fX1tLS08IMf/IA333wz1cOSJEkacoZs0DU3NzNh\nwgQKCgrIzMzkvvvuY9OmTakeliRJ0pATCYIgSPUgerNhwwa2bt3Kd77zHQC+//3v09TUxPPPP598\nTSQSSdXwJEmSLtlAZVfGgLxrP7iYWBuiLSpJkjSohuwh17y8PFpbW5Pft7a2kp+fn8IRSZIkDU1D\nNuimTZtGLBbj7bffpqenh/Xr11NZWZnqYUmSJA05Q/aQa0ZGBt/+9reZOXMmH3zwAUuXLuWWW25J\n9bAkSZKGnCE7Qwdwzz338Jvf/Ibf/va3fOMb3zjrOa9RJ4CCggImTZrE1KlTKSkpAeDQoUOUl5dT\nWFhIRUUFXV1dydevXr2aaDTKxIkTaWhoSC7fsWMHxcXFRKNRHnrooUHfDvW/r3zlK+Tk5FBcXJxc\n1p/7xokTJ5g/fz7RaJTS0lL27t07OBumAdHb/rJy5Ury8/OZOnUqU6dOZcuWLcnn3F/SV2trK5/9\n7Ge59dZbue2223juueeAIfD3JQihkydPBuPHjw/27NkT9PT0BJMnTw5aWlpSPSylQEFBQXDw4MGz\nlv3N3/xN8PTTTwdBEARr1qwJVqxYEQRBEPzqV78KJk+eHPT09AR79uwJxo8fH5w6dSoIgiCYPn16\n0NTUFARBENxzzz3Bli1bBnErNBB+8pOfBDt37gxuu+225LL+3DdeeOGFYNmyZUEQBEFdXV0wf/78\nQds29b/e9peVK1cGzzzzzHmvdX9Jbx0dHcGuXbuCIAiC7u7uoLCwMGhpaUn535chPUPXF69RpzMF\n55ztvHnzZhYvXgzA4sWL2bhxIwCbNm1iwYIFZGZmUlBQwIQJE2hqaqKjo4Pu7u7kDN+iRYuS6yi8\n7rrrLq699tqzlvXnvnHme1VVVbF9+/bB2jQNgN72F+j9agruL+ktNzeXKVOmADBixAhuueUW4vF4\nyv++hDLo4vE448aNS36fn59PPB5P4YiUKpFIhLvvvptp06Ylr1nY2dlJTk4OADk5OXR2dgLQ3t5+\n1pnSp/ebc5fn5eW5Pw1T/blvnPl3KCMjg1GjRnHo0KHB2hQNkueff57JkyezdOnS5CE09xed9vbb\nb7Nr1y7uvPPOlP99CWXQeUFhnfbGG2+wa9cutmzZwgsvvMBPf/rTs56PRCLuL+qV+4Y+yrJly9iz\nZw+7d+9m7NixPProo6kekoaQd999l6qqKp599lmuueaas55Lxd+XUAad16jTaWPHjgXg+uuv5957\n76W5uZmcnBz2798PQEdHB2PGjAHO32/a2trIz88nLy+Ptra2s5bn5eUN4lZosPTHvnH6b01eXh77\n9u0D4OTJkxw5coTRo0cP1qZoEIwZMyb5P+YHHniA5uZmwP1F8P7771NVVcXChQuZO3cukPq/L6EM\nOq9RJ4D33nuP7u5uAI4ePUpDQwPFxcVUVlZSW1sLQG1tbfI/tsrKSurq6ujp6WHPnj3EYjFKSkrI\nzc1l5MiRNDU1EQQB69atS66j4aU/9o05c+ac914bNmygrKwsNRulAdPR0ZH8+pVXXkmeAev+kt6C\nIGDp0qUUFRXxta99Lbk85X9f+umkj0H3ox/9KCgsLAzGjx8fPPXUU6kejlLgd7/7XTB58uRg8uTJ\nwa233prcDw4ePBiUlZUF0Wg0KC8vDw4fPpxc58knnwzGjx8f3HzzzUF9fX1y+c9//vPgtttuC8aP\nHx/81V/91aBvi/rffffdF4wdOzbIzMwM8vPzg5deeqlf943jx48HX/jCF4IJEyYEd955Z7Bnz57B\n3Dz1s3P3l5qammDhwoVBcXFxMGnSpGDOnDnB/v37k693f0lfP/3pT4NIJBJMnjw5mDJlSjBlypRg\ny5YtKf/7EgkCb4gqSZIUZqE85CpJkqQPGXSSJEkhZ9BJkiSFnEEnSZIUcgadJElSyBl0kiRJIff/\nAQryBUhe0NItAAAAAElFTkSuQmCC\n" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot(list(fdist._cumulative_frequencies()))" ] }, { "cell_type": "code", "execution_count": 92, "id": "ed72ee81", "metadata": { "collapsed": false }, "outputs": [], "source": [ "vocabulary = fdist.keys()" ] }, { "cell_type": "code", "execution_count": 93, "id": "435d24bb", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', \"'\", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '\"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']\n" ] } ], "source": [ "print vocabulary[:50]" ] }, { "cell_type": "code", "execution_count": 94, "id": "5782dfe7", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sent1: Call me Ishmael .\n", "sent2: The family of Dashwood had long been settled in Sussex .\n", "sent3: In the beginning God created the heaven and the earth .\n", "sent4: Fellow - Citizens of the Senate and of the House of Representatives :\n", "sent5: I have a problem with people PMing me to lol JOIN\n", "sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !\n", "sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .\n", "sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .\n", "sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .\n" ] } ], "source": [ "sents()" ] }, { "cell_type": "markdown", "id": "ca54611f", "metadata": {}, "source": [ "# Tokenizers" ] }, { "cell_type": "markdown", "id": "bafc92e6", "metadata": {}, "source": [ "Tokenizers somehow split the text into \"tokens\"." ] }, { "cell_type": "code", "execution_count": 95, "id": "13f5e031", "metadata": { "collapsed": false }, "outputs": [], "source": [ "s = \"Call me Ishmael. The quick brown fox is sleeping.\"" ] }, { "cell_type": "markdown", "id": "5af696a9", "metadata": {}, "source": [ "The `sent_tokenize` function splits a text into sentences." ] }, { "cell_type": "code", "execution_count": 96, "id": "d1d4c652", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Call me Ishmael.', 'The quick brown fox is sleeping.']" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.tokenize.sent_tokenize(s)" ] }, { "cell_type": "markdown", "id": "31f58de3", "metadata": {}, "source": [ "The `word_tokenize` splits a text into words." ] }, { "cell_type": "code", "execution_count": 97, "id": "5a6068f2", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Call',\n", " 'me',\n", " 'Ishmael.',\n", " 'The',\n", " 'quick',\n", " 'brown',\n", " 'fox',\n", " 'is',\n", " 'sleeping',\n", " '.']" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.tokenize.word_tokenize(s)" ] }, { "cell_type": "markdown", "id": "faad784e", "metadata": {}, "source": [ "# Lemmatizer" ] }, { "cell_type": "markdown", "id": "887366f8", "metadata": {}, "source": [ "Lemmatizers take words and turn them into _lemmas_ (or _lemmata_), the canonical or dictionary form of a word." ] }, { "cell_type": "code", "execution_count": 98, "id": "c3c01c7e", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from nltk import stem\n", "L = stem.WordNetLemmatizer()" ] }, { "cell_type": "code", "execution_count": 99, "id": "74c5f6e8", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'dog'" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L.lemmatize(\"dogs\")" ] }, { "cell_type": "code", "execution_count": 100, "id": "992a4d3a", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'shop'" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L.lemmatize(\"shops\")" ] }, { "cell_type": "code", "execution_count": 101, "id": "67d1138c", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'mouse'" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L.lemmatize(\"mice\")" ] }, { "cell_type": "code", "execution_count": 102, "id": "06976f10", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'fantasize'" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L.lemmatize('fantasized','v')" ] }, { "cell_type": "markdown", "id": "21ea6f90", "metadata": {}, "source": [ "Lemmatization sometimes requires knowledge of the part-of-speech." ] }, { "cell_type": "code", "execution_count": 103, "id": "b27c9311", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'ha'" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L.lemmatize('has')" ] }, { "cell_type": "code", "execution_count": 104, "id": "f891fd05", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'have'" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L.lemmatize('has','v')" ] }, { "cell_type": "code", "execution_count": 105, "id": "0938e76c", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'have'" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L.lemmatize('having','v')" ] }, { "cell_type": "markdown", "id": "b75bbe64", "metadata": {}, "source": [ "# Stemming" ] }, { "cell_type": "markdown", "id": "79c36ae1", "metadata": {}, "source": [ "Stemming reduces inflected forms to a common base form, the _stem_.\n", "\n", "The stem may or may not be the dictionary form of the word; what matters primarily is that\n", "different inflected forms map to the same stem, and different lemmas map to different stems.\n", "\n", "Stemming is important for information retrieval and web search, as different inflected forms\n", "are usually treated the same for query purposes." ] }, { "cell_type": "code", "execution_count": 106, "id": "66249ef1", "metadata": { "collapsed": false }, "outputs": [], "source": [ "P = nltk.stem.porter.PorterStemmer()" ] }, { "cell_type": "code", "execution_count": 107, "id": "c94a0b23", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'overdo'" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P.stem(\"overdoing\")" ] }, { "cell_type": "code", "execution_count": 108, "id": "d1969c37", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'bender'" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P.stem(\"bender\")" ] }, { "cell_type": "code", "execution_count": 109, "id": "b1485c0e", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'bend'" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P.stem(\"bending\")" ] }, { "cell_type": "code", "execution_count": 110, "id": "58eb4d89", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'commun'" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P.stem(\"communities\")" ] }, { "cell_type": "markdown", "id": "269c8216", "metadata": {}, "source": [ "# POS Tagging" ] }, { "cell_type": "markdown", "id": "37ccc934", "metadata": {}, "source": [ "Recall that part-of-speech is the linguistic category of a word/lexical item.\n", "\n", "We had nouns, verbs, adjectives, etc.\n", "\n", "Some of these classes are open (they grow, like nouns), some of them are closed (like prepositions).\n", "\n", "POS tagging infers the part-of-speech of a word in the context of a sentence.\n", "The sentence context is needed, since many words belong to different\n", "parts of speech depending on how they are used." ] }, { "cell_type": "code", "execution_count": 111, "id": "685b71af", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('The', 'DT'),\n", " ('quick', 'NN'),\n", " ('brown', 'NN'),\n", " ('fox', 'NN'),\n", " ('jumps', 'NNS'),\n", " ('over', 'IN'),\n", " ('the', 'DT'),\n", " ('lazy', 'NN'),\n", " ('dogs', 'NNS'),\n", " ('.', '.')]" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos_tag(word_tokenize(\"The quick brown fox jumps over the lazy dogs.\"))" ] }, { "cell_type": "code", "execution_count": 112, "id": "3be144c8", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('I', 'PRP'),\n", " ('compare', 'VBP'),\n", " ('thee', 'JJ'),\n", " ('to', 'TO'),\n", " ('a', 'DT'),\n", " ('summer', 'NN'),\n", " (\"'s\", 'POS'),\n", " ('day', 'NN'),\n", " ('?', '.')]" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos_tag(word_tokenize(\"I compare thee to a summer's day?\"))" ] }, { "cell_type": "code", "execution_count": 113, "id": "7d7a984b", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PRP: pronoun, personal\n", " hers herself him himself hisself it itself me myself one oneself ours\n", " ourselves ownself self she thee theirs them themselves they thou thy us\n" ] } ], "source": [ "nltk.help.upenn_tagset('PRP')" ] }, { "cell_type": "code", "execution_count": 114, "id": "5aab2887", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JJ: adjective or numeral, ordinal\n", " third ill-mannered pre-war regrettable oiled calamitous first separable\n", " ectoplasmic battery-powered participatory fourth still-to-be-named\n", " multilingual multi-disciplinary ...\n" ] } ], "source": [ "nltk.help.upenn_tagset(\"JJ\")" ] }, { "cell_type": "code", "execution_count": 115, "id": "ee399433", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('Buffalo', 'NNP'), ('buffalo', 'VBD'), ('buffalo', 'NN'), ('.', '.')]" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos_tag(word_tokenize(\"Buffalo buffalo buffalo.\"))" ] }, { "cell_type": "markdown", "id": "e37ac659", "metadata": {}, "source": [ "# Parsing" ] }, { "cell_type": "markdown", "id": "3996087e", "metadata": {}, "source": [ "Parsing is the process of recovering the phrase structure (a tree structure)\n", "of a sentence from the linear sequence of words.\n", "\n", "Parsing is driven by a grammar.\n", "\n", "Recall your context-free grammars from the introductory class.\n", "\n", "Recall _terminals_ and _non-terminals_.\n", "\n", "Context free grammars are grammars in which there is only a single non-terminal on the left hand\n", "side of any production." ] }, { "cell_type": "code", "execution_count": 116, "id": "d8fb931c", "metadata": { "collapsed": false }, "outputs": [], "source": [ "tree = nltk.bracket_parse('(NP (Adj old) (NP (N men) (Conj and) (N women)))')\n", "tree.draw() " ] }, { "cell_type": "code", "execution_count": 117, "id": "da52a216", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grammar = parse_cfg(\"\"\"\n", "S -> NP VP\n", "PP -> P NP\n", "NP -> 'the' N | N PP | 'the' N PP\n", "VP -> V NP | V PP | V NP PP\n", "N -> 'cat'\n", "N -> 'dog'\n", "N -> 'rug'\n", "V -> 'chased'\n", "V -> 'sat'\n", "P -> 'in'\n", "P -> 'on'\n", "\"\"\")\n", "grammar" ] }, { "cell_type": "code", "execution_count": 118, "id": "82c15dd9", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from nltk.parse import ShiftReduceParser\n", "P = ShiftReduceParser(grammar)" ] }, { "cell_type": "code", "execution_count": 119, "id": "84973ac6", "metadata": { "collapsed": false }, "outputs": [], "source": [ "tree = P.parse(word_tokenize(\"the cat chased the dog\"))" ] }, { "cell_type": "code", "execution_count": 120, "id": "cc6eeec6", "metadata": { "collapsed": false }, "outputs": [], "source": [ "tree.draw()" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }