{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Jupyter Notebook file" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "import matplotlib\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*** Introductory Examples for the NLTK Book ***\n", "Loading text1, ..., text9 and sent1, ..., sent9\n", "Type the name of the text or sentence to view it.\n", "Type: 'texts()' or 'sents()' to list the materials.\n", "text1: Moby Dick by Herman Melville 1851\n", "text2: Sense and Sensibility by Jane Austen 1811\n", "text3: The Book of Genesis\n", "text4: Inaugural Address Corpus\n", "text5: Chat Corpus\n", "text6: Monty Python and the Holy Grail\n", "text7: Wall Street Journal\n", "text8: Personals Corpus\n", "text9: The Man Who Was Thursday by G . K . Chesterton 1908\n" ] } ], "source": [ "from nltk.book import *\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Searching for Words" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Displaying 25 of 1226 matches:\n", "s , and to teach them by what name a whale - fish is to be called in our tongue\n", "t which is not true .\" -- HACKLUYT \" WHALE . ... Sw . and Dan . HVAL . This ani\n", "ulted .\" -- WEBSTER ' S DICTIONARY \" WHALE . ... It is more immediately from th\n", "ISH . WAL , DUTCH . HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALE\n", "HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALEINE , FRENCH . BALLE\n", "least , take the higgledy - piggledy whale statements , however authentic , in \n", " dreadful gulf of this monster ' s ( whale ' s ) mouth , are immediately lost a\n", " patient Job .\" -- RABELAIS . \" This whale ' s liver was two cartloads .\" -- ST\n", " Touching that monstrous bulk of the whale or ork we have received nothing cert\n", " of oil will be extracted out of one whale .\" -- IBID . \" HISTORY OF LIFE AND D\n", "ise .\" -- KING HENRY . \" Very like a whale .\" -- HAMLET . \" Which to secure , n\n", "restless paine , Like as the wounded whale to shore flies thro ' the maine .\" -\n", ". OF SPERMA CETI AND THE SPERMA CETI WHALE . VIDE HIS V . E . \" Like Spencer ' \n", "t had been a sprat in the mouth of a whale .\" -- PILGRIM ' S PROGRESS . \" That \n", "EN ' S ANNUS MIRABILIS . \" While the whale is floating at the stern of the ship\n", "e ship called The Jonas - in - the - Whale . ... Some say the whale can ' t ope\n", " in - the - Whale . ... Some say the whale can ' t open his mouth , but that is\n", " masts to see whether they can see a whale , for the first discoverer has a duc\n", " for his pains . ... I was told of a whale taken near Shetland , that had above\n", "oneers told me that he caught once a whale in Spitzbergen that was white all ov\n", "2 , one eighty feet in length of the whale - bone kind came in , which ( as I w\n", "n master and kill this Sperma - ceti whale , for I could never hear of any of t\n", " . 1729 . \"... and the breath of the whale is frequendy attended with such an i\n", "ed with hoops and armed with ribs of whale .\" -- RAPE OF THE LOCK . \" If we com\n", "contemptible in the comparison . The whale is doubtless the largest animal in c\n" ] } ], "source": [ "text1.concordance(\"whale\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sea man it ship by him hand them whale view ships land me life death\n", "water way head nature fear\n" ] } ], "source": [ "text1.similar(\"love\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "affection sister heart mother time see town life it dear elinor\n", "marianne me word family her him do regard head\n" ] } ], "source": [ "text2.similar(\"love\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "join part hi hey and wb well ty lmao yeah hiya ok oh hello you what\n", "yes haha no all\n" ] } ], "source": [ "text5.similar(\"lol\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Positioning Words" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY0AAAEWCAYAAACaBstRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAZK0lEQVR4nO3de5wlZX3n8c9vGJiJgDNyCYLANII38IIwGiTgNCuiIhp9RRcIruCqiBt1jRKFhTht9uVmAYUQdQNqCIkoAU3MshoXiEpQkMuAXEVuAqIgMGGRi8j1t3/UU3bNmXO6n56+Tc983q/XeZ2qp5566nmqqs+3T53q05GZSJJUY95sd0CSNHcYGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhua8iPh2RBw6yTYOi4gfTLKN6yNieDJtTKWp2C9rsM2RiDhjJrepmWVoaEZFxO0Rse9UtpmZb8jMv5vKNrsiYigiMiIeLo97IuKbEfHann7skpkXTFc/Jmq69ktEnB4Rj5d9cX9EnB8RL1yDdqb8XND0MzSkeoszcxPgZcD5wDci4rDZ6kxEzJ+tbQPHl32xLXAvcPos9kUzyNDQWiMiDoiIqyLigYi4OCJeWsp3LL/R7lbmt4mIle2loIi4ICLe02nnvRFxQ0Q8FBE/7qx3VETc2il/65r0MzN/mZknAyPAcRExr7T/29+cI+KVEbEiIh4s70xOLOXtu5bDI+KuiLg7Ij7a6fu8Tj//PSLOjojNetZ9d0T8DPhuRCyMiDNK3Qci4vKI2Kp3v5R2j42IOyLi3oj4+4hY1NPuoRHxs7Jvj6ncF78Gvgq8uN/yiHhzuWz3QOnPi0r5l4Htgf9T3rF8bKLHQbPD0NBaobywnwa8D9gcOBU4JyIWZOatwMeBr0TEM4C/BU7vdykoIt5O82L+TuCZwJuBfy+LbwX2BhYBnwTOiIitJ9HtfwJ+F3hBn2UnAydn5jOBHYGze5bvAzwP2A84qnOZ5kPAW4BlwDbA/wM+37PuMuBFwOuAQ8t4tqPZb0cAj/bpz2HlsQ/wXGAT4HM9dfYqY3kN8In2BX4sEbEJcAjwoz7Lng+cCXwY2BL4F5qQ2Cgz/xPwM+BNmblJZh4/3ra0djA0tLZ4L3BqZl6amU+Va/GPAXsAZOYXgZuBS4GtgUG/Cb+H5tLJ5dm4JTPvKG18LTPvysynM/Os0t4rJ9Hnu8rzZn2WPQHsFBFbZObDmXlJz/JPZuYjmXktTQgeXMrfBxyTmT/PzMdoAvBtPZeiRsq6j5btbA7sVPbbFZn5YJ/+HAKcmJk/zcyHgaOBg3ra/WRmPpqZVwNX01yGG+TIiHgAuIUmgA7rU+dA4FuZeX5mPgF8GvgdYM8x2tVaztDQ2mIJ8NFyGeOB8oK0Hc1v260v0lwG+Wx5Qe1nO5p3FKuJiHd2Ln89UNraYhJ9fk55vr/PsncDzwd+Ui4ZHdCz/M7O9B2MjnMJzWclbR9vAJ4Cthqw7peBc4F/KJe7jo+IDfv0Z5uyne425/e0+8vO9K9pwmCQT2fm4sx8dma+ubwbHHObmfl06ftz+tTVHGFoaG1xJ/Cp8kLUPp6RmWfCby+D/CXwN8BIe51/QDs79hZGxBKa0PkAsHlmLgauA2ISfX4rzYfAN/YuyMybM/NgmstXxwFfj4iNO1W260xvz+i7ljuBN/Tsh4WZ+Ytu853tPJGZn8zMnWl+gz+A5tJcr7toAqm7zSeBeyrHuiZW2WZEBM2427H4FdtzkKGh2bBh+QC3fcyneUE/IiJ+LxobR8QbI2LTss7JwBWZ+R7gW8ApA9r+Es2lk91LOzuVwNiY5kXqPoCIeBcDPrwdT0RsFREfAJYDR5ffoHvrvCMitizLHijFT3Wq/FlEPCMidgHeBZxVyk8BPlX6TERsGRF/MEZf9omIl0TEBsCDNJernupT9UzgTyJihxLA/wM4KzOfnMjYJ+hs4I0R8Zry7uejNJccLy7L76H5fEVziKGh2fAvNB/Wto+RzFxB87nG52g+/L2Fcp28vGi+nuZDXoCPALtFxCG9DWfm14BP0dzR8xDwz8Bmmflj4DPAD2lerF4CXDTBfj8QEY8A1wL7A2/PzNMG1H09cH1EPEwTeAdl5m86y/+tjPE7NJd6zivlJwPnAOdFxEPAJcDvjdGnZwNfpwmMG0q7/f647jSaS1kXArcBvwE+OPZwJyczbwTeAXwWWAm8ieaD78dLlb8Aji2X4o6czr5o6oT/hEmaORExRPOiveE0/5YvTQvfaUiSqhkakqRqXp6SJFXznYYkqdpsfuHZjNhiiy1yaGhotrshSXPGFVdcsTIzt+y3bJ0PjaGhIVasWDHb3ZCkOSMi7hi0zMtTkqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqzWhoRPDwBOsfFsHnpqs/4xkehnnzIKJ5zJsHixePLu+dHhkZne9O92u33/zwcPMYGuq/3sjI6uuOjDT1e9cZGVm9DwsX9u9bO98+z5/fjLfdXtv28PDqYxwZGd0/3ce8eU39hQtXXadtb2RkdFlbv22vu52hoaY/bVvtuv320fBwU3fx4qZudzpidN/1jqHbp/nzV9/P/fZxW6e3Xu85046t7UvvPl68eHSbEaN9XbiwebRjarfdu5/b7S9cOHp+to+hoea57UM7hnZftut29+nw8Oj67Tja/dL7c9C7f7v9bLff1u22MTw82lbbp7ad9tHW793H3X3YPnfPs94xjIyMtt3Wa8+Nfse5O917nvWen+052113rJ/73mX9fra6P7fd/rb7sV2nHWf73Pa1HVt7Dk2HyMzpabnfxoKHM9lkAvUPA5Zm8oE13ebSpUtzxYoVa7RuRP/ydpdFrDo9aFm/drvL2vnu9vqt27uN3j72Kx+0nbHK+427W9475hprus5Y9Xv3UU3bveOvab/fPh6vrKYPE1VzbKaqzcmYrjZh1XNwrPN1Tdtek+Pb25eJ/Nx31+/X5mSt6ct7RFyRmUv7LZvSdxoRfCyCD5XpkyL4bpl+TQRnlOlPRXB1BJdEsFUpe1MEl0bwowj+tS3vaXvLCP4xgsvL4/ensu+SpPFN9eWpC4G9y/RSYJMINgT2Ar4PbAxcksnLSt33lro/APbI5OXAPwAf69P2ycBJmbwC+EPgS4M6ERGHR8SKiFhx3333TcGwJEkA86e4vSuA3SPYFHgMuJImPPYGPgQ8DnyzU/e1ZXpb4KwItgY2Am7r0/a+wM6dt2zPjGDTTB7qrZiZXwC+AM3lqckPS5IEUxwamTwRwe3Au4CLgWuAfYAdgRuAJzJpX8Sf6mz/s8CJmZwTwTAw0qf5ecCrMnl0KvssSao3HXdPXQgcWZ6/DxwBXNUJi34WAb8o04cOqHMejH4gHsGuk+/q2JYtW/XDqAhYtGh0vnd6+fLR+e50v3b7zS9b1jyWLOm/3vLlq6+7fHlTv3ed5ctX78OCBf371s63zxtssOr22raXLVt9jIPGGdHUX7Bg1Tpte8uXr7osYrS97naWLGn607bVrttvHy1b1tRdtKip251ul/eOvzvG5cubdXr3c7993Nbprdd7zrRja/vSu48XLRrdJoz2dcGC5tGOadA50W5/wYLR87N9LFkyeo5GjI6h3Zftut19umzZ6PrtONr90h1P2+/u/u32s91+W7fbRneftX1q22kfbf3efdzdh+1z9zzrHcPy5aNtt/Xac6Pfce5O955nvedne8521x3r5753Wb+fre7PVLe/7X5s12nH2T63fW3HBqPHZapN+d1TEbwG+L/A4kweieAm4JRMTuzePRXB24ADMjksgj8ATqIJjkuAV2Qy3L17KoItgM8DL6J5h3JhJkeM15/J3D0lSeujse6emtFbbmeDoSFJEzNjt9xKktZthoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqtlaFRgQfjuAZs92PruHh0emRkVWX9c6vq2rHOV37Y3i4abt9TNTIyGgbg5b3zq/pWLrnSzvdr73ufDvdXXc84/W5X/9rlnfLu/0ZGurf1qB9NZGxzJTJHNe5ZjrHGZk5fa1PUAS3A0szWTmBdTbI5KlBy5cuXZorVqyYTJ9od1F3ut/8uqp2nNO1PyJWnZ/oNrrr91u333Fdk+30ttVO92tvUL3abY7X535tjXUuj9fGoHN/0L5aG382JnNc55rJ7v+IuCIzl/ZbNu47jQiGIvhJBF+K4LoIvhLBvhFcFMHNEbwygs0i+OcIrongkgheWtYdieC0CC6I4KcRfKiUbxzBtyK4urR5YFm2DfC9CL5X6u0XwQ8juDKCr0WwSSm/PYJPRPAD4O1rvmskSRMxv7LeTjQvzocDlwN/BOwFvBn4b8CdwI8yeUsE/wH4e2DXsu4LgX2ATYEbI/hr4PXAXZm8ESCCRZn8KoKPAPtksjKCLYBjgX0zeSSCjwMfAf68tPubTPbq19mIOLz0le23375yiJKk8dR+pnFbJtdm8jRwPfCdTBK4FhiiCZAvA2TyXWDzCBaVdb+VyWPlktO9wFZlvX0jOC6CvTP5VZ9t7gHsDFwUwVXAocCSzvKzBnU2M7+QmUszc+mWW25ZOURJ0nhq32k81pl+ujP/dGnjyT7rtFfUuus+BczP5KYIdgf2B/4igvMyf/sOohXA+ZkcPKBPj1T2XZI0RWpDYzwXAocA/z2CYWBlJg/2foDZimAb4P5MzojgYeCwsughmstYK4FLgM9HsFMmt5S7qrbN5KYp6nOVZctGp5cvX3VZ7/y6qnac07U/li2b3N04y5fDBRcMbmMqj2v3fGmn+7XXLWunu+uOZ7w+125zrHW6/VmypH/dQftqImOZKevLzytM71jHvXsqgiHgm5m8uMyfXua/3i4DXg38LbAD8Gvg8EyuiWAEeDiTT5d1rwMOAF4AnEDzTuUJ4P2ZrIjgg8AfA3dnsk/5fOQ4YEHpzrGZnDORu6wme/eUJK1vxrp7aq265XY6GBqSNDGTuuVWkqSWoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqhkakqRqhoYkqZqhIUmqZmhIkqoZGpKkaoaGJKmaoSFJqmZoSJKqGRqSpGqGhiSpmqEhSapmaEiSqkVmznYfplVE3AfcsYarbwGsnMLurK3Wl3HC+jPW9WWcsP6MdSbHuSQzt+y3YJ0PjcmIiBWZuXS2+zHd1pdxwvoz1vVlnLD+jHVtGaeXpyRJ1QwNSVI1Q2NsX5jtDsyQ9WWcsP6MdX0ZJ6w/Y10rxulnGpKkar7TkCRVMzQkSdUMjT4i4vURcWNE3BIRR812f2pFxO0RcW1EXBURK0rZZhFxfkTcXJ6fVcojIv6qjPGaiNit086hpf7NEXFop3z30v4tZd2YwbGdFhH3RsR1nbJpH9ugbczCWEci4hfl2F4VEft3lh1d+n1jRLyuU973PI6IHSLi0jKmsyJio1K+oMzfUpYPTfM4t4uI70XEDRFxfUT811K+Th3XMcY5N49pZvroPIANgFuB5wIbAVcDO892vyr7fjuwRU/Z8cBRZfoo4LgyvT/wbSCAPYBLS/lmwE/L87PK9LPKssuAV5V1vg28YQbH9mpgN+C6mRzboG3MwlhHgCP71N25nKMLgB3KubvBWOcxcDZwUJk+BXh/mf4vwCll+iDgrGke59bAbmV6U+CmMp516riOMc45eUxn5Ad+Lj3KCXZuZ/5o4OjZ7ldl329n9dC4Edi6TG8N3FimTwUO7q0HHAyc2ik/tZRtDfykU75KvRka3xCrvpBO+9gGbWMWxjroBWaV8xM4t5zDfc/j8uK5Ephfyn9br123TM8v9WIGj+//Bl67Lh/XnnHOyWPq5anVPQe4szP/81I2FyRwXkRcERGHl7KtMvNugPL8u6V80DjHKv95n/LZNBNjG7SN2fCBclnmtM7llImOdXPggcx8sqd8lbbK8l+V+tOuXDZ5OXAp6/Bx7RknzMFjamisrt91+rlyX/LvZ+ZuwBuAP46IV49Rd9A4J1q+NloXx/bXwI7ArsDdwGdK+VSOdVb2Q0RsAvwj8OHMfHCsqn3K5sxx7TPOOXlMDY3V/RzYrjO/LXDXLPVlQjLzrvJ8L/AN4JXAPRGxNUB5vrdUHzTOscq37VM+m2ZibIO2MaMy857MfCoznwa+SHNsYeJjXQksjoj5PeWrtFWWLwLun/rRjIqIDWleSL+Smf9Uite549pvnHP1mBoaq7sceF65G2Ejmg+PzpnlPo0rIjaOiE3baWA/4Dqavrd3kxxKcz2VUv7OckfKHsCvytv0c4H9IuJZ5e3yfjTXR+8GHoqIPcodKO/stDVbZmJsg7Yxo9oXuOKtNMcWmv4dVO6S2QF4Hs2Hv33P42wubn8PeFtZv3e/tWN9G/DdUn+6xhTA3wA3ZOaJnUXr1HEdNM45e0xn6sOfufSguUvjJpo7FY6Z7f5U9vm5NHdTXA1c3/ab5vrld4Cby/NmpTyAz5cxXgss7bT1n4FbyuNdnfKl5cS+FfgcM/sh6Zk0b+GfoPnt6d0zMbZB25iFsX65jOUamheCrTv1jyn9vpHOHW2DzuNyrlxW9sHXgAWlfGGZv6Usf+40j3Mvmksl1wBXlcf+69pxHWOcc/KY+jUikqRqXp6SJFUzNCRJ1QwNSVI1Q0OSVM3QkCRVMzS03ouIkyLiw535cyPiS535z0TERybR/khEHDlg2eER8ZPyuCwi9uos27t8K+pVEfE7EXFCmT9hgtsfiog/WtP+S12GhgQXA3sCRMQ8YAtgl87yPYGLahqKiA1qNxoRBwDvA/bKzBcCRwBfjYhnlyqHAJ/OzF0z89FSd7fM/NPabRRDgKGhKWFoSE0g7Fmmd6H5Y7CHyl8YLwBeBPyo/CXyCRFxXTT/o+FAgIgYjub/JXyV5o+1iIhjovm/B/8KvGDAdj8O/GlmrgTIzCuBv6P53rD3AP8R+EREfCUizgE2Bi6NiAMj4u2lH1dHxIVlmxuU/l1evgTvfWU7/xPYu7xj+ZOp3HFa/8wfv4q0bsvMuyLiyYjYniY8fkjz7aCvovlW0Gsy8/GI+EOaL5d7Gc27kcvbF2ya7w16cWbeFhG703zFw8tpfsauBK7os+ld+pSvAA7NzD8rl6q+mZlfB4iIhzNz1zJ9LfC6zPxFRCwu676b5qs1XlHC7qKIOI/m/0UcmZkHTG5PSYaG1GrfbewJnEgTGnvShMbFpc5ewJmZ+RTNF979G/AK4EHgssy8rdTbG/hGZv4aoLxLqBXUfQvpRcDpEXE20H7R337ASyOi/Q6iRTTfW/T4BLYvjcnLU1Kj/VzjJTSXpy6heafR/TxjrH9v+0jPfM0L/4+B3XvKdivlY8rMI4Bjab7B9KqI2Lz074PlM5BdM3OHzDyvoh9SNUNDalwEHADcn83XVd8PLKYJjh+WOhcCB5bPDrak+besl/Vp60LgreWOp02BNw3Y5vHAceUFn4jYFTgM+F/jdTYidszMSzPzEzRfjb0dzbe9vj+ar+EmIp4fzTceP0Tzb0alSfPylNS4luZziq/2lG3SflBN8z9KXkXzTcIJfCwzfxkRL+w2lJlXRsRZNN9megfw/X4bzMxzIuI5wMURkTQv7u/I8h/lxnFCRDyP5t3Fd0qfrqG5U+rK8nXc9wFvKeVPRsTVwOmZeVJF+1JffsutJKmal6ckSdUMDUlSNUNDklTN0JAkVTM0JEnVDA1JUjVDQ5JU7f8DluDMLh2ZMcQAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "text1.dispersion_plot([\"whale\", \"monster\"])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZIAAAEWCAYAAABMoxE0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAYjklEQVR4nO3debhsVX3m8e8rF3FARYQoTlznWRGuGg3KdYgjTm1sxyhGRE202wGNNirX7mhaMEYTkzi1gSggSGuiRltpFEkgIheUwYGAAs4KoVHBieHXf+x1oCjOObeq1pmufD/PU8+pWnvXWr+9qk69Z+9dpypVhSRJs7reahcgSdq6GSSSpC4GiSSpi0EiSepikEiSuhgkkqQuBol+6yT5TJLnd/axT5J/7ezja0k29vSxlJZiXmYYc1OSD6/kmFp5BolWVZLzkjxqKfusqsdV1aFL2eeoJOuTVJJL2uXHST6V5PfH6rhXVR23XHVMa7nmJckhSX7T5uKiJMckufsM/Sz5c0ErwyCRZrdDVW0P3A84Bvh4kn1Wq5gk61ZrbOCgNhe3BX4CHLKKtWiFGSRas5LsneSrSS5OcmKS+7b2O7W/fHdvt2+d5MK5w0hJjkuy70g/L0ryjSQ/T/L1kfu9Lsm3RtqfOkudVfWjqnoXsAl4W5Lrtf6v+gs7yQOTbE7ys7YH847WPrd3s1+SHyT5YZJXj9R+vZE6/yPJUUl2HLvvC5N8B/h8khsk+XBb9+IkJye55fi8tH7fkOT8JD9J8g9JbjbW7/OTfKfN7QETzsUvgMOBe8+3PMmT2iG/i1s992jtHwJuD3yy7dm8dtrHQavHINGa1F7sPwi8GLgF8F7gE0m2q6pvAX8KHJbkRsDfA4fMdxgpydMZXuCfB9wUeBLwH23xt4CHAjcD3gx8OMkuHWV/DPgd4G7zLHsX8K6quilwJ+CoseUPB+4CPBp43cghnv8CPAXYC7g18P+Avxm7717APYDHAM9v23M7hnl7CfDLeerZp10eDtwR2B5499g6e7ZteSTwprkX/cUk2R54DvCVeZbdFTgCeAWwM/BphuC4flX9IfAd4IlVtX1VHbSlsbR2GCRaq14EvLeqTqqqK9qx/V8DvwtQVe8HzgZOAnYBFvqLeV+Gwy4n1+Ccqjq/9fHRqvpBVV1ZVUe2/h7YUfMP2s8d51l2GXDnJDtV1SVV9aWx5W+uqkur6gyGYHxWa38xcEBVfa+qfs0Qin8wdhhrU7vvL9s4twDu3ObtlKr62Tz1PAd4R1V9u6ouAV4PPHOs3zdX1S+r6jTgNIZDeAvZP8nFwDkMobTPPOs8A/jnqjqmqi4D3g7cEHjIIv1qK2CQaK3aFXh1OwRycXuRuh3DX+Vz3s9wCOWv24vsfG7HsOdxLUmeN3Lo7OLW104dNd+m/bxonmUvBO4KfLMdbtp7bPl3R66fz9XbuSvDuZe5Gr8BXAHccoH7fgj4LPCRdqjsoCTbzlPPrds4o2OuG+v3RyPXf8EQEAt5e1XtUFW3qqontb3GRcesqitb7beZZ11tRQwSrVXfBd7SXpzmLjeqqiPgqkMo7wT+F7Bp7rzBAv3cabwxya4MQfQy4BZVtQNwJpCOmp/KcKL5rPEFVXV2VT2L4dDX24Cjk9x4ZJXbjVy/PVfv3XwXeNzYPNygqr4/2v3IOJdV1Zur6p4Mf+nvzXBYb9wPGEJqdMzLgR9PuK2zuMaYScKw3XPb4keRb6UMEq0F27aTxHOXdQwv8i9J8qAMbpzkCUlu0u7zLuCUqtoX+GfgPQv0/QGGwy57tH7u3ELkxgwvXBcAJHkBC5wg3pIkt0zyMuBA4PXtL+3xdZ6bZOe27OLWfMXIKm9McqMk9wJeABzZ2t8DvKXVTJKdkzx5kVoenuQ+SbYBfsZwqOuKeVY9Anhlkju0UH4rcGRVXT7Ntk/pKOAJSR7Z9pJezXC48sS2/McM52u0lTFItBZ8muGE8NxlU1VtZjhP8m6GE8zn0I67txfSxzKcSAZ4FbB7kueMd1xVHwXewvBOop8D/wjsWFVfB/4C+DeGF7D7ACdMWffFSS4FzgAeDzy9qj64wLqPBb6W5BKGEHxmVf1qZPkX2zYey3CY6HOt/V3AJ4DPJfk58CXgQYvUdCvgaIYQ+Ubrd75/CPwgw2Gw44FzgV8BL198c/tU1VnAc4G/Bi4Enshwcv03bZU/B97QDuPtv5y1aGnFL7aSVk+S9Qwv5Nsu896AtGzcI5EkdTFIJEldPLQlSeriHokkqctqfsjbqthpp51q/fr1q12GJG1VTjnllAurauf5ll3ngmT9+vVs3rx5tcuQpK1KkvMXWuahLUlSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXdZEkCRcsto1TGPTpi23rVsHGzcuvP6W+p+7zFrHtGMuNP4OO2x5/PXrZxtj48Zrb+fcnM03zmI1LNQ+2v98ffdaaPzRsbZU/ySP9yTPh6U2Ot60z8Vpn4eTPA+mNUnN09Q367i92zI+Lwv9vk3yHFouqarl633SIsIlVWy/EmNt2LChNm/e3NVHAuPTNt6WDD+r5l9/S/3PWex+i9Ux7ZgLjb9YDT1jLTTGJHM7ybLR2ub675mThWxp/MXWmVs2akvrreSv66TbMN/yaZ8b49u3FI/VJDWPjtnT12Lr9m7L+OMACz/nFlq2NHXklKraMN+yNbFHMichCQcnnJlwRsIzWvuRCY8fWe+QhKclbNPWPznh9IQXr171knTdtKaCBPhPwG7A/YBHAQcn7AJ8BK4KlesDjwQ+DbwQ+GkVDwAeALwo4Q7jnSbZL8nmJJsvuOCCldkSSbqOWGtBsidwRBVXVPFj4IsMAfEZ4BEJ2wGPA46v4pfAo4HnJXwVOAm4BXCX8U6r6n1VtaGqNuy8884rtS2SdJ2wbrULGJP5Gqv4VcJxwGMY9kyOGFn/5VV8dmXKkySNW2tBcjzw4oRDgR2BhwGvacs+AuwLbAD2aW2fBV6a8PkqLku4K/D9Ki5dziIPPHDLbdtsA3vuufD60/Y/bR3TjrlQv+9855bX23XX2cY47rhrv6Nlr70Wr2fSZfPNw3x991po/NGxZql/1vWW0uiYWxp/fPm0z8Px9ZbisZq25uVat3dbxp/D55032bjTLu+xpt61lRDgIIbDVwX8WRVHtnW2BX4EfKKKF7S26wF/BjyRYe/kAuApVfx0obGW4l1bknRds9i7ttZEkKwkg0SSprfVvP1XkrT1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSF4NEktTFIJEkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUgkSV0MEklSl1UPkoQTV7uGSW3cOPm6mzbNPs7cfSfpo2ecSftYijHWmpWYt1n7Weh5tn79wuNu2jRdPYutv1zb1Xu/TZsm+x2c7/dnJZ7DC/3eTjp27+vL+Phzz5e5tmn6n1aqavl6Hx0orKvi8pHb21RxxYoMPmLDhg21efPmme6bwKTTNc26C913kj56xpm0j6UYY61ZiXmbtZ+F+l3seZEMP6d5fi60/nJtV+/9Jt3G+eZpJZ7DCz0+k47d+/oyPv749kPfHCQ5pao2zLdsi3skCesTvpnwgYQzEw5LeFTCCQlnJzywXU5M+Er7ebd2330SPprwSeBzCRsTvpBwOHBGW+eS9nP7hGMTTk04I+HJIzW8sdVwTMIRCfu39jsl/J+EUxL+JeHus0+TJGkW6yZc787A04H9gJOBZwN7Ak8C/hvwPOBhVVye8CjgrcDT2n0fDNy3iosSNgIPBO5dxbljY/wKeGoVP0vYCfhSwieAPVpf92/1ngqc0u7zPuAlVZyd8CDgb4FHjBefZL9WO7e//e0n3GRJ0iQmDZJzq67ag/gacGwVlXAGsB64GXBowl2AArYdue8xVVw0cvvL84QIQIC3JjwMuBK4DXBLhsD6pyp+2cb/ZPu5PfAQ4KNzu23AdvMVX1XvYwgdNmzY8Ft2kEaSVtekQfLrketXjty+svXxP4AvVPHUhPXAcSPrXzrW1/jtOc8Bdgb2qOKyhPOAGzAEzHyuB1xcxW4TboMkaRlMGiRbcjPg++36Ph19/KSFyMOBXVv7vwLvTfhzhnqfALy/HQI7N+HpVXw0IQyH0E6bfTMWt9dek6974IGzjzN330n66Bln0j6WYoy1ZiXmbdZ+Fnqe7brrwuNOW8ti6y/XdvXe78AD4bjjJr//aD8r8Rxe6Pd20rF7X1/Gx597vsy1TTJ3s9riu7baHsanqrh3u31Iu3303DLgRcChwAXA54E/rGJ9wj7Ahipe1u67Edi/ir1H+r+kiu3beZFPMhwW+yrwe8DjqjgvYRPwLOD8NsZxVbw/4Q7A3wG7tPt9pIr/vtj29LxrS5KuqxZ719aKvf23R8L2VVyScCPgeGC/Kk6dpS+DRJKmt1iQLNWhreX2voR7MpwzOXTWEJEkLb2tIkiqePZq1yBJmt+qf0SKJGnrZpBIkroYJJKkLgaJJKmLQSJJ6mKQSJK6GCSSpC4GiSSpi0EiSepikEiSuhgkkqQuBokkqYtBIknqYpBIkroYJJKkLgaJJKmLQSJJ6mKQSJK6GCSSpC4GiSSpi0EiSepikEiSuhgkkqQuBokkqYtBIknqYpBIkroYJJKkLgaJJKmLQSJJ6mKQSJK6GCSSpC4GiSSpi0EiSepikEiSuhgkkqQuBokkqYtBIknqYpBIkroYJJKkLgaJJKmLQSJJ6mKQSJK6GCSSpC4GiSSpi0EiSepikEiSuhgkkqQuBokkqYtBIknqYpBIkroYJJKkLgaJJKmLQSJJ6mKQSJK6GCSSpC4GiSSpi0EiSepikEiSuhgkkqQuBokkqYtBIknqYpBIkroYJJKkLgaJJKmLQSJJ6pKqWu0aVlSSC4DzZ7jrTsCFS1zOctqa6rXW5WGty2NrqhWWrt5dq2rn+RZc54JkVkk2V9WG1a5jUltTvda6PKx1eWxNtcLK1OuhLUlSF4NEktTFIJnc+1a7gCltTfVa6/Kw1uWxNdUKK1Cv50gkSV3cI5EkdTFIJEldDJIJJHlskrOSnJPkdSs47u2SfCHJN5J8Lcl/be07Jjkmydnt581be5L8Vavz9CS7j/T1/Lb+2UmeP9K+R5Iz2n3+Kkk6a94myVeSfKrdvkOSk9q4Rya5fmvfrt0+py1fP9LH61v7WUkeM9K+ZI9Dkh2SHJ3km21+H7xW5zXJK9vjf2aSI5LcYC3Na5IPJvlJkjNH2pZ9LhcaY4ZaD27Pg9OTfDzJDrPO2SyPyzS1jizbP0kl2WktzCtV5WWRC7AN8C3gjsD1gdOAe67Q2LsAu7frNwH+HbgncBDwutb+OuBt7frjgc8AAX4XOKm17wh8u/28ebt+87bsy8CD230+Azyus+ZXAYcDn2q3jwKe2a6/B3hpu/7HwHva9WcCR7br92xzvB1whzb32yz14wAcCuzbrl8f2GEtzitwG+Bc4IYj87nPWppX4GHA7sCZI23LPpcLjTFDrY8G1rXrbxupdeo5m/ZxmbbW1n474LMM/1i905qY154XjevCpU30Z0duvx54/SrV8k/A7wNnAbu0tl2As9r19wLPGln/rLb8WcB7R9rf29p2Ab450n6N9Wao77bAscAjgE+1J+iFI7+kV81l+0V4cLu+rq2X8fmdW28pHwfgpgwvzhlrX3PzyhAk320vBOvavD5mrc0rsJ5rvjgv+1wuNMa0tY4teypw2HxzsaU5m+X5PkutwNHA/YDzuDpIVnVePbS1ZXO/yHO+19pWVNsVvj9wEnDLqvohQPv5O221hWpdrP1787TP6p3Aa4Er2+1bABdX1eXz9H9VTW35T9v6027DLO4IXAD8fYbDcB9IcmPW4LxW1feBtwPfAX7IME+nsDbnddRKzOVCY/T4I4a/zmepdZbn+1SSPAn4flWdNrZoVefVINmy+Y5tr+h7ppNsD/xv4BVV9bPFVp2nrWZon1qSvYGfVNUpE9Sz2LJlr5XhL8Ldgb+rqvsDlzLswi9kNef15sCTGQ6t3Bq4MfC4RfpfzXmdxJqtL8kBwOXAYXNNU9Y0y/N9mvpuBBwAvGm+xVPWtKTzapBs2fcYjknOuS3wg5UaPMm2DCFyWFV9rDX/OMkubfkuwE+2UOti7bedp30Wvwc8Kcl5wEcYDm+9E9ghybp5+r+qprb8ZsBFM2zDLL4HfK+qTmq3j2YIlrU4r48Czq2qC6rqMuBjwENYm/M6aiXmcqExptZOQu8NPKfaMZ0Zar2Q6R+XadyJ4Q+K09rv2W2BU5PcaoZal3ZeZzlue126MPz1+u32AM6dWLvXCo0d4B+Ad461H8w1T4Yd1K4/gWuecPtya9+R4ZzAzdvlXGDHtuzktu7cCbfHL0HdG7n6ZPtHuebJxz9u1/+Ea558PKpdvxfXPMH5bYaTm0v6OAD/AtytXd/U5nTNzSvwIOBrwI1aX4cCL19r88q1z5Es+1wuNMYMtT4W+Dqw89h6U8/ZtI/LtLWOLTuPq8+RrOq8LvuL4W/DheEdEf/O8E6NA1Zw3D0ZdjdPB77aLo9nOLZ6LHB2+zn3xAjwN63OM4ANI339EXBOu7xgpH0DcGa7z7uZ4ATgBHVv5OoguSPDu0POab9k27X2G7Tb57Tldxy5/wGtnrMYebfTUj4OwG7A5ja3/9h+ydbkvAJvBr7Z+vsQwwvbmplX4AiG8zeXMfyl+8KVmMuFxpih1nMYziPM/Y69Z9Y5m+VxmabWseXncXWQrOq8+hEpkqQuniORJHUxSCRJXQwSSVIXg0SS1MUgkSR1MUikMUn+MskrRm5/NskHRm7/RZJXdfS/Kcn+Cyzbr30S7TeTfDnJniPLHprhU4C/muSG7VNrv5bk4CnHX5/k2bPWL40zSKRrO5Hhv8dJcj1gJ4Z/TpvzEOCESTpKss2kg7aPmXkxsGdV3R14CXB4+89lgOcAb6+q3arql23d3avqNZOO0awHDBItGYNEurYTaEHCECBnAj9PcvMk2wH3AL7SvgPi4AzfE3JGkmcAJNmY4XtkDmf45zCSHNC+v+L/AndbYNw/BV5TVRcCVNWpDP/J/idJ9gX+M/CmJIcl+QTD526dlOQZSZ7e6jgtyfFtzG1afSe376h4cRvnfwIPbXs2r1zKidN107otryJdt1TVD5JcnuT2DIHybwyfjPpghk9tPb2qfpPkaQz/IX8/hr2Wk+dexIEHAveuqnOT7MHwsRj3Z/idO5XhE3zH3Wue9s3A86vqje0w16eq6miAJJdU1W7t+hnAY6rq+7n6i5leCPy0qh7QAvCEJJ9j+NiL/atq776ZkgYGiTS/ub2ShwDvYAiShzAEyYltnT2BI6rqCoYPuvsi8ADgZwyfdXRuW++hwMer6hcAbW9iUmGyT2U9ATgkyVEMH+wIwxc23TfJH7TbNwPuAvxmivGlLfLQljS/ufMk92E4tPUlhj2S0fMji3197qVjtycJg68De4y17d7aF1VVLwHewPBJr19NcotW38vbOZXdquoOVfW5CeqQpmKQSPM7geFjxS+qqiuq6iKGr+N9MMOhLoDjgWe0cxE7M3w16pfn6et44KntnVY3AZ64wJgHAW9rIUCS3Ri+Vvdvt1RskjtV1UlV9SaGjzOf+zrWl7avIiDJXdsXeP2c4aubpSXhoS1pfmcwnPc4fKxt+7mT4cDHGYLlNIY9jtdW1Y+S3H20o6o6NcmRDJ8sez7DR9hfS1V9IsltgBOTFMML/nOrfVvdFhyc5C4MeyHHtppOZ3iH1qlJwvCtkE9p7ZcnOQ04pKr+coL+pQX56b+SpC4e2pIkdTFIJEldDBJJUheDRJLUxSCRJHUxSCRJXQwSSVKX/w+xtyY3ZdpXrQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "text2.dispersion_plot([\"love\", \"marriage\"])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZAAAAEWCAYAAABIVsEJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAYLklEQVR4nO3de5RlZX3m8e8jjXJTAbujqEgL3vGC0GpgIN1o4hUxLnXUYAJeomii44UoDEbbteLMAEbFaAYv4yUKCDKacYgOOhpCAgo0ylVAEBpBvEAcgihRwN/8sd+iTx+rqqvequoq5PtZ66za5917v+/vvKfOeersfeqcVBWSJM3WPRa7AEnSXZMBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKA6LdGki8nOXiOfRyS5F/m2MclSdbMpY/5NB/z0jHm2iSf2ZxjavMzQLQokqxP8vvz2WdVPauqPjWffY5KsjJJJbmlXX6c5NQkfzBWx+5VdfpC1TFbCzUvST6Z5FdtLn6a5KtJHtXRz7z/LmjzMECk2du+qrYDngB8FfhCkkMWq5gkyxZrbODoNhcPBn4CfHIRa9FmZoBoyUlyQJLzk9yU5Kwkj2/tu7W/dPds1x+Y5MaJw0VJTk/yqpF+/jTJpUl+luQ7I/sdnuR7I+3P76mzqn5UVccCa4Gjktyj9X/nX9RJnpxkXZKb2yuW97b2iVczr05yfZIfJnnLSO33GKnzX5OcnGTHsX1fmeT7wNeTbJXkM23bm5Kcm+T+4/PS+n17kmuS/CTJ3yW571i/Byf5fpvbI2c4F78ATgAeO9n6JAe2Q3s3tXoe3do/DTwE+N/tlcxbZ3s/aPEYIFpS2pP8x4HXAPcDPgx8Mcm9qup7wNuA45NsA3wC+ORkh4uSvIjhif1PgPsABwL/2lZ/D9gPuC/wLuAzSXaaQ9mfB34HeOQk644Fjq2q+wC7ASePrd8feDjwdODwkUM5bwD+EFgNPBD4f8CHxvZdDTwaeAZwcLs9OzPM26HArZPUc0i77A/sCmwHfHBsm33bbXka8I6JJ/vpJNkOOAj49iTrHgGcCLwRWAF8iSEw7llVfwx8H3huVW1XVUdvaiwtHQaIlpo/BT5cVWdX1R3t2P0vgd8FqKqPAlcAZwM7AVP9hfwqhsMr59bgyqq6pvXxuaq6vqp+XVUntf6ePIear28/d5xk3W3Aw5Isr6pbquqbY+vfVVU/r6qLGALxpa39NcCRVXVdVf2SIQxfOHa4am3b99Y2zv2Ah7V5O6+qbp6knoOA91bVVVV1C3AE8JKxft9VVbdW1QXABQyH6qZyWJKbgCsZwuiQSbZ5MfAPVfXVqroNeA+wNbDPNP3qLsAA0VKzC/CWdqjjpvbktDPDX+ETPspwqORv2pPrZHZmeKXxG5L8ycghsptaX8vnUPOD2s+fTrLulcAjgMvaYaUDxtZfO7J8DRtu5y4M51YmarwUuAO4/xT7fho4DfhsOyR2dJItJ6nngW2c0TGXjfX7o5HlXzAEw1TeU1XbV9UDqurA9ipx2jGr6tet9gdNsq3uQgwQLTXXAu9uT0oTl22q6kS481DJ+4H/AaydOC8wRT+7jTcm2YUhgP4cuF9VbQ9cDGQONT+f4QTy5eMrquqKqnopwyGuo4BTkmw7ssnOI8sPYcOrmWuBZ43Nw1ZV9YPR7kfGua2q3lVVj2H4y/4AhsN3465nCKfRMW8HfjzD29pjozGThOF2T9wWPxL8LsoA0WLasp38nbgsY3hyPzTJUzLYNslzkty77XMscF5VvQr4B+C4Kfr+GMPhlb1aPw9r4bEtwxPWDQBJXs4UJ343Jcn9k/w58E7giPaX9fg2L0uyoq27qTXfMbLJXybZJsnuwMuBk1r7ccC7W80kWZHkedPUsn+SxyXZAriZ4ZDWHZNseiLwpiQPbWH8X4CTqur22dz2WToZeE6Sp7VXRW9hOCx5Vlv/Y4bzMbqLMUC0mL7EcKJ34rK2qtYxnAf5IMOJ4ytpx9XbE+gzGU4QA7wZ2DPJQeMdV9XngHczvDPoZ8DfAztW1XeAvwa+wfDE9TjgzFnWfVOSnwMXAc8GXlRVH59i22cClyS5hSH8XlJV/z6y/p/abfwaw+Ggr7T2Y4EvAl9J8jPgm8BTpqnpAcApDOFxaet3sn/k+zjD4a4zgKuBfwdeP/3NnZuquhx4GfA3wI3AcxlOmv+qbfJfgbe3w3WHLWQtml/xC6WkzS/JSoYn8C0X+K9/acH4CkSS1MUAkSR18RCWJKmLr0AkSV0W80PYNpvly5fXypUrF7sMSbpLOe+8826sqhVTrb9bBMjKlStZt27dYpchSXcpSa6Zbr2HsCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkddlkgCSsTLh4LoPMRx/T9H16wqqF6HvU2rXDZc2a4TK+bqb7T7ffTPqZzHg9c+1vvo3XsWYNrFy5cfvE3C7EeKNtM72vNqfpxpvvWlaunL9xZrPPfN+O8d+dmY65VB4TExainvl6XpmJVNX0G4SVwKlVPLZ7kHnoY5q+TwcOq2LdVNusWrWq1q2bcvVMx9nI6LQlG1+fbv/p9ptJP1P1Pdl+vf3Nt8lu54SJ9snmZ77GG22b6X21OedtuvHmu5b5/F2ZzT4LeTtmc5uWymNiwkLUM1/PK8O+Oa+qpvwDfaaHsLZI+GjCJQlfSdh69C//hOUJ69vy7gnnJJyfcGHCw1sfyxI+1dpOSdimbf+OhHMTLk74SEJa++kJR7W+vpuwX2vfOuGzrZ+TgK37pkaSNBczDZCHAx+qYnfgJuAF02x7KHBsFXsAq4DrWvsjgY9U8XjgZuB1rf2DVTypvTrZGjhgpK9lVTwZeCPwztb2WuAXrZ93A3tNVkSSVydZl2TdDTfcMMObKUmaqZkGyNVVnN+WzwNWTrPtN4D/nPA2YJcqbm3t11ZxZlv+DLBvW94/4eyEi4CnAruP9PX5Scb8vbY/VVwIXDhZEVX1kapaVVWrVqxYMYObKEmajZkGyC9Hlu8AlgG3j+y/1cTKKk4ADgRuBU5LeOrEqrE+K2Er4G+BF1bxOOCjo32NjDsx5p37zrBuSdICWbbpTaa0nuHw0TnACycaE3YFrqriA2358cBVwEMS9q7iG8BLgX9hQ1jcmLBd6+eUTYx7BnAQ8I8Jj239L7h3tgNop58+9bqZ7D9d20z6mczq1TMfczGM17F6NaxfD4ccsvE2k83tfIw32tZ7Xy2k6cab71p22WX+xpnNPvN9O0b7m6rvmTzmFttC1DNfzyszMet3YSUcBmwHfBY4GbgF+DrwsipWJhwBvAy4DfgR8EfAfYAvMTz57wNcAfxxFb9I+CvgJQyBdC1wTRVrR99dlbAcWNf63xr4BPAY4HzgYcAbFvpdWJJ0d7Opd2FtMkB+GxggkjR78/U2XkmSNmKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC5LNkASvpSwfVt+Q8KlCccvRi1r1sDatcPy+M/x5alMtc3atb/Z10z6m0nfvfv09DeTsUb7XbNm09v0jLGpbWZ7v823+Rxzrn3Ndf/x+3AhxpiJ2Tz+eh5fk42xZg2sXDn7fqbqezY1TbbtZM8ho5eFkqpauN7nScJlwLOquLpn/1WrVtW6devmMj4AVcPy6M+J9Zuaxqm2Ge17suszrW+2d+N0+/T0N5OxppuzybbpGWNT28Ds7rf5Np9jzrWvzbH/5pjj2dTR8/iabIzefqbqezZ9TXZ7xx9b43rrTHJeVa2aav2ivQJJeGvCG9ry+xK+3paflvCZhPUJyxOOA3YFvpjwpoRtEz6ecG7CtxOet1i3QZLuzhbzENYZwH5teRWwXcKWwL7AP09sVMWhwPXA/lW8DzgS+HoVTwL2B45J2Ha88ySvTrIuybobbrhhgW+KJN39LGaAnAfslXBv4JfANxiCZD9GAmQSTwcOTzgfOB3YCnjI+EZV9ZGqWlVVq1asWDHftUvS3d6yxRq4itsS1gMvB84CLmR4RbEbcOk0uwZ4QRWXL3iRkqQpLVqANGcAhwGvAC4C3gucV0VNdiKoOQ14fcLr23ZPrOLbC1nk6tUb3nHyzndu/HN8eSpTbTPePpO+Ztp37z49/c1krNF+V6/e9DY9Y8xmm/m+nTMxn2POta+57j9+Hy7EGDMxm/t+vn6/Vq+G9ev7+tpU3z3bz/b5aL4s6ruwEp4G/B9g+yp+nvBd4Lgq3ttenayq4sax5a2B9wP7MLwaWV/FAdONM9d3YUnS3dGm3oW1qK9AqvgasOXI9UeMLK+cYvlW4DWbp0JJ0lSW7D8SSpKWNgNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHVJVS12DQsuyQ3ANZ27LwdunMdyFpK1LgxrXTh3pXrvjrXuUlUrplp5twiQuUiyrqpWLXYdM2GtC8NaF85dqV5r/U0ewpIkdTFAJEldDJBN+8hiFzAL1rowrHXh3JXqtdYxngORJHXxFYgkqYsBIknqYoBMIckzk1ye5Mokh2/GcXdO8o9JLk1ySZL/1Np3TPLVJFe0nzu09iT5QKvzwiR7jvR1cNv+iiQHj7TvleSits8HkmSONW+R5NtJTm3XH5rk7DbuSUnu2drv1a5f2davHOnjiNZ+eZJnjLTP2/2QZPskpyS5rM3v3kt1XpO8qd3/Fyc5MclWS2lek3w8yU+SXDzStuBzOdUYHbUe034PLkzyhSTb985Zz/0ym1pH1h2WpJIsXwrzCkBVeRm7AFsA3wN2Be4JXAA8ZjONvROwZ1u+N/Bd4DHA0cDhrf1w4Ki2/Gzgy0CA3wXObu07Ale1nzu05R3aunOAvds+XwaeNcea3wycAJzarp8MvKQtHwe8ti2/DjiuLb8EOKktP6bN8b2Ah7a532K+7wfgU8Cr2vI9ge2X4rwCDwKuBrYemc9DltK8Ar8H7AlcPNK24HM51RgdtT4dWNaWjxqpddZzNtv7Zba1tvadgdMY/iF6+VKY16oyQKa4E/cGThu5fgRwxCLV8r+APwAuB3ZqbTsBl7flDwMvHdn+8rb+pcCHR9o/3Np2Ai4bad9ou476Hgx8DXgqcGr7xbxx5MF551y2B8DebXlZ2y7j8zux3XzeD8B9GJ6UM9a+5OaVIUCubU8Ay9q8PmOpzSuwko2flBd8LqcaY7a1jq17PnD8ZHOxqTnr+X3vqRU4BXgCsJ4NAbLo8+ohrMlNPIAnXNfaNqv2kveJwNnA/avqhwDt5++0zaaqdbr26yZp7/V+4K3Ar9v1+wE3VdXtk/R/Z01t/b+17Wd7G3rsCtwAfCLD4baPJdmWJTivVfUD4D3A94EfMszTeSzNeR21OeZyqjHm4hUMf4331Nrz+z4rSQ4EflBVF4ytWvR5NUAmN9mx6836fuck2wH/E3hjVd083aaTtFVH+6wlOQD4SVWdN4N6plu34LUy/AW4J/Dfq+qJwM8ZXqpPZTHndQfgeQyHUB4IbAs8a5r+F3NeZ2LJ1pfkSOB24PiJplnW1PP7Ppv6tgGOBN4x2epZ1jTv82qATO46hmOOEx4MXL+5Bk+yJUN4HF9Vn2/NP06yU1u/E/CTTdQ6XfuDJ2nv8R+AA5OsBz7LcBjr/cD2SZZN0v+dNbX19wV+2nEbelwHXFdVZ7frpzAEylKc198Hrq6qG6rqNuDzwD4szXkdtTnmcqoxZq2dXD4AOKjasZuOWm9k9vfLbOzG8IfEBe1x9mDgW0ke0FHr/M9rzzHa3/YLw1+rV7U7buKE2e6baewAfwe8f6z9GDY+yXV0W34OG59IO6e178hwzH+Hdrka2LGtO7dtO3Ei7dnzUPcaNpxE/xwbn1R8XVv+MzY+qXhyW96djU9cXsVw0nJe7wfgn4FHtuW1bU6X3LwCTwEuAbZpfX0KeP1Sm1d+8xzIgs/lVGN01PpM4DvAirHtZj1ns71fZlvr2Lr1bDgHsvjzOtcnjt/WC8M7HL7L8M6LIzfjuPsyvKy8EDi/XZ7NcOz0a8AV7efEL0SAD7U6LwJWjfT1CuDKdnn5SPsq4OK2zweZwYm9GdS9hg0BsivDuz2ubA+ue7X2rdr1K9v6XUf2P7LVczkj716az/sB2ANY1+b279uDa0nOK/Au4LLW36cZntCWzLwCJzKcn7mN4S/bV26OuZxqjI5ar2Q4TzDxGDuud8567pfZ1Dq2fj0bAmRR57Wq/CgTSVIfz4FIkroYIJKkLgaIJKmLASJJ6mKASJK6GCBSk+R9Sd44cv20JB8buf7XSd48h/7XJjlsinWvbp8Oe1mSc5LsO7JuvwyfzHt+kq3bJ8lekuSYWY6/Mskf9dYvjTNApA3OYviPb5LcA1jO8I9lE/YBzpxJR0m2mOmg7SNhXgPsW1WPAg4FTmj/bQxwEPCeqtqjqm5t2+5ZVX8x0zGalYABonljgEgbnEkLEIbguBj4WZIdktwLeDTw7fY9DMdk+K6Oi5K8GCDJmgzf5XICwz92keTI9h0S/xd45BTjvg34i6q6EaCqvsXw3+d/luRVwH8E3pHk+CRfZPhsrLOTvDjJi1odFyQ5o425Ravv3PY9Ea9p4/w3YL/2SuZN8zlxuntatulNpLuHqro+ye1JHsIQJN9g+LTSvRk+SfXCqvpVkhcw/Ff7ExhepZw78eQNPBl4bFVdnWQvho+weCLDY+1bDJ+qO273SdrXAQdX1V+2w1mnVtUpAEluqao92vJFwDOq6gfZ8KVIrwT+raqe1ILvzCRfYfiIisOq6oC5zZQ0MECkjU28CtkHeC9DgOzDECBntW32BU6sqjsYPoTun4AnATczfB7R1W27/YAvVNUvANqrh5kKM/uk1DOBTyY5meFDF2H4sqTHJ3lhu35f4OHAr2YxvrRJHsKSNjZxHuRxDIewvsnwCmT0/Md0X1X787HrMwmB7wB7jbXt2dqnVVWHAm9n+PTV85Pcr9X3+nbOZI+qemhVfWUGdUizYoBIGzuT4SO+f1pVd1TVTxm++nZvhkNaAGcAL27nGlYwfA3pOZP0dQbw/PbOqXsDz51izKOBo9qTP0n2YPgK27/dVLFJdquqs6vqHQwfLT7x1aevbV8LQJJHtC/P+hnD1yRL88JDWNLGLmI4r3HCWNt2Eye5gS8wBMoFDK8w3lpVP0ryqNGOqupbSU5i+LTXaxg+Tv43VNUXkzwIOCtJMTzRv6zaN8RtwjFJHs7wquNrraYLGd5x9a0kYfgmxj9s7bcnuQD4ZFW9bwb9S1Py03glSV08hCVJ6mKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQu/x/HDt7dTBR1tAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "text2.dispersion_plot([\"husband\", \"wife\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Types vs. tokens" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "906" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1.count(\"whale\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "282" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1.count(\"Whale\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "38" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1.count(\"WHALE\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "text1_tokens = []\n", "for t in text1:\n", " if t.isalpha():\n", " t = t.lower()\n", " text1_tokens.append(t)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "text1_tokens = [t.lower() for t in text1 if t.isalpha()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Length and unique words" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1226" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1_tokens.count(\"whale\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1_tokens.count(\"Whale\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1_tokens.count(\"WHALE\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "218361" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text1_tokens)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "260819" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text1)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16948" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Long alternative\n", "\n", "x = set(text1_tokens)\n", "len(x)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16948" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Short and sweet alternative\n", "len(set(text1_tokens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lexical density" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.07761459234936642" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(text1_tokens)) / len(text1_tokens)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "text1_slice = text1_tokens[0:10000]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.2816" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(text1_slice)) / len(text1_slice)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1786" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text2_tokens = []\n", "for t in text2:\n", " if t.isalpha():\n", " t = t.lower()\n", " text2_tokens.append(t)\n", " \n", "text2_slice = text2_tokens[0:10000]\n", "\n", "len(set(text2_slice)) / len(text2_slice)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning: removing Stopwords" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import stopwords" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "stops = stopwords.words('english')" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n" ] } ], "source": [ "print(stops)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "text1_stops = []\n", "for t in text1_tokens:\n", " if t not in stops:\n", " text1_stops.append(t)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "text1_stops = [t for t in text1_tokens if t not in stops]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['moby', 'dick', 'herman', 'melville', 'etymology', 'supplied', 'late', 'consumptive', 'usher', 'grammar', 'school', 'pale', 'usher', 'threadbare', 'coat', 'heart', 'body', 'brain', 'see', 'ever', 'dusting', 'old', 'lexicons', 'grammars', 'queer', 'handkerchief', 'mockingly', 'embellished', 'gay', 'flags']\n" ] } ], "source": [ "print(text1_stops[:30])" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "110459" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text1_stops)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16802" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(text1_stops))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning: Lemmatizing Words" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "from nltk.stem import WordNetLemmatizer" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "wordnet_lemmatizer = WordNetLemmatizer()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'child'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet_lemmatizer.lemmatize(\"children\")" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'better'" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet_lemmatizer.lemmatize(\"better\")" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'good'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet_lemmatizer.lemmatize(\"better\", pos='a')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "text1_clean = []\n", "for t in text1_stops:\n", " t_lem = wordnet_lemmatizer.lemmatize(t)\n", " text1_clean.append(t_lem)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "text1_clean = [wordnet_lemmatizer.lemmatize(t) for t in text1_stops]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "110459\n", "14750\n" ] } ], "source": [ "print(len(text1_clean))\n", "print(len(set(text1_clean)))" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['aback',\n", " 'abaft',\n", " 'abandon',\n", " 'abandoned',\n", " 'abandonedly',\n", " 'abandonment',\n", " 'abased',\n", " 'abasement',\n", " 'abashed',\n", " 'abate',\n", " 'abated',\n", " 'abatement',\n", " 'abating',\n", " 'abbreviate',\n", " 'abbreviation',\n", " 'abeam',\n", " 'abed',\n", " 'abednego',\n", " 'abel',\n", " 'abhorred',\n", " 'abhorrence',\n", " 'abhorrent',\n", " 'abhorring',\n", " 'abide',\n", " 'abided',\n", " 'abiding',\n", " 'ability',\n", " 'abjectly',\n", " 'abjectus',\n", " 'able']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(set(text1_clean))[:30]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data cleaning: Stemming Words" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "from nltk.stem import PorterStemmer\n", "porter_stemmer = PorterStemmer()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "berri\n", "berri\n", "berry\n", "berry\n" ] } ], "source": [ "print(porter_stemmer.stem('berry'))\n", "print(porter_stemmer.stem('berries'))\n", "print(wordnet_lemmatizer.lemmatize('berry'))\n", "print(wordnet_lemmatizer.lemmatize('berries'))" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "abandon\n", "abandon\n", "abandonli\n", "abandon\n" ] } ], "source": [ "print(porter_stemmer.stem('abandon'))\n", "print(porter_stemmer.stem('abandoned'))\n", "print(porter_stemmer.stem('abandonly'))\n", "print(porter_stemmer.stem('abandonment'))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "t1_porter = []\n", "for t in text1_clean:\n", " t_stemmed = porter_stemmer.stem(t)\n", " t1_porter.append(t_stemmed)\n", " " ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "t1_porter = [porter_stemmer.stem(t) for t in text1_clean]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10501\n", "['aback', 'abaft', 'abandon', 'abandonedli', 'abas', 'abash', 'abat', 'abbrevi', 'abe', 'abeam', 'abednego', 'abel', 'abhor', 'abhorr', 'abid', 'abil', 'abjectli', 'abjectu', 'abl', 'ablut', 'aboard', 'abod', 'abomin', 'aborigin', 'abort', 'abound', 'aboundingli', 'abraham', 'abreast', 'abridg']\n" ] } ], "source": [ "print(len(set(t1_porter)))\n", "print(sorted(set(t1_porter))[:30])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data cleaning: results\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "my_dist = FreqDist(text1_clean)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nltk.probability.FreqDist" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(my_dist)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEhCAYAAACOZ4wDAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3dd3hcZ5n38e+t3mxVFzkusuOS4lQplWSBJGSTwBKWkNAJ2UCWBZZQFgxLeLNAdiEsddmlhBRCWRYSWmzi9DhOc2w5CbHjEnfH3ZJcJdmypPv945yRx7KkOaORNCq/z3XNpZkzc5/zSDOa+zz1mLsjIiLSk4x0F0BERAY/JQsREUlIyUJERBJSshARkYSULEREJKGsdBegP1RUVHhVVVWv45ubm8nPz1e84hWv+BEVv3Tp0jp3H9Plk+4+7G7V1dWeitraWsUrXvGKH3HxQK13872qZigREUlIyUJERBJSshARkYSULEREJCElCxERSUjJQkREElKy6IJrJV4RkWMoWcSZv2w7F93+BHe/fCDdRRERGVSG5Qzu3srLyWTLnmaKs3LSXRQRkUFFNYs4s8aNAmDzviNqihIRiaNkEaeyOI9RuVkcaHHqDrakuzgiIoOGkkUcM2Pm+KB28dpO9VuIiMQoWXQyM2yKWr1DyUJEJEbJopNZ44oA1SxEROIpWXQSa4ZarWQhItJByaKT2Iio13Yc0IgoEZGQkkUn5UW5FOdm0NjSxta9zekujojIoKBk0YXJxcFcRfVbiIgElCy6MGl0kCxW7ziY5pKIiAwOShZdUM1CRORYShZdiCULzbUQEQn0W7Iws7vNbJeZLe/iuX8xMzezivCxmdl/mdlaM3vFzM6Oe+31ZrYmvF3fX+WNF2uGWrv7IK1t7QNxSBGRQa0/axY/B67ovNHMJgFvATbHbb4SmBHebgJ+HL62DLgVOA84F7jVzEr7scwAFGRncEJJPi2t7WxqaOrvw4mIDHr9lizcfSHQ0MVT3wO+AMRPYrga+IUHFgElZlYJ/C3wqLs3uPse4FG6SED9YWY4k3uN+i1ERLD+nHhmZlXAPHefHT5+O3Cpu99sZhuBGnevM7N5wDfd/ZnwdY8Dc4A3AXnuflu4/StAs7t/u4tj3URQK6GysrJ67ty5vS53U1MTv1/bxp9WN/LuU4u47pSipOMLCgpSOr7iFa94xQ90fE1NzVJ3r+nySXfvtxtQBSwP7xcALwDF4eONQEV4/y/ARXFxjwPVwOeBW+K2fwX4XKLjVldXeypqa2v990tf9ylz5vnHf7W0V/GpHl/xile84gc6Hqj1br5XB3I01InAVOCvYa1iIvCimY0HtgCT4l47EdjWw/Z+17H6rJqhREQGLlm4+zJ3H+vuVe5eRZAIznb3HcADwIfCUVHnA/vcfTvwMHC5mZWGHduXh9v63fSxRWQYbKhr5HBr20AcUkRk0OrPobO/AZ4HZpnZFjO7sYeXPwisB9YCPwM+DuDuDcDXgSXh7Wvhtn6Xl51JVXkhbe3O+t2NA3FIEZFBK6u/duzu703wfFXcfQc+0c3r7gbu7tPCRTRz3CjW1zXy2s4DnFw5Oh1FEBEZFDSDuwcd17bQTG4RGeGULHrQcW0LdXKLyAinZNGDWeOD+RUaESUiI52SRQ+mlBeSk5nB6w3NNB5uTXdxRETSRsmiB9mZGUwbUwjAml26toWIjFxKFgnMjLsmt4jISKVkkcCs8ZrJLSKiZJHATI2IEhFRskgkNnxWcy1EZCRTskhgYmk++dmZ7DpwmD2NLekujohIWihZJJCRYR0XQlJTlIiMVEoWEajfQkRGOiWLCDQiSkRGOiWLCI7OtdDEPBEZmZQsIoivWXg/XrNcRGSwUrKIYOyoXIrzs9nXfIRdBw6nuzgiIgNOySICM9N8CxEZ0ZQsIpo5XsNnRWTkUrKISDULERnJlCwi0lwLERnJlCwiiiWLNbsO0t6uEVEiMrIoWURUWpjD2FG5NLW0sXVvc7qLIyIyoJQsktAx30L9FiIywihZJCHWFKVlP0RkpFGySMIsdXKLyAjVb8nCzO42s11mtjxu23+a2Soze8XM/mhmJXHPfcnM1prZajP727jtV4Tb1prZF/urvFHMVDOUiIxQ/Vmz+DlwRadtjwKz3f104DXgSwBmdgrwHuDUMOZHZpZpZpnA/wBXAqcA7w1fmxYzxgYT89bvbuRIW3u6iiEiMuD6LVm4+0KgodO2R9y9NXy4CJgY3r8a+D93P+zuG4C1wLnhba27r3f3FuD/wtemRWFuFpPK8mlpa2dTfWO6iiEiMuCsP1dRNbMqYJ67z+7iubnAb939V2b238Aid/9V+NxdwPzwpVe4+0fC7R8EznP3T3axv5uAmwAqKyur586d2+tyNzU1UVBQ0OVz33hmD7XbD/O580u4cFJe0vGpHl/xile84vsrvqamZqm713T5pLv32w2oApZ3sf3LwB85mqz+B/hA3PN3AdcA1wJ3xm3/IPDDRMetrq72VNTW1nb73O3zV/qUOfP8O4+s7lV8qsdXvOIVr/j+igdqvZvv1axepZ8UmNn1wNuAS8PCAWwBJsW9bCKwLbzf3fa0iM21eE2d3CIyggzo0FkzuwKYA7zd3ZvinnoAeI+Z5ZrZVGAGsBhYAswws6lmlkPQCf7AQJa5M60RJSIjUb/VLMzsN8CbgAoz2wLcSjD6KRd41Mwg6Kf4mLu/ama/A1YArcAn3L0t3M8ngYeBTOBud3+1v8ocxbQxhWRmGBvrGzl0pI287Mx0FkdEZED0W7Jw9/d2sfmuHl7/78C/d7H9QeDBPixaSnKzMplaUcjaXQdZu+sgs08oTneRRET6nWZw94JmcovISKNk0QtaI0pERholi16YFbvEqkZEicgIoWTRC0dHRB1Mc0lERAaGkkUvTCkvJCcrg617mzlw6Ei6iyMi0u+ULHohM8OYPiZsilLtQkRGACWLXuqYya1ObhEZAZQseqljRJQ6uUVkBFCy6KWOEVGqWYjICKBk0UtaI0pERhIli146oSSfwpxM6g62UHfwcLqLIyLSr5QsesnMOq7JrdqFiAx3ShYp6FgjSp3cIjLMKVmk4OgaUZprISLDm5JFCmJzLdaoGUpEhjklixTErz579AqxIiLDj5JFCiqKcigrzOHAoVZ27D+U7uKIiPQbJYsUmBkzxwWT8zSTW0SGMyWLFOmqeSIyEihZpCg212L1Do2IEpHhS8kiRapZiMhIoGSRohlhsliz6wBt7RoRJSLDk5JFiorzs6kszuPQkXZeb2hKd3FERPqFkkUfiJ9vISIyHPVbsjCzu81sl5ktj9tWZmaPmtma8GdpuN3M7L/MbK2ZvWJmZ8fFXB++fo2ZXd9f5U1Fx1XzNHxWRIap/qxZ/By4otO2LwKPu/sM4PHwMcCVwIzwdhPwYwiSC3ArcB5wLnBrLMEMJqpZiMhw12/Jwt0XAg2dNl8N3Bvevxd4R9z2X3hgEVBiZpXA3wKPunuDu+8BHuX4BJR2GhElIsOd9eeaRmZWBcxz99nh473uXhL3/B53LzWzecA33f2ZcPvjwBzgTUCeu98Wbv8K0Ozu3+7iWDcR1EqorKysnjt3bq/L3dTUREFBQeTXH2513v/HnWQY/Pqd4zhyqDmp+FSPr3jFK17xfRFfU1Oz1N1runzS3ZO6AaXA6RFfWwUsj3u8t9Pze8KffwEuitv+OFANfB64JW77V4DPJTpudXW1p6K2tjbpmDd+6wmfMmeer9q+v1fxqR5f8YpXvOJTjQdqvZvv1UjNUGa2wMxGh30IfwXuMbPv9iJx7Qyblwh/7gq3bwEmxb1uIrCth+2DjvotRGQ4i9pnUezu+4F3Ave4ezVwWS+O9wAQG9F0PfDnuO0fCkdFnQ/sc/ftwMPA5WZWGnZsXx5uG3Q0IkpEhrOoySIrrAlcB8yLEmBmvwGeB2aZ2RYzuxH4JvAWM1sDvCV8DPAgsB5YC/wM+DiAuzcAXweWhLevhdsGHdUsRGQ4y4r4uq8SnNE/4+5LzGwasKanAHd/bzdPXdrFax34RDf7uRu4O2I506ajZrHzAJw6Os2lERHpW1GTxXZ3Pz32wN3X97LPYtiqKi8kO9PY3NDEodaidBdHRKRPRW2G+mHEbSNWTlYG0yqKcIct+9vSXRwRkT7VY83CzC4ALgTGmNln454aDWT2Z8GGopnjR7F65wE27z+S7qKIiPSpRDWLHKCIIKmMirvtB97Vv0UbemaFl1h9fV9rmksiItK3eqxZuPtTwFNm9nN33zRAZRqyThofdGyvrlfNQkSGl6gd3LlmdgfBjOyOGHe/pD8KNVRdcGI5edkZrK4/wo59hxhfnJfuIomI9ImoHdz3AS8BtxAswRG7SZzC3CzeNHMsAA8t357m0oiI9J2oyaLV3X/s7ovdfWns1q8lG6KuPG08AA8u35HmkoiI9J2oyWKumX3czCrDCxiVhetESSeXnDSW7AxYsrGBXQcOpbs4IiJ9ImqyuJ6g2ek5YGl4q+2vQg1lo/KyOXN8Lu7w8Ks7010cEZE+ESlZuPvULm7T+rtwQ9X5E4OO7fnL1G8hIsNDpNFQZvahrra7+y/6tjjDwzmVuWRnGovW11N/8DDlRbnpLpKISEqiNkOdE3e7GPg34O39VKYhrzAng4umV9Du8MgKNUWJyNAXtRnqn+NuHwXOIpjdLd248rRKAB5UU5SIDANRaxadNQEz+rIgw83lp4wjK8N4bl09expb0l0cEZGURL2s6lwzeyC8/QVYzdGr3EkXSgpyuODEctranUdXqilKRIa2qMt9fDvufiuwyd239EN5hpWrTqvk6TV1zF+2netqJiUOEBEZpKL2WTwFrCJYcbYUULtKBJefMo4Mg2fW1rGvWYsLisjQFbUZ6jpgMXAtwXW4XzAzLVGeQHlRLudPK+dIm/O4mqJEZAiL2sH9ZeAcd7/e3T8EnAt8pf+KNXwcHRWltaJEZOiKmiwy3H1X3OP6JGJHtL89dRxmsHDNbg4cUlOUiAxNUb/wHzKzh83sw2b2YeAvwIP9V6zhY+yoPM6pKqOltZ0nVu1KHCAiMgj1mCzMbLqZvcHdPw/8FDgdOAN4HrhjAMo3LFw1O1i2fL6aokRkiEpUs/g+cADA3f/g7p91988Q1Cq+39+FGy6umB30Wzy5eheNh3V9bhEZehIliyp3f6XzRnevJbjEqkQwvjiP6imlHG5tZ8Hq3ekujohI0hIli54uIp3f24Oa2WfM7FUzW25mvzGzPDObamYvmNkaM/utmeWEr80NH68Nn6/q7XHT6crZsSvoaa0oERl6EiWLJWb20c4bzexGggsgJc3MTgA+BdS4+2wgE3gPcDvwPXefAewBbgxDbgT2uPt04Hvh64ac2BDaJ1ftormlLc2lERFJTqJk8WngBjNbYGbfCW9PAR8Bbk7huFlAvpllAQXAduAS4P7w+XuBd4T3rw4fEz5/qZlZCsdOixNK8jljUglNLW089ZqaokRkaDF3T/wiszcDs8OHr7r7Eykd1Oxm4N+BZuARgsSzKKw9YGaTgPnuPtvMlgNXxNaiMrN1wHnuXtdpnzcBNwFUVlZWz507t9fla2pqoqCgoM/j/7S6kV++coCLJ+fx6fNKBvz4ile84hXfk5qamqXuXtPlk+4+oDeCtaWeAMYA2cCfgA8Ca+NeMwlYFt5/FZgY99w6oLynY1RXV3sqamtr+yV+U12jT5kzz0/9fw95c0vrgB9f8YpXvOJ7AtR6N9+r6ZiFfRmwwd13u/sR4A/AhUBJ2CwFMBHYFt7fQpA8CJ8vBhoGtsh9Y3J5AadOGM3Bw608s6YucYCIyCCRjmSxGTjfzArCvodLgRXAk0BsccLrOXq9jAfCx4TPPxFmwCHpqthaURoVJSJDyIAnC3d/gaCj+kVgWViGO4A5wGfNbC1QDtwVhtwFlIfbPwt8caDL3JdiQ2gfXbGTltb2NJdGRCSaqBc/6lPufitwa6fN6wlWs+382kMES6MPC9PGFHHS+FGs2nGAZ9fV8eZZY9NdJBGRhLRybBpcGS7/MX+ZmqJEZGhQskiDq04LmqIeWbGTI21qihKRwU/JIg1mjBvF9LFF7G06wqL19ekujohIQkoWaRJbtlxX0BORoUDJIk1ia0U98uoOWtUUJSKDnJJFmpw0fhRTKwqpb2xh8cYhOcdQREYQJYs0MbOOORe6gp6IDHZKFmkUm8390Ks7aGsfspPSRWQEULJIo1MnjGZSWT67Dxxm6aY96S6OiEi3lCzSyMy4Kpyg96Am6InIIKZkkWaxUVEPLd9Bu5qiRGSQUrJIszMmFjOhOI8d+w/x0ut7010cEZEuKVmkmZl11C60VpSIDFZKFoNAbK2o+ct3MIQv1SEiw5iSxSBw1qRSxo3OZeveZl7Zsi/dxREROY6SxSCQkWEdy5brCnoiMhgpWQwS8bO51RQlIoONksUgUVNVRkVRLpsbmtiwtzXdxREROYaSxSCRmWFcMXscAH9Y1UhDY0uaSyQicpSSxSByzdkTyTB4fssh3vDNJ/ja3BVs39ec7mKJiChZDCZnTS7lvo9dwFnjc2g+0sbdz27gb771JF+4/6+s330w3cUTkRFMyWKQqZ5Sxi0Xl/GXT13E206vpK3d+V3tFi797lN8/NdLWb5VQ2tFZOBlpbsA0rVTJxTz3+87mw11jdyxcB2/X7qVB5ft4MFlO7h4RgWfePN0zptahpmlu6giMgKoZjHITa0o5BvvPJ2FX3gzH714KgU5mTy9po733LGIa378HI+t2KkFCEWk3ylZDBHji/P48ltP4dk5l/Dpy2ZQUpDNi5v38pFf1HLlD57mTy9t1bW8RaTfpCVZmFmJmd1vZqvMbKWZXWBmZWb2qJmtCX+Whq81M/svM1trZq+Y2dnpKPNgUVqYw6cvm8mzcy7hlreezPjReazeeYBP//Zl3vydBfxy0SYOt6mmISJ9K101ix8AD7n7ScAZwErgi8Dj7j4DeDx8DHAlMCO83QT8eOCLO/gU5mbxkYun8dQX3sTt15zG1IpCXm9o5it/Ws5N83Zx27wVbKhrTHcxRWSYGPBkYWajgb8B7gJw9xZ33wtcDdwbvuxe4B3h/auBX3hgEVBiZpUDXOxBKzcrk3efM5nHPvtG/ud9Z3PGxGIOtjh3PrOBN397AR+48wUeWr5dTVQikhIb6HWIzOxM4A5gBUGtYilwM7DV3UviXrfH3UvNbB7wTXd/Jtz+ODDH3Ws77fcmgpoHlZWV1XPnzu11GZuamigoKBiy8cu3HeCpre0883ozLW3BtrK8DC6bls9l0wooz8/s1+MrXvGKH5rxNTU1S929pssn3X1Ab0AN0AqcFz7+AfB1YG+n1+0Jf/4FuChu++NAdU/HqK6u9lTU1tYOi/i9jS1+19Pr/ZJvP+lT5szzKXPm+bQv/cU/eu8Sf2r1Lm9ra+/X4yte8YofWvFArXfzvZqOeRZbgC3u/kL4+H6C/omdZlbp7tvDZqZdca+fFBc/Edg2YKUdwooLsvmHi6ZywxuqeH59Pb9etJmHX93BIyt28siKnVSVF/C+8yZzbfUkSgtz0l1cERnEBrzPwt13AK+b2axw06UETVIPANeH264H/hzefwD4UDgq6nxgn7vrog9JMDMuPLGC/3n/2Tz3pUv4l8tnMqE4j431TfzHg6s47xuP89nfvszSTXu0PLqIdCldM7j/Gfi1meUA64EbCBLX78zsRmAzcG342geBq4C1QFP4WumlsaPy+OQlM/inN03nyVW7+NULm3jqtd384aWt/OGlrZxcOZqzy9uhYg+nTywmO1NTcUQkTcnC3V8m6Lvo7NIuXuvAJ/q9UCNMZoZx2SnjuOyUcWyub+J/F2/md7Wvs3L7flZuh18vf47CnExqqso4f1o5F5xYzuwJo8lS8hAZkbQ2lDC5vIAvXnkSn3nLDB5fuYs/Pb+StQcyWL+7kade281Tr+0GoCg3i3OqSjuSx6kTisnM0NpUIiOBkoV0yM3K5KrTKhnXso3q6mp27j/EovX1LFpfz/Pr6tlY38STq3fz5OogeYzKzeLcqWVccGI5508r5+TK0UoeIsOUkoV0a9zoPK4+8wSuPvMEALbvaw6Sx7oGnl9fz+aGJh5ftYvHVwUD10bnZXHu1HIqMhrZkb2d6WOLqKooIDer53kdIjL4KVlIZJXF+fz9WRP5+7MmArB1bzOL1tXzfFjz2Lq3mcdW7gTg/159EYAMg8llBZw4pogTxxYxfUwRJ44t5MQxRZQUaLiuyFChZCG9dkJJPtdUT+Sa6iB5vN7QxOINDSx8ZS2NGYWs293IpvpGNtY3sbG+qaMGElNRlMO0MUVMH1sUJJMxhUwfW0S7hu+KDDpKFtJnJpUVMKmsgCp2Ul1dDcDh1jY21TexdtdB1u06yLrdB1m7+yDrdjVSd7CFuoMNLN7QcMx+cjJh+rNPdySR6WOD2khVeSF52WrSEkkHJQvpV7lZmcwcN4qZ40Yds7293dm+/9DRBNLxs5G6g4dZsX0/K7bvPyYmw4KE1JFAxhR23FeTlkj/UrKQtMjIME4oyeeEknz+ZuaYY5576vkljDphekcCWberkXW7D7KpvpFN9U1sqm/iiU5NWuWFOZwY1kTKvZmpJ7VQpiVMRPqMkoUMOkU5GZw9uZSzJ5cesz3WpHVsbSRIJPWNLdRvONqk9aPaR6meUsplJwcTD08cU5SOX0Vk2FCykCGjuyYtd2f7vkOs232Q1TsOMK92Ha/WHWHJxj0s2biHb8xfxbSKwmDG+snjqJ5SqvkgIklSspAhz8yYUJLPhJJ8Lp4xhrMK9jDz1NNZ+Fodj63cyROrdrG+rpE7Fq7njoXrKS3I5s0njeXyU8Zx8YwxFObq30AkEf2XyLA0Ki+bt55eyVtPr6S1rZ3aTXt4fOVOHl2xk431Tfzhxa384cWt5GRmcOH0ci47eRyXnjyWyuL8dBddZFBSspBhLyszg/OnBUuS/OtVJ7NudyOPrdzJYyt2snTzHhas3s2C1bu55U8w+4TRnJB3hDMOrGVCcX5YY8lj3Og8rcArI5qShYwoZsb0scFw24+98UTqDh7myVW7eGzlTha+VsfyrftZDjy8bnWnOBg3Ko/KkrwggRTndTR9BUklj7LCHMzUFyLDk5KFjGgVRblcWzOJa2smcehIGy9saOCpF1eSOaqCbfsOsW1vM9v2NrPrwGF27D/Ejv2HeGnz3i73lZuVwYSSfMbmtnJZ43qqq0qZPaGYnCzVSGToU7IQCeVlZ/LGmWMoOrCZ6upTjnnuSFs7O/YdYnssgexrDhPJ0YSy/1ArG+oa2QC8sHUlECSQMyaVUDOllJqqUqonl1FckJ2G304kNUoWIhFkZ2Z0LGfSnYOHW9m6p5kHnn2ZeopZsrGBdbsbWbzh2CVNZo4ronpKWUcCmVxWoOYrGfSULET6SFFuFrPGj+KSqgKqq08HYE9jC0s37aF20x6Wbmrgr1v28drOg7y28yC/WbwZCJrCYomjpqqMA4fbOXSkjdysDCURGTSULET6UWlhTsflayGYhb586z5qN8YSyB7qDh7moVd38NCrO44GPvAQGQb52Znk52RRkJMZ3s+kILzl52SRn51BQU5WsD18/mBdMznj9lFVUcCoPDV5Sd9QshAZQLlZmVRPKaN6Shn/SDD7fH1dI0s37qF2UwMvbt7Ljj2NtLQbLW3tNLa00djSlvRxvv/CM0BQa5lWUcjUikKqwp/TxhQyuaxAK/hKUpQsRNLIzMJreRRx3TmTAFi6dCnV1dW0trXTdKSNQy1tNIW35iOtNLe009TSSvORo9sPHWmjqaWVxsNtrNy0nb2tOWyoD1bwrTt4mMUbGzodFyYU5zNtTLD0+9SKQqaOKWRqeSGNLe0cOHSEzAwjw4JbcB81i41gShYig1RWZgajMzMYnWRT0tKlh6iurqa93dm2r5mNdU1sqDvIho6fjby+p5mte4Pb02vqjt/Jnx/pct9mkGlGRoYFPy1YQTg+sUwqcr5c3kBNVVlvfm0ZpJQsRIapjAxjYmkBE0sLuGhGxTHPHWlr5/WGpmCob9xtY10je5sOY5ZBmzvtHlx7pM0dd3CHVndo7/5qhnUH4V0/eZ4LTyznU5fO4Pxp5f39q8oAULIQGYGyMzOYNqaIaV0s3R5rBuvMw+TR1u60e3AL7h9NKIeOtPGDuYt5aP1hnltXz3Pr6jlvahk3XzaDC6aVqxlrCEvb1FIzyzSzl8xsXvh4qpm9YGZrzOy3ZpYTbs8NH68Nn69KV5lFRjIL+y5ysjLIy86kICeLUXnZFOdnU1qYQ0VRLhNLC3jPqaN4Zs4lfOaymYzOy+KFDQ2872cvcN1Pn+fpNbtxXWN9SErnOgQ3AyvjHt8OfM/dZwB7gBvD7TcCe9x9OvC98HUiMogV52dz82UzePaLl/Avl8+kpCCbJRv38MG7FnPNj59jwepdShpDTFqShZlNBN4K3Bk+NuAS4P7wJfcC7wjvXx0+Jnz+UlNdVmRIGJWXzScvmcEzcy7hC1fMorQgmxc37+XD9yzhHT96jidW7VTSGCIsHW+Umd0PfAMYBfwL8GFgUVh7wMwmAfPdfbaZLQeucPct4XPrgPPcva7TPm8CbgKorKysnjt3bq/L19TUREFB98s6KF7xiu9dfHNrOw+va+bPqxvZf7gdgBNLs7j2lCJqKnMxs0Fd/uEeX1NTs9Tda7p80t0H9Aa8DfhReP9NwDxgDLA27jWTgGXh/VeBiXHPrQPKezpGdXW1p6K2tlbxild8P8Y3HW71ny1c5zW3PepT5szzKXPm+ZXfX+jzl233xUuW9PvxFd81oNa7+V5Nx2ioNwBvN7OrgDxgNPB9oMTMsty9FZgIbAtfv4UgeWwxsyygGGg4frciMlTk52TykYun8YHzp/CbxZv58YJ1rNi+n4/9ainjCzN5x+5VXHXaeE47oVgjqAaJAe+zcPcvuftEd68C3gM84e7vB54E3hW+7Hrgz+H9B8LHhM8/EWZAERni8rIzueENU1n4hTfztatPpbI4jx2NbfzkqXW8/b+f5aLbn+S2eStYuqmB9h7mdkj/G0zzLOYA/2dmt0xcqDoAABkSSURBVAEvAXeF2+8CfmlmawlqFO9JU/lEpJ/kZWfyoQuqeP95U/jVw8+zvmU085fvYOveZu58ZgN3PrOBcaNzueLU8Vx5WiXnVJWRmaEax0BKa7Jw9wXAgvD+euDcLl5zCLh2QAsmImmRmWHMHpvL9dWzufXvTuWl1/fw4LIdPBQmjnuf38S9z2+ioiiHt5wynqtOG8/508p1ffQBMJhqFiIiHTIyrGOF3lveejKvbNnH/OU7mL98O5vqm/jN4s38ZvFmSgqyecvJ47jytPG8YXpF4h1LryhZiMigZ2acMamEMyaVMOeKWazcfoD5y7czf/kO1u46yH1Lt3Df0i2Mys1iWkkGU9e8RHlRLmWFOVQU5VBWmEt5UQ7lhTmUF+VSmJOpjvMkKVmIyJBiZpwyYTSnTBjN5y6fxZqdB8Iaxw5Wbt/PX3fCX3du63EfOVkZVBTmUFaUQ3lhbphEgqRSv7ORlUc2kRmurGsWNI9lZliw5IkZmRnE3Y97jRlb9hxhysHDlBfmDKuEpGQhIkPajHGjmDFuFJ+6dAab65t4ZNHLlI6fTENjC3WNh2k42EJ9Ywv1Bw+HP1toPtLGtn2H2LbvUNc7fWl5aoV67DFyszKYUJJPZXEeE0rymRD7WZLPhJI8KovzKcwdOl/BQ6ekIiIJTC4v4KzxuVRXT+zxdU0trdQfbKGhsYX6xsPUxyWUzdt2UFY+hvZwdd0293BVXYLVdtvjVtvtWHn36Cq82+v3s7fF2Nd8pGPp9+4U52cfk0gqS/I4sqeZrLF7mVJeQElBTl//iXpNyUJERpyCnCwKyrKYVHb8shhLlzZTXX1ar/cdW+K98XAr2/c1s3XvIbbtbWb73uD+9n3NbNvbzLZ9h9jXfIR9zUdYuX3/Mfv43gvPAjA6L4sp5YVMLi9gSlkBU8oLmFxWyJTyAsaPziNjAIcPK1mIiPSDwtwspo8dxfSxo7p83t2pb2wJEkeYULbtbWbZhm3sa8thc0MT+w+1smzrPpZt3XdcfE5WBpNK84NkEiaSKeUFHNjfypnt3ufzUJQsRETSwMyoKMqloiiX0+NazYKaTTXuTt3BFjY3NLKpvolN9U1sbghum+qbqDt4mHW7G1m3+/hmrovPPUJZYd82YSlZiIgMQmbGmFG5jBmVS/WU469n3ni4tSNxxBLK5oYmNu/aS2lBctdtj0LJQkRkCCrMzeLkytGcXDn6mO1Lly7tlyG7miMvIiIJKVmIiEhCShYiIpKQkoWIiCSkZCEiIgkpWYiISEJKFiIikpCShYiIJGTuw+8i6Ga2G9iUwi4qgDrFK17xih9h8VPcfUyXz7i7bp1uQK3iFa94xY/E+O5uaoYSEZGElCxERCQhJYuu3aF4xSte8SM0vkvDsoNbRET6lmoWIiKSkJKFiIgkpGQhIiIJKVnEMbPCoVoGM8sxs9PN7DQz69uL744AZpYbZZvISKXLqgJmdiFwJ1AETDazM4B/dPePR4yfCfwYGOfus83sdODt7n7bQJTBzN4K/ARYBxgw1cz+0d3nRz1+uJ9MYBxxnwt33xwhLg+4ETgVyIuL/Yckj38CMKXT8RdGjB0PnAs4sMTddyRzbOB54OwI27o7/kXADHe/x8zGAEXuviFBzDt7et7d/xDx2L36/JnZMoK/V3fHPz3CsY+/OPSx+2iIsI9c4BqgimPf+68lio3bRyqfndvdfU6ibd3E9tV7OJfj34t9QC3wU3c/FGU//UmjoQAzewF4F/CAu58Vblvu7rMjxj8FfJ7gTU06PtUymNkq4G3uvjZ8fCLwF3c/KYnj/zNwK7ATaA83e8QvjPuAVcD7gK8B7wdWuvvNSRz/duDdwAqgLe74b48Q+xHg/wFPECTLNwJfc/e7I8SOB04AfhWWP3bx4tHAT6L8Dc3sVqAGmOXuM81sAnCfu78hQdw94d2xwIVh+QHeDCxw9x6/iOL206vPn5lNCe9+Ivz5y/Dn+4GmKF/WZraB4EvOgMnAnvB+CbDZ3adG2MdDBF+MSzn63uPu30kUG8b3+rMTxr/o7md32vZKxM9+X72HPwDGAL8JN70b2AHkA6Pd/YM9xB6g+0TzOXdfH6UMCfXHtPChdgNeCH++FLftr0nEL+ki/uWBKgOwsNNj67wtwj7WAuW9/Pu9FP58JfyZDTyR5D5WA7m9PP7q+LID5cDqiLHXA08CB8KfsdsDwDsj7uPl8G8e/969kkT55wGVcY8rgT8M1OcPeDbKtgT7+AlwVdzjK4HvRIxd3pv3PdXPDvBPwDKgEXgl7rYB+FWS+0r1PTzu/zW2DXg1QexXgX8ERhGc5NxEcPL0boKE1eu/bfxNzVCB18NmIA/b+z8FrEwivi48mw++qc3eBWwfwDK8amYPAr8Ly3AtsCRWRfZoVeHXCc5GeuNI+HOvmc0mOCOqSnIf6wmSzOFeHH8LwZd9zAGC3ychd78XuNfMrnH33/fi2AAt7u5mFnv/k+13qnL3+M/LTmBmEvGpfv4Kzewid38mjL8QSPZ3OMfdPxZ74O7zzezrEWOfM7PT3H1ZkseM6e1n53+B+cA3gC/GbT/gEZrPOkn1PRxjZpM9bPY1s8kECwICtCSIvcLdz4t7fIeZLXL3r5nZvyZRhh4pWQQ+BvyAoDliC/AIR6vmUXyCYNbkSWa2leDM5AMDWIY8gg/nG8PHu4Ey4O8IvkCiJIv1wAIz+wtx/3Tu/t0IsXeYWSnwFYIz8iKCM5tkNAEvm9njnY7/qe4CzOyz4d2twAtm9meC3/dqYHEyB3f334d9P537XaK0m//OzH4KlJjZR4F/AH6WxOEXmNnDBE0QDryHoHYTVaqfvxuBu82sOHy8l+B3SEadmd1C0Jzn4fHrewqI6zPJAm4ws/UE770RsQk0lPRnJ3x+H8EJ0ns79TlVmNlUT9Dn1Emq7+HngGfMrKPfEfh4eOJxb4LYdjO7Drg/fPyuuOf6rJ9BfRZ9KHxjM9z9QMIXHx9b1vlsphcf2F4L292P4+5fHaDjX9/N8bv9R+muzHGxkctuZj8BCgjamu8k+Idb7O43Roj9Z4La1LkE/+gPu/ujUY8d7uOdwMXhw4Xu/sdk4sN99PrzF8aPJvhOSLqGGXZ03wr8TbhpIfDVns7Q4/pMuuTukS4z0JvPTqf4XvU5dbGflN7DsKP/JILP0CqP2KltZtMITjQvIEgOi4DPEJxEVcdqjKka0cnCzH5Iz6NBejwzidtPX4zmeBa40t33h49PJvjAdttJaWZfcPdvdfd7RC1/p32OCkL9YBIx44D/ACa4+5Vmdgpwgbvflezx0yXWoRn3s4igzfnyCLG3EZxJvgjcTZAsBuwfy8xKgA9x/Oevx/c/rmbWpYi1ypR1M6LqgLsf6WJ7fxz/ZeAs4EU/OkAgUgd3H5fjQo5/D38xkGXoyUhvhqrto/38maOjOXrT5g7Bl+3csClkFvALglEpPYn1aaT8e4R9Db8kaL7CzOqAD7n7qxHCfw7cA3w5fPwa8FsgcrIwsxkEbcencGwz0LQIsWOAL3B8E9IlUY8PNIc/m8Izy3qCpoCE3P0WM/sKcDlwA/DfZvY74C53X9dDuWOjWIxjk32sGWZ0xLI/SHA2uYyjI9miGJXEa3uU4nvwIjCJY0dSbTezXcBH3X1pN8f8nbtfZ90MAU7iyz7VPqdYreJ2glFRRpLvoZn9EjiRYLBEx4gugu+BRLFjgI9yfKJJtimxRyM6WUStpkYw0d2vSLEsfzGzbIK+ilHAO9x9TYKYueHPvvg97gA+6+5PApjZmwja3S+MEFvh7r8zsy+F5Wk1s7ZEQZ3cQ9CM8T2CpqAbODqMNZFfEySntxH0/VxP0G+TjHnhGfp/Enx5OUn0O4RfNjsImqNagVLgfjN71N2/0E1Mx5e1mZ3JsU0Yf02i7Hnu3mMtoZvj92UTYyrvwUPAH939YQAzuxy4gmDAxo+A87qJiw3NflsvyxyTap8TwLeAv3P3ZAbGxKsBTulljfTPwNPAY8QNPe5rI7oZKibMzHM4/qw20pmpmd0B/LA3ozm6aEK6hKCzeWNYhoRNSRZMyvoXjj+ziHxmbWZ/dfczEm3rJnYBQTPco+5+tpmdD9zu7m/sOfKYfSx192ozW+bup4Xbnnb3i5OI7Wg6MLOnkjl+p/3lEnwBR2q7N7NPEXw51hH0d/zJ3Y+YWQawxt1PjBD/UYKBCAa8A/iZu/8w4vE/AxwkGL4Z38EbaUSPBXMFujozj3xmmsp7YGa17l7T1TYze9ndz4xajt7ooz6nZ5Pt4+gUfx/wqU4jqqLG9vvfCEZ4zSJO7KzorfTuzPQiej+ao3MTUpdV7gTuIxjnfie9P7NYHzalxCZmfYBgVE0UnyUYBXVi2PcyhmNHZERxKPblamafJOicGxsxNta2vT1sxtsGTEzm4GGt7p842kG7wMx+GrHdvIJgTsYxHbLu3m5mUc56PwKc7+6NYVluJ5g9HilZEAyt/E+CZsDYl74DCZvwQvPi7ucBf0/wN0xGKu9Bg5nNAf4vfPxuYI8FKwokbFYLT05+CJwM5ACZQGMSzXjjCGopsT6nxyLGxas1s98Cf+LYhB1pBjfBZ2iFmS3uFB9lYuE8M7vK3R9MpsDJUs2C1M9Mw1EdpcQ1IwB7o47mSFWs/Cnuo5Rgcs9FhJP6gH9z9z0RYq8FHiZod76GoNngK+7+YhLHP4egD6YE+DrB5KJvufsLEWLfRlANn0TwpTGaYCTOA0kc/06CsfqxJr0PAm3u/pGo++itsM39nNjoFwuWT1kSq2FFiF8HnOfudX1UngzgsSRrpr1+D8ysgqAJMvbZe4bgs7gPmOzhygQ9xNcSDDC4j6A550PAdHf/ck9xnfZhHO1zqiFoAuuxz6lT/D1dbPaotTMz6/K7xt2fihB7gGBezGGCpJ1sn1ckqlkEUj0zfQfB2WGsGeGXBG2eCc8MU+mkixtFMtfMPg78kV40Q4Sv3QN8yoKx9u2e3PDLr7j7fWHCuQz4DsFaRd21NXdZBIK/2xSCL20I/oYJa2fuHjsz3kfQ39Eb53RqcnvCzJLpN0jFPQTzRGJDLd9BEoMDgFcJ5hr0lRkES3dElsp7ECa5f+7m6R4TRdw+1ppZpru3AfeY2XNJliHpPqdO8Tckc7wu4p8KRxWeE25a7O67Isb22UCFnqhmQepnpmb2CsFQ0VgzQiHwfJRmKDOrdPft1s2Y855qJ3bsujwdIXGxUZshYmf2d3N0hMw+4B+6G4nSKfYldz/LzL4BLHP3/41tS+L4qwnWNzpmRE+U2llfjAYxsxeBa2NnkhaMXb/fO60Z1F/M7GzianXu/lISsX8kGIX0JElMSouL7zwqawfwJU9iRrulsJimmT1J1ydKUfsMFxKcpNwZln078OEo/W1hfEp9TuE+UlpM04JJdf8JLCB4Hy4GPu/u9/cQc5K7rwo/O8dJpmYfqYxKFqlLtRmhD45/HfCQu+8P+x3OBr6eZDPQK8An3P3p8PFFwI8iJrx5BH0MlwHVBMNQF0f9Zw338Yy7XxT19Z1inyNI9p0Xokvmy+5SgjP82KJrVcANHo4OG8wsxUlpfVSGXi+maWbxTah5BE2ZrVHO6MP4KQQrGOQQTEYrBv4niSakrxE0OR13YmJmJ3uEEU6W4mKaYS32LbHaRHgC9FhP/0Nmdoe73xQm2848mWbESGVUskj9zNSCyU3XEzQDQdCM8HN3/36E2K5WjIQk2h3t6ESyiwjma3wH+Fc/dr2YRPs4bjRH1BEeZlZAMNRxmbuvMbNK4DR3fySJ418KvBfovGRDwg7CvhgNEib4zwGXhpseBb7ng2Bp6IFgZm8nrnM/rlkpavwSdz8nvkaZyvuSZJ/hze7+g0Tb+lNc7Tr2v5hNMKoqau2oYxRg+DiDYCHRATnhjEJ9FoGUxim7+3ctGD4aa0a4IWozQh+1N8bK/FaCZbX/bGb/luQ+Flsw1jy2ts27CUYEnR2Ws9tairs3Ebf+lAfD/5IdAngDwVIH2cQtkU60da36YjTIL4D9BJ3rECSuXxIsyjgoJejv8iSaYb5J0Fb+63DTzWb2Bnf/UhLF6fVihnbsDO4Mgtrp+CSOfT3BchfxPtzFtv6U6mKaD9nRtaUg+P+L/Hm2AZj9rZoFAzdOub/0UTNQrCob+0DE2q9jNZw+rdJ2cfxlyZ5FdWprT2k0iKUwzyRd4vq7fkfQBNTxFMFIsusi7ucV4Ex3bw8fZxIsdx55uYuwj+cOgkmcewiGXb8/Yp9TfN9baxj7NU+wppGZvZeg2ecigpO9mNEEzViXRS1/qiy4psrvgdMIVjQoIhj48dMk9nEN8AaO9ltFWlvKupn9HbXPKirVLAIDMk65H11H0Az0bXffGzYDfT5BTGcLOj12SG59qxQtMrNT3H1F1AA/dgZ0GcEonrzuI3r0kpmd7+6Lwv2dBzzby30NCD86gWt65y9lM4t84atQCRAbPVfc0wu7sZWgz+dJgiVj9hOc8Sf8/HiECyR14zmC2ksFQdNrzAGC61IMpF9ydH24WF/RuGR2EPax9WaZ/FRmf0c2opNFp/6CfzWzwwRnNtAP45T7Sx81A8UvHJhHsIRCb5cu6I2LgOvDs8ykJjaGZ3U3Ewx3fhk4n+CL5NKe4sLYWBNONvAhM9scPp5CcOW1QcvM/gn4ODAtrB3EjCK5RPcfwIthU6oR9F0k0wQFQVPuXoKJbUlN6LMuJkQSdJT3OCEyTJCbgAvs2Mvqrnb31p5i+0Gv1ofriz5LYDlBs13Ss7+ToWYoOqpxTwNPRxn5MBJYsOTFA+7+twN0vKSHDsfFLiNoc1/k7meGZ9Vfdfd39/a4yRw/XSyYE1NKihfvCT//awiajzYTXLUxqWuYRx351E1sShMizexGgkl9SV9Wt6+k8vuncMzYdbtHAWcSXMMl2dnfkY3omkWcewjObP8rbHt9iSBxDGQH2WBTQPTlIlKW4pfyIXc/ZGaYWW449nzWABw3rTzu4j0p7ir2+X87wXv+spktTPLzn8rV7lKdEPkF4Cx3rwcws3KCmuWAJQtSv9pfb3ybIDneTjACMya2rU8pWQDu/kQ4TvwcgtmnHwNmM7CjKdKq04iaTIL1nQaqvyJVWyxYMfZPwKNmtofk1zYasbr5/J9KhM+/9c3V7trM7EQ/dkJkMqMSe31Z3VT10e/fKx4uBWJm2d5pWRAzy+/r46kZCrDgcoyFBIu3PQ084xGn2g8XnZpjWoGdaWj3TZkFa+wUE0xSTHTtYiG1z39fNOPZsRMijaC/KPKESDP7BcEopM6X1X0tLEO/XcQpnc2Y8X1WQPwExFHAs+6e7KWdez6ekgWY2fcIhpweJugYXEiwXEdzj4Eiw8Bg+PyHfWSzoOOSosl0EvfZ5XWHkr7qs4p8PCWLoyy4lOYNBNeGGO/uuWkuksiASefnfyAmlUlq1GcBWHD9hIsJzq42EXSMPd1jkMgwke7Pf3eTyohwSdEwvi8uqysJKFkE8oHvAkuHYju9SIrS/flPdVJZX1xWVxJQM5SIpJWlcEnRML5PL6srXVPNQkTSotOkst5eUhT64LK6kpiShYikS19NKrstHBn0OY5evOzTfVVICShZiEha9OGksmsJ5oYsB94cLir5bWBunxVWlCxEJD36cCHE0919b+yBuzeYWeRL+ko0ShYiki7/C8wn9UllGWZW6u57oGO5en239TH9QUUkLfpwIcTvECzkdz9Bh/l1wL+nuE/pRENnRWTIM7NTgEsIOscfT+YiWhKNkoWIiCSUke4CiIjI4KdkISIiCSlZiCRgZl82s1fN7BUze9nMzuvHYy0ws5r+2r9Ib2k0lEgPzOwCggXqznb3w2ZWAeSkuVgiA041C5GeVQJ1sYvxuHudu28zs/9nZkvMbLmZ3WFmBh01g++Z2UIzW2lm55jZH8xsjZndFr6mysxWmdm9YW3lfjMr6HxgM7vczJ43sxfN7L7wehOY2TfNbEUY++0B/FvICKZkIdKzR4BJZvaamf0ovGwrwH+7+znuPptgie+3xcW0uPvfAD8huNTnJwiu6f5hMysPXzMLuCNcJXU/wUzmDmEN5hbgMnc/G6gFPhtOOPt74NQw9rZ++J1FjqNkIdIDdz9IcFGgmwiukfBbM/swwRpEL5jZMoLx/afGhT0Q/lwGvOru28OayXpgUvjc6+4eW9LiV8BFnQ59PnAK8KyZvUxwjYYpBInlEHCnmb0TaOqzX1akB+qzEEnA3duABcCCMDn8I3A6UOPur5vZvxF3hTaOLrPdHnc/9jj2P9d5glPnxwY86u7HzW42s3OBS4H3AJ8kSFYi/Uo1C5EemNksM5sRt+lMYHV4vy7sR3hXL3Y9Oew8h2C5i2c6Pb8IeIOZTQ/LUWBmM8PjFbv7gwTLcJ/Zi2OLJE01C5GeFQE/NLMSoBVYS9AktZegmWkjsKQX+10JXG9mPwXWAD+Of9Ldd4fNXb8xs9xw8y3AAeDPZpZHUPv4TC+OLZI0LfchMsDMrAqYF3aOiwwJaoYSEZGEVLMQEZGEVLMQEZGElCxERCQhJQsREUlIyUJERBJSshARkYT+P0nEiGlmTSbWAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_dist.plot(20)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('whale', 1494),\n", " ('one', 940),\n", " ('like', 650),\n", " ('ship', 605),\n", " ('upon', 566),\n", " ('sea', 542),\n", " ('man', 527),\n", " ('ahab', 512),\n", " ('boat', 483),\n", " ('ye', 472),\n", " ('old', 450),\n", " ('time', 446),\n", " ('would', 432),\n", " ('head', 431),\n", " ('though', 384),\n", " ('captain', 353),\n", " ('yet', 345),\n", " ('hand', 344),\n", " ('long', 333),\n", " ('thing', 320)]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_dist.most_common(20)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "b_words = ['god', 'apostle', 'angel']" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "my_list = []\n", "for word in b_words:\n", " if word in text1_clean:\n", " my_list.append(word)\n", " else:\n", " pass\n", " " ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['god', 'angel']\n" ] } ], "source": [ "print(my_list)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "my_list2 = [word for word in b_words if word in text1_clean]" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_list == my_list2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make Your Own Corpus" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "from urllib.request import urlopen" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "my_url = \"http://www.gutenberg.org/files/996/996-0.txt\"" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "file = urlopen(my_url)\n", "raw = file.read()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "don = raw.decode()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(don)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "don_tokens = nltk.word_tokenize(don)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "498721" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(don_tokens)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['\\ufeff',\n", " 'The',\n", " 'Project',\n", " 'Gutenberg',\n", " 'EBook',\n", " 'of',\n", " 'The',\n", " 'History',\n", " 'of',\n", " 'Don']" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "don_tokens[:10]" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "dq_text = don_tokens[320:]" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I', 'CHAPTER', 'I', 'WHICH', 'TREATS', 'OF', 'THE', 'CHARACTER', 'AND', 'PURSUITS', 'OF', 'THE', 'FAMOUS', 'GENTLEMAN', 'DON', 'QUIXOTE', 'OF', 'LA', 'MANCHA', 'CHAPTER', 'II', 'WHICH', 'TREATS', 'OF', 'THE', 'FIRST', 'SALLY', 'THE', 'INGENIOUS', 'DON']\n" ] } ], "source": [ "print(dq_text[:30])" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "dq_nltk_text = nltk.Text(dq_text)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nltk.text.Text" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(dq_nltk_text)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['chapter', 'treats', 'character', 'pursuits', 'famous', 'gentleman', 'quixote', 'la', 'mancha', 'chapter', 'ii', 'treats', 'first', 'sally', 'ingenious', 'quixote', 'made', 'home', 'chapter', 'iii', 'wherein', 'related', 'droll', 'way', 'quixote', 'dubbed', 'knight', 'chapter', 'iv', 'happened', 'knight', 'left', 'inn', 'chapter', 'v', 'narrative', 'knight', 'mishap', 'continued', 'chapter', 'vi', 'diverting', 'important', 'scrutiny', 'curate', 'barber', 'made', 'library', 'ingenious', 'gentleman']\n" ] } ], "source": [ "dq_clean = []\n", "for word in dq_text:\n", " if word.isalpha():\n", " if word.lower() not in stops:\n", " dq_clean.append(word.lower())\n", "print(dq_clean[:50])" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "from nltk.stem import WordNetLemmatizer\n", "wordnet_lemmatizer = WordNetLemmatizer()\n", "\n", "dq_lemmatized = []\n", "for t in dq_clean:\n", " dq_lemmatized.append(wordnet_lemmatizer.lemmatize(t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part-of-Speech Tagging" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "dq_tagged = nltk.pos_tag(dq_text)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('I', 'PRP'), ('CHAPTER', 'VBP'), ('I', 'PRP'), ('WHICH', 'NNP'), ('TREATS', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CHARACTER', 'NNP'), ('AND', 'NNP'), ('PURSUITS', 'NNP')]\n" ] } ], "source": [ "print(dq_tagged[:10])" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "tag_dict = {}\n", "# for every word/tag pair in my list,\n", "for (word, tag) in dq_tagged:\n", " if tag in tag_dict:\n", " tag_dict[tag]+=1\n", " else:\n", " tag_dict[tag] = 1" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'PRP': 36100,\n", " 'VBP': 9658,\n", " 'NNP': 31836,\n", " 'IN': 57945,\n", " 'VBD': 23503,\n", " ',': 36910,\n", " 'CC': 22993,\n", " 'VB': 21198,\n", " 'MD': 7256,\n", " 'DT': 40778,\n", " ':': 6442,\n", " 'CD': 3108,\n", " 'VBZ': 8316,\n", " 'RP': 1916,\n", " 'JJ': 24445,\n", " 'NN': 62303,\n", " 'WP': 4157,\n", " 'NNS': 15271,\n", " 'RB': 20227,\n", " 'VBN': 10087,\n", " 'WDT': 3546,\n", " '.': 7119,\n", " 'EX': 1073,\n", " 'TO': 13801,\n", " 'PRP$': 12231,\n", " 'VBG': 7727,\n", " 'RBS': 253,\n", " 'JJS': 954,\n", " 'PDT': 1118,\n", " 'RBR': 655,\n", " 'JJR': 1294,\n", " 'FW': 381,\n", " '(': 574,\n", " ')': 574,\n", " 'WP$': 137,\n", " 'WRB': 2147,\n", " 'POS': 14,\n", " 'NNPS': 155,\n", " 'UH': 85,\n", " \"''\": 111,\n", " '$': 3}" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag_dict" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('NN', 62303), ('IN', 57945), ('DT', 40778), (',', 36910), ('PRP', 36100), ('NNP', 31836), ('JJ', 24445), ('VBD', 23503), ('CC', 22993), ('VB', 21198), ('RB', 20227), ('NNS', 15271), ('TO', 13801), ('PRP$', 12231), ('VBN', 10087), ('VBP', 9658), ('VBZ', 8316), ('VBG', 7727), ('MD', 7256), ('.', 7119), (':', 6442), ('WP', 4157), ('WDT', 3546), ('CD', 3108), ('WRB', 2147), ('RP', 1916), ('JJR', 1294), ('PDT', 1118), ('EX', 1073), ('JJS', 954), ('RBR', 655), ('(', 574), (')', 574), ('FW', 381), ('RBS', 253), ('NNPS', 155), ('WP$', 137), (\"''\", 111), ('UH', 85), ('POS', 14), ('$', 3)]\n" ] } ], "source": [ "tag_dict_sorted = sorted(tag_dict.items(),\n", " reverse=True,\n", " key=lambda kv: kv[1])\n", "print(tag_dict_sorted)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }