{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Searching for Meaning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook considers some strategies for searching for word forms and word meanings, especially using WordNet. It's part of the [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) and assumes that you've already worked through previous notebooks ([Getting Setup](GettingSetup.ipynb), [Getting Started](GettingStarted.ipynb), [Getting Texts](GettingTexts.ipynb), [Getting NLTK](GettingNltk.ipynb), and [Getting Graphical](GettingGraphical.ipynb)). In this notebook we'll look in particular at:\n", "\n", "* [Matching characters and words](#Matching-Characters-and-Words)\n", "* [Stemming and lemmatization](#Stemming-and-Lemmatization)\n", "* [Searching for meaning with WordNet](#Searching-for-meaning-with-WordNet)\n", "* [Defining functions in Python](#Defining-Functions)\n", "\n", "This notebook assumes you've saved a _The Gold Bug_ into a plain text file, as described in [Getting Texts](GettingTexts.ipynb). If that's not the case, you may wish to include the following:\n", "\n", "```python\n", "import urllib.request\n", "# retrieve Poe plain text value\n", "poeUrl = \"http://www.gutenberg.org/files/2147/2147-0.txt\"\n", "poeString = urllib.request.urlopen(poeUrl).read().decode()```\n", "\n", "And then this, in a separate cell so that we don't read repeatedly from Gutenberg:\n", "\n", "```python\n", "import os\n", "# isolate The Gold Bug\n", "start = poeString.find(\"THE GOLD-BUG\")\n", "end = poeString.find(\"FOUR BEASTS IN ONE\")\n", "goldBugString = poeString[start:end]\n", "# save the file locally\n", "directory = \"data\"\n", "if not os.path.exists(directory):\n", " os.makedirs(directory)\n", "with open(\"data/goldBug.txt\", \"w\") as f:\n", " f.write(goldBugString)```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matching Characters and Words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin by importing _The Gold Bug_ text and creating token lists that we can use subsequently." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read Gold Bug plain text into string\n", "with open(\"data/goldBug.txt\", \"r\") as f:\n", " goldBugString = f.read()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import nltk\n", "goldBugTokens = nltk.word_tokenize(goldBugString)\n", "goldBugWordTokens = [word for word in goldBugTokens if word[0].isalpha()]\n", "goldBugWordTokensLowercase = [word.lower() for word in goldBugWordTokens]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks can be deceiving, and it's easy to miss things when you're looking for them. This is as true for humans as for computers, though of course for very different reasons. Humans are easily distracted and error-prone (try counting the number of \"t\" characters in this paragraph, for instance), whereas computers can be frustratingly literal when searching. We saw this previously when we searched for terms in a case-sensitive way." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of lowercase matches: 37\n", "Number of upper-case matches: 1\n", "Number of converted lower-case matches: 38\n" ] } ], "source": [ "print(\"Number of lowercase matches: \", goldBugString.count(\"bug\"))\n", "print(\"Number of upper-case matches: \", goldBugString.count(\"BUG\"))\n", "print(\"Number of converted lower-case matches: \", goldBugString.lower().count(\"bug\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(BTW, if you did count the \"t\" characters in the previous text cell, did you count 24? Did you could the capital T or did you decide not to?)\n", "\n", "It's worth emphasizing that our previous ```count()``` functions were operating on a string, so the counting matches any sequence of characters in the string, not just whole words. This means that we may actually be catching more than we intend to (which could be good or bad). Let's demonstrate this with another string count:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "testBugSentence = \"These bugs are bugging me said the bug.\"\n", "testBugSentence.count(\"bug\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It may be surprising at first to see 3 occurrences of \"bug\" in our sentence, but of course that's the case:\n", "\n", "> _These bugs are bugging me said the bug._\n", "\n", "A common way to differentiate between string characters and entire word tokens is to use regular expressions but we'll cover regular expressions in a subsequent notebook. For now, to avoid charges of being coy and to provide a preview of how regular expressions work, let's give two quick examples that we won't fully explain here." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of matches 38\n", "Variant forms: {'bug', 'BUG', 'bugs'}\n" ] } ], "source": [ "\n", "import re\n", "# \\w matches a word character, \\w* means match zero, one or more word characters\n", "bugMatches = re.compile(\"bug\\w*\", re.IGNORECASE).findall(goldBugString)\n", "print(\"Number of matches \", len(bugMatches))\n", "print(\"Variant forms: \", set(bugMatches)) # set removes duplicates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, even without regular expressions, we can list the words that match bug by operating on the word tokens. To accomplish this we'll use the ```in``` operator that allows us to ask if one string is contained in another. This is a variant of the list comprehension syntax we've already seen." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of matches 38\n", "Variant forms: {'bug', 'goole-bugs', 'goole-bug', 'GOLD-BUG'}\n" ] } ], "source": [ "bugTokens = [word for word in goldBugWordTokens if \"bug\" in word.lower()]\n", "print(\"Number of matches \", len(bugTokens))\n", "print(\"Variant forms: \", set(bugTokens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have the same counts for regular expressions and for our matching of tokens (which is reassuring), though this most recent strategy also shows examples of hyphenated words (the \"bug\" part of the hyphenated words are also matched by the regular expressions, but our search expression doesn't capture the full token)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stemming and Lemmatization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than matching string variants, another approach is to normalize or regularize string forms before they're matched (converting a string to lowercase is actually a simple example of normalization). One very common technique is called stemming where we ask the computer to follow a set of rules to reduce words to a common root (even if that root isn't necessarily a real word). We won't use stemming much here, but it's useful to see an example of how stemming can work." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bug: bug\n", "bugs: bug\n", "bugging: bug\n", "baking: bake\n", "bakery: bakeri\n", "bakeries: bakeri\n" ] } ], "source": [ "from nltk.stem import PorterStemmer\n", "stemmer = PorterStemmer()\n", "print(\"bug: \", stemmer.stem(\"bug\")) # bug\n", "print(\"bugs: \", stemmer.stem(\"bugs\")) # bug\n", "print(\"bugging: \", stemmer.stem(\"bugging\")) # bug\n", "print(\"baking: \", stemmer.stem(\"baking\")) # bake\n", "print(\"bakery: \", stemmer.stem(\"bakery\")) # bakeri\n", "print(\"bakeries: \", stemmer.stem(\"bakeries\")) # bakeri" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, our mileage varies. Stemming is good at reducing \"bug\", \"bugs\" and \"bugging\" to \"bug\", as well as \"baking\" to \"bake\", but \"bakery\" and \"bakeries\" become the non-existent stem \"bakeri\". However, in many cases stemming can be very useful, even when it produces stems that aren't real words. Search engines typically do stemming since the non-words are never really shown to the user.\n", "\n", "Another approach is to try to lemmatize words, which essentially tries to reduce words to other existing words (not just a stem). How exactly this is done depends on the lemmatizer used, but one useful lemmatizer is WordNet. Let's compare results with the stemmer above." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bug: bug\n", "bugs: bug\n", "bugging: bugging\n", "baking: baking\n", "bakery: bakery\n", "bakeries: bakery\n" ] } ], "source": [ "from nltk.stem import WordNetLemmatizer\n", "wnl = WordNetLemmatizer()\n", "print(\"bug: \", wnl.lemmatize(\"bug\")) # bug\n", "print(\"bugs: \", wnl.lemmatize(\"bugs\")) # bug\n", "print(\"bugging: \", wnl.lemmatize(\"bugging\")) # bugging\n", "print(\"baking: \", wnl.lemmatize(\"baking\")) # baking\n", "print(\"bakery: \", wnl.lemmatize(\"bakery\")) # bakery\n", "print(\"bakeries: \", wnl.lemmatize(\"bakeries\")) # bakery" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's clearly a trade-off between the more agressive (and faster) stemming from the previous example and the more conservative lemmatization example above. It's not quite fair to say that WordNet is less powerful (because more words seem to be left unchanged), but it _is_ more conservative – if there's any doubt about the lemma (the lemmatize form), then no change is made. Doubt is often introduced because it's not quite clear if the word form should be a verb, a noun or some other part of speech. Is the word \"duck\" a bird or the action out the way of something?\n", "\n", "Let's see what happens when we provide the part-of-speech (pos) tag for baking (pos tags might include verb or noun or abbreviated forms like \"v\" and \"n\" – the next notebook deals more extensively with [Parts of Speech](PartsOfSpeech.ipynb))." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "baking: baking\n", "baking: bake\n" ] } ], "source": [ "print(\"baking: \", wnl.lemmatize(\"baking\")) # baking\n", "print(\"baking: \", wnl.lemmatize(\"baking\", pos=\"v\")) # bake" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More sophisticated lemmatizers will guess at the part-of-speech based on the structure of sentences – _I read_ (verb) vs. _a good read_ (noun) – but for the sake of simplicity and expediency, we'll just ask NLTK to use WordNet to lemmatize each word out of its linguistic context (which essentially helps in making most plural forms into singular forms, like with \"bug\" and \"bugs\")." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Count of \"bug\" in tokens 32\n", "Count of \"bugs\" in tokens 0\n", "Count of \"bug\" in lemmas 32\n", "Count of \"eye\" in tokens 20\n", "Count of \"eyes\" in tokens 8\n", "Count of \"eye\" in lemmas 28\n" ] } ], "source": [ "goldBugLemmas = [wnl.lemmatize(word) for word in goldBugWordTokens]\n", "print('Count of \"bug\" in tokens', goldBugWordTokens.count(\"bug\"))\n", "print('Count of \"bugs\" in tokens', goldBugWordTokens.count(\"bugs\"))\n", "print('Count of \"bug\" in lemmas', goldBugLemmas.count(\"bug\"))\n", "print('Count of \"eye\" in tokens', goldBugWordTokens.count(\"eye\"))\n", "print('Count of \"eyes\" in tokens', goldBugWordTokens.count(\"eyes\"))\n", "print('Count of \"eye\" in lemmas', goldBugLemmas.count(\"eye\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So our lemmatization doesn't make much difference for the word _bug_ but it does make a difference for _eye_, which combines _eye_ and _eyes_. The dispersion plots for \"eye\" tokens will be different than for \"eye\" lemmas – for instance, the eye lemmas are more visible in the first part of the story." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAEZCAYAAABxbJkKAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAE6BJREFUeJzt3XuwZWV95vHvg1y8gA1NKFAuQkyIQBzFThQxxlaIYShR\npsY7MFFmjDPRJGUsUdGRJikNmjJgjIaJhUgQFTWkI2gZjNipKCCKXKXRIBdpEBA0QIgKyG/+WO+R\n3YfTh6Z7n9t+v5+qXb33Wmu/6/2ts3s9a71rn3VSVUiS+rTFQndAkrRwDAFJ6pghIEkdMwQkqWOG\ngCR1zBCQpI4ZAhq7JM9NcvUY2rk+yUGb8f4jkvzT5vZjXMa1XTZhvQ8k+eX5Xq+WBkNAm72zna6q\n/rWqnjKOptrjIZJ8LMnPktzVHlckeU+Sx4/044yq+t0x9GMsxrhd1pNkz7ajv7s9rkvy1k1o5zVJ\n/nXc/dPiZggIZtnZLmIFvLeqHg/8EvBa4ADga0keu1CdSrKQ/6eWVdV2wKuAdyV54QL2RUuEIaAN\nyuBtSa5JcnuSM5Ps0Ob9TZLPjiz73iT/3J6vTHLjyLzdk5yV5LbWzgfb9CcnOa9N+2GSjydZ9ki6\nCFBV91bVN4EXAzsyBMJ6R7atlhOT3JrkziSXJ9m3zftYkpOTnNvOKtYk2WOk/09J8qUkdyS5OsnL\nRuZ9rG2LLyT5D2BlkkOTXNXaWpfkzRvYLvu0df04yZVJDpvW7oeSnNPauXBjh3Sq6kLg28CvP2SD\nJcuS/F37WVyf5B1t2+wD/A3w7HY28aON/SFoaTMENJs/Ytix/jbwBODHwIfavD8Bnprk95I8Fzga\n+B/TG0jyKOAc4DrgScCuwKdGFnl3a3sfYHdg1aZ2tqr+A/gS8NwZZr+wTf/VqloGvAwY3dG9GvhT\nhrOKS4EzWv8f19r8OLAT8Ergw22nOeVVwJ9V1bbA+cApwOvaWcp+wHnTO5NkK+Bs4Iut3T8Ezkiy\n98hir2DYHjsA1zBsq9m0/Xme09Z7yQzLfBDYDtgLeB7Dz+y1VbUW+N/ABVW1XVUtf5h1aUIYAprN\n64F3VtXNVXUfcDzw0iRbVNVPgKOAE4HTgTdW1c0ztPFMhp38W6rqJ1X1s6r6GkBVfa+qvlxV91XV\n7a2t521mn38AzLQDu49h57dP6/93quqWkfnnVNVXq+pe4B0MR8S7AS8Crquq06rqgaq6FDiLIUSm\nrK6qC1pNPwXuBfZL8viqurOqZtoZHwA8rqpOqKr7q+orDGH5qpFlzqqqb1bVzxlC6ekPU/vtwB3A\nR4C3tjZ/oQXyK4C3V9U9VXUD8H6GnyO0Myv1xRDQbPYE/qENV/wYuAq4H9gZoKouAq5ty35mA23s\nDtxQVQ9Mn5Fk5ySfakMmdzKEyY6b2eddGXaE66mq84C/ZjiTuTXJ/0uy3dRsYN3IsvcwnCU8keHs\n5VlT26Bth1fTtkF77y+GeJr/DhwKXN+Gew6YoZ9PnOF9N7TpU+3eOjLvJ8C2G6x6sGNVLa+qfavq\nr2eY/0vAVm09U77PsM3UKUNAs/k+cEhV7TDyeGxV/QAgyRuArYGbgWM20MaNwB7tKHS69wA/B369\nDdEcxSP7TK53MTvJtsDBwIzfcKmqD1bVbwD7AnsDb5l6K0NYjbazHLiJYRv8y7RtsF1VvWGDnRqO\n3g9nGOZZDXx6hsVuBnZPMnr0/aS2zrlyO8MZ0Z4j0/bgwQBcal8O0BgYApqydZJHjzy2BE4G3jN1\nkTTJTkle3J7vDfwZcATDuPIxSZ42Q7sXMQzRnJDksa3tA9u8bYF7gLuS7MqDO+WNkfYgyTZJVjDs\ncO8ATn3IwslvJHlWG4v/T+CnDAE05dAkz0mydavrgqq6Cfg8sHeSI5Ns1R6/mWTqq56Ztp6tMvx+\nwrI2jHP3tPVM+XrrxzHtPSsZhp6mrpeMfWim9efTwLuTbJvkScCbGK53wHDmsVvbRuqEIaApX2DY\nKU093gV8APgccG6Su4ALgGe2o/rTgROq6oqqugY4Fjh9ZAdS8Isdz2HArzAcVd8IvLwtczzwDOBO\nhoukf8/GH40Www70LoYj3NOAbwAHtusVU8tMtfd44G8Zhnmub+/5i5HlPgEcxxAi+wNHtv7fzXBR\n+ZUMR+k/AP6c4Qxo+jqmHAlc14a4fp8hKEf7Tbv2cBjwX4EfMgxVHVVV352l3dm2zcbO+0OG4L2W\n4YzpDB4MzS8zfKvoliS3zdKeJkj8ozLqXZJTgXVV9X8Xui/SfPNMQPJbMeqYISAtzd+YlsbC4SBJ\n6phnApLUsS3ne4VJPPWQpE1QVWO/frUgZwJVNbGP4447bsH7YH3W11ttPdQ3VxwOkqSOGQKS1DFD\nYMxWrly50F2YU9a3dE1ybTD59c2Vef+KaJKa73VK0lKXhJqUC8OSpMXBEJCkjhkCktQxQ0CSOmYI\nSFLHDAFJ6pghIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAk\ndcwQkKSOGQKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLH\nDAFJ6pghIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQ\nkKSOGQKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ\n6pghIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSO\nGQKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ6pgh\nIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSOGQKS\n1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ6tiChcCa\nNQu15v6cdNKwvd/4xuHf+dz2J520/uupfkjjNtvnetyf+emf6/lY51x52BBIODLh6wmXJJyccHTC\niSPzX5fwlxtYdoPtL5UNNAlWrx629znnzH8IrF69/uupfkjjNp8hMP1zPR/rnCuzhkDCPsDLgQOr\n2B/4OXAfcFjCo9pirwFOmWHZB4Aj5qrjkqTNt+XDzD8IWAF8MwHg0cBtwHkMQXA1sFUV305447Rl\nHwPcMlOjq1atYs0aWLUKVq5cycqVKze7EEmaJGvWrGHNPJxOPFwIAJxWxbGjExKeCbwDWAt8dLZl\nZ7Jq1SpWrRpCQJL0UNMPkI8//vg5Wc/DXRP4MvDShJ0AEpYn7FHFRcBuwKuBT8627Jz0WpI0FrOe\nCVSxNuGdwLntIu99wB8A3wc+DTytijs3YtmHcARo/hx+ODz96XD77fO/3Q8/fP3XK1cO/ZDGbbbP\n9rg/99M/1/OxzrmSqtq0N4azgb+s4iuP7H2pTV2nJPUqCVWVcbf7iH9PIGH7hO8A//lIA0CStLhs\n8pnAJq/QMwFJesQWzZmAJGlyGAKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkC\nktQxQ0CSOmYISFLHDAFJ6pghIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJ\nHTMEJKljhoAkdcwQkKSOGQKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQx\nQ0CSOmYISFLHDAFJ6pghIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTME\nJKljhoAkdcwQkKSOGQKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CS\nOmYISFLHDAFJ6pghIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKlj\nhoAkdcwQkKSOGQKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYI\nSFLHDAFJ6pghIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAk\ndcwQkKSOGQKS1DFDQJI6ZghIUscMAUnqmCEwZmvWrFnoLswp61u6Jrk2mPz65oohMGaT/kG0vqVr\nkmuDya9vrhgCktQxQ0CSOpaqmt8VJvO7QkmaEFWVcbc57yEgSVo8HA6SpI4ZApLUsc0OgSS7J/lK\nkm8nuTLJH7Xpy5N8Kcl3k5ybZPuR97w9yb8luTrJC0emr0hyRZv3gc3t2zgleVSSS5Kc3V5PTH1J\ntk/y2SRrk1yV5FkTVt+b2mfziiSfSLLNUq0vyUeT3JrkipFpY6ulbZsz2/QLkzxp/qrbYH1/0T6b\nlyU5K8mykXlLvr6ReW9O8kCS5SPT5r6+qtqsB7AL8PT2fFvgO8A+wPuAY9r0twIntOf7ApcCWwF7\nAtfw4LWJi4BntudfAA7Z3P6N6wH8CXAG8Ln2emLqA04Djm7PtwSWTUp9wK7AtcA27fWZwO8t1fqA\n5wL7A1eMTBtbLcAfAB9uz18BfGoR1Pc7wBbt+QmTVl+bvjvwReA6YPl81jcXRa4GDgauBnZu03YB\nrm7P3w68dWT5LwIHAE8A1o5MfyVw8nz+gGapaTfgn4HnA2e3aRNRH8MO/9oZpk9KfbsC3wd2YAi4\ns9tOZcnW13YIozvJsdXSlnlWe74l8MOFrm/avP8GfHzS6gM+A/wX1g+BealvrNcEkuzJkHJfZ/hQ\n3tpm3Qrs3J4/EVg38rZ1DP9Rp0+/qU1fDE4E3gI8MDJtUurbC/hhklOTfCvJR5I8jgmpr6puAt7P\nEAQ3A/9eVV9iQuprxlnLrsCNAFV1P3Dn6PDEInA0w5EvTEh9SV4CrKuqy6fNmpf6xhYCSbYF/h74\n46q6e3ReDbG0JL+LmuRFwG1VdQkw43d0l3J9DEcLz2A4hXwGcA/wttEFlnJ9SXYAXsxw9PVEYNsk\nR44us5Trm26SapkuyTuAe6vqEwvdl3FJ8ljgWOC40cnz2YexhECSrRgC4PSqWt0m35pklzb/CcBt\nbfpNDONfU3ZjSLWb2vPR6TeNo3+b6UDgxUmuAz4JvCDJ6UxOfesYjkK+0V5/liEUbpmQ+g4Grquq\nO9qR0VnAs5mc+mA8n8V1I+/Zo7W1JbCsqn40d13fOEleAxwKHDEyeRLqezLDAcplbR+zG3Bxkp2Z\np/rG8e2gAKcAV1XVSSOzPsdwAY727+qR6a9MsnWSvYBfBS6qqluAuzJ8MyXAUSPvWTBVdWxV7V5V\nezGMvZ1XVUcxOfXdAtyYZO826WDg2wxj50u+PuAG4IAkj2n9Ohi4ismpD8bzWfzHGdp6KfDl+Shg\nNkkOYRiOfUlV/XRk1pKvr6quqKqdq2qvto9ZBzyjDe/NT31juMjxWwxj5ZcCl7THIcByhoup3wXO\nBbYfec+xDFe6rwZ+d2T6CuCKNu+v5vuCzUbU+jwe/HbQxNQHPA34BnAZw5HysgmrbxWwtvXtNIZv\nWyzJ+hjORm8G7mUY+33tOGsBtgE+DfwbcCGw5wLXd3Tryw0j+5cPT0B9P5v6+U2bfy3twvB81edt\nIySpY/7GsCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYIaBFKcmJSf545PU/JfnIyOv3J3nTJra9\nMu2W4DPM+60kX2+3Ll6b5HUj83Zq8y5uy70sw623H/EvHCU5dlP6Lo2bIaDF6qsMt+wgyRbAjgy3\n1p3ybOBrG9NQe//GLLcLw+3CX19V+zD8IuTrkxzaFjkIuLyqVlTVV4H/CfyvqjpoY9qf5u2b8B5p\n7AwBLVYXMOzoAfYDrgTuzvAHcLZh+JsV30pyULv76eVJTkmyNUCS65OckORi4GVJDmlH9hcz3I54\nJm8ATq2qSwGq6g7gGOBtSZ4GvBd4SYY/LvQu4DnAR5O8L8l+SS5q8y5L8uTWjyPb2cMlSU5OskWS\nE4DHtGmnz8G2kzbalgvdAWkmVXVzkvuT7M4QBhcw3Cb32cBdwOXAo4BTgRdU1TVJTgP+D/ABhjtp\n3l5VK5I8muGWCs+vqu8lOZOZ77S5L/CxadMuBvarqsvajn9FVU399bznA2+uqm8l+SvgpKr6RLtx\n15ZJ9gFeDhxYVT9P8mHgiKp6W5I3VNX+49pe0qbyTECL2fkMQ0IHMoTABe351FDQrzHcIfSatvxp\nwG+PvP/M9u9T2nLfa68/zoZv1zvbbXwzy/wLgGOTHMNwv5afMgwfrQC+meQS4AUMf79BWjQMAS1m\nX2MYcnkqw82yLuTBUDh/huXD+kf492yg3Q3tyK9i2GmPWsEwFDWrqvokcBjwE+AL7SwB4LSq2r89\nnlJVf/pwbUnzyRDQYnY+8CLgjhr8GNie4UzgfIYhnj2nxt8Zbqn7LzO0c3Vb7pfb61dtYH0fAl7T\nxv9JsiPD37R938N1NMleVXVdVX2Q4ba+T2W4je9Lk+zUllmeZI/2lvvasJG0oAwBLWZXMnwr6MKR\naZcz/InIH7Uhl9cCn0lyOXA/cHJb7hdnBG253wc+3y4M38oM1wRquE/7kcBHkqxlOBM5pao+P9Lm\nhm67+/IkV7Zhn/2Av6uqtcA7gXOTXMZwm+dd2vJ/C1zuhWEtNG8lLUkd80xAkjpmCEhSxwwBSeqY\nISBJHTMEJKljhoAkdcwQkKSOGQKS1LH/DzwsPXMABJXvAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAEZCAYAAABxbJkKAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFABJREFUeJzt3XuwLWV95vHvg1y8gIdLKFAuQhwZgTgGT0YRYzgIYwgl\nOtZ4BybKjHEmmqSMJTcdgaRi0JRBYzRMLESCqKhhGEHLYISTigIiylXO0SAXuQgIOkCICshv/uh3\nw2K7zz63vddei/f7qVp11up+19u/7r13P91vr9MrVYUkqU+bLHUBkqSlYwhIUscMAUnqmCEgSR0z\nBCSpY4aAJHXMENCCS/LiJKsXoJ8bkxy4Ee8/LMk/bGwdC2WhtssGLPfhJL867uVqOhgC2uid7WxV\n9c9V9eyF6Ko9fkmSTyT5eZJ72+PqJO9N8tSROs6sqt9egDoWxAJul8dIslvb0d/XHjckOXoD+nlj\nkn9e6Po02QwBwTw72wlWwPuq6qnArwBvAvYFvp7kyUtVVJKl/JtaVlVbAa8H3pPkpUtYi6aEIaA1\nyuCYJNcluSvJWUm2afP+JsnnR9q+L8k/tucrktw8Mm+XJGcnubP18+E2/ZlJLmjTfpTkk0mWrU+J\nAFX1QFVdBrwc2I4hEB5zZNvW5eQkdyS5J8lVSfZq8z6R5JQk57ezipVJdh2p/9lJvpLk7iSrk7x6\nZN4n2rb4UpJ/BVYkOSTJta2vW5K8Yw3bZc+2rJ8kuSbJobP6/UiS81o/l6zrkE5VXQJ8B/i1X9pg\nybIkf9d+FjcmeVfbNnsCfwO8sJ1N/HhdfwiaboaA5vOHDDvW3wKeBvwE+Eib98fAc5L8bpIXA0cC\n/3V2B0meAJwH3AA8A9gJ+MxIkz9rfe8J7AKcsKHFVtW/Al8BXjzH7Je26c+qqmXAq4HRHd0bgD9h\nOKu4Ajiz1f+U1ucnge2B1wEfbTvNGa8H/rSqtgQuAk4F3tzOUvYGLphdTJLNgHOBL7d+/wA4M8ke\nI81ey7A9tgGuY9hW82n787yoLffyOdp8GNgK2B3Yn+Fn9qaqWgX8D+DiqtqqqrZdy7L0OGEIaD5v\nAd5dVbdV1YPAicCrkmxSVT8FjgBOBs4A3lZVt83Rx/MZdvLvrKqfVtXPq+rrAFX1/ar6alU9WFV3\ntb7238iafwjMtQN7kGHnt2er/7tVdfvI/POq6mtV9QDwLoYj4p2BlwE3VNXpVfVwVV0BnM0QIjPO\nqaqL2zr9DHgA2DvJU6vqnqqaa2e8L/CUqjqpqh6qqgsZwvL1I23OrqrLquoXDKH062tZ97uAu4GP\nAUe3Ph/RAvm1wLFVdX9V3QR8gOHnCO3MSn0xBDSf3YD/04YrfgJcCzwE7ABQVZcC17e2n1tDH7sA\nN1XVw7NnJNkhyWfakMk9DGGy3UbWvBPDjvAxquoC4K8ZzmTuSPK/k2w1Mxu4ZaTt/QxnCU9nOHt5\nwcw2aNvhDbRt0N77yBBP81+AQ4Ab23DPvnPU+fQ53ndTmz7T7x0j834KbLnGtR5sV1XbVtVeVfXX\nc8z/FWCztpwZP2DYZuqUIaD5/AA4uKq2GXk8uap+CJDkrcDmwG3AUWvo42Zg13YUOtt7gV8Av9aG\naI5g/X4nH3MxO8mWwEHAnJ9wqaoPV9VvAHsBewDvnHkrQ1iN9rMtcCvDNvinWdtgq6p66xqLGo7e\n/zPDMM85wGfnaHYbsEuS0aPvZ7RlLpa7GM6IdhuZtiuPBuC0fThAC8AQ0IzNkzxx5LEpcArw3pmL\npEm2T/Ly9nwP4E+BwxjGlY9K8tw5+r2UYYjmpCRPbn3v1+ZtCdwP3JtkJx7dKa+LtAdJtkiynGGH\nezdw2i81Tn4jyQvaWPy/AT9jCKAZhyR5UZLN23pdXFW3Al8E9khyeJLN2uM/Jpn5qGdmLWezDP8/\nYVkbxrlv1nJmfKPVcVR7zwqGoaeZ6yULPjTT6vks8GdJtkzyDODtDNc7YDjz2LltI3XCENCMLzHs\nlGYe7wE+BHwBOD/JvcDFwPPbUf0ZwElVdXVVXQccB5wxsgMpeGTHcyjw7xiOqm8GXtPanAg8D7iH\n4SLp37PuR6PFsAO9l+EI93Tgm8B+7XrFTJuZ/p4K/C3DMM+N7T1/MdLuU8DxDCGyD3B4q/8+hovK\nr2M4Sv8h8OcMZ0CzlzHjcOCGNsT1ewxBOVo37drDocDvAD9iGKo6oqq+N0+/822bdZ33BwzBez3D\nGdOZPBqaX2X4VNHtSe6cpz89jsQvlVHvkpwG3FJV/2upa5HGzTMByU/FqGOGgDSd/2NaWhAOB0lS\nxzwTkKSObTruBSbx1EOSNkBVLfj1qyU5E6iqqX0cf/zxS16D9S99HT3WP821Px7qXywOB0lSxwwB\nSeqYIbCeVqxYsdQlbBTrX1rTXP801w7TX/9iGftHRJPUuJcpSdMuCfV4uTAsSZoMhoAkdcwQkKSO\nGQKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ6pgh\nIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSOGQKS\n1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ6pghIEkd\nMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSOGQKS1DFD\nQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ6pghIEkdMwQk\nqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSOGQKS1DFDQJI6\nZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ6pghIEkdMwQkqWOG\ngCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSOGQKS1DFDQJI6ZghI\nUse6CYGVKye7v8XudxKsXDk83va24d8PfnDxljO771e+cuOWN6k/l0mtaxKtaVvNtw3XdfvO/G5v\nbD9LYa0hkHB4wjcSLk84JeHIhJNH5r854S/X0HZiQsYQWHozfyjnnTf8e845i7ec2X1feOHGLW9S\nfy6TWtckMgTmNu9OOmFP4DXAflXsA/wCeBA4NOEJrdkbgVPnaPswcNhiFS5J2nibrmX+gcBy4LIE\ngCcCdwIXMATBamCzKr6T8LZZbZ8E3D5XpyeccMIjz1esWMGKFSs2YhUk6fFn5cqVrBzDKcTaQgDg\n9CqOG52Q8HzgXcAq4OPztZ3LaAhIkn7Z7APkE088cVGWs7Yx+68Cr0rYHiBh24Rdq7gU2Bl4A/Dp\n+douStWSpAUx75lAFasS3g2c3y7yPgj8PvAD4LPAc6u4Zx3aLrmFHnFarBGsx/PI2My63XXX8Hzr\nrRdvObP7PuAA2H//jetzEk1qXZNoTdtqvm24rtt3be0m+eeUqtqwN4Zzgb+s4sL1e19qQ5cpSb1K\nQlVloftd749wJmyd8F3g39Y3ACRJk2WDzwQ2eIGeCUjSepuYMwFJ0uOHISBJHTMEJKljhoAkdcwQ\nkKSOGQKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ\n6pghIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSO\nGQKS1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ6pgh\nIEkdMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSOGQKS\n1DFDQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ6pghIEkd\nMwQkqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSOGQKS1DFD\nQJI6ZghIUscMAUnqmCEgSR0zBCSpY4aAJHXMEJCkjhkCktQxQ0CSOmYISFLHDAFJ6pghIEkdMwQk\nqWOGgCR1zBCQpI4ZApLUMUNAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSOGQLraeXKlUtd\nwkax/qU1zfVPc+0w/fUvFkNgPU37L5L1L61prn+aa4fpr3+xGAKS1DFDQJI6lqoa7wKT8S5Qkh4n\nqioL3efYQ0CSNDkcDpKkjhkCktSxsYZAkoOTrE7yL0mOHuey1yTJLkkuTPKdJNck+cM2fdskX0ny\nvSTnJ9l65D3HtnVYneSlI9OXJ7m6zfvQmNfjCUkuT3LutNWfZOskn0+yKsm1SV4wZfW/vf3uXJ3k\nU0m2mNT6k3w8yR1Jrh6ZtmC1tnU/q02/JMkzxlD/X7TfnSuTnJ1k2TTVPzLvHUkeTrLtWOuvqrE8\ngCcA1wG7AZsBVwB7jmv589S1I/Dr7fmWwHeBPYH3A0e16UcDJ7Xne7XaN2vrch2PXlu5FHh+e/4l\n4OAxrscfA2cCX2ivp6Z+4HTgyPZ8U2DZtNQP7ARcD2zRXp8F/O6k1g+8GNgHuHpk2oLVCvw+8NH2\n/LXAZ8ZQ/38CNmnPT5q2+tv0XYAvAzcA246z/kX/Ax9ZyRcCXx55fQxwzLiWvx51ngMcBKwGdmjT\ndgRWt+fHAkePtP8ysC/wNGDVyPTXAaeMqeadgX8EDgDObdOmon6GHf71c0yflvp3An4AbMMQYOe2\nndLE1t92KKM70QWrtbV5QXu+KfCjxa5/1rxXAp+ctvqBzwH/gceGwFjqH+dw0E7AzSOvb2nTJkaS\n3RhS+hsMfxR3tFl3ADu0509nqH3GzHrMnn4r41u/k4F3Ag+PTJuW+ncHfpTktCTfTvKxJE9hSuqv\nqluBDzAEwW3A/6uqrzAl9TcLWesjf+dV9RBwz+jwxhgcyXBkDFNSf5JXALdU1VWzZo2l/nGGwER/\nFjXJlsDfA39UVfeNzqshViey/iQvA+6sqsuBOT9DPMn1MxytPI/hFPZ5wP0MZ4mPmOT6k2wDvJzh\n6O7pwJZJDh9tM8n1zzZNtc6W5F3AA1X1qaWuZV0leTJwHHD86ORx1jDOELiVYdxrxi48Ns2WTJLN\nGALgjKo6p02+I8mObf7TgDvb9NnrsTPDetzano9Ov3Ux6272A16e5Abg08BLkpzB9NR/C8NR0Dfb\n688zhMLtU1L/QcANVXV3O/I6m2Hoc1rqh4X5Xbll5D27tr42BZZV1Y8Xr/RBkjcChwCHjUyehvqf\nyXAAcWX7G94Z+FaSHcZV/zhD4DLgWUl2S7I5w0WLL4xx+XNKEuBU4Nqq+uDIrC8wXOCj/XvOyPTX\nJdk8ye7As4BLq+p24N4Mn2wJcMTIexZNVR1XVbtU1e4MY4MXVNURU1T/7cDNSfZokw4CvsMwtj7x\n9QM3AfsmeVJb7kHAtVNU/0xNG1vr/52jr1cBX13s4pMczDAc+oqq+tnIrImvv6qurqodqmr39jd8\nC/C8Njw3nvoX+qLHWi6I/A7Dp2+uA44d57Lnqek3GcbSrwAub4+DgW0ZLrZ+Dzgf2HrkPce1dVgN\n/PbI9OXA1W3eXy3BuuzPo58Ompr6gecC3wSuZDiSXjZl9Z8ArGrLPp3h0xwTWT/D2eJtwAMMY8dv\nWshagS2AzwL/AlwC7LbI9R/ZlnXTyN/vR6eg/p/PbP9Z86+nXRgeV/3eNkKSOub/GJakjhkCktQx\nQ0CSOmYISFLHDAFJ6pghIEkdMwQ0kZKcnOSPRl7/Q5KPjbz+QJK3b2DfK9JuuT3HvN9M8o12a+JV\nSd48Mm/7Nu9brd2rM9z6er3/Q1GS4zakdmmhGQKaVF9juCUGSTYBtmO4te6MFwJfX5eO2vvXpd2O\nDLfjfktV7cnwHwnfkuSQ1uRA4KqqWl5VXwP+G/Dfq+rAdel/lmM34D3SgjMENKkuZtjRA+wNXAPc\nl+ELaLZg+M6Hbyc5sN199Kokp7ZbkpDkxiQnJfkW8OoMX2i0qr1+5RqW+VbgtKq6AqCq7gaOAo5J\n8lzgfcArMnx5z3uAFwEfT/L+JHsnubTNuzLJM1sdh7ezh8uTnJJkkyQnAU9q085YhG0nrbNNl7oA\naS5VdVuSh5LswhAGFzPcJveFwL3AVQxfVHQa8JKqui7J6cD/BD7EcCfMu6pqeZInMtwS4YCq+n6S\ns5j7Tpl7AZ+YNe1bwN5VdWXb8S+vqplvnzsAeEdVfTvJXwEfrKpPtRt3bZpkT+A1wH5V9YskHwUO\nq6pjkry1qvZZqO0lbSjPBDTJLmIYEtqPIQQubs9nhoL+PcMdPK9r7U8Hfmvk/We1f5/d2n2/vf4k\na75d73y38c088y8GjktyFMP9Wn7GMHy0HLgsyeXASxi+P0GaGIaAJtnXGYZcnsNws6xLeDQULpqj\nfXjsEf79a+h3TTvyaxl22qOWMwxFzauqPg0cCvwU+FI7SwA4var2aY9nV9WfrK0vaZwMAU2yi4CX\nAXfX4CfA1gxnAhcxDPHsNjP+znBL3X+ao5/Vrd2vttevX8PyPgK8sY3/k2Q7hu+sff/aCk2ye1Xd\nUFUfZrit73MYbuP7qiTbtzbbJtm1veXBNmwkLSlDQJPsGoZPBV0yMu0qhq9w/HEbcnkT8LkkVwEP\nAae0do+cEbR2vwd8sV0YvoM5rgnUcJ/2w4GPJVnFcCZyalV9caTPNd129zVJrmnDPnsDf1dVq4B3\nA+cnuZLhNs07tvZ/C1zlhWEtNW8lLUkd80xAkjpmCEhSxwwBSeqYISBJHTMEJKljhoAkdcwQkKSO\nGQKS1LH/DwRxOF3ASCoQAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "nltk.Text(goldBugWordTokens).dispersion_plot([\"eye\"]) # first graph with tokens (eye)\n", "nltk.Text(goldBugLemmas).dispersion_plot([\"eye\"]) # second graph with lemmas (eye and eyes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've looked at converting to lowercase, stemming, and lemmatization as ways of helping to ensure that we're matching a greater number of relevant words. Another tool in our NLTK toolbox is to use the semantic features of WordNet to explore words with related meanings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Searching for meaning with WordNet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "WordNet can be thought of as a large dictionary that allows us to explore the interconnectedness of lexical word forms and their meanings. Entries are organized into sets of cognitive synonyms, called synsets. For instance, we can ask wordnet for synsets for the word \"bug\":" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Synset('bug.n.01'),\n", " Synset('bug.n.02'),\n", " Synset('bug.n.03'),\n", " Synset('hemipterous_insect.n.01'),\n", " Synset('microbe.n.01'),\n", " Synset('tease.v.01'),\n", " Synset('wiretap.v.01')]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.corpus import wordnet as wn\n", "wn.synsets(\"bug\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a list (the square brackets) composed of elements (separated by commas) that represent Synset objects. We'll look more closely at Synset objects as we go along, but already we can see a structure that resembles a dictionary, with parts of speech (\"n\" for noun, \"v\" for verb) and multiple meanings:\n", "\n", "* bug.n.01\n", "* bug.n.02\n", "* bug.n.03\n", "* hemipterous_insect.n.01\n", "* microbe.n.01\n", "* tease.v.01\n", "* wiretap.v.01\n", "\n", "In our _Gold Bug_ story we're especially interested in the noun bug, not the verb meanings (like to tease or to wiretap). We can reduce the number of synsets by specifying a part of speech:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Synset('bug.n.01'),\n", " Synset('bug.n.02'),\n", " Synset('bug.n.03'),\n", " Synset('hemipterous_insect.n.01'),\n", " Synset('microbe.n.01')]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wn.synsets(\"bug\", pos=\"n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can maybe reduce our list further by looking more closely at the definitions for each synset." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bug.n.01 : general term for any insect or similar creeping or crawling invertebrate\n", "bug.n.02 : a fault or defect in a computer program, system, or machine\n", "bug.n.03 : a small hidden microphone; for listening secretly\n", "hemipterous_insect.n.01 : insects with sucking mouthparts and forewings thickened and leathery at the base; usually show incomplete metamorphosis\n", "microbe.n.01 : a minute life form (especially a disease-causing bacterium); the term is not in technical use\n" ] } ], "source": [ "for synset in wn.synsets(\"bug\", pos=\"n\"):\n", " print(synset.name(), \": \", synset.definition())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, given our knowledge of the Poe's short story (though of course we won't always have this familiarity with text we wish to analyze), we can narrow in on the one synset of interest." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Synset('bug.n.01') general term for any insect or similar creeping or crawling invertebrate\n" ] } ], "source": [ "bugSynset = wn.synset(\"bug.n.01\")\n", "print(bugSynset, bugSynset.definition())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "WordNet's real strength is in exploring words with related meanings. There are several kinds of relationships, but here are some of the main ones:\n", "\n", "* synonyms: words with very similar meanings\n", "* hypernyms: words with more general meanings (\"insect\" is more general than \"bug\")\n", "* hyponyms: words with more specific meanings (\"scarabaeus is more specific than \"bug\")\n", "\n", "In looking for words related to \"bug\" one possibility might be to look for a more general word and then to collect all of the more specific words that are related. In other words, we look for the hypernym of bug and then we look for all hyponyms of that word. Let's start by looking for hypernyms." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Synset('insect.n.01')]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bugSynset.hypernyms()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that there's only one hypernym for our bug synset so let's create a variable for that synset." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Synset('insect.n.01') small air-breathing arthropod\n" ] } ], "source": [ "bugHypernym = wn.synset(\"insect.n.01\")\n", "print(bugHypernym, bugHypernym.definition())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've gone up a level in generality from \"bug\" to \"insect\" now we want to go down by asking the hyponyms (more specific meanings) of insect." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Synset('beetle.n.01'),\n", " Synset('bird_louse.n.01'),\n", " Synset('bug.n.01'),\n", " Synset('collembolan.n.01'),\n", " Synset('defoliator.n.01'),\n", " Synset('dictyopterous_insect.n.01'),\n", " Synset('dipterous_insect.n.01'),\n", " Synset('earwig.n.01'),\n", " Synset('ephemerid.n.01'),\n", " Synset('ephemeron.n.01'),\n", " Synset('flea.n.01'),\n", " Synset('gallfly.n.03'),\n", " Synset('hemipterous_insect.n.01'),\n", " Synset('heteropterous_insect.n.01'),\n", " Synset('holometabola.n.01'),\n", " Synset('homopterous_insect.n.01'),\n", " Synset('hymenopterous_insect.n.01'),\n", " Synset('imago.n.02'),\n", " Synset('leaf_miner.n.01'),\n", " Synset('lepidopterous_insect.n.01'),\n", " Synset('louse.n.01'),\n", " Synset('mecopteran.n.01'),\n", " Synset('neuropteron.n.01'),\n", " Synset('odonate.n.01'),\n", " Synset('orthopterous_insect.n.01'),\n", " Synset('phasmid.n.01'),\n", " Synset('pollinator.n.01'),\n", " Synset('proturan.n.01'),\n", " Synset('psocopterous_insect.n.01'),\n", " Synset('pupa.n.01'),\n", " Synset('queen.n.01'),\n", " Synset('social_insect.n.01'),\n", " Synset('stonefly.n.01'),\n", " Synset('termite.n.01'),\n", " Synset('thysanopter.n.01'),\n", " Synset('thysanuran_insect.n.01'),\n", " Synset('trichopterous_insect.n.01'),\n", " Synset('web_spinner.n.01'),\n", " Synset('worker.n.03')]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bugHypernym.hyponyms()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's quite a list of critters. We can see \"beetle.n.01\", \"bug.n.01\" and many other synsets but the problem is that we also want to list the hyponyms of these hyponyms, and the hyponyms of those hyponyms, etc., until no further hyponyms are found. In programming this is called a recursive function and can be accomplished in several ways.\n", "\n", "One very succinct and a bit tricky way of achieving this in Python would look something like this:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['tusser', 'tussock_moth', 'tussore', 'tussur', 'two-spotted_ladybug', 'two-winged_insects', 'tzetze', 'tzetze_fly', 'underwing', 'vedalia', 'velvet_ant', 'vespid', 'vespid_wasp', 'viceroy', 'vinegar_fly', 'walking_leaf', 'walking_stick', 'walkingstick', 'warble_fly', 'wasp', 'water_beetle', 'water_boatman', 'water_bug', 'water_scorpion', 'water_skater', 'water_strider', 'wax_insect', 'wax_moth', 'web_spinner', 'webbing_clothes_moth', 'webbing_moth', 'webworm_moth', 'weevil', 'wheel_bug', 'whirligig_beetle', 'white-faced_hornet', 'white_admiral', 'white_ant', 'whitefly', 'wood_ant', 'woolly_adelgid', 'woolly_alder_aphid', 'woolly_aphid', 'woolly_apple_aphid', 'woolly_plant_louse', 'worker', 'worker_bee', 'yellow-fever_mosquito', 'yellow_hornet', 'yellow_jacket']\n" ] } ], "source": [ "insectHyponymsTricky = sorted(set([l.name() for s in bugHypernym.closure(lambda s:s.hyponyms()) for l in s.lemmas()]))\n", "print(insectHyponymsTricky[-50:]) # peek at last 50" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yikes, that looks a bit scary, with the [closure](http://www.shutupandship.com/2012/01/python-closures-explained.html), [lambda function](https://docs.python.org/2/reference/expressions.html#lambda), and multiple ```for``` structures. We're going to take a different approach by introducing the concept of defining our own function. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining Functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To define a function in Python you use the keyword ```def``` that usually receives arguments and returns a value. Here's a simplified example:\n", "\n", "```python\n", "def add_five(val):\n", " return val+5 # add 5 to the value and return it\n", "\n", "add_five(0) # returns 5\n", "add_five(5) # returns 10```\n", "\n", "Now imagine that we keep adding 5 until we get to twenty:\n", "\n", "```python\n", "def add_five_until_twenty(val):\n", " val+=5\n", " if val >= 20: # we have at least 20, so return\n", " return val\n", " else: # call this function again\n", " return add_five_until_twenty(val)\n", "\n", "print(add_five_until_twenty(0)) # returns 20```\n", "\n", "We will do something similar for finding hyponyms, as per below. Note that we're not returning a value in the function below, but rather, we're accumulating elements in a list that gets passed as an argument to the function. This is a bit less elegant than some other solutions, but it allows our list of hyponym names to remain as a flat, one dimensional list (instead of having nested lists that we would then need to flatten)." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def collect_hyponym_lemma_names(synset, hyponynm_names):\n", " for hyponym in synset.hyponyms(): # go through this synset's hyponyms\n", " for lemma in hyponym.lemmas(): # go through each hyponym's lemma\n", " hyponynm_names.append(lemma.name()) # add this lemma name to our list\n", " collect_hyponym_lemma_names(hyponym, hyponynm_names) # this this hyponym's hyponyms" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['tusser', 'tussock_moth', 'tussore', 'tussur', 'two-spotted_ladybug', 'two-winged_insects', 'tzetze', 'tzetze_fly', 'underwing', 'vedalia', 'velvet_ant', 'vespid', 'vespid_wasp', 'viceroy', 'vinegar_fly', 'walking_leaf', 'walking_stick', 'walkingstick', 'warble_fly', 'wasp', 'water_beetle', 'water_boatman', 'water_bug', 'water_scorpion', 'water_skater', 'water_strider', 'wax_insect', 'wax_moth', 'web_spinner', 'webbing_clothes_moth', 'webbing_moth', 'webworm_moth', 'weevil', 'wheel_bug', 'whirligig_beetle', 'white-faced_hornet', 'white_admiral', 'white_ant', 'whitefly', 'wood_ant', 'woolly_adelgid', 'woolly_alder_aphid', 'woolly_aphid', 'woolly_apple_aphid', 'woolly_plant_louse', 'worker', 'worker_bee', 'yellow-fever_mosquito', 'yellow_hornet', 'yellow_jacket']\n" ] } ], "source": [ "insect_hyponym_names = [] # this list will keep track of our hyponynms\n", "collect_hyponym_lemma_names(bugHypernym, insect_hyponym_names) # call our function with the bugHypernym\n", "insect_hyponym_names = sorted(set(insect_hyponym_names))\n", "print(insect_hyponym_names[-50:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make this functionality even more convenient by defining a new function that takes a single synset and returns a list of hyponym names from the synset's hypernym. Functions are defined in order to have reusable code organized into small units." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_hyponym_names_from_hypernym(synset):\n", " names = []\n", " for hypernym in synset.hypernyms():\n", " collect_hyponym_lemma_names(hypernym, names)\n", " return sorted(set(names))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bug Hunting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use our shiny new function to get all the lemmas that are hyponyms of the hypernym (\"insect\") of our bug synset. This is the same list as we had above, but demonstrates the retrieval of a word list from a synset in one fell swoop." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Synset('bug.n.01') has 886 hyponyms\n" ] } ], "source": [ "bug_hypernym_hyponyms = get_hyponym_names_from_hypernym(bugSynset)\n", "print(bugSynset, \"has\", len(bug_hypernym_hyponyms), \"hyponyms\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many of our _Gold Bug_ tokens are in our insect word list? Easy." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['bee', 'bug', 'beetle', 'scarabaeus', 'soldier']" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bugRelatedWords = list(set([word for word in goldBugLemmas if word in bug_hypernym_hyponyms]))\n", "bugRelatedWords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice! We used WordNet to find words related to bugs (of the insect persuasion). The only word that may not fit is \"soldier\", let's have a look." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Displaying 1 of 1 matches:\n", " looking dis here way wid he head down and he soldier up and a white a a gose And den he keep a syp\n" ] } ], "source": [ "goldBugText = nltk.Text(goldBugLemmas)\n", "goldBugText.concordance(\"soldier\", 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's one match, and it's not all that easy to decipher, but \"solider\" here seems to be a verb and have little to do with bugs. So we can remove it." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['bee', 'bug', 'beetle', 'scarabaeus']" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bugRelatedWordsFiltered = [word for word in bugRelatedWords if \"soldier\" not in word]\n", "bugRelatedWordsFiltered" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the simplest level, we can use our bug list to create a new dispersion plot. " ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAAEZCAYAAADCJLEQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGiBJREFUeJzt3XeUZWWd7vHvIw0GGlpRlhjARkcR0FGCCqhjGcY0irrE\nBJiv4yiGpV6zDt3j6EUdrxHGq8tBBnRU0MuYBsVQJkAECY0NKApIEExcQEZE5Hf/2G/Rh6I6V3hP\n8/2sdVbv8+53v/u3d1ed5+xQ56SqkCSpN7da6AIkSZqJASVJ6pIBJUnqkgElSeqSASVJ6pIBJUnq\nkgGlW5QkD09yziyMc0GSR2/E8gck+drG1jFbZmu/bMB6b0hyz/ler8aDAaWubWwQTFdV36uq+87G\nUO1xM0k+meRPSa5qjxVJ3pVk65E6PlVVj5uFOmbFLO6Xm0iytIXQ1e1xfpI3bsA4L0jyvdmuT30z\noNS71QZBxwp4d1VtDdwJeCGwF/CDJLdbqKKSLOTv+5Kq2gp4DvCPSR67gLVoTBhQGksZvCnJeUl+\nm+SzSe7Q5v1rkmNG+r47yTfa9ESSi0bmbZ/kC0l+3cb5cGu/V5JvtbbfJDkqyZL1KRGgqq6rqlOA\nfYE7MoTVTY4I2ra8P8nlSa5McmaSXdq8Tyb5aJKvt6OxySQ7jNR/3yTHJ/ldknOSPGNk3ifbvvhq\nkj8AE0memGRlG+viJK9bzX7Zua3riiRnJXnytHEPTfLlNs5J63qarqpOAn4C3O9mOyxZkuTf2//F\nBUne2vbNzsC/Anu3o7Dfr+t/gsabAaVx9SqGF/2/Ae4CXAEc2ua9Frh/kucneTjwIuB50wdIshnw\nZeB84B7A3YDPjHR5Zxt7Z2B7YNmGFltVfwCOBx4+w+zHtvZ7V9US4BnA6Ivw/sA/MRyNnQ58qtW/\nZRvzKGBb4NnAYe0FfcpzgHdU1WLgBOATwEva0d2uwLemF5Nkc+BLwHFt3FcCn0pyn5Fuz2LYH3cA\nzmPYV2vSsiYPbes9bYY+Hwa2AnYEHsHwf/bCqjob+AfgxKraqqq2Wcu6tIkwoDSuXgq8raourao/\nA8uB/ZLcqqr+CDwXeD9wJPCKqrp0hjEezBBAr6+qP1bVn6rqBwBV9fOq+mZV/bmqftvGesRG1vwr\nYKYX1z8zvDDv3Oo/t6ouG5n/5ar6flVdB7yV4Uji7sCTgPOr6oiquqGqTge+wBBwU46tqhPbNl0L\nXAfsmmTrqrqyqmYKir2ALavqkKq6vqq+zRDkzxnp84WqOqWq/sIQmA9cy7b/Fvgd8HHgjW3MG7U3\nC88C3lxV11TVhcD7GP4foR2R6pbFgNK4Wgr833YK6gpgJXA9cGeAqjoZ+EXre/RqxtgeuLCqbpg+\nI8mdk3ymnQa7kiHo7riRNd+N4UX6JqrqW8BHGI4AL0/yf5JsNTUbuHik7zUMR1d3ZTjqe8jUPmj7\nYX/aPmjL3njarnk68ETggnYKb68Z6rzrDMtd2Nqnxr18ZN4fgcWr3erBHatqm6rapao+MsP8OwGb\nt/VM+SXDPtMtlAGlcfVL4PFVdYeRx+2q6lcASQ4CtgAuBd6wmjEuAnZo796nexfwF+B+7bTbc1m/\n35eb3NiRZDHwGGDGO9Gq6sNVtSewC3Af4PVTizIE6eg42wCXMOyD70zbB1tV1UGrLWo46nkqw6m7\nY4HPzdDtUmD7JKNHLfdo65wrv2U4klw60rYDq8J53G6U0SwwoDQOtkhym5HHIuCjwLumbhhIsm2S\nfdv0fYB3AAcwXMd4Q5IHzDDuyQyn3Q5Jcrs29j5t3mLgGuCqJHdjVWCsi7QHSW6dZA+GMPgdcPjN\nOid7JnlIu/bz38C1DOE45YlJHppki7ZdJ1bVJcBXgPskOTDJ5u3xoCRTt4tn2no2z/D3V0vaqbmr\np61nyg9bHW9oy0wwnE6cuj4366fbWj2fA96ZZHGSewCvYbi+BsMR293bPtIthAGlcfBVhhfMqcc/\nAh8Evgh8PclVwInAg9vR0JHAIVW1oqrOA94CHDny4lZw44vik4G/YjgauQh4ZuuzHNgduJLhhoHP\ns+7v4ovhxf0qhiODI4AfAfu062NTfabG2xr4GMOpuwvaMu8d6fdp4GCGgNsNOLDVfzXDDRbPZji6\n+RXwvxiOHKevY8qBwPnttOXfM4T4aN20a11PBp4A/Ibh9ONzq+qnaxh3TftmXee9kuFNwS8YjjQ/\nxapA/ybD3X+XJfn1GsbTJiR+YaHUrySHAxdX1dsXuhZpvnkEJfXNu9d0i2VASX0bx0/SkGaFp/gk\nSV3yCEqS1KVFC11AT5J4OClJG6CqZv16qUdQ01TV2D4OPvjgBa/B+he+jlti/eNc+6ZQ/1wxoCRJ\nXTKgJEldMqA2IRMTEwtdwkax/oU1zvWPc+0w/vXPFW8zH5Gk3B+StH6SUN4kIUm6pTCgJEldMqAk\nSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEld\nMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKg\nJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXRrbgEpYmrBioeuQJM2N\nsQ0oSdpUTE4udAV9GveAWpRwVMLKhKMTbpuwR8JkwikJxyVsB5Bwr4T/au3fTdhpoYuXJDCgVmfc\nA2on4NAqdgGuAl4BfAjYr4o9gcOBd7a+HwNe2dpfDxy2APVKktbRooUuYCNdVMWJbfoo4K3A/YDj\nEwA2Ay5N2BLYBzi6tQNsMdOAy5Ytu3F6YmKCiYmJ2a9aksbY5OQkk/Nw2JeqmvOVzIWEpcBkFUvb\n80cxHEFtV8U+0/puDZxTxV3XPGZqXPeHpPG1bNnwGFdJqKqsvef6GfdTfDsk7NWm9wdOAradakvY\nPGGXKq4Czk/Yr7Un4a8XpmRJ0roY54Aq4FzgoISVwBLa9Sfg3QmnA6cBe7f+BwAvbu1nAfvOf8mS\ndHNeSZjZ2J7imwue4pOk9ecpPknSLYoBJUnqkgElSeqSASVJ6pIBJUnqkgElSeqSASVJ6pIBJUnq\nkgElSeqSASVJ6pIBJUnqkgElSeqSASVJ6pIBJUnqkgElSeqSASVJ6pIBJUnqkgElSeqSASVJ6pIB\nJUnqkgElSeqSASVJ6pIBJUnqkgElSeqSASVJ6pIBJUnqkgElSeqSASVJ6pIBJUnqkgElSeqSASVJ\n6pIBJUnqkgElSeqSASVJ6pIBJUnq0tgGVMLShBULXYckaW6MbUBJkjZt4x5QixKOSliZcHTC7RIu\nSNgGIGHPhG+36W0Tjk84K+Hjo/3myuTk8PjAB4Z/12e52TK67g0d9wMfmPn59PY1jT86b0PreMUr\nVi0/W9uyJuu6jtn8/9pYPdWykNwP6276vupp3417QO0EHFrFLsBVwMuBWk3fg4FvVHE/4Bhgh7ku\nbuqF9NhjFy6gRte9oeMee+zMz6e3z3VAffnLq5afrW1ZEwNqfLkf1p0BNXcuquLENn0U8LA19H0o\n8BmAKr4GXDHHtUmSNsKihS5gI40eLQW4AbieVcF7m2n9s7YBly1bduP0xMQEExMTG1WgJG1qJicn\nmZyHQ61xD6gdEvaq4iRgf+D7wFbAnsBxwNNH+v4AeCbwnoTHAneYacDRgJIk3dz0N+/Lly+fk/WM\n8ym+As4FDkpYCSwBDgOWAx9M+BHD0dTUUdZy4LHt1vT9gMuAq+e9aknSOhnbI6gqLgR2nmHW9xlu\nnpjuSuBxVfwlYW9gzyr+PJc1Tr3BuP3t4YEPXP/lZsNTn7pq3Rs67lOfOvPz6e1rGn903obW8aQn\nbdzycPOa12Rd19PTWeCeallI7od1N31f9bTvUrW6m942LQl/BXyO4ajxOuBlVZx60z6pW8r+kKTZ\nkoSqWus1/vUe1xfkVQwoSVp/cxVQ43wNSpK0CTOgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJ\nXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0y\noCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAkSV0yoCRJXTKgJEldMqAk\nSV0yoCRJXTKgJEldMqAkSV0yoCRJXeoioBKWJqyYhXGekrDzyPNPJjx9Y8eVJM2/LgJqFj0N2GXk\nebWHJGnM9BRQixKOSliZcHTCbRP2SJhMOCXhuITtABLulfBfrf27CTsl7AM8GXhvwo8T7tnGTVtm\nxrF6MTm5fu2zvZ7ZGG+msWdzfZOTszPeuowzvc+69N+YejZ2jPmyITVOX2a+fta1/nr7WewpoHYC\nDq1iF+Aq4BXAh4D9qtgTOBx4Z+v7MeCVrf31wGFVnAB8EfifVexexS9a30rYHPgw8PQZxuqCAbVu\n6zKgFpYBtWnr7Wdx0UIXMOKiKk5s00cBbwXuBxyfALAZcGnClsA+wNGtHWCLkXHCTYUh/HYFvjE6\n1mxvgCRp9vQUUKPXisJwFPWTKvYZ7ZSwNXBFFbutwzijbjbWTJYtW3bj9MTEBBMTE2tbRJJuUSYn\nJ5mch8OsngJqh4S9qjgJ2B84CXjJVFs7TXfvKlYmnJ+wXxXHJAS4fxVnAlcDW08bt4BzgW1nGmt6\nEaMBJUm6uelv3pcvXz4n6+nlGtRUiByUsBJYQrv+BLw74XTgNGDv1v8A4MWt/Sxg39b+GeD1CaeO\n3CRBFX9ew1iSpA51cQRVxYWw6u+XRpwBPGKG/hcAT5ih/QSGa01TXjgyb8axerG6M4mzfYZxLseb\naezZXN9sjbUu40zvs7ZlNqa2qWXH4WzyhtS4rvtyHLZ/U9fbz2Kq/DOhKUnK/SFJ6ycJVTX9BrWN\n1sspPkmSbsKAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1\nyYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmA\nkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIk\ndcmAkiR1qZuASvjDQtcgSerHvAVUwqK1dKl5KUTSnJqcXOgKtKlYY0AlbJnwlYTTE1YkPDPhQQkn\ntLYfJixOWJrw3YRT22PvtvxEwvcS/hM4q7Udm3BKwlkJL5m2vv/d2r+RcKfW9pKEk9v6jkm4bWvf\ntj0/uT32ae3LEl43MuZZCTvMtC2zuiclAQaUZs/ajmoeD1xSxd8BJGwNnAY8s4pTExYDfwQuB/62\nij8l3Bv4NPCgNsZuwK5VXNiev7CKK1rQnJxwTBVXAFsCP6ritQlvBw4GXgl8voqPt/W/A3gx8BHg\ng8D7q/hBwg7AccAu3PxIrICsZlskSZ1aW0CdCfxLwiHAl4ErgV9VcSpA1XDdKGEL4CMJDwD+Atx7\nZIyTR8IJ4NUJT23T27e+JwM3AJ9t7UcBX2jT90/4Z2AJsJghiAAeA+yc3DjuVglbrmY7avq2VPH9\nmTouW7bsxumJiQkmJiZWM6Qk3TJNTk4yOQ+HymsMqCp+lrAb8HfAPwPfXk3X1zAE13MTNgOuHZl3\nzdREwgTwaGCvKq5N+DZwmxnGC6uOhD4J7FvFioTnA48Y6fOQKq67yYLhem566vI2M21LwjereMf0\nFY8GlCTp5qa/eV++fPmcrGdt16DuAlxbxaeAfwEeDGyXsGebv1ULpK2By9pizwM2W82QWwNXtHC6\nL7DXtFqe0ab3B77XphcDlyVsDhw40v/rwKtGan1gm7wA2L217Q7suJpt2X1N2y5JWlhrO8V3f+C9\nCTcA1wEvYwiSD7drSP/NcKrtMODzCc9jOAU3esv46DWh44B/SFgJnAucODLvGuDBCW9juKb1rNb+\nduCHwG/av4tb+6uAQxPOaNvxHeDlwOeB5yWc1fqfu4ZtkTTLPCuu2ZIq7+6ekqTcH5K0fpJQVVl7\nz/XTzR/qSpI0yoCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCS\nJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1\nyYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmA\nkiR1yYCSJHXJgJIkdcmA2oRMTk4udAkbxfoX1jjXP861w/jXP1cMqE3IuP+QW//CGuf6x7l2GP/6\n54oBJUnqkgElSepSqmqha+hGEneGJG2Aqspsj2lASZK65Ck+SVKXDChJUpcMqCbJ45Ock+RnSd64\n0PUAJNk+ybeT/CTJWUle1dq3SXJ8kp8m+XqS248s8+a2DeckeexI+x5JVrR5H5zn7dgsyWlJvjRu\n9Se5fZJjkpydZGWSh4xZ/a9pPzsrknw6ya17rT/JvyW5PMmKkbZZq7Vt+2db+0lJ7jEP9b+3/eyc\nkeQLSZaMU/0j816X5IYk28xr/VV1i38AmwHnAUuBzYHTgZ07qGs74IFtejFwLrAz8B7gDa39jcAh\nbXqXVvvmbVvOY9V1xpOBB7fprwKPn8fteC3wKeCL7fnY1A8cAbyoTS8CloxL/cDdgF8At27PPws8\nv9f6gYcDuwErRtpmrVbg5cBhbfpZwGfmof6/BW7Vpg8Zt/pb+/bAccD5wDbzWf+c/4KPwwPYGzhu\n5PmbgDctdF0z1Hks8BjgHODOrW074Jw2/WbgjSP9jwP2Au4CnD3S/mzgo/NU892BbwCPBL7U2sai\nfoYw+sUM7eNS/92AXwJ3YAjXL7UXzG7rby92oy/ws1Zr6/OQNr0I+M1c1z9t3tOAo8atfuBo4K+5\naUDNS/2e4hvcDbho5PnFra0bSZYyvLv5IcMv7OVt1uXAndv0XRlqnzK1HdPbL2H+tu/9wOuBG0ba\nxqX+HYHfJDk8yY+TfDzJloxJ/VV1CfA+hpC6FPh/VXU8Y1J/M5u13vh7XlXXA1eOnrKaBy9iOKKA\nMak/yVOAi6vqzGmz5qV+A2rQ9b32SRYDnwdeXVVXj86r4e1Il/UneRLw66o6DZjxbyR6rp/hXd7u\nDKcldgeuYTi6vlHP9Se5A7Avw7viuwKLkxw42qfn+qcbp1qnS/JW4Lqq+vRC17KuktwOeAtw8Gjz\nfNZgQA0uYTjPOmV7bvouYMEk2ZwhnI6sqmNb8+VJtmvz7wL8urVP3467M2zHJW16tP2Suay72QfY\nN8n5wH8Aj0pyJONT/8UM7x5/1J4fwxBYl41J/Y8Bzq+q37V3rF9gOJ09LvXD7PysXDyyzA5trEXA\nkqr6/dyVPkjyAuCJwAEjzeNQ/70Y3tyc0X6H7w6cmuTO81W/ATU4Bbh3kqVJtmC4gPfFBa6JJAE+\nAaysqg+MzPoiw8Vu2r/HjrQ/O8kWSXYE7g2cXFWXAVdluAMtwHNHlpkzVfWWqtq+qnZkOBf9rap6\n7hjVfxlwUZL7tKbHAD9huJbTff3AhcBeSW7b1vsYYOUY1T9V08bW+p8zjLUf8M25Lj7J4xlOcT+l\nqq4dmdV9/VW1oqruXFU7tt/hi4Hd2ynX+al/ti+yjesDeALDXXLnAW9e6HpaTQ9juHZzOnBaezwe\n2IbhxoOfAl8Hbj+yzFvaNpwDPG6kfQ9gRZv3oQXYlkew6i6+sakfeADwI+AMhiOQJWNW/zLg7Lbu\nIxjuuuqyfoaj7EuB6xiuVbxwNmsFbg18DvgZcBKwdI7rf1Fb14Ujv7+HjUH9f5ra/9Pm/4J2k8R8\n1e9HHUmSuuQpPklSlwwoSVKXDChJUpcMKElSlwwoSVKXDChJUpcMKGk9JXl/klePPP9ako+PPH9f\nktds4NgTaV9LMsO8hyX5Yfv6hrOTvGRk3rZt3qmt3zMyfD3Iev8xZ5K3bEjt0mwzoKT1932Gj3Ei\nya2AOzJ8/cCUvYEfrMtAbfl16bcdw1eWvLSqdmb4I+6XJnli6/Jo4Myq2qOqvg+8GPgfVfXodRl/\nmjdvwDLSrDOgpPV3IkMIAewKnAVcneHLDW/N8J1dP07y6PYp6Gcm+UT7GC2SXJDkkCSnAs/I8GWZ\nZ7fnT1vNOg8CDq+q0wGq6nfAG4A3JXkA8G7gKRm+GPIfgYcC/5bkPUl2TXJym3dGknu1Og5sR12n\nJfloklslOQS4bWs7cg72nbTOFi10AdK4qapLk1yfZHuGoDqR4asE9gauAs5k+BLMw4FHVdV5SY4A\nXgZ8kOETuX9bVXskuQ3Dx/g8sqp+nuSzzPyJ3bsAn5zWdiqwa1Wd0UJpj6qa+tblRwKvq6ofJ/kQ\n8IGq+nT7kM5FSXYGngnsU1V/SXIYcEBVvSnJQVW122ztL2lDeQQlbZgTGE7z7cMQUCe26anTezsx\nfJL4ea3/EcDfjCz/2fbvfVu/n7fnR7H6rzRY01cdZA3zTwTekuQNDJ9/di3DKcE9gFOSnAY8iuH7\nr6RuGFDShvkBw2m0+zN8MOZJrAqsE2boH256ZHTNasZdXcisZAiUUXswnF5co6r6D+DJwB+Br7aj\nK4Ajqmq39rhvVf3T2saS5pMBJW2YE4AnAb+rwRXA7RmOoE5gOG23dOp6D8PXDnxnhnHOaf3u2Z4/\nZzXrOxR4QbveRJI7AocA71lboUl2rKrzq+rDDF99cH+GrzrYL8m2rc82SXZoi/y5nQqUFpQBJW2Y\nsxju3jtppO1Mhq9V/307jfZC4OgkZwLXAx9t/W48kmr9/h74SrtJ4nJmuAZVw/fsHAh8PMnZDEdw\nn6iqr4yMubqvJnhmkrPaqbxdgX+vqrOBtwFfT3IGw1dZbNf6fww405sktND8ug1JUpc8gpIkdcmA\nkiR1yYCSJHXJgJIkdcmAkiR1yYCSJHXJgJIkdcmAkiR16f8DOWei3bHzb4kAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "goldBugText.dispersion_plot(bugRelatedWordsFiltered)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also do our own kind of lemmatization to group identify all occurrences of our bug words as the same." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZkAAAEZCAYAAABFFVgWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFc9JREFUeJzt3XuUZWV95vHvgw0YBJEmLJCbEJUI6IigclG0RccYFipZ\n8QIKE3DGmAkYl3HkZkYgWTFoxqWGaEhcXgig4oUQQZeCSjsCjYhyFdAgF7kIChJABuT2mz/2W/Sh\nqG6qu+utqtP9/ax1Vu3a+z3v/u1dVfs5796n9klVIUlSD2vNdQGSpNWXISNJ6saQkSR1Y8hIkrox\nZCRJ3RgykqRuDBmNpSR7Jrl6Bvq5PskrV+H5b03yzVWtY6bM1H5ZifU+kuT3Znu9mv8MGc2KVT2Y\nT1ZV36uq58xEV+3xOEk+m+S3Se5uj8uTfCDJU0fqOKWq/mAG6pgRM7hfHiPJNi1I7mmP65IcvhL9\nHJTkezNdn+YvQ0azZZkH83msgA9W1VOB3wUOBnYDzkuy3lwVlWQu/243rKoNgP2B9yd59RzWojFg\nyGhOZXBEkmuS3J7k1CQbtWX/lOTLI20/mORbbXpRkhtHlm2V5LQkv2z9HN/mPzPJd9q8XyU5OcmG\nK1IiQFU9UFUXAa8DNmYInMe8Mm/b8pEktyW5K8llSXZoyz6b5IQkZ7VR0eIkW4/U/5wkZye5I8nV\nSd44suyzbV98PclvgEVJ9k5yZevrpiTvWcZ+2b6t684kVyR57aR+P57kzNbPBdM95VVVFwA/Bp77\nuB2WbJjkX9vP4vok72v7Znvgn4Dd22jo19P9IWh8GTKaa3/BcOB+GfB04E7g423ZXwLPS/InSfYE\n3gb8t8kdJHkScCZwHfAMYAvgCyNN/rb1vT2wFXDMyhZbVb8Bzgb2nGLxq9v8Z1fVhsAbgdED6VuA\nv2YYFV0CnNLqf0rr82RgE2A/4BPtoDxhf+Bvqmp94HzgU8Db2yhrR+A7k4tJsjZwBvCN1u87gVOS\nbDfS7M0M+2Mj4BqGfbU8LS/ykrbei6doczywAbAt8HKGn9nBVXUV8GfAkqraoKoWPsG6tBowZDTX\n3gH8VVXdUlUPAscCb0iyVlXdBxwIfAQ4CTi0qm6Zoo8XM4TIe6vqvqr6bVWdB1BVP6uqb1fVg1V1\ne+vr5atY8y+AqQ6QDzIcXLdv9f+kqm4dWX5mVZ1bVQ8A72N4Rb8lsA9wXVWdWFWPVNUlwGkMITXh\n9Kpa0rbpfuABYMckT62qu6pqqoP9bsBTquq4qnqoqs5hCOP9R9qcVlUXVdXDDKG30xNs++3AHcAn\ngcNbn49qgf9m4MiqureqbgA+zPBzhDYy1JrDkNFc2wb4t3Y6507gSuAhYFOAqroQuLa1/dIy+tgK\nuKGqHpm8IMmmSb7QTindxRBWG69izVswHGgfo6q+A/wjw0jstiT/nGSDicXATSNt72UY5WzOMPra\ndWIftP3wFto+aM999BRY88fA3sD17XTYblPUufkUz7uhzZ/o97aRZfcB6y9zqwcbV9XCqtqhqv5x\niuW/C6zd1jPh5wz7TGsgQ0Zz7efAa6pqo5HHelX1C4AkhwDrALcAhy2jjxuBrdur6Mk+ADwMPLed\nwjqQFfu9f8ybFZKsD7wKmPIdUlV1fFW9ENgB2A5478RTGcJwtJ+FwM0M++C7k/bBBlV1yDKLGkYf\n+zKcBjsd+OIUzW4BtkoyOnp4RltnL7czjOi2GZm3NUsDdtze/KFVZMhoNq2T5MkjjwXACcAHJi6C\nJ9kkyeva9HbA3wBvZTivf1iS50/R74UMp7COS7Je63uPtmx94F7g7iRbsPSgPx1pD5Ksm2QXhgP6\nHcBnHtc4eWGSXdu1kP8H3M8QcBP2TvKSJOu07VpSVTcDXwO2S3JAkrXb40VJJt6KnEnrWTvD/+ds\n2E5z3TNpPRO+3+o4rD1nEcOpuYnrVTN+6qrV80Xgb5Osn+QZwLsZrjfBMHLasu0jrQEMGc2mrzMc\n9CYe7wc+BnwVOCvJ3cAS4MVtVHIScFxVXV5V1wBHASeNHKAKHj2wvRZ4FsOo4EbgTa3NscDOwF0M\nF8G/wvRfTRfDAfpuhlfoJwI/APZo14sm2kz091TgXxhOg13fnvP3I+0+BxzNEFIvAA5o9d/D8KaB\n/RhGGb8A/o5hBDd5HRMOAK5rpwD/lCGIR+umXft5LfCHwK8YTuUdWFU/XU6/y9s30132ToZgv5Zh\nxHcKS0P52wzvSrs1yS+X059WE/FDy6T+knwGuKmq/vdc1yLNJkcy0uzwXVVaIxky0uwYxzseSKvM\n02WSpG4cyUiSulkw1wX0kMThmSSthKqa0euHq+1IpqrG9nH00UfPeQ1rav3jXLv1z/1j3OvvYbUN\nGUnS3DNkJEndGDLz0KJFi+a6hFUyzvWPc+1g/XNt3OvvYbV8C3OSWh23S5J6SkJ54V+SNC4MGUlS\nN4aMJKkbQ0aS1I0hI0nqxpCRJHVjyEiSujFkJEndGDKSpG4MGUlSN4aMJKkbQ0aS1I0hI0nqxpCR\nJHVjyEiSujFkJEndGDKSpG4MGUlSN4aMJKkbQ0aS1I0hI0nqxpCRJHVjyEiSujFkJEndGDKSpG4M\nGUlSN4aMJKkbQ0aS1I0hI0nqxpCRJHVjyEiSujFkJEndGDKSpG4MGUlSN4aMJKkbQ0aS1I0hI0nq\nxpCRJHVjyEiSujFkJEndGDKSpG4MGUlSN4aMJKkbQ0aS1I0hI0nqxpCRJHVjyEiSujFkJEndGDKS\npG4MGUlSN4aMJKkbQ0aS1I0hI0nqxpCRJHVjyEiSujFkJEndGDKSpG4MGUlSN4aMJKkbQ0aS1I0h\nI0nqxpCRJHVjyEiSujFkJEndGDKSpG4MGUlSN4aMJKkbQ0aS1I0hI0nqxpCRJHVjyEiSujFkJEnd\nGDKSpG4MGUlSN4aMJKkbQ0aS1I0hI0nqxpCRJHVjyEiSujFkJEndGDKSpG4MGUlSN4aMJKkbQ0aS\n1I0hI0nqxpCRJHVjyEiSujFkJEndGDKSpG4MGUlSN4aMJKkbQ0aS1I0hI0nqxpCRJHVjyEiSujFk\nJEndGDKSpG4MGUlSN4aMJKkbQ0aS1I0hI0nqZrkhk7BNwuWzVcyKSjgo4fi5rkOSNLWxGskks1/v\n4sXDA+DQQ6fXfvL06LyV9dGPPraWFe1z9LmjfcJjt2uqdqPrm2r7VsShhy5d78TXFTVR47Jqnard\n6LzltV0VM9HHRD+9TWcdk9tM92c/k/XPxr6YSXNV71R/2/Nl303noL0g4eSEKxO+lLBewvUJCwES\nXphwTpveJOHshCsSPtnabZzw3oR3tjYfSfh2m94r4eQ2vX/CZQmXJxw3sfKE3yT8n4RLgN0TDk74\nScL3gT1meH88zuiB48wzp9d+8vRM/LBPP33mQ+b004evo9vVO2TOPHPpeie+rihDZnbWYcisuPkS\nMhPHi/lgOiHz+8DHq9gBuBv4c6CW0fZo4FtVPBf4MrB1a/t/gT1bmxcCT0lY0OZ9N2Fz4DjgFcBO\nwIsSXt/arwdcUMVOwLXAMQzh8lJgh+XUIkmaYwum0ebGKpa06ZOBdy2n7UuAfQGq+GbCnW3+j4Bd\nEjYA7gcuYgiblwLvBF4ELK7iDoCEU4CXAf8OPAx8pfWzK3DOSLtTge2mKuSYY455dHrRokUsWrRo\nGpsqSWuOxYsXs7jzkGc6ITM6UgjwCPAQS0dBT57UPo/roHgw4TrgIOB84DJgL+BZVVydPC4oMrLe\n+6sena5J/T9uXRNGQ0aS9HiTX4Afe+yxM76O6Zwu2zphtzb9FuBc4HqGkQjAH4+0PQ94E0DCq4GN\nRpZ9D/hfwHfb9J8xjHAAfgC8vF2/eRKwX2s32YWt3cKEtYE3TqN+SdIceaKRTAE/AQ5J+DTwY+AT\nDAf7TyXcDSxm6ajjWODzCQcCS4BbgXvasnOBo4AlVdyXcB9D2FDFLxKOAM5hGJ2cWcUZIzUw0u6Y\n1vd/AhfT+ZrM6Fm2ffZZsfYT0zNxpm7ffWGnnaZez3RM1X7ffYevo9u1rH6n2paV2a599oFnPeux\n619R013vVO2eaPtWxUydkZ2NM7vTWcfkNtP92c9k/eN2lnuu6p283snHi7mUqpk7RiesAzxcxcMJ\nuzO8YWDnGVvBtOtIzeR2SdKaIAlVtczLECtjOtdkVsTWwBfb/7M8ALx9hvuXJI2RGR3JzBeOZCRp\nxfUYyYzVf/xLksaLISNJ6saQkSR1Y8hIkroxZCRJ3RgykqRuDBlJUjeGjCSpG0NGktSNISNJ6saQ\nkSR1Y8hIkroxZCRJ3RgykqRuDBlJUjeGjCSpG0NGktSNISNJ6saQkSR1Y8hIkroxZCRJ3RgykqRu\nDBlJUjeGjCSpG0NGktSNISNJ6saQkSR1Y8hIkroxZCRJ3RgykqRuDBlJUjeGjCSpG0NGktSNISNJ\n6saQkSR1Y8hIkroxZCRJ3RgykqRuDBlJUjeGjCSpG0NGktSNISNJ6saQkSR1Y8hIkroxZCRJ3Rgy\nkqRuDBlJUjeGjCSpG0NGktSNISNJ6saQkSR1Y8hIkroxZCRJ3RgykqRuDBlJUjeGjCSpG0NGktSN\nISNJ6saQkSR1Y8hIkroxZCRJ3RgykqRuDBlJUjeGjCSpG0NGktSNISNJ6saQkSR1Y8hIkroxZCRJ\n3RgykqRuDBlJUjeGjCSpG0NGktSNISNJ6saQkSR1Y8hIkroxZCRJ3RgykqRuDBlJUjeGjCSpG0NG\nktSNISNJ6saQkSR1Y8hIkroxZCRJ3RgykqRuDBlJUjeGjCSpG0NGktSNISNJ6saQkSR1Y8hIkrox\nZCRJ3RgykqRuDBlJUjeGjCSpG0NGktSNISNJ6saQkSR1Y8hIkroxZCRJ3Rgy89DixYvnuoRVMs71\nj3PtYP1zbdzr78GQmYfG/Rd1nOsf59rB+ufauNffgyEjSerGkJEkdZOqmusaZlyS1W+jJGkWVFVm\nsr/VMmQkSfODp8skSd0YMpKkblarkEnymiRXJ/mPJIfPdT0TkmyV5JwkP05yRZK/aPMXJjk7yU+T\nnJXkaSPPObJtx9VJXj0yf5ckl7dlH5vFbXhSkouTnDGGtT8tyZeTXJXkyiS7jln9726/N5cn+VyS\ndedz/Uk+neS2JJePzJuxetv2n9rmX5DkGbNQ/9+3359Lk5yWZMNxqn9k2XuSPJJk4azVX1WrxQN4\nEnANsA2wNnAJsP1c19Vq2wzYqU2vD/wE2B74EHBYm384cFyb3qHVv3bbnmtYev3sQuDFbfrrwGtm\naRv+EjgF+Gr7fpxqPxF4W5teAGw4LvUDWwDXAuu2708F/mQ+1w/sCbwAuHxk3ozVC/w58Ik2/Wbg\nC7NQ/38F1mrTx41b/W3+VsA3gOuAhbNVf/c/8Nl6ALsD3xj5/gjgiLmuaxm1ng68Crga2LTN2wy4\nuk0fCRw+0v4bwG7A04GrRubvB5wwC/VuCXwLeAVwRps3LrVvCFw7xfxxqX8L4OfARgwBeUY74M3r\n+tsBa/QgPWP1tja7tukFwK961z9p2R8BJ49b/cCXgP/CY0Ome/2r0+myLYAbR76/qc2bV5Jsw/Aq\n4/sMf3S3tUW3AZu26c0Z6p8wsS2T59/M7GzjR4D3Ao+MzBuX2rcFfpXkM0l+lOSTSZ7CmNRfVTcD\nH2YImluA/6yqsxmT+kfMZL2P/q1X1UPAXaOnf2bB2xhe2cOY1J/k9cBNVXXZpEXd61+dQmbevxc7\nyfrAV4B3VdU9o8tqeFkw77YhyT7AL6vqYmDK98/P19qbBcDODMP7nYF7GUa5j5rP9SfZCHgdwyvT\nzYH1kxww2mY+1z+Vcat3VJL3AQ9U1efmupbpSrIecBRw9Ojs2Vr/6hQyNzOcc5ywFY9N4jmVZG2G\ngDmpqk5vs29Lsllb/nTgl23+5G3ZkmFbbm7To/Nv7lk3sAfwuiTXAZ8H9kpy0pjUTlv3TVX1g/b9\nlxlC59Yxqf9VwHVVdUd71Xgaw6nhcal/wkz8vtw08pytW18LgA2r6tf9Sh8kOQjYG3jryOxxqP+Z\nDC9SLm1/x1sCP0yy6WzUvzqFzEXAs5Nsk2QdhgtSX53jmgBIEuBTwJVV9dGRRV9luIhL+3r6yPz9\nkqyTZFvg2cCFVXUrcHeGd0cFOHDkOV1U1VFVtVVVbctwXvY7VXXgONTe6r8VuDHJdm3Wq4AfM1zb\nmPf1AzcAuyX5nbbeVwFXjlH9E2bi9+Xfp+jrDcC3exef5DUMp4xfX1X3jyya9/VX1eVVtWlVbdv+\njm8Cdm6nL/vXP9MXnObyAfwhwzu3rgGOnOt6Rup6KcP1jEuAi9vjNcBChgvqPwXOAp428pyj2nZc\nDfzByPxdgMvbsn+Y5e14OUvfXTY2tQPPB34AXMowEthwzOo/BriqrftEhncCzdv6GUa8twAPMJy7\nP3gm6wXWBb4I/AdwAbBN5/rf1tZ1w8jf7yfGoP7fTuz/ScuvpV34n436va2MJKmb1el0mSRpnjFk\nJEndGDKSpG4MGUlSN4aMJKkbQ0aS1I0hozVWko8kedfI999M8smR7z+c5N0r2feitI9FmGLZS5N8\nv906/qokbx9Ztklb9sPW7o0ZPp5ghf9hL8lRK1O7NJMMGa3JzmW4bQ5J1gI2Zrj1+YTdgfOm01F7\n/nTabcbwkQnvqKrtGf5R9x1J9m5NXglcVlW7VNW5wH8H/kdVvXI6/U9y5Eo8R5pRhozWZEsYggRg\nR+AK4J4MH3K2LsNn/vwoySvbHZwvS/Kpdtsiklyf5LgkPwTemOFD865q3//RMtZ5CPCZqroEoKru\nAA4DjkjyfOCDwOszfEDc+4GXAJ9O8qEkOya5sC27NMkzWx0HtNHPxUlOSLJWkuOA32nzTuqw76Rp\nWTDXBUhzpapuSfJQkq0YwmYJw23MdwfuBi5j+DC8zwB7VdU1SU4E/ifwMYY7Cd9eVbskeTLDLVNe\nUVU/S3IqU99peAfgs5Pm/RDYsaoubcGyS1VNfHrqK4D3VNWPkvwD8NGq+ly7MeGCJNsDbwL2qKqH\nk3wCeGtVHZHkkKp6wUztL2llOJLRmu58hlNmezCEzJI2PXGq7PcZ7oJ8TWt/IvCykeef2r4+p7X7\nWfv+ZJZ9O/Xl3WY9y1m+BDgqyWEM94u6n+H02i7ARUkuBvZi+AwdaV4wZLSmO4/hlNTzGG4GeAFL\nQ+f8KdqHx45Q7l1Gv8sKiisZQmHULgyn6parqj4PvBa4D/h6G+UAnFhVL2iP51TVXz9RX9JsMWS0\npjsf2Ae4owZ3Ak9jGMmcz3AKbJuJ6x8Mtzz/7hT9XN3a/V77fv9lrO/jwEHt+gtJNmb4zPgPPVGh\nSbatquuq6niG264/j+E2629IsklrszDJ1u0pD7bTatKcMWS0pruC4V1lF4zMu4zhY45/3U5JHQx8\nKcllwEPACa3doyOa1u5Pga+1C/+3McU1mRo+p+MA4JNJrmIYSX2qqr420ueybo3+piRXtNNiOwL/\nWlVXAX8FnJXkUobb6G/W2v8LcJkX/jWXvNW/JKkbRzKSpG4MGUlSN4aMJKkbQ0aS1I0hI0nqxpCR\nJHVjyEiSujFkJEnd/H9Z6ahaiBuh3wAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "goldBugLemmasForBugs = [\"bugword\" if word in bugRelatedWordsFiltered else word for word in goldBugLemmas]\n", "goldBugTextForBugs = nltk.Text(goldBugLemmasForBugs)\n", "goldBugTextForBugs.dispersion_plot([\"bugword\"]) # this plots a composition of bee, scarabaeus, bug and beetle" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "56 32\n" ] }, { "data": { "text/plain": [ "175.0" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(goldBugLemmasForBugs.count(\"bugword\"),goldBugTokens.count(\"bug\"))\n", "goldBugLemmasForBugs.count(\"bugword\") * 100 / goldBugTokens.count(\"bug\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a larger and more semantically-robust net for catching our \"bug\" words, we might also be interested in seeing which other words occur in proximity to our bug words. This is like if we generated a concordance of our bug words and then counted all of the (non stop-word ) words that were left. Sometimes these are called collocates (co-located terms). An NLTK Text object has a very convenient [similar()](http://www.nltk.org/api/nltk.html?highlight=similar#nltk.text.Text.similar) function for generating a list of the words that occur most often near another word." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "scarabæus tree skull parchment whole insect peg boat island hut spade\n", "hotel treasure pit scrap solution word hearth slip half\n" ] } ], "source": [ "goldBugTextForBugs.similar(\"bugword\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's convenient, but we don't get any frequency information, and with such a small corpus (one short story), it's probably not all that helpful.\n", "\n", "If we really want frequency information we can use a [ContextIndex](http://www.nltk.org/api/nltk.html?highlight=similar#nltk.text.ContextIndex) object, which in many ways is similar to a Text object. ```ContextIndex``` has a [word_similarity_dict()](http://www.nltk.org/api/nltk.html?highlight=similar#nltk.text.ContextIndex.word_similarity_dict) function that calculates a frequency value for _every_ word in the document, even ones that don't occur in the context (it's not a raw frequency count, but it will serve our purposes)." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAFFCAYAAAADy/H8AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnXm4XePVwH8riSTIcA0hKiEkhJhuJCKI1tD2S81aY7VK\nURRV1CdUS0e0H1W0qKmmGkqpMaoIYswsElGUImZNZCAksr4/1j65+5679xnuOeeed5+7fs+zn3v2\n3mu/Z519z9nrfdda73pFVXEcx3EcgC71VsBxHMcJBzcKjuM4zgrcKDiO4zgrcKPgOI7jrMCNguM4\njrMCNwqO4zjOCmpqFERkrIjMEZGXROS0hPObiMhTIrJERE7JO9ckIreJyAsiMltERtdSV8dxHAe6\n1aphEekKXAJ8GZgLTBKRu1T1hZjYh8AJwD4JTfweuE9V9xORbsCqtdLVcRzHMWo5UhgFvKyqr6nq\nUuBmYO+4gKq+r6qTgaXx4yLSF9hRVa+O5Jap6kc11NVxHMehtkZhXeCN2P6b0bFS2AB4X0SuEZGp\nInKFiKxSdQ0dx3GcVtTMfQRUUj+jG7A1cLyqThKRC4FxwE/jQkOGDNFFixbx7rvvAjB48GB69+7N\n9OnTAWhubgbwfd/3fd/v9Ptrr702wIrnpaoKSahqTTZgNDA+tn86cFqK7FnAKbH9/sCrsf0xwD0J\n12mpnHXWWSXLlivfyLKh6BGCbCh6hCAbih5Zkw1Fj+jZmfjsrqX7aDKwkYgMEpHuwIHAXSmyrSyW\nqr4DvCEiG0eHvgzMqpmmjuM4DlBD95GqLhOR44EHgK7AVar6gogcHZ2/XET6A5OAPsByETkRGKaq\ni7CspBsjg/IKcHj+e+SGQ6WwZMmSsvQvR76RZUPRIwTZUPQIQTYUPbImG5IeadQypoCq3g/cn3fs\n8tjrd4CBKdfOALYp1H6vXr1K1mXMmDEly5Yr38iyoegRgmwoeoQgG4oeWZMNSY80RDO8noKIaJb1\ndxzHqQcikhpo9jIXjuM4zgoybRRyKVelMH/+/LLaLke+kWVD0SME2VD0CEE2FD2yJhuSHmlk2ig4\njuM41cVjCo7jOJ2Mho4pLFhQbw0cx3Eah0wbhebmZp54ojTZLPr9QpANRY8QZEPRIwTZUPTImmxI\neqSRaaMA8OKL9dbAcRyncch8TOGYY5RLL623Jo7jONmhoWMK//pXvTVwHMdpHDJtFJqbm0t2H2XR\n7xeCbCh6hCAbih4hyIaiR9ZkQ9IjjUwbBYC5c2HRonpr4TiO0xhkPqYAypQpsPXW9dbGcRwnGzR0\nTAE8A8lxHKdaZNoo5GoflRJszqLfLwTZUPQIQTYUPUKQDUWPrMmGpEcamTYKOXyk4DiOUx0aIqaw\n9dYwZUq9tXEcx8kGhWIKDWEUevWyGkiS+BEdx3GcOA0baG5ubmaNNSwl9a23Cstm0e8XgmwoeoQg\nG4oeIciGokfWZEPSI41MGwWAoUPtr8cVHMdxKifz7qPDD1euuQYuvRSOOabeGjmO44RPw7qPADbe\n2P76SMFxHKdyMm0UmpubS3YfZdHvF4JsKHqEIBuKHiHIhqJH1mRD0iONTBsF8JiC4zhONcl8TGHJ\nEmWVVWz/44+hR4/66uQ4jhM6DR1T6NEDNtgAli+Hl1+utzaO4zjZJtNGIVf7KOdCKlQDKYt+vxBk\nQ9EjBNlQ9AhBNhQ9siYbkh5pZNoo5PAMJMdxnOpQ05iCiIwFLgS6Aleq6nl55zcBrgGGAz9W1fPz\nzncFJgNvquqeCe2rqnLZZXDssXDYYXDNNTX6MI7jOA1CXWIK0QP9EmAsMAw4WEQ2zRP7EDgB+L+U\nZk4EZgMFLZdnIDmO41SHWrqPRgEvq+prqroUuBnYOy6gqu+r6mRgaf7FIjIA2A24Eki0aPkxhUJG\nIYt+vxBkQ9EjBNlQ9AhBNhQ9siYbkh5p1NIorAu8Edt/MzpWKr8DTgWWFxNcZx3o1Qv++1/44IPy\nlHQcx3FaqFlMQUS+AYxV1aOi/W8B26rqCQmyZwGLcjEFEdkD+JqqHiciOwGnFIopAIwcaWsqPPEE\nbL99TT6S4zhOQ1AoptCthu87FxgY2x+IjRZKYXtgLxHZDegJ9BGR61T10LjQ4MGDGTduHD179uTT\nT2Hw4JH85z9j2H77JqBlONXU5Pu+7/u+33n3J0yYwPjx4wHo2bMnBVHVmmyYwXkFGAR0B6YDm6bI\nno2NBpLOfQm4O+lcc3Oz5jj7bFVQPe00TWTevHnJJ1IoR76RZUPRIwTZUPQIQTYUPbImG4oe9uhP\nfnbXbKSgqstE5HjgASwl9SpVfUFEjo7OXy4i/YFJQB9guYicCAxT1UX5zRV7P89AchzHqZzM1z7K\n6T9tGmy9NWy6KcyeXWfFHMdxAqah12jO6b94sWUgrbSSFcbrVstoieM4ToZp2IJ4uXkKAKuuCgMG\nwNKl8J//tJXNYi5xCLKh6BGCbCh6hCAbih5Zkw1JjzQybRTy8RpIjuM4ldEw7iOA73/f1mq+4AI4\n6aQ6KuY4jhMwDes+ysczkBzHcSoj00YhHlOAwkYhi36/EGRD0SME2VD0CEE2FD2yJhuSHmlk2ijk\n4yMFx3GcymiomMLnn1sW0qefwoIF0Lt3HZVzHMcJlE4TU+jaFYYMsdeFluZ0HMdxksm0UciPKUC6\nCymLfr8QZEPRIwTZUPQIQTYUPbImG5IeaWTaKCThcQXHcZz201AxBYBrr7W1mg88EG6+uT56OY7j\nhEyniSlAy0jBYwqO4zjlk2mjkBRTyJW6+Ne/ID6IyKLfLwTZUPQIQTYUPUKQDUWPrMmGpEcamTYK\nSay+Oqy5plVNnTu33to4juNki4aLKQCMGWNrNf/zn7DrrnVQzHEcJ2A6VUwBPAPJcRynvWTaKCTF\nFCDZKGTR7xeCbCh6hCAbih4hyIaiR9ZkQ9IjjUwbhTQ8A8lxHKd9NGRM4YUXYNgw2GAD+Pe/66CY\n4zhOwHSKNZrjfPYZrLIKLF9u6zX37FkH5RzHcQKlYQPNaTGF7t1tlKAKL79sx7Lo9wtBNhQ9QpAN\nRY8QZEPRI2uyIemRRqaNQiE8A8lxHKd8GtJ9BHDKKbZW869+BWec0cGKOY7jBEzDuo8K4RlIjuM4\n5ZNpo5AWU4CWGkg591EW/X4hyIaiRwiyoegRgmwoemRNNiQ90si0UShEPKaQYQ+Z4zhOh9KwMQVV\n6NsXFi6E996Dfv06WDnHcZxA6ZQxBRHPQHIcxymXmhsFERkrInNE5CUROS3h/CYi8pSILBGRU2LH\nB4rIIyIyS0SeF5Ef5F9bKKYArY1CFv1+IciGokcIsqHoEYJsKHpkTTYkPdLoVpVWUhCRrsAlwJeB\nucAkEblLVV+IiX0InADsk3f5UuAkVZ0uIr2AKSLyYN61BYkvuOM4juMUp6YxBRHZDjhLVcdG++MA\nVPXcBNmzgEWqen5KW3cCF6vqQ7FjqTEFgFtugYMOgr33hjvvrOyzOI7jNAr1jCmsC7wR238zOlYW\nIjIIGA48U851HlNwHMcpj5q6j4CKhyGR6+g24ERVXRQ/98UvfpFx48bRM6p4N3LkSMaMGUNTUxMA\na689n+ZmmDWriQ8/nE/XrnZd7nzOB5e0H/fPFZPPv6aQ/KJFixgwYEDR9wd488036dWrV131bfTP\nV46+jf75ytG30T9fo/2eJkyYwPjx4wFWPC9TUdWabcBoYHxs/3TgtBTZs4BT8o6tBDwA/DDpmubm\nZi3GwIGqoDpz5ryisnHmzStdvpFlQ9EjBNlQ9AhBNhQ9siYbih726E9+btc6ptANeBHYFXgLeBY4\nWBOCxSJyNrBQo5iCiAhwLfChqp6U0r4W0/8rX7G1mu++G/bYo5JP4ziO0xjULaagqsuA47He/mzg\nFlV9QUSOFpGjI+X6i8gbwEnAmSLyeuQy2gH4FrCziEyLtrHl6uAZSI7jOKVT83kKqnq/qg5V1SGq\nek507HJVvTx6/Y6qDlTVvqq6mqqup6qLVHWiqnZR1WZVHR5t4+NtF5unAC3B5nnz5hcWzCOEXOIQ\nZEPRIwTZUPQIQTYUPbImG5IeaTTsjOYcOaPwxhuF5RzHcZwGrn2U47XXbBW2/v3h7bc7Ri/HcZyQ\n6XRrNMdZvhxWXRWWLIGPPoI+fTpIOcdxnEBp2IJ4pcQUunSBjTaC5ub5ZU1iC8HvF4JsKHqEIBuK\nHiHIhqJH1mRD0iONTBuFUsllIP35z/DEE7BgQV3VcRzHCZaGdx8B/OIX8NOftj42aBBsuWXrbcgQ\nVsx6dhzHaVQ6dUwBYNEiuPpqmDYNZsyAWbPgs8/ayvXsCZtvbgZi7Fib7LbyyjVQ3HEcp4506pgC\nQK9ecOih87nmGpg6FRYvNsNw001w+umw++4wcKAFoydPNgPy61/PZ6214NBD4f77YenS9PZD8BE2\nug80BNlQ9AhBNhQ9siYbkh5p1LogXpB06wbDhtl20EEtx+fNg5kz4dlnzThMnw7XX2/bmmvC/vvD\nwQfDDjtYANtxHKfR6BTuo/by0ktw883wl7/AnDktxwcONONw8MGw1Va29KfjOE5W6PQxhUpRtVjE\nTTfZFp8dvckmtojPzjvDmDE2J8JxHCdkOn1MASrz+4lAczOcd57NkH78cTj2WHMpzZkDDzwwn7Fj\noanJXEtnngkPPQSffFKZHiHIhqJHCLKh6BGCbCh6ZE02JD3S6JQxhUro0sVGBGPGwO9/D488YsHr\nbt3s75NP2varX0H37jB6tI0idt7ZXjuO44SMu4+qyPz5Nop45BHbZsww11OOnj0tuN2vn21rrdXy\nOv9Y794eq3AcpzZ4TKFO/Pe/8NhjLUZi5szSr+3e3SbT7bMPfOMbMHy4GwnHcapDwxqF4cOH67Rp\n00qSnT9//oq1S6stX6rsBx/Ayy/P57//beK99+D991u2+P5778HHH1u9punTrd0NNoCvf90MxLbb\ntk2JrYW+tW47a7Kh6BGCbCh6ZE02FD0KGQWPKXQga65psYdS/m+LF8Mzz8Btt8Edd8Crr8L559u2\n7rqw775mIHbc0UtzOI5TPcoaKYjI6sAAVX2udiqVTujuo2rx+efw1FNmIP72t9Ypsf36mYHYbTcY\nMMAMT79+sMoq9dPXcZywqch9JCKPAntio4opwPvAE6p6UrUVLZfOYhTiqMKkSXD77ba98kqy3Mor\nm3FYc80WQ5F73b8/bLON1XnyUYbjdD4KGQVUteAGTI/+Hgn8LHo9s9h1HbE1NzdrqcybN69k2XLl\n6yW7fLnq9OmqP/mJ6hFHzNPmZtUBA1R79FA185G8NTfPU1Dt00d17FjVX/5SdcIE1cWLw/p8ocmG\nokcIsqHokTXZUPSwR3/yc7WUmEJXEVkHOAA4M2dLKjRUThUQsTIbW21l6bC5WIWqxSQ++MC2999v\n/fqzz2wVuldfhfHjbQOLd4wY0TIPY4cdbIThOE7noRT30f7ATzCX0bEiMhj4jap+oyMULERndB9V\nk7lzbdGhiRNtmzHDli+Ns8km8POfWzFAx3Eag0pjCmNUdWKxY/XAjUJ1WbAAnn66xUg8/XRLqY7z\nzoNTT/W5Eo7TCFRa++jihGMXVaZSdeio2kedRbZPH/jqV21k8PDD5mL6zW9svsRpp8Fxx8GyZWHp\n3JGyoegRgmwoemRNNiQ90kiNKYjIdsD2QD8RORnIWZXegOesdAJWWslGBxttZOtOXHopvPmmVYr1\narCO05ikuo9E5EvAzsDRwGWxUwuBu1X1pdqrVxh3H3UcEydaifD//tfSWe++G9Zeu95aOY7THiqN\nKQxS1ddqoViluFHoWF58Eb72Ncta2mADW6Z06NB6a+U4TrlUGlPoISJXiMiDIvJItD1cZR3bhccU\nai8blx861GZWjxxphmH77S17qSP0CEE2FD1CkA1Fj6zJhqRHGqUYhb8CU7E5CqfGtqKIyFgRmSMi\nL4nIaQnnNxGRp0RkiYicUs61Tn1Ye22YMAH22MNcSbvuauU3HMdpDEpxH01R1RFlNyzSFXgR+DIw\nF5gEHKyqL8Rk+gHrA/sA81T1/FKvjeTcfVQnli2DE06Ayy6zNNXzz4eT6l74xHGcUqjUfXS3iBwn\nIuuIyOq5rYTrRgEvq+prqroUuBnYOy6gqu+r6mRgabnXOvWlWzf44x/h3HNtBvXJJ8MPf2jF+xzH\nyS6lGIXDgB8BT2IF8XJbMdYFYvU8eTM6VgolXesxhdrLFpIXgdNOg7/8xRYF+v3v4YQT5vPRR9XX\nIwTZUPQIQTYUPbImG5IeaRStfaSqg9rZdiV+nZKu7dOnD+PGjaNnz54AjBw5kjFjxqxYaCJ3k2q9\nn6MU+UWLFpXc/qJFi+qubynyBx/cxBe+AGeeOZ/331/EqFFN3HknrLNO+J+vnP9HFv9/tdS30T9f\nvX5Ptfh8EyZMYHxU5Cz3vEyjlJjCd0h4SKvqdUWuGw2crapjo/3TgeWqel6C7FnAolhMoaRrPaYQ\nFq+8Yms7zJwJvXrBtdfaanGO44RFpTGFbWLbF4Gzgb1KuG4ysJGIDBKR7sCBwF1pOlZwrRMIgwdb\nyupBB8GiRbYy3BlneJzBcbJEUaOgqser6gnRdiSwNVbqoth1y4DjgQeA2cAtqvqCiBwtIkcDiEh/\nEXkDOAk4U0ReF5Feadfmv4fHFGovW6780qXz+ctfLBupa1c45xzYfXdLX62k3RBkQ9EjBNlQ9Mia\nbEh6pNGeNZo/BjYoRVBV7wfuzzt2eez1O8DAUq91soGIZSMNHw4HHAAPPGAT3u64w9Z+cBwnXEqJ\nKdwd2+0CDANuVdW6TyjzmEL4vP66xRWmTLElQq+8Er75zXpr5Tidm0prH+0UvVRgGfC6qr6RfkXH\n4UYhG3zyCXz/+/DnP9v+SSdZSe5u7RmnOo5TMRUFmlV1AjAH6AOsBnxaVe0qwGMKtZetRtsrrwxX\nX22T3VZaCX73O/jKV+DVV+v/+fx70T7ZUPTImmxIeqRR1CiIyAHAM8D+2DrNz0ZLdDpOyYjAscfC\nI49A//5WP+lHP7LZ0I7jhEMp7qPngC+r6nvRfj/gIVXdsgP0K4i7j7LJW2/ZmgxvvWUrvO28c701\ncpzORaXzFAR4P7b/IW3nFThOyXzhC3Dkkfb6iivqq4vjOK0pxSiMBx4QkcNE5HDgPgJJFfWYQu1l\na9X2d79raz/ffjt8+GF9dChXNhQ9QpANRY+syYakRxqpRkFENhKRMap6KnA5sCWwBVYY709VeXen\n07L++jBqFHz2GdxwQ721cRwnR6E1mu8FTlfV5/KObwn8SlX37AD9CuIxhWxz++2w336w2WZWL0nc\nKek4HUJ7Ywpr5xsEgOhYSTOaHacQe+4Ja60Fs2bB00/XWxvHcaCwUWgqcK5w7dUOwmMKtZetZdsf\nfzyfww6z18UCzo1+L7ImG4oeWZMNSY80ChmFySLyvfyDInIUpS2y4zhFyWUh3XILLFhQX10cxykc\nU+gP3AF8RosRGAH0APZV1bc7RMMCeEyhMdh5Z5vMdtllcPTR9dbGcRqfdtc+EhEBdgY2x2ofzVLV\nh2uiZTtwo9AY3HgjfOtbMGIETJ5cb20cp/Fp9+Q1NR5W1YtU9eKQDAJ4TKEjZDtCj298A1ZbzSqp\nTptWHx3q3XbWZEPRI2uyIemRRimT1xynpvTsCd/+tr32Gc6OU1+K1j4KGXcfNQ7PPw9bbAF9+lhN\npFVXrbdGjtO4VFr7yHFqzuabw+jRloF022311sZxOi+ZNgoeU6i9bEfqcdRR9jfJhdTZ7kXosqHo\nkTXZkPRII9NGwWksDjgAevWCJ56A2bPrrY3jdE48puAExdFHw5/+ZEt2XnBBvbVxnMakojWaQ8aN\nQuMxebItwLPGGjB3LvToUW+NHKfxaNhAs8cUai/b0XqMGAHNzbbGwp131keHerSdNdlQ9MiabEh6\npJFpo+A0HiKFA86O49QWdx85wTF/vi3Z+ckn8PLLMHhwvTVynMaiYd1HTmPS1AT772+vr7qqvro4\nTmcj00bBYwq1l62XHjkX0jXXwNKlnftehCgbih5Zkw1JjzQybRScxmWHHWDTTeGdd+C+++qtjeN0\nHmoaUxCRscCFQFfgSlU9L0HmIuBrwMfAYao6LTp+EnAEVrJ7JnC4qn6ad63HFBqYCy6AU06B3XeH\ne+6ptzaO0zjUJaYgIl2BS4CxwDDgYBHZNE9mN2CIqm4EfA+4NDq+LnACMEJVt8CMykG10tUJk29/\nG1ZaCe6/H958s97aOE7noJbuo1HAy6r6mqouBW4G9s6T2Qu4FkBVnwGaRGTt6Fw3YBUR6QasAszN\nfwOPKdRetp569OsH++4Ly5fD7bd37nsRmmwoemRNNiQ90uhWlVaSWRd4I7b/JrBtCTLrqupUETkf\neB34BHhAVf9ZQ12dQDnqKLj1VlvD+dprS7tmiy1gvfVg111hu+18VrTjlEMtjUKpzv42fi0RWQ0b\nRQwCPgL+KiKHqOqNcbmFCxcybtw4evbsCcDIkSMZM2YMTU1NQIvlbGpqoqmpqdV+/vlK5cvZz1FM\nPnes3vrW8/NtvTVstlkTTz3VRHOznZ8+3c6n7V93ne3fc898uneHvn2b2HVX2Gmn+QwZAmus0T59\ns/r/q4W+jf75Gu33NGHCBMaPHw+w4nmZRs0CzSIyGjhbVcdG+6cDy+PBZhG5DJigqjdH+3OALwFf\nBP5HVY+Mjn8bGK2qx+W9hweaOwGLFsG//lW6/Ny58PDD8NBDMHNm63NNTbDTTjaK2HVX2GQTm0Xt\nOJ2JQoFmVLUmGzYKeQXr7XcHpgOb5snsBtwXvR4NPB293hZ4HlgZG0lcCxyX/x7Nzc1aKvPmzStZ\ntlz5RpYNRY/2yr77rupNN6keeaTqhhuqQuttxx3n6YABWtK2/vqqxx8/Ty+8UHXGDNXPP6//56uX\nbCh6ZE02FD3s0Z/87K6Z+0hVl4nI8cADWPbQVar6gogcHZ2/XFXvE5HdRORlYDFweHTuGRG5DZgK\nLIv+/qlWujqNy1prwUEH2Qbw2ms2gshtCxeWl9k0cSJccom9Xn11+NKXYOedbfSx2WbQxWf+OBnH\nax85nRZVePtt+Pzz0uQ//hiefBImTIBHHoE33mh9fs01zUjstFOLkXDXlBMivp6C41QZVXj1VTMO\nOSMxNy9pes01Yccd4YtftG2rraBr17qo6zitaNiCeD5PofayoegRgmxcXgQ23BCOOAKuv95GDS+9\nZOW+DznEqrwOGDCfO+6wVeRGjIDVVoPddoNzzrElRz/9tG271dbZvxdhyYakRxq1TEl1nE6DCAwZ\nYtuRR9pIYvZsmDQJHnvMtldesdnZ999v1/TsCaNH2yhi6NDWRqIQPXvCkiXVl+3TB7bc0j6Du706\nL+4+cpwOYu5cePzxFiMxa1a9NUqmXz8YM8aKEo4ZA8OHQ/fu9dbKqSYeU3CcAPngA8tmmjjRlh+t\nN/PmwVNPwXvvtT6+8sowapQZiDFjbJZ437710dGpDg1rFIYPH67Tpk0rSTY+06/a8o0sG4oeIciG\nokctZfv2beLlly3mkTNYL77YWk7EyppvtdV8/v3v4m2LwJe+NJ8NNmhi++1h2LDCAfdQ7kUjf4cK\nGQWPKTiOswIR2Ggj2w47zI69/76l4uYMxeTJFi/p3h2mTy+t3SVLWmR794Ztt4Xtt7dRx+jRNtPc\nCYNMjxTcfeQ4Hc8nn5hRWLq0NPnPPoMZM8w19eST8J//tJUZNqzFSKy7bum6rLUWbL65lVh3Sqdh\n3UduFBwne7z1lhmInJGYMsUMR3tZeWVL+d12Wxt1bLstDBjgGVSFaFij4DGF2suGokcIsqHoEYJs\nNdv+9FOYOtWMxNNPQ9++83n99eLtqkL37vO59962sl/4QmsjMXIkLF3aWPetElmPKTiOEyw9epjb\naLvtbH/+/NJjDPPnW5mSZ581g/LMM7a99RbccYdtYDWpvvpVm2BYCkOHtg2wp7H11nDssVYDqxHI\n9EjB3UeO4+SzfLk9/J95psVQzJhReo2r9tC7twXit9iidu9RTRrWfeRGwXGcUvj4Yxs91IIzz7SV\nAQcMMCNUTqC8XtRlPYWO2Hw9hdrLhqJHCLKh6BGCbCh6hCD77rvzdIcdbH2O5mbVBQvqo0e11lPI\ndEE8x3GcetO9O9x5p83tmD4dDjwQli2rt1btx91HjuM4VeDlly3b6cMP4Zhj4I9/DDcttmFLZzuO\n44TCkCFw112WTXXZZfDb39Zbo/aRaaPg6ynUXjYUPUKQDUWPEGRD0SM02e23hxtusNennQa33lof\nPSoh00bBcRwnNPbbr2WUcOihlqqaJTym4DiOU2VU4bjj4NJLYY01bLb2RhvVW6sWfJ6C4zhOB7Ns\nGeyzD9x7LwwebIahX796a2U0bKDZYwq1lw1FjxBkQ9EjBNlQ9AhZtls3uPlmW7nulVdg772twmwI\nOhci00bBcRwnZHr1gnvugYEDbaTwne9YGY6QcfeR4zhOjXn+eVvzesECOOAA+MMfYM0166dPw7qP\nHMdxssDmm8Pf/mZrP9x6qy0qdMstFpAOjUwbBY8p1F42FD1CkA1FjxBkQ9EjS7K77grPPQeHHTaf\n99+Hgw6CffctXKjPYwqO4zgNzJAhcP75NuO5d2/4+99t1HD11eGMGjym4DiOUwfeeMNqJN13n+1/\n5Svwpz/BoEG1f++6xRREZKyIzBGRl0TktBSZi6LzM0RkeOx4k4jcJiIviMhsERldS10dx3E6koED\nLTPp+uth9dXhwQct9nDRRfXNUKqZURCRrsAlwFhgGHCwiGyaJ7MbMERVNwK+B1waO/174D5V3RTY\nEngh/z08plB72VD0CEE2FD1CkA1Fj6zJ5suLwLe+BS+8YFlJixfDiSfCjjvCnDmNF1MYBbysqq+p\n6lLgZmDvPJm9gGsBVPUZoElE1haRvsCOqnp1dG6Zqn5UQ10dx3HqxlprWTbSHXdA//7w5JPQ3Az/\n+EfH61KzmIKI7Af8j6oeFe1/C9hWVU+IydwNnKOqT0b7/wROAz4HLgdmA1sBU4ATVfXjvPfwmILj\nOA3FvHlHQKdiAAAgAElEQVRw8snw5z9bWYy334auXav7HoViCt2q+1atKPVpna+YYnptDRyvqpNE\n5EJgHPDTuODgwYMZN24cPXv2BGDkyJGMGTOGpqYmoGU45fu+7/u+n5X91VZr4uqr4Z135vPOO/Ds\ns01st11l7U+YMIHx48cDrHheppK2TmelGzAaGB/bPx04LU/mMuCg2P4cYG2gP/Bq7PgY4J789/A1\nmmsvG4oeIciGokcIsqHokTXZcuR/8APV5uZ5evrp1deDOq3RPBnYSEQGiUh34EDgrjyZu4BDAaLs\novmq+q6qvgO8ISIbR3JfBmbVUFfHcZyg2HNP+3tX/lOzxtR0noKIfA24EOgKXKWq54jI0QCqenkk\nk8tQWgwcrqpTo+NbAVcC3YFXonMf5bWvtdTfcRynXnz2mcUUFiywKqsbbli9tn09BcdxnAxy0EGW\nlXThhZaqWi0atiCez1OovWwoeoQgG4oeIciGokfWZMuV328/ky3FhZSFeQqO4zhOBYwaZemojz0G\nVXrmF8XdR47jOAGz007w6KNw003mTqoGDes+chzHaXT22sv+dlQWUqaNgscUai8bih4hyIaiRwiy\noeiRNdn2tJ1LTb3/fli6tHp6pJFpo+A4jtPobLQRbLKJxRQmTqz9+3lMwXEcJ3BOOw1+8xs46SS4\n4ILK2/OYguM4ToaJxxVq3Q/OtFHwmELtZUPRIwTZUPQIQTYUPbIm2962R4+GNde0mc0vtFlZpn16\npJFpo+A4jtMZ6NoVdt/dXt99d23fy2MKjuM4GeD222G//WD77eGJJypry2sfOY7jZJyFC82FtHQp\nvPuuFctrLw0baPaYQu1lQ9EjBNlQ9AhBNhQ9siZbSdu9e8Muu1ig+d57K9cjjUwbBcdxnM5ER6yx\n4O4jx3GcjPDGG7DeerDqqvDBB1BsZc00GtZ95DiO05kYOBCam2HxYnjkkdq8R6aNgscUai8bih4h\nyIaiRwiyoeiRNdlqtJ2byJafmuoxBcdxnE5IrWc3e0zBcRwnQ6jCgAHw1lswZQpsvXX5bXhMwXEc\np0EQaclCqsXs5kwbBY8p1F42FD1CkA1FjxBkQ9Eja7LVajtp4R2PKTiO43RSdtkFVlkFpk6FN9+s\nbtseU3Acx8kg++4Ld94Jl14KxxxT3rUeU3Acx2kwajW7OdNGwWMKtZcNRY8QZEPRIwTZUPTImmw1\n2959dws6P/wwLFrkMQXHcZxOzdpr2+I7n34KDz5YvXY9puA4jpNRzjkHzjgDDjsMrrmm9Ot8PQXH\ncZwGZNYs2HxzW1vh7bdthbZSqFugWUTGisgcEXlJRE5LkbkoOj9DRIbnnesqItNEJHGKhscUai8b\nih4hyIaiRwiyoeiRNdlqtz1sGGywAbz/Pjz1VOAxBRHpClwCjAWGAQeLyKZ5MrsBQ1R1I+B7wKV5\nzZwIzAYShwMLFy4sWZ+JEyeWLFuufCPLhqJHCLKh6BGCbCh6ZE222m2LtExku+GG8vRIo5YjhVHA\ny6r6mqouBW4G9s6T2Qu4FkBVnwGaRGRtABEZAOwGXAkkDnNeeeWVkpWZPHlyWcqXI9/IsqHoEYJs\nKHqEIBuKHlmTrUXbOaNw++3l6ZFGLY3CusAbsf03o2OlyvwOOBVYXisFHcdxss6OO0LfvrBggbmR\nKqWWRqHUCHD+KEBEZA/gPVWdlnB+BWuvvXbJyixZsqRk2XLlG1k2FD1CkA1FjxBkQ9Eja7K1aHul\nlWDiRPjBD5bQr19ZqiRSs+wjERkNnK2qY6P904HlqnpeTOYyYIKq3hztzwF2An4AfBtYBvQE+gC3\nq+qhee/hqUeO4zjtoMNTUkWkG/AisCvwFvAscLCqvhCT2Q04XlV3i4zIhao6Oq+dLwE/UtU9a6Ko\n4ziOs4JutWpYVZeJyPHAA0BX4CpVfUFEjo7OX66q94nIbiLyMrAYODytuVrp6TiO47SQ6clrjuM4\nTnXx2keO4zjOCmrmPqoFItIL+ERVPxeRocBQ4P5oHkS+7EOqumuxY7Fz16vqt4sdi473VNUlxY6V\ni4jsr6p/LXYsOj4Tc6vFg0UfAZOAX6rqhxXo0QU4BNhAVX8uIusB/VX12fa2WQ1EZJqqDk851xVY\nm9h3WlVfr/D9ugCjVfXJInIzC5xWVd2yQj1K/m5WiiTUjhGR1Qtdo6r/jckuIt3dq6raJ+E9VwVO\nBtZT1aNEZCNgqKreE5M5Jd4OLd97jRq+ICa7gaq+WkjneiAiq6jqx/XWoxiZMgrAY8AYEVkNi1VM\nAg7EHmAAiMjKwCpAv7wvcx/azpOIs3l8JwqUj0iRfRLIXy476ViurR2xmdvXiEg/oFfKl/YMIN8A\nJB0DGI9lZ/0F+4EchH3ud4E/A3u25wca8UdsfsguwM+BRdGxkbHPFC89km+cVFX3ym80MuR/xAzM\nZiKyJbCXqv4yRY98hdMMwgnAWcB7wOexU1skyK4MfB8YE+n9OHBpkkFX1eUi8kegWD2VXBLE96O/\n12P345Bk8fL0oIzvZjn3WER+oao/ie13jXT/Zp7oVArH9TbIvVDVXlFbv8QSTG6ITh0CfCHl+muA\nKcD20f5bwG3APTGZ3pEOQ4FtgLuwe7wHlsQS5zZghIg8rKq7FNB7BdF9+xEwiJbnoiZdX4oRy5Pf\nHpuE2xsYKCLNwPdU9ft5ct2AB1V15xJ1/iF27xZE7W8NjFPVB0q5PhVVzcwGTIv+ngD8b/R6Rp7M\nD4FXgU+jv7ntOSzTKb/NM4CF2AN2YWz7L3Bunuw62I9xTvQPGBH93QmYk6Lz2cDdwL+i/XWBJ/Jk\nvgZcjD3ULopeX4w93J8tdC9S7s/MKt3nabFj+fd5p2j7PXAL9mDcC7gJyyJLavcxYNtY+wLMKqLL\nmsDXgREFZF4B1ijxs/0VuArYGTN6VwJ/LSD/f8B+RPG3Im1PL+X/VKoe5Xw323OPo+/X6dHrHsDf\nsTTyavxWnyvlWHR8SrHvW+z440Dv2H5v4PH8/wPwY2wy7MnAKbHt5DR9gWOjezcy2hK/c8CtwGm5\n+wqsmqZvdP5ZYL28z5f2P3kIaCrnHgP/A9yBdR4Sv2/lbFkbKSAi22G9jiOiQ63iIqp6IXChiPxA\nVS8q1p6q/hr4tYico6qnFxH/KnAY9mA/P3Z8IfYDTmJfYDjWE0JV54pI7zyZt6Lze0d/c73uBcBJ\nKe12FZFt1cqDICKjaLkXy6JjJQ/78/gs6jUStdOPvJnlqjohOne+qsZ7rXeJyJSUdldR1WdEJNeG\nikgr15+I3AucpqrPi8g6wDRsRDhYRK5Q1d8ltPs6dq9KYTNVHRbbf1hEZheQPwZ7sHwuIrlevGry\nKEtEZIyqTox2diB98mVRPWLfzXNVdVyhDxWj6D2O8V3gxmgO0S7AfUn3V0QSR8AxPacmHF4sIt/C\nOglgI9lFKU18Go2ccu83GOvUJbEWEP88S6NjcQ4C9sGyHvN/a2ksVdX82mtpDFbVA0TkIABVXZy7\n32mo6ut5MstSRBcDM0Xkweh1dLn+IEE21+DuwPXRb6bEj5BO1ozCD4HTgTtUdVb05XkkSVBVL4qG\nbYNo7We+LqXtSSLSpKrzAUSkCdhJVe+MXXstcK2I7Keqt5Wo86dqbgiidldN0HUGMENEbtSE+EgK\nRwDXRHEWMMN0RNT+OdGxkof9eVyM9TzWEpFfYz3lM1NkVxGRwar6CoCIbIi5sZJ4X0SG5HZEZD/g\n7TyZQar6fPT6cOAfqnpoZEifxMqf5K7P+Zn/DUwQkXuAz6JjqjE/c4ypIrKdqj4VtTGayGAnoZE7\npES+i/1P+kb780lPsy6qh4hsoqpzgL8mPZhTHsZF77GIjKDle3EhcDl2bx8Vka0T2r2Awt+jJHfH\nN7FR5IXR/hO0dUvlOBtzhw4Qkb8AO2CdrySuA54Vkb9hD8V9iOqn5Yju2bkiMkNV7y+gd5y7ReQ4\n4G/EDFJKx6kcIwbwetRBQES6Y5NzX0iR/Vu05e63kH7vp4jIP4ANgdNFpA9VKAvUsCmpInIDdrOm\nE/Mzq+oJKfIzVHWrvGPTVTXRnyxWimMYNuM61/bPE+ROBYZgo4xzsAfHX5JGMSKSFGdQVd0wSYfo\nmr7Y/7E6dXNb2t0Um3gI8JDGJh3myY0F/oS56MCM8Pc0wa8Z/Xj+hPmO50XXHKKqr8VkVtxzEXkY\nuEJVb4r2W/2PRORsCvx4VPVnCTrMATbGam4pNqx/Eeu5qSYEhaMY1ka0/l8/lnQ/Ivm+kcxHBWSK\n6hGNjI4SkQn5ny1qv83DOHaPt8OMUtI9zm+v1b1LarfWiMiaQG7i6tOq+kEB2RHAjpjOj6mVw4mf\nP4W2ca4ciZ0FEXmNtvc48bcnIl/F3FPDgAeJjJiqJnZQo892EfDlSKd/AD/QlGQQEVkFi1fMSTof\nk+uCeSFeUdX5IrIGsK6qPlfoumJkyiiISNJNV00OBr0ADNMSP6CIPJf/QBCRmaqaFKy8HFgZG3Jf\nAewPPKOqR+TLRvJfxYwCwAOqmrh4XvTlydET66GvobFgYEy2P/Ar7EswVkSGAdup6lUJsl9Mer/8\nB1uCuyk/wyPR3SQiPYFNIrk5qlqo15QbLXVR1Ta1z6Pe/gPAXMznvqGqzot+KJNUdbOEaw5Q1VuL\nHYuODyqkW/zhGckfhfXsBmKurNHAUynfuSYs4J273xOAnycZh3L0kLZB6YlYUPqTtOujEWQXVS3V\nrZaKiOyiqg+LyDdIMCaq+reEa9YCjqJt4Pa7Ke+xbkw2931LNLyRa7N/nuzrsfNnk2BEY/q26SyU\nS6lGTCx4fK2qpiYd5MnvBfwW6KGqg8TWmPmZJidu7IDFMhaJyLex+OaFqvqfdnyklnYzZhRGxnZ7\nAt8AlqnqqQmyfwVOVNW3Smz7Gqz3+gfsy3McsJqqHpYgO1NVt8gZkugHOF5Vx6S0PQjYSFUfjB5u\nXZMeiCnXTlXVNq4DERmPZR78ONJhJSzItHmC7D20/Eh6YmXNp+Q/2GK9JcF6rvOiU6sB/1HVRHeT\niGxBy6gp9yO9LnY+P51wxSnyem5ipdN/jv3o/6Cq/4iO74wF/v4v4f3bpKomHcs7vxate/6J6asi\n8jyW7fKUqjaLyCbAOaq6b4Ls34CZmDtDsPpdW6rq12MyfVR1QYIBzunRxvBG3+UFWCaPYG6Yvqq6\nf4LsOcB5MTfoasApqtrG/Re5Bn9TTFZEfqaqZ4nIn0kesbRxkYnIU1jQewotLg1V1dsTZM/Dsghn\n03pU36a0jaRkmiV13sohcuscixl0BR4FLtOYOzfP7QZtO01J7jxEZCKwa7HOUiQ7FetsPpL7/orI\n8ym/65nAltH2ZyxZ4QBV/VKx9ylEpmIKqppfMHyiiExKEe8HzBaRZ2nx92mSxY04AfgJlkkDNiw8\nLkU210P7OOrhfIg9xNogIt/DekyrA4OBAdhiQm3mS+R96bpi2U1pC+ytqaq3iMi46IMtFZHE4JWq\n7pH3PgMxf2++3KDo/BVY3Oa+aP9rWMA86fOdDXwJ2Ay4F8ukmoj5fnPk0gnbXJ5/XFXfBY5O0O0R\n8uJHkV67AeuKyEW0/Eh70zoYGb9mLyxJ4AvYg2V9zL/bZgQSsURVPxERxOaizBFLX0xicNwAAGeL\nyIw8mZuwwGBavCfJ8JYTHP+axhImolHW7iTHhHZT1TOKyarqWdHLY7CO2CCKPztWVtXE1RYT2BdL\n6Sz60MTiikPTXC9xohHWEViHZWVaHt5Jo5VLsc+U6xR+Ozp2ZEzmfJL/ZznS3G6vYs+qu4DcPIVE\nNxYW8J4vrQPGaXGCZaqqIrIP1oG6UkQSvRXlkCmjkNe76oKljaXl2p8d/W0z0SUJVV0EnCYiq6rq\n4jS5iLujXtVvaflxX5EiexzWM386ep9/Rb3UJOJfumXAa8ABKbKL4u4msUBlqg87jzeBTQuc305V\nj8rtqOr9IvLbFNn9gK2Aqap6eNTTvzEuoKpnl6gXYvMfCvmD40Y9l7W1Fy1ZW4oF3dOytn6J+dsf\nVNXh0Qik0CSwN6L/9Z3AgyIyD/u/JPGJiOyoqo9Hn2UMLQ+B3AfYPfo7qMB75lNOcLyLxCZSRg/G\n7lWQBUtZnR+9d7GJmveIyO6qem8RObCU4u4UDtbmKCfT7HrM4I8FfgZ8i/QA7zZ57uOHRKSVb15V\ndyrxffN5Jdq6AL1I6AzFmCUihwDdxOY//ABLAkhioYicgX2uHSO32krt1HEFmTIKtO5d5R6aiZZR\nVSdEbpshqvrPyG2T+nml7QSTrYCjNW+CSdT2L6KXt0eumZ5JfuOIT1X1U2nJPlrhB01gLG17Ygdi\n7pR8TsF+pBuKyJPYyGi/lM92cWy3CzYZKzXjBnhLRM6ktbtibopsbob5MrEA63uY/z1Jj4FYwC3n\nZnsMc/G9GRMbjRmtm4BncpdGf/NHFTNEZBbwVbXMsFJYqqofiEgXEemqqo+ISJtRU+w9ciOks8UC\ntH2wTJkkjgGuk5bso3nAd+IC0r70zpHAEyLSKigduQ8072F2I/ZAuxq7b4fTetRGO2XB4lf/U0j/\nGD8EzhCRz2gZtakmp/J+AkwXkYdoPapPSsN8FXhELHW5WKbZEFXdT0T2VtVrxTKb0tasXCYiQ1T1\nZVgRsE8ceUt5Ew/L6hRhHosfY/fhJiy+9osU2QOBg4Hvquo7YpUH0jpvJZMpo1BO76oct03EhdhD\n+e/Re80QK9ud1Hbc/wiWDtnK/xjjURH5MZa6+RXsy3R3ghyU1xObjfVeP8F6Tndi2StJxN1uy7Ds\npycKtH0w5re9I9p/LDqWxKSoJ31F9D6LSe/ZXIM9iHKjn0OiY1+JyawT7R8cbfcCN6nqrKQG1arx\nriciPUp0P8wTS299HMvRf4/0/HkApO2M9HVpybaK6zId2FIsNRBNDvK2J71zbCH98nQ4L+rhfjl6\nn59rygzXmGzuN5EqG/GkiGypJWS3aHmpvHdFW6smUmRfj7bu0Vao150zGh+Jxb3ewTpPSZyKueXi\nWXRp6cTXYb+5nMvym9iopE2MB8pLkKHFpbfCrSci+5NQ1UBV347iWLkU5A+w50BFZC3QnD+9fGNg\nY02YXh75ckdhmQG5gE1iNlF07llVHSWxAKUkpKlGx6/CDGo8oLhMVY9MkO2C+SVXZB8BV2rCjU8L\nKKXoW07w8UhVvTLvWDkTokpCRDbAZpsmPjSS7mfaPY7O9cAMw/9hM20vSZG7Hst+Kuqzjb5DS7AR\n0yFYz//GNB91FDMZgfmxN45iSLeq6g4xmW+r6vXSkgq54lQBPQ7AkhMWiMhPsdTCX6pqoRFc1RHL\nYtsm2n1GVd9LkMnVduqKpea+SusefWJtJykzlbfaiGWO3Y6VO7kGc938VFUvS5HviZXRUODFtE6G\niMzOi/EkHoudKydBpuSkiXjHV1UHR8/DSzWlvlupZGqkgP1jJ9FSI2UuZkGTao6U47aB8iaYFPU/\nxt7zeVXdBMsdL0bJPTHKCz5+Q0SWqOoNkV5/wAJviUQxj/+lJUAH6am/bYrnicgoTS6e96FY6ly8\nXlObVL7ox7l7dH4QFhS/I18uRkk+2+j/cY9aHv7nWMZGMUqZkZ6brFdO7/gnqnprFHfYBTN8f8TK\nLLQbsbTRc7HigCvcbklum8gw/RbLtAG4WERO1bYFGMte4EpSUnmxz5qT+auq7i/JBQVbGRsR+b2q\nniita27FZZMSSK6nxR2bcy+2WsNXRHZV1YekJd02d8+GiAiakG5L+RMgiybISDuSJigvXlkyWTMK\nQ9Sml38TyE0vTyv/XY7bBswd9HvMNTAXm2CSln1Ukv8xcm28KCLra4Hc4bye2OHRELZYT6ycL+bX\nsfITn2PZQfM0JV884kYsC2sPLBPoMCBtSfCixfNifBebLZ3rOT9J3hA96vVvBtyHuTMKVSAFSvfZ\nRv+P5RKbuV4CpcxIv1wsyLcwxbedRC6dcg9sgt49IpLmOy6H3wB7aMpkwzzOxDo47wFErrGHyHNV\naN7cjRI5kZZU3p0lSuVNkIHSjE4u1vEo1jGMJyKklbIoxR37Rewz70lyp3GFUYj9TruREONJU1xK\nS5BJKnVTLGmi3I5vSWTNfTQJK8I2US1zZBhwnaq2eQBFP9IjKMFt0w49dsVGLf/G/nmDgMNV9eEE\n2cexnuaztK5lsldMZlCh90v6UUppM2LjX8be2I9kIhYvUE2fjDZVVbeW2IQ+EZmccp+nRf+Lom63\nUhCR5bTcp3zSerzljGzuwv4f/6C1qykpqImUNyN9kqpuk388pd17sc7HVyJ9lmDum3bdt1i7T8Rd\nW0VkZ2LzKDTa74JNhqoo5z9qa7KqjhSR6Vj58SWFXCxltDsV+E6usyAiBwMnqeqoBNly3LEbquq/\nCx1rz+80uu412ibI/EyjGll5sn2Axar6ebTfFZvI1qbktlhG4HzgUOB4rOM7W1V/XEjPYmRtpPBz\nLJi5VvTjHkFKcCe6qX+iNLdN7gb/AgvcjsfSLE9S1esTxJ+M2t0F+6c8QHpwtQfmCon3bH6Tp+tr\npeiYRynBx/xceIl02T06nlY+Ixege0esnMdb2AS2RFkpUjwvdq7oLFdVbc/CT+WMbMqpLQM2YnsI\ni99sjLl9EmekY26BSyJdVhg2Tc4oOgD7H/5WLS99HSzYWSmTReQWLOAYz85p5QYR615OAh4Qy8oR\nLJul1FpBxSiayivtK+2+H3Bb5C3YEXsgfiVBDspzx95G29L3fyVWojz/dyp5EyDT0PLSj/+BJQnk\nkh9WwZ4v2yfIjsM6vjOx7/19WAZlRWRtpHAX8DDm1jkJS/e7X1WvTpDdEzMig2j9AEqc15Dr3YrI\nvtjD5WSsJG9SLZxygrxJgaPUgHctkDKDmtG9exzzB1+MDXXPVtX8DBHEKmEegP14riUqnqfJJSZK\nnuVa5ucrZ2SzYqGmaL8rllKcODoRkV9hD8tpwNXYfUz80UgZNYpqhdisY/L10LxZx5FRmAn8lJY6\nQo+raqHYTXt12okolVdVPysiXkp7QzFj8x/g6/m9aCkjMC5W42sYFlv5ES2dhD7AqZpcViVxAmSS\nbCS/P/bZF4rIT2j5/bXpLEhCvbWkYzVFq1A7vaM2WuqH5/52Ib1e/SvY9O8uJbadq41+FTYrFNJr\nus8udgyLUczEXBQzY9trWLZLR963mdHfMVg9nt0xV0W12t8UG74eD2xaQK7NegNVev+no7//wAz6\n1liRsERZbJGj3H5v4Mki7XfBevU3Ay8Dv8ZmL3f4b6DK9+1aYFSV21y90FZBuzPztncxd+lM8tZp\nwDqCqVue7N5YwsGHmEs4t10EbJ+iy3PYOh+5NSt2Bq4upHv0N/f724P0dVKeILaOAxZ/eCpFdgxW\neeElWtaN+Xel/8OsuY9ywaJ5YnnH75Ge8fEG9qAvtZTs3ZGffglwbDQ0TAtOlRLk/Qs2FD8XW5Aj\n5z5aqBUsldlO8oOa94qtjJVIFDi/EJv5q5hr7CTN87nG+Bc2cuoGqIisp8m1hMqZ5VoOvxIrRncK\nLSObtOBcT7XZ6wCo9d7SSn3nZJaLyDvYg+hzzJV2m4j8U2NphVJGkcJaIeWVdhgNfEtE/kPreFcl\ny4e2t1x7MUrOgNIy3LGq+nfg7/HfcwmUNQGS8pIKfgjcKiK5cufrYCPVJK6K5KfSesXBisia+yiX\ndzwMSzfrha0c1caPJrbozC8wy1xs5mPumjWA+WozdFfFcu7fiZ2PZx8MJS/Iq6qFSkfUjXKDmiLy\nDHAJ1jMG+1KeoKpt0iWlhAJleb7jVbH/R7FZrjVBbPb3CRq5zsRyyC9W1e1S5E/E/NYfYv7aO9Tq\nTHUBXlLVwTHZkosU1goRuQ1LpT6EWGkHTQikpwVOy3moNgpiBTHjpBpTEfknlqp8DjZieA8YqapJ\nfv/2/P6603q+RFodr2eSfpOVkimjUA5iKxctxIaXK0YLWqBsrtg8hfVpqR+i2rra56ACb6laYcna\nWhEZuLHYMPulKKi5hUYVSBPkk8qIp03kewVzQZRSoOxGLKXwcS0tZbIkpLx1ibfBjF2uJ9YfOEjb\n5pLn5H+GuQba/G9FZJiqzo7t5zJu4plYHeoPzr2ftFTwXQnL1qv6w6OIHvllnYcDvw/4N7IfLR2X\nlbGH/luasP6KiFyAJQXk5uj0BbZKGY215/dXsOpwTO5cLG6SvzBQYrXWUsmU+0isANxZtK458vOU\nB9I6qpqWlZDUduKiPMRqwWS1B6UWRL09tv82bVc8i3O/2BKNuaUUD4yOrR5dH09lLadA2VXY/+6i\nyEU1DTMQFxa+rChXYD/S3EzVmZHuSS6yDbAH1PrY/I1RFFitSlsqhCady58sWEmRwmpRTmmHWnIZ\nsJVYDbGTsf/9dVhF3eDQvJUUo4ystFIwO6slKqyYACnJE/BybS/G6qStJVafCGyd9zZIaVWHc4zG\nnoP5CRUVJTZkyihgPbxHsR9zLuvnFiyFK5/7ROR/tHAtlzgjKGNRngbnQOzL9r2U4xtKO5bCVFuo\n5THsS7wLVkBuc1qWbGwv5axLnJtJ3Bf78fwfVhOrGj3pkosU1pArIuN9Jlb2oxdWEr6jWRbFYuJl\nnQtNmAyNjckzpiJyLDYXYHCeEehNugFJzVYiuVx70arDUZtdgbsKucPbS9aMQn9tqVAK8EsRSQvC\nfB/4kZRWpRHgeSyoU9KiPA3O/1I8hTW3RsLrWGwlV6AsFbEqmKti5Q4mYn7YNrV22kEpaz/nqNVM\nYiivSGGteCgayT1KFNQVWze7o6lJWedakRf3UiypIH89iPYmj5RTrr2kqsORzMG0VAeoGlkzCv+I\nbkRuIZz9sTTENmh5VRqh/EV5GpmidXk0obSEiKwTuabSeA4bJWyOPTTnichTWmBZyRI5Hlt8fqiI\nvIWNXr6VIjtXRP6EBf3OFauz1J4Jc0nkqmf+ipaRbGr1zBpRdBJWB3Eg9vnjZZ3brJoXCqU8L9TK\n40gYfOUAAAu5SURBVH+E1eQqh3KylcqpOpw/WTJXgLGimEImAs0J2Ss5H3AXbEp4Yu0TsYqW6xMz\nfpq+7utOScdVdUK7lM4wsWDluViO9Y1SZHnL6LrEpUMT5Hpjs45/hI3+elSob67y5CAsH34B9uNo\nsw5FuUG/MvUoq3pmNZF2TMJyQNousdmKSh+w0XuUlK0k5v8cqFE6t1jV4T6qmr96X05+QpLuWuFk\nyUyMFNrR60dS1n3FZtQmvceEdinXmLS3N520WlrLSUtf3RHrtb6KzRB+vEJdoXXhs4Luv3YE3cuh\nrOqZVWZjLJe/L61z+hdipUU6BGlf6Yp60t4lNsthH8yl+ENsBNsHSxdO4j5sJI2qtlmzI462fyW4\ngmRipBBHSqzRLiL/wnqABRdekaiAWMqXOcQvcc1pb29aRL6vqn8scP5UzChPTcu9bqe+JRc+qyVS\nQpHCDtChnElYTgchbVeB7KqqCxPkrsUC80ml5/NlazJZMlNGQVJqtGtyNcz7gQOSbrxTPaIA75tq\nVTB3xhY0uU5LL01dDR3+BFyipRU+q6Uegwqd74iUZimvsKMTIQmrKQJpqymW23bJi+GIyItYVd6i\ns8ylRpMls2YUnqelRnuzRDXatWUdXaRlPeIvYGsRl7Luq9NOxFa4G4H58+/DXDmbqepuHfDe7VoR\nrJGRMgo7Oi1IGasptqPtkleBTOtYJHUopEaTJTMRU4ixRFU/ERFEpKeqzhGbzRpnCi1uoLtjr4uV\nSHbax3K1xWu+jpWLuFhEpnXQe5e9IlgnIPeb3gO4TVU/EhH/3henpNUU20nJi+HkHv5SWlnuRWKl\neXLtVmWyZNaMwptSpEa7qv4ZQFJKJHeotp2Dz8Rq2x9Ky0O6Q/LRO8Idk0HKKezotFDSaort5FEp\ncRXIMie6nRK1U9XJkplyH8WRIjXaReRp4MsaVcSM0iAfyE8DcypDRDbDFvh4SlVvitLoDlDV8+qs\nWqdFihR2dNoiLasp5jJ+1sfmWLRZTbEdbXcBjqSEVSCj0cku5E100/S6SithxfOgQPG8svTNilGI\nhlzPq+omJcrXf7EKx+kgpO0C9NCSIqyavAC9EyG2ENUDWGxsb2wG8o8rnghW/nNriqqOiOIQW0eG\nvVWBytj/ONElXun/OjPuo8hv/aKIrK+lVVr8WERGaOsSyZXOnHUiROSR6OWHqtrR9X2ctuQWoN8j\n5bwbhcLkZvH3pmUWf8U1sdrx3JoX6fA4cKOIvEfL0pw59sSMwVrYMp250czO2OznzmEUIlYHZomV\nooinayWVojiR0hercMont7xj1Rb3cCpigViRwufrrUhGqWVNrHKeW7mJbidhZbnbTHRT1cOA3PIA\nw6IJmETzia6tVNmsGYWSqj1GQeUx2DKRcX9bxevDOoYHeYMjV6BwKJa2nVtPe0+g6EQop6Y1sXpg\nS+DGZ/z/JklQW1YFXFGWuwADsdLoOd7FJkxWRGZiCuUiIpNUdZt669HoRP7Nc4G1ae3D7nQzwUNA\nRB4HdstN2oxcEfep6o711Sxs2juLv8S229QNKzBPoeTfk1gxvI2x6q2CeUJe0oSFgcrSN0tGQUS2\nwxbU3hSzvl2BRSk37HdYamRVKwg6rRFbeW0PreJKak77iWbEbqWqS6L9ntgKaPnzeZwaI7H1F4BX\nYqd6A0+o6iEJ15T8exKb+LAvFk9S4DFVvaNivTNmFKZgZWtvxUowHwoMVdVxCbITqEEFQac1EtWO\nqrcejhHlwx+IBRsF81Hfoqq/rqtinRCx9RBWo4z1F0L4PWXOKETpWitStDzNtD5Ew1ywXkp/bEJh\nfOU1z3apE2LloHekpffYUTPMnXZSzu9JalyJNmuB5sUi0gOYISK/wYIsqeWaRWQPWhbABkATauw7\n7SKXFgeWLfHVvPNuFOpElIbdUSW7neoQ/z19TOvfkxL7PWk7lhIoh6yNFAZhEfbuWMpWH+CPuanp\nebKXAytjOcdXYKtfPaOqR3SUvp0BERmjqhOLHXMcpzgich1woqrOi/ZXB85X1cNjMn3UlspdPakN\nteVY269DxoxCUj2jHqr6cYLsTFXdIudqiq4dr6pjOljthkYSVltLOuY4TnFKqcQgIveq6u4ikrgI\nj6puUIkOWXMfPQTsSssMv1WwqelJ9Yxys5c/EVuW80PMV+dUgSgTbHtgLRE5mRY3Xm8sK8xxnPIR\nEVk919uPRgOtfk+qunv08kngUaw0etWy/7JmFHrEJnegqgvFVjFK4p6ooupvaPGvXlFrBTsR3Wkx\nAPE1shdQhUqNjtNJOR94SkRuxTpa+2OrqyVxFTZJ96Koqus0zEBcWIkCWXMfPQmckFfP6GJV3S5B\ndmUsR3gMFqiZiK125PWPqoiIDPLZzY5TPaLKw7tgz62HVXV2AdluWHr+LsAxmHu9ojkpWTMK2wA3\n07LQen/gIFWdnCD7V6zXegNmcb8J9FXV/TtI3U5BVK//f7Esr5Wjw6oJS6Q6jlM9ROQhYFXgKazT\n+7iqvldpu1lzH20ADMdqnX8dW+JueYrsZqo6LLb/sIikWlyn3dyIzRrfA1tX4TDg/Xoq5DidhOew\nUcLmWAd4nog8Vak3pFoFnzqKn6jqAqAvVib20mhLYmoUDAVWLFXnudvVZw1VvRL4TFUfjVLnfJTg\nODVGVU+Kalp9HfgAWyRofqXtZm2kUE5525HAEyLyBuabWw94UWyxd1VfyLxa5GZdvhNNFnwLm9rv\nOE4NEZETsJnrI7AV467G1mGoiKwZhXLK247tOLU6Nb8SkSZsvdiLsQmFJ9VXJcfpFPTEspWmVmMZ\nzhxZCzTXrLyt4ziOkzGj4ISHiAwF/gj0V9XNRGRLYC9V/WWdVXMcpx1kLdDshMcVwBm0xBZmAgfX\nTx3HcSrBjYJTKauo6jO5HbWhZ9X8m47jdCxuFJxKeV9EhuR2RGQ/WiYXOo6TMTym4FREVHPlcqw4\n3nwsNe4QL33hONnEjYLTLkTklLxDufTgjzEv0gUdr5XjOJWStXkKTjj0xiYFDgW2Ae6Kjn8beLZe\nSjmOUxk+UnAqQkQeB3ZT1YXRfm/gvmj6veM4GcMDzU6lrEXrbKOl0THHcTKIu4+cSrkOeFZE/oaV\nKN8HuLa+KjmO017cfeRUjIiMwApzKfCYqk6rs0qO47QTNwqO4zjOCjym4DiO46zAjYLjOI6zAjcK\njuM4zgrcKDhOhIj8WESeF5EZIjJNREbV8L0mRAF6xwkKT0l1HCBaz3t3YLiqLhWR1YEeNXxLjTbH\nCQofKTiO0R/4ILesoar+V1XfFpGfiMizIjJTRC7PCUc9/QtEZJKIzBaRkSLyNxH5V27dcBEZJCJz\nROSGSOavIrJy/huLyFdF5EkRmSIit0YrDCIi54rIrGjk8tsOug9OJ8eNguMY/wAGisiLIvIHEfli\ndPwSVR2lqlsAK4vIHtFxBT5V1W2Ay4C/A8cCmwOHichqkdzGwB9UdRiwAPh+/E1FZE3gx8CuqjoC\nmAKcHI1U9lHVzVR1K+AXtfrgjhPHjYLjAKq6GBgBfA94H7hFRL4D7CIiT4vIc8AuwLDYZbkigM8D\ns1T1XVX9DPg3MDA694aqPhW9vgEYE7tegNFRm0+KyDTgUGA94CNgiYhcJSL7Ap9U9xM7TjIeU3Cc\nCFVdDjwKPCoiM4FjgC2AEao6V0TOwkqE5/g0+rs89jq3n/ttxeMGQnIc4UFV/Wb+wSjQvSuwH3B8\n9NpxaoqPFBwHEJGNRWSj2KHhwBzsIf6hiPQC9m9H0+uJyOjo9TeBx2PnFHga2CFarAgRWVVENori\nCk2qej9wMrBVO97bccrGRwqOY/QCLhaRJmAZ8BJwNLaa3PPAO8AzKdcWyiR6EThORK4GZgGXtrpQ\n9QMROQy4SURy2U4/BhYCfxeRntgI46R2fi7HKQuvfeQ4NUJEBgF3R0Fqx8kE7j5ynNrivS4nU/hI\nwXEcx1mBjxQcx3GcFbhRcBzHcVbgRsFxHMdZgRsFx3EcZwVuFBzHcf5/owAOAL4rwbp4vP+AAAAA\nAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "goldBugBugContextIndex = nltk.text.ContextIndex(goldBugLemmasForBugs)\n", "goldBugBugSimilarities = goldBugBugContextIndex.word_similarity_dict(\"bugword\")\n", "del goldBugBugSimilarities['bugword'] # we don't want to include \"bugword\" itself in our dictionary\n", "goldBugBugSimilarityFreqs = nltk.FreqDist(goldBugBugSimilarities) # copy the dictionary into a FreqDist\n", "goldBugBugSimilarityFreqs.plot(30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's not clear that this really helps to provide any insights into the text, not least because bug words are so evenly distributed throughout. Still, the primary purpose of this notebook was to explore how we could search and count more flexibly and powerfully using a mix of strategies ranging from lexical transformations such as case changes, stemming and lemmatization, to word sense lookups with WordNet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some tasks to try:\n", "\n", "* What's the count of \"bug\" in our ```goldBugWordTokens``` list and of \"bugword\" in our ```goldBugLemmasForBugs``` list?\n", "* What's the percentage increase in coverage that we get by lemmatizing and looking related words using WordNet?\n", "* What's another word in _The Gold Bug_ where the percentage increase is greater using the same process we followed:\n", " * Choose a word and find the synset that is most relevant to the text's meaning\n", " * Create a list of all the hyponyms of the hypernym of the synset (using our ```get_hyponym_names_from_hypernym``` function)\n", " * Replace all occurrences of the hyponyms with a standard word form (we used \"bugword\" last time)\n", "\n", "In the next notebook we'll look at [Parts of Speech](PartsOfSpeech.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "[CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) From [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) by [Stéfan Sinclair](http://stefansinclair.name) & [Geoffrey Rockwell](http://geoffreyrockwell.com). Edited and revised by [Melissa Mony](http://melissamony.com).
Created February 7, 2015 and last modified January 14, 2018 (Jupyter 5.0.0)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 1 }