{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An Overview of Natural Language Processing with Python NLTK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction\n",
    "\n",
    "In the broad field of artificial intelligence, the ability to parse and understand natural language is an important goal with many types of application. Online retailers and service providers may wish to analyse the feedback of their reviews; governments may need to understand the content of large-scale surveys and responses; or researchers may attempt to determine the sentiments expressed towards certain topics or people on social media. There are many areas where the processing of natural language is required, and this tutorial will step through some of the key elements involved. \n",
    "\n",
    "The difficulty of understanding natural language is tied to the fact that text data is unstructured. This means it doesn’t come in an easy format to analyse and interpret. When performing data analysis, we want to be able to evaluate the information quantitatively, but text is inherently qualitative. Natural language processing is the process of transforming a piece of text into a structured format that a computer can process and begin to understand."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset\n",
    "\n",
    "To perform natural language processing, we need some data containing natural language to work with. Often you can collect your own data for projects by scraping the web or downloading existing files in txt, csv, or json format. The NLTK library helpfully comes with a few large datasets built in and these are easy to import directly. These include classic novels, film scripts, tweets, speeches, and even real-life conversations overheard in New York. We will be using a set of movie reviews for our analysis. \n",
    "\n",
    "After importing a selected dataset, you can call a ‘readme’ function to learn more about the structure and purpose of the collected data. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "from nltk.corpus import movie_reviews\n",
    "# movie_reviews.readme()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can start trying to understand the data by simply printing words and frequencies to the console, to see what we are dealing with. To get the entire collection of movie reviews as one chunk of data, we use the raw text function (though we will limit what we print to the terminal to the first 3000 characters for readability)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "plot : two teen couples go to a church party , drink and then drive . \n",
      "they get into an accident . \n",
      "one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \n",
      "what's the deal ? \n",
      "watch the movie and \" sorta \" find out . . . \n",
      "critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \n",
      "which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . \n",
      "they seem to have taken this pretty neat concept , but executed it terribly . \n",
      "so what are the problems with the movie ? \n",
      "well , its main problem is that it's simply too jumbled . \n",
      "it starts off \" normal \" but then downshifts into this \" fantasy \" world in which you , as an audience member , have no idea what's going on . \n",
      "there are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \n",
      "now i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . \n",
      "it's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \n",
      "and do they make things entertaining , thrilling or even engaging , in the meantime ? \n",
      "not really . \n",
      "the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . \n",
      "i guess the bottom line with movies like this is that you should always make sure that the audience is \" into it \" even before they are given the secret password to enter your world of understanding . \n",
      "i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \n",
      "okay , we get it . . . there \n",
      "are people chasing her and we don't know who they are . \n",
      "do we really need to see it over and over again ? \n",
      "how about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \n",
      "apparently , the studio took this film away from its director and chopped it up themselves , and it shows . \n",
      "there might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess \" the suits \" decided that turning it into a music video with little edge , would make more sense . \n",
      "the actors are pretty good for the most part , although wes bentley just seemed to \n"
     ]
    }
   ],
   "source": [
    "raw = movie_reviews.raw()\n",
    "print(raw[:3000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you will see, this outputs each of the reviews with no formatting at all. If we print just the first element, the output is a single character."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p\n"
     ]
    }
   ],
   "source": [
    "print(raw[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the ‘words’ function to split our movie reviews into individual words, and store them all in one corpus. Now we can start to analyse things like individual words and their frequencies within the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]\n"
     ]
    }
   ],
   "source": [
    "corpus = movie_reviews.words()\n",
    "print(corpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "plot\n"
     ]
    }
   ],
   "source": [
    "print(corpus[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can calculate the frequency distribution (counting the number of times each word is used in the corpus), and then plot or print out the top k words. This should show us which words are used most often. Let’s take a look at the top 50. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<FreqDist with 39768 samples and 1583820 outcomes>\n",
      "[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), (\"'\", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('\"', 17612), ('it', 16107), ('that', 15924), ('-', 15595), (')', 11781), ('(', 11664), ('as', 11378), ('with', 10792), ('for', 9961), ('his', 9587), ('this', 9578), ('film', 9517), ('i', 8889), ('he', 8864), ('but', 8634), ('on', 7385), ('are', 6949), ('t', 6410), ('by', 6261), ('be', 6174), ('one', 5852), ('movie', 5771), ('an', 5744), ('who', 5692), ('not', 5577), ('you', 5316), ('from', 4999), ('at', 4986), ('was', 4940), ('have', 4901), ('they', 4825), ('has', 4719), ('her', 4522), ('all', 4373), ('?', 3771), ('there', 3770), ('like', 3690), ('so', 3683), ('out', 3637)]\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAEeCAYAAABCLIggAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnXl8VdW1+L/rZk4gCQmCERAQcAIHvFFxQGuxFVtfsa1a\nbK3UR7Wtvmqrrz/09dn37Kt92sG2ytPWyquofVWcgRat4oRFhgRQBkXmSQYJCYGEzOv3x94XLpd7\nk5uQm5thfT+fk3vO2mfvtfe9J3udvdc6+4iqYhiGYRjtQSDZFTAMwzC6D2ZUDMMwjHbDjIphGIbR\nbphRMQzDMNoNMyqGYRhGu2FGxTAMw2g3zKgYhmEY7YYZFcMwDKPdMKNiGIZhtBupiSxcRH4IfBtQ\nYDlwA5ANPAMMATYC16hquT//LmAy0AjcqqqvenkQeBzIAv4G3KaqKiIZwBNAECgDvqaqG5urU9++\nfXXIkCFtas+BAwfIyspqVVqi5aa7a+sw3R2vuyN0dDXd8VBaWrpbVY9p8URVTcgGDAA2AFn+eAbw\nLeAXwJ1edidwv98/FXgfyACGAuuAFJ+2CBgDCDAHuNzLbwZ+7/cnAs+0VK9gMKhtpaSkpNVpiZab\n7q6tw3R3Tx1dTXc8ACUaR9+f6OmvVCBLRFJxI5RPgAnAdJ8+HbjS708AnlbVWlXdAKwFzhGRIiBX\nVRf4hj0RkSdU1nPAOBGRBLfJMAzDiEHCjIqqbgN+BWwGtgN7VfXvQH9V3e5P2wH09/sDgC1hRWz1\nsgF+P1J+WB5VbQD2AoXt3hjDMAwjLkQTtEqxiPQBnge+BlQAz+JGE1NVNT/svHJV7SMiU4EFqvqU\nl0/DTXVtBO5T1Uu9fCwwRVWvEJEVwHhV3erT1gHnquruiLrcBNwEUFRUFJw1a1ab2lRdXU12dnar\n0hItN91dW4fp7njdHaGjq+mOh+Li4lJVLW7xxHjmyNqyAVcD08KOrwceBlYDRV5WBKz2+3cBd4Wd\n/ypwnj/nozD5tcAfws/x+6nAbryhjLWZT8V0dyYdprt76uhquuOBTuBT2QyMEZFs7+cYB3wIzAQm\n+XMmAS/7/ZnARBHJEJGhwAhgkbqpskoRGePLuT4iT6isq4A3fOMNwzCMJJCwkGJVXSgizwFLgAZg\nKfAo0AuYISKTgU3ANf78lSIyA1jlz79FVRt9cTdzKKR4jt8ApgFPishaYA8uAswwDMNIEgl9TkVV\n/wP4jwhxLW7UEu38e4F7o8hLgFFR5DW4abaE89GOSpbuqOUsVSzAzDAMIzr2RH2c3D/nI342r5xr\n/vAe89ftbjmDYRhGD8SMShyoKueeUEjvdGHxxnK+/seFXPvoAhZv3JPsqhmGYXQqEjr91V0QEb57\n8TBGZexhaVU+f5y3nvfWl3H1799j7Ii+3P65E5NdRcMwjE6BjVRaQVZagO+PG8G8KZ/l1nEj6JWR\nyrw1u/nyw/NZ/ElNsqtnGIaRdMyotIG8rDRu/9yJvDvlEr561kAA5m02o2IYhmFG5SjIz07nxouG\nArCmrD7JtTEMw0g+ZlSOkhH9epOTnsKu6kZ27bPRimEYPRszKkdJSkA4Y5BbymzZ5ook18YwDCO5\nmFFpB0Yf74zK0i1mVAzD6NmYUWkHRg/qA8DSzeVJrolhGEZyMaPSDpzpRyofbN1LQ2NTkmtjGIaR\nPMyotAN9e2XQPyeF6rpGPt65P9nVMQzDSBpmVNqJEwvTAFi6xabADMPouZhRaSdOLPBGxSLADMPo\nwZhRaSdOLEwHzFlvGEbPxoxKOzE4P5X01ADrPq1ib7U9XW8YRs/EjEo7kRYQThuQB8CyrTYFZhhG\nz8SMSjsy2p6sNwyjh5MwoyIiJ4nIsrCtUkR+ICIFIvKaiKzxn33C8twlImtFZLWIXBYmD4rIcp/2\noPj3+YpIhog84+ULRWRIotoTD6OP9w9BWgSYYRg9lIQZFVVdrapnquqZQBCoBl4E7gTmquoIYK4/\nRkROBSYCI4HxwMMikuKLewS4ERjht/FePhkoV9XhwG+A+xPVnng4uFzL5gpUNZlVMQzDSAodNf01\nDlinqpuACcB0L58OXOn3JwBPq2qtqm4A1gLniEgRkKuqC9T11E9E5AmV9RwwLjSKSQZFeZn0z81g\n74F6NuyuSlY1DMMwkoZ0xB21iPwvsERVp4pIharme7ngRhr5IjIVWKCqT/m0acAcYCNwn6pe6uVj\ngSmqeoWIrADGq+pWn7YOOFdVd0fovwm4CaCoqCg4a9asNrWjurqa7OzsZtN+Mb+chdtq+f7ZeXxm\nSFbMPO0lb8+yTHfH6zDdHa+7I3R0Nd3xUFxcXKqqxS2eqKoJ3YB0YDfQ3x9XRKSX+8+pwHVh8mnA\nVUAx8HqYfCww2++vAAaGpa0D+jZXn2AwqG2lpKSkxbTfv7VWB0+ZrT9+8YNm87SXvCN09FTdHaHD\ndHdPHV1NdzwAJRpHn98R01+X40YpO/3xTj+lhf/c5eXbgEFh+QZ62Ta/Hyk/LI+IpAJ5QFkC2hA3\nB531FgFmGEYPpCOMyrXAX8KOZwKT/P4k4OUw+UQf0TUU55BfpKrbgUoRGeOny66PyBMq6yrgDW9R\nk8ZpA/JICQgf7dhHdV1DMqtiGIbR4aQmsnARyQE+B3wnTHwfMENEJgObgGsAVHWliMwAVgENwC2q\n2ujz3Aw8DmTh/CxzvHwa8KSIrAX24KLHkkpWegonH9ublZ9Usnzr3sR+wYZhGJ2MhPZ5qloFFEbI\nynDRYNHOvxe4N4q8BBgVRV4DXN0ulW1HRh+fz8pPKlm6pYKzeyW7NoZhGB2HPVGfAOxNkIZh9FTM\nqCSA0EOQS+whSMMwehhmVBLA0L455GWl8em+WnYfsNcLG4bRczCjkgBE5OBo5eOyuiTXxjAMo+Mw\no5IgQn6Vj8vs3SqGYfQczKgkiNMG5gKwea89q2IYRs/BjEqC6Nc7E4DKWvOpGIbRczCjkiAKe7l3\n1u81o2IYRg/CjEqCKMhxRqWytsnCig3D6DGYUUkQGakp9M5IpVGh8oD5VQzD6BmYUUkgoSmwsqra\nJNfEMAyjYzCjkkBCU2BlVfasimEYPQMzKgmksFcGAGX7zagYhtEzMKOSQApzbPrLMIyehRmVBBLy\nqeyxkYphGD0EMyoJpCDHT3+ZT8UwjB6CGZUEUmiOesMwehhmVBLIwZDi/eZTMQyjZ5BQoyIi+SLy\nnIh8JCIfish5IlIgIq+JyBr/2Sfs/LtEZK2IrBaRy8LkQRFZ7tMeFBHx8gwRecbLF4rIkES2p7WE\nQor32EjFMIweQqJHKr8DXlHVk4EzgA+BO4G5qjoCmOuPEZFTgYnASGA88LCIpPhyHgFuBEb4bbyX\nTwbKVXU48Bvg/gS3p1X07WU+FcMwehYJMyoikgdcBEwDUNU6Va0AJgDT/WnTgSv9/gTgaVWtVdUN\nwFrgHBEpAnJVdYG6RbSeiMgTKus5YFxoFNMZ6JN9aKTS1GTrfxmG0f2RRC12KCJnAo8Cq3CjlFLg\nNmCbqub7cwQ30sgXkanAAlV9yqdNA+YAG4H7VPVSLx8LTFHVK0RkBTBeVbf6tHXAuaq6O6IuNwE3\nARQVFQVnzZrVpjZVV1eTnZ3dqrRvvriD6gZ4fEI/eqcHWjy/tfL2LMt0d7wO093xujtCR1fTHQ/F\nxcWlqlrc4omqmpANKAYacJ08uKmw/wIqIs4r959TgevC5NOAq3w5r4fJxwKz/f4KYGBY2jqgb3P1\nCgaD2lZKSkpanTbmv+bo4Cmzdc3OfXGd31p5e5Zlujteh+nunjq6mu54AEo0jr4/kT6VrcBWVV3o\nj58DzgJ2+ikt/Ocun74NGBSWf6CXbfP7kfLD8ohIKpAHlLV7S46CvAz3FZuz3jCMnkDCjIqq7gC2\niMhJXjQONxU2E5jkZZOAl/3+TGCij+gainPIL1LV7UCliIzx02XXR+QJlXUV8Ia3qJ2GXG9ULKzY\nMIyeQGqCy/8+8GcRSQfWAzfgDNkMEZkMbAKuAVDVlSIyA2d4GoBbVLXRl3Mz8DiQhfOzzPHyacCT\nIrIW2IOLHutUHDQqNlIxDKMHkFCjoqrLcD6RSMbFOP9e4N4o8hJgVBR5DXD1UVYzoeRlhkYqZlQM\nw+j+2BP1CSb3oE/Fpr8Mw+j+mFFJMCFH/W6b/jIMowdgRiXBHByp2PSXYRg9ADMqCSbvoKPepr8M\nw+j+mFFJMPacimEYPQkzKgmmd5hRsfW/DMPo7phRSTCpASEvK40mhYoD9cmujmEYRkIxo9IBHHxX\nvflVDMPo5phR6QBCrxXebRFghmF0c8yodAD2BkjDMHoKZlQ6gMLQGyBtUUnDMLo5ZlQ6gND0ly0q\naRhGd8eMSgdw0KiYT8UwjG6OGZUOoMBPf5lPxTCM7o4ZlQ6g78HoL/OpGIbRvTGj0gEU9LLoL8Mw\negZmVDqAwhwf/WVGxTCMbo4ZlQ6gT3YaAOXVdTTa+l+GYXRjEmpURGSjiCwXkWUiUuJlBSLymois\n8Z99ws6/S0TWishqEbksTB705awVkQdFRLw8Q0Se8fKFIjIkke1pK6kpAfKz01CFimobrRiG0X3p\niJHKJap6pqqG3lV/JzBXVUcAc/0xInIqMBEYCYwHHhaRFJ/nEeBGYITfxnv5ZKBcVYcDvwHu74D2\ntAl7VsUwjJ5AMqa/JgDT/f504Mow+dOqWquqG4C1wDkiUgTkquoCVVXgiYg8obKeA8aFRjGdjYN+\nFXtWxTCMboy4fjpBhYtsAPYCjcAfVPVREalQ1XyfLriRRr6ITAUWqOpTPm0aMAfYCNynqpd6+Vhg\niqpeISIrgPGqutWnrQPOVdXdEfW4CbgJoKioKDhr1qw2tae6uprs7OxWpYXkv5xfzoJttdw+Jo8L\nBmW1eH576k6UvLvr7ggdprvjdXeEjq6mOx6Ki4tLw2acYqOqCduAAf6zH/A+cBFQEXFOuf+cClwX\nJp8GXAUUA6+HyccCs/3+CmBgWNo6oG9zdQoGg9pWSkpKWp0Wkv/bCx/o4Cmzdfr8DXGd3566EyXv\n7ro7Qofp7p46uprueABKNI5+P6HTX6q6zX/uAl4EzgF2+ikt/Ocuf/o2YFBY9oFets3vR8oPyyMi\nqUAeUJaIthwtoUUlbfl7wzC6MwkzKiKSIyK9Q/vA53Eji5nAJH/aJOBlvz8TmOgjuobiHPKLVHU7\nUCkiY/x02fUReUJlXQW84S1qp6Mwx17UZRhG9yc1gWX3B170fvNU4P9U9RURWQzMEJHJwCbgGgBV\nXSkiM4BVQANwi6o2+rJuBh4HsnB+ljlePg14UkTWAntw0WOdktDbH81RbxhGdyZhRkVV1wNnRJGX\nAeNi5LkXuDeKvAQYFUVeA1x91JXtAAospNgwjB6APVHfQRwKKbbpL8Mwui9mVDqIQltU0jCMHoAZ\nlQ6iT3Y6IlBeXU9DY1Oyq2MYhpEQzKh0ECkBoU+2G62UV9cnuTaGYRiJodVGRUT6iMjpiahMd6cg\nx6bADMPo3sRlVETkLRHJFZECYAnwRxF5ILFV634cele9OesNw+iexDtSyVPVSuArwBOqei5waeKq\n1T05+KyKjVQMw+imxGtUUv2SKtcAsxNYn26NhRUbhtHdideo3AO8CqxV1cUicgKwJnHV6p6YT8Uw\njO5OvE/Ub1fVg855VV1vPpXW09dPf+2uqoPCJFfGMAwjAcQ7UnkoTpnRDAV++muPrf9lGEY3pdmR\nioicB5wPHCMit4cl5QIp0XMZsTi0/lctkJHcyhiGYSSAlqa/0oFe/rzeYfJK3FLzRivoe1j0lxkV\nwzC6H80aFVV9G3hbRB5X1U0dVKduy8GRik1/GYbRTYnXUZ8hIo8CQ8LzqOpnE1Gp7kp+djoBgb0H\n6mlo6pTvEjMMwzgq4jUqzwK/Bx4DGls414hBaP2vsqo69tXaopKGYXQ/4jUqDar6SEJr0kMo7OWM\nyl4zKoZhdEPiDSmeJSI3i0iRiBSEtngyikiKiCwVkdn+uEBEXhORNf6zT9i5d4nIWhFZLSKXhcmD\nIrLcpz3o31WPf5/9M16+UESGxN3yJBHyq1SaUTEMoxsSr1GZBPwImA+U+q0kzry3AR+GHd8JzFXV\nEcBcf4yInIp7x/xIYDzwsIiEwpYfAW4ERvhtvJdPBspVdTjwG+D+OOuUNAp7uagvG6kYhtEdicuo\nqOrQKNsJLeUTkYHAF3G+mBATgOl+fzpwZZj8aVWtVdUNwFrgHL/mWK6qLlBVBZ6IyBMq6zlgXGgU\n01kptJGKYRjdGHH9dAsniVwfTa6qT7SQ7zngv3HPuPyrql4hIhWqmu/TBTfSyBeRqcACVX3Kp00D\n5gAbgftU9VIvHwtM8WWtAMar6laftg44V1V3R9TjJuAmgKKiouCsWbNabHM0qquryc7OblVapHzG\nyv08s2o/XxqezqTRR84gxlvO0eRpL3l3190ROkx3x+vuCB1dTXc8FBcXl6pqcYsnqmqLG25JltD2\nR2A98FwLea4AHvb7nwFm+/2KiPPK/edU4Low+TTcA5bFwOth8rFhZa0ABoalrQP6NlevYDCobaWk\npKTVaZHyJ97bqIOnzNZv/2HuUZVzNHnaS97ddXeEDtPdPXV0Nd3xAJRoHPYirugvVf1++LGI5ANP\nt5DtAuBLIvIFIBPIFZGngJ0iUqSq2/3U1i5//jZgUFj+gV62ze9HysPzbBWRVCAPKIunTckiNP21\nt8amvwzD6H609R31VcDQ5k5Q1btUdaCqDsE54N9Q1euAmTjHP/7zZb8/E5joI7qG4hzyi1R1O1Ap\nImP8dNn1EXlCZV3ldXTqpwrNp2IYRncmrpGKiMwCQp11CnAKMKONOu8DZojIZGAT7sVfqOpKEZkB\nrAIagFtUNfSg5c3A40AWzs8yx8unAU+KyFpgD854dWpCb3+06C/DMLoj8T78+Kuw/QZgk3rneDyo\n6lvAW36/DBgX47x7gXujyEuAUVHkNcDV8dajM9AvN5PUgLB9fyMvL9vGhDMHJLtKhmEY7Ua8IcVv\nAx/horj6ALYiYhvJzUzjR5edBMAdM97nzdW7WshhGIbRdYjLqIjINcAi3KjgGmChiNjS923kOxcP\nY8JJOTQ0Kd97qpTSTXuSXSXDMIx2IV5H/Y+Bs1V1kqpeD5wD3J24anV/vnlaL64pHkhNfRM3/Gkx\nH+2oTHaVDMMwjpp4jUpAVcPnacpakdeIgojw8y+fxudP7U9lTQPXT1vElj3Vya6WYRjGURGvYXhF\nRF4VkW+JyLeAvwJ/S1y1egapKQEevHY05w4tYNe+Wr45bSEVNfZmAcMwui7NGhURGS4iF6jqj4A/\nAKf77T3g0Q6oX7cnMy2FxyYVM/K4XDaWVXPPO+Ws2bkv2dUyDMNoEy2NVH6Lex89qvqCqt6uqrcD\nL/o0ox3onZnG9H8+h6F9c9i8t4EvPDiPX726mpp6G7UYhtG1aMmo9FfV5ZFCLxuSkBr1UPr2yuCl\nmy/g8ydkUd+oTH1zLZf99h3mrfk02VUzDMOIm5aMSn4zaVntWRED8rLT+E4wj+e/dx4n9e/NprJq\nvjltEbc9vdR8LYZhdAlaMiolInJjpFBEvo17UZeRAIKDC5h964VMGX8ymWkBXl72Cbe9spv5a3e3\nnNkwDCOJtGRUfgDcICJviciv/fY27o2LtyW+ej2XtJQA3/vMMP7+g4sZO6Iv++uVbz2+mNdW7Ux2\n1QzDMGLSrFFR1Z2qej5wD+5lWRuBe1T1PFXdkfjqGccXZjP9hnMYPyybuoYmvvtUKS8t3dZyRsMw\njCQQ7/tU3gTeTHBdjBgEAsK3R/dm2PFF/M+b6/jhjGXsq6nnm+cNSXbVDMMwDsOeiu8iiAg/uuxk\n7rz8ZFTh7pdX8vBba5NdLcMwjMOId+l7o5Pw3YuH0TszlX9/aQW/eGU1lQcauPSYTv1eMsMwehA2\nUumCfOPcwfz2a2eSGhB+//Y67n23nLW79ie7WoZhGGZUuioTzhzAo9cH6Z2RytIddYz/7Tv8bPYq\nKmvqk101wzB6MGZUujCfPbk/b/7oM1w6NItGVR57dwOX/PItnl60mcYmmxIzDKPjSZhREZFMEVkk\nIu+LyEoRucfLC0TkNRFZ4z/7hOW5S0TWishqEbksTB4UkeU+7UERES/PEJFnvHyhiAxJVHs6K317\nZfC94jxm/cuFnD2kD2VVddz5wnIm/M+7bKywUYthGB1LIkcqtcBnVfUM4ExgvIiMAe4E5qrqCGCu\nP0ZETgUmAiOB8cDDIpLiy3oEuBEY4bfxXj4ZKFfV4cBvgPsT2J5OzagBecz4znk8eO1oivIyWbGt\nkl+9V0GTjVgMw+hAEmZU1BHyHqf5TYEJwHQvnw5c6fcnAE+raq2qbgDWAueISBGQq6oLVFWBJyLy\nhMp6DhgXGsX0RESEL51xHHPvuJgB+Vls39/I2x/bgpSGYXQc4vrpBBXuRhqlwHDgf1R1iohUqGq+\nTxfcSCNfRKYCC1T1KZ82DZiDe4r/PlW91MvHAlNU9QoRWQGMV9WtPm0dcK6q7o6ox03ATQBFRUXB\nWbNmtak91dXVZGdntyot0fJYaS+truLJD/ZxRv90fnJRQYfq7gh5d9Fhujted0fo6Gq646G4uLhU\nVYtbPFFVE77hVjt+ExgFVESklfvPqcB1YfJpwFVAMfB6mHwsMNvvrwAGhqWtA/o2V5dgMKhtpaSk\npNVpiZbHSquoqtMT/+2vOnjKbF2zs7JDdXeEvLvoMN3dU0dX0x0PQInG0d93SPSXqlZ4ozIe2Omn\ntPCfu/xp24BBYdkGetk2vx8pPyyPiKQCeUBZYlrRtcjLTuPiwZkAPD5/Y3IrYxhGjyGR0V/HiEho\nmisL+BzwETATmORPmwS87PdnAhN9RNdQnEN+kapuBypFZIyfLrs+Ik+orKuAN7xFNYAvjMgB4PnS\nbew9YJFghmEknkSOVIqAN0XkA2Ax8JqqzgbuAz4nImuAS/0xqroSmAGsAl4BblHV0JupbgYewznv\n1+F8LeCmyApFZC1wOz6SzHAMyk3lwuF9OVDfyIzFW5JdHcMwegAJW/tLVT8ARkeRlwHjYuS5F7g3\nirwE54+JlNcAVx91ZbsxN1wwhHfX7mb6exv55wuHkhLoscFxhmF0APZEfTfnkpP6Mbgwm63lB3j9\nQ3vBl2EYicWMSjcnEBAm+feuPP6PjUmti2EY3R8zKj2Aq4oHkpOewnvry/hwe2Wyq2MYRjfGjEoP\nIDczjauLXbS2jVYMw0gkZlR6CNefNxiAl5Zto7K2Kcm1MQyju2JGpYdwwjG9uOSkY6htaOL19dXJ\nro5hGN0UMyo9iG9dMBSAv62tZp+9zMswjARgRqUHcdGIvpwxMI/ymiZ+OmtVsqtjGEY3xIxKD0JE\n+PU1Z5AegGdLt/L3lTuSXSXDMLoZZlR6GMP79eYbp/cG4K4XlrN7f22Sa2QYRnfCjEoP5AvDszl/\nWCFlVXXc9cJybA1OwzDaCzMqPZCACL+8+gx6Z6Ty2qqdPFu6NdlVMgyjm2BGpYcyID+LeyaMBOCn\ns1axZY+FGRuGcfSYUenBfHn0AC4fdSz7axu449n3abJpMMMwjhIzKj0YEeHeL59G314ZLNqwh9kf\n22jFMIyjw4xKD6cgJ537v3oaAH9esY+PdtiCk4ZhtB0zKgbjTunPtecMoqEJfvD0MmrqG1vOZBiG\nEQUzKgYA//7FUzm2Vwof7djHr/++OtnVMQyji5IwoyIig0TkTRFZJSIrReQ2Ly8QkddEZI3/7BOW\n5y4RWSsiq0XksjB5UESW+7QHRUS8PENEnvHyhSIyJFHt6e7kZKRy2zl5pASEP87bwD/W7k52lQzD\n6IIkcqTSANyhqqcCY4BbRORU4E5grqqOAOb6Y3zaRGAkMB54WERSfFmPADcCI/w23ssnA+WqOhz4\nDXB/AtvT7TmxMJ1bPzsCgDtmvM/ealt00jCM1pEwo6Kq21V1id/fB3wIDAAmANP9adOBK/3+BOBp\nVa1V1Q3AWuAcESkCclV1gbpHv5+IyBMq6zlgXGgUY7SNWy4Zxujj89lRWcO/vWRP2xuG0TqkIzoN\nPy31DjAK2Kyq+V4uuJFGvohMBRao6lM+bRowB9gI3Keql3r5WGCKql4hIiuA8aq61aetA85V1d0R\n+m8CbgIoKioKzpo1q03tqK6uJjs7u1VpiZYnQseO/Q3c8VoZNQ3K98/J45xjtEe0Oxk6THfH6+4I\nHV1NdzwUFxeXqmpxiyeqakI3oBdQCnzFH1dEpJf7z6nAdWHyacBVQDHweph8LDDb768ABoalrQP6\nNlefYDCobaWkpKTVaYmWJ0rHM4s36+Aps3XkT17Rv729oEN1xyPvLjpMd/fU0dV0xwNQonH0+QmN\n/hKRNOB54M+q+oIX7/RTWvjPXV6+DRgUln2gl23z+5Hyw/KISCqQB5S1f0t6HlcHBzJ+pHva/sFF\ne6lrsFcQG4bRMomM/hLcaONDVX0gLGkmMMnvTwJeDpNP9BFdQ3EO+UWquh2oFJExvszrI/KEyroK\neMNbVOMoERF+/pXT6Nc7gw9313P+fXP5+d8+ZO2u/cmummEYnZhEjlQuAL4JfFZElvntC8B9wOdE\nZA1wqT9GVVcCM4BVwCvALaoaegrvZuAxnPN+Hc7XAs5oFYrIWuB2fCSZ0T4U5KTzyHVBBuamsnt/\nHY++s55LH3ibrz4ynxmLt1BV25DsKhqG0clITVTBqvouECsSa1yMPPcC90aRl+Cc/JHyGuDqo6im\n0QLBwX347ecLCfQbxozFW5j1/ieUbiqndFM598xayReHZ3LaGU2kp9pztIZh2BP1RhyICGcd34f7\nvno6i358Kb+86nTOHtKHqrpGZqyq4oqH5rF0c3myq2kYRifAjIrRKnIyUrm6eBDPfvd8nrlpDEW9\nUvh4536+8sh8fjprFdV1NiVmGD0ZMypGmzn3hEJ+/fm+fPfiYQRE+N9/bODzv3mHd9fYEi+G0VNJ\nmE/F6BlkpAh3Xn4yV5xexP977gNWba/kumkLOT4vlUFLF1CQk0FBdhp9ctIpyEln/6c15AyoZHBB\nDlnpKS0phQhcAAAgAElEQVQrMAyjS2FGxWgXRg3I4+V/uYBH31nP7+auYfPeBjbvjf7I0C/mzwPg\n2NxMBhdmM6Qwh75SzejRSiBgq+wYRlfGjIrRbqSlBLjlkuFcd+5gXp1fSv/jh1FeVceeqjrKq+so\nq6pj9eadlDeksmVPNTsqa9hRWcPCDXsA6Fe0iUnnD0luIwzDOCrMqBjtTl52GsP6pBE88Zgj0kpL\n6wgGgzQ0NrF9bw0by6pYvGEPD76xll+9uprLTzuWfr0zk1BrwzDaA3PUG0khNSXAoIJsxo44hh9+\n7kSCRRnsq23g53/9MNlVMwzjKDCjYiQdEeGfz+xNRmqAl5Z9wvx1Fj1mGF0VMypGp+DYXqn8yyXD\nAbj7pRW2gKVhdFHMqBidhpsuPoET+uaw7tMqHnt3fbKrYxhGGzCjYnQaMlJT+OkEt8Tbg3PXsGVP\ndZJrZBhGazGjYnQqLhzRl3864zhq6pu4Z9aqZFfHMIxWYkbF6HT8+xdPoVdGKq9/uJPFn9QkuzqG\nYbQCMypGp6N/biZ3fP5EAKYtreTdNbtZurmcj3fuY1vFAfZW19PQaI58w+iM2MOPRqfkm2MG82zJ\n1oNriUUjLyPAkPfeZWCfbAb0yWJgnywG5GdRV9UY9XzDMBKPGRWjU5KaEuC3E8/kP55diGT0oqq2\ngf21DVTVNrr9ugb21jbx/ta9vL917xH5By14gzFDCzlvWCFjTijkuPysJLTCMHoeCTMqIvK/wBXA\nLlUd5WUFwDPAEGAjcI2qlvu0u4DJQCNwq6q+6uVB4HEgC/gbcJuqqohkAE8AQaAM+JqqbkxUe4yO\n58T+vfnX8/oQDAaPSGtsUubOX0zBwOFsqzjA1vLQVk3pxjK27DnAlj1bebZ0KwCDC7M5KU+Z3KeM\ns4cU2MKVhpEgEjlSeRyYiuv4Q9wJzFXV+0TkTn88RUROBSYCI4HjgNdF5ET/jvpHgBuBhTijMh73\njvrJQLmqDheRicD9wNcS2B6jE5ESEAqzUggOKaA4Im1RSQnZRSN4b10ZC9aXsWjDHjaVVbOpDP7+\n6AIG5GfxpTOP48ujB3Bi/95Jqb9hdFcS+Y76d0RkSIR4AvAZvz8deAuY4uVPq2otsEFE1gLniMhG\nIFdVFwCIyBPAlTijMgH4T1/Wc8BUERFV1cS0yOgqpIgwakAeowbkceNFJ9DQ2MTKTyp5fO4yFu1o\nYlvFAR55ax2PvLWOU4pyOauwiX29djG8Xy+Oy8uyUYxhHAWSyD7YG5XZYdNfFaqa7/cFN9LIF5Gp\nwAJVfcqnTcMZjo3Afap6qZePBaao6hUisgIYr6pbfdo64FxVPWLhKBG5CbgJoKioKDhr1qw2tae6\nuprs7OxWpSVabrpbV1ZmVhYf7q5n3uYDzN9SQ1X94dd/RoowoHcKA3NTKeqdCo31pKenI4Dg/gjQ\n1FBPdmY6aQEhNQBpKUJqQMgN1DG8Xy/c5d3x7eupv2tn09HVdMdDcXFxqapGTgwciaombMP5TlaE\nHVdEpJf7z6nAdWHyacBVQDHweph8LM5IAawABoalrQP6tlSnYDCobaWkpKTVaYmWm+62y2vqG3TO\n8u36nUff0Il/eE+D//WaDp4y+6i3c+99XW9/Zpm+uGSr7qqsSVr72irv7ro7QkdX0x0PQInG0e93\ndPTXThEpUtXtIlIE7PLybcCgsPMGetk2vx8pD8+zVURSgTycw94w4iIjNYXxo47lmNptB4MB9lbX\ns/bTfazdtZ9NZdVs/WQ7/fv3RxUU/Keyfccu8voUUtfY5LaGJmobmvhgUxk7Kmt4fslWnl/iggRO\nPrY3A7IaOHbzclICQkCElIDbdu/axwc1G+ifm0m/3hn0651Jv9wMMtPsVctG16SjjcpMYBJwn/98\nOUz+fyLyAM5RPwJYpKqNIlIpImNwjvrrgYciynoPN6p5w1tTw2gzedlpBAcXEBxcAEBpaRXB4KlH\nnFdaWkMwePoR8pKSEnIGnMi7a3Yzb+1uFq4v46Md+/gIYMPmqDpf+OjI5Wjys9O4YEAaQ06qpbBX\nxlG1yTA6kkSGFP8F55TvKyJbgf/AGZMZIjIZ2ARcA6CqK0VkBrAKaABuURf5BXAzh0KK5/gN3BTZ\nk96pvwcXPWYYSUVEOKUol1OKcrnxohOoqW+kdFM5by/5kIGDBtHUpDQq/lPZsHkrab0L2FVZy659\nteyqrOHT/bVUVNfz1zX1vP3Lt7jpohOYfOFQcjLssTKj85PI6K9rYySNi3H+vcC9UeQlwKgo8hrg\n6qOpo2Ekmsy0FC4Y3pfMvdkEg0OOSC8trSAYPO0wWVOTsmp7JXc/u4ilO+p44LWPeeK9Tdw2bjgT\nzzm+g2puGG3Dbn0Mo5MRCLiQ6H8fW0Bd/hDue+Uj3t9Swd0vr+SxdzdwQVGAyl67GFmUS7/czGRX\n1zAOw4yKYXRizhtWyEs3n8+rK3fwi1dXs/7TKjaVwf+tWAxA314ZnHpcLqcW5ZJTW8OA4TUcm2eG\nxkgeZlQMo5MjIowfVcSlp/RnzoodvFKymt0NmazaXsnu/bW88/GnvPPxpwD8esFcBvbJonhwH7fa\nwOA+NFr8itGBmFExjC5CakqAfzrjOI5r2E4wGERV2Vp+gFXbK1n5SSVvr9jEuoqmg+ugvbTsE5cv\nAPmvvE5uVip5WWnkZqaRm5VG7b69HL99FVnpKWSmuS0rLYXs9BQqP63juL0H6N8701YYMFqFGRXD\n6KKICIMKshlUkM1lI4/l4oJ9nDn6LFbv2Efppj0s3lhO6aZytlUcYPf+Wnbvrz2ykPUbYpb/k7fe\nICM1wPEF2QwuzGZwYQ4HKvaztHr9QSOUmRYgMzWFjTtqObBmN4GAWyYnJSAEAsLaPfVkfVJJeqqQ\nlhIgLSVAaoqwv66J+sYm0lLslU7dDTMqhtGNSAmI87Ecl8s3zxsCwPyFJQw7ZRSVB+rZe6Ceypp6\nKg80sOLjdfQrGsCBuiYO1DdS47equkY+3vopu2uEsqo61uzaz5pd+w8pWfFhdOXzor/3hrnzostf\nnkN6SoCcjBSy01PplZFKTkYKaY0HOGXbSoryMinKz+I4/9nYZNN4XQEzKobRzclIFfrnZtI/IlJs\nUNMOgsFhUfOUlpYSDAbZV1PPprJqNu+pZlNZNR9v3EJ+4THU1DdRW99ITUMjNfVNlJVXkNOrN41N\nSpMqjf55nH379pOemUVdYxMNjUp9oxuhVNfWU9uIW42guony6vrD9C/ctvGIOqUIDHzrzYMjp+ML\nsjm+IIc9ZXUENpcTEEEEAn7dtY0V9eTu3EdqSoDUQGikJKSmBDjQ0ERtQyOpgQAB4Yi12oy2Y0bF\nMIyY9M5MO7jiM4Seqxl5xHkhI9Qa+VlnnUVtQxNV/uVr+/2L2Ba8v4qsguP4ZO8Bduyt4ZO9NWyv\nOMCufbX+FQbVzFsTUeAb86M34LV3YjfuxVcO7qb6ZXNSRMl55XUyUgNkpAbITEshIzVAfU01x64s\nISc9haz0VHLSU8jOSGXvp1Vskq0U5KTTt1cGhb3SKchJj62zB2BGxTCMpCAiB30zhb0OyVP2ZBEM\nnnDE+fMXlXDM4JMOjpzc6KmKrZ9WkJ2djQJNqqhCk0JVdTVp6Rk0NOnBUVJDkx8t1TfSJEJDYxNN\nijvHT69V10fxPQHLd+2M3pAP3j9ClJUqFLz2BrlZaeRmpvrPNPKz0+hdf4BBJ9bQr3f3DP02o2IY\nRpcgI0UY0b83IyJerNbaUVJkWmjJnIZGZVHpEk4eeRq19W56rKa+iZqGRpav+ojjjj+B6roGqusa\nqa5zo6u1m7eRmpPPnqo6du+vo2x/LXuq6jjQoGyrOMC2igNR9f924VxG9OvF+cMKOW9YX8acUHCU\n307nwYyKYRg9mkBACCCkpUCv9MARvieAQFkGwVHHHiEvLd1HMDj6MJmqMm9BCUNPGhkWGOGCI3bt\nq+H19zeyek/jwQCI6e9tQgSyUoSUma8eoUO1kV6vziUzLUBGqou4y0hLoaZqP3nLXHCEiBDyCu2r\nrKRgRQnpqUJqwEXbpfnPvMZqYtjZdsOMimEYRjsiIuSkB1y4d5T08/IqOe2M0by/tYL5a8v4x7rd\nLN1cTnWDQkND1DL319VEV7briHcSOnZEn6q7YFDip9zMqBiGYXQw6akBzh5SwNlDCrjt0hHUNjSy\nsGQJZ5555hHnli5ZxkmnjqKmvpHahiYf+t3Eqo9WM3zECPdyrNDJCh+vWcOQE4Y5H1KYL6mhsYna\nPZ8kvG1mVAzDMJJMRmoKOWkBcjPTjkjLzQhwXH7WEfL0igyCJx5z5PlVWwiOPHKqDqC0NPHvMbTH\nWQ3DMIx2w4yKYRiG0W6YUTEMwzDaDTMqhmEYRrvR5Y2KiIwXkdUislZE7kx2fQzDMHoyXdqoiEgK\n8D/A5cCpwLUicmpya2UYhtFz6dJGBTgHWKuq61W1DngamJDkOhmGYfRYuvpzKgOALWHHW4FzI08S\nkZuAm/zhfhFZ3UZ9fYEYj7DGTEu03HR3bR2mu3vq6Gq642FwXGepapfdgKuAx8KOvwlMTaC+ktam\nJVpuuru2DtPdPXV0Nd3tuXX16a9tcNjyOgO9zDAMw0gCXd2oLAZGiMhQEUkHJgIzk1wnwzCMHkuX\n9qmoaoOI/AvwKpAC/K+qrkygykfbkJZouenu2jpMd/fU0dV0txvi59kMwzAM46jp6tNfhmEYRifC\njIphGIbRbphRMQzDMNoNMyqtRESKRCTjKPL3EZFzROSi0Nae9WsvRORqEent9/9dRF4QkbPaUM4R\n31UMWdS3ColIjoikiIhES484t8VzegIicn88ss5EvNdJd0BECkTk30TkdhHJTbCuDv9ezai0nieB\nj0TkV63JJCLHisi3gXdw0Wr3+M//jHLuk/7ztmbK6y8iV/itXxz6LxCRHL9/nYg8ICKD/fH5IvJ1\nEbk+tAF3q+o+EbkQuBSYBjzSgo5o9V0bRfZeFNnffBkBX5e/isgu4CNgO24lhNdFZGyEznQR+ayI\nTAcmxahXZPsmicifm2tLa/A3CqeLyFmhrY3lXBCPLA4+F0V2uS/vBRH5ooh0tv/9aNfE6vBrMuza\njEnof6clWUuISKmI3CIifVo4L/Tbp7SiT3ge6IVbEeQ9ETkhRtnZzejNjjhOE5FbReQ5v31fRNKI\n/r1Gk7UbXTqkOBmo6qX+jnisiEwDjlPVy/1Cluep6rQYWacBxwNnAwtU9RIRqfLlVEacmyMiVwLr\nReQJIPIO/FLgl8BbPu0hEfmRlynwqapGLlfzCHCGiJwB3AE8BjwhIpuBYcAyoDHUzLD9LwKPqupf\nReRnIvKmT9+jqldF6JgE/A4OjjwGAMeIyOiwNuQC0f5ZQulvAq8DdwErVLXJl1eEM8Qv+oVEtwOZ\nuFDyvwM7gRf8b/MYMBq4E7fKQrT2DRaRdHVrxh2qhMgvgJ8BB4BXgNOBHwKFwJ+AfRHljwW+Bazz\n5eLT8kVkX5gs1EYFinG/R39VHSUipwNfAr4CRBqkh0TkPOCrwBDC/mdV9aci0h/4OXAc7hmtH/q2\nfRBWRm/gH37/YeAG4EEReRb4k6qu9m0/WFbYNX0lMC6yrqr6MxE5MUY7/gDcGFlf4CfAQ8CF/nuY\n57/rAJAV5TrJxf2/gPutxwFLRGQmcH2U7+NWYGT4lyciqUAw1u+qqk/FaMfb/jtdLCIluN/+76qq\nIvKWb2cqUArs8t/v+RG6I3//EDlAUFVzReTvwNsiUoH7v/w28FvcNdYLON7/z35HVW8WkfOjpQHp\nQBru9wW3LNVnY3yvMY1Vu5DoR/a76wbMAa4B3vfHqcDyFvIs9p/LgAy/vxL4L+Bm3D9/LvAM8ClQ\nC6wP2zb4z/eBfmHlHhOqRzO6l/jPnwCTQzLgQ3xoecT5s3Gdw3ogH8jwegfjjOOAsHOvBWYB5biO\nbaYvuxxowBmK0DYT+EoUfTf7z7Q4vvssoAjID5OFfofLgBdwnUtz7XsC9/Ds3cDtYdsyn/5l3I1A\nnm93rPJXA+mtvHbexi2GutQfn4czklsi6vKfXvcr/pr4f7iO5w7gjsjr0Nd1GFDhf6fQVhClDnnA\nd73O+ThD8wpHXtP7w+vq5SuitSOU5su735f11bDtNa8n1W/f8ue/iTPWzV4nuOvwFV/+A76sSX57\n1pfRAFSGbWXAf8f6XZtrh/8M4AzINmAz7sbmA5/2beAev/8BzjDNxN3IfCW0Rfnu/wEMCTsW3A1Y\nNu66XohbKSRafaKmEfH/77+T/fF8r+292Uil7fRV1RkichccfBCzsYU8W0UkH3gJeE1EyoFNuDu/\nM8LO+5qIvA88B/weCPld3lHV90VkuaruCju/jJanMvf5ul4HXOSnP9JwF+SxuE4tnGuA8cCvVLXC\njxR+hBsdKc7ohUZD833+vsCvw3UCw1T12Rbqhqo+7D/r4zj3AO6OM5zQndgXgCdVdaUftcRq3zq/\nBXDGPETof+KLwLOqutcV02z5+bi71XjJVtVFcsgFlB6mO7wulbj17V5U1fExyjp4HarqXmCviGxU\n1U2xlItIIa7juw5YCvwZN3o4X1XHR1zTgYi6guu4o7UjlJatqlOi6L1bVf8UJnpcRH6gbtT+VVV9\nPladPVXAUKBKVW+PSJvudfw38AvgRNzoBtz1+k9+P/J3jdkOP2K5AfebP8+h72mE/3+4BvhxWJ5M\n3P/iZ/1xOhAa2YTzQ9yN4EYAdVYgtLxUtYigqlsi6nOwb4mR1igiw1R1nZfNAz4G7o3je21XzKi0\nnSr/z+luNUTGAHuby6CqX/a7/+mnkfJwd15vicg3cEv3K+7OvwrnT3gKd2cswJMi8kdgjoi8CvzF\nl/c1vE+iGb4GfB03StkhIsfjpstuAFaJyCLcyChU1y95vaHj7biOeWiUdm3CGcfzouhdIiJfxN3Z\nZ4bl+WkL9W0tpX4qYShwl4j8ze9XELt9iEgvf7zfH2eJyEc4o/U9ETkGqPFlhJffG2jC3QUvFZEV\n0cqPwW4RGcahqZFjcKPX70YzBiIyX0ROU9XlUcpq1XUoIi8CJ+F8g1eo6g6f9IyIfCtKWTXhdRWR\nqzhkoHfHSCsVkS+oauQ1WSYi13Hour0W1wkDzBWRBzh0A/U27h1JoZuMFOAUYAawU0RuxI2mw7/z\nPbiR9Tu4dQCXAWNwPoTZMX7XWO0YCvwGN6q5U1VDehaKyDU4f+i7qrpYnE9kjareEPFdb/BlloaJ\nlUPToFF9KcAWP82l4vwit+FG3M2lPQq8KSLr/XlDcP/bF4nIYVOC/rtq7/+/g9gT9W1EnDP2IWAU\n7m71GOAqVf2g2YzRyxqC80VcgLvY/gH8ADdUPU9Vq/x5Obh/kDm4YfCFvoh5wJhod4dx6L44mlxV\n325FGe+q6oUxfAgZOGN5CW4u+CpgkapObm1dW6hDADgTN/rKwHVOxxBmGCMow3WsBf54N3C9H4EU\nAHtVtVGcQzQXNxI5E1jvR26FuCmLv+CmCZfjjAzQ/PfnO6FHcXPw5bhpzW8AjxN9Dv5YYASuw6zF\nd0qqenprr0MRuRxn4C/w9X0XeERVa8LKGomblj0G12ndEVHX61R1Y4x2XOe/ixxf13oOdaKn+fLP\n88fzgVtVdbOIPO/rP91X9Zs4H8rd/rgB2KSqW0XkFuBe3A1D6PtSVT1BRJZzyG95poicDPxcVb8S\n7Xf1N1jR2nGnqr4b7TuMRSwfkzr/UwHuNwy/sYp6jYhIX1x/cKn/7v4O3KaqZbHScKOjV3HG5Er/\nHf8Y938XIhO4AvhQVf+5NW1rDWZUjgLvBDwJ9+OujmfqppXlLwfOVtUaf5yJ8wPUq+pZEed+oKqn\nRymjuQ5fVTXRIY0f+M4v9NkLmKOqY1vM3Do938b9c0XeoS6ONLbiwmvHAj9W1Te97IfA1UBkBM8A\n3NTExhiq/6CqZ8dIi1XXDJxxHYIzapW43+avYadl4vwQDbi3m/bxdQZ3J16B84eMARYR53UoIjO8\nvlD029dxvqmr/fX1Lzi/0T7c9/eQNzg5QEBV90Up84i01nSi/vxlqnpmpMzXJfT9LlLVXf5u/BxV\nPeK9ICKyWFXP9nnPVdVaEVmpqiNFZBRu9BNepydEJMUbmoPtkOhBC+ep6jT/PU0mYvSN82f9CHdN\njPb1WYFzvEdem/NVdVyU+qfgDO1vYnxPBX5EFi4bCrzs/78uxPlofwX8RCMCdvy196qqfiZa+e2B\nTX8dHedwKALlLD8X+kRrC/FD8WjRMn/CDbdf9MeTcdMAeRI7uucwVPVC/xk+V9+Rxibk+6gWkeNw\nI4Sidio7nNs4PLLuZFyn8DkgcgR3Oa6Nb4bJ8nF31//E4d/HBbjv9ohpP3/ePHHz+DM5fCpmSTN1\nfRlnFJYAn4TlKY047x9+2m4bzil8cBoU+KOqPiQi/+M7sHgXUh2lquGv3H5TRFb5/SdwBufn/vjr\nwJ/9KGIIkOrn8k9X1atE5DC/hhya568kSicqIhOJcp37u+YDInJhaHQgLpQ6G2cw3+LwKMe1QHWM\n9kX1W4rIfwCfwRmVv+GugXd9mzeISCgY4g1fzuO4/7+Qz+Rjnz4N/1gBzuD9FDfK/JDYPqZY1+YR\neOP2ddzUWzRmicjlqlrpv6dTcEEKoRuJL+Kujb+KyM+i5M/G/S4Jw4xKGxEX+x4tVLXVRgXXyczD\nhdKGO+SeF+fkC01zfR13d/rfuHDWEPsi715aIpaxSQCz/T/5L3GdqOKmwdqbGn9HHbobuwTnYG2M\nYYBzRORuXAcB7p9yPfA9Dg/f3YQzQFGnC8T5xsB1nCGUQ87aaAzUKI53f3cfIoALPc7D3UyMCZsG\nvR8/isD5Ir4KvKDxTTssEZExqrrAl3UuUOLTohmc/biOsZRDRjOkJ9a1cwPRO9Go17nne8B0Ecnz\nx+Vez9nqg1L8zdfruN9pmf/uww35rc34LUuBM3BRUzf4kchT/tyTcdNCtwDTRGQ2LjorViDOcD+y\nm6Cq00Xk/3y7YvmY+oZfm6r6kYicFOO7A3hXRKbijFhVWPuW+O9xljg/5Um4/uYbwM9E5A+4m6j7\n/f9AwM92hH6vANAPN5JJGGZU2k4xcGqc/8gtETVaBg5eSJF3vde2g84OQVVDF/Dz/p81U12UUntz\n2B0qh6ZvdnC4Af6Fql7n77KHcMjn8g7wzz5/aBQRcuSGOonzOfIuO3zOOl5iOd5LOdQBNOCm3Cbj\n5unDO+FGDkWjfQcXftwgIjXEGGmGdS5pXv9mfzwYd9cN0Q1Ovap+LUY7HlbVTyOFInJFjE60vhm/\n34e4qK1huFHjXlyYebQox5f81izh020iUqOqTSLSIO4p9l34F/ypajUuAGCGuIcdfwecLLEDIEKj\nggo/pbYD11l/HeebOVlEtnHIV/bbaKOnZqoemgYMd6Yr8Fk/AknD+VJ6A19W1Y/FBQ9Ei9ZczaGp\n03zgb1FGxO2K+VTaiLgHx25VFxV1tGX9DDfH2lIEV5ckRmfclhFdvPouxjm3X8Q9JBbOfNzzA3/B\njWZCTuQQ76jqqChlRo5Mz/T5onZuqvpAlDJCHXsqURzvuBDtmzn84cBHvGySbw84R+zjqvpbX26L\n/gvxqyfE4FWgDmdwTsI9jxEyOHuBi6IYQETkY5zhewY3Uir38hdxo5Uf4EZs5b7sJcS4zv30U8iY\nhwzoFb5e4VGOH6jqFHEv5TvRy1v0Z4rIw8C/4V7kdwfuGY5l6iO2/DXzNVzHXIKbdvsyUQIgxPnv\nnscFHjyOu8bu9vtH+Mo0LNLK68kDXtGIB29bqP9DHH6djsOFxG+Egw9+Rst3K27KMTR1eiV+6jRe\n3a3FjEorEZFZuB+3N65jiRqq2soy9xElWibRTvSOINY0Yax/gnbUO1tVr5BDYZ3hTxT39rLwV0+H\nOvbXcc7p5RHlfUjYyNTP0cdEVe+JUqfmOnZwU4SxnOhnERbtp6pLfZnRAhSiOoFj0UK9Xsd1kkdE\nnvm85+A66iuBVcDTqhqaVgp1onM49GxLL19O6FjVPVm+ItKY+w5xC4cCFOap6osi8hlclNhGX59B\nwCRVfaeZNj6FC1OehxuB5qqPkBORjbjndWYAM8OmGaMG4vippdAUaVqoHbjosUjDiKqGP7vVIhI9\nSOAO3Gg6Kqo6PZrcT/0eEUGqUYJ62gszKq3E/5MI7onh/xeeBNwfGW3RinJbFS3TVYjsjJOg/2Bn\noqofhckfUdXvhR23NIpYTTuNTJup66oIn0ZUWUR6zBDadqrTYKJEnmnE8zTiQl0fAL6hqikxynrK\n55+nqh9GpD1KhDH3I/iJuE76f3FRSyoipcDX9dDyMicCf1HVYDPtuMS3YSzuJmcpblT6OxHJDTm+\nI/JEHWH7UdVe3HRl+LTkDdFGua1FRObggwRU9Qxv3Jaq6mltKCtqBGlbyooX86m0klBHLyJpUaYY\nstpSZqy7TdwQt6sT64n2jmIariN5yDtRl+A6te9FnHdFM/kVt1pA+EOUw3DTD5uI8mxJG0dizTnR\nY3FYgEIcTuDWciVRIs9w32cubopoIu77eBEXERmL0G/xYMRv8TvcKOxbfmQZbsxHAJ/HTadNFRcS\nnRMyKADep5BGM6jqmyLyDs4AX4JbomYkzn9SJ+7Zl/AQ4YtwfpdogTixAi1OiuEray3RVusY6HWE\nO97D2xdr5BEZQXol7ndIGGZUWomIfA83x32CxBnWGwdxhxx2FSKmCWM+0Z5oYnQmo/ALX4adF9Vx\nKiL/xaGR6ZVhSTNx8/Jv4HwGbSZOJ3osYi390140F3n2vtf7U1VtceXbFn6Ly2PkURHZgXOGN+BG\nTceIW8YotCr2N2jB+IrIXNwU83u4KbCDUWVEDxEuAEbEGGEfFmgRMcq9QdxzNEdMFbaCaKskhK6D\nWDc/UVHVB+TwCNIbQlOnicKmv1qJuJDHPrRDWG9YmTEf2GqHKieFRE0TtqEekZ3JuxERRfGWs0TD\nHhiSUSwAAAW5SURBVDiVQ8921OOefzjs4YTWXAst+VpiGbwo5bTJCdxCmTGnT0REfKd/2FI3zZTV\nqt9C3KsUrsetdvAY8JKq1vs6bOWQj2EeLhKtNnpJICK/AYK4zv4fPu97qnpARJaq6mg59IBuGs6I\njQqf7mxmijTdy6Ouzxbv7xemp91W60gGNlJpJeoX7aN9w3oTfbfZ4SRimrCNfIDrTEbhfrcKEXlP\n3aKULdLMyLQ/7v8nk8PXdmppXacjaG2n00w5ifDBNTd9MtIHYhQAIiKf4hzmK2KU1drfogC3ou7B\n70fcE+fTgLGRfpnmUNUf+vy9casj/wk3LZvB4SHCb+Ec+dkcOcLegZsKjKWjvX7HJf4G4WCQALBH\nRKKNADpdUI+NVDoZibjbTAbhnTHO9xCiN/APVb2ug+sT6kz+FThWVeN6+11LI9NIh393RGJHns3n\n8KVuPoMLEjg/akH/v737B5GriuI4/v2lMf4BRW0U1EjUGFMocSOKlUbslVSimMoUSWOwM0iEFHYp\nlCiCNioqirAiIkpkwfiPRImarK5KmoAK2WJREZHosbj3zc7M7izunftmJuvv0+zMvDfvvWHYuZz7\nzj1n8XhF30XX+4+Q1mysJiV3D+l+zq2krLGP8mf5UL0pwq+RWiss5GvsHILRRtgjTcOvyYOKtaKN\nacLC6xj4YzKqa1irJH0VvS0bln2ta1uV70Kpcd1m0n2t7hXnS9YGdb3nsXy+LyLibN+25VKEd0XE\nlX37LVtfrzaNKQ2/Fk9/WStamiYssZ6U6rrkx8SGdkq9pW4eJN1nGGSo70LSSxHxEKlp1kGW9sIZ\nKCJWavU7zWKK8FZSvbdLKybirFbNah0j50jFzIoolTR5kvQjDCkS2B8RCy2db5ZU8v09UnJEjyES\nZToLLychwlbFah3j4EjFzEptJK1mX0f6LdlOKsvS1hTRc8Bh4Fp6U4hXnRzRp5MiPM4Ie1LS8Ifl\nSMXMikiaI91sP0Fvg7JWMxdrJUeskCJcur5k2OuZiDT8YXlQMbMiyj15xn0dpWqtD6qtf01Ufm0k\nSQI1eFAxsyKStpOmiQ7TO00zqIWzrWDS0vBLeVAxsyJKBSJvJHWdbKa/Ilrsf76WTUKSQA0eVMys\niKS5iKhZvNLWgHXjvgAzO2d9otTrw6zDkYqZFVHqlbOR1DZ3bFlTNlk8qJhZkUHZU+PKmrLJ4EHF\nzMyq8T0VMzOrxoOKmZlV40HFbAiSHpd0UtLXko4r9ZVv61wzkqbaOr5ZDS4oaVZI0h2knuFbcwvo\ny0mtZc3+txypmJW7AphveqNHxHxE/CTpCUlHJZ2Q9LwkQSfSOCjpmKRvJW2T9JakHyQdyPtskPSd\npFfyPm9KuqD/xJLulfSppC8lvdH0iZf0lKTZHDmt1EPErBUeVMzKvQ9cJel7SYdylVmAZyJiW+7R\ncT4pmmn8FRFTpDLu08BuUs/2nZIuy/tsAg5FxGbgV1I9qI4cEe0D7smFB48Be/P77wO25LUiB1r4\nzGYr8qBiVigifie1xn0EOAO8LmkncJekz3Np9buBLV1vezv//QY4GRE/50jnFKk3CcDpiGi6DL7M\nYn/4xu3ATcDHko4DDwPXkPqA/Am8IOl+4I9qH9bsP/I9FbMhRMTfwAwwkweRXaQmVVMRcVrSflIb\n3UZTzfefrsfN8+b/sX/xWP9zAR9ExJJGUpJuIzXL2gHsIQ1qZiPjSMWskKRNkq7veukWYC4/ns/3\nOXYUHPrqnAQA8ABwpG/7Z8Cdkq7L13GhpBvy+S6OiHeBR4GbC85tNhRHKmblLgKelnQJcBb4kTQV\ntkDqhvgLcLTguHPAbkkvArPAs90bI+JMnmZ7VdJ5+eV9wG/AtKT1pGhmb8G5zYbiMi1mE0TSBuCd\nfJPf7Jzj6S8zM6vGkYqZmVXjSMXMzKrxoGJmZtV4UDEzs2o8qJiZWTUeVMzMrJp/AewSvAMnfGRs\nAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x10e596550>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "freq_dist = nltk.FreqDist(corpus)\n",
    "print(freq_dist)\n",
    "print(freq_dist.most_common(50))\n",
    "freq_dist.plot(50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first line of our output is telling us how many total words are in our corpus (1,583,820 outcomes), and how many unique words this contains (39,768 samples). The next line then begins the list of the top 50 most frequently appearing words, and we can save our plotted graph as a .png file.\n",
    "\n",
    "We can start to see that attempting analysis on the text in its natural format is not yielding useful results - the inclusion of punctuation and common words such as ‘the’ does not help us understand the content or meaning of the text. The next few sections will explain some of the most common and useful steps involved in NLP, which begin to transform the given text documents into something that can be analysed and utilised more effectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenization\n",
    "\n",
    "When we previously split our raw text document into individual words, this was a process very similar to the concept of tokenisation. Tokenisation aims to take a document and break it down into individual ‘tokens’ (often words), and store these in a new data structure. Other forms of minor formatting can be applied here too. For example, all punctuation could be removed.\n",
    "\n",
    "To test this out on an individual review, first we will need to split our raw text document up, this time by review instead of by word.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "reviews = []\n",
    "for i in range (0,len(movie_reviews.fileids())):\n",
    "    reviews.append(movie_reviews.raw(movie_reviews.fileids()[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The movie reviews are stored according to a file ID such as 'neg/cv000_29416.txt', so this code loops through each of the file IDs, and assigning the raw text associated with that ID to an empty ‘reviews’ list. We will now be able to call, for example, review[0] to look at the first review by itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "plot : two teen couples go to a church party , drink and then drive . \n",
      "they get into an accident . \n",
      "one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \n",
      "what's the deal ? \n",
      "watch the movie and \" sorta \" find out . . . \n",
      "critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \n",
      "which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . \n",
      "they seem to have taken this pretty neat concept , but executed it terribly . \n",
      "so what are the problems with the movie ? \n",
      "well , its main problem is that it's simply too jumbled . \n",
      "it starts off \" normal \" but then downshifts into this \" fantasy \" world in which you , as an audience member , have no idea what's going on . \n",
      "there are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \n",
      "now i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . \n",
      "it's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \n",
      "and do they make things entertaining , thrilling or even engaging , in the meantime ? \n",
      "not really . \n",
      "the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . \n",
      "i guess the bottom line with movies like this is that you should always make sure that the audience is \" into it \" even before they are given the secret password to enter your world of understanding . \n",
      "i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \n",
      "okay , we get it . . . there \n",
      "are people chasing her and we don't know who they are . \n",
      "do we really need to see it over and over again ? \n",
      "how about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \n",
      "apparently , the studio took this film away from its director and chopped it up themselves , and it shows . \n",
      "there might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess \" the suits \" decided that turning it into a music video with little edge , would make more sense . \n",
      "the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . \n",
      "but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . \n",
      "overall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . \n",
      "oh , and by the way , this is not a horror or teen slasher flick . . . it's \n",
      "just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . \n",
      "it also wrapped production two years ago and has been sitting on the shelves ever since . \n",
      "whatever . . . skip \n",
      "it ! \n",
      "where's joblo coming from ? \n",
      "a nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) \n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(reviews[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now taking the first review in its natural form, we can break it down even further using tokenizers. The sent_tokenize function will split the review into tokens of sentences, and word_tokenize will split the review into tokens of words. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "plot : two teen couples go to a church party , drink and then drive .\n",
      "plot\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import word_tokenize, sent_tokenize\n",
    "sentences = nltk.sent_tokenize(reviews[0])\n",
    "words = nltk.word_tokenize(reviews[0])\n",
    "print(sentences[0])\n",
    "print(words[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are lots of options for tokenizing in NLTK which you can read about in the API documentation [here](http://www.nltk.org/api/nltk.tokenize.html). We are going to combine a regular expression tokenizer to remove punctuation, along with the Python function for transforming strings to lowercase, to build our final tokenizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['plot', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', 'drink', 'and', 'then', 'drive', 'they', 'get', 'into', 'an', 'accident', 'one', 'of', 'the', 'guys', 'dies', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', 'and', 'has', 'nightmares', 'what', 's', 'the', 'deal', 'watch', 'the', 'movie', 'and', 'sorta', 'find', 'out', 'critique', 'a', 'mind', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', 'mess', 'with', 'your', 'head', 'and', 'such', 'lost', 'highway', 'memento', 'but', 'there', 'are', 'good', 'and', 'bad', 'ways', 'of', 'making', 'all', 'types', 'of', 'films', 'and', 'these', 'folks', 'just', 'didn', 't', 'snag', 'this', 'one', 'correctly', 'they', 'seem', 'to', 'have', 'taken', 'this', 'pretty', 'neat', 'concept', 'but', 'executed', 'it', 'terribly', 'so', 'what', 'are', 'the', 'problems', 'with', 'the', 'movie', 'well', 'its', 'main', 'problem', 'is', 'that', 'it', 's', 'simply', 'too', 'jumbled', 'it', 'starts', 'off', 'normal', 'but', 'then', 'downshifts', 'into', 'this', 'fantasy', 'world', 'in', 'which', 'you', 'as', 'an', 'audience', 'member', 'have', 'no', 'idea', 'what', 's', 'going', 'on', 'there', 'are', 'dreams', 'there', 'are', 'characters', 'coming', 'back', 'from', 'the', 'dead', 'there', 'are', 'others', 'who', 'look', 'like', 'the', 'dead', 'there', 'are', 'strange', 'apparitions', 'there', 'are', 'disappearances', 'there', 'are', 'a', 'looooot', 'of', 'chase', 'scenes', 'there', 'are', 'tons', 'of', 'weird', 'things', 'that', 'happen', 'and', 'most', 'of', 'it', 'is', 'simply', 'not', 'explained', 'now', 'i', 'personally', 'don', 't', 'mind', 'trying', 'to', 'unravel', 'a', 'film', 'every', 'now', 'and', 'then', 'but', 'when', 'all', 'it', 'does', 'is', 'give', 'me', 'the', 'same', 'clue', 'over', 'and', 'over', 'again', 'i', 'get', 'kind', 'of', 'fed', 'up', 'after', 'a', 'while', 'which', 'is', 'this', 'film', 's', 'biggest', 'problem', 'it', 's', 'obviously', 'got', 'this', 'big', 'secret', 'to', 'hide', 'but', 'it', 'seems', 'to', 'want', 'to', 'hide', 'it', 'completely', 'until', 'its', 'final', 'five', 'minutes', 'and', 'do', 'they', 'make', 'things', 'entertaining', 'thrilling', 'or', 'even', 'engaging', 'in', 'the', 'meantime', 'not', 'really', 'the', 'sad', 'part', 'is', 'that', 'the', 'arrow', 'and', 'i', 'both', 'dig', 'on', 'flicks', 'like', 'this', 'so', 'we', 'actually', 'figured', 'most', 'of', 'it', 'out', 'by', 'the', 'half', 'way', 'point', 'so', 'all', 'of', 'the', 'strangeness', 'after', 'that', 'did', 'start', 'to', 'make', 'a', 'little', 'bit', 'of', 'sense', 'but', 'it', 'still', 'didn', 't', 'the', 'make', 'the', 'film', 'all', 'that', 'more', 'entertaining', 'i', 'guess', 'the', 'bottom', 'line', 'with', 'movies', 'like', 'this', 'is', 'that', 'you', 'should', 'always', 'make', 'sure', 'that', 'the', 'audience', 'is', 'into', 'it', 'even', 'before', 'they', 'are', 'given', 'the', 'secret', 'password', 'to', 'enter', 'your', 'world', 'of', 'understanding', 'i', 'mean', 'showing', 'melissa', 'sagemiller', 'running', 'away', 'from', 'visions', 'for', 'about', '20', 'minutes', 'throughout', 'the', 'movie', 'is', 'just', 'plain', 'lazy', 'okay', 'we', 'get', 'it', 'there', 'are', 'people', 'chasing', 'her', 'and', 'we', 'don', 't', 'know', 'who', 'they', 'are', 'do', 'we', 'really', 'need', 'to', 'see', 'it', 'over', 'and', 'over', 'again', 'how', 'about', 'giving', 'us', 'different', 'scenes', 'offering', 'further', 'insight', 'into', 'all', 'of', 'the', 'strangeness', 'going', 'down', 'in', 'the', 'movie', 'apparently', 'the', 'studio', 'took', 'this', 'film', 'away', 'from', 'its', 'director', 'and', 'chopped', 'it', 'up', 'themselves', 'and', 'it', 'shows', 'there', 'might', 've', 'been', 'a', 'pretty', 'decent', 'teen', 'mind', 'fuck', 'movie', 'in', 'here', 'somewhere', 'but', 'i', 'guess', 'the', 'suits', 'decided', 'that', 'turning', 'it', 'into', 'a', 'music', 'video', 'with', 'little', 'edge', 'would', 'make', 'more', 'sense', 'the', 'actors', 'are', 'pretty', 'good', 'for', 'the', 'most', 'part', 'although', 'wes', 'bentley', 'just', 'seemed', 'to', 'be', 'playing', 'the', 'exact', 'same', 'character', 'that', 'he', 'did', 'in', 'american', 'beauty', 'only', 'in', 'a', 'new', 'neighborhood', 'but', 'my', 'biggest', 'kudos', 'go', 'out', 'to', 'sagemiller', 'who', 'holds', 'her', 'own', 'throughout', 'the', 'entire', 'film', 'and', 'actually', 'has', 'you', 'feeling', 'her', 'character', 's', 'unraveling', 'overall', 'the', 'film', 'doesn', 't', 'stick', 'because', 'it', 'doesn', 't', 'entertain', 'it', 's', 'confusing', 'it', 'rarely', 'excites', 'and', 'it', 'feels', 'pretty', 'redundant', 'for', 'most', 'of', 'its', 'runtime', 'despite', 'a', 'pretty', 'cool', 'ending', 'and', 'explanation', 'to', 'all', 'of', 'the', 'craziness', 'that', 'came', 'before', 'it', 'oh', 'and', 'by', 'the', 'way', 'this', 'is', 'not', 'a', 'horror', 'or', 'teen', 'slasher', 'flick', 'it', 's', 'just', 'packaged', 'to', 'look', 'that', 'way', 'because', 'someone', 'is', 'apparently', 'assuming', 'that', 'the', 'genre', 'is', 'still', 'hot', 'with', 'the', 'kids', 'it', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'and', 'has', 'been', 'sitting', 'on', 'the', 'shelves', 'ever', 'since', 'whatever', 'skip', 'it', 'where', 's', 'joblo', 'coming', 'from', 'a', 'nightmare', 'of', 'elm', 'street', '3', '7', '10', 'blair', 'witch', '2', '7', '10', 'the', 'crow', '9', '10', 'the', 'crow', 'salvation', '4', '10', 'lost', 'highway', '10', '10', 'memento', '10', '10', 'the', 'others', '9', '10', 'stir', 'of', 'echoes', '8', '10']\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import RegexpTokenizer\n",
    "tokenizer = RegexpTokenizer(r'\\w+')\n",
    "tokens = tokenizer.tokenize(reviews[0].lower())\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stop Words\n",
    "\n",
    "We’re starting to head towards a more concise representation of the reviews, which only holds important and useful parts of the text. One further key step in NLP is the removal of stop words, for example ‘the’, ‘and’, ‘to’, which add no value in terms of content or meaning and are used very frequently in almost all forms of text. To do this we can run our document against a predefined list of stop words and remove matching instances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get', 'accident', 'one', 'guys', 'dies', 'girlfriend', 'continues', 'see', 'life', 'nightmares', 'deal', 'watch', 'movie', 'sorta', 'find', 'critique', 'mind', 'fuck', 'movie', 'teen', 'generation', 'touches', 'cool', 'idea', 'presents', 'bad', 'package', 'makes', 'review', 'even', 'harder', 'one', 'write', 'since', 'generally', 'applaud', 'films', 'attempt', 'break', 'mold', 'mess', 'head', 'lost', 'highway', 'memento', 'good', 'bad', 'ways', 'making', 'types', 'films', 'folks', 'snag', 'one', 'correctly', 'seem', 'taken', 'pretty', 'neat', 'concept', 'executed', 'terribly', 'problems', 'movie', 'well', 'main', 'problem', 'simply', 'jumbled', 'starts', 'normal', 'downshifts', 'fantasy', 'world', 'audience', 'member', 'idea', 'going', 'dreams', 'characters', 'coming', 'back', 'dead', 'others', 'look', 'like', 'dead', 'strange', 'apparitions', 'disappearances', 'looooot', 'chase', 'scenes', 'tons', 'weird', 'things', 'happen', 'simply', 'explained', 'personally', 'mind', 'trying', 'unravel', 'film', 'every', 'give', 'clue', 'get', 'kind', 'fed', 'film', 'biggest', 'problem', 'obviously', 'got', 'big', 'secret', 'hide', 'seems', 'want', 'hide', 'completely', 'final', 'five', 'minutes', 'make', 'things', 'entertaining', 'thrilling', 'even', 'engaging', 'meantime', 'really', 'sad', 'part', 'arrow', 'dig', 'flicks', 'like', 'actually', 'figured', 'half', 'way', 'point', 'strangeness', 'start', 'make', 'little', 'bit', 'sense', 'still', 'make', 'film', 'entertaining', 'guess', 'bottom', 'line', 'movies', 'like', 'always', 'make', 'sure', 'audience', 'even', 'given', 'secret', 'password', 'enter', 'world', 'understanding', 'mean', 'showing', 'melissa', 'sagemiller', 'running', 'away', 'visions', '20', 'minutes', 'throughout', 'movie', 'plain', 'lazy', 'okay', 'get', 'people', 'chasing', 'know', 'really', 'need', 'see', 'giving', 'us', 'different', 'scenes', 'offering', 'insight', 'strangeness', 'going', 'movie', 'apparently', 'studio', 'took', 'film', 'away', 'director', 'chopped', 'shows', 'might', 'pretty', 'decent', 'teen', 'mind', 'fuck', 'movie', 'somewhere', 'guess', 'suits', 'decided', 'turning', 'music', 'video', 'little', 'edge', 'would', 'make', 'sense', 'actors', 'pretty', 'good', 'part', 'although', 'wes', 'bentley', 'seemed', 'playing', 'exact', 'character', 'american', 'beauty', 'new', 'neighborhood', 'biggest', 'kudos', 'go', 'sagemiller', 'holds', 'throughout', 'entire', 'film', 'actually', 'feeling', 'character', 'unraveling', 'overall', 'film', 'stick', 'entertain', 'confusing', 'rarely', 'excites', 'feels', 'pretty', 'redundant', 'runtime', 'despite', 'pretty', 'cool', 'ending', 'explanation', 'craziness', 'came', 'oh', 'way', 'horror', 'teen', 'slasher', 'flick', 'packaged', 'look', 'way', 'someone', 'apparently', 'assuming', 'genre', 'still', 'hot', 'kids', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'sitting', 'shelves', 'ever', 'since', 'whatever', 'skip', 'joblo', 'coming', 'nightmare', 'elm', 'street', '3', '7', '10', 'blair', 'witch', '2', '7', '10', 'crow', '9', '10', 'crow', 'salvation', '4', '10', 'lost', 'highway', '10', '10', 'memento', '10', '10', 'others', '9', '10', 'stir', 'echoes', '8', '10']\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import stopwords\n",
    "tokens = [token for token in tokens if token not in stopwords.words('english')]\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This has reduced the number of tokens in the first review from 726 to 343 - so we can see that nearly half the words in this instance were essentially redundant. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lemmatization and Stemming\n",
    "\n",
    "Often different inflections of a word have the same general meaning (at least in terms of data analysis), and it may be useful to group them together as one. For example, instead of handling the words ‘walk’, ‘walks’, ‘walked’, ‘walking’ individually, we may want to treat them all as the same word. There are two common approaches to this - lemmatization and stemming.\n",
    "\n",
    "Lemmatization takes any inflected form of a word and returns its base form - the lemma. To achieve this, we need some context to the word use, such as whether it is a noun or adjective. Stemming is a somewhat cruder attempt at generating a root form, often returning a word which is simply the first few characters that are consistent in any form of the word (but not always a real word itself). To understand this better, let's test an example.\n",
    "\n",
    "First we’ll import and define a stemmer and lemmatizer - there are different versions available but these are two of the most popularly used. Next we’ll test the resulting word generated by each approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "worri worrying worry worrying\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem import PorterStemmer, WordNetLemmatizer\n",
    "stemmer = PorterStemmer()\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "\n",
    "test_word = \"worrying\"\n",
    "word_stem = stemmer.stem(test_word)\n",
    "word_lemmatise = lemmatizer.lemmatize(test_word)\n",
    "word_lemmatise_verb = lemmatizer.lemmatize(test_word, pos=\"v\")\n",
    "word_lemmatise_adj = lemmatizer.lemmatize(test_word, pos=\"a\")\n",
    "print(word_stem, word_lemmatise, word_lemmatise_verb, word_lemmatise_adj)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the stemmer in this case has outputted the word ‘worri’ which is not a real word, but a stemmed version of ‘worry’. For the lemmatizer to correctly lemmatize the word ‘worrying’ it needs to know whether this word has been used as a verb or adjective. The process for assigning these contextual tags is called part-of-speech tagging and is explained in the following section. Although stemming is a less thorough approach compared to lemmatization, in practise it oftens perform equally or only negibly worse. You can test other words such as ‘walking’ and see that in this case the stemmer and lemmatizer would give the same result.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part-of-Speech Tagging\n",
    "\n",
    "As briefly mentioned, part-of-speech tagging refers to tagging a word with a grammatical category. To do this requires context on the sentence in which the word appears - for example, which words it is adjacent to, and how its definition could change depending on this. The following function can be called to view a list of all possible part-of-speech tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# nltk.help.upenn_tagset()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The extensive list includes PoS tags such as VB (verb in base form), VBD (verb in past tense), VBG (verb as present participle) and so on.\n",
    "\n",
    "To generate the tags for our tokens, we simply import the library and call the pos_tag function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('plot', 'NN'), ('two', 'CD'), ('teen', 'NN'), ('couples', 'NNS'), ('go', 'VBP'), ('church', 'NN'), ('party', 'NN'), ('drink', 'VBP'), ('drive', 'JJ'), ('get', 'NN'), ('accident', 'JJ'), ('one', 'CD'), ('guys', 'NN'), ('dies', 'VBZ'), ('girlfriend', 'VBP'), ('continues', 'VBZ'), ('see', 'VBP'), ('life', 'NN'), ('nightmares', 'NNS'), ('deal', 'VBP'), ('watch', 'JJ'), ('movie', 'NN'), ('sorta', 'NN'), ('find', 'VBP'), ('critique', 'JJ'), ('mind', 'NN'), ('fuck', 'JJ'), ('movie', 'NN'), ('teen', 'JJ'), ('generation', 'NN'), ('touches', 'NNS'), ('cool', 'VBP'), ('idea', 'NN'), ('presents', 'NNS'), ('bad', 'JJ'), ('package', 'NN'), ('makes', 'VBZ'), ('review', 'VB'), ('even', 'RB'), ('harder', 'RBR'), ('one', 'CD'), ('write', 'NN'), ('since', 'IN'), ('generally', 'RB'), ('applaud', 'VBN'), ('films', 'NNS'), ('attempt', 'VB'), ('break', 'JJ'), ('mold', 'NN'), ('mess', 'NN'), ('head', 'NN'), ('lost', 'VBD'), ('highway', 'RB'), ('memento', 'JJ'), ('good', 'JJ'), ('bad', 'JJ'), ('ways', 'NNS'), ('making', 'VBG'), ('types', 'NNS'), ('films', 'NNS'), ('folks', 'NNS'), ('snag', 'VBP'), ('one', 'CD'), ('correctly', 'RB'), ('seem', 'VBP'), ('taken', 'VBN'), ('pretty', 'RB'), ('neat', 'JJ'), ('concept', 'NN'), ('executed', 'VBD'), ('terribly', 'RB'), ('problems', 'NNS'), ('movie', 'NN'), ('well', 'RB'), ('main', 'JJ'), ('problem', 'NN'), ('simply', 'RB'), ('jumbled', 'VBD'), ('starts', 'NNS'), ('normal', 'JJ'), ('downshifts', 'NNS'), ('fantasy', 'JJ'), ('world', 'NN'), ('audience', 'NN'), ('member', 'NN'), ('idea', 'NN'), ('going', 'VBG'), ('dreams', 'JJ'), ('characters', 'NNS'), ('coming', 'VBG'), ('back', 'RB'), ('dead', 'JJ'), ('others', 'NNS'), ('look', 'VBP'), ('like', 'IN'), ('dead', 'JJ'), ('strange', 'JJ'), ('apparitions', 'NNS'), ('disappearances', 'NNS'), ('looooot', 'JJ'), ('chase', 'NN'), ('scenes', 'NNS'), ('tons', 'NNS'), ('weird', 'VBP'), ('things', 'NNS'), ('happen', 'VB'), ('simply', 'RB'), ('explained', 'VBN'), ('personally', 'RB'), ('mind', 'VB'), ('trying', 'VBG'), ('unravel', 'JJ'), ('film', 'NN'), ('every', 'DT'), ('give', 'NN'), ('clue', 'NN'), ('get', 'VBP'), ('kind', 'NN'), ('fed', 'NN'), ('film', 'NN'), ('biggest', 'JJS'), ('problem', 'NN'), ('obviously', 'RB'), ('got', 'VBD'), ('big', 'JJ'), ('secret', 'JJ'), ('hide', 'NN'), ('seems', 'VBZ'), ('want', 'JJ'), ('hide', 'NN'), ('completely', 'RB'), ('final', 'JJ'), ('five', 'CD'), ('minutes', 'NNS'), ('make', 'VBP'), ('things', 'NNS'), ('entertaining', 'VBG'), ('thrilling', 'VBG'), ('even', 'RB'), ('engaging', 'VBG'), ('meantime', 'RB'), ('really', 'RB'), ('sad', 'JJ'), ('part', 'NN'), ('arrow', 'NN'), ('dig', 'NN'), ('flicks', 'NNS'), ('like', 'IN'), ('actually', 'RB'), ('figured', 'VBN'), ('half', 'JJ'), ('way', 'NN'), ('point', 'NN'), ('strangeness', 'JJ'), ('start', 'RB'), ('make', 'VB'), ('little', 'JJ'), ('bit', 'NN'), ('sense', 'NN'), ('still', 'RB'), ('make', 'VB'), ('film', 'NN'), ('entertaining', 'VBG'), ('guess', 'JJ'), ('bottom', 'JJ'), ('line', 'NN'), ('movies', 'NNS'), ('like', 'IN'), ('always', 'RB'), ('make', 'VB'), ('sure', 'JJ'), ('audience', 'NN'), ('even', 'RB'), ('given', 'VBN'), ('secret', 'JJ'), ('password', 'NN'), ('enter', 'NN'), ('world', 'NN'), ('understanding', 'VBG'), ('mean', 'JJ'), ('showing', 'VBG'), ('melissa', 'JJ'), ('sagemiller', 'NN'), ('running', 'VBG'), ('away', 'RB'), ('visions', 'NNS'), ('20', 'CD'), ('minutes', 'NNS'), ('throughout', 'IN'), ('movie', 'NN'), ('plain', 'NN'), ('lazy', 'JJ'), ('okay', 'JJ'), ('get', 'NN'), ('people', 'NNS'), ('chasing', 'VBG'), ('know', 'VBP'), ('really', 'RB'), ('need', 'JJ'), ('see', 'VBP'), ('giving', 'VBG'), ('us', 'PRP'), ('different', 'JJ'), ('scenes', 'NNS'), ('offering', 'VBG'), ('insight', 'JJ'), ('strangeness', 'NN'), ('going', 'VBG'), ('movie', 'NN'), ('apparently', 'RB'), ('studio', 'NN'), ('took', 'VBD'), ('film', 'NN'), ('away', 'RB'), ('director', 'NN'), ('chopped', 'VBD'), ('shows', 'NNS'), ('might', 'MD'), ('pretty', 'VB'), ('decent', 'JJ'), ('teen', 'JJ'), ('mind', 'NN'), ('fuck', 'JJ'), ('movie', 'NN'), ('somewhere', 'RB'), ('guess', 'JJ'), ('suits', 'NNS'), ('decided', 'VBD'), ('turning', 'VBG'), ('music', 'NN'), ('video', 'NN'), ('little', 'JJ'), ('edge', 'NN'), ('would', 'MD'), ('make', 'VB'), ('sense', 'NN'), ('actors', 'NNS'), ('pretty', 'RB'), ('good', 'JJ'), ('part', 'NN'), ('although', 'IN'), ('wes', 'NN'), ('bentley', 'NN'), ('seemed', 'VBD'), ('playing', 'VBG'), ('exact', 'JJ'), ('character', 'JJ'), ('american', 'JJ'), ('beauty', 'NN'), ('new', 'JJ'), ('neighborhood', 'NN'), ('biggest', 'JJS'), ('kudos', 'NN'), ('go', 'VBP'), ('sagemiller', 'NN'), ('holds', 'VBZ'), ('throughout', 'IN'), ('entire', 'JJ'), ('film', 'NN'), ('actually', 'RB'), ('feeling', 'VBG'), ('character', 'NN'), ('unraveling', 'VBG'), ('overall', 'JJ'), ('film', 'NN'), ('stick', 'NN'), ('entertain', 'NN'), ('confusing', 'VBG'), ('rarely', 'RB'), ('excites', 'VBZ'), ('feels', 'NNS'), ('pretty', 'RB'), ('redundant', 'JJ'), ('runtime', 'NN'), ('despite', 'IN'), ('pretty', 'JJ'), ('cool', 'JJ'), ('ending', 'VBG'), ('explanation', 'NN'), ('craziness', 'NN'), ('came', 'VBD'), ('oh', 'JJ'), ('way', 'NN'), ('horror', 'NN'), ('teen', 'IN'), ('slasher', 'JJR'), ('flick', 'NN'), ('packaged', 'VBD'), ('look', 'NN'), ('way', 'NN'), ('someone', 'NN'), ('apparently', 'RB'), ('assuming', 'VBG'), ('genre', 'NNS'), ('still', 'RB'), ('hot', 'JJ'), ('kids', 'NNS'), ('also', 'RB'), ('wrapped', 'VBD'), ('production', 'NN'), ('two', 'CD'), ('years', 'NNS'), ('ago', 'RB'), ('sitting', 'VBG'), ('shelves', 'NNS'), ('ever', 'RB'), ('since', 'IN'), ('whatever', 'WDT'), ('skip', 'JJ'), ('joblo', 'NN'), ('coming', 'VBG'), ('nightmare', 'JJ'), ('elm', 'JJ'), ('street', 'NN'), ('3', 'CD'), ('7', 'CD'), ('10', 'CD'), ('blair', 'NN'), ('witch', 'NN'), ('2', 'CD'), ('7', 'CD'), ('10', 'CD'), ('crow', 'NN'), ('9', 'CD'), ('10', 'CD'), ('crow', 'NN'), ('salvation', 'NN'), ('4', 'CD'), ('10', 'CD'), ('lost', 'VBD'), ('highway', 'RB'), ('10', 'CD'), ('10', 'CD'), ('memento', 'NN'), ('10', 'CD'), ('10', 'CD'), ('others', 'NNS'), ('9', 'CD'), ('10', 'CD'), ('stir', 'NN'), ('echoes', 'NNS'), ('8', 'CD'), ('10', 'CD')]\n"
     ]
    }
   ],
   "source": [
    "from nltk import pos_tag\n",
    "pos_tokens = nltk.pos_tag(tokens) \n",
    "print(pos_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vocabulary\n",
    "\n",
    "So far we’ve mostly been testing these NLP tools on the first individual review in our corpus, but in a real task we would be using all 2000 reviews. If this were the case, our corpus would generate a total of 1,336,782 word tokens after the tokenization step. But the resulting list is a collection of each use of a word, even if the same word is repeated multiple times. For instance the sentence “He walked and walked” generates the tokens [‘he’, ‘walked’, ‘and’, ‘walked’]. This is useful for counting frequencies and other such analysis, but we may want a data structure containing every unique word used in the corpus, for example [‘he’, ‘walked’, ‘and’]. For this, we can use the Python set function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corpus_tokens = tokenizer.tokenize(raw.lower())\n",
    "vocab = sorted(set(corpus_tokens))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us the vocabulary of a corpus, and we can use it to compare sizes of vocabulary in different texts, or percentages of the vocabulary which refer to a certain topic or are a certain PoS type. For example, whilst the total size of the corpus of reviews contains 1,336,782 words (after tokenization), the size of the vocabulary is 39,696. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tokens: 1336782\n",
      "Vocabulary: 39696\n"
     ]
    }
   ],
   "source": [
    "print(\"Tokens:\", len(corpus_tokens))\n",
    "print(\"Vocabulary:\", len(vocab))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "We have been able to transform movie reviews from long raw text documents into something that a computer can begin to understand. These structured forms can be used for data analysis or as input into machine learning algorithms to determine topics discussed, analyse sentiment expressed, or infer meaning. \n",
    "\n",
    "You can now go ahead and use some of the NLP concepts discussed to start working with text-based data in your own projects. The NLTK movie reviews data set is already categorised into positive and negative reviews, so see if you can use the tools from this tutorial to aid you in creating a machine learning classifier to predict the label of a given review. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}