{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lesson 12 - Introduction to NLP\n",
    "\n",
    "> Introduction to Natural Language Processing (NLP)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/lewtun/dslectures/master?urlpath=lab/tree/notebooks%2Flesson12_nlp-intro.ipynb) [![slides](https://img.shields.io/static/v1?label=slides&message=lesson12_nlp-intro.pdf&color=blue&logo=Google-drive)](https://drive.google.com/open?id=11m5iXGNJEUlvjSMLdQAztJyVZ2LLj4oz)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning objectives\n",
    "In this lecture we cover the basics of NLP to build a sentiment classifier in scikit-learn. The learning goals are:\n",
    "* Know the basics of string processing in python\n",
    "* Preprocessing steps in NLP\n",
    "* Count and TF-IDF encodings\n",
    "* Naïve Bayes classifier\n",
    "\n",
    "## References\n",
    "* Chapter 10: Representing and Mining Text in _Data Science for Business_ by F. Provost and P. Fawcett\n",
    "\n",
    "## Homework\n",
    "As homework read the references, work carefully through the notebook and solve the exercises. \n",
    "\n",
    "## Introduction to NLP\n",
    "\n",
    "<div style=\"text-align: center\">\n",
    "<img src='images/natural-language-processing-so-hot-right-now.jpg' width='400'>\n",
    "</div>\n",
    "\n",
    "Natural language processing (NLP) concerns the part of Machine Learning about the analysis of digital, human written texts. The topic of NLP is as old as machine learning itself and dates back to Alan Turing himself. Since text is a widely used medium there are plenty of applications of machine learning:\n",
    "\n",
    "- Text classification\n",
    "- Question/answering systems\n",
    "- Dialogue systems\n",
    "- Named entity recognition\n",
    "- Summarization\n",
    "- Text generation\n",
    "\n",
    "Especially in the past few years there has been exciting and rapid progress in the field. One example is the release of OpenAI's GPT-2, a language model able to not only create realistic text samples but also solve tasks of many NLP benchmarks without special training. See the figure below for an example output of GPT-2.\n",
    "\n",
    "If you want to try your own examples you can do so at [talktotransformer.com](https://talktotransformer.com/) or read the original article on [OpenAI's webpage](https://openai.com/blog/better-language-models/).\n",
    "\n",
    "<div style=\"text-align: center\">\n",
    "<img src='images/gpt2-example.png' width='400'>\n",
    "</div>\n",
    "    \n",
    "Natural text is different to other data sources such as numerical tables or images. One way to look at text is to consider each word to be a feature. Since most languages have of the order of 100k words in their vocabulary plus many variations this leads to an enormous feature space. At the same time most words in the vocabulary do not appear in a small text. This leads to extreme sparsity. These properties call for a different approach to NLP than the methods we encountered and used for tabular data.\n",
    "\n",
    "## Notebook overview\n",
    "The goal of this notebook is to classify movie reviews in terms of positive or negative feedback. This task is called sentiment analysis and is a common NLP application. As a company you might use a sentiment classifier to analyse customer feedback or detect toxic comments on your website.\n",
    "\n",
    "<div style=\"text-align: center\">\n",
    "<img src='images/sentiment-meme.jpg' width='400'>\n",
    "</div>\n",
    "\n",
    "Text data can be messy and require some clean up. The specific steps for the clean-up can depend on how the data was generated or where it was found. Text from the web might have some html artifacts that need cleaning or product reviews could include meta information on the review. Python offers powerful tools to manipulate strings. If cleaning requires complex rules one can also resort to regular expressions or regex for short.\n",
    "\n",
    "Once the text is cleaned we have to encode it in a way that machine learning methods can handle. Directly using text representation as input is not possible. Most machine learning methods can only handle numerical data such as vectors and matrices. So we have to encode the input texts as vectors or matrices. These text representations are called vector encodings. Furthermore, we look at n-grams to keep some of the sequential structure of text.\n",
    "\n",
    "Finally, we can train a model to classify the movie review texts. However, the Random Forest models we already know well do not work well for the high-dimensional data. We introduce a new methods that is common for text data called the Naïve Bayes classifier that utilises Bayes theorem.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import pickle\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "from dslectures.core import get_dataset\n",
    "\n",
    "from tqdm import tqdm\n",
    "tqdm.pandas(desc=\"progress\")\n",
    "\n",
    "import nltk\n",
    "nltk.download('stopwords')\n",
    "nltk.download('punkt')\n",
    "\n",
    "from nltk import word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem.snowball import SnowballStemmer\n",
    "\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.metrics import accuracy_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we load the IMDB dataset as a dataframe. Note that this is not the original dataset from [here](https://ai.stanford.edu/~amaas/data/sentiment/), but a version that I pre-processed for the ease of use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Download of imdb.csv dataset complete.\n"
     ]
    }
   ],
   "source": [
    "get_dataset('imdb.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_imdb = pd.read_csv('../data/imdb.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>filename</th>\n",
       "      <th>text</th>\n",
       "      <th>sentiment</th>\n",
       "      <th>train_label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4715_9</td>\n",
       "      <td>For a movie that gets no respect there sure ar...</td>\n",
       "      <td>pos</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>12390_8</td>\n",
       "      <td>Bizarre horror movie filled with famous faces ...</td>\n",
       "      <td>pos</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8329_7</td>\n",
       "      <td>A solid, if unremarkable film. Matthau, as Ein...</td>\n",
       "      <td>pos</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>9063_8</td>\n",
       "      <td>It's a strange feeling to sit alone in a theat...</td>\n",
       "      <td>pos</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3092_10</td>\n",
       "      <td>You probably all already know this by now, but...</td>\n",
       "      <td>pos</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  filename                                               text sentiment  \\\n",
       "0   4715_9  For a movie that gets no respect there sure ar...       pos   \n",
       "1  12390_8  Bizarre horror movie filled with famous faces ...       pos   \n",
       "2   8329_7  A solid, if unremarkable film. Matthau, as Ein...       pos   \n",
       "3   9063_8  It's a strange feeling to sit alone in a theat...       pos   \n",
       "4  3092_10  You probably all already know this by now, but...       pos   \n",
       "\n",
       "  train_label  \n",
       "0       train  \n",
       "1       train  \n",
       "2       train  \n",
       "3       train  \n",
       "4       train  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_imdb.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset consists of a `filename`, `text`, `sentiment` and a `train_label`. The latter splits the data into a train and test set which is used as the official benchmark. We will follow that same split. \n",
    "\n",
    "But first we want to make the `sentiment` column categorical:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_imdb['sentiment'] = df_imdb['sentiment'].astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's have a look at a few text examples. For that purpose we wrote a helper function to print examples from the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_n_samples(df, n):\n",
    "    \"\"\"\n",
    "    Helper function to print data samples from IMDB dataset.\n",
    "    \"\"\"\n",
    "    for i in range(n):\n",
    "        print('SAMPLE', i+1, '\\n')\n",
    "        df_sample = df.sample(1)\n",
    "        print(df_sample['text'].values[0])\n",
    "        print('\\nSentiment:', df_sample['sentiment'].values[0],'\\n')\n",
    "        print(\"\".join(100*['=']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can show a few examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SAMPLE 1 \n",
      "\n",
      "I have just seen Today You Die. It is bad, almost very bad.<br /><br />1) The direction and editing are awful, just awful. Almost made me turn off the movie, Fauntleroy (the director) has no idea what he is doing, he seems to be filming things at random and some scenes don't make sense at all. Also, I hate it when the same scene is used again in the same movie, in this movie some scenes were used 3 or 4 times. Pretty bad.<br /><br />2) The dialogue is sometimes good, sometimes awful. I like the fact that they wanted to make Seagal's character and Treach's character seem like they were in a similar relationship to the characters in Lethal Weapon, but it did not work simply because some of the dialogue DID NOT MAKE SENSE, and I speak English very well, it's not that I did not understand the words, it was the fact that the jokes and dialogue lines had no meaning whatsoever.<br /><br />3) The script is pretty bad. Why do they always try to complicate DTV action movies? Seagal's wife in the movie has psychic abilities, why? Is it useful to the movie? NO. Seagal eliminates a whole bunch of people who work for the guy who betrayed him and he knows these people without having ever met them in the movie. STUPID. The story sometimes goes off track and the jumps back without any reason. The story is messy and pointless sometimes. They should have kept it simple and it would have worked.<br /><br />4) In some of the action scenes it is not Seagal, it is his stunt double. You can tell because they only film him from behind and never show his face. He also beats the guys with movie martial arts, not real ones like the aikido Steven knows. The stunt double uses cheesy kicks and punches.<br /><br />5) Steven is good in the movie. 90-95% of the lines are said with his real voice. The rest is dubbing but it is not that bad. This was good. Also Steven seems to be enjoying himself in the movie and is more into the action that he was in Submerged. He likes Treach as a partner; at least he does not seem to dislike him. Also, he seems to have been in better shape than in some of his recent movies. I hate the fact that he wears clothes to hide his body, but in the same clothes that he wears on the DVD cover he looks more than OK and he should have wore those clothes for most of the movie not the stupid long leather coat.<br /><br />I really think that Seagal was willing to make a good movie. The fact that he came late and took off early from the set ON TWO MOVIES directed by Fauntleroy does not look like a coincidence to me. I think he realized that the crew were amateurs or only in it for a quick buck and he did not give a damn anymore.<br /><br />In the hands of a better company and crew this might have been a damn good action movie for Seagal. Something like Out for Justice or Above the Law. I honestly believe that. But the people who made the movie are not very good at their jobs or they did not have enough money to do the job properly. Too bad since I liked Steven in the movie and Treach was cool (Ice Cool ) too, but the rest was bad. Hey, at least this gives me hope for Black Dawn and Shadows of the past. I think that Mercenary might be just as badly handled. But hey, Steven seemed to be back into the same mood he was in while making his better movies and at least THAT is reason enough to watch the movie.<br /><br />I liked it, but it could have been SO much better. 4/10\n",
      "\n",
      "Sentiment: neg \n",
      "\n",
      "====================================================================================================\n",
      "SAMPLE 2 \n",
      "\n",
      "Boring, long, pretentious, repetitive, self-involved  this move felt like a bad date. Worse, the tedious art-school direction -- with a heavy-handed use of the whirling shot that gets so overdone it almost made me throw up - is constantly screaming to be noticed. Add the thinnest of plots and virtually no dialogue, and the film begins to feel like a four hour epic about 30 minutes in. It gets worse: instead of dialogue there are poorly written voice-overs AND quotes and songs that comment all too obviously on the characters. Really loud opera music too. Blame it all on the director.<br /><br />The actors are all quite good. The lead actor Miguel Angel Hoppe is particularly suited for film stardom. He and the other actors have some tender erotic moments. Even these start to get boring after 5 minutes however, and one wonders if the director is auditioning for a Bel Ami porn job. The stunning college campus architecture as a location in Mexico City is inspiring. How come universities in the US are so bland (SFSU, UC, etc.)? But wait for the DVD on this film. You'll want to use the fast scan button  a lot.\n",
      "\n",
      "Sentiment: neg \n",
      "\n",
      "====================================================================================================\n"
     ]
    }
   ],
   "source": [
    "print_n_samples(df_imdb, 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the reviews are medium sized texts with positive and negative labels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "A few exploratory and processing tasks:\n",
    "* Create a plot with showing the distribution of positive and negative comments in the train and test dataset.\n",
    "* Study the distribution of the text lengths. You can perform string operations on a `pandas.DataFrame` by accessing the `str` object of a column: `df['YOUR_TEXT_COLUMN'].str.len()`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section we have a look at the basics of string processing. Being able to filter/combine/manipulate strings is a crucial skill to do natural language processing.\n",
    "\n",
    "Cleaning up text for NLP tasks usually involves the following steps.\n",
    "* Normalization\n",
    "* Tokenization\n",
    "* Remove stop-words\n",
    "* Remove non-alphabetical tokens\n",
    "* Stemming\n",
    "\n",
    "Some of the steps might not be necessary or you need to add steps depending on the text, task and method. For our task these steps are fine. We apply these steps on one text as an example and then build a function to apply it to all texts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### String processing in Python\n",
    "Python offers powerful properties and functions to manipulate strings. The Python primer notebook offers an introduction to string processing with Python. Make sure to check it out. Once you are armed with this arsenal of string processing tools, we can preprocess the texts in the dataset to bring them to a cleaner form. Fortunately we don't need to implement everything from scratch. One of the richest Python libraries to process texts is the Natural Language Toolkit (NLTK) which offers some powerful functions we will use.\n",
    "\n",
    "### Exercise 2\n",
    "Work through the string processing introduction in the Python primer notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Input\n",
    "We see that the raw text as several features such as capitalisation, special characters, numericals and puncuations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan \"The Skipper\" Hale jr. as a police Sgt.\n"
     ]
    }
   ],
   "source": [
    "text = df_imdb.loc[0, 'text']\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalize\n",
    "This is the process of transforming the text to lower-case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "for a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. imagine a movie where joe piscopo is actually funny! maureen stapleton is a scene stealer. the moroni character is an absolute scream. watch for alan \"the skipper\" hale jr. as a police sgt.\n"
     ]
    }
   ],
   "source": [
    "text = text.lower()\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenize\n",
    "Now we split the text in words/tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['for', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'imagine', 'a', 'movie', 'where', 'joe', 'piscopo', 'is', 'actually', 'funny', '!', 'maureen', 'stapleton', 'is', 'a', 'scene', 'stealer', '.', 'the', 'moroni', 'character', 'is', 'an', 'absolute', 'scream', '.', 'watch', 'for', 'alan', '``', 'the', 'skipper', \"''\", 'hale', 'jr.', 'as', 'a', 'police', 'sgt', '.']\n"
     ]
    }
   ],
   "source": [
    "tokens = word_tokenize(text)\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stop Words\n",
    "Next, we remove words that are too common and don't add the the content of sentences. These words are commonly called 'stop words'. NLTK provides a list of stop words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'shan', 'some', 'couldn', 'their', 'them', 'just', 'me', 'haven', 'didn', 'her', 'have', 'ours', 'this', 'again', \"should've\", 'my', 'it', 'below', 'why', 'few', 'won', 't', 'that', 'you', 'i', 'then', 'wouldn', 'while', 've', 'own', 'in', \"won't\", 'themselves', \"hasn't\", 'both', \"you're\", 'at', 'he', 'they', 'a', 'nor', 'him', 'aren', 'mightn', 'now', 'ma', 'wasn', \"shouldn't\", 'theirs', 'doesn', \"couldn't\", 'than', 'as', \"weren't\", \"you'll\", 'during', 'not', \"hadn't\", 'these', 'isn', 'y', 'which', 'such', 'yourself', 'where', 'any', 'your', 'been', 'what', 'herself', 'we', \"wasn't\", 'over', 'be', 'an', 'yourselves', 'each', 'more', 'most', 'did', 'same', 'for', 'down', 'is', \"wouldn't\", 'itself', 'by', 'before', 'so', 'between', 'from', 're', 'm', \"you'd\", 'on', 'above', 'do', 'had', 'off', 'd', \"aren't\", 'needn', \"you've\", 'yours', 'doing', 'the', 'has', 'there', 'too', 'further', \"she's\", 'because', 'once', 'shouldn', 'being', 'of', 'will', 's', \"haven't\", 'o', 'his', \"isn't\", 'whom', 'should', 'all', \"that'll\", 'those', 'very', 'after', 'don', 'hers', 'mustn', 'and', 'were', \"doesn't\", 'or', \"needn't\", 'its', 'weren', 'myself', 'was', 'to', 'no', \"don't\", 'does', 'll', 'hadn', \"it's\", 'if', 'when', 'until', 'through', \"shan't\", 'with', 'are', 'himself', 'about', 'here', 'how', 'only', 'can', \"mightn't\", \"mustn't\", 'against', 'but', 'who', 'she', 'up', 'under', 'other', 'into', 'am', \"didn't\", 'hasn', 'ain', 'having', 'ourselves', 'out', 'our'}\n"
     ]
    }
   ],
   "source": [
    "stop_words = set(stopwords.words('english'))\n",
    "print(stop_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We keep only the words that are **not** in the list of stop words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['movie', 'gets', 'respect', 'sure', 'lot', 'memorable', 'quotes', 'listed', 'gem', '.', 'imagine', 'movie', 'joe', 'piscopo', 'actually', 'funny', '!', 'maureen', 'stapleton', 'scene', 'stealer', '.', 'moroni', 'character', 'absolute', 'scream', '.', 'watch', 'alan', '``', 'skipper', \"''\", 'hale', 'jr.', 'police', 'sgt', '.']\n"
     ]
    }
   ],
   "source": [
    "tokens = [i for i in tokens if not i in stop_words]\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Punctuation\n",
    "We also want to get of all tokens that are not composed of letters (e.g. punctuation and numbers). We can check if a words is only composed of alphabetic letters with the `isalpha()` and filter with it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['movie', 'gets', 'respect', 'sure', 'lot', 'memorable', 'quotes', 'listed', 'gem', 'imagine', 'movie', 'joe', 'piscopo', 'actually', 'funny', 'maureen', 'stapleton', 'scene', 'stealer', 'moroni', 'character', 'absolute', 'scream', 'watch', 'alan', 'skipper', 'hale', 'police', 'sgt']\n"
     ]
    }
   ],
   "source": [
    "tokens = [i for i in tokens if i.isalpha()]\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stemming\n",
    "As a final step we want to trim the words to the stem. This helps drastically decrease the vocabulary size and maps similar/same words onto the same word. E.g. plural/singular words or different forms of verbs:\n",
    "* pen, pens --> pen\n",
    "* happy, happier --> happi\n",
    "* go, goes --> go\n",
    "\n",
    "There are several languages available in nltk since this is a **language dependant process**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')\n"
     ]
    }
   ],
   "source": [
    "print(SnowballStemmer.languages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Applied to the text sample this yields:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['movi', 'get', 'respect', 'sure', 'lot', 'memor', 'quot', 'list', 'gem', 'imagin', 'movi', 'joe', 'piscopo', 'actual', 'funni', 'maureen', 'stapleton', 'scene', 'stealer', 'moroni', 'charact', 'absolut', 'scream', 'watch', 'alan', 'skipper', 'hale', 'polic', 'sgt']\n"
     ]
    }
   ],
   "source": [
    "stemmer = SnowballStemmer(\"english\")\n",
    "tokens = [stemmer.stem(i) for i in tokens]\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def preprocessing(text, language='english', stemming=True):\n",
    "    \"\"\"\n",
    "    preprocess a string and return processed tokens.\n",
    "    args:\n",
    "        text: text string\n",
    "    return:\n",
    "        tokens: list of processed and cleaned words\n",
    "    \"\"\"\n",
    "    \n",
    "    stop_words = set(stopwords.words(language))\n",
    "    stemmer = SnowballStemmer(language)    \n",
    "    \n",
    "    text = text.lower()\n",
    "    tokens = word_tokenize(text)\n",
    "    tokens = [i for i in tokens if not i in stop_words]\n",
    "    tokens = [i for i in tokens if i.isalpha()]\n",
    "    if stemming:\n",
    "        tokens = [stemmer.stem(i) for i in tokens]\n",
    "    \n",
    "    return tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can apply these steps to all texts. We use the `apply` function of pandas which applies a function to every entry in a DataFrame column. Since we registered `tqdm` we can use the `progress_apply` function which uses `apply` and adds a progress bar to it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "progress: 100%|██████████| 50000/50000 [03:17<00:00, 252.92it/s]\n"
     ]
    }
   ],
   "source": [
    "df_imdb['text_processed_stemmed'] = df_imdb['text'].progress_apply(preprocessing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3: Vector encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we cleaned up and tokenized the text corpus we are now ready to encode the texts in vectors. In class we had a look at simple **one-hot encodings** that can be extended to count encodings and **TF-IDF encodings**.\n",
    "\n",
    "Scikit-learn comes with functions to do both count and TF-IDF encodings on text. The interface is very similar to the classifier just the `predict` step is replace with `transform`:\n",
    "\n",
    "```python\n",
    "count_vectorizer = CountVectorizer(your_settings)\n",
    "count_vectorizer.fit(your_dataset)\n",
    "vec = count_vectorizer.transform('your_text')\n",
    "```\n",
    "\n",
    "This creates a vectorizer that can transform texts to vectors. We can also limit the number of words take into account when building the vector. This limits the vector size and cuts off words that occur rarely. If you set `max_features=10000` only the 10000 most occurring words are used to build the vector and all rare words are excluded. This means that the encoding vector then has a dimension of 10000. For now we take all words (`max_features=None`). Since we used our own tokenizer and preprocessing step we overwrite the standard steps in the vectorizer library with the `vec_default_settings`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vec_default_settings = {'analyzer':'word', 'tokenizer':lambda x: x, 'preprocessor':lambda x: x, 'token_pattern':None,}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tfidf_vec = TfidfVectorizer(max_features=None, **vec_default_settings)\n",
    "count_vec = CountVectorizer(max_features=None, **vec_default_settings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's test both vectorizers on a small, dummy dataset with **4 documents**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus = [\n",
    "    ['this','is','the','first','document','in','the','corpus'],\n",
    "    ['this','document','is','the','second','document','in','the','corpus'],\n",
    "    ['and','this','is','the','third','one','in','this','corpus'],\n",
    "    ['is','this','the','first','document','in','this','corpus'],\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we fit a count vectorizer to the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "                lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "                ngram_range=(1, 1),\n",
       "                preprocessor=<function <lambda> at 0x123e15bf8>,\n",
       "                stop_words=None, strip_accents=None, token_pattern=None,\n",
       "                tokenizer=<function <lambda> at 0x123e15ae8>, vocabulary=None)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_vec.fit(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the a vectorizer is fitted, we can investigate the vocabulary. It is a dictionary that points each word to the index in the vector it corresponds to. For example the word `'this'` corresponds to the 10+1-nth (+1 because we start counting at zero) entry in the vector and the word `'and'` corresponds to the the first entry."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(count_vec.vocabulary_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'this': 10,\n",
       " 'is': 5,\n",
       " 'the': 8,\n",
       " 'first': 3,\n",
       " 'document': 2,\n",
       " 'in': 4,\n",
       " 'corpus': 1,\n",
       " 'second': 7,\n",
       " 'and': 0,\n",
       " 'third': 9,\n",
       " 'one': 6}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_vec.vocabulary_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can transform the corpus and get a list of vectors in the form of a matrix (each row corresponds to a document vector):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 1 1 1 1 0 0 2 0 1]\n",
      " [0 1 2 0 1 1 0 1 2 0 1]\n",
      " [1 1 0 0 1 1 1 0 1 1 2]\n",
      " [0 1 1 1 1 1 0 0 1 0 2]]\n"
     ]
    }
   ],
   "source": [
    "X = count_vec.transform(corpus)\n",
    "print(X.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we now do the same thing with the TF-IDF vectorizer we see that the output looks different:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.         0.29137467 0.35639305 0.44021632 0.29137467 0.29137467\n",
      "  0.         0.         0.58274934 0.         0.29137467]\n",
      " [0.         0.23798402 0.58217725 0.         0.23798402 0.23798402\n",
      "  0.         0.45604658 0.47596805 0.         0.23798402]\n",
      " [0.43943636 0.22931612 0.         0.         0.22931612 0.22931612\n",
      "  0.43943636 0.         0.22931612 0.43943636 0.45863224]\n",
      " [0.         0.29137467 0.35639305 0.44021632 0.29137467 0.29137467\n",
      "  0.         0.         0.29137467 0.         0.58274934]]\n"
     ]
    }
   ],
   "source": [
    "tfidf_vec.fit(corpus)\n",
    "X = tfidf_vec.transform(corpus)\n",
    "print(X.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The shape of the matrix is the same.\n",
    "* Instead of integers (corresponding to counts) we have continous values.\n",
    "* Elements that occur in multilple documents have lower scores than those appearing in fewer.\n",
    "\n",
    "This should just illustrate how count and TF-IDF vectorizer work. Now let's apply this to our dataset and create encodigs with `100000` words:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### n-grams\n",
    "When we use a count or TF-IDF vectorizer we through away all sequential information in the texts. From the vector encodings above we could not reconstruct the original sentences. For this reason these encodings are called Bag-of-Words encodings (all words go in a bag and are shuffeled). However, sequential information can be important for the meaning of a sentence. As an example imagine the sentence:\n",
    "\n",
    "```Python\n",
    "text = 'The movie was good and not bad.'\n",
    "```\n",
    "\n",
    "It is important to know if the word `'not'` is in front of `'good'` or `'bad'` for determining the sentiment of the sentence. We can preserve some of that information by using n-grams. Instead of just encoding single words we can also encode tuple, triplets etc. called n-grams. The n encodes how many words we bundle together. \n",
    "\n",
    "The vectorizers can do this for us if we provide them a range of n's we want to include. In the following example we encode the text in 1- and 2-grams."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "count_vec = CountVectorizer(max_features=None, ngram_range=(1,2), **vec_default_settings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "                lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "                ngram_range=(1, 2),\n",
       "                preprocessor=<function <lambda> at 0x123e15bf8>,\n",
       "                stop_words=None, strip_accents=None, token_pattern=None,\n",
       "                tokenizer=<function <lambda> at 0x123e15ae8>, vocabulary=None)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_vec.fit(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that this drastically increases the vocabulary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "30"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(count_vec.vocabulary_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the vocabulary also contains word tuples next to the words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'this': 25,\n",
       " 'is': 11,\n",
       " 'the': 18,\n",
       " 'first': 6,\n",
       " 'document': 3,\n",
       " 'in': 8,\n",
       " 'corpus': 2,\n",
       " 'this is': 28,\n",
       " 'is the': 12,\n",
       " 'the first': 20,\n",
       " 'first document': 7,\n",
       " 'document in': 4,\n",
       " 'in the': 9,\n",
       " 'the corpus': 19,\n",
       " 'second': 16,\n",
       " 'this document': 27,\n",
       " 'document is': 5,\n",
       " 'the second': 21,\n",
       " 'second document': 17,\n",
       " 'and': 0,\n",
       " 'third': 23,\n",
       " 'one': 14,\n",
       " 'and this': 1,\n",
       " 'the third': 22,\n",
       " 'third one': 24,\n",
       " 'one in': 15,\n",
       " 'in this': 10,\n",
       " 'this corpus': 26,\n",
       " 'is this': 13,\n",
       " 'this the': 29}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_vec.vocabulary_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The encodings look very similar but are larger due to the larger vocabulary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 0 1 1 1 0 1 1 1 1 0 1 1 0 0 0 0 0 2 1 1 0 0 0 0 1 0 0 1 0]\n",
      " [0 0 1 2 1 1 0 0 1 1 0 1 1 0 0 0 1 1 2 1 0 1 0 0 0 1 0 1 0 0]\n",
      " [1 1 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 1 2 1 0 1 0]\n",
      " [0 0 1 1 1 0 1 1 1 0 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 2 1 0 0 1]]\n"
     ]
    }
   ],
   "source": [
    "X = count_vec.transform(corpus)\n",
    "print(X.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vectorize dataset\n",
    "Now we want to encode the real text. We put an upper limit on the vocabulary size. Since there a lot of unique words in our corpus there are a lot of combinations of words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_features=100000\n",
    "ngrams=(1,3)\n",
    "\n",
    "count_vec = CountVectorizer(max_features=max_features, ngram_range=ngrams, **vec_default_settings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example above we used the `fit` and `transform` function. We can avoid these two steps with the combined function `fit_transform`. First we need to split the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_train = df_imdb.loc[df_imdb['train_label']=='train', 'text_processed_stemmed']\n",
    "text_test = df_imdb.loc[df_imdb['train_label']=='test', 'text_processed_stemmed']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = count_vec.fit_transform(text_train)\n",
    "X_test = count_vec.transform(text_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This yields a vocabulary with `100000` entries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100000"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(count_vec.vocabulary_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the shape of the returned matrix we see that it still has as many rows as the input but now has `100000` entries per row (the feature vector)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(25000, 100000)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 4: Naïve Bayes classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we featurised the text we can train a model on it. In this section we will use a Naïve Bayes classifier to determine whether a review is positive or negative. Firs we split the labels of the training and test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train = df_imdb.loc[df_imdb['train_label']=='train', 'sentiment']\n",
    "y_test = df_imdb.loc[df_imdb['train_label']=='test', 'sentiment']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can initialise a Naïve Bayes classifier the same way we initialised the Random Forest models. Also the fit/predict interface is the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nb_clf = MultinomialNB()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 90.6 ms, sys: 21.5 ms, total: 112 ms\n",
      "Wall time: 130 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "nb_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = nb_clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can calculate the prediction accuracy on the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.85192"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_score(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Forest\n",
    "We can compare the Naïve Bayes model with a Random Forest on the same task with the same input data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_clf = RandomForestClassifier(n_jobs=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 23s, sys: 1.29 s, total: 1min 24s\n",
      "Wall time: 27.2 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n",
       "                       criterion='gini', max_depth=None, max_features='auto',\n",
       "                       max_leaf_nodes=None, max_samples=None,\n",
       "                       min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "                       min_samples_leaf=1, min_samples_split=2,\n",
       "                       min_weight_fraction_leaf=0.0, n_estimators=100,\n",
       "                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,\n",
       "                       warm_start=False)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "rf_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.85328"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred = rf_clf.predict(X_test)\n",
    "accuracy_score(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that we get similar performance while being **~1000x faster**!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom prediction\n",
    "Now that we have a model we want to use it to make some custom predictions. We can build an easy pipeline that preprocesses, vectorises and predicts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = ['This movie sucked!!',\n",
    "         'This movie is awesome :)',\n",
    "         'I did not like that movie at all.',\n",
    "         'The movie was boring.']\n",
    "\n",
    "enc = count_vec.transform(preprocessing(text) for text in texts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['neg' 'pos' 'neg' 'neg']\n"
     ]
    }
   ],
   "source": [
    "print(nb_clf.predict(enc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "Retrain the model with a TF-IDF encoding instead of the count encoding. Experiment with the n-gram setting and run the experiment with the following settings: `(1,1)`, `(1,2)`, `(1,3)` and `(1,4)`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}