{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Movie reviews sentiment analysis with Naive Bayes\n", "\n", "Project 2, MS621
\n", "*Spring 2021*\n", "\n", "## Goal\n", "\n", "In this project, you will build a multinomial naive bayes classifier to predict whether a movie review is positive or negative. As part of the project, you will also learn to do *k*-fold cross validation testing.\n", "\n", "You will do your work in a `bayes`-*userid* repository. Please keep all of your files in the root directory of the repository. Do not add data to the repo!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting started\n", "\n", "Download and uncompress [polarity data set v2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz) into the root directory of your repository, but do not add the data to git. My directory looks like:\n", "\n", "```\n", "$ ls\n", "bayes.py review_polarity test_bayes.py\n", "```\n", "\n", "Download the [test_bayes.py](https://github.com/parrt/msds621/blob/master/projects/bayes/test_bayes.py) script into the root directory of your repository; you can add this if you want but I will overwrite it when testing. It assumes that the `review_polarity` directory is in the same directory (the root of the repository).\n", "\n", "Download the [bayes.py starter kit](https://github.com/parrt/msds621/blob/master/projects/bayes/bayes.py) into the root directory of your repository. Make sure to add this to the repo.\n", "\n", "See Naive Bayes discussion, p258 in [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/).\n", "\n", "**Please do not add the data to your repository!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discussion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Naive bayes classifiers for text documents\n", "\n", "A text classifier predicts to which class, $c$, an unknown document $d$ belongs. In our case, the predictions are binary: $c=0$ for negative movie review and $c=1$ for positive movie review. We can think about classification mathematically as picking the most likely class:\n", "\n", "$$\n", "c^*= \\underset{c}{argmax} ~P(c|d)\n", "$$\n", "\n", "We can replace $P(c|d)$, using Bayes' theorem:\n", "\n", "$$\n", "P(c | d) = \\frac{P(c)P(d|c)}{P(d)}\n", "$$\n", "\n", "to get the formula \n", "\n", "$$\n", "c^*= \\underset{c}{argmax} \\frac{P(c)P(d|c)}{P(d)}\n", "$$\n", "\n", "Since $P(d)$ is a constant for any given document $d$, we can use the following equivalent but simpler formula to find the most likely (even though the scalar $P(c)P(d|c)$ will not technically be a probability):\n", "\n", "$$\n", "c^*= \\underset{c}{argmax} ~ P(c)P(d|c)\n", "$$\n", "\n", "Training then consists of estimating $P(c)$ and $P(c|d)$, which we'll get to shortly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Representing documents\n", "\n", "Text classification requires a representation for document $d$. When loading a document, we first load the text and then tokenize the words, stripping away punctuation and stop words like *the*. The list of words is a fine representation for a document except that each document has a different length, which makes training most models problematic as they assume tabular data with a fixed number of features. The simplest and most common approach is to create an overall vocabulary, $V$, created as a set of unique words across all documents in all classes. *Sort the unique words in the vocab alphabetically so we standardize which word is associated with which word vector index.* Then, the training features are those words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to represent a document then is with a binary word vector, with a 1 in each column if that word is present in the document. Something like this:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
catfoodhongkong
01100
10011
\n", "
" ], "text/plain": [ " cat food hong kong\n", "0 1 1 0 0\n", "1 0 0 1 1" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.DataFrame(data=[[1,1,0,0],\n", " [0,0,1,1]],\n", " columns=['cat','food','hong','kong'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tends to work well for very short strings/documents, such as article titles or tweets. For longer documents, using a binary presence or absence loses information. Instead, it's better to count the number of times each word is present. For example, here are 3 documents and resulting word vectors:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
catfoodhongkong
03100
10022
21011
\n", "
" ], "text/plain": [ " cat food hong kong\n", "0 3 1 0 0\n", "1 0 0 2 2\n", "2 1 0 1 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d1 = \"cats food cats cats\"\n", "d2 = \"hong kong hong kong\"\n", "d3 = \"cats in hong kong\" # assume we strip stop words like \"in\"\n", "df = pd.DataFrame(data=[[3,1,0,0],\n", " [0,0,2,2],\n", " [1,0,1,1]],\n", " columns=['cat','food','hong','kong'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These word vectors with fixed lengths are how most models expect data, including sklearn's implementation. Here's how to train a Naive Bayes model with sklearn using the trivial/toy `df` data and get the training set error:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Correct = 3 / 3 = 100.0%\n" ] } ], "source": [ "from sklearn.naive_bayes import MultinomialNB\n", "import numpy as np\n", "\n", "X = df.values\n", "y = [0, 1, 1] # assume document classes\n", "sknb = MultinomialNB()\n", "sknb.fit(X, y)\n", "y_pred = sknb.predict(X)\n", "print(f\"Correct = {np.sum(y==y_pred)} / {len(y)} = {100*np.sum(y==y_pred) / len(y):.1f}%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because it is convenient to keep word vectors in a 2D matrix and it is what sklearn likes, we will use the same representation in this project. Given the directory name, your function `load_docs()` will return a list of word lists where each word list is the raw list of tokens, typically with repeated words. Then, function `vocab()` will create the combined vocabulary as a mapping from word to word-feature-index, starting with index 1. Index 0 is reserved for unknown words. Vocabulary $V$ should be a `defaultintdict()`, which I provided, so that unknown words get mapped to value/index 0. Finally, function `vectorize()` will convert that to a 2D matrix, one row per document:\n", "\n", "```\n", "neg = load_docs(neg_dir)\n", "pos = load_docs(pos_dir)\n", "V = vocab(neg,pos)\n", "vneg = vectorize_docs(neg, V)\n", "vpos = vectorize_docs(pos, V)\n", "```\n", "\n", "The `defaultintdict` class behaves exactly like `defaultdict(int)` except `d['foo']` does **not** add 'foo' to dictionary `d`. (Boo for that default behavior in `defaultdict`!)\n", "\n", "```\n", "class defaultintdict(dict):\n", " def __init__(self): # Create dictionary of ints\n", " self._factory=int\n", " super().__init__()\n", "\n", " def __missing__(self, key):\n", " \"Override default behavior so missing returns 0\"\n", " return 0\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Estimating probabilities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To train our model, we need to estimate $P(c)$ and $P(d|c)$ for all classes and documents. Estimating $P(c)$ is easy: it's just the number of documents in class $c$ divided by the total number of documents. To estimate $P(d | c)$, Naive Bayes assumes that each word is *conditionally independent*, given the class, meaning:\n", "\n", "$$\n", "P(d | c) = \\prod_{w \\in d} P(w | c)\n", "$$\n", "\n", "so that gives us:\n", "\n", "$$\n", "c^*= \\underset{c}{argmax} ~ P(c) \\prod_{w \\in d} P(w | c)\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where $w$ iterates through all (non-unique) words in $d$, so the product includes $P(w|c)$ 5 times if $w$ appears 5 times in $d$.\n", "\n", "Because we are going to use word counts, not binary word vectors, in fixed-length vectors, we need to include $P(w|c)$ explicitly multiple times for repeated $w$ in $d$:\n", "\n", "$$\n", "c^*= \\underset{c}{argmax} ~ P(c) \\prod_{w \\in V} P(w | c)^{n_w(d)}\n", "$$\n", "\n", "where $n_w(d)$ is the number of times $w$ occurs in $d$, $V$ is the overall vocabulary (set of unique words from all documents); $n_w(d)=0$ when $w$ isn't present in $d$.\n", "\n", "Now we have to figure out how to estimate $P(w | c)$, the probability of seeing word $w$ given that we're looking at a document from class $c$. That's just the number of times $w$ appears in all documents from class $c$ divided by the total number of words (including repeats) in all documents from class $c$:\n", "\n", "$$P(w | c) = \\frac{wordcount(w,c)}{wordcount(c)}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making predictions\n", "\n", "Once we have the appropriate parameter estimates, we have a model that can make predictions in an ideal setting:\n", "\n", "$$\n", "c^*= \\underset{c}{argmax} ~ P(c) \\prod_{w \\in V} P(w | c)^{n_w(d)}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Avoiding $P(w|c)=0$\n", "\n", "If word $w$ does not exist in class $c$ (but is in $V$), then the overall product goes to 0 (and when we take the log below, the classifier would try to evaluate $log(0)$, which is undefined). To solve the problem, we use *Laplace Smoothing*, which just means adding 1 to each word count when computing $P(w|c)$ and making sure to compensate by adding $|V|$ to the denominator (adding 1 for each vocabulary word):\n", "\n", "$$P(w | c) = \\frac{wordcount(w,c) + 1}{wordcount(c) + |V|}$$\n", "\n", "where $|V|$ is the size of the vocabulary for all documents in all classes. Adding this to the denominator, keeps $P(w|c)$ a probability. This way, even if $wordcount(w,c)$ is 0, this ratio > 0. If $w$ is not in any doc of class $c$ then $P(w|c)=1/(wordcount(c)+|V|)$, which is a very low probability. (Note: Each doc's word vector has length $|V|$. During training, any $w$ not found in docs of $c$, will have word count 0. When we add +1, then $c$ looks like it has every word in $V$. Hence, we must divide by $|V|$ not $|V_c|$.) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dealing with missing words\n", "\n", "Laplace smoothing deals with $w$ that are in the vocabulary $V$ but that are not in a class, hence, giving $wordcount(w,c)=0$ for some $c$. There's one last slightly different problem. If a future unknown document contains a word not in $V$ (i.e., not in the training data), then what should $wordcount(w,c)$ be? Probably not 0 because that would mean we had data indicating it does not appear in class $c$ when we have *no* training data on it.\n", "\n", "To be strictly correct and keep $P(w | c)$ a probability in the presence of unknown words, all we have to do is add 1 to the denominator in addition to the Laplace smoothing changes:\n", "\n", "$$P(w | c) = \\frac{wordcount(w,c) + 1}{wordcount(c) + |V| + 1}$$\n", "\n", "We are lumping all unknown words into a single \"wildcard\" word that exists in every $V$. That has the effect of increasing the overall vocabulary size for class $c$ to include room for an unknown word (and all unknown words map to that same spot). In this way, an unknown word gets probability:\n", "\n", "$$P(unknown|c) = \\frac{0 + 1}{wordcount(c) + |V| + 1}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the end, this is no big deal as all classes will get the same nudge for the unknown word so classification won't be affected.\n", "\n", "To deal with unknown words in the project, we can reserve word index 0 to mean unknown word. All words in the training vocabulary start at index 1. So, if we normally have $|V|$ words in the training vocabulary, we will now have $|V|+1$; no word will ever have 0 word count. Each word vector will be of length $|V|+1$. \n", "\n", "The `vocab()` function in your project returns $|V| = |uniquewords|+1$ to handle the unknown word wildcard. Once computed, the size of the vocabulary should never change; all word vectors are size $|V|$.\n", "\n", "With this new adjusted estimate of $P(w|c)$, we can simplify the overall prediction problem to use $w \\in V$:\n", "\n", "$$\n", "c^*= \\underset{c}{argmax} ~ P(c) \\prod_{w \\in V} P(w | c)^{n_w(d)}\n", "$$\n", "\n", "That means we can use dot products for prediction, which is faster than iterating in python through unique document words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Floating point underflow\n", "\n", "The first problem involves the limitations of floating-point arithmetic in computers. Because the probabilities are less than one and there could be tens of thousands multiplied together, we risk floating-point underflow. That just means that eventually the product will attenuate to zero and our classifier is useless. The solution is simply to take the log of the right hand side because log is monotonic and won't affect the $argmax$:\n", "\n", "$$\n", "c^*= \\underset{c}{argmax} \\left \\{ log(P(c)) + \\sum_{w \\in V} log(P(w | c)^{n_w(d)}) \\right \\}\n", "$$\n", "\n", "Or,\n", "\n", "$$\n", "c^* = \\underset{c}{argmax} \\left \\{ log(P(c)) + \\sum_{w \\in V} n_w(d) \\times log(P(w | c)) \\right \\}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Speed issues\n", "\n", "For large data sets, Python loops often are too slow and so we have to rely on vector operations, which are implemented in C. For example, the `predict(X)` method receives a 2D matrix of word vectors and must make a prediction for each one. The temptation is to write the very readable:\n", "\n", "```\n", "y_pred = []\n", "for each row d in X:\n", " y_pred = prediction for d\n", "return y_pred\n", "```\n", "\n", "But, you should use the built-in `numpy` functions such as `np.dot` (same as the `@` operator) and apply functions across vectors. For example, if I have a vector, $v$, and I'd like the log of each value, don't write a loop. Use `np.log(v)`, which will give you a vector with the results.\n", "\n", "My `predict()` method consists primarily of a matrix-vector multiplication per class followed by `argmax`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deliverables\n", "\n", "*You must implement naive bayes from scratch; you cannot use sklearn's built-in `MultinomialNB` class for your implementation. Of course, we do use that for testing purposes.*\n", "\n", "To submit your project, ensure that your `bayes.py` file is submitted to your repository. That file must be in the root of your `bayes`-*userid* repository. It should not have a main program; it should consist of a collection of functions. You must implement the following functions:\n", "\n", "* `load_docs(docs_dirname)`\n", "* `vocab(neg, pos)`\n", "* `vectorize(V, docwords)`\n", "* `vectorize_docs(docs, V)`\n", "* `kfold_CV(model, X, y, k=4)` (You must implement manually; don't use sklearn's version)\n", "\n", "and implement class `NaiveBayes621` with these methods\n", "\n", "* `fit(self, X, y)`\n", "* `predict(self, X)`\n", "\n", "\n", "**Please do not add the data to your repository!**\n", "\n", "**Please do not use sklearn's vectorizer / counter objects; you must learn to implement word vectorization yourself.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Submission\n", "\n", "In your github `bayes`-*userid* repository, you should submit your `bayes.py` file in the root directory. It should not have a main program that runs when the file is imported.\n", "\n", "*Please do not add data files to your repository!*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation\n", "\n", "To evaluate your projects I will run `test_bayes.py` from your repo root directory. Here is a sample test run:\n", "\n", "```\n", "$ pytest -v test_bayes.py \n", "platform darwin -- Python 3.8.8, pytest-6.2.3, py-1.10.0, pluggy-0.13.1 -- /Users/parrt/opt/anaconda3/bin/python\n", "cachedir: .pytest_cache\n", "rootdir: /Users/parrt/courses/msds621-private/projects/bayes\n", "plugins: anyio-2.2.0\n", "collected 10 items \n", "\n", "test_bayes.py::test_load PASSED [ 10%]\n", "test_bayes.py::test_vocab PASSED [ 20%]\n", "test_bayes.py::test_vectorize_docs PASSED [ 30%]\n", "test_bayes.py::test_vectorize PASSED [ 40%]\n", "test_bayes.py::test_simple_docs_error PASSED [ 50%]\n", "test_bayes.py::test_unknown_words_vectorize PASSED [ 60%]\n", "test_bayes.py::test_unknown_words_training_error PASSED [ 70%]\n", "test_bayes.py::test_training_error PASSED [ 80%]\n", "test_bayes.py::test_kfold_621 PASSED [ 90%]\n", "test_bayes.py::test_kfold_sklearn_vs_621 PASSED [100%]\n", "\n", "============================= 10 passed in 16.75s ==============================\n", "```\n", "\n", "Notice that it takes about 17 seconds. If your project takes more than 45 seconds, I will take off 10 points from 100. Each test is evaluated in a binary fashion: it either works or it does not. Each failed test cost you 8 points.\n", "\n", "I ran these tests and my hidden 50 times w/o failure so they tests should be ok.\n", "\n", "I have created a hidden test using a totally different data set for this project that has 3 extra tests, each of which is worth 5%. In other words the best you can do without passing the unknown tests is 85%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automatic testing using github actions\n", "\n", "Github recently introduced a feature called [actions](https://docs.github.com/en/free-pro-team@latest/actions) for *continuous integration*, which just means that every push to the repository at github from your laptop triggers the unit tests. All you have to do is put the [test.yml](https://github.com/parrt/msds621/blob/master/projects/bayes/test.yml) file I have prepared for you into repo subdirectory `.github/workflows`, commit, and push back to github. Then go to the Actions tab of your repository and you'll see something like this:\n", "\n", "\n", "\n", "Naturally it will only work if you have your software written and added to the repository. Once you have something basic working, this functionality is very nice because it automatically shows you how your software is going to run on a different computer (a linux computer). This will catch the usual errors where you have hardcoded something from your machine into the software. It also gets you in the habit of committing software to the repository as you develop it, rather than using the repository as a homework submission device.\n", "\n", "If you want to get fancy, you can use the following \"badge\" code in your repo README.md file:\n", "\n", "```\n", "Test naive bayes\n", "\n", "![bayes](https://github.com/parrt/parrt-bayes/workflows/Test%20MSDS621%20bayes/badge.svg)\n", "```\n", "\n", "which looks like this on your repo homepage:\n", "\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }