{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assignment 2: Naive Bayes\n",
"Welcome to week two of this specialization. You will learn about Naive Bayes. Concretely, you will be using Naive Bayes for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will: \n",
"\n",
"* Train a naive bayes model on a sentiment analysis task\n",
"* Test using your model\n",
"* Compute ratios of positive words to negative words\n",
"* Do some error analysis\n",
"* Predict on your own tweet\n",
"\n",
"You may already be familiar with Naive Bayes and its justification in terms of conditional probabilities and independence.\n",
"* In this week's lectures and assignments we used the ratio of probabilities between positive and negative sentiments.\n",
"* This approach gives us simpler formulas for these 2-way classification tasks.\n",
"\n",
"Load the cell below to import some packages.\n",
"You may want to browse the documentation of unfamiliar libraries and functions."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from utils import process_tweet, lookup\n",
"import pdb\n",
"from nltk.corpus import stopwords, twitter_samples\n",
"import numpy as np\n",
"import pandas as pd\n",
"import nltk\n",
"import string\n",
"from nltk.tokenize import TweetTokenizer\n",
"from os import getcwd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are running this notebook in your local computer,\n",
"don't forget to download the twitter samples and stopwords from nltk.\n",
"\n",
"```\n",
"nltk.download('stopwords')\n",
"nltk.download('twitter_samples')\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path\n",
"filePath = f\"{getcwd()}/../tmp2/\"\n",
"nltk.data.path.append(filePath)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# get the sets of positive and negative tweets\n",
"all_positive_tweets = twitter_samples.strings('positive_tweets.json')\n",
"all_negative_tweets = twitter_samples.strings('negative_tweets.json')\n",
"\n",
"# split the data into two pieces, one for training and one for testing (validation set)\n",
"test_pos = all_positive_tweets[4000:]\n",
"train_pos = all_positive_tweets[:4000]\n",
"test_neg = all_negative_tweets[4000:]\n",
"train_neg = all_negative_tweets[:4000]\n",
"\n",
"train_x = train_pos + train_neg\n",
"test_x = test_pos + test_neg\n",
"\n",
"# avoid assumptions about the length of all_positive_tweets\n",
"train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))\n",
"test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 1: Process the Data\n",
"\n",
"For any machine learning project, once you've gathered the data, the first step is to process it to make useful inputs to your model.\n",
"- **Remove noise**: You will first want to remove noise from your data -- that is, remove words that don't tell you much about the content. These include all common words like 'I, you, are, is, etc...' that would not give us enough information on the sentiment.\n",
"- We'll also remove stock market tickers, retweet symbols, hyperlinks, and hashtags because they can not tell you a lot of information on the sentiment.\n",
"- You also want to remove all the punctuation from a tweet. The reason for doing this is because we want to treat words with or without the punctuation as the same word, instead of treating \"happy\", \"happy?\", \"happy!\", \"happy,\" and \"happy.\" as different words.\n",
"- Finally you want to use stemming to only keep track of one variation of each word. In other words, we'll treat \"motivation\", \"motivated\", and \"motivate\" similarly by grouping them within the same stem of \"motiv-\".\n",
"\n",
"We have given you the function `process_tweet()` that does this for you."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['hello', 'great', 'day', ':)', 'good', 'morn']\n"
]
}
],
"source": [
"custom_tweet = \"RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np\"\n",
"\n",
"# print cleaned tweet\n",
"print(process_tweet(custom_tweet))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1.1 Implementing your helper functions\n",
"\n",
"To help train your naive bayes model, you will need to build a dictionary where the keys are a (word, label) tuple and the values are the corresponding frequency. Note that the labels we'll use here are 1 for positive and 0 for negative.\n",
"\n",
"You will also implement a `lookup()` helper function that takes in the `freqs` dictionary, a word, and a label (1 or 0) and returns the number of times that word and label tuple appears in the collection of tweets.\n",
"\n",
"For example: given a list of tweets `[\"i am rather excited\", \"you are rather happy\"]` and the label 1, the function will return a dictionary that contains the following key-value pairs:\n",
"\n",
"{\n",
" (\"rather\", 1): 2\n",
" (\"happi\", 1) : 1\n",
" (\"excit\", 1) : 1\n",
"}\n",
"\n",
"- Notice how for each word in the given string, the same label 1 is assigned to each word.\n",
"- Notice how the words \"i\" and \"am\" are not saved, since it was removed by process_tweet because it is a stopword.\n",
"- Notice how the word \"rather\" appears twice in the list of tweets, and so its count value is 2.\n",
"\n",
"#### Instructions\n",
"Create a function `count_tweets()` that takes a list of tweets as input, cleans all of them, and returns a dictionary.\n",
"- The key in the dictionary is a tuple containing the stemmed word and its class label, e.g. (\"happi\",1).\n",
"- The value the number of times this word appears in the given collection of tweets (an integer)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"\n",
" Hints\n",
"
\n",
"\n",
"
\n",
"
\n", " Words\n", " | \n", "\n", " Positive word count\n", " | \n", "\n", " Negative Word Count\n", " | \n", "
\n", " glad\n", " | \n", "\n", " 41\n", " | \n", "\n", " 2\n", " | \n", "
\n", " arriv\n", " | \n", "\n", " 57\n", " | \n", "\n", " 4\n", " | \n", "
\n", " :(\n", " | \n", "\n", " 1\n", " | \n", "\n", " 3663\n", " | \n", "
\n", " :-(\n", " | \n", "\n", " 0\n", " | \n", "\n", " 378\n", " | \n", "