{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assignment 1: Logistic Regression\n",
"Welcome to week one of this specialization. You will learn about logistic regression. Concretely, you will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will: \n",
"\n",
"* Learn how to extract features for logistic regression given some text\n",
"* Implement logistic regression from scratch\n",
"* Apply logistic regression on a natural language processing task\n",
"* Test using your logistic regression\n",
"* Perform error analysis\n",
"\n",
"We will be using a data set of tweets. Hopefully you will get more than 99% accuracy. \n",
"Run the cell below to load in the packages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import functions and data"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# run this cell to import nltk\n",
"import nltk\n",
"from os import getcwd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Imported functions\n",
"\n",
"Download the data needed for this assignment. Check out the [documentation for the twitter_samples dataset](http://www.nltk.org/howto/twitter.html).\n",
"\n",
"* twitter_samples: if you're running this notebook on your local computer, you will need to download it using:\n",
"```Python\n",
"nltk.download('twitter_samples')\n",
"```\n",
"\n",
"* stopwords: if you're running this notebook on your local computer, you will need to download it using:\n",
"```python\n",
"nltk.download('stopwords')\n",
"```\n",
"\n",
"#### Import some helper functions that we provided in the utils.py file:\n",
"* `process_tweet()`: cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems.\n",
"* `build_freqs()`: this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the `freqs` dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path\n",
"# this enables importing of these files without downloading it again when we refresh our workspace\n",
"\n",
"filePath = f\"{getcwd()}/../tmp2/\"\n",
"nltk.data.path.append(filePath)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from nltk.corpus import twitter_samples \n",
"\n",
"from utils import process_tweet, build_freqs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare the data\n",
"* The `twitter_samples` contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets. \n",
" * If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets. \n",
" * You will select just the five thousand positive tweets and five thousand negative tweets."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"# select the set of positive and negative tweets\n",
"all_positive_tweets = twitter_samples.strings('positive_tweets.json')\n",
"all_negative_tweets = twitter_samples.strings('negative_tweets.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Train test split: 20% will be in the test set, and 80% in the training set.\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"# split the data into two pieces, one for training and one for testing (validation set) \n",
"test_pos = all_positive_tweets[4000:]\n",
"train_pos = all_positive_tweets[:4000]\n",
"test_neg = all_negative_tweets[4000:]\n",
"train_neg = all_negative_tweets[:4000]\n",
"\n",
"train_x = train_pos + train_neg \n",
"test_x = test_pos + test_neg"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Create the numpy array of positive labels and negative labels."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"# combine positive and negative labels\n",
"train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)\n",
"test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"train_y.shape = (8000, 1)\n",
"test_y.shape = (2000, 1)\n"
]
}
],
"source": [
"# Print the shape train and test sets\n",
"print(\"train_y.shape = \" + str(train_y.shape))\n",
"print(\"test_y.shape = \" + str(test_y.shape))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Create the frequency dictionary using the imported `build_freqs()` function. \n",
" * We highly recommend that you open `utils.py` and read the `build_freqs()` function to understand what it is doing.\n",
" * To view the file directory, go to the menu and click File->Open.\n",
"\n",
"```Python\n",
" for y,tweet in zip(ys, tweets):\n",
" for word in process_tweet(tweet):\n",
" pair = (word, y)\n",
" if pair in freqs:\n",
" freqs[pair] += 1\n",
" else:\n",
" freqs[pair] = 1\n",
"```\n",
"* Notice how the outer for loop goes through each tweet, and the inner for loop steps through each word in a tweet.\n",
"* The `freqs` dictionary is the frequency dictionary that's being built. \n",
"* The key is the tuple (word, label), such as (\"happy\",1) or (\"happy\",0). The value stored for each key is the count of how many times the word \"happy\" was associated with a positive label, or how many times \"happy\" was associated with a negative label."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"type(freqs) =
\n", "
\n", "
\n", "