{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing tweets and the Logistic Regression model\n", "\n", "**Objectives:** Visualize and interpret the logistic regression model\n", "\n", "**Steps:**\n", "* Plot tweets in a scatter plot using their positive and negative sums.\n", "* Plot the output of the logistic regression model in the same plot as a solid line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import the required libraries\n", "\n", "We will be using [*NLTK*](http://www.nltk.org/howto/twitter.html), an opensource NLP library, for collecting, handling, and processing Twitter data. In this lab, we will use the example dataset that comes alongside with NLTK. This dataset has been manually annotated and serves to establish baselines for models quickly. \n", "\n", "So, to start, let's import the required libraries. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import nltk # NLP toolbox\n", "from os import getcwd\n", "import pandas as pd # Library for Dataframes \n", "from nltk.corpus import twitter_samples \n", "import matplotlib.pyplot as plt # Library for visualization\n", "import numpy as np # Library for math functions\n", "\n", "from utils import process_tweet, build_freqs # Our functions for NLP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the NLTK sample dataset\n", "\n", "To complete this lab, you need the sample dataset of the previous lab. Here, we assume the files are already available, and we only need to load into Python lists." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of tweets: 8000\n" ] } ], "source": [ "# select the set of positive and negative tweets\n", "all_positive_tweets = twitter_samples.strings('positive_tweets.json')\n", "all_negative_tweets = twitter_samples.strings('negative_tweets.json')\n", "\n", "tweets = all_positive_tweets + all_negative_tweets ## Concatenate the lists. \n", "labels = np.append(np.ones((len(all_positive_tweets),1)), np.zeros((len(all_negative_tweets),1)), axis = 0)\n", "\n", "# split the data into two pieces, one for training and one for testing (validation set) \n", "train_pos = all_positive_tweets[:4000]\n", "train_neg = all_negative_tweets[:4000]\n", "\n", "train_x = train_pos + train_neg \n", "\n", "print(\"Number of tweets: \", len(train_x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load the extracted features\n", "\n", "Part of this week's assignment is the creation of the numerical features needed for the Logistic regression model. In order not to interfere with it, we have previously calculated and stored these features in a CSV file for the entire training set.\n", "\n", "So, please load these features created for the tweets sample. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | bias | \n", "positive | \n", "negative | \n", "sentiment | \n", "
---|---|---|---|---|
0 | \n", "1.0 | \n", "3020.0 | \n", "61.0 | \n", "1.0 | \n", "
1 | \n", "1.0 | \n", "3573.0 | \n", "444.0 | \n", "1.0 | \n", "
2 | \n", "1.0 | \n", "3005.0 | \n", "115.0 | \n", "1.0 | \n", "
3 | \n", "1.0 | \n", "2862.0 | \n", "4.0 | \n", "1.0 | \n", "
4 | \n", "1.0 | \n", "3119.0 | \n", "225.0 | \n", "1.0 | \n", "
5 | \n", "1.0 | \n", "2955.0 | \n", "119.0 | \n", "1.0 | \n", "
6 | \n", "1.0 | \n", "3934.0 | \n", "538.0 | \n", "1.0 | \n", "
7 | \n", "1.0 | \n", "3162.0 | \n", "276.0 | \n", "1.0 | \n", "
8 | \n", "1.0 | \n", "628.0 | \n", "189.0 | \n", "1.0 | \n", "
9 | \n", "1.0 | \n", "264.0 | \n", "112.0 | \n", "1.0 | \n", "