{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sentiment Analysis using Airline Tweets\n", "Author: Matthew Huh\n", "\n", "## Introduction\n", "\n", "Social media is a treasure trove of textual data. It’s a free and easy way for users to express themselves and share whatever they want to say, get attention, and even start movements. \n", "\n", "It’s a powerful way for companies to get an idea of what users think about them, understand how their brands may be perceived, and identify ways to improve business by analyzing the concerns that users have with their products or services. And that is what I aim to accomplish in this project, by analyzing mentions of a few airlines and determining how people perceive their options, and evaluate what concerns people are mentioning. \n", "\n", "\n", "## About the Data\n", "\n", "The data for this project has been obtained from two sources. The first data set is a collection of data pre-compiled from crowdflower, and is freely available on Kaggle. The second data set has been obtained using Twitter’s API (tweepy to be precise) \n", "\n", "The first data set has far more information as the data has been reviewed by people to determine the sentiment, and the rationale behind negative comments. The tweets have been evaluated by people and for the sake of simplicity, we will be assuming that those results are correct. The data set contains data from February 2015 and only mentions 6 different airlines ('American', 'Delta', 'Southwest', 'US Airways', 'United', 'Virgin America'). \n", "\n", "Our second data set is more recent, November 2018, and has both the same airlines in our training set, but also a few more airlines that aren’t. This is done in order to introduce data that we haven’t trained with yet. In theory, the models developed in this project should be able to do just that.\n", "\n", "## Research Question\n", "\n", "How accurate of a model can we build to determine a tweet's sentiment?\n", "\n", "## Sources\n", "\n", "https://www.kaggle.com/crowdflower/twitter-airline-sentiment (Pre-compiled Kaggle data)\n", "\n", "http://nbviewer.jupyter.org/github/mhuh22/Thinkful/blob/master/Bootcamp/Unit%207/Twitter%20API%20%40Airline%20Tweets.ipynb (Tweepy Script)\n", "\n", "
\n", "Note: The visuals for this project do not render correctly on Github; if you would like to view the presentation the way it was meant to be viewed, please click on the following link.\n", "\n", "http://nbviewer.jupyter.org/github/mhuh22/Portfolio/blob/master/Sentiment%20Analysis%20with%20Airline%20Tweets/Airline%20Sentiment%20Analysis%20using%20Twitter%20Data.ipynb\n", "
\n", "\n", "## Packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\mhuh22\\Anaconda3\\lib\\site-packages\\sklearn\\ensemble\\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", " from numpy.core.umath_tests import inner1d\n" ] }, { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Necessary imports\n", "import os\n", "import time\n", "import timeit\n", "import numpy as np\n", "import pandas as pd\n", "import scipy\n", "import sklearn\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.utils import resample\n", "%matplotlib inline\n", "\n", "# Modelling packages\n", "from sklearn import ensemble\n", "from sklearn.feature_selection import chi2, f_classif, SelectKBest \n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import adjusted_rand_score, classification_report, confusion_matrix, silhouette_score\n", "from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.preprocessing import normalize\n", "\n", "# Natural Language processing\n", "import nltk\n", "import re\n", "import spacy\n", "from collections import Counter\n", "from nltk.corpus import stopwords\n", "from nltk.stem import WordNetLemmatizer\n", "from sklearn.datasets import fetch_rcv1\n", "from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation as LDA\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import Normalizer\n", "from nltk.corpus import stopwords\n", "from wordcloud import WordCloud, STOPWORDS\n", "\n", "# Clustering packages\n", "import sklearn.cluster as cluster\n", "from sklearn.cluster import KMeans, MeanShift, estimate_bandwidth, SpectralClustering, AffinityPropagation\n", "from scipy.spatial.distance import cdist\n", "\n", "# Plotly packages\n", "import cufflinks as cf\n", "import ipywidgets as widgets\n", "import plotly as py\n", "import plotly.figure_factory as ff\n", "import plotly.graph_objs as go\n", "from plotly import tools\n", "from scipy import special\n", "py.offline.init_notebook_mode(connected=True)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idairline_sentimentairline_sentiment_confidencenegativereasonnegativereason_confidenceairlineairline_sentiment_goldnamenegativereason_goldretweet_counttexttweet_coordtweet_createdtweet_locationuser_timezone
0570306133677760513neutral1.0000NaNNaNVirgin AmericaNaNcairdinNaN0@VirginAmerica What @dhepburn said.NaN2015-02-24 11:35:52 -0800NaNEastern Time (US & Canada)
1570301130888122368positive0.3486NaN0.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica plus you've added commercials t...NaN2015-02-24 11:15:59 -0800NaNPacific Time (US & Canada)
2570301083672813571neutral0.6837NaNNaNVirgin AmericaNaNyvonnalynnNaN0@VirginAmerica I didn't today... Must mean I n...NaN2015-02-24 11:15:48 -0800Lets PlayCentral Time (US & Canada)
3570301031407624196negative1.0000Bad Flight0.7033Virgin AmericaNaNjnardinoNaN0@VirginAmerica it's really aggressive to blast...NaN2015-02-24 11:15:36 -0800NaNPacific Time (US & Canada)
4570300817074462722negative1.0000Can't Tell1.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica and it's a really big bad thing...NaN2015-02-24 11:14:45 -0800NaNPacific Time (US & Canada)
\n", "
" ], "text/plain": [ " tweet_id airline_sentiment airline_sentiment_confidence \\\n", "0 570306133677760513 neutral 1.0000 \n", "1 570301130888122368 positive 0.3486 \n", "2 570301083672813571 neutral 0.6837 \n", "3 570301031407624196 negative 1.0000 \n", "4 570300817074462722 negative 1.0000 \n", "\n", " negativereason negativereason_confidence airline \\\n", "0 NaN NaN Virgin America \n", "1 NaN 0.0000 Virgin America \n", "2 NaN NaN Virgin America \n", "3 Bad Flight 0.7033 Virgin America \n", "4 Can't Tell 1.0000 Virgin America \n", "\n", " airline_sentiment_gold name negativereason_gold retweet_count \\\n", "0 NaN cairdin NaN 0 \n", "1 NaN jnardino NaN 0 \n", "2 NaN yvonnalynn NaN 0 \n", "3 NaN jnardino NaN 0 \n", "4 NaN jnardino NaN 0 \n", "\n", " text tweet_coord \\\n", "0 @VirginAmerica What @dhepburn said. NaN \n", "1 @VirginAmerica plus you've added commercials t... NaN \n", "2 @VirginAmerica I didn't today... Must mean I n... NaN \n", "3 @VirginAmerica it's really aggressive to blast... NaN \n", "4 @VirginAmerica and it's a really big bad thing... NaN \n", "\n", " tweet_created tweet_location user_timezone \n", "0 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada) \n", "1 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada) \n", "2 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) \n", "3 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada) \n", "4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import the data\n", "tweets = pd.read_csv(\"airline_tweets/Tweets.csv\")\n", "\n", "# Preview the dataset\n", "tweets.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(14640, 15)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View the size of the dataset\n", "tweets.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset has a bit more information than we actually need for this project. We definitely need the text information since that is what we are evaluating, the sentiment since that is what we are trying to measure, and the reason to determine what clusters of complaints people are encountering. As for the rest, they could have some impact on the outcome, but they are not what we are trying to measure so, we'll drop the rest before continuing in order to improve our runtimes." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Condense dataframe to only include what we want\n", "tweets = tweets[['airline_sentiment', 'negativereason', 'airline', 'text']]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['American', 'Delta', 'Southwest', 'US Airways', 'United', 'Virgin America']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print unique airlines in the dataset\n", "sorted(tweets['airline'].unique())" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "airline_sentiment 3\n", "negativereason 10\n", "airline 6\n", "text 14427\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Describe unique occurences for each categorical variable\n", "tweets.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Visualization" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "labels": [ "negative", "neutral", "positive" ], "type": "pie", "uid": "f1ffdb71-7d4c-4c53-995d-d3319f78fc07", "values": [ 9178, 3099, 2363 ] } ], "layout": { "autosize": false, "height": 400, "title": "Tweet Sentiment", "width": 500, "yaxis": { "title": "Number of tweets" } } }, "text/html": [ "
" ], "text/vnd.plotly.v1+html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# View distribution of tweets by sentiment \n", "# (Changing colors to red/gray/green would be nice)\n", "trace = go.Pie(labels=tweets['airline_sentiment'].value_counts().index, \n", " values=tweets['airline_sentiment'].value_counts())\n", "\n", "# Create the layout\n", "layout = go.Layout(\n", " title = 'Tweet Sentiment',\n", " height = 400,\n", " width = 500,\n", " autosize = False,\n", " yaxis = dict(title='Number of tweets')\n", ")\n", "\n", "fig = go.Figure(data = [trace], layout = layout)\n", "py.offline.iplot(fig, filename='cufflinks/simple')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the sentiment values in our dataset, it appears as though practically ⅔ of the tweets aren’t too happy about their choice of airline, while only ⅙ mention it in a positive way. We do have to keep in mind that these are tweets, not reviews, so they are mostly a form of expression, or expressing a concern/complaint more so than thinking about the experience as whole. " ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "opacity": 0.7, "text": [ 4220, 2280, 1771, 1269, 915, 852, 837, 698, 240, 134 ], "textposition": "outside", "type": "bar", "uid": "6f6f29de-ce95-4dc3-9ad0-fc8b3c04af42", "x": [ "Customer Service Issue", "Late Flight", "Can't Tell", "Cancelled Flight", "Lost Luggage", "Flight Booking Problems", "Bad Flight", "Flight Attendant Complaints", "longlines", "Damaged Luggage" ], "y": [ 4220, 2280, 1771, 1269, 915, 852, 837, 698, 240, 134 ] } ], "layout": { "title": "Negative Tweets by Reason", "yaxis": { "title": "Number of tweets" } } }, "text/html": [ "
" ], "text/vnd.plotly.v1+html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plots the complaint reasons, and their frequency\n", "# (It might be nice to somehow show how common each reason is for each airline)\n", "\n", "# The input is the number of negative tweets by reason\n", "data = [go.Bar(\n", " x = tweets.negativereason.value_counts().index,\n", " y = tweets.negativereason.value_counts(),\n", " opacity = 0.7,\n", " text=tweets.negativereason.value_counts(),\n", " textposition='outside'\n", ")]\n", "\n", "# Create the layout\n", "layout = go.Layout(\n", " title = 'Negative Tweets by Reason',\n", " yaxis = dict(title='Number of tweets')\n", ")\n", "\n", "fig = go.Figure(data = data, layout = layout)\n", "py.offline.iplot(fig, filename='cufflinks/simple')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Out of the 9000 negative tweets in our dataset, the reasons have been distilled into 10 categories, with the most common reason for it has being related to customer service. Let’s take a look at what those complaints look like." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15 @united @CheerTymeDad So I can buy tix 3 days before flight but can't transfer the tix. Flawed security logic. Flawed customer service\n", "16 @united I did start a claim but 8-10 weeks is unrealistic, am I really supposed to go that long with out a car seat for my child.Ridiculous!\n", "25 @united beginning of Feb I called United they said they would send another voucher by mail. Never got anything. #tiredofwaiting\n", "26 @United the internet is a great thing. I am emailing executives in your company, maybe they will respond to me in a timely manner.\n", "32 @united I am trying to book awards for September and need flights on @aegeanairlines but they will not show even w/ many award seats availab\n", "Name: text, dtype: object" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets.loc[tweets['negativereason']=='Customer Service Issue']['text'].head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "marker": { "color": "rgba(200,0,0,.7)" }, "name": "Negative", "type": "bar", "uid": "2422b61d-e296-45e2-9fe1-8056d7805a91", "x": [ "American", "Delta", "Southwest", "US Airways", "United", "Virgin America" ], "y": [ 1960, 955, 1186, 2263, 2633, 181 ] }, { "marker": { "color": "rgba(150,150,150,.7)" }, "name": "Neutral", "type": "bar", "uid": "921d525a-6a78-49cd-b32d-7c2bbfc328a5", "x": [ "American", "Delta", "Southwest", "US Airways", "United", "Virgin America" ], "y": [ 463, 723, 664, 381, 697, 171 ] }, { "marker": { "color": "rgba(0,200,0,.7)" }, "name": "Positive", "type": "bar", "uid": "1b55df28-1606-4f2f-bbb6-d8d6908d3ea1", "x": [ "American", "Delta", "Southwest", "US Airways", "United", "Virgin America" ], "y": [ 336, 544, 570, 269, 492, 152 ] } ], "layout": { "barmode": "group", "title": "Airline Sentiment (Total Tweets)", "yaxis": { "title": "Number of tweets" } } }, "text/html": [ "
" ], "text/vnd.plotly.v1+html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Show distribution of texts\n", "\n", "trace1 = go.Bar(\n", " x = sorted(tweets['airline'].unique()),\n", " y = tweets[tweets['airline_sentiment'] == 'negative'].groupby('airline')['airline_sentiment'].value_counts(),\n", " name = 'Negative',\n", " marker = dict(color='rgba(200,0,0,.7)')\n", ")\n", "\n", "trace2 = go.Bar(\n", " x = sorted(tweets['airline'].unique()),\n", " y = tweets[tweets['airline_sentiment'] == 'neutral'].groupby('airline')['airline_sentiment'].value_counts(),\n", " name = 'Neutral',\n", " marker = dict(color='rgba(150,150,150,.7)')\n", ")\n", "\n", "trace3 = go.Bar(\n", " x = sorted(tweets['airline'].unique()),\n", " y = tweets[tweets['airline_sentiment'] == 'positive'].groupby('airline')['airline_sentiment'].value_counts(),\n", " name = 'Positive',\n", " marker = dict(color='rgba(0,200,0,.7)')\n", ")\n", "\n", "data = [trace1, trace2, trace3]\n", "layout = go.Layout(\n", " title = 'Airline Sentiment (Total Tweets)',\n", " barmode='group',\n", " yaxis = dict(title='Number of tweets')\n", ")\n", "\n", "fig = go.Figure(data=data, layout=layout)\n", "py.offline.iplot(fig, filename='stacked-bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "For the above visual, please hover over the bars to see the actual values, and feel free to toggle the sentiment categories on the right.\n", "
\n", "\n", "First impressions say a lot, and the one you might be getting with this chart is that there are a lot of tweets about American, US Airways, and United, and it doesn’t look very good. Granted, they are some of the largest airlines in the world, but it seems like people have just as much to say.\n", "\n", "### Class Imbalance\n", "\n", "Based on the number of total tweets above, we can see that certain airlines don't have the same presence as others, namely Virgin America, which is a much smaller airline than the others in our dataset. In order to fix this issue, I will be upsampling the data so that all of the airlines will have an equal number of tweets and look at the ratio of tweet sentiments rather than the absolute numbers." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "airline\n", "United 3822\n", "US Airways 2913\n", "American 2759\n", "Southwest 2420\n", "Delta 2222\n", "Virgin America 504\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count # of tweets for each airline\n", "tweets.groupby(['airline']).size().sort_values(ascending = False)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "airline\n", "American 3822\n", "Delta 3822\n", "Southwest 3822\n", "US Airways 3822\n", "United 3822\n", "Virgin America 3822\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create temporary dataframe for all airlines to upsample and concatenate\n", "tweets_united = tweets[tweets.airline=='United']\n", "sample_size = len(tweets[tweets['airline']=='United'])\n", "\n", "# Upsample all other airlines\n", "tweets_usairways = resample(tweets[tweets.airline=='US Airways'], \n", " replace=True, n_samples=sample_size)\n", "tweets_american = resample(tweets[tweets.airline=='American'], \n", " replace=True, n_samples=sample_size)\n", "tweets_southwest = resample(tweets[tweets.airline=='Southwest'], \n", " replace=True, n_samples=sample_size)\n", "tweets_delta = resample(tweets[tweets.airline=='Delta'], \n", " replace=True, n_samples=sample_size)\n", "tweets_virgin = resample(tweets[tweets.airline=='Virgin America'], \n", " replace=True, n_samples=sample_size)\n", "\n", "# Concatenate the individual dataframes\n", "tweets = pd.concat([tweets_united, \n", " tweets_usairways, \n", " tweets_american, \n", " tweets_southwest,\n", " tweets_delta,\n", " tweets_virgin])\n", "\n", "tweets = tweets.reset_index(drop=True)\n", "\n", "# Count # of tweets for each airline to verify\n", "tweets.groupby(['airline']).size()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "marker": { "color": "rgba(200,0,0,.7)" }, "name": "Negative", "type": "bar", "uid": "c565e91d-d9d5-45da-9dfb-460cec80b277", "x": [ "American", "Delta", "Southwest", "US Airways", "United", "Virgin America" ], "y": [ 0.7087912087912088, 0.4330193615907902, 0.4869178440607012, 0.7846677132391419, 0.6889063317634746, 0.3555729984301413 ] }, { "marker": { "color": "rgba(150,150,150,.7)" }, "name": "Neutral", "type": "bar", "uid": "d62aac71-c2c9-4c54-a059-43106b6c24bb", "x": [ "American", "Delta", "Southwest", "US Airways", "United", "Virgin America" ], "y": [ 0.1698063840920984, 0.31998953427524857, 0.27341705913134484, 0.12663526949241236, 0.18236525379382523, 0.3362114076399791 ] }, { "marker": { "color": "rgba(0,200,0,.7)" }, "name": "Positive", "type": "bar", "uid": "8d43e555-adee-4c1e-bd86-3660d70a332d", "x": [ "American", "Delta", "Southwest", "US Airways", "United", "Virgin America" ], "y": [ 0.12140240711669283, 0.24699110413396128, 0.23966509680795395, 0.08869701726844584, 0.12872841444270017, 0.3082155939298796 ] } ], "layout": { "barmode": "group", "title": "Airline Sentiment (Percentage)", "yaxis": { "title": "% of tweets" } } }, "text/html": [ "
" ], "text/vnd.plotly.v1+html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Show distribution of texts\n", "\n", "trace1 = go.Bar(\n", " x = sorted(tweets['airline'].unique()),\n", " y = (tweets[tweets['airline_sentiment'] == 'negative'].groupby('airline')['airline_sentiment'].value_counts().values) / (tweets['airline'].value_counts().sort_index().values),\n", " name = 'Negative',\n", " marker = dict(color='rgba(200,0,0,.7)')\n", ")\n", "\n", "trace2 = go.Bar(\n", " x = sorted(tweets['airline'].unique()),\n", " y = (tweets[tweets['airline_sentiment'] == 'neutral'].groupby('airline')['airline_sentiment'].value_counts().values) / (tweets['airline'].value_counts().sort_index().values),\n", " name = 'Neutral',\n", " marker = dict(color='rgba(150,150,150,.7)')\n", ")\n", "\n", "trace3 = go.Bar(\n", " x = sorted(tweets['airline'].unique()),\n", " y = (tweets[tweets['airline_sentiment'] == 'positive'].groupby('airline')['airline_sentiment'].value_counts().values) / (tweets['airline'].value_counts().sort_index().values),\n", " name = 'Positive',\n", " marker = dict(color='rgba(0,200,0,.7)')\n", ")\n", "\n", "data = [trace1, trace2, trace3]\n", "layout = go.Layout(\n", " title = 'Airline Sentiment (Percentage)',\n", " barmode='group',\n", " yaxis = dict(title='% of tweets')\n", ")\n", "\n", "fig = go.Figure(data=data, layout=layout)\n", "py.offline.iplot(fig, filename='stacked-bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "For the above visual, please feel free to toggle the sentiment categories on the right.\n", "
\n", "\n", "After balancing the number of tweets for each airline, we can get a better idea of what people think about their choice of airline. Virgin America seems to fare the best with an almost equal number of positive and negative tweets, followed by Delta and Southwest. \n", "\n", "## Text Cleaning\n", "\n", "There’s only so much that you can fit into 140 characters or less (the training set is from 2014, before Twitter updated their character limit), and it’s not as if most of these tweets have as much post-processing put into them as an academic essay, so there are going to be quite a few components that need to be cleaned up before modelling. To keep it short, this is the process for how what needs to happen first\n", "\n", "* Remove special characters\n", "* Remove excess spaces\n", "* Remove stopwords like ‘like’, ‘the’ ‘and’ \n", "* Reduce words to their lemmas (base forms)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def text_cleaner(text):\n", " # Visual inspection identifies a form of punctuation spaCy does not\n", " text = re.sub(r'\\n',' ',text)\n", " text = re.sub(r'\\t',' ',text)\n", " text = re.sub(r'--',' ',text)\n", " text = re.sub(\"[\\[].*?[\\]]\", \"\", text)\n", " text = ' '.join(text.split())\n", " return text" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 @united thanks\n", "1 @united Thanks for taking care of that MR!! Happy customer.\n", "2 @united still no refund or word via DM. Please resolve this issue as your Cancelled Flightled flight was useless to my assistant's trip.\n", "3 @united Delayed due to lack of crew and now delayed again because there's a long line for deicing... Still need to improve service #united\n", "4 @united thanks we filled it out. How's our luck with this? Is it common?\n", "Name: text, dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remove non-essential punctuation from the tweets\n", "pd.options.display.max_colwidth = 200\n", "tweets['text'] = tweets['text'].map(lambda x: text_cleaner(str(x)))\n", "tweets['text'].head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Reduce all text to their lemmas\n", "lemmatizer = WordNetLemmatizer()\n", "\n", "for tweet in tweets['text']:\n", " tweet = lemmatizer.lemmatize(tweet)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the last part before modelling, converting the sentiment from text to numbers. Computers don’t really like working with characters, so we have to remap these values before continuing." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Modify values of sentiment to numerical values\n", "sentiment = {'negative': -1, 'neutral': 0, 'positive': 1}\n", "tweets['airline_sentiment'] = tweets['airline_sentiment'].map(lambda x: sentiment[x])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Natural Lanuage Processing\n", "\n", "Now that we’ve reached the end of the course, it’s time to pull out all the stops for natural language processing. We’ll be trying out a few methods and combine them to generate the most useful factors for sentiment analysis prediction. We’ll try a few simpler methods like extracting the types of words used in each tweet using spacy, looking at the characteristics of each tweet, and finally extracting the most useful words in the dataset using tf-idf vectorization." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Identify our predictor (tweets) and outcome (sentiment) variable\n", "X = tweets['text']\n", "y = tweets['airline_sentiment']\n", "\n", "# Instantiating spaCy\n", "nlp = spacy.load('en')\n", "X_words = []\n", "\n", "# Create list of dataframes that we'll combine later\n", "nlp_methods = []" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Count parts of speech\n", "\n", "for row in X:\n", " row_doc = nlp(row) # Processing each row for tokens\n", " sent_len = len(row_doc) # Calculating length of each sentence\n", "\n", " advs = 0 # Initializing counts of different parts of speech\n", " verb = 0\n", " noun = 0\n", " adj = 0\n", " \n", " for token in row_doc:\n", " # Identifying each part of speech and adding to counts\n", " if token.pos_ == 'ADV':\n", " advs +=1\n", " elif token.pos_ == 'VERB':\n", " verb +=1\n", " elif token.pos_ == 'NOUN':\n", " noun +=1\n", " elif token.pos_ == 'ADJ':\n", " adj +=1\n", " # Creating a list of all features for each sentence\n", " X_words.append([row_doc, advs, verb, noun, adj, sent_len])\n", "\n", "# Create dataframe with count of adverbs, verbs, nouns, and adjectives\n", "X_count = pd.DataFrame(data=X_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'sent_length'])\n", "\n", "# Change token count to token percentage\n", "for column in X_count.columns[1:5]:\n", " X_count[column] = X_count[column] / X_count['sent_length']\n", "\n", "# Normalize X_count\n", "X_counter = normalize(X_count.drop('BOW',axis=1))\n", "X_counter = pd.DataFrame(data=X_counter)\n", "\n", "nlp_methods.append(X_counter)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Track tweet composition\n", "\n", "# Create new dataframe to track tweet characteristics\n", "tweet_characteristics = pd.DataFrame()\n", "\n", "# Track characteristics of each tweet\n", "tweet_characteristics['word_count'] = tweets['text'].apply(lambda x: len(str(x).split(\" \")))\n", "tweet_characteristics['char_count'] = tweets['text'].str.len()\n", "tweet_characteristics['stop_count'] = tweets['text'].apply(lambda x: len([x for x in x.split() if x in stopwords.words('english')]))\n", "tweet_characteristics['special_count'] = tweets['text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))\n", "\n", "# Preview the tweet characteristics\n", "tweet_characteristics.head()\n", "\n", "nlp_methods.append(pd.DataFrame(normalize(tweet_characteristics)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tf-idf Vectorization\n", "The first types of features that we are going to add are the most useful words in our dataset. Now how are we going to determine which words are deemed the most \"useful\"? With TF-IDF vectorizer, of course.\n", "\n", "TF tracks the term frequency, or how often each word appears in all articles of text, while idf (or Inverse Document Frequency) is a value that places less weight on variables that occur too often and lose their predictive power. Put together, it's a tool that allows us to assign an 'importance' value to each word in the entire dataset based on frequency in each row and throughout the database." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Parameters for TF-idf vectorizer\n", "vectorizer = TfidfVectorizer(max_df=0.5, # Throw out words that occur in over half of tweets\n", " min_df=3, # Words need to apper at least 3 times to count\n", " max_features=1200, \n", " stop_words='english', # Ignore stop words\n", " lowercase=True, # Ignore case\n", " use_idf=True, # Penalize frequent words\n", " norm=u'l2',\n", " smooth_idf=True # Add 1 to df in case we have to divide by 0\n", " )\n", "\n", "#Applying the vectorizer\n", "X_tfidf=vectorizer.fit_transform(X)\n", "\n", "#splitting into training and test sets\n", "X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)\n", "\n", "#Removes all zeros from the matrix\n", "X_train_tfidf_csr = X_train_tfidf.tocsr()\n", "\n", "#number of paragraphs\n", "n = X_train_tfidf_csr.shape[0]\n", "\n", "#A list of dictionaries, one per paragraph\n", "tfidf_bypara = [{} for _ in range(0,n)]\n", "\n", "#List of features\n", "terms = vectorizer.get_feature_names()\n", "\n", "#for each paragraph, lists the feature words and their tf-idf scores\n", "for i, j in zip(*X_train_tfidf_csr.nonzero()):\n", " tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]\n", "\n", "# Normalize the dataset \n", "X_norm = normalize(X_train_tfidf)\n", "\n", "# Convert from tf-idf matrix to dataframe\n", "X_normal = pd.DataFrame(data=X_norm.toarray())\n", "\n", "# Append tf-idf vectorizer to our list of nlp methods\n", "nlp_methods.append(X_normal)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "## Creating tf-idf matrix\n", "synopsis_tfidf = vectorizer.fit_transform(tweets['text'])\n", "\n", "# Getting the word list.\n", "terms = vectorizer.get_feature_names()\n", "\n", "# Linking words to topics\n", "def word_topic(tfidf,solution, wordlist):\n", " \n", " # Loading scores for each word on each topic/component.\n", " words_by_topic=tfidf.T * solution\n", "\n", " # Linking the loadings to the words in an easy-to-read way.\n", " components=pd.DataFrame(words_by_topic,index=wordlist)\n", " \n", " return components\n", "\n", "# Extracts the top N words and their loadings for each topic.\n", "def top_words(components, n_top_words):\n", " n_topics = range(components.shape[1])\n", " index= np.repeat(n_topics, n_top_words, axis=0)\n", " topwords=pd.Series(index=index)\n", " for column in range(components.shape[1]):\n", " # Sort the column so that highest loadings are at the top.\n", " sortedwords=components.iloc[:,column].sort_values(ascending=False)\n", " # Choose the N highest loadings.\n", " chosen=sortedwords[:n_top_words]\n", " # Combine loading and index into a string.\n", " chosenlist=chosen.index +\" \"+round(chosen,2).map(str) \n", " topwords.loc[column]=chosenlist\n", " return(topwords)\n", "\n", "# Number of words to look at for each topic.\n", "n_top_words = 200\n", "\n", "# Number of possible outcomes (3 = positive, neutral, negative)\n", "ntopics = tweets['airline_sentiment'].nunique()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Parameters for LSA\n", "svd= TruncatedSVD(ntopics)\n", "lsa = make_pipeline(svd, Normalizer(copy=False))\n", "\n", "# Time and run LSA model\n", "start_time = timeit.default_timer()\n", "synopsis_lsa = lsa.fit_transform(synopsis_tfidf)\n", "elapsed_lsa = timeit.default_timer() - start_time\n", "\n", "# Extract most common words for LSA\n", "components_lsa = word_topic(synopsis_tfidf, synopsis_lsa, terms)\n", "topwords=pd.DataFrame()\n", "topwords['LSA']=top_words(components_lsa, n_top_words)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Parameters for NNMF\n", "nmf = NMF(alpha=0.0, \n", " init='nndsvdar', # how starting value are calculated\n", " l1_ratio=0.0, # Sets whether regularization is L2 (0), L1 (1), or a combination (values between 0 and 1)\n", " max_iter=200, # when to stop even if the model is not converging (to prevent running forever)\n", " n_components=ntopics, \n", " random_state=0, \n", " solver='cd', # Use Coordinate Descent to solve\n", " tol=0.0001, # model will stop if tfidf-WH <= tol\n", " verbose=0 # amount of output to give while iterating\n", " )\n", "\n", "# Time and run NNMF model\n", "start_time = timeit.default_timer()\n", "synopsis_nmf = nmf.fit_transform(synopsis_tfidf)\n", "elapsed_nmf = timeit.default_timer() - start_time\n", "\n", "# Extract most common words for NNMF\n", "components_nmf = word_topic(synopsis_tfidf, synopsis_nmf, terms)\n", "topwords['NNMF']=top_words(components_nmf, n_top_words)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top positive words: \n", "\n", " LSA NNMF\n", "0 flight 863.54 flight 42.82\n", "0 southwestair 824.93 southwestair 30.41\n", "0 united 761.71 americanair 29.94\n", "0 americanair 739.49 united 28.63\n", "0 usairways 687.61 usairways 26.72\n", "0 jetblue 562.68 cancelled 19.22\n", "0 thanks 484.34 thanks 15.63\n", "0 virginamerica 452.83 thank 12.26\n", "0 http 424.11 flightled 12.0\n", "0 just 324.94 help 10.99 \n", "\n", "\n", "Top neutral words: \n", "\n", " LSA NNMF\n", "1 jetblue 755.76 jetblue 83.23\n", "1 http 226.13 http 26.59\n", "1 fleek 105.45 fleek 13.72\n", "1 fleet 103.5 fleet 13.46\n", "1 rt 43.34 thanks 12.68\n", "1 jfk 30.73 thank 10.1\n", "1 love 26.45 flight 9.42\n", "1 wall 22.44 just 5.21\n", "1 blue 21.41 great 5.0\n", "1 ceo 20.18 rt 4.92 \n", "\n", "\n", "Top negative words: \n", "\n", " LSA NNMF\n", "2 virginamerica 749.26 virginamerica 73.34\n", "2 http 192.65 http 21.2\n", "2 love 53.64 thanks 18.15\n", "2 virgin 51.36 flight 10.85\n", "2 website 50.33 flights 5.86\n", "2 site 46.05 love 5.82\n", "2 thanks 45.6 guys 5.53\n", "2 carrieunderwood 44.49 flying 5.37\n", "2 ladygaga 41.73 website 4.89\n", "2 flying 41.38 airline 4.75\n" ] } ], "source": [ "# View top words identified by LSA and NNMF\n", "print('Top positive words: \\n\\n', topwords[:10], '\\n\\n')\n", "print('Top neutral words: \\n\\n', topwords[200:210], '\\n\\n')\n", "print('Top negative words: \\n\\n', topwords[400:410])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123401230...1190119111921193119411951196119711981199
00.0000000.0000000.2357020.2357020.9428090.1414210.9899490.0000000.0000000.0...0.00.00.00.00.00.00.00.00.00.0
10.0000000.0118310.0177470.0059160.9997550.1668990.9847020.0500700.0000000.0...0.00.00.00.00.00.00.00.00.00.0
20.0014790.0059170.0103540.0044380.9999180.1664700.9843450.0579030.0000000.0...0.00.00.00.00.00.00.00.00.00.0
30.0054860.0096010.0068580.0013720.9999140.1709840.9831590.0641190.0071240.0...0.00.00.00.00.00.00.00.00.00.0
40.0030860.0092580.0061720.0092580.9998900.1904280.9793430.0680100.0000000.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

5 rows × 1209 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 0 1 \\\n", "0 0.000000 0.000000 0.235702 0.235702 0.942809 0.141421 0.989949 \n", "1 0.000000 0.011831 0.017747 0.005916 0.999755 0.166899 0.984702 \n", "2 0.001479 0.005917 0.010354 0.004438 0.999918 0.166470 0.984345 \n", "3 0.005486 0.009601 0.006858 0.001372 0.999914 0.170984 0.983159 \n", "4 0.003086 0.009258 0.006172 0.009258 0.999890 0.190428 0.979343 \n", "\n", " 2 3 0 ... 1190 1191 1192 1193 1194 1195 1196 \\\n", "0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.050070 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.057903 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.064119 0.007124 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.068010 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " 1197 1198 1199 \n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", "[5 rows x 1209 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create final training dataset\n", "X_train = pd.concat(nlp_methods, axis=1)\n", "\n", "# Review the data that we are placing into our models\n", "X_train.head()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Generates a wordcloud of the most common words in negative tweets\n", "df=tweets[tweets['airline_sentiment']==-1]\n", "words = ' '.join(df['text'])\n", "cleaned_word = \" \".join([word for word in words.split()\n", " if 'http' not in word\n", " and not word.startswith('@')\n", " and word != 'RT'\n", " ])\n", "\n", "wordcloud = WordCloud(stopwords=STOPWORDS,\n", " background_color='black',\n", " max_words = 100\n", " ).generate(cleaned_word)\n", "\n", "plt.figure(1,figsize=(12, 12))\n", "plt.imshow(wordcloud)\n", "plt.axis('off')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modelling Phase\n", "\n", "So, the tweets have been parsed and prepared for our models. We're trying to determine if it's possible to predict sentiment (a binary variable) so that means that we will require some classification models. As for which models to test, we will be using \n", "* Logistic Regression\n", "* Random Forest Ensemble\n", "* Gradient Boosting Ensemble\n", "* Neural Networks\n", "\n", "These four models differ fundamentally in how they operate, and range from least to most complex, meaning that logistic regression will almost always take the least amount of time, but the latter three will likely be more accurate." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# Create dataframe to track runtime and scores\n", "models = ['Logistic regression' , 'Random forest', 'Gradient Boosting', 'Neural Networks']\n", "runtime = []\n", "train_score = []\n", "test_score = []" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def run_model(model):\n", " \n", " # Train the model\n", " train_set = cross_val_score(model, X_train_tfidf, y_train_tfidf, cv=5, n_jobs=-1)\n", " \n", " # Test and time the model\n", " start_time = timeit.default_timer()\n", " test_set = cross_val_score(model, X_test_tfidf, y_test_tfidf, cv=5, n_jobs=-1)\n", " elapsed_time = timeit.default_timer() - start_time\n", " \n", " # Append the scores and runtime to our dataframe\n", " train_score.append(train_set.mean())\n", " test_score.append(test_set.mean())\n", " runtime.append(elapsed_time)\n", " \n", " # Fit the model to the data\n", " model.fit(X_train_tfidf, y_train_tfidf)\n", " \n", " # Store the predicted values in a dataframe\n", " y_pred = model.predict(X_test_tfidf)\n", " \n", " # Print scores and runtime\n", " print(str(model), '\\n\\nTrain score: {:.5f}(+/- {:.2f})\\n'.format(train_set.mean(), train_set.std()*2))\n", " print('Test score: {:.5f}(+/- {:.2f})\\n'.format(test_set.mean(), test_set.std()*2))\n", " print('Runtime:', elapsed_time, 'seconds\\n')\n", " \n", " # Generate and print the confusion matrix\n", " print('Confusion matrix:\\n\\n', confusion_matrix(y_test_tfidf, y_pred))\n", " \n", " # Print the model's statisitcs\n", " print('\\nClassification Report:\\n\\n' + classification_report(y_test_tfidf, y_pred))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='multinomial',\n", " n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',\n", " tol=0.0001, verbose=0, warm_start=False) \n", "\n", "Train score: 0.81918(+/- 0.01)\n", "\n", "Test score: 0.77256(+/- 0.02)\n", "\n", "Runtime: 9.902313456999991 seconds\n", "\n", "Confusion matrix:\n", "\n", " [[2997 223 56]\n", " [ 449 808 99]\n", " [ 155 104 842]]\n", "\n", "Classification Report:\n", "\n", " precision recall f1-score support\n", "\n", " -1 0.83 0.91 0.87 3276\n", " 0 0.71 0.60 0.65 1356\n", " 1 0.84 0.76 0.80 1101\n", "\n", "avg / total 0.81 0.81 0.81 5733\n", "\n" ] } ], "source": [ "f" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RandomForestClassifier(bootstrap=True, class_weight='balanced',\n", " criterion='gini', max_depth=None, max_features='auto',\n", " max_leaf_nodes=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=60, n_jobs=1, oob_score=True, random_state=None,\n", " verbose=0, warm_start=False) \n", "\n", "Train score: 0.88278(+/- 0.02)\n", "\n", "Test score: 0.79628(+/- 0.03)\n", "\n", "Runtime: 19.516328983999983 seconds\n", "\n", "Confusion matrix:\n", "\n", " [[3099 131 46]\n", " [ 260 1025 71]\n", " [ 109 48 944]]\n", "\n", "Classification Report:\n", "\n", " precision recall f1-score support\n", "\n", " -1 0.89 0.95 0.92 3276\n", " 0 0.85 0.76 0.80 1356\n", " 1 0.89 0.86 0.87 1101\n", "\n", "avg / total 0.88 0.88 0.88 5733\n", "\n" ] } ], "source": [ "# Setting up grid search to return the best results for the random forest model\n", "rfc = ensemble.RandomForestClassifier(n_jobs=-1)\n", "param_grid = {'n_estimators' : [10, 20, 40, 60],\n", " 'class_weight': ['balanced', 'balanced_subsample'],\n", " 'oob_score': [True, False]}\n", "\n", "# Run grid search to find ideal parameters\n", "rfc_grid = GridSearchCV(rfc, param_grid, cv=5, n_jobs=-1)\n", "\n", "# Fit the model to the data\n", "rfc_grid.fit(X_train_tfidf, y_train_tfidf)\n", "\n", "# Run the model with the best parameters\n", "rfc_2 = ensemble.RandomForestClassifier(**rfc_grid.best_params_)\n", "run_model(rfc_2)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n", " learning_rate=0.1, loss='deviance', max_depth=3,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100,\n", " presort='auto', random_state=None, subsample=1.0, verbose=0,\n", " warm_start=False) \n", "\n", "Train score: 0.75435(+/- 0.01)\n", "\n", "Test score: 0.72929(+/- 0.03)\n", "\n", "Runtime: 27.98217151499989 seconds\n", "\n", "Confusion matrix:\n", "\n", " [[3122 82 72]\n", " [ 735 526 95]\n", " [ 338 99 664]]\n", "\n", "Classification Report:\n", "\n", " precision recall f1-score support\n", "\n", " -1 0.74 0.95 0.84 3276\n", " 0 0.74 0.39 0.51 1356\n", " 1 0.80 0.60 0.69 1101\n", "\n", "avg / total 0.75 0.75 0.73 5733\n", "\n" ] } ], "source": [ "# Gradient Boosting Model\n", "gbc = ensemble.GradientBoostingClassifier()\n", "run_model(gbc)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,\n", " beta_2=0.999, early_stopping=False, epsilon=1e-08,\n", " hidden_layer_sizes=(100, 10), learning_rate='constant',\n", " learning_rate_init=0.001, max_iter=200, momentum=0.9,\n", " nesterovs_momentum=True, power_t=0.5, random_state=None,\n", " shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,\n", " verbose=False, warm_start=False) \n", "\n", "Train score: 0.88162(+/- 0.01)\n", "\n", "Test score: 0.77378(+/- 0.02)\n", "\n", "Runtime: 59.17632372000003 seconds\n", "\n", "Confusion matrix:\n", "\n", " [[3057 151 68]\n", " [ 183 1103 70]\n", " [ 87 63 951]]\n", "\n", "Classification Report:\n", "\n", " precision recall f1-score support\n", "\n", " -1 0.92 0.93 0.93 3276\n", " 0 0.84 0.81 0.83 1356\n", " 1 0.87 0.86 0.87 1101\n", "\n", "avg / total 0.89 0.89 0.89 5733\n", "\n" ] } ], "source": [ "# Neural Network Model\n", "mlp = MLPClassifier(hidden_layer_sizes=(100,10))\n", "run_model(mlp)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scores: \n", " Model Train_score Test_score Runtime\n", "0 Logistic regression 0.819176 0.772555 9.902313\n", "1 Random forest 0.882784 0.796281 19.516329\n", "2 Gradient Boosting 0.754346 0.729295 27.982172\n", "3 Neural Networks 0.881621 0.773777 59.176324 \n", "\n" ] } ], "source": [ "# Create dataframes for the models and the scores\n", "results = pd.DataFrame({'Model': models,\n", " 'Train_score': train_score,\n", " 'Test_score': test_score,\n", " 'Runtime': runtime})\n", "\n", "# Print out the results\n", "print('Scores: \\n', results, '\\n')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "name": "Test Score", "type": "bar", "uid": "cc59c46a-c562-450d-9401-a6c186cb0988", "x": [ "Logistic regression", "Random forest", "Gradient Boosting", "Neural Networks" ], "y": [ 0.7725552381082571, 0.7962808542825767, 0.7292947657649036, 0.7737773342628917 ] }, { "name": "Runtime", "type": "bar", "uid": "f946d8c0-83ec-424c-8089-b176e5e05991", "x": [ "Logistic regression", "Random forest", "Gradient Boosting", "Neural Networks" ], "y": [ 9.902313456999991, 19.516328983999983, 27.98217151499989, 59.17632372000003 ] } ], "layout": { "barmode": "group", "title": "Results" } }, "text/html": [ "
" ], "text/vnd.plotly.v1+html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# (Heavily consider separating the two later)\n", "\n", "trace1 = go.Bar(\n", " x = results.Model,\n", " y = results.Test_score,\n", " name = 'Test Score'\n", ")\n", "\n", "trace2 = go.Bar(\n", " x = results.Model,\n", " y = results.Runtime,\n", " name = 'Runtime'\n", ")\n", "\n", "data = [trace1, trace2]\n", "layout = go.Layout(\n", " title = 'Results',\n", " barmode = 'group'\n", ")\n", "\n", "fig = go.Figure(data=data, layout=layout)\n", "py.offline.iplot(fig, filename='stacked-bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Comparison\n", "\n", "So, after running all 4 types of models it seems as though logistic regression took the least time, but random forests fared almost as well due to parallelization and forgiving parameters, and is also the most accurate of the 4. While it was prone to overfitting, it still behaves well enough for the training set\n", "\n", "## Clustering\n", "\n", "Now, it's time to determine if we can identify any trends with our data through unsupervised machine learning. The way that we'll be doing that is through clustering, to determine if there are groups of tweet with a similar tf-idf score" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123401230...1190119111921193119411951196119711981199
00.0000000.0000000.2357020.2357020.9428090.1414210.9899490.0000000.0000000.0...0.00.00.00.00.00.00.00.00.00.0
10.0000000.0118310.0177470.0059160.9997550.1668990.9847020.0500700.0000000.0...0.00.00.00.00.00.00.00.00.00.0
20.0014790.0059170.0103540.0044380.9999180.1664700.9843450.0579030.0000000.0...0.00.00.00.00.00.00.00.00.00.0
30.0054860.0096010.0068580.0013720.9999140.1709840.9831590.0641190.0071240.0...0.00.00.00.00.00.00.00.00.00.0
40.0030860.0092580.0061720.0092580.9998900.1904280.9793430.0680100.0000000.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

5 rows × 1209 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 0 1 \\\n", "0 0.000000 0.000000 0.235702 0.235702 0.942809 0.141421 0.989949 \n", "1 0.000000 0.011831 0.017747 0.005916 0.999755 0.166899 0.984702 \n", "2 0.001479 0.005917 0.010354 0.004438 0.999918 0.166470 0.984345 \n", "3 0.005486 0.009601 0.006858 0.001372 0.999914 0.170984 0.983159 \n", "4 0.003086 0.009258 0.006172 0.009258 0.999890 0.190428 0.979343 \n", "\n", " 2 3 0 ... 1190 1191 1192 1193 1194 1195 1196 \\\n", "0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.050070 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.057903 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.064119 0.007124 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.068010 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " 1197 1198 1199 \n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", "[5 rows x 1209 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.head()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Size of graph\n", "X_train.dropna(inplace=True)\n", "plt.rcParams['figure.figsize'] = [8,5]\n", "\n", "# k means determine k\n", "distortions = []\n", "K = range(1,10)\n", "for k in K:\n", " kmeanModel = KMeans(n_clusters=k).fit(X_train)\n", " kmeanModel.fit(X_train)\n", " distortions.append(sum(np.min(cdist(X_train, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X_train_tfidf.shape[0])\n", "\n", "# Plot the elbow\n", "plt.plot(K, distortions, 'bx-')\n", "plt.xlabel('k')\n", "plt.ylabel('Distortion')\n", "plt.title('The Elbow Method showing the optimal k')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_0012
airline_sentiment
-1769010301220
02132971924
11635896701
\n", "
" ], "text/plain": [ "col_0 0 1 2\n", "airline_sentiment \n", "-1 7690 1030 1220\n", " 0 2132 971 924\n", " 1 1635 896 701" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calulate predicted values\n", "kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42, n_init=20)\n", "y_pred = kmeans.fit_predict(X_train)\n", "\n", "pd.crosstab(y_train_tfidf, y_pred)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(6,6))\n", "\n", "# We are limiting our feature space to 2 components here. \n", "# This makes it easier to graph and see the clusters.\n", "svd= TruncatedSVD(2)\n", "\n", "# Normalize the data.\n", "X_norm = normalize(X_train)\n", "\n", "# Reduce it to two components.\n", "X_svd = svd.fit_transform(X_norm)\n", "\n", "# Calculate predicted values.\n", "y_pred = KMeans(n_clusters=3, random_state=42).fit_predict(X_svd)\n", "\n", "# Plot the solution.\n", "plt.scatter(X_svd[:, 0], X_svd[:, 1], c=y_pred, cmap='viridis')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Comparing k-means clusters against the data:\n", "airline_sentiment -1 0 1\n", "row_0 \n", "0 8680 2918 2414\n", "1 122 284 231\n", "2 1138 825 587\n" ] } ], "source": [ "# Check the solution against the data.\n", "print('Comparing k-means clusters against the data:')\n", "print(pd.crosstab(y_pred, y_train_tfidf))" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Adjusted Rand Score: 0.07606666\n", "Silhouette Score: 0.003654095\n" ] } ], "source": [ "# View clustering score\n", "print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train_tfidf, y_pred)))\n", "print('Silhouette Score: {:0.7}'.format(silhouette_score(X_train, y_pred, sample_size=60000, metric='euclidean')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Testing Set\n", "\n", "Finally, we get to open the testing set. What if we decide to look at more recent tweets from the same airline, or how about we also decide to look at other airlines that weren't even in our dataset? \n", "\n", "Enough with the hypotheticals. The next dataset is a collection of recent tweets containing the testing data that we want and while we won't be able to assess its accuracy since we don't have the correct values, we should be able to get a good idea of measuring sentiment analysis for more recent tweets targetting the airlines in our dataset, as well as other airlines that actually aren't in our model. \n", "\n", "What this will demonstrate is how useful our models are for predicting sentiment analysis for tweets that our model has not seen yet." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
airlinetext
0@AlaskaAirRT @AlaskaAir: Happy birthday to our Chief Football Officer, @DangeRussWilson! We hope your thirties are as fly as you! 🎂 https://t.co/XfTm…
1@AlaskaAir@AlaskaAir My monthly+ flughts in and out of Montana will have to be with another airline if these flights after 9pm aren't possible.
2@AlaskaAirRT @ChrisEgan5: UW #Husky fans fired up for @pac12 championship game! Leaving #Seattle on @AlaskaAir on plane filled with @UW_Football @uw…
3@AlaskaAir@ChrisEgan5 @mcclainfan59 @pac12 @AlaskaAir @UW_Football @UW That's a Virgin Plane. 🤣
4@AlaskaAir@AlaskaAir Stupid, stupid decision!!
\n", "
" ], "text/plain": [ " airline \\\n", "0 @AlaskaAir \n", "1 @AlaskaAir \n", "2 @AlaskaAir \n", "3 @AlaskaAir \n", "4 @AlaskaAir \n", "\n", " text \n", "0 RT @AlaskaAir: Happy birthday to our Chief Football Officer, @DangeRussWilson! We hope your thirties are as fly as you! 🎂 https://t.co/XfTm… \n", "1 @AlaskaAir My monthly+ flughts in and out of Montana will have to be with another airline if these flights after 9pm aren't possible. \n", "2 RT @ChrisEgan5: UW #Husky fans fired up for @pac12 championship game! Leaving #Seattle on @AlaskaAir on plane filled with @UW_Football @uw… \n", "3 @ChrisEgan5 @mcclainfan59 @pac12 @AlaskaAir @UW_Football @UW That's a Virgin Plane. 🤣 \n", "4 @AlaskaAir Stupid, stupid decision!! " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import the training dataset\n", "tweets_test = pd.read_csv(\"airline_tweets/test_set.csv\", usecols=[1,2])\n", "\n", "# Preview the data\n", "tweets_test.head()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(18000, 2)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View size of the testing dataframe\n", "tweets_test.shape" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['@AlaskaAir',\n", " '@Allegiant',\n", " '@AmericanAir',\n", " '@Delta',\n", " '@FlyFrontier',\n", " '@HawaiianAir',\n", " '@JetBlue',\n", " '@SouthwestAir',\n", " '@SpiritAirlines',\n", " '@united']" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print unique airlines in the test dataset\n", "sorted(tweets_test['airline'].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the original dataset was obtained in February 2015, two of the airlines in the dataset no longer operate under the same name, 'US Airways' and 'Virgin' which have since merged with 'American' and 'Alaska' respectively. Due to this, we can only compare the remaining 4 airlines in the original dataset with 6 other major airlines currently in service in the United States." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "# Remove non-essential punctuation from the tweets\n", "pd.options.display.max_colwidth = 200\n", "tweets_test['text'] = tweets_test['text'].map(lambda x: text_cleaner(str(x)))\n", "\n", "# Reduce all text to their lemmas\n", "for tweet in tweets_test['text']:\n", " tweet = lemmatizer.lemmatize(tweet)\n", " \n", "# Specify the new test input\n", "X = tweets_test['text']\n", "\n", "# Apply the vectorizer to the test data\n", "X_tfidf_2 = vectorizer.fit_transform(X)\n", "\n", "# Predict the results using our best model\n", "tweets_test['airline_sentiment'] = rfc_2.predict(X_tfidf_2)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "# Modify values of sentiment to numerical values\n", "sentiment = {-1:'negative', 0:'neutral', 1:'positive'}\n", "tweets_test['airline_sentiment'] = tweets_test['airline_sentiment'].map(lambda x: sentiment[x])" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "labels": [ "negative", "neutral", "positive" ], "type": "pie", "uid": "185fd559-5bf9-4a10-91c0-7882cdb4ce8d", "values": [ 15681, 1369, 950 ] } ], "layout": { "autosize": false, "height": 400, "title": "Tweet Sentiment (Nov 2018)", "width": 500, "yaxis": { "title": "Number of tweets" } } }, "text/html": [ "
" ], "text/vnd.plotly.v1+html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# View distribution of tweets by sentiment \n", "# (Changing colors to red/gray/green would be nice)\n", "trace = go.Pie(labels=tweets_test['airline_sentiment'].value_counts().index, \n", " values=tweets_test['airline_sentiment'].value_counts())\n", "\n", "# Create the layout\n", "layout = go.Layout(\n", " title = 'Tweet Sentiment (Nov 2018)',\n", " height = 400,\n", " width = 500,\n", " autosize = False,\n", " yaxis = dict(title='Number of tweets')\n", ")\n", "\n", "fig = go.Figure(data = [trace], layout = layout)\n", "py.offline.iplot(fig, filename='cufflinks/simple')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we apply the random forest model to our testing dataset, the results do look a bit more ... negative. Now, the differences in our dataset could be due to a number of things, so let's take a look at the data and see what our model spit out." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
airline@AlaskaAir@Allegiant@AmericanAir@Delta@FlyFrontier@HawaiianAir@JetBlue@SouthwestAir@SpiritAirlines@united
airline_sentiment
negative1728169216931239156617101566139614891602
neutral5472107184909018039410890
positive1836037714405410203108
\n", "
" ], "text/plain": [ "airline @AlaskaAir @Allegiant @AmericanAir @Delta @FlyFrontier \\\n", "airline_sentiment \n", "negative 1728 1692 1693 1239 1566 \n", "neutral 54 72 107 184 90 \n", "positive 18 36 0 377 144 \n", "\n", "airline @HawaiianAir @JetBlue @SouthwestAir @SpiritAirlines \\\n", "airline_sentiment \n", "negative 1710 1566 1396 1489 \n", "neutral 90 180 394 108 \n", "positive 0 54 10 203 \n", "\n", "airline @united \n", "airline_sentiment \n", "negative 1602 \n", "neutral 90 \n", "positive 108 " ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Pivot table of airline sentiment\n", "tweets_test.pivot_table(index=['airline_sentiment'],columns='airline', aggfunc='size', fill_value=0)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "scrolled": true }, "outputs": [ { "ename": "ValueError", "evalue": "operands could not be broadcast together with shapes (8,) (10,) ", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 17\u001b[0m trace3 = go.Bar(\n\u001b[0;32m 18\u001b[0m \u001b[0mx\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msorted\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtweets_test\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'airline'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0munique\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 19\u001b[1;33m \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mtweets_test\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mtweets_test\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'airline_sentiment'\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m'positive'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mgroupby\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'airline'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'airline_sentiment'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m/\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mtweets_test\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'airline'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msort_index\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 20\u001b[0m \u001b[0mname\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'Positive'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 21\u001b[0m \u001b[0mmarker\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdict\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcolor\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'rgba(0,200,0,.7)'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mValueError\u001b[0m: operands could not be broadcast together with shapes (8,) (10,) " ] } ], "source": [ "# # Show distribution of texts\n", "\n", "# trace1 = go.Bar(\n", "# x = sorted(tweets_test['airline'].unique()),\n", "# y = (tweets_test[tweets_test['airline_sentiment'] == 'negative'].groupby('airline')['airline_sentiment'].value_counts().values) / (tweets_test['airline'].value_counts().sort_index().values),\n", "# name = 'Negative',\n", "# marker = dict(color='rgba(200,0,0,.7)')\n", "# )\n", "\n", "# trace2 = go.Bar(\n", "# x = sorted(tweets_test['airline'].unique()),\n", "# y = (tweets_test[tweets_test['airline_sentiment'] == 'neutral'].groupby('airline')['airline_sentiment'].value_counts().values) / (tweets_test['airline'].value_counts().sort_index().values),\n", "# name = 'Neutral',\n", "# marker = dict(color='rgba(150,150,150,.7)')\n", "# )\n", "\n", "# trace3 = go.Bar(\n", "# x = sorted(tweets_test['airline'].unique()),\n", "# y = (tweets_test[tweets_test['airline_sentiment'] == 'positive'].groupby('airline')['airline_sentiment'].value_counts().values) / (tweets_test['airline'].value_counts().sort_index().values),\n", "# name = 'Positive',\n", "# marker = dict(color='rgba(0,200,0,.7)')\n", "# )\n", "\n", "# data = [trace1, trace2, trace3]\n", "# layout = go.Layout(\n", "# title = 'Airline Sentiment (Percentage)',\n", "# barmode='group',\n", "# yaxis = dict(title='% of tweets')\n", "# )\n", "\n", "# fig = go.Figure(data=data, layout=layout)\n", "# py.offline.iplot(fig, filename='stacked-bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "For the above visual, please feel free to toggle the sentiment categories on the right. For example, toggle negative and neutral off to view only positive tweets.\n", "
\n", "\n", "The model predicted so few positive results that these bars do look a bit off. Wasn't 'American' one of the most poorly received in the training set? You would be correct, but that doesn't appear to be the case here. \n", "\n", "* The training set was acquired in February 2015, and our test set in November 2018\n", "* We didn't capture enough characteristics / words in our original model, and require a larger data set." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
airlinetextairline_sentiment
0@AlaskaAirRT @AlaskaAir: Happy birthday to our Chief Football Officer, @DangeRussWilson! We hope your thirties are as fly as you! 🎂 https://t.co/XfTm…negative
1@AlaskaAir@AlaskaAir My monthly+ flughts in and out of Montana will have to be with another airline if these flights after 9pm aren't possible.negative
2@AlaskaAirRT @ChrisEgan5: UW #Husky fans fired up for @pac12 championship game! Leaving #Seattle on @AlaskaAir on plane filled with @UW_Football @uw…negative
3@AlaskaAir@ChrisEgan5 @mcclainfan59 @pac12 @AlaskaAir @UW_Football @UW That's a Virgin Plane. 🤣negative
4@AlaskaAir@AlaskaAir Stupid, stupid decision!!negative
\n", "
" ], "text/plain": [ " airline \\\n", "0 @AlaskaAir \n", "1 @AlaskaAir \n", "2 @AlaskaAir \n", "3 @AlaskaAir \n", "4 @AlaskaAir \n", "\n", " text \\\n", "0 RT @AlaskaAir: Happy birthday to our Chief Football Officer, @DangeRussWilson! We hope your thirties are as fly as you! 🎂 https://t.co/XfTm… \n", "1 @AlaskaAir My monthly+ flughts in and out of Montana will have to be with another airline if these flights after 9pm aren't possible. \n", "2 RT @ChrisEgan5: UW #Husky fans fired up for @pac12 championship game! Leaving #Seattle on @AlaskaAir on plane filled with @UW_Football @uw… \n", "3 @ChrisEgan5 @mcclainfan59 @pac12 @AlaskaAir @UW_Football @UW That's a Virgin Plane. 🤣 \n", "4 @AlaskaAir Stupid, stupid decision!! \n", "\n", " airline_sentiment \n", "0 negative \n", "1 negative \n", "2 negative \n", "3 negative \n", "4 negative " ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Preview the top tweets for @AlaskaAir to see the correlation between tweets and predicted results\n", "tweets_test.loc[tweets_test['airline']=='@AlaskaAir'].head()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
airline_sentimentnegativereasonairlinetext
7644-1Customer Service IssueAmerican@AmericanAir I've been on hold for 55 mins about my Cancelled Flighted international flight. Am out of country, so can't leave a call back #. Help?
7645-1Customer Service IssueAmerican“@AmericanAir Thanks for info on super large passengers- the extra seat Mr. Big needed was the one i was sitting in already #customerservice
7646-1Customer Service IssueAmerican@AmericanAir Been trying to call all day, get hung up on every time. Your guys are horrible!
7647-1Late FlightAmerican@AmericanAir Hey so I'm very disappointed with my time traveling with you! I've had the worst experience and both times the flight delayed!
76480NaNAmerican@AmericanAir Any way that we could look at other options for today?
\n", "
" ], "text/plain": [ " airline_sentiment negativereason airline \\\n", "7644 -1 Customer Service Issue American \n", "7645 -1 Customer Service Issue American \n", "7646 -1 Customer Service Issue American \n", "7647 -1 Late Flight American \n", "7648 0 NaN American \n", "\n", " text \n", "7644 @AmericanAir I've been on hold for 55 mins about my Cancelled Flighted international flight. Am out of country, so can't leave a call back #. Help? \n", "7645 “@AmericanAir Thanks for info on super large passengers- the extra seat Mr. Big needed was the one i was sitting in already #customerservice \n", "7646 @AmericanAir Been trying to call all day, get hung up on every time. Your guys are horrible! \n", "7647 @AmericanAir Hey so I'm very disappointed with my time traveling with you! I've had the worst experience and both times the flight delayed! \n", "7648 @AmericanAir Any way that we could look at other options for today? " ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Preview the top tweets for 'American' in our test set\n", "tweets.loc[tweets['airline']=='American'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "For this project, I aimed to build the best model that I could for performing sentiment analysis based on a fairly small training set, and apply the model to a more recent, equally sized testing set. Based on the training, and validation steps of the process, the model was able to do fairly well within the dataset, but when tasked to apply the same model on external data, the model wasn't able to hold up quite as well.\n", "\n", "## Next Steps\n", "\n", "Granted, the project may not achieved an ideal outcome, it has a sufficient framework for performing the task, and would likely stand to benefit from inputting more data, in the training phase, and incorporating a larger vocabulary, or testing it with an expansive word2vec model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }