{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 08\n", "\n", "## Analyze how travelers expressed their feelings on Twitter\n", "\n", "A sentiment analysis job about the problems of each major U.S. airline. \n", "Twitter data was scraped from February of 2015 and contributors were \n", "asked to first classify positive, negative, and neutral tweets, followed\n", "by categorizing negative reasons (such as \"late flight\" or \"rude service\")." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
airline_sentimentairline_sentiment_confidencenegativereasonnegativereason_confidenceairlineairline_sentiment_goldnamenegativereason_goldretweet_counttexttweet_coordtweet_createdtweet_locationuser_timezone
tweet_id
570306133677760513neutral1.0000NaNNaNVirgin AmericaNaNcairdinNaN0@VirginAmerica What @dhepburn said.NaN2015-02-24 11:35:52 -0800NaNEastern Time (US & Canada)
570301130888122368positive0.3486NaN0.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica plus you've added commercials t...NaN2015-02-24 11:15:59 -0800NaNPacific Time (US & Canada)
570301083672813571neutral0.6837NaNNaNVirgin AmericaNaNyvonnalynnNaN0@VirginAmerica I didn't today... Must mean I n...NaN2015-02-24 11:15:48 -0800Lets PlayCentral Time (US & Canada)
570301031407624196negative1.0000Bad Flight0.7033Virgin AmericaNaNjnardinoNaN0@VirginAmerica it's really aggressive to blast...NaN2015-02-24 11:15:36 -0800NaNPacific Time (US & Canada)
570300817074462722negative1.0000Can't Tell1.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica and it's a really big bad thing...NaN2015-02-24 11:14:45 -0800NaNPacific Time (US & Canada)
\n", "
" ], "text/plain": [ " airline_sentiment airline_sentiment_confidence \\\n", "tweet_id \n", "570306133677760513 neutral 1.0000 \n", "570301130888122368 positive 0.3486 \n", "570301083672813571 neutral 0.6837 \n", "570301031407624196 negative 1.0000 \n", "570300817074462722 negative 1.0000 \n", "\n", " negativereason negativereason_confidence airline \\\n", "tweet_id \n", "570306133677760513 NaN NaN Virgin America \n", "570301130888122368 NaN 0.0000 Virgin America \n", "570301083672813571 NaN NaN Virgin America \n", "570301031407624196 Bad Flight 0.7033 Virgin America \n", "570300817074462722 Can't Tell 1.0000 Virgin America \n", "\n", " airline_sentiment_gold name negativereason_gold \\\n", "tweet_id \n", "570306133677760513 NaN cairdin NaN \n", "570301130888122368 NaN jnardino NaN \n", "570301083672813571 NaN yvonnalynn NaN \n", "570301031407624196 NaN jnardino NaN \n", "570300817074462722 NaN jnardino NaN \n", "\n", " retweet_count \\\n", "tweet_id \n", "570306133677760513 0 \n", "570301130888122368 0 \n", "570301083672813571 0 \n", "570301031407624196 0 \n", "570300817074462722 0 \n", "\n", " text \\\n", "tweet_id \n", "570306133677760513 @VirginAmerica What @dhepburn said. \n", "570301130888122368 @VirginAmerica plus you've added commercials t... \n", "570301083672813571 @VirginAmerica I didn't today... Must mean I n... \n", "570301031407624196 @VirginAmerica it's really aggressive to blast... \n", "570300817074462722 @VirginAmerica and it's a really big bad thing... \n", "\n", " tweet_coord tweet_created tweet_location \\\n", "tweet_id \n", "570306133677760513 NaN 2015-02-24 11:35:52 -0800 NaN \n", "570301130888122368 NaN 2015-02-24 11:15:59 -0800 NaN \n", "570301083672813571 NaN 2015-02-24 11:15:48 -0800 Lets Play \n", "570301031407624196 NaN 2015-02-24 11:15:36 -0800 NaN \n", "570300817074462722 NaN 2015-02-24 11:14:45 -0800 NaN \n", "\n", " user_timezone \n", "tweet_id \n", "570306133677760513 Eastern Time (US & Canada) \n", "570301130888122368 Pacific Time (US & Canada) \n", "570301083672813571 Central Time (US & Canada) \n", "570301031407624196 Pacific Time (US & Canada) \n", "570300817074462722 Pacific Time (US & Canada) " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "# read the data and set the datetime as the index\n", "import zipfile\n", "with zipfile.ZipFile('../datasets/Tweets.zip', 'r') as z:\n", " f = z.open('Tweets.csv')\n", " tweets = pd.read_csv(f, index_col=0)\n", "\n", "tweets.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(14640, 14)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Proportion of tweets with each sentiment" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "negative 9178\n", "neutral 3099\n", "positive 2363\n", "Name: airline_sentiment, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets['airline_sentiment'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Proportion of tweets per airline\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "United 3822\n", "US Airways 2913\n", "American 2759\n", "Southwest 2420\n", "Delta 2222\n", "Virgin America 504\n", "Name: airline, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets['airline'].value_counts()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pd.Series(tweets[\"airline\"]).value_counts().plot(kind = \"bar\",figsize=(8,6),rot = 0)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pd.crosstab(index = tweets[\"airline\"],columns = tweets[\"airline_sentiment\"]).plot(kind='bar',figsize=(10, 6),alpha=0.5,rot=0,stacked=True,title=\"Sentiment by airline\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 8.1 \n", "\n", "Predict the sentiment using CountVectorizer, stopwords, n_grams, stemmer, TfidfVectorizer\n", "\n", "use Random Forest classifier" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split, cross_val_score\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "from sklearn.ensemble import RandomForestClassifier\n", "from nltk.stem.snowball import SnowballStemmer\n", "from nltk.stem import WordNetLemmatizer" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "X = tweets['text']\n", "y = tweets['airline_sentiment'].map({'negative':-1,'neutral':0,'positive':1})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 8.2\n", "\n", "Train a Deep Neural Network with the following architecture:\n", "\n", "- Input = text \n", "- Dense(128)\n", "- Relu Activation\n", "- BatchNormalization\n", "- Dropout(0.5)\n", "- Dense(10, Softmax)\n", "\n", "Optimized using rmsprop using as loss categorical_crossentropy\n", "\n", "Hints: \n", "- test with two iterations then try more. \n", "- learning can be ajusted\n", "\n", "Evaluate the performance using the testing set (aprox 55% with 50 epochs)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "from keras.models import Sequential\n", "from keras.utils import np_utils\n", "from keras.layers import Dense, Dropout, Activation, BatchNormalization\n", "from keras.optimizers import RMSprop\n", "from keras.callbacks import History\n", "from livelossplot import PlotLossesKeras" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }