{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sentiment analysis\n", "\n", "### What is sentiment analysis?\n", "\n", "In this notebook, we're going to perform [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) on a dataset of tweets about US airlines. Sentiment analysis is the task of extracting [affective states][1] from text. Sentiment analysis is most ofen used to answer questions like:\n", "\n", "[1]: https://en.wikipedia.org/wiki/Affect_(psychology)\n", "\n", "- _what do our customers think of us?_\n", "- _do our users like the look of our product?_\n", "- _what aspects of our service are users dissatisfied with?_\n", "\n", "### Sentiment analysis as text classification\n", "\n", "We're going to treat sentiment analysis as a text classification problem. Text classification is just like other instances of classification in data science. We use the term \"text classification\" when the features come from natural language data. (You'll also hear it called \"document classification\" at times.) What makes text classification interestingly different from other instances of classification is the way we extract numerical features from text. \n", "\n", "### Dataset\n", "\n", "The dataset was collected by [Crowdflower](https://www.crowdflower.com/), which they then made public through [Kaggle](https://www.kaggle.com/crowdflower/twitter-airline-sentiment). I've downloaded it for you and put it in the \"data\" directory. Note that this is a nice clean dataset; not the norm in real-life data science! I've chosen this dataset so that we can concentrate on understanding what text classification is and how to do it.\n", "\n", "### Topics\n", "\n", "Here's what we'll cover in our hour. Like any data science task, we'll first do some EDA to understand what data we've got. Then, like always, we'll have to preprocess our data. Because this is text data, it'll be a little different from preprocessing other types of data. Next we'll perform our sentiment classification. Finally, we'll interpret the results of the classifier in terms of our big question.\n", "\n", "We'll cover:\n", "\n", "[EDA](#eda)
\n", "\n", "[Preprocess](#preprocess)
\n", "\n", "[Classification](#classification)
\n", "\n", "[Interpret](#interpret)
\n", "\n", "### Time\n", "- Teaching: 30 minutes\n", "- Exercises: 30 minutes\n", "\n", "### Final note\n", "\n", "Don't worry if you don't follow every single step in our hour. If things are moving too quickly, concentrate on the big picture. Afterwards, you can go through the notebook line by line to understand all the details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import os\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "from sklearn.linear_model import LogisticRegressionCV\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.ensemble import RandomForestClassifier\n", "sns.set()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## EDA " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DATA_DIR = 'data'\n", "fname = os.path.join(DATA_DIR, 'tweets.csv')\n", "df = pd.read_csv(fname)\n", "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which airlines are tweeted about and how many of each in this dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.countplot(df['airline'], order=df['airline'].value_counts().index);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge\n", "\n", "- How many tweets are in the dataset?\n", "- How many tweets are positive, neutral and negative?\n", "- What **proportion** of tweets are positive, neutral and negative?\n", "- Visualize these last two questions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extra challenge\n", "\n", "- When did the tweets come from?\n", "- Who gets more retweets: positive, negative or neutral tweets?\n", "- What are the three main reasons why people are tweeting negatively? What could airline companies do to improve this?\n", "- What's the distribution of time zones in which people are tweeting?\n", "- Is this distribution consistent depending on what airlines they're tweeting about?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocess \n", "\n", "### Regular expressions\n", "\n", "Regular expressions are like advanced find-and-replace. They allow us to specify complicated patterns in text data and find all the matches. They're indispensable in text processing. You can learn more about them [here](https://github.com/geoffbacon/regular-expressions-in-python).\n", "\n", "We can use regular expressions to find hashtags and user mentions in a tweet. We first write the pattern we're looking for as a (raw) string, using regular expression's special syntax. The `twitter_handle_pattern` says \"find me a @ sign immediately followed by one or more upper or lower case letters, digits or underscore\". The `hashtag_pattern` is a little more complicated; it says \"find me exactly one # or #, immediately followed by one or more upper or lower case letters, digits or underscore, but only if it's at the beginning of a line or immediately after a whitespace character\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "twitter_handle_pattern = r'@(\\w+)'\n", "hashtag_pattern = r'(?:^|\\s)[##]{1}(\\w+)'\n", "url_pattern = r'https?:\\/\\/.*.com'\n", "example_tweet = \"lol @justinbeiber and @BillGates are like soo #yesterday #amiright saw it on https://twitter.com #yolo\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "re.findall(twitter_handle_pattern, example_tweet)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "re.findall(hashtag_pattern, example_tweet)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "re.findall(url_pattern, example_tweet)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas` has great in-built support for operating with regular expressions on columns. We can `extract` all user mentions from a column of text like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['text'].str.extract(twitter_handle_pattern).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And find all the hashtags like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['text'].str.extract(hashtag_pattern).head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge\n", "\n", "Often in preprocessing text data, we don't care about the exact hashtag/user/URL that someone used (although sometimes we do!). Your job is to replace all the hashtags with `'HASHTAG'`, the user mentions with `'USER'` and URLs with `'URL'`. To do this, you'll use the `replace` string method of the `text` column. The result of this will be a series, which you should add to `df` as a column called `clean_text`. **See the docs [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) for more information on the method.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bag of words\n", "\n", "Now that we've cleaned the text, we need to turn the text into numbers for our classifier. We're going to use a \"bag of words\" as our features. A bag of words is just like a frequency count of all the words that appear in a tweet. It's called a bag because we ignore the order of the words; we just care about what words are in the tweet. To do this, we can use `scikit-learn`'s `CountVectorizer`. `CountVectorizer` replaces each tweet with a vector (think a list) of counts. Each position in the vector represents a unique word in the corpus. The value of an entry in a vector represents the number of times that word appeared in that tweet. Below, we restrict the length of the vectors to be 5,000 and the counts to be 0 (not in the tweet) and 1 (in the tweet)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countvectorizer = CountVectorizer(max_features=5000, binary=True)\n", "X = countvectorizer.fit_transform(df['clean_text'])\n", "features = X.toarray()\n", "features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = df['airline_sentiment'].values\n", "response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split into train/test datasets\n", "\n", "We don't want to train our classifier on the same dataset that we test it on, so let's split it into training and test sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(features, response, test_size=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification \n", "\n", "OK, so now that we've turned our data into numbers, we're ready to feed it into a classifier. We're not going to concentrate too much on the code below, but here's the big picture. In the `fit_model` function defined below, we're going to use logistic regression as a classifier to take in the numerical representation of the tweets and spit out whether it's positive, neutral or negative. Then we'll use `test_model` to test the model's performance against our test data and print out some results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def fit_logistic_regression(X_train, y_train):\n", " model = LogisticRegressionCV(Cs=5, penalty='l1', cv=3, solver='liblinear', refit=True)\n", " model.fit(X_train, y_train)\n", " return model\n", "\n", "def conmat(model, X_test, y_test):\n", " \"\"\"Wrapper for sklearn's confusion matrix.\"\"\"\n", " labels = model.classes_\n", " y_pred = model.predict(X_test)\n", " c = confusion_matrix(y_test, y_pred)\n", " sns.heatmap(c, annot=True, fmt='d', \n", " xticklabels=labels, \n", " yticklabels=labels, \n", " cmap=\"YlGnBu\", cbar=False)\n", " plt.ylabel('Ground truth')\n", " plt.xlabel('Prediction')\n", " \n", "def test_model(model, X_train, y_train):\n", " conmat(model, X_test, y_test)\n", " print('Accuracy: ', model.score(X_test, y_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lr = fit_logistic_regression(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_model(lr, X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge\n", "\n", "Use the `fit_random_forest` function below to train a random forest classifier on the training set and test the model on the test set. Which performs better?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def fit_random_forest(X_train, y_train):\n", " model = RandomForestClassifier()\n", " model.fit(X_train, y_train)\n", " return model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge\n", "\n", "Use the `test_tweet` function below to test your classifier's performance on a list of tweets. Write your tweets " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def clean_tweets(tweets):\n", " tweets = [re.sub(hashtag_pattern, 'HASHTAG', t) for t in tweets]\n", " tweets = [re.sub(twitter_handle_pattern, 'USER', t) for t in tweets]\n", " return [re.sub(url_pattern, 'URL', t) for t in tweets]\n", "\n", "def test_tweets(tweets, model):\n", " tweets = clean_tweets(tweets)\n", " features = countvectorizer.transform(tweets)\n", " predictions = model.predict(features)\n", " return list(zip(tweets, predictions))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_tweets = [example_tweet,\n", " 'omg I am never flying on Delta again',\n", " 'I love @VirginAmerica so much #friendlystaff']\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interpret \n", "\n", "Now we can interpret the classifier by the features that it found important." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vocab = [(v,k) for k,v in countvectorizer.vocabulary_.items()]\n", "vocab = sorted(vocab, key=lambda x: x[0])\n", "vocab = [word for num,word in vocab]\n", "coef = list(zip(vocab, lr.coef_[0]))\n", "important = pd.DataFrame(lr.coef_).T\n", "important.columns = lr.classes_\n", "important['word'] = vocab\n", "important.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "important.sort_values(by='negative', ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "important.sort_values(by='positive', ascending=False).head(10)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }