{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<div style=\"width: 100%; overflow: hidden;\">\n", " <div style=\"width: 150px; float: left;\"> <img src=\"data/D4Sci_logo_ball.png\" alt=\"Data For Science, Inc\" align=\"left\" border=\"0\"> </div>\n", " <div style=\"float: left; margin-left: 10px;\"> \n", " <h1>Natural Language Processing For Everyone</h1>\n", " <h1>Sentiment Analysis</h1>\n", " <p>Bruno Gonçalves<br/>\n", " <a href=\"http://www.data4sci.com/\">www.data4sci.com</a><br/>\n", " @bgoncalves, @data4sci</p></div>\n", "</div>" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import string\n", "import gzip\n", "from collections import Counter\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import matplotlib.cm as cm\n", "\n", "import watermark\n", "\n", "%matplotlib inline\n", "%load_ext watermark" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "List out the versions of all loaded libraries" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.11.7\n", "IPython version : 8.20.0\n", "\n", "Compiler : Clang 14.0.6 \n", "OS : Darwin\n", "Release : 23.4.0\n", "Machine : arm64\n", "Processor : arm\n", "CPU cores : 16\n", "Architecture: 64bit\n", "\n", "Git hash: ed7010f27131287f28d990d9846e4ce6cd87e34d\n", "\n", "watermark : 2.4.3\n", "numpy : 1.26.4\n", "pandas : 2.1.4\n", "matplotlib: 3.8.0\n", "\n" ] } ], "source": [ "%watermark -n -v -m -g -iv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the default style" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "plt.style.use('d4sci.mplstyle')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word counting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by taking the simplest approach and simply counting positive and negative words. We'll use Hu and Liu's Lexicon from their 2004 KDD paper: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "pos = np.loadtxt('data/positive-words.txt', dtype='str', comments=';')\n", "neg = np.loadtxt('data/negative-words.txt', dtype='str', comments=';')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['a+', 'abound', 'abounds', ..., 'zenith', 'zest', 'zippy'],\n", " dtype='<U20')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['2-faced', '2-faces', 'abnormal', ..., 'zealous', 'zealously',\n", " 'zombie'], dtype='<U24')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "neg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a dictionary and assign the valence to each positive and negative word" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "valence = {}\n", "\n", "for word in pos:\n", " valence[word.lower()] = +1\n", " \n", "for word in neg:\n", " valence[word.lower()] = -1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the simple word extraction function we defined in Lesson I" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def extract_words(text):\n", " temp = text.split() # Split the text on whitespace\n", " text_words = []\n", "\n", " for word in temp:\n", " # Remove any punctuation characters present in the beginning of the word\n", " while word[0] in string.punctuation:\n", " word = word[1:]\n", "\n", " # Remove any punctuation characters present in the end of the word\n", " while word[-1] in string.punctuation:\n", " word = word[:-1]\n", "\n", " # Append this word into our list of words.\n", " text_words.append(word.lower())\n", " \n", " return text_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That now we can use to define a sentiment measuring function that returns the valence of a sentence or piece of text. Notice that we use the valence directly from the dictionary instead of treating positive and negative words separatly. This will prove useful later on ;)\n", "\n", "\n", "$$ sentiment=\\frac{P-N}{P+N}$$" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def sentiment(text, valence):\n", " words = extract_words(text.lower())\n", " \n", " word_count = 0\n", " score = 0\n", " \n", " for word in words:\n", " if word in valence:\n", " score += valence[word]\n", " word_count += 1\n", " \n", " return score/word_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's test our simple code with some simple examples" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I'm very happy : 1.0\n", "The product is pretty annoying, and I hate it : -0.3333333333333333\n", "I'm sad : -1.0\n" ] } ], "source": [ "texts = [\"I'm very happy\",\n", " \"The product is pretty annoying, and I hate it\",\n", " \"I'm sad\",\n", " ]\n", "\n", "for text in texts:\n", " print(text, ':', sentiment(text, valence))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a bit surprising. One might expect the second sentence to be negative, after all \"pretty annoying\" and \"hate\" sound pretty negative. However, since each word in taken by itself, regardless of context we end up with:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pretty 1\n", "annoying -1\n", "hate -1\n" ] } ], "source": [ "words = extract_words(texts[1].lower())\n", "\n", "for word in words:\n", " if word in valence:\n", " print(word, valence[word])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll see in a bit how to handle cases like this, but the solution requires two important changes to our current approach: modifier words and real valued weights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modifiers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is to define a dictionary of modifiers" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "modifiers = {\n", " \"very\": 1.5,\n", " \"much\": 1.3,\n", " \"not\": -1,\n", " \"pretty\": 1.5,\n", " \"somewhat\": 1.2\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And to change our sentiment measuring function to take the modifiers into account." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "code_folding": [] }, "outputs": [], "source": [ "def sentiment_modified(text, valence, modifiers, verbose=False):\n", " words = extract_words(text.lower())\n", " \n", " ngrams = [[]]\n", " \n", " # generate ngrams\n", " for i in range(len(words)):\n", " word = words[i]\n", " \n", " if word in modifiers:\n", " ngrams[-1].append(word)\n", " continue\n", "\n", " if word in valence:\n", " ngrams[-1].append(word)\n", " else:\n", " if len(ngrams[-1]) > 0:\n", " ngrams.append([])\n", "\n", " score = 0\n", " \n", " # Remove the trailing empty ngram if necessary\n", " if len(ngrams[-1]) == 0:\n", " ngrams = ngrams[:-1]\n", "\n", " for ngram in ngrams:\n", " value = 1\n", "\n", " for word in ngram:\n", " if word in modifiers:\n", " value *= modifiers[word]\n", " elif word in valence:\n", " value *= valence[word]\n", "\n", " if verbose:\n", " print(ngram, value)\n", "\n", " score += value\n", "\n", " return score/len(ngrams)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This implementation is still relatively simple, but, as you can see, the results are already better." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The product is pretty annoying, and I hate it\n" ] } ], "source": [ "print(texts[1])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['pretty', 'annoying'] -1.5\n", "['hate'] -1\n" ] }, { "data": { "text/plain": [ "-1.25" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_modified(texts[1], valence, modifiers, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A more complete implementation would be more careful in handling the modifiers and would build larger ngrams so that cases like this one would also work:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['not', 'very', 'good'] -1.5\n" ] }, { "data": { "text/plain": [ "-1.5" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_modified(\"It was not very good\", valence, modifiers, True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And even more complex (and unrealistic) examples work fine" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['not', 'not', 'very', 'very', 'good'] 2.25\n" ] }, { "data": { "text/plain": [ "2.25" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_modified(\"It was not not very very good\", valence, modifiers, True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Continuous weights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "VADER is a state of the art sentiment analysis tool. Here we will use their excelent and well documented [lexicon](https://github.com/cjhutto/vaderSentiment) to explore non binary weights. Their approach is significantly more advanced than what we present here, but some of the fundamental ideas are the same" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "vader = pd.read_csv(\"data/vader_lexicon.txt\", sep='\\t', header=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vader lexicon includes a lot of interesting information:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>$:</td>\n", " <td>-1.5</td>\n", " <td>0.80623</td>\n", " <td>[-1, -1, -1, -1, -3, -1, -3, -1, -2, -1]</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>%)</td>\n", " <td>-0.4</td>\n", " <td>1.01980</td>\n", " <td>[-1, 0, -1, 0, 0, -2, -1, 2, -1, 0]</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>%-)</td>\n", " <td>-1.5</td>\n", " <td>1.43178</td>\n", " <td>[-2, 0, -2, -2, -1, 2, -2, -3, -2, -3]</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>&-:</td>\n", " <td>-0.4</td>\n", " <td>1.42829</td>\n", " <td>[-3, -1, 0, 0, -1, -1, -1, 2, -1, 2]</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>&:</td>\n", " <td>-0.7</td>\n", " <td>0.64031</td>\n", " <td>[0, -1, -1, -1, 1, -1, -1, -1, -1, -1]</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "0 $: -1.5 0.80623 [-1, -1, -1, -1, -3, -1, -3, -1, -2, -1]\n", "1 %) -0.4 1.01980 [-1, 0, -1, 0, 0, -2, -1, 2, -1, 0]\n", "2 %-) -1.5 1.43178 [-2, 0, -2, -2, -1, 2, -2, -3, -2, -3]\n", "3 &-: -0.4 1.42829 [-3, -1, 0, 0, -1, -1, -1, 2, -1, 2]\n", "4 &: -0.7 0.64031 [0, -1, -1, -1, 1, -1, -1, -1, -1, -1]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vader.head()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>7512</th>\n", " <td>}:</td>\n", " <td>-2.1</td>\n", " <td>0.83066</td>\n", " <td>[-1, -1, -3, -2, -3, -2, -2, -1, -3, -3]</td>\n", " </tr>\n", " <tr>\n", " <th>7513</th>\n", " <td>}:(</td>\n", " <td>-2.0</td>\n", " <td>0.63246</td>\n", " <td>[-3, -1, -2, -1, -3, -2, -2, -2, -2, -2]</td>\n", " </tr>\n", " <tr>\n", " <th>7514</th>\n", " <td>}:)</td>\n", " <td>0.4</td>\n", " <td>1.42829</td>\n", " <td>[1, 1, -2, 1, 2, -2, 1, -1, 2, 1]</td>\n", " </tr>\n", " <tr>\n", " <th>7515</th>\n", " <td>}:-(</td>\n", " <td>-2.1</td>\n", " <td>0.70000</td>\n", " <td>[-2, -1, -2, -2, -2, -4, -2, -2, -2, -2]</td>\n", " </tr>\n", " <tr>\n", " <th>7516</th>\n", " <td>}:-)</td>\n", " <td>0.3</td>\n", " <td>1.61555</td>\n", " <td>[1, 1, -2, 1, -1, -3, 2, 2, 1, 1]</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "7512 }: -2.1 0.83066 [-1, -1, -3, -2, -3, -2, -2, -1, -3, -3]\n", "7513 }:( -2.0 0.63246 [-3, -1, -2, -1, -3, -2, -2, -2, -2, -2]\n", "7514 }:) 0.4 1.42829 [1, 1, -2, 1, 2, -2, 1, -1, 2, 1]\n", "7515 }:-( -2.1 0.70000 [-2, -1, -2, -2, -2, -4, -2, -2, -2, -2]\n", "7516 }:-) 0.3 1.61555 [1, 1, -2, 1, -1, -3, 2, 2, 1, 1]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vader.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similies are also included and, in addition to the average sentiment of each word (in column 1) and it's standard deviation (in column 2) it provides the raw human generated scores in column 3. So that we may easily check (and possibly modify) their weights. To extract the raw scores for the word \"love\" we could simply do:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(7517, 4)\n" ] } ], "source": [ "print(vader.shape)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 love\n", "1 3.2\n", "2 0.4\n", "3 [3, 3, 3, 3, 3, 3, 3, 4, 4, 3]\n", "Name: 4446, dtype: object\n" ] } ], "source": [ "print(vader.iloc[4446])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3, 3, 3, 3, 3, 3, 3, 4, 4, 3]\n" ] } ], "source": [ "scores = eval(vader.iloc[4446][3])\n", "print(scores)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores[8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can see that 8/10 people thought that the word love should receive a score of 3 and two others a score of 4. This gives us insight into how uniform the scores are. If for some reason, we thought that there was some problem with the 2 values of 4 or perhaps just not appropriate to our purposes we might discard them and recalculate the valence of the word. \n", "\n", "One justification for this might be the fact that the scores for the closely related word, \"loved\", are significantly different with a wider range of variation in the human scores" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 loved\n", "1 2.9\n", "2 0.7\n", "3 [3, 3, 4, 2, 2, 4, 3, 2, 3, 3]\n", "Name: 4447, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vader.iloc[4447]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we convert this dataset into a dictionary similar to the one we used above" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "valence_vader = dict(vader[[0, 1]].values)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.2" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valence_vader['love']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use this new dictionary we just have to modify the arguments to the sentiment_modified function:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['not', 'not', 'very', 'very', 'good'] 4.2749999999999995\n" ] }, { "data": { "text/plain": [ "4.2749999999999995" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_modified(\"It was not not very very good\", valence_vader, modifiers, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One important detail to keep in mind is that scores obtained through different methods are not comparable. In this example, the score of the sentence \"It was not not very very good\" went from 2.25 to 4.27 when we switched dictionaries. This is due not only to different levels of coverage in differnet dictionaries but also to differnet choices in the possible ranges of values." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I'm very happy : 4.050000000000001\n", "The product is pretty annoying, and I hate it : -2.625\n", "I'm sad : -2.1\n" ] } ], "source": [ "texts = [\"I'm very happy\",\n", " \"The product is pretty annoying, and I hate it\",\n", " \"I'm sad\",\n", " ]\n", "\n", "for text in texts:\n", " print(text, ':', sentiment_modified(text, valence_vader, modifiers))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <img src=\"data/D4Sci_logo_full.png\" alt=\"Data For Science, Inc\" align=\"center\" border=\"0\" width=300px> \n", "</center>" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }