{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
\"Data
\n", "
\n", "

Natural Language Processing For Everyone

\n", "

Sentiment Analysis

\n", "

Bruno Gonçalves
\n", " www.data4sci.com
\n", " @bgoncalves, @data4sci

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import string\n", "import gzip\n", "from collections import Counter\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import matplotlib.cm as cm\n", "\n", "import watermark\n", "\n", "%matplotlib inline\n", "%load_ext watermark" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "List out the versions of all loaded libraries" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.11.7\n", "IPython version : 8.20.0\n", "\n", "Compiler : Clang 14.0.6 \n", "OS : Darwin\n", "Release : 23.4.0\n", "Machine : arm64\n", "Processor : arm\n", "CPU cores : 16\n", "Architecture: 64bit\n", "\n", "Git hash: ed7010f27131287f28d990d9846e4ce6cd87e34d\n", "\n", "watermark : 2.4.3\n", "numpy : 1.26.4\n", "pandas : 2.1.4\n", "matplotlib: 3.8.0\n", "\n" ] } ], "source": [ "%watermark -n -v -m -g -iv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the default style" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "plt.style.use('d4sci.mplstyle')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word counting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by taking the simplest approach and simply counting positive and negative words. We'll use Hu and Liu's Lexicon from their 2004 KDD paper: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "pos = np.loadtxt('data/positive-words.txt', dtype='str', comments=';')\n", "neg = np.loadtxt('data/negative-words.txt', dtype='str', comments=';')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['a+', 'abound', 'abounds', ..., 'zenith', 'zest', 'zippy'],\n", " dtype=' 0:\n", " ngrams.append([])\n", "\n", " score = 0\n", " \n", " # Remove the trailing empty ngram if necessary\n", " if len(ngrams[-1]) == 0:\n", " ngrams = ngrams[:-1]\n", "\n", " for ngram in ngrams:\n", " value = 1\n", "\n", " for word in ngram:\n", " if word in modifiers:\n", " value *= modifiers[word]\n", " elif word in valence:\n", " value *= valence[word]\n", "\n", " if verbose:\n", " print(ngram, value)\n", "\n", " score += value\n", "\n", " return score/len(ngrams)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This implementation is still relatively simple, but, as you can see, the results are already better." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The product is pretty annoying, and I hate it\n" ] } ], "source": [ "print(texts[1])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['pretty', 'annoying'] -1.5\n", "['hate'] -1\n" ] }, { "data": { "text/plain": [ "-1.25" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_modified(texts[1], valence, modifiers, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A more complete implementation would be more careful in handling the modifiers and would build larger ngrams so that cases like this one would also work:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['not', 'very', 'good'] -1.5\n" ] }, { "data": { "text/plain": [ "-1.5" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_modified(\"It was not very good\", valence, modifiers, True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And even more complex (and unrealistic) examples work fine" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['not', 'not', 'very', 'very', 'good'] 2.25\n" ] }, { "data": { "text/plain": [ "2.25" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_modified(\"It was not not very very good\", valence, modifiers, True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Continuous weights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "VADER is a state of the art sentiment analysis tool. Here we will use their excelent and well documented [lexicon](https://github.com/cjhutto/vaderSentiment) to explore non binary weights. Their approach is significantly more advanced than what we present here, but some of the fundamental ideas are the same" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "vader = pd.read_csv(\"data/vader_lexicon.txt\", sep='\\t', header=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vader lexicon includes a lot of interesting information:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
0$:-1.50.80623[-1, -1, -1, -1, -3, -1, -3, -1, -2, -1]
1%)-0.41.01980[-1, 0, -1, 0, 0, -2, -1, 2, -1, 0]
2%-)-1.51.43178[-2, 0, -2, -2, -1, 2, -2, -3, -2, -3]
3&-:-0.41.42829[-3, -1, 0, 0, -1, -1, -1, 2, -1, 2]
4&:-0.70.64031[0, -1, -1, -1, 1, -1, -1, -1, -1, -1]
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 $: -1.5 0.80623 [-1, -1, -1, -1, -3, -1, -3, -1, -2, -1]\n", "1 %) -0.4 1.01980 [-1, 0, -1, 0, 0, -2, -1, 2, -1, 0]\n", "2 %-) -1.5 1.43178 [-2, 0, -2, -2, -1, 2, -2, -3, -2, -3]\n", "3 &-: -0.4 1.42829 [-3, -1, 0, 0, -1, -1, -1, 2, -1, 2]\n", "4 &: -0.7 0.64031 [0, -1, -1, -1, 1, -1, -1, -1, -1, -1]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vader.head()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
7512}:-2.10.83066[-1, -1, -3, -2, -3, -2, -2, -1, -3, -3]
7513}:(-2.00.63246[-3, -1, -2, -1, -3, -2, -2, -2, -2, -2]
7514}:)0.41.42829[1, 1, -2, 1, 2, -2, 1, -1, 2, 1]
7515}:-(-2.10.70000[-2, -1, -2, -2, -2, -4, -2, -2, -2, -2]
7516}:-)0.31.61555[1, 1, -2, 1, -1, -3, 2, 2, 1, 1]
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "7512 }: -2.1 0.83066 [-1, -1, -3, -2, -3, -2, -2, -1, -3, -3]\n", "7513 }:( -2.0 0.63246 [-3, -1, -2, -1, -3, -2, -2, -2, -2, -2]\n", "7514 }:) 0.4 1.42829 [1, 1, -2, 1, 2, -2, 1, -1, 2, 1]\n", "7515 }:-( -2.1 0.70000 [-2, -1, -2, -2, -2, -4, -2, -2, -2, -2]\n", "7516 }:-) 0.3 1.61555 [1, 1, -2, 1, -1, -3, 2, 2, 1, 1]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vader.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similies are also included and, in addition to the average sentiment of each word (in column 1) and it's standard deviation (in column 2) it provides the raw human generated scores in column 3. So that we may easily check (and possibly modify) their weights. To extract the raw scores for the word \"love\" we could simply do:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(7517, 4)\n" ] } ], "source": [ "print(vader.shape)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 love\n", "1 3.2\n", "2 0.4\n", "3 [3, 3, 3, 3, 3, 3, 3, 4, 4, 3]\n", "Name: 4446, dtype: object\n" ] } ], "source": [ "print(vader.iloc[4446])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3, 3, 3, 3, 3, 3, 3, 4, 4, 3]\n" ] } ], "source": [ "scores = eval(vader.iloc[4446][3])\n", "print(scores)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores[8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can see that 8/10 people thought that the word love should receive a score of 3 and two others a score of 4. This gives us insight into how uniform the scores are. If for some reason, we thought that there was some problem with the 2 values of 4 or perhaps just not appropriate to our purposes we might discard them and recalculate the valence of the word. \n", "\n", "One justification for this might be the fact that the scores for the closely related word, \"loved\", are significantly different with a wider range of variation in the human scores" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 loved\n", "1 2.9\n", "2 0.7\n", "3 [3, 3, 4, 2, 2, 4, 3, 2, 3, 3]\n", "Name: 4447, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vader.iloc[4447]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we convert this dataset into a dictionary similar to the one we used above" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "valence_vader = dict(vader[[0, 1]].values)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.2" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valence_vader['love']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use this new dictionary we just have to modify the arguments to the sentiment_modified function:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['not', 'not', 'very', 'very', 'good'] 4.2749999999999995\n" ] }, { "data": { "text/plain": [ "4.2749999999999995" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_modified(\"It was not not very very good\", valence_vader, modifiers, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One important detail to keep in mind is that scores obtained through different methods are not comparable. In this example, the score of the sentence \"It was not not very very good\" went from 2.25 to 4.27 when we switched dictionaries. This is due not only to different levels of coverage in differnet dictionaries but also to differnet choices in the possible ranges of values." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I'm very happy : 4.050000000000001\n", "The product is pretty annoying, and I hate it : -2.625\n", "I'm sad : -2.1\n" ] } ], "source": [ "texts = [\"I'm very happy\",\n", " \"The product is pretty annoying, and I hate it\",\n", " \"I'm sad\",\n", " ]\n", "\n", "for text in texts:\n", " print(text, ':', sentiment_modified(text, valence_vader, modifiers))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \"Data \n", "
" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }