{ "cells": [ { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "---\n", "# **Unsupervised Semantic Sentiment Analysis of IMDB Reviews**\n", "## **A model to capture sentiment complexity and text subjectivity**\n", "\n", "### by [Ahmad Hashemi](https://www.linkedin.com/in/ahmad-hashemi-oxford/)\n", "---\n", "\n", "![alt text](../reports/figures/distribution_of_high_confidence_predictions_on_PSS_NSS_plane.png)" ] }, { "cell_type": "markdown", "metadata": { "id": "xxuUFISrqjML" }, "source": [ "\n", "\n", "# Table of contents \n", "\n", "1. [Introduction](#Introduction)\n", " >- Problem overview\n", " >- Importing necessary libraries \n", " \n", "2. [Data Preprocessing](#Preprocessing)\n", " >- Utility module\n", " \n", "3. [Supervised Models](#Supervised_Models)\n", "\n", "4. [Unsupervised Approach](#Unsupervised_Approach)\n", " >- Training the word embedding model\n", " >- Defining the negative and positive sets\n", " >- Calculating the semantic sentiment of the reviews\n", " >- High confidence predictions\n", " \n", "5. [Further Analysis](#Further_Analysis)\n", " >- Sentiment complexity \n", " >- A Qualitative Assessment\n", " >- Now it's your turn!" ] }, { "cell_type": "markdown", "metadata": { "id": "pqJIZlMwqjMM" }, "source": [ "---\n", "\n", "# 1. Introduction\n", "---" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "## Problem overview\n", "\n", "Sentiment analysis, also called opinion mining, is a common application of Natural Language Processing (NLP) widely used to analyze the overall effect and underlying sentiment of a given sentence or statement. In its most basic form, a sentiment analysis model classifies the text into positive or negative (and sometimes neutral) sentiments. Therefore naturally, the most successful approaches are using supervised models which need a fair amount of labeled data in order to be trained. Providing such data is an expensive and time-consuming process that is not possible or easily accessible in many cases. Additionally, the output of such models is a number implying how similar the text is to the positive examples we provided during the training and does not consider nuances such as sentiment complexity of the text.\n", "\n", "Relying on my background in close reading and qualitative analysis of a text, I present an unsupervised semantic model that not only captures the overall sentiment of the text but also provides a way to analyze the polarity strength and complexity of emotions in the text while maintaining the high performance.\n", "\n", "To demonstrate this approach, I use the well-known IMDB database. Released to the public by [Stanford University](http://ai.stanford.edu/~amaas/data/sentiment/), this dataset is a collection of 50,000 reviews from IMDB that contains an even number of positive and negative reviews with no more than 30 reviews per movie. As it is noted in the dataset introduction notes, \"a negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset.\"\n", "\n", "The dataset can be obtained from http://ai.stanford.edu/~amaas/data/sentiment/\n" ] }, { "cell_type": "markdown", "metadata": { "id": "tUvSpLcaqjMM" }, "source": [ "\n", "\n", "## Importing necessary libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7UYAZY9lqjMN", "outputId": "d2cdcf59-3f96-4954-c91b-99e085f06a89" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*** --> Modules are imported: \n", "Python version: 3.6.13 (default, Dec 12 2021, 15:04:37) \n", "[GCC Apple LLVM 13.0.0 (clang-1300.0.29.3)]\n", "numpy version: 1.19.5\n", "pandas version: 1.1.5\n", "ploty version: 5.4.0\n", "sklearn version: 0.24.2\n", "nltk version: 3.6.5\n", "gensim version: 4.1.2\n" ] } ], "source": [ "# data processing and Data manipulation\n", "import numpy as np # linear algebra\n", "import pandas as pd # data processing\n", "\n", "import sklearn\n", "from sklearn.model_selection import train_test_split\n", " \n", "# Libraries and packages for NLP\n", "import nltk\n", "import gensim\n", "from gensim.models import Word2Vec\n", "\n", "# Visualization \n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import plotly\n", "import plotly.express as px\n", "%matplotlib inline\n", "\n", "plt.style.use('fivethirtyeight')\n", "matplotlib.rcParams['axes.labelsize'] = 14\n", "matplotlib.rcParams['xtick.labelsize'] = 12\n", "matplotlib.rcParams['figure.figsize'] = (12, 10)\n", "matplotlib.rcParams['ytick.labelsize'] = 12\n", "matplotlib.rcParams['text.color'] = 'k'\n", "\n", "import os\n", "import sys\n", "import warnings\n", "if not sys.warnoptions:\n", " warnings.simplefilter(\"ignore\")\n", " \n", "print('*** --> Modules are imported: ') \n", "print(\"Python version:\", sys.version)\n", "print(\"numpy version:\", np.__version__)\n", "print(\"pandas version:\", pd.__version__)\n", "\n", "print(\"ploty version:\", plotly.__version__)\n", "print(\"sklearn version:\", sklearn.__version__)\n", "print(\"nltk version:\", nltk.__version__)\n", "print(\"gensim version:\", gensim.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | review | \n", "sentiment | \n", "
---|---|---|
0 | \n", "In 1974, the teenager Martha Moxley (Maggie Gr... | \n", "1 | \n", "
1 | \n", "OK... so... I really like Kris Kristofferson a... | \n", "0 | \n", "
2 | \n", "***SPOILER*** Do not read this, if you think a... | \n", "0 | \n", "