{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " Sentiment_Analysis.ipynb: Sentiment Analysis with Pandas and Watson Natural Language Understanding\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "With the significant growth in the volume of highly subjective user-generated text in the form of online products reviews, recommendations, blogs, discussion forums and etc., the sentiment analysis has gained a lot of attention in the last decade. The sentiment analysis goal is to automatically detect the underlying sentiment of the user towards the entity of interest. While the Sentiment analysis isĀ  one of the most prominent and commonly used natural language processing (NLP) features, it is typically used in combination with other NLP features and text analytics to gain insight about the user experience for the sake of customer care and feedback analytics, product analytics and brand intelligence.\n", "This notebook shows how the open source library [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) lets you use use [Pandas](https://pandas.pydata.org/) DataFrames and the [Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding) service to conduct exploratory sentiment analysis over the product reviews. \n", "\n", "We start out with a dataset from the [Edmunds-Consumer Car Ratings and Reviews](https://www.kaggle.com/ankkur13/edmundsconsumer-car-ratings-and-reviews) obtained from the Kaggle datasets. This is a dataset containing consumer's thought and the star rating of car manufacturer/model/type.\n", "We pass each review to the Watson Natural Language \n", "Understanding (NLU) service. Then we use Text Extensions for Pandas to convert the output of the \n", "Watson NLU service to Pandas DataFrames. Next, we perform an example exploratory data analysis and machine learning task with \n", "Pandas to show how Pandas makes analyzing the dataset and prediction task much easier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Environment Setup\n", "\n", "This notebook requires a Python 3.7 or later environment with the following packages:\n", "* The dependencies listed in the [\"requirements.txt\" file for Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas/blob/master/requirements.txt)\n", "* The \"[ibm-watson](https://pypi.org/project/ibm-watson/)\" package, available via `pip install ibm-watson`\n", "* `text_extensions_for_pandas`\n", "\n", "You can satisfy the dependency on `text_extensions_for_pandas` in either of two ways:\n", "\n", "* Run `pip install text_extensions_for_pandas` before running this notebook. This command adds the library to your Python environment.\n", "* Run this notebook out of your local copy of the Text Extensions for Pandas project's [source tree](https://github.com/CODAIT/text-extensions-for-pandas). In this case, the notebook will use the version of Text Extensions for Pandas in your local source tree **if the package is not installed in your Python environment**." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Core Python libraries\n", "import json\n", "import os\n", "import sys\n", "import pandas as pd\n", "import numpy as np\n", "import glob\n", "import re\n", "import time\n", "import warnings\n", "from typing import *\n", "\n", "# IBM Watson libraries\n", "import ibm_watson\n", "import ibm_watson.natural_language_understanding_v1 as nlu\n", "import ibm_cloud_sdk_core\n", "\n", "# Machine Learning libraries\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "\n", "# Visualization\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "# And of course we need the text_extensions_for_pandas library itself.\n", "try:\n", " import text_extensions_for_pandas as tp\n", "except ModuleNotFoundError as e:\n", " # If we're running from within the project source tree and the parent Python\n", " # environment doesn't have the text_extensions_for_pandas package, use the\n", " # version in the local source tree.\n", " if not os.getcwd().endswith(\"notebooks\"):\n", " raise e\n", " if \"..\" not in sys.path:\n", " sys.path.insert(0, \"..\")\n", " import text_extensions_for_pandas as tp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Set up the Watson Natural Language Understanding Service\n", "\n", "In this part of the notebook, we will use the Watson Natural Language Understanding (NLU) service to extract the keywords and their sentiment and emotion from each of the product reviews.\n", "\n", "You can create an instance of Watson NLU on the IBM Cloud for free by navigating to [this page](https://www.ibm.com/cloud/watson-natural-language-understanding) and clicking on the button marked \"Get started free\". You can also install your own instance of Watson NLU on [OpenShift](https://www.openshift.com/) by using [IBM Watson Natural Language Understanding for IBM Cloud Pak for Data](\n", "https://catalog.redhat.com/software/operators/detail/5e9873e13f398525a0ceafe5).\n", "\n", "You'll need two pieces of information to access your instance of Watson NLU: An **API key** and a **service URL**. If you're using Watson NLU on the IBM Cloud, you can find your API key and service URL in the IBM Cloud web UI. Navigate to the [resource list](https://cloud.ibm.com/resources) and click on your instance of Natural Language Understanding to open the management UI for your service. Then click on the \"Manage\" tab to show a page with your API key and service URL.\n", "\n", "The cell that follows assumes that you are using the environment variables `IBM_API_KEY` and `IBM_SERVICE_URL` to store your credentials. If you're running this notebook in Jupyter on your laptop, you can set these environment variables while starting up `jupyter notebook` or `jupyter lab`. For example:\n", "``` console\n", "IBM_API_KEY='' \\\n", "IBM_SERVICE_URL='' \\\n", " jupyter lab\n", "```\n", "\n", "Alternately, you can uncomment the first two lines of code below to set the `IBM_API_KEY` and `IBM_SERVICE_URL` environment variables directly.\n", "**Be careful not to store your API key in any publicly-accessible location!**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# If you need to embed your credentials inline, uncomment the following two lines and\n", "# paste your credentials in the indicated locations.\n", "# os.environ[\"IBM_API_KEY\"] = \"\"\n", "# os.environ[\"IBM_SERVICE_URL\"] = \"\"\n", "\n", "# Retrieve the API key for your Watson NLU service instance\n", "if \"IBM_API_KEY\" not in os.environ:\n", " raise ValueError(\"Expected Watson NLU api key in the environment variable 'IBM_API_KEY'\")\n", "api_key = os.environ.get(\"IBM_API_KEY\")\n", " \n", "# Retrieve the service URL for your Watson NLU service instance\n", "if \"IBM_SERVICE_URL\" not in os.environ:\n", " raise ValueError(\"Expected Watson NLU service URL in the environment variable 'IBM_SERVICE_URL'\")\n", "service_url = os.environ.get(\"IBM_SERVICE_URL\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Connect to the Watson Natural Language Understanding Python API\n", "\n", "This notebook uses the IBM Watson Python SDK to perform authentication on the IBM Cloud via the \n", "`IAMAuthenticator` class. See [the IBM Watson Python SDK documentation](https://github.com/watson-developer-cloud/python-sdk#iam) for more information. \n", "\n", "We start by using the API key and service URL from the previous cell to create an instance of the\n", "Python API for Watson NLU." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "natural_language_understanding = ibm_watson.NaturalLanguageUnderstandingV1(\n", " version=\"2019-07-12\",\n", " authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key)\n", ")\n", "natural_language_understanding.set_service_url(service_url)\n", "natural_language_understanding.set_disable_ssl_verification(True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pass a Review through the Watson NLU Service\n", "\n", "Once you've opened a connection to the Watson NLU service, you can pass documents through \n", "the service by invoking the [`analyze()` method](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#analyze).\n", "\n", "To do so, you should download the [Edmunds-Consumer Car Ratings and Reviews](https://www.kaggle.com/ankkur13/edmundsconsumer-car-ratings-and-reviews/download/) from the Kaggle website and place the archive.zip folder to our notebooks/outputs directory. Note that the directory of the dataset contains 50 csv files of reviews of 50 major car brands which we read into one dataframe with the brand name is listed under the \"Car_Make\" column.\n", "\n", "Let's read the reviews and show what the reviews looks like:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Review_DateAuthor_NameVehicle_TitleReview_TitleReviewRating\\rCar_Make
0on 09/18/11 00:19 AM (PDT)wizbang_fl2007 Volkswagen New Beetle Convertible 2.5 2dr...New Beetle- Holds up well & Fun to Drive, but ...I've had my Beetle Convertible for over 4.5 y...4.500Volkswagen
1on 07/07/10 05:28 AM (PDT)carlo frazzano2007 Volkswagen New Beetle Convertible 2.5 PZE...Quality ReviewWe bought the car new in 2007 and are general...4.375Volkswagen
2on 10/19/09 21:41 PM (PDT)NewBeetleDriver2007 Volkswagen New Beetle Convertible Triple ...Adore itI adore my New Beetle. Even though I'm a male...4.375Volkswagen
3on 01/01/09 19:13 PM (PST)Kayemtee2007 Volkswagen New Beetle Convertible 2.5 2dr...Nice RagtopMy wife chose this car to replace a Sebring c...4.375Volkswagen
4on 08/02/08 13:43 PM (PDT)jik2007 Volkswagen New Beetle Convertible 2.5 2dr...Luv, luv, luv my dream car4 of us carpool 1 way 30 min. Backseat ok fo...4.750Volkswagen
5on 05/16/08 12:07 PM (PDT)Ray Cavanagh2007 Volkswagen New Beetle Convertible Triple ...The Best One So Far....I owned a 2002 SLK and 2003 BMW Z-4. After s...5.000Volkswagen
6on 03/28/08 22:04 PM (PDT)harvestmoon2007 Volkswagen New Beetle Convertible 2.5 2dr...Don't Fall Under The Cute Spell!Fell in love with the car's look and would be...2.750Volkswagen
7on 01/03/08 17:53 PM (PST)The Husband2007 Volkswagen New Beetle Convertible Triple ...Not for Cold Weather!!!The car is beautiful and performs well in the...3.750Volkswagen
8on 09/27/07 08:42 AM (PDT)Kristina2007 Volkswagen New Beetle Convertible 2.5 2dr...I love my BeetleI love my car. I previously owned an Explore...5.000Volkswagen
9on 08/01/07 22:24 PM (PDT)bug lover2007 Volkswagen New Beetle Convertible Triple ...Bug lover reviewMy 2005 was so good, I had to have the Triple...5.000Volkswagen
\n", "
" ], "text/plain": [ " Review_Date Author_Name \\\n", "0 on 09/18/11 00:19 AM (PDT) wizbang_fl \n", "1 on 07/07/10 05:28 AM (PDT) carlo frazzano \n", "2 on 10/19/09 21:41 PM (PDT) NewBeetleDriver \n", "3 on 01/01/09 19:13 PM (PST) Kayemtee \n", "4 on 08/02/08 13:43 PM (PDT) jik \n", "5 on 05/16/08 12:07 PM (PDT) Ray Cavanagh \n", "6 on 03/28/08 22:04 PM (PDT) harvestmoon \n", "7 on 01/03/08 17:53 PM (PST) The Husband \n", "8 on 09/27/07 08:42 AM (PDT) Kristina \n", "9 on 08/01/07 22:24 PM (PDT) bug lover \n", "\n", " Vehicle_Title \\\n", "0 2007 Volkswagen New Beetle Convertible 2.5 2dr... \n", "1 2007 Volkswagen New Beetle Convertible 2.5 PZE... \n", "2 2007 Volkswagen New Beetle Convertible Triple ... \n", "3 2007 Volkswagen New Beetle Convertible 2.5 2dr... \n", "4 2007 Volkswagen New Beetle Convertible 2.5 2dr... \n", "5 2007 Volkswagen New Beetle Convertible Triple ... \n", "6 2007 Volkswagen New Beetle Convertible 2.5 2dr... \n", "7 2007 Volkswagen New Beetle Convertible Triple ... \n", "8 2007 Volkswagen New Beetle Convertible 2.5 2dr... \n", "9 2007 Volkswagen New Beetle Convertible Triple ... \n", "\n", " Review_Title \\\n", "0 New Beetle- Holds up well & Fun to Drive, but ... \n", "1 Quality Review \n", "2 Adore it \n", "3 Nice Ragtop \n", "4 Luv, luv, luv my dream car \n", "5 The Best One So Far.... \n", "6 Don't Fall Under The Cute Spell! \n", "7 Not for Cold Weather!!! \n", "8 I love my Beetle \n", "9 Bug lover review \n", "\n", " Review Rating\\r Car_Make \n", "0 I've had my Beetle Convertible for over 4.5 y... 4.500 Volkswagen \n", "1 We bought the car new in 2007 and are general... 4.375 Volkswagen \n", "2 I adore my New Beetle. Even though I'm a male... 4.375 Volkswagen \n", "3 My wife chose this car to replace a Sebring c... 4.375 Volkswagen \n", "4 4 of us carpool 1 way 30 min. Backseat ok fo... 4.750 Volkswagen \n", "5 I owned a 2002 SLK and 2003 BMW Z-4. After s... 5.000 Volkswagen \n", "6 Fell in love with the car's look and would be... 2.750 Volkswagen \n", "7 The car is beautiful and performs well in the... 3.750 Volkswagen \n", "8 I love my car. I previously owned an Explore... 5.000 Volkswagen \n", "9 My 2005 was so good, I had to have the Triple... 5.000 Volkswagen " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from zipfile import ZipFile\n", "path = r'./outputs/archive' # path to compressed directory of data\n", "\n", "with ZipFile(path+'.zip', 'r') as zipObj:\n", " # Extract all the contents of zip file in the notebooks/output/archive directory\n", " zipObj.extractall(path)\n", " \n", "all_files = glob.glob(path + \"/*.csv\")\n", "\n", "li = []\n", "\n", "for filename in all_files:\n", " df = pd.read_csv(filename, index_col=0, header=0, lineterminator='\\n')\n", " df['Car_Make'] = re.split('_|\\\\.',os.path.basename(filename))[-2] # Extracting the car brand from file name\n", " li.append(df)\n", "\n", "frame = pd.concat(li, axis=0, ignore_index=True)\n", "frame.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how many car models, reviews and reviewers and etc. we have per car make in our dataset:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Review_DateAuthor_NameVehicle_TitleReview_TitleReviewRating\\r
Car_Make
AMGeneral552554
Acura563258074945681651232
AlfaRomeo77762277775
AstonMartin828931898917
Audi506953897535467600633
BMW683371068297202798433
Bentley1501463914115021
Bugatti994997
Buick340632423743334361533
Cadillac353935314573593390233
Chevrolet15781162542760165001933433
GMC4327442512614415496433
Honda11611106461704111411255933
Toyota16145154832328159131855333
Volkswagen8260821915778481933433
chrysler495849604954996552933
dodge6781737311737462846033
ferrari1561594715616117
fiat3943806839139125
ford16908171363261177182057633
genesis78751678775
hummer5375413553155929
hyundai767970329437250815633
infiniti391438743703846427732
isuzu100211461751093117333
jaguar173017292541770187832
jeep482443116434540493233
kia556152257055353592633
lamborghini838524828615
land-rover175217111861743183133
lexus537154743705424608332
lincoln280127763082792301232
lotus1361331613313716
maserati2352346123423925
maybach2424624248
mazda716568309387036782033
mclaren111111
mercedes-benz606365428046638730833
mercury300230022913037335533
mini1033977127997103629
mitsubishi398243826014222477333
nissan10729100251735104011176033
pontiac506652943455239592733
porsche163616462801657177430
ram56450528155155322
rolls-royce333415333311
subaru599457119705958651033
suzuki214221514602124232633
tesla1401363713914011
volvo426943104524405481833
\n", "
" ], "text/plain": [ " Review_Date Author_Name Vehicle_Title Review_Title Review \\\n", "Car_Make \n", "AMGeneral 5 5 2 5 5 \n", "Acura 5632 5807 494 5681 6512 \n", "AlfaRomeo 77 76 22 77 77 \n", "AstonMartin 82 89 31 89 89 \n", "Audi 5069 5389 753 5467 6006 \n", "BMW 6833 7106 829 7202 7984 \n", "Bentley 150 146 39 141 150 \n", "Bugatti 9 9 4 9 9 \n", "Buick 3406 3242 374 3334 3615 \n", "Cadillac 3539 3531 457 3593 3902 \n", "Chevrolet 15781 16254 2760 16500 19334 \n", "GMC 4327 4425 1261 4415 4964 \n", "Honda 11611 10646 1704 11141 12559 \n", "Toyota 16145 15483 2328 15913 18553 \n", "Volkswagen 8260 8219 1577 8481 9334 \n", "chrysler 4958 4960 495 4996 5529 \n", "dodge 6781 7373 1173 7462 8460 \n", "ferrari 156 159 47 156 161 \n", "fiat 394 380 68 391 391 \n", "ford 16908 17136 3261 17718 20576 \n", "genesis 78 75 16 78 77 \n", "hummer 537 541 35 531 559 \n", "hyundai 7679 7032 943 7250 8156 \n", "infiniti 3914 3874 370 3846 4277 \n", "isuzu 1002 1146 175 1093 1173 \n", "jaguar 1730 1729 254 1770 1878 \n", "jeep 4824 4311 643 4540 4932 \n", "kia 5561 5225 705 5353 5926 \n", "lamborghini 83 85 24 82 86 \n", "land-rover 1752 1711 186 1743 1831 \n", "lexus 5371 5474 370 5424 6083 \n", "lincoln 2801 2776 308 2792 3012 \n", "lotus 136 133 16 133 137 \n", "maserati 235 234 61 234 239 \n", "maybach 24 24 6 24 24 \n", "mazda 7165 6830 938 7036 7820 \n", "mclaren 1 1 1 1 1 \n", "mercedes-benz 6063 6542 804 6638 7308 \n", "mercury 3002 3002 291 3037 3355 \n", "mini 1033 977 127 997 1036 \n", "mitsubishi 3982 4382 601 4222 4773 \n", "nissan 10729 10025 1735 10401 11760 \n", "pontiac 5066 5294 345 5239 5927 \n", "porsche 1636 1646 280 1657 1774 \n", "ram 564 505 281 551 553 \n", "rolls-royce 33 34 15 33 33 \n", "subaru 5994 5711 970 5958 6510 \n", "suzuki 2142 2151 460 2124 2326 \n", "tesla 140 136 37 139 140 \n", "volvo 4269 4310 452 4405 4818 \n", "\n", " Rating\\r \n", "Car_Make \n", "AMGeneral 4 \n", "Acura 32 \n", "AlfaRomeo 5 \n", "AstonMartin 17 \n", "Audi 33 \n", "BMW 33 \n", "Bentley 21 \n", "Bugatti 7 \n", "Buick 33 \n", "Cadillac 33 \n", "Chevrolet 33 \n", "GMC 33 \n", "Honda 33 \n", "Toyota 33 \n", "Volkswagen 33 \n", "chrysler 33 \n", "dodge 33 \n", "ferrari 17 \n", "fiat 25 \n", "ford 33 \n", "genesis 5 \n", "hummer 29 \n", "hyundai 33 \n", "infiniti 32 \n", "isuzu 33 \n", "jaguar 32 \n", "jeep 33 \n", "kia 33 \n", "lamborghini 15 \n", "land-rover 33 \n", "lexus 32 \n", "lincoln 32 \n", "lotus 16 \n", "maserati 25 \n", "maybach 8 \n", "mazda 33 \n", "mclaren 1 \n", "mercedes-benz 33 \n", "mercury 33 \n", "mini 29 \n", "mitsubishi 33 \n", "nissan 33 \n", "pontiac 33 \n", "porsche 30 \n", "ram 22 \n", "rolls-royce 11 \n", "subaru 33 \n", "suzuki 33 \n", "tesla 11 \n", "volvo 33 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.groupby('Car_Make').nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And number of the car makes:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "50" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.groupby('Car_Make').nunique().shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's then sample randomly from the dataframe by keeping <=200 of the records per car make:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Review_Date 7321\n", "Author_Name 7434\n", "Vehicle_Title 5292\n", "Review_Title 7665\n", "Review 8338\n", "Rating\\r 33\n", "Car_Make 50\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = 200\n", "sampled_df = frame.groupby('Car_Make').apply(lambda x: x.sample(min(n,len(x)))).reset_index(drop=True)\n", "sampled_df.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking the number of reviews and columns in the imported corpus:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8392, 7)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sampled_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's combine the review titles and the review into the review_content for the later analysis:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Review_DateAuthor_NameVehicle_TitleReview_TitleReviewRating\\rCar_MakeReview_Content
0on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.000AMGeneralWhat a waste: I have owned this car for a year...
1on 12/18/05 19:55 PM (PST)Clayton2000 AM General Hummer SUV 4dr SUV AWDHUMMER NOT A bummerVehicle is a beast. I don't recommend HUMMER ...5.000AMGeneralHUMMER NOT A bummer : Vehicle is a beast. I do...
2on 01/19/06 19:46 PM (PST)REUBEN2000 AM General Hummer SUV Hard Top 4dr SUV AWDAWESOME HUMMERHummer is unstoppable. May only get 12 mpg bu...5.000AMGeneralAWESOME HUMMER: Hummer is unstoppable. May onl...
3on 08/23/03 00:00 AM (PDT)Bobby Keene2000 AM General Hummer SUV Hard Top 4dr SUV AWDH1 ReviewThe truck is incredible. I have a long histo...4.500AMGeneralH1 Review: The truck is incredible. I have a ...
4on 08/30/02 00:00 AM (PDT)bluice33092000 AM General Hummer SUV 4dr SUV AWDa true ridethis beast can go through just about \\ranythi...4.625AMGenerala true ride: this beast can go through just ab...
\n", "
" ], "text/plain": [ " Review_Date Author_Name \\\n", "0 on 06/15/02 00:00 AM (PDT) mike6382 \n", "1 on 12/18/05 19:55 PM (PST) Clayton \n", "2 on 01/19/06 19:46 PM (PST) REUBEN \n", "3 on 08/23/03 00:00 AM (PDT) Bobby Keene \n", "4 on 08/30/02 00:00 AM (PDT) bluice3309 \n", "\n", " Vehicle_Title Review_Title \\\n", "0 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "1 2000 AM General Hummer SUV 4dr SUV AWD HUMMER NOT A bummer \n", "2 2000 AM General Hummer SUV Hard Top 4dr SUV AWD AWESOME HUMMER \n", "3 2000 AM General Hummer SUV Hard Top 4dr SUV AWD H1 Review \n", "4 2000 AM General Hummer SUV 4dr SUV AWD a true ride \n", "\n", " Review Rating\\r Car_Make \\\n", "0 I have owned this car for a year and a \\rhalf... 1.000 AMGeneral \n", "1 Vehicle is a beast. I don't recommend HUMMER ... 5.000 AMGeneral \n", "2 Hummer is unstoppable. May only get 12 mpg bu... 5.000 AMGeneral \n", "3 The truck is incredible. I have a long histo... 4.500 AMGeneral \n", "4 this beast can go through just about \\ranythi... 4.625 AMGeneral \n", "\n", " Review_Content \n", "0 What a waste: I have owned this car for a year... \n", "1 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "2 AWESOME HUMMER: Hummer is unstoppable. May onl... \n", "3 H1 Review: The truck is incredible. I have a ... \n", "4 a true ride: this beast can go through just ab... " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sampled_df['Review_Content'] = sampled_df['Review_Title']+ ':' + sampled_df['Review']\n", "sampled_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what the reiews look like in our dataset by showing one:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'What a waste: I have owned this car for a year and a \\rhalf now and it is not reliabile at \\rall. I have driven it through \\reverything and it stalls on me all the \\rtime. I would never buy this car \\ragain. and trying to sell it is like \\rtrying to sell fire in hell, just wont \\rhappen.'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sampled_df['Review_Content'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Watson Natural Language Understanding Analysis:\n", "Now it is time to check how Watson Natural Language Understanding can help us analyzing the reviews starting from the first review:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code below, we instruct Watson Natural Language Understanding to perform keywords (with sentiment and emotion) analysis on the first review:\n", "\n", "See [the Watson NLU documentation](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#text-analytics-features) for a full description of the types of analysis that NLU can perform." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "warnings.filterwarnings('ignore')\n", "# Using Watson Natural Language Understanding for analyzing the Review_Content\n", "# Make the request\n", "nlu_response_review = natural_language_understanding.analyze(\n", " text=sampled_df['Review_Content'][0],\n", " return_analyzed_text=True,\n", " features=nlu.Features(\n", " keywords=nlu.KeywordsOptions(sentiment=True, emotion=True)\n", " )).get_result()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The response from the analyze() method is a Python dictionary. The dictionary contains an entry for each pass of analysis requested, plus some additional entries with metadata about the API request itself. Here's a list of the keys in response:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['usage', 'language', 'keywords', 'analyzed_text'])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlu_response_review.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's the whole output of Watson NLU's text analysis for the first review in the dataset:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'usage': {'text_units': 1, 'text_characters': 284, 'features': 1},\n", " 'language': 'en',\n", " 'keywords': [{'text': 'waste',\n", " 'sentiment': {'score': -0.875215, 'label': 'negative'},\n", " 'relevance': 0.685741,\n", " 'emotion': {'sadness': 0.192383,\n", " 'joy': 0.024961,\n", " 'fear': 0.313145,\n", " 'disgust': 0.08332,\n", " 'anger': 0.277825},\n", " 'count': 1},\n", " {'text': 'fire',\n", " 'sentiment': {'score': -0.934513, 'label': 'negative'},\n", " 'relevance': 0.598326,\n", " 'emotion': {'sadness': 0.360925,\n", " 'joy': 0.002355,\n", " 'fear': 0.26649,\n", " 'disgust': 0.069938,\n", " 'anger': 0.442759},\n", " 'count': 1},\n", " {'text': 'car',\n", " 'sentiment': {'score': -0.844774, 'label': 'negative'},\n", " 'relevance': 0.581432,\n", " 'emotion': {'sadness': 0.144346,\n", " 'joy': 0.150177,\n", " 'fear': 0.246102,\n", " 'disgust': 0.06176,\n", " 'anger': 0.203999},\n", " 'count': 2},\n", " {'text': 'hell',\n", " 'sentiment': {'score': -0.934513, 'label': 'negative'},\n", " 'relevance': 0.577011,\n", " 'emotion': {'sadness': 0.360925,\n", " 'joy': 0.002355,\n", " 'fear': 0.26649,\n", " 'disgust': 0.069938,\n", " 'anger': 0.442759},\n", " 'count': 1},\n", " {'text': 'year',\n", " 'sentiment': {'score': -0.875215, 'label': 'negative'},\n", " 'relevance': 0.563676,\n", " 'emotion': {'sadness': 0.192383,\n", " 'joy': 0.024961,\n", " 'fear': 0.313145,\n", " 'disgust': 0.08332,\n", " 'anger': 0.277825},\n", " 'count': 1},\n", " {'text': 'time',\n", " 'sentiment': {'score': 0, 'label': 'neutral'},\n", " 'relevance': 0.466983,\n", " 'emotion': {'sadness': 0.266573,\n", " 'joy': 0.401314,\n", " 'fear': 0.08908,\n", " 'disgust': 0.024027,\n", " 'anger': 0.065767},\n", " 'count': 1}],\n", " 'analyzed_text': 'What a waste: I have owned this car for a year and a \\rhalf now and it is not reliabile at \\rall. I have driven it through \\reverything and it stalls on me all the \\rtime. I would never buy this car \\ragain. and trying to sell it is like \\rtrying to sell fire in hell, just wont \\rhappen.'}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlu_response_review" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explore the output dictionary based on its keys:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'What a waste: I have owned this car for a year and a \\rhalf now and it is not reliabile at \\rall. I have driven it through \\reverything and it stalls on me all the \\rtime. I would never buy this car \\ragain. and trying to sell it is like \\rtrying to sell fire in hell, just wont \\rhappen.'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlu_response_review['analyzed_text']" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'text': 'waste',\n", " 'sentiment': {'score': -0.875215, 'label': 'negative'},\n", " 'relevance': 0.685741,\n", " 'emotion': {'sadness': 0.192383,\n", " 'joy': 0.024961,\n", " 'fear': 0.313145,\n", " 'disgust': 0.08332,\n", " 'anger': 0.277825},\n", " 'count': 1},\n", " {'text': 'fire',\n", " 'sentiment': {'score': -0.934513, 'label': 'negative'},\n", " 'relevance': 0.598326,\n", " 'emotion': {'sadness': 0.360925,\n", " 'joy': 0.002355,\n", " 'fear': 0.26649,\n", " 'disgust': 0.069938,\n", " 'anger': 0.442759},\n", " 'count': 1},\n", " {'text': 'car',\n", " 'sentiment': {'score': -0.844774, 'label': 'negative'},\n", " 'relevance': 0.581432,\n", " 'emotion': {'sadness': 0.144346,\n", " 'joy': 0.150177,\n", " 'fear': 0.246102,\n", " 'disgust': 0.06176,\n", " 'anger': 0.203999},\n", " 'count': 2},\n", " {'text': 'hell',\n", " 'sentiment': {'score': -0.934513, 'label': 'negative'},\n", " 'relevance': 0.577011,\n", " 'emotion': {'sadness': 0.360925,\n", " 'joy': 0.002355,\n", " 'fear': 0.26649,\n", " 'disgust': 0.069938,\n", " 'anger': 0.442759},\n", " 'count': 1},\n", " {'text': 'year',\n", " 'sentiment': {'score': -0.875215, 'label': 'negative'},\n", " 'relevance': 0.563676,\n", " 'emotion': {'sadness': 0.192383,\n", " 'joy': 0.024961,\n", " 'fear': 0.313145,\n", " 'disgust': 0.08332,\n", " 'anger': 0.277825},\n", " 'count': 1},\n", " {'text': 'time',\n", " 'sentiment': {'score': 0, 'label': 'neutral'},\n", " 'relevance': 0.466983,\n", " 'emotion': {'sadness': 0.266573,\n", " 'joy': 0.401314,\n", " 'fear': 0.08908,\n", " 'disgust': 0.024027,\n", " 'anger': 0.065767},\n", " 'count': 1}]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlu_response_review['keywords']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For many data scientists and machine learning engineers a common task workflow includes using Pandas to do exploratory data analysis followed by using scikit-learn for applying the machine learning techniques over the data. \n", "\n", "Text Extensions for Pandas includes a function parse_response() that turns the output of Watson NLU's analyze() function into a dictionary of Pandas DataFrames. Let's run our response object through that conversion. Let's first begin by parsing the Watson NLU response by text extensions for pandas, to see what information has been captured for each review:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'syntax': Empty DataFrame\n", " Columns: []\n", " Index: [],\n", " 'entities': Empty DataFrame\n", " Columns: []\n", " Index: [],\n", " 'entity_mentions': Empty DataFrame\n", " Columns: []\n", " Index: [],\n", " 'keywords': text sentiment.label sentiment.score relevance emotion.sadness \\\n", " 0 waste negative -0.875215 0.685741 0.192383 \n", " 1 fire negative -0.934513 0.598326 0.360925 \n", " 2 car negative -0.844774 0.581432 0.144346 \n", " 3 hell negative -0.934513 0.577011 0.360925 \n", " 4 year negative -0.875215 0.563676 0.192383 \n", " 5 time neutral 0.000000 0.466983 0.266573 \n", " \n", " emotion.joy emotion.fear emotion.disgust emotion.anger count \n", " 0 0.024961 0.313145 0.083320 0.277825 1 \n", " 1 0.002355 0.266490 0.069938 0.442759 1 \n", " 2 0.150177 0.246102 0.061760 0.203999 2 \n", " 3 0.002355 0.266490 0.069938 0.442759 1 \n", " 4 0.024961 0.313145 0.083320 0.277825 1 \n", " 5 0.401314 0.089080 0.024027 0.065767 1 ,\n", " 'relations': Empty DataFrame\n", " Columns: []\n", " Index: [],\n", " 'semantic_roles': Empty DataFrame\n", " Columns: []\n", " Index: []}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_analyzed_review = tp.io.watson.nlu.parse_response(nlu_response_review)\n", "df_analyzed_review" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['syntax', 'entities', 'entity_mentions', 'keywords', 'relations', 'semantic_roles'])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_analyzed_review.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output of each analysis pass that Watson NLU performed is now a DataFrame. Let's look at the DataFrame for the \"keywords\" pass:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textsentiment.labelsentiment.scorerelevanceemotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angercount
0wastenegative-0.8752150.6857410.1923830.0249610.3131450.0833200.2778251
1firenegative-0.9345130.5983260.3609250.0023550.2664900.0699380.4427591
2carnegative-0.8447740.5814320.1443460.1501770.2461020.0617600.2039992
3hellnegative-0.9345130.5770110.3609250.0023550.2664900.0699380.4427591
4yearnegative-0.8752150.5636760.1923830.0249610.3131450.0833200.2778251
5timeneutral0.0000000.4669830.2665730.4013140.0890800.0240270.0657671
\n", "
" ], "text/plain": [ " text sentiment.label sentiment.score relevance emotion.sadness \\\n", "0 waste negative -0.875215 0.685741 0.192383 \n", "1 fire negative -0.934513 0.598326 0.360925 \n", "2 car negative -0.844774 0.581432 0.144346 \n", "3 hell negative -0.934513 0.577011 0.360925 \n", "4 year negative -0.875215 0.563676 0.192383 \n", "5 time neutral 0.000000 0.466983 0.266573 \n", "\n", " emotion.joy emotion.fear emotion.disgust emotion.anger count \n", "0 0.024961 0.313145 0.083320 0.277825 1 \n", "1 0.002355 0.266490 0.069938 0.442759 1 \n", "2 0.150177 0.246102 0.061760 0.203999 2 \n", "3 0.002355 0.266490 0.069938 0.442759 1 \n", "4 0.024961 0.313145 0.083320 0.277825 1 \n", "5 0.401314 0.089080 0.024027 0.065767 1 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_analyzed_review['keywords']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Buried in the above data structure is all the information we need to perform our sentence-level sentiment analysis task:\n", "\n", "\n", " - The sentiment label and score of every sentence in the review. The score ranges from -1 to 1, with -1 being negative, 0 being neutral and 1 being positive. It provides sentiment on each keyword based on its sentence's sentiment, which can come in useful since it calculates the sentiment in the context.\n", " - The emotion score of every sentence (i.e., sadness, joy, fear, disgust, and anger) in the review.\n", "\n", " - The list of the most important words/phrases in a review including both sentiment/emotion-bearing words/phrases as well as objective words/phrases in the review extracted under the keywords. Note that the sentiment assigned to each keyword has calculated based on its context and in the sentence level." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's concat the watson nlu sentiment analysis dataframe above(output of text enstensions for pandas) with its corresponding review." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textsentiment.labelsentiment.scorerelevanceemotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angercount0
0wastenegative-0.8752150.6857410.1923830.0249610.3131450.0833200.2778251What a waste: I have owned this car for a year...
1firenegative-0.9345130.5983260.3609250.0023550.2664900.0699380.4427591What a waste: I have owned this car for a year...
2carnegative-0.8447740.5814320.1443460.1501770.2461020.0617600.2039992What a waste: I have owned this car for a year...
3hellnegative-0.9345130.5770110.3609250.0023550.2664900.0699380.4427591What a waste: I have owned this car for a year...
4yearnegative-0.8752150.5636760.1923830.0249610.3131450.0833200.2778251What a waste: I have owned this car for a year...
5timeneutral0.0000000.4669830.2665730.4013140.0890800.0240270.0657671What a waste: I have owned this car for a year...
\n", "
" ], "text/plain": [ " text sentiment.label sentiment.score relevance emotion.sadness \\\n", "0 waste negative -0.875215 0.685741 0.192383 \n", "1 fire negative -0.934513 0.598326 0.360925 \n", "2 car negative -0.844774 0.581432 0.144346 \n", "3 hell negative -0.934513 0.577011 0.360925 \n", "4 year negative -0.875215 0.563676 0.192383 \n", "5 time neutral 0.000000 0.466983 0.266573 \n", "\n", " emotion.joy emotion.fear emotion.disgust emotion.anger count \\\n", "0 0.024961 0.313145 0.083320 0.277825 1 \n", "1 0.002355 0.266490 0.069938 0.442759 1 \n", "2 0.150177 0.246102 0.061760 0.203999 2 \n", "3 0.002355 0.266490 0.069938 0.442759 1 \n", "4 0.024961 0.313145 0.083320 0.277825 1 \n", "5 0.401314 0.089080 0.024027 0.065767 1 \n", "\n", " 0 \n", "0 What a waste: I have owned this car for a year... \n", "1 What a waste: I have owned this car for a year... \n", "2 What a waste: I have owned this car for a year... \n", "3 What a waste: I have owned this car for a year... \n", "4 What a waste: I have owned this car for a year... \n", "5 What a waste: I have owned this car for a year... " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keywords_review = pd.concat ([df_analyzed_review['keywords'] , pd.Series([nlu_response_review['analyzed_text']]*len(df_analyzed_review['keywords']))], axis = 1)\n", "keywords_review" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's merge the above dataframe with its corresponding review's information:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textsentiment.labelsentiment.scorerelevanceemotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angercountReview_DateAuthor_NameVehicle_TitleReview_TitleReviewRating\\rCar_MakeReview_Content
0wastenegative-0.8752150.6857410.1923830.0249610.3131450.0833200.2778251on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
1firenegative-0.9345130.5983260.3609250.0023550.2664900.0699380.4427591on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
2carnegative-0.8447740.5814320.1443460.1501770.2461020.0617600.2039992on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
3hellnegative-0.9345130.5770110.3609250.0023550.2664900.0699380.4427591on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
4yearnegative-0.8752150.5636760.1923830.0249610.3131450.0833200.2778251on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
5timeneutral0.0000000.4669830.2665730.4013140.0890800.0240270.0657671on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
\n", "
" ], "text/plain": [ " text sentiment.label sentiment.score relevance emotion.sadness \\\n", "0 waste negative -0.875215 0.685741 0.192383 \n", "1 fire negative -0.934513 0.598326 0.360925 \n", "2 car negative -0.844774 0.581432 0.144346 \n", "3 hell negative -0.934513 0.577011 0.360925 \n", "4 year negative -0.875215 0.563676 0.192383 \n", "5 time neutral 0.000000 0.466983 0.266573 \n", "\n", " emotion.joy emotion.fear emotion.disgust emotion.anger count \\\n", "0 0.024961 0.313145 0.083320 0.277825 1 \n", "1 0.002355 0.266490 0.069938 0.442759 1 \n", "2 0.150177 0.246102 0.061760 0.203999 2 \n", "3 0.002355 0.266490 0.069938 0.442759 1 \n", "4 0.024961 0.313145 0.083320 0.277825 1 \n", "5 0.401314 0.089080 0.024027 0.065767 1 \n", "\n", " Review_Date Author_Name \\\n", "0 on 06/15/02 00:00 AM (PDT) mike6382 \n", "1 on 06/15/02 00:00 AM (PDT) mike6382 \n", "2 on 06/15/02 00:00 AM (PDT) mike6382 \n", "3 on 06/15/02 00:00 AM (PDT) mike6382 \n", "4 on 06/15/02 00:00 AM (PDT) mike6382 \n", "5 on 06/15/02 00:00 AM (PDT) mike6382 \n", "\n", " Vehicle_Title Review_Title \\\n", "0 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "1 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "2 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "3 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "4 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "5 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "\n", " Review Rating\\r Car_Make \\\n", "0 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "1 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "2 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "3 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "4 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "5 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "\n", " Review_Content \n", "0 What a waste: I have owned this car for a year... \n", "1 What a waste: I have owned this car for a year... \n", "2 What a waste: I have owned this car for a year... \n", "3 What a waste: I have owned this car for a year... \n", "4 What a waste: I have owned this car for a year... \n", "5 What a waste: I have owned this car for a year... " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(keywords_review.merge(sampled_df, left_on=0, right_on = sampled_df.Review_Content)).drop(columns=[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Repeat the Preprocessing over Multiple Reviews\n", "Let's see how we can apply same operations on multiple entries from our car reviews dataset and use the outcome for correlation and sentiment analysis:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def analyze_with_retry(text: str) -> Any:\n", " \"\"\"\n", " Compensate for the occasional \"service unavailable due to rate-limiting\"\n", " error message.\n", " \"\"\"\n", " num_retries_left = 5\n", " last_exception = None\n", " while num_retries_left > 0:\n", " num_retries_left -= 1\n", " try:\n", " return natural_language_understanding.analyze(\n", " text=text,\n", " language=\"en\",\n", " return_analyzed_text=True,\n", " features=nlu.Features(\n", " keywords=nlu.KeywordsOptions(sentiment=True, emotion=True))\n", " ).get_result()\n", " except BaseException as e:\n", " last_exception = e\n", " # Backoff\n", " time.sleep(0.2)\n", " raise last_exception\n", "\n", "\n", "warnings.filterwarnings('ignore')\n", "nlu_response_reviews = sampled_df['Review_Content'].dropna().apply(lambda x: analyze_with_retry(x))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "tp_parsed_reviews = [tp.io.watson.nlu.parse_response(r) for r in nlu_response_reviews]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it. With the DataFrame version of this data, we can perform our exploratory and sentiment analysis task easily with few line of code.\n", "\n", "Specifically, we use Pandas to concat the Watson NLU sentiments dataframe (output of text enstensions for pandas) with its corresponding review, and then we conduct some exploratory analysis on the data." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textsentiment.labelsentiment.scorerelevanceemotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angercount0
0wastenegative-0.8752150.6857410.1923830.0249610.3131450.0833200.2778251.0What a waste: I have owned this car for a year...
1firenegative-0.9345130.5983260.3609250.0023550.2664900.0699380.4427591.0What a waste: I have owned this car for a year...
2carnegative-0.8447740.5814320.1443460.1501770.2461020.0617600.2039992.0What a waste: I have owned this car for a year...
3hellnegative-0.9345130.5770110.3609250.0023550.2664900.0699380.4427591.0What a waste: I have owned this car for a year...
4yearnegative-0.8752150.5636760.1923830.0249610.3131450.0833200.2778251.0What a waste: I have owned this car for a year...
5timeneutral0.0000000.4669830.2665730.4013140.0890800.0240270.0657671.0What a waste: I have owned this car for a year...
0Top speednegative-0.5375640.8810370.5092240.1991720.0387770.0651610.0444721.0HUMMER NOT A bummer : Vehicle is a beast. I do...
1OK causepositive0.6475150.7869850.0630220.4329750.1079650.0169180.0909441.0HUMMER NOT A bummer : Vehicle is a beast. I do...
2HUMMER Hneutral0.0000000.639671NaNNaNNaNNaNNaN1.0HUMMER NOT A bummer : Vehicle is a beast. I do...
3seat cushionnegative-0.5375640.5935660.5092240.1991720.0387770.0651610.0444721.0HUMMER NOT A bummer : Vehicle is a beast. I do...
4HUMMERnegative-0.9138740.5821620.1726040.2211880.1464690.0220020.0295881.0HUMMER NOT A bummer : Vehicle is a beast. I do...
5speedpositive0.3051100.5480920.2861230.3160740.0733710.0410390.0677081.0HUMMER NOT A bummer : Vehicle is a beast. I do...
6thingnegative-0.9491930.5348670.6451650.0100280.2165810.0227720.0554271.0HUMMER NOT A bummer : Vehicle is a beast. I do...
7Vehiclenegative-0.9612350.5314100.1802070.0613320.1929020.0082740.0462321.0HUMMER NOT A bummer : Vehicle is a beast. I do...
8beastnegative-0.9612350.5244880.1802070.0613320.1929020.0082740.0462321.0HUMMER NOT A bummer : Vehicle is a beast. I do...
9thatspositive0.6475150.4624350.0630220.4329750.1079650.0169180.0909441.0HUMMER NOT A bummer : Vehicle is a beast. I do...
10bummernegative-0.9612350.3601890.1802070.0613320.1929020.0082740.0462321.0HUMMER NOT A bummer : Vehicle is a beast. I do...
11averagenegative-0.8572700.3416870.1650020.3810430.1000360.0357300.0129441.0HUMMER NOT A bummer : Vehicle is a beast. I do...
0AWESOME HUMMERpositive0.7346820.8331770.0324990.4939420.1168090.0092570.0240461.0AWESOME HUMMER: Hummer is unstoppable. May onl...
1mphneutral0.0000000.6354040.4999770.1513880.0396400.0360490.0646541.0AWESOME HUMMER: Hummer is unstoppable. May onl...
\n", "
" ], "text/plain": [ " text sentiment.label sentiment.score relevance \\\n", "0 waste negative -0.875215 0.685741 \n", "1 fire negative -0.934513 0.598326 \n", "2 car negative -0.844774 0.581432 \n", "3 hell negative -0.934513 0.577011 \n", "4 year negative -0.875215 0.563676 \n", "5 time neutral 0.000000 0.466983 \n", "0 Top speed negative -0.537564 0.881037 \n", "1 OK cause positive 0.647515 0.786985 \n", "2 HUMMER H neutral 0.000000 0.639671 \n", "3 seat cushion negative -0.537564 0.593566 \n", "4 HUMMER negative -0.913874 0.582162 \n", "5 speed positive 0.305110 0.548092 \n", "6 thing negative -0.949193 0.534867 \n", "7 Vehicle negative -0.961235 0.531410 \n", "8 beast negative -0.961235 0.524488 \n", "9 thats positive 0.647515 0.462435 \n", "10 bummer negative -0.961235 0.360189 \n", "11 average negative -0.857270 0.341687 \n", "0 AWESOME HUMMER positive 0.734682 0.833177 \n", "1 mph neutral 0.000000 0.635404 \n", "\n", " emotion.sadness emotion.joy emotion.fear emotion.disgust \\\n", "0 0.192383 0.024961 0.313145 0.083320 \n", "1 0.360925 0.002355 0.266490 0.069938 \n", "2 0.144346 0.150177 0.246102 0.061760 \n", "3 0.360925 0.002355 0.266490 0.069938 \n", "4 0.192383 0.024961 0.313145 0.083320 \n", "5 0.266573 0.401314 0.089080 0.024027 \n", "0 0.509224 0.199172 0.038777 0.065161 \n", "1 0.063022 0.432975 0.107965 0.016918 \n", "2 NaN NaN NaN NaN \n", "3 0.509224 0.199172 0.038777 0.065161 \n", "4 0.172604 0.221188 0.146469 0.022002 \n", "5 0.286123 0.316074 0.073371 0.041039 \n", "6 0.645165 0.010028 0.216581 0.022772 \n", "7 0.180207 0.061332 0.192902 0.008274 \n", "8 0.180207 0.061332 0.192902 0.008274 \n", "9 0.063022 0.432975 0.107965 0.016918 \n", "10 0.180207 0.061332 0.192902 0.008274 \n", "11 0.165002 0.381043 0.100036 0.035730 \n", "0 0.032499 0.493942 0.116809 0.009257 \n", "1 0.499977 0.151388 0.039640 0.036049 \n", "\n", " emotion.anger count 0 \n", "0 0.277825 1.0 What a waste: I have owned this car for a year... \n", "1 0.442759 1.0 What a waste: I have owned this car for a year... \n", "2 0.203999 2.0 What a waste: I have owned this car for a year... \n", "3 0.442759 1.0 What a waste: I have owned this car for a year... \n", "4 0.277825 1.0 What a waste: I have owned this car for a year... \n", "5 0.065767 1.0 What a waste: I have owned this car for a year... \n", "0 0.044472 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "1 0.090944 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "2 NaN 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "3 0.044472 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "4 0.029588 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "5 0.067708 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "6 0.055427 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "7 0.046232 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "8 0.046232 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "9 0.090944 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "10 0.046232 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "11 0.012944 1.0 HUMMER NOT A bummer : Vehicle is a beast. I do... \n", "0 0.024046 1.0 AWESOME HUMMER: Hummer is unstoppable. May onl... \n", "1 0.064654 1.0 AWESOME HUMMER: Hummer is unstoppable. May onl... " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Concatenation\n", "keywords_review = [pd.concat ([parsed_review['keywords'] , pd.Series([r['analyzed_text']]*len(parsed_review['keywords']))], axis = 1) for (parsed_review,r) in zip(tp_parsed_reviews,pd.Series(nlu_response_reviews))]\n", "# Convert list of dataframes to the dataframe\n", "keywords_review_df = pd.concat(keywords_review, axis = 0)\n", "keywords_review_df.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Merging each review in the resulted dataframe with its Title, Author, Rating, and other info as below and then grouping based on the Review_Title:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textsentiment.labelsentiment.scorerelevanceemotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angercountReview_DateAuthor_NameVehicle_TitleReview_TitleReviewRating\\rCar_MakeReview_Content
0wastenegative-0.8752150.6857410.1923830.0249610.3131450.0833200.2778251.0on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
1firenegative-0.9345130.5983260.3609250.0023550.2664900.0699380.4427591.0on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
2carnegative-0.8447740.5814320.1443460.1501770.2461020.0617600.2039992.0on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
3hellnegative-0.9345130.5770110.3609250.0023550.2664900.0699380.4427591.0on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
4yearnegative-0.8752150.5636760.1923830.0249610.3131450.0833200.2778251.0on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
5timeneutral0.0000000.4669830.2665730.4013140.0890800.0240270.0657671.0on 06/15/02 00:00 AM (PDT)mike63822000 AM General Hummer SUV Hard Top 4dr SUV AWDWhat a wasteI have owned this car for a year and a \\rhalf...1.0AMGeneralWhat a waste: I have owned this car for a year...
\n", "
" ], "text/plain": [ " text sentiment.label sentiment.score relevance emotion.sadness \\\n", "0 waste negative -0.875215 0.685741 0.192383 \n", "1 fire negative -0.934513 0.598326 0.360925 \n", "2 car negative -0.844774 0.581432 0.144346 \n", "3 hell negative -0.934513 0.577011 0.360925 \n", "4 year negative -0.875215 0.563676 0.192383 \n", "5 time neutral 0.000000 0.466983 0.266573 \n", "\n", " emotion.joy emotion.fear emotion.disgust emotion.anger count \\\n", "0 0.024961 0.313145 0.083320 0.277825 1.0 \n", "1 0.002355 0.266490 0.069938 0.442759 1.0 \n", "2 0.150177 0.246102 0.061760 0.203999 2.0 \n", "3 0.002355 0.266490 0.069938 0.442759 1.0 \n", "4 0.024961 0.313145 0.083320 0.277825 1.0 \n", "5 0.401314 0.089080 0.024027 0.065767 1.0 \n", "\n", " Review_Date Author_Name \\\n", "0 on 06/15/02 00:00 AM (PDT) mike6382 \n", "1 on 06/15/02 00:00 AM (PDT) mike6382 \n", "2 on 06/15/02 00:00 AM (PDT) mike6382 \n", "3 on 06/15/02 00:00 AM (PDT) mike6382 \n", "4 on 06/15/02 00:00 AM (PDT) mike6382 \n", "5 on 06/15/02 00:00 AM (PDT) mike6382 \n", "\n", " Vehicle_Title Review_Title \\\n", "0 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "1 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "2 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "3 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "4 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "5 2000 AM General Hummer SUV Hard Top 4dr SUV AWD What a waste \n", "\n", " Review Rating\\r Car_Make \\\n", "0 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "1 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "2 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "3 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "4 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "5 I have owned this car for a year and a \\rhalf... 1.0 AMGeneral \n", "\n", " Review_Content \n", "0 What a waste: I have owned this car for a year... \n", "1 What a waste: I have owned this car for a year... \n", "2 What a waste: I have owned this car for a year... \n", "3 What a waste: I have owned this car for a year... \n", "4 What a waste: I have owned this car for a year... \n", "5 What a waste: I have owned this car for a year... " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.core.display import display, HTML\n", "display(HTML(\"\"))\n", "merged_keywords_review_df = (keywords_review_df.merge(sampled_df, left_on=0, right_on = sampled_df.Review_Content)).drop(columns=[0])\n", "grouped_merged_keywords_review_df = merged_keywords_review_df.groupby('Review_Title')\n", "grouped_merged_keywords_review_df.get_group('What a waste').head(30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "merged_keywords_review_dfAs we mentioned above, Watson NLU assigns the sentiment to the keywords based on their context within the sentence. Hence, all keywords within one sentence get the same sentiment score. Thus, to get the aggregated sentiment of each review we calulate the mean sentiment score of its sentences by considering the sentiment assigned to one keyword in each sentence. More specifically, we first drop duplicate sentiment scores for each review and then we calculate the average sentiment and emotion score for each review:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angersentiment.scoreRating\\r
Review_Title
1 sweet R320.1515430.5321620.0678590.0185010.1129940.6498254.875
2002 Trans Am/Sunset Orange Metallic0.1763220.4652100.2570640.0328420.0389080.1480354.625
42 days of driving 8 days in the shop0.2064780.5634660.1145060.0100820.082325-0.0541263.375
A great little car0.2785750.4705860.0638230.0152180.0396880.5037854.875
AWESOME FUN MY LITTLE TIGER0.0076290.6283120.0130150.0014520.0247820.9860295.000
I LOVE my Focus0.0740190.5891960.1117220.0081240.0660920.6219834.750
Looks Good But Hunk Of Junk0.1446710.0613580.0606130.0504940.116835-0.9846222.875
Mr TACOMA0.1227660.8256530.0347770.0231240.0303440.6338035.000
Veracruz0.1069810.5243710.0914820.0123440.0544930.5918164.750
You will pay for that warranty0.3963060.1104580.0569800.0211920.119030-0.3735832.750
everyday rSx0.0384860.5158520.1334190.0080350.0339980.6772864.000
got new weel0.1085070.3483900.0791940.0346430.2391770.6540344.625
i'm on my second one0.0631240.0248400.0539510.0264020.165089-0.9734465.000
! un happy Camper0.4249260.2195060.0666270.0365780.084982-0.3881822.625
\"\"\"I can't believe it \"\"0.2446710.0258680.0531330.0675970.165597-0.9040221.000
\"06\" GTO0.1020050.6324000.1137960.0557680.0673480.7599985.000
\"Acceleration failure\" - Genesis phraseology0.1398950.1437800.2594970.0493440.086211-0.6326393.000
\"Cry wolf\" tire light and redundant warning screen0.3046940.1657820.1287620.0439230.172683-0.7442373.000
\"Downgraded\" to an LS 430 but best upgrade ever!0.3817720.4018420.0281010.0180070.0260350.4809245.000
\"First Ride\" Impressions when I visited Tesla's Factory0.2318700.4431380.0362620.0236400.0514120.6106194.875
\n", "
" ], "text/plain": [ " emotion.sadness \\\n", "Review_Title \n", " 1 sweet R32 0.151543 \n", " 2002 Trans Am/Sunset Orange Metallic 0.176322 \n", " 42 days of driving 8 days in the shop 0.206478 \n", " A great little car 0.278575 \n", " AWESOME FUN MY LITTLE TIGER 0.007629 \n", " I LOVE my Focus 0.074019 \n", " Looks Good But Hunk Of Junk 0.144671 \n", " Mr TACOMA 0.122766 \n", " Veracruz 0.106981 \n", " You will pay for that warranty 0.396306 \n", " everyday rSx 0.038486 \n", " got new weel 0.108507 \n", " i'm on my second one 0.063124 \n", "! un happy Camper 0.424926 \n", "\"\"\"I can't believe it \"\" 0.244671 \n", "\"06\" GTO 0.102005 \n", "\"Acceleration failure\" - Genesis phraseology 0.139895 \n", "\"Cry wolf\" tire light and redundant warning screen 0.304694 \n", "\"Downgraded\" to an LS 430 but best upgrade ever! 0.381772 \n", "\"First Ride\" Impressions when I visited Tesla's... 0.231870 \n", "\n", " emotion.joy emotion.fear \\\n", "Review_Title \n", " 1 sweet R32 0.532162 0.067859 \n", " 2002 Trans Am/Sunset Orange Metallic 0.465210 0.257064 \n", " 42 days of driving 8 days in the shop 0.563466 0.114506 \n", " A great little car 0.470586 0.063823 \n", " AWESOME FUN MY LITTLE TIGER 0.628312 0.013015 \n", " I LOVE my Focus 0.589196 0.111722 \n", " Looks Good But Hunk Of Junk 0.061358 0.060613 \n", " Mr TACOMA 0.825653 0.034777 \n", " Veracruz 0.524371 0.091482 \n", " You will pay for that warranty 0.110458 0.056980 \n", " everyday rSx 0.515852 0.133419 \n", " got new weel 0.348390 0.079194 \n", " i'm on my second one 0.024840 0.053951 \n", "! un happy Camper 0.219506 0.066627 \n", "\"\"\"I can't believe it \"\" 0.025868 0.053133 \n", "\"06\" GTO 0.632400 0.113796 \n", "\"Acceleration failure\" - Genesis phraseology 0.143780 0.259497 \n", "\"Cry wolf\" tire light and redundant warning screen 0.165782 0.128762 \n", "\"Downgraded\" to an LS 430 but best upgrade ever! 0.401842 0.028101 \n", "\"First Ride\" Impressions when I visited Tesla's... 0.443138 0.036262 \n", "\n", " emotion.disgust \\\n", "Review_Title \n", " 1 sweet R32 0.018501 \n", " 2002 Trans Am/Sunset Orange Metallic 0.032842 \n", " 42 days of driving 8 days in the shop 0.010082 \n", " A great little car 0.015218 \n", " AWESOME FUN MY LITTLE TIGER 0.001452 \n", " I LOVE my Focus 0.008124 \n", " Looks Good But Hunk Of Junk 0.050494 \n", " Mr TACOMA 0.023124 \n", " Veracruz 0.012344 \n", " You will pay for that warranty 0.021192 \n", " everyday rSx 0.008035 \n", " got new weel 0.034643 \n", " i'm on my second one 0.026402 \n", "! un happy Camper 0.036578 \n", "\"\"\"I can't believe it \"\" 0.067597 \n", "\"06\" GTO 0.055768 \n", "\"Acceleration failure\" - Genesis phraseology 0.049344 \n", "\"Cry wolf\" tire light and redundant warning screen 0.043923 \n", "\"Downgraded\" to an LS 430 but best upgrade ever! 0.018007 \n", "\"First Ride\" Impressions when I visited Tesla's... 0.023640 \n", "\n", " emotion.anger \\\n", "Review_Title \n", " 1 sweet R32 0.112994 \n", " 2002 Trans Am/Sunset Orange Metallic 0.038908 \n", " 42 days of driving 8 days in the shop 0.082325 \n", " A great little car 0.039688 \n", " AWESOME FUN MY LITTLE TIGER 0.024782 \n", " I LOVE my Focus 0.066092 \n", " Looks Good But Hunk Of Junk 0.116835 \n", " Mr TACOMA 0.030344 \n", " Veracruz 0.054493 \n", " You will pay for that warranty 0.119030 \n", " everyday rSx 0.033998 \n", " got new weel 0.239177 \n", " i'm on my second one 0.165089 \n", "! un happy Camper 0.084982 \n", "\"\"\"I can't believe it \"\" 0.165597 \n", "\"06\" GTO 0.067348 \n", "\"Acceleration failure\" - Genesis phraseology 0.086211 \n", "\"Cry wolf\" tire light and redundant warning screen 0.172683 \n", "\"Downgraded\" to an LS 430 but best upgrade ever! 0.026035 \n", "\"First Ride\" Impressions when I visited Tesla's... 0.051412 \n", "\n", " sentiment.score Rating\\r \n", "Review_Title \n", " 1 sweet R32 0.649825 4.875 \n", " 2002 Trans Am/Sunset Orange Metallic 0.148035 4.625 \n", " 42 days of driving 8 days in the shop -0.054126 3.375 \n", " A great little car 0.503785 4.875 \n", " AWESOME FUN MY LITTLE TIGER 0.986029 5.000 \n", " I LOVE my Focus 0.621983 4.750 \n", " Looks Good But Hunk Of Junk -0.984622 2.875 \n", " Mr TACOMA 0.633803 5.000 \n", " Veracruz 0.591816 4.750 \n", " You will pay for that warranty -0.373583 2.750 \n", " everyday rSx 0.677286 4.000 \n", " got new weel 0.654034 4.625 \n", " i'm on my second one -0.973446 5.000 \n", "! un happy Camper -0.388182 2.625 \n", "\"\"\"I can't believe it \"\" -0.904022 1.000 \n", "\"06\" GTO 0.759998 5.000 \n", "\"Acceleration failure\" - Genesis phraseology -0.632639 3.000 \n", "\"Cry wolf\" tire light and redundant warning screen -0.744237 3.000 \n", "\"Downgraded\" to an LS 430 but best upgrade ever! 0.480924 5.000 \n", "\"First Ride\" Impressions when I visited Tesla's... 0.610619 4.875 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_cols = [str(c) for c in merged_keywords_review_df.columns\n", " if c.startswith('emotion.')] + ['sentiment.score']\n", "agg_merged_keywords_review_df = (\n", " merged_keywords_review_df[sentiment_cols + ['Review_Title', 'Rating\\r']]\n", " .drop_duplicates(['Review_Title','sentiment.score'])\n", " .groupby('Review_Title')\n", " .mean())\n", "agg_merged_keywords_review_df.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can find the correlation among the variables using pearson method:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 emotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angersentiment.scoreRating\r", "
emotion.sadness1.000000-0.6354960.0990180.0620680.158305-0.519823-0.353612
emotion.joy-0.6354961.000000-0.391679-0.226616-0.4847400.7612110.518204
emotion.fear0.099018-0.3916791.0000000.0778240.149845-0.321152-0.187657
emotion.disgust0.062068-0.2266160.0778241.0000000.136611-0.213541-0.155754
emotion.anger0.158305-0.4847400.1498450.1366111.000000-0.440811-0.352083
sentiment.score-0.5198230.761211-0.321152-0.213541-0.4408111.0000000.620320
Rating\r", "-0.3536120.518204-0.187657-0.155754-0.3520830.6203201.000000
\n" ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corr = agg_merged_keywords_review_df.corr(method ='pearson')\n", "corr.style.background_gradient(cmap='coolwarm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the table above shows, there is an association between the review's Ratings and the Watson NLU sentiment score and joy emotion but repulsion between review's Ratings and sadness emotion. The results also demonstrate the strong positive correlation between Watson NLU sentiment score and Watson NLU joy emotion. In contrary, there is a strong negative correlation between sadness emotion and the sentiment score as expected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Univariate linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's perform the regression. To do that, we first need to determine the input features. Since the sentiment.score field shows a relatively high correlation with the rating, let's try a regression based on just that value:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [], "source": [ "X = agg_merged_keywords_review_df.dropna()['sentiment.score'].values.reshape(-1, 1) # values converts it into a numpy array\n", "Y = agg_merged_keywords_review_df.dropna()['Rating\\r'].values.reshape(-1, 1) # -1 means that calculate the dimension of rows, but have 1 column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's split the dataframe into training and testing sets:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now need to create an instance of the LinearRegression model from Scikit-Learn:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linear_regressor = LinearRegression() # create object for the class\n", "linear_regressor.fit(X_train, Y_train) # fit the model on the training data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the model has been fit we can make predictions by calling the predict command. We are making predictions on the testing set:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "Y_pred = linear_regressor.predict(X_test) # make predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll now check the predictions against the actual values by using the mean squared error (MSE) and R-2 metrics, two metrics commonly used to evaluate regression tasks:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Squared Error = 0.6260140152618712\n", "R-Squared = 0.37732446794693686\n" ] } ], "source": [ "test_set_mse = mean_squared_error(Y_test, Y_pred)\n", "print(f\"Mean Squared Error = {test_set_mse}\")\n", "test_set_r2 = r2_score(Y_test, Y_pred)\n", "print(f\"R-Squared = {test_set_r2}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Multivariate Linear Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's try adding the fine-grained sentiment scores from Watson to our model and see if the coefficient of determination (r^2) goes up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's determine the input features:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angersentiment.score
Review_Title
1 sweet R320.1515430.5321620.0678590.0185010.1129940.649825
2002 Trans Am/Sunset Orange Metallic0.1763220.4652100.2570640.0328420.0389080.148035
42 days of driving 8 days in the shop0.2064780.5634660.1145060.0100820.082325-0.054126
A great little car0.2785750.4705860.0638230.0152180.0396880.503785
AWESOME FUN MY LITTLE TIGER0.0076290.6283120.0130150.0014520.0247820.986029
\n", "
" ], "text/plain": [ " emotion.sadness emotion.joy \\\n", "Review_Title \n", " 1 sweet R32 0.151543 0.532162 \n", " 2002 Trans Am/Sunset Orange Metallic 0.176322 0.465210 \n", " 42 days of driving 8 days in the shop 0.206478 0.563466 \n", " A great little car 0.278575 0.470586 \n", " AWESOME FUN MY LITTLE TIGER 0.007629 0.628312 \n", "\n", " emotion.fear emotion.disgust \\\n", "Review_Title \n", " 1 sweet R32 0.067859 0.018501 \n", " 2002 Trans Am/Sunset Orange Metallic 0.257064 0.032842 \n", " 42 days of driving 8 days in the shop 0.114506 0.010082 \n", " A great little car 0.063823 0.015218 \n", " AWESOME FUN MY LITTLE TIGER 0.013015 0.001452 \n", "\n", " emotion.anger sentiment.score \n", "Review_Title \n", " 1 sweet R32 0.112994 0.649825 \n", " 2002 Trans Am/Sunset Orange Metallic 0.038908 0.148035 \n", " 42 days of driving 8 days in the shop 0.082325 -0.054126 \n", " A great little car 0.039688 0.503785 \n", " AWESOME FUN MY LITTLE TIGER 0.024782 0.986029 " ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_df = agg_merged_keywords_review_df.drop(columns='Rating\\r').dropna().iloc[:, :7]\n", "X_df.head()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": true }, "outputs": [], "source": [ "X = X_df.values.reshape(-1, 6) # values converts it into a numpy array\n", "Y = agg_merged_keywords_review_df.dropna()['Rating\\r'].values.reshape(-1, 1) # -1 means that calculate the dimension of rows, but have 1 column" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=9)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linear_regressor = LinearRegression() # create object for the class\n", "linear_regressor.fit(X_train, Y_train) # fit the model on the training data" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "Y_pred = linear_regressor.predict(X_test) # make predictions" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Squared Error = 0.6149240275777909\n", "R-Squared = 0.3883553136042148\n" ] } ], "source": [ "mse = mean_squared_error(Y_test, Y_pred)\n", "print(f\"Mean Squared Error = {mse}\")\n", "test_set_r2 = r2_score(Y_test, Y_pred)\n", "print(f\"R-Squared = {test_set_r2}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our multivariate model shows better value of Coefficient of determination or R-squared and hence the better fit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For every feature we get the coefficient value. Since we have 7 features we get 7 coefficients. Magnitude and direction(+/-) of all these values affect the prediction results. " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature Coefficients = [[-0.22073291 0.32887635 0.37191993 -0.32082964 -1.60607343 1.01721109]]\n" ] }, { "data": { "text/plain": [ "array([4.0674602])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coef = linear_regressor.coef_\n", "print(f\"Feature Coefficients = {coef}\")\n", "linear_regressor.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicted Rating against actual Rating plot\n", "We have our predictions in Y_pred. Now lets first create a dataframe for the prediction and actual ratings and then visualize it:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Predicted RatingActual Rating
04.3007834.875
13.8957314.625
24.3423263.375
34.7053464.875
43.2564595.000
.........
15274.6233393.125
15283.7471425.000
15293.8741024.500
15303.4682263.875
15314.4060955.000
\n", "

1532 rows Ɨ 2 columns

\n", "
" ], "text/plain": [ " Predicted Rating Actual Rating\n", "0 4.300783 4.875\n", "1 3.895731 4.625\n", "2 4.342326 3.375\n", "3 4.705346 4.875\n", "4 3.256459 5.000\n", "... ... ...\n", "1527 4.623339 3.125\n", "1528 3.747142 5.000\n", "1529 3.874102 4.500\n", "1530 3.468226 3.875\n", "1531 4.406095 5.000\n", "\n", "[1532 rows x 2 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predicted_actual = pd.DataFrame(zip(np.squeeze(Y_pred), np.squeeze(Y)), columns=['Predicted Rating', 'Actual Rating'])\n", "predicted_actual" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Rating From Dataset Vs Rating Predicted By Model')" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.scatter(Y_test, Y_pred, alpha=0.2)\n", "plt.xlabel('Rating From Dataset')\n", "plt.ylabel('Rating Predicted By Model')\n", "plt.rcParams[\"figure.figsize\"] = (10,6) # Custom figure size in inches\n", "plt.title(\"Rating From Dataset Vs Rating Predicted By Model\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Random Forest:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's fit Random forest regressor to the dataset to see if can improve the R-squared value even more:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
RandomForestRegressor(random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "RandomForestRegressor(random_state=0)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fitting Random Forest Regression to the dataset\n", "# import the regressor\n", "from sklearn.ensemble import RandomForestRegressor\n", " \n", "# create regressor object\n", "regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)\n", " \n", "# fit the regressor with x and y data\n", "regressor.fit(X_train, Y_train) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting a new result:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "Y_pred = regressor.predict(X_test) # test the output by changing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reporting mean squared error and R-2 score:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Squared Error = 0.572035684052662\n", "R-Squared = 0.43101493698694837\n" ] } ], "source": [ "mse = mean_squared_error(Y_test, Y_pred)\n", "print(f\"Mean Squared Error = {mse}\")\n", "\n", "test_set_r2 = r2_score(Y_test, Y_pred)\n", "print(f\"R-Squared = {test_set_r2}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicted against actual Y plot" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Rating From Dataset Vs Rating Predicted By Model')" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.scatter(Y_test, Y_pred, alpha=0.2)\n", "plt.xlabel('Rating From Dataset')\n", "plt.ylabel('Rating Predicted By Model')\n", "plt.rcParams[\"figure.figsize\"] = (10,6) # Custom figure size in inches\n", "plt.title(\"Rating From Dataset Vs Rating Predicted By Model\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gradient Boosting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try the Gradient Boosting here:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Squared Error = 0.5519593529266892\n", "R-Squared = 0.4509842026276053\n" ] } ], "source": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "reg = GradientBoostingRegressor(random_state=0)\n", "reg.fit(X_train, Y_train)\n", "Y_pred = reg.predict(X_test)\n", "print(f\"Mean Squared Error = {mean_squared_error(Y_test, Y_pred)}\")\n", "print(f\"R-Squared = {r2_score(Y_test, Y_pred)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicted against actual Y plot" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Rating From Dataset Vs Rating Predicted By Model')" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.scatter(Y_test, Y_pred, alpha=0.2)\n", "plt.xlabel('Rating From Dataset')\n", "plt.ylabel('Rating Predicted By Model')\n", "plt.rcParams[\"figure.figsize\"] = (10,6) # Custom figure size in inches\n", "plt.title(\"Rating From Dataset Vs Rating Predicted By Model\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Further Evaluations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how well the model fits the data when it comes to prediction of the Car_Make level ratings. For that we need to keep the Car_Make in our dataset datarame; fit the regression on individual reviews and then calulate the average mean squared error and R-squared in the Car_Make level:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentiment.scoreemotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angerRating\\rCar_MakeReview_Content
Review_Title
1 sweet R320.6498250.1515430.5321620.0678590.0185010.1129944.875Volkswagen1 sweet R32: I was looking into buying a Subaru WRX \\rSTI, but after two test drives in each \\rand reading as many \\rRoad&Track,Car&Driver,and any other \\rinfo I could find I desided to go with \\rthe R32. I traded in my 2003 GTI VR6 \\rthat had 29,000 miles on it. That was a \\rgreat car but this is a whole new \\rbeast. Once you own an all wheel drive \\rthere is just no going back. This car \\rhandles like a dream, the seats are the \\rbest I've ever been in. Cabin is put \\rtogether very well and the pipes are \\rcrazy. The climit control is awsome, \\rheated seats are so sweet on those cold \\rwinter days. I live in the central \\rvalley of California so these tire are \\rthe best. If there was one thing I \\rwould change(give me a spare tire)!!!!!
2002 Trans Am/Sunset Orange Metallic0.1480350.1763220.4652100.2570640.0328420.0389084.625pontiac2002 Trans Am/Sunset Orange Metallic: This Is Pontiac's most exciting vehicle \\rof all time.It has so much performance \\rthat it is a big disapointment that it \\rwill be discontinued this year.The only \\rarea that this vehicle does not excell \\rin would be the fuel economy \\rdepartment.I guess that if you can \\rafford one of these dream cars, you \\rreally dony worry about how far it will \\rtravel on a tankfull of gas.
42 days of driving 8 days in the shop-0.0541260.2064780.5634660.1145060.0100820.0823253.375chrysler42 days of driving 8 days in the shop : I was given the sebring for my 20th wedding anniversary. I have been in love with it for years and finally got it. After 42 days I blew most of the electrical system. It has been at the dealer for 8 days and they can not find the problem. Right now I am not very happy.
A great little car0.5037850.2785750.4705860.0638230.0152180.0396884.875kiaA great little car: Bought my Spectra about one year ago, currently has about 18,000 miles on it. I have had absolutely no problems with it. I had cruise control added at the time of purchase, other than that it's stock. This is my daily driver, it's comfortable, reliable and gets decent mileage. The Spectra happens to be my second Kia, I have a Sedona van that has been to the dealer several times (however everything was covered by the warranty) it currently has 58,000 miles on it. The Spectra's a great handling car.
AWESOME FUN MY LITTLE TIGER0.9860290.0076290.6283120.0130150.0014520.0247825.000fiatAWESOME FUN MY LITTLE TIGER: Abarth is ultimately more fun than my old mustang or Z a little power house that doesn't shy away from a fight love the engine growl and the kick more room than you think awesome bang for the buck .Fun the most fun than any car I have ever own worth every penny a pleasure to drive.
I LOVE my Focus0.6219830.0740190.5891960.1117220.0081240.0660924.750fordI LOVE my Focus: I LOVE my Focus. I've had it about 2 \\ryears. It drives great, looks good, \\rgets great gas milage and never slows \\rdown. I'm even thinking of getting \\ranother one on my next car purchase!
Looks Good But Hunk Of Junk-0.9846220.1446710.0613580.0606130.0504940.1168352.875maseratiLooks Good But Hunk Of Junk: This car is strictly \"looks only\", it is not reliable or even close to it.I have already sank $13,760 in repairs at only 23K miles.This is totally unacceptable for a $140K car when new.I am taking it to the auction next week to \"unload\" before it can empty my wallet again.But if you want a sharp car that sits good in the driveway - this is it!Just don't drive it anywhere!!
Mr TACOMA0.6338030.1227660.8256530.0347770.0231240.0303445.000ToyotaMr TACOMA: Great truck. The Handling is pretty \\rnice and the engine is stronger. The V6 \\rwith 3100 pounds can really make this \\rtruck move.
Veracruz0.5918160.1069810.5243710.0914820.0123440.0544934.750hyundaiVeracruz: This is a crossover with the ride of a cruse ship. The car has so many bells and whistles. Have it one week and already over 1100 miles. Finding wonderful things about it every day. Could be the best car ever.
You will pay for that warranty-0.3735830.3963060.1104580.0569800.0211920.1190302.750kiaYou will pay for that warranty: Own a 2002 KIA Sedona EX. I complained about lights going dim while under warranty. Kia checked, said everything within parameters. Guess what, 3000 miles out of warranty alternator died. KIA says it's on you now. 63,000 miles and they want $565.00 to repair; that includes alternator, belts and labor. It's not a repair you can do either, seems AC lines are in the way. Do you think KIA planned it? Ask them about changing spark plugs the rear 3, seems you need to remove the air intake manifold? That will require new gaskets? Not sure of that cost. I hope to dump this Sedona by then! Think twice before you buy, they will get you to pay for that supposedly free 5year/60000 bumper to bumper warranty. RIPOFF.
\n", "
" ], "text/plain": [ " sentiment.score emotion.sadness \\\n", "Review_Title \n", " 1 sweet R32 0.649825 0.151543 \n", " 2002 Trans Am/Sunset Orange Metallic 0.148035 0.176322 \n", " 42 days of driving 8 days in the shop -0.054126 0.206478 \n", " A great little car 0.503785 0.278575 \n", " AWESOME FUN MY LITTLE TIGER 0.986029 0.007629 \n", " I LOVE my Focus 0.621983 0.074019 \n", " Looks Good But Hunk Of Junk -0.984622 0.144671 \n", " Mr TACOMA 0.633803 0.122766 \n", " Veracruz 0.591816 0.106981 \n", " You will pay for that warranty -0.373583 0.396306 \n", "\n", " emotion.joy emotion.fear \\\n", "Review_Title \n", " 1 sweet R32 0.532162 0.067859 \n", " 2002 Trans Am/Sunset Orange Metallic 0.465210 0.257064 \n", " 42 days of driving 8 days in the shop 0.563466 0.114506 \n", " A great little car 0.470586 0.063823 \n", " AWESOME FUN MY LITTLE TIGER 0.628312 0.013015 \n", " I LOVE my Focus 0.589196 0.111722 \n", " Looks Good But Hunk Of Junk 0.061358 0.060613 \n", " Mr TACOMA 0.825653 0.034777 \n", " Veracruz 0.524371 0.091482 \n", " You will pay for that warranty 0.110458 0.056980 \n", "\n", " emotion.disgust emotion.anger \\\n", "Review_Title \n", " 1 sweet R32 0.018501 0.112994 \n", " 2002 Trans Am/Sunset Orange Metallic 0.032842 0.038908 \n", " 42 days of driving 8 days in the shop 0.010082 0.082325 \n", " A great little car 0.015218 0.039688 \n", " AWESOME FUN MY LITTLE TIGER 0.001452 0.024782 \n", " I LOVE my Focus 0.008124 0.066092 \n", " Looks Good But Hunk Of Junk 0.050494 0.116835 \n", " Mr TACOMA 0.023124 0.030344 \n", " Veracruz 0.012344 0.054493 \n", " You will pay for that warranty 0.021192 0.119030 \n", "\n", " Rating\\r Car_Make \\\n", "Review_Title \n", " 1 sweet R32 4.875 Volkswagen \n", " 2002 Trans Am/Sunset Orange Metallic 4.625 pontiac \n", " 42 days of driving 8 days in the shop 3.375 chrysler \n", " A great little car 4.875 kia \n", " AWESOME FUN MY LITTLE TIGER 5.000 fiat \n", " I LOVE my Focus 4.750 ford \n", " Looks Good But Hunk Of Junk 2.875 maserati \n", " Mr TACOMA 5.000 Toyota \n", " Veracruz 4.750 hyundai \n", " You will pay for that warranty 2.750 kia \n", "\n", " Review_Content \n", "Review_Title \n", " 1 sweet R32 1 sweet R32: I was looking into buying a Subaru WRX \\rSTI, but after two test drives in each \\rand reading as many \\rRoad&Track,Car&Driver,and any other \\rinfo I could find I desided to go with \\rthe R32. I traded in my 2003 GTI VR6 \\rthat had 29,000 miles on it. That was a \\rgreat car but this is a whole new \\rbeast. Once you own an all wheel drive \\rthere is just no going back. This car \\rhandles like a dream, the seats are the \\rbest I've ever been in. Cabin is put \\rtogether very well and the pipes are \\rcrazy. The climit control is awsome, \\rheated seats are so sweet on those cold \\rwinter days. I live in the central \\rvalley of California so these tire are \\rthe best. If there was one thing I \\rwould change(give me a spare tire)!!!!! \n", " 2002 Trans Am/Sunset Orange Metallic 2002 Trans Am/Sunset Orange Metallic: This Is Pontiac's most exciting vehicle \\rof all time.It has so much performance \\rthat it is a big disapointment that it \\rwill be discontinued this year.The only \\rarea that this vehicle does not excell \\rin would be the fuel economy \\rdepartment.I guess that if you can \\rafford one of these dream cars, you \\rreally dony worry about how far it will \\rtravel on a tankfull of gas. \n", " 42 days of driving 8 days in the shop 42 days of driving 8 days in the shop : I was given the sebring for my 20th wedding anniversary. I have been in love with it for years and finally got it. After 42 days I blew most of the electrical system. It has been at the dealer for 8 days and they can not find the problem. Right now I am not very happy. \n", " A great little car A great little car: Bought my Spectra about one year ago, currently has about 18,000 miles on it. I have had absolutely no problems with it. I had cruise control added at the time of purchase, other than that it's stock. This is my daily driver, it's comfortable, reliable and gets decent mileage. The Spectra happens to be my second Kia, I have a Sedona van that has been to the dealer several times (however everything was covered by the warranty) it currently has 58,000 miles on it. The Spectra's a great handling car. \n", " AWESOME FUN MY LITTLE TIGER AWESOME FUN MY LITTLE TIGER: Abarth is ultimately more fun than my old mustang or Z a little power house that doesn't shy away from a fight love the engine growl and the kick more room than you think awesome bang for the buck .Fun the most fun than any car I have ever own worth every penny a pleasure to drive. \n", " I LOVE my Focus I LOVE my Focus: I LOVE my Focus. I've had it about 2 \\ryears. It drives great, looks good, \\rgets great gas milage and never slows \\rdown. I'm even thinking of getting \\ranother one on my next car purchase! \n", " Looks Good But Hunk Of Junk Looks Good But Hunk Of Junk: This car is strictly \"looks only\", it is not reliable or even close to it.I have already sank $13,760 in repairs at only 23K miles.This is totally unacceptable for a $140K car when new.I am taking it to the auction next week to \"unload\" before it can empty my wallet again.But if you want a sharp car that sits good in the driveway - this is it!Just don't drive it anywhere!! \n", " Mr TACOMA Mr TACOMA: Great truck. The Handling is pretty \\rnice and the engine is stronger. The V6 \\rwith 3100 pounds can really make this \\rtruck move. \n", " Veracruz Veracruz: This is a crossover with the ride of a cruse ship. The car has so many bells and whistles. Have it one week and already over 1100 miles. Finding wonderful things about it every day. Could be the best car ever. \n", " You will pay for that warranty You will pay for that warranty: Own a 2002 KIA Sedona EX. I complained about lights going dim while under warranty. Kia checked, said everything within parameters. Guess what, 3000 miles out of warranty alternator died. KIA says it's on you now. 63,000 miles and they want $565.00 to repair; that includes alternator, belts and labor. It's not a repair you can do either, seems AC lines are in the way. Do you think KIA planned it? Ask them about changing spark plugs the rear 3, seems you need to remove the air intake manifold? That will require new gaskets? Not sure of that cost. I hope to dump this Sedona by then! Think twice before you buy, they will get you to pay for that supposedly free 5year/60000 bumper to bumper warranty. RIPOFF. " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option(\"display.max_colwidth\", 10000)\n", "agg_merged_keywords_review_df = merged_keywords_review_df.drop_duplicates(['Review_Title','sentiment.score']).groupby([\"Review_Title\"]).agg({\n", " 'sentiment.score': 'mean',\n", " 'emotion.sadness': 'mean',\n", " 'emotion.joy': 'mean',\n", " 'emotion.fear': 'mean',\n", " 'emotion.disgust': 'mean',\n", " 'emotion.anger': 'mean',\n", " 'Rating\\r': 'first',\n", " 'Car_Make': 'first',\n", " 'Review_Content': 'first'\n", "})\n", "agg_merged_keywords_review_df.head(10)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "train_set = agg_merged_keywords_review_df.sample(frac=0.75, random_state=0)\n", "test_set = agg_merged_keywords_review_df.drop(train_set.index)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Car_Make\n", "AMGeneral 2\n", "Acura 154\n", "AlfaRomeo 60\n", "AstonMartin 55\n", "Audi 144\n", "BMW 143\n", "Bentley 102\n", "Bugatti 7\n", "Buick 134\n", "Cadillac 140\n", "Chevrolet 157\n", "GMC 133\n", "Honda 143\n", "Toyota 130\n", "Volkswagen 152\n", "chrysler 136\n", "dodge 142\n", "ferrari 111\n", "fiat 142\n", "ford 138\n", "genesis 48\n", "hummer 149\n", "hyundai 142\n", "infiniti 134\n", "isuzu 137\n", "jaguar 128\n", "jeep 127\n", "kia 126\n", "lamborghini 54\n", "land-rover 141\n", "lexus 125\n", "lincoln 138\n", "lotus 102\n", "maserati 136\n", "maybach 15\n", "mazda 137\n", "mclaren 1\n", "mercedes-benz 133\n", "mercury 131\n", "mini 142\n", "mitsubishi 118\n", "nissan 125\n", "pontiac 132\n", "porsche 136\n", "ram 152\n", "rolls-royce 23\n", "subaru 129\n", "suzuki 121\n", "tesla 100\n", "volvo 135\n", "dtype: int64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_set.groupby(\"Car_Make\").size()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "scrolled": true }, "outputs": [], "source": [ "X_train = train_set.dropna().iloc[:, :6].values.reshape(-1, 6) # values converts it into a numpy array\n", "X_test = test_set.dropna().iloc[:, :6].values.reshape(-1, 6) # values converts it into a numpy array\n", "Y_train = train_set.dropna()['Rating\\r'].values.reshape(-1, 1) # -1 means that calculate the dimension of rows, but have 1 column\n", "Y_test = test_set.dropna()['Rating\\r'].values.reshape(-1, 1) # -1 means that calculate the dimension of rows, but have 1 column" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "reg = GradientBoostingRegressor(random_state=0) # create object for the class\n", "reg.fit(X_train, Y_train) # fit the model on the training data\n", "Y_pred = reg.predict(X_test) # make predictions" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "tags": [] }, "outputs": [], "source": [ "predicted_y_with_na = np.zeros(len(test_set.index), dtype=object)\n", "predicted_y_with_na[~test_set.isna().any(axis=1)] = Y_pred\n", "test_set['Predicted_Y'] = predicted_y_with_na" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angersentiment.scoreRating\\rPredicted_Y
meanmeanmeanmeanmeanmeanmeanmean
Car_Make
AMGeneral0.2335020.4165270.1494160.0305300.0651710.0216264.8333334.264005
Acura0.1868030.4673070.1340820.0207440.0642540.3301924.5386904.455716
AlfaRomeo0.1790480.4343070.1010480.0305070.0883230.2680804.1875004.435494
AstonMartin0.1614650.5329240.0939790.0309430.0630270.4701494.6136364.631018
Audi0.1966090.4909650.0924300.0214020.0598600.3034314.4534314.421119
BMW0.1919400.4745630.0864990.0240380.0726750.2432014.4687504.354955
Bentley0.1877710.5284490.0897680.0285130.0579800.4411034.2395834.57587
Bugatti0.1883140.5877270.0553660.0236770.0541050.4308824.7500004.718716
Buick0.2604520.3927180.0991290.0255110.0865540.0884044.1627364.075432
Cadillac0.2188030.4299760.1025340.0304790.0699620.2740754.3954084.335975
Chevrolet0.2009840.4412520.1024570.0379430.0743430.1730964.1047304.23044
GMC0.2187550.4160750.1111350.0333780.0675790.0714384.0899124.138499
Honda0.1913370.4183010.1175690.0297660.0630760.1187893.8323864.130003
Toyota0.1990250.4316380.1047280.0271250.0703330.1545614.3505434.243079
Volkswagen0.1907670.4290780.1268920.0319650.0711800.1303904.3968754.17746
chrysler0.2348280.3987000.1163300.0350230.0756630.0860654.1409574.168451
dodge0.2185130.4087670.1197610.0264070.0762490.1002774.1339294.163826
ferrari0.1596490.5397980.1083430.0197310.0827630.4638634.7672414.530158
fiat0.2032020.4013030.1005370.0308970.0762350.0870653.8188784.134359
ford0.2382880.3621880.1214600.0280630.0883160.0781354.0400944.005134
genesis0.2112370.4309260.0780430.0319720.0578560.1562534.6086964.316763
hummer0.1817500.5028880.1263200.0270080.0559500.2972034.4046054.462752
hyundai0.2209900.3937210.0967350.0272510.0799150.1617324.1093754.14899
infiniti0.2005380.4696610.0906670.0246710.0597610.3221874.5668604.393907
isuzu0.2011270.4048130.1226770.0288560.1013340.2059434.2202384.306578
jaguar0.1637030.5567050.0867850.0256750.0555370.3756614.5840914.497573
jeep0.2532550.3960470.1041650.0250740.0792160.1068914.1086074.15378
kia0.2666750.3947920.1118490.0259040.0703820.1416214.1418274.12064
lamborghini0.1270940.6291760.0822210.0297880.0529190.6570444.7250004.665769
land-rover0.2745240.3551150.1098270.0348460.0861690.0806423.8488374.0177
lexus0.2029970.4398000.1005620.0319770.0763740.2316064.3061224.294433
lincoln0.2134330.4663200.1122600.0271920.0757360.1634144.2692314.264042
lotus0.1584670.4579160.1373500.0234580.0804450.3071354.7023814.490183
maserati0.1749250.5239920.0878220.0376160.0717950.3112564.4312504.394369
maybach0.1781940.5155200.0776570.0156870.0723570.6337144.9583334.733771
mazda0.2034420.4449710.1110210.0261700.0623870.2302684.4796514.318612
mercedes-benz0.2270150.3874880.1059100.0267810.0850450.0913854.0957454.094125
mercury0.2006640.4622190.1059580.0250670.0638380.2463604.3112244.390451
mini0.2185310.4437080.0961540.0264290.0716720.1678614.0361844.190949
mitsubishi0.1757810.4815540.1183470.0251690.0704540.3002194.3466984.417798
nissan0.2412680.3739160.1111140.0343410.0799560.1021614.2470934.119348
pontiac0.1902570.4309940.1106100.0265200.0784490.1657774.3750004.221752
porsche0.1452260.5107270.0934470.0261010.0806970.3829314.6625004.552274
ram0.2406260.3675440.1083490.0407450.0762670.0002943.8611114.113366
rolls-royce0.2609430.4120090.0804480.0370390.0721880.3216494.8437504.508778
subaru0.2021210.4701150.1037310.0201840.0708020.3010444.2572124.327014
suzuki0.2069900.4104320.1115640.0334400.0752420.1147644.2351194.255248
tesla0.2962160.3790650.0648110.0249110.0668180.1546074.6733874.284923
volvo0.2042670.4336000.1138050.0241110.0711340.2206524.3808144.281185
\n", "
" ], "text/plain": [ " emotion.sadness emotion.joy emotion.fear emotion.disgust \\\n", " mean mean mean mean \n", "Car_Make \n", "AMGeneral 0.233502 0.416527 0.149416 0.030530 \n", "Acura 0.186803 0.467307 0.134082 0.020744 \n", "AlfaRomeo 0.179048 0.434307 0.101048 0.030507 \n", "AstonMartin 0.161465 0.532924 0.093979 0.030943 \n", "Audi 0.196609 0.490965 0.092430 0.021402 \n", "BMW 0.191940 0.474563 0.086499 0.024038 \n", "Bentley 0.187771 0.528449 0.089768 0.028513 \n", "Bugatti 0.188314 0.587727 0.055366 0.023677 \n", "Buick 0.260452 0.392718 0.099129 0.025511 \n", "Cadillac 0.218803 0.429976 0.102534 0.030479 \n", "Chevrolet 0.200984 0.441252 0.102457 0.037943 \n", "GMC 0.218755 0.416075 0.111135 0.033378 \n", "Honda 0.191337 0.418301 0.117569 0.029766 \n", "Toyota 0.199025 0.431638 0.104728 0.027125 \n", "Volkswagen 0.190767 0.429078 0.126892 0.031965 \n", "chrysler 0.234828 0.398700 0.116330 0.035023 \n", "dodge 0.218513 0.408767 0.119761 0.026407 \n", "ferrari 0.159649 0.539798 0.108343 0.019731 \n", "fiat 0.203202 0.401303 0.100537 0.030897 \n", "ford 0.238288 0.362188 0.121460 0.028063 \n", "genesis 0.211237 0.430926 0.078043 0.031972 \n", "hummer 0.181750 0.502888 0.126320 0.027008 \n", "hyundai 0.220990 0.393721 0.096735 0.027251 \n", "infiniti 0.200538 0.469661 0.090667 0.024671 \n", "isuzu 0.201127 0.404813 0.122677 0.028856 \n", "jaguar 0.163703 0.556705 0.086785 0.025675 \n", "jeep 0.253255 0.396047 0.104165 0.025074 \n", "kia 0.266675 0.394792 0.111849 0.025904 \n", "lamborghini 0.127094 0.629176 0.082221 0.029788 \n", "land-rover 0.274524 0.355115 0.109827 0.034846 \n", "lexus 0.202997 0.439800 0.100562 0.031977 \n", "lincoln 0.213433 0.466320 0.112260 0.027192 \n", "lotus 0.158467 0.457916 0.137350 0.023458 \n", "maserati 0.174925 0.523992 0.087822 0.037616 \n", "maybach 0.178194 0.515520 0.077657 0.015687 \n", "mazda 0.203442 0.444971 0.111021 0.026170 \n", "mercedes-benz 0.227015 0.387488 0.105910 0.026781 \n", "mercury 0.200664 0.462219 0.105958 0.025067 \n", "mini 0.218531 0.443708 0.096154 0.026429 \n", "mitsubishi 0.175781 0.481554 0.118347 0.025169 \n", "nissan 0.241268 0.373916 0.111114 0.034341 \n", "pontiac 0.190257 0.430994 0.110610 0.026520 \n", "porsche 0.145226 0.510727 0.093447 0.026101 \n", "ram 0.240626 0.367544 0.108349 0.040745 \n", "rolls-royce 0.260943 0.412009 0.080448 0.037039 \n", "subaru 0.202121 0.470115 0.103731 0.020184 \n", "suzuki 0.206990 0.410432 0.111564 0.033440 \n", "tesla 0.296216 0.379065 0.064811 0.024911 \n", "volvo 0.204267 0.433600 0.113805 0.024111 \n", "\n", " emotion.anger sentiment.score Rating\\r Predicted_Y \n", " mean mean mean mean \n", "Car_Make \n", "AMGeneral 0.065171 0.021626 4.833333 4.264005 \n", "Acura 0.064254 0.330192 4.538690 4.455716 \n", "AlfaRomeo 0.088323 0.268080 4.187500 4.435494 \n", "AstonMartin 0.063027 0.470149 4.613636 4.631018 \n", "Audi 0.059860 0.303431 4.453431 4.421119 \n", "BMW 0.072675 0.243201 4.468750 4.354955 \n", "Bentley 0.057980 0.441103 4.239583 4.57587 \n", "Bugatti 0.054105 0.430882 4.750000 4.718716 \n", "Buick 0.086554 0.088404 4.162736 4.075432 \n", "Cadillac 0.069962 0.274075 4.395408 4.335975 \n", "Chevrolet 0.074343 0.173096 4.104730 4.23044 \n", "GMC 0.067579 0.071438 4.089912 4.138499 \n", "Honda 0.063076 0.118789 3.832386 4.130003 \n", "Toyota 0.070333 0.154561 4.350543 4.243079 \n", "Volkswagen 0.071180 0.130390 4.396875 4.17746 \n", "chrysler 0.075663 0.086065 4.140957 4.168451 \n", "dodge 0.076249 0.100277 4.133929 4.163826 \n", "ferrari 0.082763 0.463863 4.767241 4.530158 \n", "fiat 0.076235 0.087065 3.818878 4.134359 \n", "ford 0.088316 0.078135 4.040094 4.005134 \n", "genesis 0.057856 0.156253 4.608696 4.316763 \n", "hummer 0.055950 0.297203 4.404605 4.462752 \n", "hyundai 0.079915 0.161732 4.109375 4.14899 \n", "infiniti 0.059761 0.322187 4.566860 4.393907 \n", "isuzu 0.101334 0.205943 4.220238 4.306578 \n", "jaguar 0.055537 0.375661 4.584091 4.497573 \n", "jeep 0.079216 0.106891 4.108607 4.15378 \n", "kia 0.070382 0.141621 4.141827 4.12064 \n", "lamborghini 0.052919 0.657044 4.725000 4.665769 \n", "land-rover 0.086169 0.080642 3.848837 4.0177 \n", "lexus 0.076374 0.231606 4.306122 4.294433 \n", "lincoln 0.075736 0.163414 4.269231 4.264042 \n", "lotus 0.080445 0.307135 4.702381 4.490183 \n", "maserati 0.071795 0.311256 4.431250 4.394369 \n", "maybach 0.072357 0.633714 4.958333 4.733771 \n", "mazda 0.062387 0.230268 4.479651 4.318612 \n", "mercedes-benz 0.085045 0.091385 4.095745 4.094125 \n", "mercury 0.063838 0.246360 4.311224 4.390451 \n", "mini 0.071672 0.167861 4.036184 4.190949 \n", "mitsubishi 0.070454 0.300219 4.346698 4.417798 \n", "nissan 0.079956 0.102161 4.247093 4.119348 \n", "pontiac 0.078449 0.165777 4.375000 4.221752 \n", "porsche 0.080697 0.382931 4.662500 4.552274 \n", "ram 0.076267 0.000294 3.861111 4.113366 \n", "rolls-royce 0.072188 0.321649 4.843750 4.508778 \n", "subaru 0.070802 0.301044 4.257212 4.327014 \n", "suzuki 0.075242 0.114764 4.235119 4.255248 \n", "tesla 0.066818 0.154607 4.673387 4.284923 \n", "volvo 0.071134 0.220652 4.380814 4.281185 " ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agg_grouped_test_set = (\n", " test_set[sentiment_cols + ['Car_Make', 'Rating\\r', 'Predicted_Y']]\n", " .groupby('Car_Make')\n", " .agg(['mean']))\n", "agg_grouped_test_set" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentiment.scoreemotion.sadnessemotion.joyemotion.fearemotion.disgustemotion.angerRating\\rReview_ContentPredicted_Y
Car_Make
AMGeneral333333233
Acura424242424242124241
AlfaRomeo16161616161641616
AstonMartin333333333333103332
Audi515151515151155147
BMW484848484848164844
Bentley363636363636143632
Bugatti111111111
Buick535353535353195351
Cadillac494949494949154947
Chevrolet373737373737173737
GMC575757575757215755
Honda444444444444184442
Toyota464646464646114643
Volkswagen404040404040154036
chrysler464747474747194747
dodge424242424242164239
ferrari29292929292972926
fiat494949494949114948
ford535353535353185352
genesis23232323232332322
hummer383838383838153838
hyundai484848484848164847
infiniti434343434343134341
isuzu424242424242164242
jaguar555555555555135548
jeep616161616161206160
kia525252525252205250
lamborghini20202020202072016
land-rover434343434343214343
lexus484949494949154946
lincoln393939393939163936
lotus21212121212192121
maserati404040404040134038
maybach333333233
mazda434343434343144342
mercedes-benz474747474747194745
mercury494949494949184948
mini383838383838153833
mitsubishi535353535353185351
nissan434343434343184343
pontiac404040404040144039
porsche404040404040104040
ram343636363636103636
rolls-royce444444344
subaru525252525252165248
suzuki424242424242164241
tesla31313131313153131
volvo434343434343154342
\n", "
" ], "text/plain": [ " sentiment.score emotion.sadness emotion.joy emotion.fear \\\n", "Car_Make \n", "AMGeneral 3 3 3 3 \n", "Acura 42 42 42 42 \n", "AlfaRomeo 16 16 16 16 \n", "AstonMartin 33 33 33 33 \n", "Audi 51 51 51 51 \n", "BMW 48 48 48 48 \n", "Bentley 36 36 36 36 \n", "Bugatti 1 1 1 1 \n", "Buick 53 53 53 53 \n", "Cadillac 49 49 49 49 \n", "Chevrolet 37 37 37 37 \n", "GMC 57 57 57 57 \n", "Honda 44 44 44 44 \n", "Toyota 46 46 46 46 \n", "Volkswagen 40 40 40 40 \n", "chrysler 46 47 47 47 \n", "dodge 42 42 42 42 \n", "ferrari 29 29 29 29 \n", "fiat 49 49 49 49 \n", "ford 53 53 53 53 \n", "genesis 23 23 23 23 \n", "hummer 38 38 38 38 \n", "hyundai 48 48 48 48 \n", "infiniti 43 43 43 43 \n", "isuzu 42 42 42 42 \n", "jaguar 55 55 55 55 \n", "jeep 61 61 61 61 \n", "kia 52 52 52 52 \n", "lamborghini 20 20 20 20 \n", "land-rover 43 43 43 43 \n", "lexus 48 49 49 49 \n", "lincoln 39 39 39 39 \n", "lotus 21 21 21 21 \n", "maserati 40 40 40 40 \n", "maybach 3 3 3 3 \n", "mazda 43 43 43 43 \n", "mercedes-benz 47 47 47 47 \n", "mercury 49 49 49 49 \n", "mini 38 38 38 38 \n", "mitsubishi 53 53 53 53 \n", "nissan 43 43 43 43 \n", "pontiac 40 40 40 40 \n", "porsche 40 40 40 40 \n", "ram 34 36 36 36 \n", "rolls-royce 4 4 4 4 \n", "subaru 52 52 52 52 \n", "suzuki 42 42 42 42 \n", "tesla 31 31 31 31 \n", "volvo 43 43 43 43 \n", "\n", " emotion.disgust emotion.anger Rating\\r Review_Content \\\n", "Car_Make \n", "AMGeneral 3 3 2 3 \n", "Acura 42 42 12 42 \n", "AlfaRomeo 16 16 4 16 \n", "AstonMartin 33 33 10 33 \n", "Audi 51 51 15 51 \n", "BMW 48 48 16 48 \n", "Bentley 36 36 14 36 \n", "Bugatti 1 1 1 1 \n", "Buick 53 53 19 53 \n", "Cadillac 49 49 15 49 \n", "Chevrolet 37 37 17 37 \n", "GMC 57 57 21 57 \n", "Honda 44 44 18 44 \n", "Toyota 46 46 11 46 \n", "Volkswagen 40 40 15 40 \n", "chrysler 47 47 19 47 \n", "dodge 42 42 16 42 \n", "ferrari 29 29 7 29 \n", "fiat 49 49 11 49 \n", "ford 53 53 18 53 \n", "genesis 23 23 3 23 \n", "hummer 38 38 15 38 \n", "hyundai 48 48 16 48 \n", "infiniti 43 43 13 43 \n", "isuzu 42 42 16 42 \n", "jaguar 55 55 13 55 \n", "jeep 61 61 20 61 \n", "kia 52 52 20 52 \n", "lamborghini 20 20 7 20 \n", "land-rover 43 43 21 43 \n", "lexus 49 49 15 49 \n", "lincoln 39 39 16 39 \n", "lotus 21 21 9 21 \n", "maserati 40 40 13 40 \n", "maybach 3 3 2 3 \n", "mazda 43 43 14 43 \n", "mercedes-benz 47 47 19 47 \n", "mercury 49 49 18 49 \n", "mini 38 38 15 38 \n", "mitsubishi 53 53 18 53 \n", "nissan 43 43 18 43 \n", "pontiac 40 40 14 40 \n", "porsche 40 40 10 40 \n", "ram 36 36 10 36 \n", "rolls-royce 4 4 3 4 \n", "subaru 52 52 16 52 \n", "suzuki 42 42 16 42 \n", "tesla 31 31 5 31 \n", "volvo 43 43 15 43 \n", "\n", " Predicted_Y \n", "Car_Make \n", "AMGeneral 3 \n", "Acura 41 \n", "AlfaRomeo 16 \n", "AstonMartin 32 \n", "Audi 47 \n", "BMW 44 \n", "Bentley 32 \n", "Bugatti 1 \n", "Buick 51 \n", "Cadillac 47 \n", "Chevrolet 37 \n", "GMC 55 \n", "Honda 42 \n", "Toyota 43 \n", "Volkswagen 36 \n", "chrysler 47 \n", "dodge 39 \n", "ferrari 26 \n", "fiat 48 \n", "ford 52 \n", "genesis 22 \n", "hummer 38 \n", "hyundai 47 \n", "infiniti 41 \n", "isuzu 42 \n", "jaguar 48 \n", "jeep 60 \n", "kia 50 \n", "lamborghini 16 \n", "land-rover 43 \n", "lexus 46 \n", "lincoln 36 \n", "lotus 21 \n", "maserati 38 \n", "maybach 3 \n", "mazda 42 \n", "mercedes-benz 45 \n", "mercury 48 \n", "mini 33 \n", "mitsubishi 51 \n", "nissan 43 \n", "pontiac 39 \n", "porsche 40 \n", "ram 36 \n", "rolls-royce 4 \n", "subaru 48 \n", "suzuki 41 \n", "tesla 31 \n", "volvo 42 " ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# To get the number of reviews per Car Name:\n", "test_set.groupby('Car_Make').nunique()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R-Squared = 0.5833818585164156\n", "Mean Squared Error = 0.03221873091669909\n" ] } ], "source": [ "# r2_score for predicted y and target y avg per group!\n", "agg_r2_score = r2_score(agg_grouped_test_set['Rating\\r'], agg_grouped_test_set['Predicted_Y'])\n", "print(f\"R-Squared = {agg_r2_score}\")\n", "agg_mse = mean_squared_error(agg_grouped_test_set['Rating\\r'], agg_grouped_test_set['Predicted_Y'])\n", "print(f\"Mean Squared Error = {agg_mse}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the mean_squared error shows when it comes to the average the model has fitted the data moderately well. The R-squareds shows a moderate effect size indicates that ~44% of the variability in the Rating cannot be explained by the model." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Rating\\rPredicted_Y
meanmean
Car_Make
AMGeneral4.8333334.264005
Acura4.5386904.455716
AlfaRomeo4.1875004.435494
AstonMartin4.6136364.631018
Audi4.4534314.421119
BMW4.4687504.354955
Bentley4.2395834.57587
Bugatti4.7500004.718716
Buick4.1627364.075432
Cadillac4.3954084.335975
Chevrolet4.1047304.23044
GMC4.0899124.138499
Honda3.8323864.130003
Toyota4.3505434.243079
Volkswagen4.3968754.17746
chrysler4.1409574.168451
dodge4.1339294.163826
ferrari4.7672414.530158
fiat3.8188784.134359
ford4.0400944.005134
genesis4.6086964.316763
hummer4.4046054.462752
hyundai4.1093754.14899
infiniti4.5668604.393907
isuzu4.2202384.306578
jaguar4.5840914.497573
jeep4.1086074.15378
kia4.1418274.12064
lamborghini4.7250004.665769
land-rover3.8488374.0177
lexus4.3061224.294433
lincoln4.2692314.264042
lotus4.7023814.490183
maserati4.4312504.394369
maybach4.9583334.733771
mazda4.4796514.318612
mercedes-benz4.0957454.094125
mercury4.3112244.390451
mini4.0361844.190949
mitsubishi4.3466984.417798
nissan4.2470934.119348
pontiac4.3750004.221752
porsche4.6625004.552274
ram3.8611114.113366
rolls-royce4.8437504.508778
subaru4.2572124.327014
suzuki4.2351194.255248
tesla4.6733874.284923
volvo4.3808144.281185
\n", "
" ], "text/plain": [ " Rating\\r Predicted_Y\n", " mean mean\n", "Car_Make \n", "AMGeneral 4.833333 4.264005\n", "Acura 4.538690 4.455716\n", "AlfaRomeo 4.187500 4.435494\n", "AstonMartin 4.613636 4.631018\n", "Audi 4.453431 4.421119\n", "BMW 4.468750 4.354955\n", "Bentley 4.239583 4.57587\n", "Bugatti 4.750000 4.718716\n", "Buick 4.162736 4.075432\n", "Cadillac 4.395408 4.335975\n", "Chevrolet 4.104730 4.23044\n", "GMC 4.089912 4.138499\n", "Honda 3.832386 4.130003\n", "Toyota 4.350543 4.243079\n", "Volkswagen 4.396875 4.17746\n", "chrysler 4.140957 4.168451\n", "dodge 4.133929 4.163826\n", "ferrari 4.767241 4.530158\n", "fiat 3.818878 4.134359\n", "ford 4.040094 4.005134\n", "genesis 4.608696 4.316763\n", "hummer 4.404605 4.462752\n", "hyundai 4.109375 4.14899\n", "infiniti 4.566860 4.393907\n", "isuzu 4.220238 4.306578\n", "jaguar 4.584091 4.497573\n", "jeep 4.108607 4.15378\n", "kia 4.141827 4.12064\n", "lamborghini 4.725000 4.665769\n", "land-rover 3.848837 4.0177\n", "lexus 4.306122 4.294433\n", "lincoln 4.269231 4.264042\n", "lotus 4.702381 4.490183\n", "maserati 4.431250 4.394369\n", "maybach 4.958333 4.733771\n", "mazda 4.479651 4.318612\n", "mercedes-benz 4.095745 4.094125\n", "mercury 4.311224 4.390451\n", "mini 4.036184 4.190949\n", "mitsubishi 4.346698 4.417798\n", "nissan 4.247093 4.119348\n", "pontiac 4.375000 4.221752\n", "porsche 4.662500 4.552274\n", "ram 3.861111 4.113366\n", "rolls-royce 4.843750 4.508778\n", "subaru 4.257212 4.327014\n", "suzuki 4.235119 4.255248\n", "tesla 4.673387 4.284923\n", "volvo 4.380814 4.281185" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agg_grouped_test_set[['Rating\\r', 'Predicted_Y']]" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "emotion.sadness mean float64\n", "emotion.joy mean float64\n", "emotion.fear mean float64\n", "emotion.disgust mean float64\n", "emotion.anger mean float64\n", "sentiment.score mean float64\n", "Rating\\r mean float64\n", "Predicted_Y mean object\n", "dtype: object" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agg_grouped_test_set.dtypes" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "y = 0.51x + 2.10\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pylab as pylab\n", "\n", "# plot the data itself\n", "pylab.plot(agg_grouped_test_set['Rating\\r'],agg_grouped_test_set['Predicted_Y'],'o')\n", "pylab.xlabel('Rating From Dataset')\n", "pylab.ylabel('Rating Predicted By Model')\n", "\n", "# calc the trendline\n", "z = np.polyfit(np.squeeze(agg_grouped_test_set['Rating\\r']),\n", " np.squeeze(agg_grouped_test_set['Predicted_Y'].astype(float)), 1)\n", "p = np.poly1d(z)\n", "pylab.plot(agg_grouped_test_set['Predicted_Y'],p(agg_grouped_test_set['Predicted_Y']),\"r--\")\n", "pylab.title(\"Rating From Dataset Vs Rating Predicted By Model\")\n", "# the trendline equation:\n", "print (\"y = %.2fx + %.2f\"%(z[0],z[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above results suggest a clear better fit for the model in average; showing that the Gradient Boosting Regressor model explains 74% of the fitted Car Make level Rating in the regression model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion\n", "\n", "In this notebook we demonstrated how Text Extensions for Pandas can be used to perform Sentiment Analysis tasks. We started by loading our car reviews and passing it through Watson NLU service. We extracted the keywords and their corresponding sentiment and fine-grained emotion using the Watson NLU service. We used Text Extensions for Pandas to convert the Watson NLU output to pandas dataframe and calculated the reveiw-level sentiment and emotion. Using the resulted Pandas dataframe, we showed the correlation of Watson NLU's extracted features and user's Rating first and then developed the Univariate/Multivariate Regression, Random Forest, and Gradient Boosting models for predicting the Ratings for a given review. Finally we evaluated the ability of the model for predicting the sentiment for each car make.\n", "\n", "This notebook also demonstrates how easy it is to use IBM Watson NLU, Pandas, Scikit Learn together to conduct exploratory analysis or predcition on your data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" } }, "nbformat": 4, "nbformat_minor": 4 }