{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# *Data Visualization and Statistics*\n", "\n", "Gallery of Matplotlib examples: [https://matplotlib.org/gallery.html](https://matplotlib.org/gallery.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## First, let's import some packages.\n", "\n", "import os\n", "from pprint import pprint\n", "from textblob import TextBlob\n", "\n", "import numpy as np\n", "from scipy import stats\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline\n", "# The line above tells Jupyter to display Matplotlib graphics within the notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Download sample text corpora from GitHub, then unzip.\n", "\n", "os.chdir('/sharedfolder/')\n", "\n", "!wget -N https://github.com/pcda17/pcda17.github.io/blob/master/week/8/Sample_corpora.zip?raw=true -O Sample_corpora.zip\n", "!unzip -o Sample_corpora.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.chdir('/sharedfolder/Sample_corpora')\n", "\n", "os.listdir('./')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls Jane_Austen" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls Herman_Melville" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Loading a Melville novel as a TextBlob object\n", "\n", "melville_path = 'Herman_Melville/Moby_Dick.txt'\n", "\n", "melville_blob = TextBlob(open(melville_path).read().replace('\\n', ' '))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Loading an Austen novel as a TextBlob object\n", "\n", "austen_path = 'Jane_Austen/Pride_and_Prejudice.txt'\n", "\n", "austen_blob = TextBlob(open(austen_path).read().replace('\\n', ' '))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Recall that 'some_textblob_object.words' is a WordList object ...\n", "\n", "melville_blob.words[5100:5140]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ... which we can cast to an ordinary list.\n", "\n", "list(melville_blob.words[5100:5140])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## And 'some_textblob_object.sentences' is a list of Sentence objects ...\n", "\n", "austen_blob.sentences[100:105]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ... which we can convert to a list of strings using a list comprehension.\n", "\n", "[str(item) for item in austen_blob.sentences[100:105]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## For reference, here's another example of a list comprehension:\n", "\n", "word_list = ['Call', 'me', 'Ishmael.']\n", "\n", "uppercase_list = [word.upper() for word in word_list]\n", "\n", "uppercase_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## And one more for good measure:\n", "\n", "string_nums = [str(i) for i in range(12)]\n", "\n", "string_nums" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ▷ Sentiment analysis with TextBlob\n", "\n", "Details on the training data that NLTK (via TextBlob) uses to measure polarity:\n", "[http://www.cs.cornell.edu/people/pabo/movie-review-data/](http://www.cs.cornell.edu/people/pabo/movie-review-data/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Negative sentiment polarity example\n", "# (result between -1 and +1)\n", "\n", "from textblob import TextBlob\n", "\n", "text = \"This is a very mean and nasty sentence.\"\n", "\n", "blob = TextBlob(text)\n", "\n", "sentiment_score = blob.sentiment.polarity\n", "\n", "print(sentiment_score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Positive sentiment polarity example\n", "# (result between -1 and +1)\n", "\n", "text = \"This is a very nice and positive sentence.\"\n", "\n", "blob = TextBlob(text)\n", "\n", "sentiment_score = blob.sentiment.polarity\n", "\n", "print(sentiment_score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Neutral polarity / not enough information\n", "\n", "text = \"What is this?\"\n", "\n", "blob = TextBlob(text)\n", "\n", "sentiment_score = blob.sentiment.polarity\n", "\n", "print(sentiment_score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## High subjectivity example\n", "# result between 0 and 1\n", "\n", "text=\"This is a very mean and nasty sentence.\"\n", "\n", "blob = TextBlob(text)\n", "\n", "sentiment_score = blob.sentiment.subjectivity\n", "\n", "print(sentiment_score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Low subjectivity example\n", "# result between 0 and 1\n", "\n", "text=\"This sentence states a fact, with an apparently objective adjective.\"\n", "\n", "blob = TextBlob(text)\n", "\n", "sentiment_score=blob.sentiment.subjectivity\n", "\n", "print(sentiment_score)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### ▷ Plotting Sentiment Values\n", "\n", "Let's map sentiment polarity values across the course of a full novel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Viewing Pyplot style templates\n", "\n", "pprint(plt.style.available)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Selecting a Pyplot style\n", "\n", "plt.style.use('ggplot')\n", "\n", "# The 'ggplot' style imitates the R graphing package 'ggplot2.' (http://ggplot2.org)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "austen_sentiments = [item.sentiment.polarity for item in austen_blob.sentences]\n", "\n", "austen_sentiments[:15]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Austen sentiment values for first 60 sentences\n", "\n", "plt.figure(figsize=(18,8))\n", "plt.plot(austen_sentiments[:60])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "austen_blob.sentences[30]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "austen_blob.sentences[37]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Plotting 'Pride and Prejudice' sentence sentiment values over full novel\n", "\n", "plt.figure(figsize=(18,8))\n", "\n", "plt.plot(austen_sentiments)\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Finding the most 'positive' sentences in 'Pride and Prejudice' and printing them\n", "\n", "max_sentiment = max(austen_sentiments)\n", "\n", "print(max_sentiment) # max sentiment polarity value\n", "print()\n", "\n", "for sentence in austen_blob.sentences:\n", " if sentence.sentiment.polarity == max_sentiment:\n", " print(sentence)\n", " print()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Finding the most 'negative' sentences in 'Pride and Prejudice' and printing them\n", "\n", "min_sentiment = min(austen_sentiments)\n", "\n", "print(min_sentiment) # max sentiment polarity value\n", "print()\n", "\n", "for sentence in austen_blob.sentences:\n", " if sentence.sentiment.polarity == min_sentiment:\n", " print(sentence)\n", " print()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Example: smoothing a list of numbers using the 'pandas' package\n", "\n", "some_values = [5, 4, 5, 6, 6, 7, 6, 19, 4, 4, 3, 3, 3, 1, 5, 5, 6, 7, 0]\n", "\n", "pandas_series = pd.Series(some_values)\n", "\n", "list(pandas_series.rolling(window=4).mean())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Smoothing our data before plotting\n", "\n", "austen_sentiments_pd = pd.Series(austen_sentiments)\n", "\n", "austen_sentiments_smooth = austen_sentiments_pd.rolling(window=200).mean()\n", "\n", "print(austen_sentiments_smooth[190:220])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Plotting smoothed sentiment polarity values for each sentence in 'Pride and Prejudice'\n", "\n", "plt.figure(figsize=(18,8))\n", "\n", "plt.plot(austen_sentiments_smooth)\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Comparing 'Moby Dick' sentiment values\n", "\n", "melville_sentiments = [item.sentiment.polarity for item in melville_blob.sentences]\n", "\n", "melville_sentiments_pd = pd.Series(melville_sentiments)\n", "\n", "melville_sentiments_smooth = melville_sentiments_pd.rolling(window=200).mean()\n", "\n", "plt.figure(figsize=(18,8))\n", "\n", "plt.plot(melville_sentiments_smooth)\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Finding and printing the most 'negative' sentence in a list of smoothed sentiment values\n", "\n", "min_sentiment = min(melville_sentiments_smooth[199:])\n", "\n", "print(min_sentiment) # min sentiment polarity value\n", "print()\n", "\n", "min_sentiment_index = list(melville_sentiments_smooth).index(min_sentiment) # index position of the 'min_sentiment' value\n", "\n", "print(melville_blob.sentences[min_sentiment_index])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Finding and printing the most 'positive' sentence in a list of smoothed sentiment values\n", "\n", "max_sentiment = max(melville_sentiments_smooth[199:])\n", "\n", "print(max_sentiment) # max sentiment polarity value\n", "print()\n", "\n", "max_sentiment_index = list(melville_sentiments_smooth).index(max_sentiment) # index position of the 'min_sentiment' value\n", "\n", "print(melville_blob.sentences[max_sentiment_index])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Finding and printing the most 'positive' sentence in a list of smoothed sentiment values\n", "\n", "max_sentiment = max(austen_sentiments_smooth[199:])\n", "\n", "print(max_sentiment) # max sentiment polarity value\n", "print()\n", "\n", "max_sentiment_index = list(austen_sentiments_smooth).index(max_sentiment) # index position of the 'max_sentiment' value\n", "\n", "print(austen_blob.sentences[max_sentiment_index])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Finding and printing the most 'negative' sentence in a list of smoothed sentiment values\n", "\n", "min_sentiment = min(austen_sentiments_smooth[199:])\n", "\n", "print(min_sentiment) # min sentiment polarity value\n", "print()\n", "\n", "min_sent_index=list(austen_sentiments_smooth).index(min_sentiment) # index position of the 'min_sentiment' value\n", "\n", "print(austen_blob.sentences[min_sent_index])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Creating functions to expedite the steps we put together above process\n", "# This function accepts an optional second argument for smoothing window size. The default is 200 windows.\n", "\n", "def plot_polarity(text_path, window=200):\n", " text_in = open(text_path).read().replace('\\n', ' ')\n", " blob = TextBlob(text_in)\n", " sentiments = [sentence.sentiment.polarity for sentence in blob.sentences]\n", " sentiments_pd = pd.Series(sentiments)\n", " sentiments_smooth = sentiments_pd.rolling(window).mean()\n", " plt.figure(figsize = (18,8))\n", " plt.plot(sentiments_smooth)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!find ./" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_polarity('George_Eliot/Silas_Marner.txt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_polarity('Joseph_Conrad/Heart_of_Darkness.txt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ▷ Plotting smoothed random data (for comparison)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Plotting completely random data\n", "\n", "random_vals = np.random.rand(4000)\n", "\n", "vals_pd = pd.Series(random_vals)\n", "vals_smooth = vals_pd.rolling(window=200).mean()\n", "\n", "plt.figure(figsize=(18,8))\n", "plt.plot(vals_smooth)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ▷ Working with multiple files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.chdir('/sharedfolder/Sample_corpora/Inaugural_Speeches/')\n", "sorted(os.listdir('./'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inaugural_filenames = sorted(os.listdir('./'))\n", "\n", "inaugural_sentiment_values = []\n", "\n", "for filename in inaugural_filenames:\n", " inaugural_text = open(filename).read()\n", " sentiment_polarity_value = TextBlob(inaugural_text).sentiment.polarity\n", " inaugural_sentiment_values.append(sentiment_polarity_value)\n", "\n", "print(inaugural_sentiment_values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Creating nicely formatted labels for the sentiment values above\n", "\n", "inaugural_labels = [item.replace('.txt','').replace('_', ' ').title() for item in inaugural_filenames]\n", "\n", "inaugural_labels" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Plotting presidential inaugural address sentiment values over time\n", "\n", "plt.figure(figsize = (20,8))\n", "\n", "plt.xticks(range(len(inaugural_sentiment_values)), inaugural_labels) # two arguments: tick positions, tick display list\n", "\n", "plt.xticks(rotation=-85)\n", "\n", "plt.ylabel('Sentiment Polarity Value')\n", "\n", "plt.plot(inaugural_sentiment_values)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ▷ Assignment\n", "\n", " For each author in our set of corpora, which is their most 'positive' novel? Their most 'negative'?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ▷ Sentiment Histograms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.chdir('/sharedfolder/Sample_corpora/')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text_in = open('Jane_Austen/Pride_and_Prejudice.txt').read().replace('\\n', ' ')\n", "\n", "blob = TextBlob(text_in)\n", "sentiments = [sentence.sentiment.polarity for sentence in blob.sentences]\n", "plt.figure(figsize=(20,10))\n", "plt.hist(sentiments, bins=25)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text_in = open('Jane_Austen/Pride_and_Prejudice.txt').read().replace('\\n', ' ')\n", "\n", "blob = TextBlob(text_in)\n", "sentiments = [sentence.sentiment.subjectivity for sentence in blob.sentences]\n", "plt.figure(figsize=(20,10))\n", "plt.hist(sentiments, bins=25)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ▷ Cleaning sentiment values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text_in = open('Jane_Austen/Pride_and_Prejudice.txt').read().replace('\\n', ' ')\n", "\n", "blob = TextBlob(text_in)\n", "sentiments = [sentence.sentiment.polarity for sentence in blob.sentences]\n", "sentiments_cleaned = [value for value in sentiments if value!=0]\n", "plt.figure(figsize=(20,10))\n", "plt.hist(sentiments_cleaned, bins=25)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def polarity_histogram_cleaned(text_path):\n", " text_in = open(text_path).read().replace('\\n', ' ')\n", " blob = TextBlob(text_in)\n", " sentiments = [sentence.sentiment.polarity for sentence in blob.sentences]\n", " sentiments_cleaned = [value for value in sentiments if value!=0]\n", " plt.figure(figsize=(20,10))\n", " plt.hist(sentiments_cleaned, bins=25)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!find ./" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "polarity_histogram_cleaned('./Joseph_Conrad/The_Secret_Agent.txt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## ▷ Comparing Sentiment Distributions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "melville_blob = TextBlob(open('Herman_Melville/Moby_Dick.txt').read().replace('\\n', ' '))\n", "austen_blob = TextBlob(open('Jane_Austen/Pride_and_Prejudice.txt').read().replace('\\n', ' '))\n", "\n", "melville_sentiments = [sentence.sentiment.polarity for sentence in melville_blob.sentences]\n", "melville_sentiments_cleaned = [value for value in melville_sentiments if value!=0.0]\n", "\n", "austen_sentiments = [sentence.sentiment.polarity for sentence in austen_blob.sentences]\n", "austen_sentiments_cleaned = [value for value in austen_sentiments if value!=0.0]\n", "\n", "plt.figure(figsize=(15,8))\n", "\n", "plt.hist(melville_sentiments_cleaned, bins=25, alpha=0.5, label='Moby Dick')\n", "plt.hist(austen_sentiments_cleaned, bins=25, alpha=0.5, label='Pride and Prejudice')\n", "\n", "plt.legend(loc='upper right')\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(np.mean(melville_sentiments_cleaned))\n", "print(np.mean(austen_sentiments_cleaned))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ▷ Statistical Tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## t-test of independent values\n", "# (used to determine whether two *normally distributed* sets of values are significantly different)\n", "\n", "from scipy import stats\n", "\n", "stats.ttest_ind(melville_sentiments_cleaned, austen_sentiments_cleaned)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Mann-Whitney U test\n", "# (used to test two sets of *non-normally distributed* values are significantly different)\n", "\n", "stats.mannwhitneyu(melville_sentiments, austen_sentiments)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ▷ Assignment\n", "\n", " Is George Eliot significantly more subjective than Jane Austen?\n", " Is Herman Melville significantly more 'positive' than Joseph Conrad?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ▷ Assignment\n", "\n", " Write a function that takes two texts' paths as arguments and \n", " (a) plots a histogram comparing their sentences' sentiment distributions\n", " (b) tests whether their sentiment values are significantly different" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }