{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Analysis: Moral Foundations Theory\n", "---\n", "\n", "\n", "### Professor Amy Tick\n", "\n", "Moral Foundations Theory (MFT) hypothesizes that people's sensitivity to the foundations is different based on their political ideology: liberals are more sensitive to care and fairness, while conservatives are equally sensitive to all five. Here, we'll explore whether we can find evidence for MFT in the campaign speeches of 2016 United States presidential candidates. For our main analysis, we'll go through the data science process start to finish to recreate a simplified version of the analysis done by Jesse Graham, Jonathan Haidt, and Brian A. Nosek in their 2009 paper [\"Liberals and Conservatives Rely on Different Sets of Moral Foundations\"](http://projectimplicit.net/nosek/papers/GHN2009.pdf). Finally, we'll explore other ways to visualize and use this data in rhetorical analysis.\n", "\n", "*Estimated Time: 50 minutes*\n", "\n", "---\n", "\n", "### Topics Covered\n", "- Word count using a dictionary\n", "- Data visualization with pandas\n", "- Graph interpretations\n", "\n", "### Table of Contents\n", "\n", "\n", "1 - [Data Set and Test Statistic](#section 1)
\n", "\n", "       1.1 - [2016 Campaign Speeches](#subsection 1)
\n", "\n", "       1.2 - [Moral Foundations Dictionary](#subsection 2)
\n", "\n", "2 - [Data Analysis](#section 2)
\n", "       2.1 - [Calculating Perceptages](#subsection 3)
\n", "\n", "       2.2 - [Filtering Table Rows](#subsection 4)
\n", "\n", "       2.3 - [Democrats](#subsection 5)
\n", "\n", "       2.4 - [Republicans](#subsection 6)
\n", "\n", "       2.5 - [Democrats vs Republicans](#subsection 7)
\n", "\n", "3 - [Additional Visualizations](#section 3)
\n", "\n", "4 - [Assignment: Run Analysis with Your Dictionary](#section 4)
\n", "\n", "**Dependencies:**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import json\n", "from nltk.stem.snowball import SnowballStemmer\n", "import os\n", "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Intro: The Data Science Process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Module 01 defined data science as an interdisciplinary field, combining statistics, computer science, and domain expertise to understand the world and solve problems. The data science process can be thought of like this:\n", "\n", "\n", "\n", "This module walks through a simplified version of the process to explore speech data and probe Moral Foundations Theory. Steps done in this module are in bold.\n", "\n", "1. Raw Data Collection: speech data is collected into csv files via web-scraping.\n", "2. **Data Processing/Cleaning**: speech data is transformed to enable analysis. Some processing/cleaning has already been done.\n", "3. **Exploratory Data Analysis**: transform, visualize, and summarize data with the goal of understanding the data set, finding possible issues, and looking for potential questions to explore further.\n", "4. **Models and Algorithms**: develop and test a *model*- a theory of how the data was generated (in this case, Moral Foundations Theory).\n", "5. Communicate, Visualize, Report: to be discussed in Module 03." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 1: Speech Data and Foundations Dictionary " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Part 1, we'll get familiar with our data set and determine a way to answer questions using the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2016 Campaign Speeches " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the cell below to load the data.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CandidatePartyTypeDateTitleSpeech
0Jeb BushRcJune 15, 2015Remarks Announcing Candidacy for President at ...Thank you all very much. I always feel welcome...
1Jeb BushRcJuly 30, 2015Remarks to the National Urban League Conferenc...Thank you all very much. I appreciate your hos...
2Jeb BushRcAugust 11, 2015Remarks at the Ronald Reagan Presidential Libr...Thank you very much. It's good to be with all ...
3Jeb BushRcSeptember 9, 2015Remarks in Garner, North CarolinaThank you very much. I appreciate your hospita...
4Jeb BushRcNovember 2, 2015Remarks in Tampa, FloridaThank you. It's great to be in Tampa with so m...
\n", "
" ], "text/plain": [ " Candidate Party Type Date \\\n", "0 Jeb Bush R c June 15, 2015 \n", "1 Jeb Bush R c July 30, 2015 \n", "2 Jeb Bush R c August 11, 2015 \n", "3 Jeb Bush R c September 9, 2015 \n", "4 Jeb Bush R c November 2, 2015 \n", "\n", " Title \\\n", "0 Remarks Announcing Candidacy for President at ... \n", "1 Remarks to the National Urban League Conferenc... \n", "2 Remarks at the Ronald Reagan Presidential Libr... \n", "3 Remarks in Garner, North Carolina \n", "4 Remarks in Tampa, Florida \n", "\n", " Speech \n", "0 Thank you all very much. I always feel welcome... \n", "1 Thank you all very much. I appreciate your hos... \n", "2 Thank you very much. It's good to be with all ... \n", "3 Thank you very much. I appreciate your hospita... \n", "4 Thank you. It's great to be in Tampa with so m... " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load the data from csv files into a table. \n", "speeches = pd.read_csv('campaign_2016.csv', index_col=0)\n", "\n", "# show the first 5 rows of the table\n", "speeches.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a moment to look at this table. Before doing any analysis, it's important to understand:\n", "* the size of the table (how much data does it contain?)\n", "* the structure of the table (how is the data organized?)\n", "* what information it contains (what are the aspects of each record described in columns? what does each record (row) represent?)\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(430, 6)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use this cell to expore the speeches DataFrame\n", "# the `shape` attribute is useful to get the number of rows and columns\n", "speeches.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Moral Foundations Dictionary " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "In [\"Liberals and Conservatives Rely on Different Sets of Moral Foundations\"](http://projectimplicit.net/nosek/papers/GHN2009.pdf), one of the methods Graham, Haidt, and Nosek use to measure people's use of Moral Foundations Theory is to count how often they use words related to each foundation. This will be our test statistic for today. To calculate it, we'll need a dictionary of words related to each moral foundation. \n", "\n", "The dictionary we'll use today comes from a database called [WordNet](https://wordnet.princeton.edu), in which \"nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.\" By querying WordNet for semantically related words, it was possible to build a dictionary automatically using a Python program.\n", "\n", "Run the cell below to load the dictionary and assign it to the variable 'mft_dict'." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Load a dictionary into the mft_dict variable\n", "# The path is the argument for the open function. It gives the location of the dictionary file.\n", "# To use the Wordnet dictionary from the Module 02 lecture, set the path to '../mft_data/foundations_dict.json'\n", "# To use your hand-coded dictionary, set the path to '../mft_data/my_dict.json'\n", "with open('../mft_data/foundations_dict.json') as json_data:\n", " mft_dict = json.load(json_data)\n", "\n", "# Stem the words in your dictionary (this will help you get more matches)\n", "stemmer = SnowballStemmer('english')\n", "\n", "for foundation in mft_dict.keys():\n", " curr_words = mft_dict[foundation]\n", " stemmed_words = [stemmer.stem(word) for word in curr_words]\n", " mft_dict[foundation] = stemmed_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the keys of the dictionary using the .keys() function:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['authority/subversion',\n", " 'liberty/oppression',\n", " 'loyalty/betrayal',\n", " 'care/harm',\n", " 'sanctity/degradation',\n", " 'fairness/cheating']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keys = mft_dict.keys()\n", "list(keys)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can look up the entries associated with a key by putting the key in brackets:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'authority/subversion': ['respect',\n", " 'esteem',\n", " 'regard',\n", " 'subver',\n", " 'say-so',\n", " 'offic',\n", " 'disrespect',\n", " 'valu',\n", " 'obedi',\n", " 'assur',\n", " 'honor',\n", " 'disesteem',\n", " 'agenc',\n", " 'corrupt',\n", " 'honour',\n", " 'domin',\n", " 'author',\n", " 'observ',\n", " 'confid',\n", " 'defer',\n", " 'bureau',\n", " 'authori',\n", " 'sure',\n", " 'sanction'],\n", " 'care/harm': ['hurt',\n", " 'scath',\n", " 'precaut',\n", " 'concern',\n", " 'attent',\n", " 'damag',\n", " 'care',\n", " 'manag',\n", " 'impair',\n", " 'worri',\n", " 'harm',\n", " 'trauma',\n", " 'guardianship',\n", " 'aid',\n", " 'tend',\n", " 'caution',\n", " 'forethought',\n", " 'tutelag',\n", " 'injuri',\n", " 'upkeep',\n", " 'mainten',\n", " 'charg'],\n", " 'fairness/cheating': ['equiti',\n", " 'fair',\n", " 'cuckold',\n", " 'unsportsmanlik',\n", " 'screw',\n", " 'dirti',\n", " 'candour',\n", " 'cheat',\n", " 'proport',\n", " 'balanc',\n", " 'inequ',\n", " 'chican',\n", " 'betray',\n", " 'candor',\n", " 'adult',\n", " 'chous',\n", " 'unsport',\n", " 'unfair',\n", " 'two-tim',\n", " 'foul',\n", " 'shaft',\n", " 'fair-mind'],\n", " 'liberty/oppression': ['self-direct',\n", " 'self-suffici',\n", " 'autonomi',\n", " 'conquest',\n", " 'burdensom',\n", " 'independ',\n", " 'subjug',\n", " 'oner',\n", " 'oppress',\n", " 'subject',\n", " 'self-r',\n", " 'liberti',\n", " 'conquer',\n", " 'heavi'],\n", " 'loyalty/betrayal': ['traitor',\n", " 'disloyalti',\n", " 'treason',\n", " 'betray',\n", " 'commit',\n", " 'dedic',\n", " 'commit',\n", " 'consign',\n", " 'perfidi',\n", " 'truth',\n", " 'subver',\n", " 'allegi',\n", " 'trueness',\n", " 'veriti',\n", " 'inscript',\n", " 'treacheri',\n", " 'fealti',\n", " 'loyalti',\n", " 'committ',\n", " 'falsiti'],\n", " 'sanctity/degradation': ['pure',\n", " 'guilt',\n", " 'respect',\n", " 'impur',\n", " 'reward',\n", " 'disrespect',\n", " 'deba',\n", " 'honor',\n", " 'sanctitud',\n", " 'white',\n", " 'sanctiti',\n", " 'honour',\n", " 'holi',\n", " 'degrad',\n", " 'adult',\n", " 'dross',\n", " 'observ',\n", " 'innoc',\n", " 'natur',\n", " 'ingenu',\n", " 'aba',\n", " 'dishonor',\n", " 'puriti',\n", " 'abject',\n", " 'unholi',\n", " 'sinless',\n", " 'humili']}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mft_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try looking up the entries for the other keys by filling in for '...' in the cell below." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# look up a key in mft_dict\n", "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's something odd about some of the entries: they're not words! The entries in this dictionary have been **stemmed**, meaning they have been reduced to their smallest meaningful root. \n", "\n", "We can see why this is helpful with an example. Python can count the number of times a string can be found in another string using the string method 'count':" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Counts the number of times the second string appears in the first string\n", "\"Data science is the best major, says data scientist.\".count('science')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It returns one match, for the second word. But, 'scientist' is very closely related to 'science', and many times we will want to match them both. A stem allows Python to find all words with a common root. Try running the count again with a stem that matches both 'science' and 'scientist'." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fill in the parenthesis with a stem that will match both 'science' and 'scientist'\n", "\"Data science is the best major, says data scientist.\".count('...')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another thing you might have noticed is that all the entries in our dictionary are lowercase. This could be a problem when we do our text analysis. Try counting the number of times 'rhetoric' appears in the example sentence." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fill in the parenthesis to count how often 'rhetoric' appears in the sentence\n", "\"Rhetoric major says back: NEVER argue with a rhetoric student.\".count('...')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can clearly see the word 'rhetoric' appears twice, but the count function only returns 1. That's because Python differentiates between capital and lowercase letters:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'r' is 'R'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get around this, we can use the .lower() function, which changes all letters in the string to lowercase:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'rhetoric major says back: never argue with a rhetoric student.'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Rhetoric major says back: NEVER argue with a rhetoric student.\".lower()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add a column to our 'speeches' table that contains the lowercase text of the speeches. The `clean_text` function lowers the case of the text in addition to implementing some of the text cleaning methods seen in Module 01, like removing the punctuation and splitting the text into individual words." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CandidatePartyTypeDateTitleSpeechclean_speech
0Jeb BushRcJune 15, 2015Remarks Announcing Candidacy for President at ...Thank you all very much. I always feel welcome...[thank, you, all, very, much, i, always, feel,...
1Jeb BushRcJuly 30, 2015Remarks to the National Urban League Conferenc...Thank you all very much. I appreciate your hos...[thank, you, all, very, much, i, appreciate, y...
2Jeb BushRcAugust 11, 2015Remarks at the Ronald Reagan Presidential Libr...Thank you very much. It's good to be with all ...[thank, you, very, much, it, s, good, to, be, ...
3Jeb BushRcSeptember 9, 2015Remarks in Garner, North CarolinaThank you very much. I appreciate your hospita...[thank, you, very, much, i, appreciate, your, ...
4Jeb BushRcNovember 2, 2015Remarks in Tampa, FloridaThank you. It's great to be in Tampa with so m...[thank, you, it, s, great, to, be, in, tampa, ...
\n", "
" ], "text/plain": [ " Candidate Party Type Date \\\n", "0 Jeb Bush R c June 15, 2015 \n", "1 Jeb Bush R c July 30, 2015 \n", "2 Jeb Bush R c August 11, 2015 \n", "3 Jeb Bush R c September 9, 2015 \n", "4 Jeb Bush R c November 2, 2015 \n", "\n", " Title \\\n", "0 Remarks Announcing Candidacy for President at ... \n", "1 Remarks to the National Urban League Conferenc... \n", "2 Remarks at the Ronald Reagan Presidential Libr... \n", "3 Remarks in Garner, North Carolina \n", "4 Remarks in Tampa, Florida \n", "\n", " Speech \\\n", "0 Thank you all very much. I always feel welcome... \n", "1 Thank you all very much. I appreciate your hos... \n", "2 Thank you very much. It's good to be with all ... \n", "3 Thank you very much. I appreciate your hospita... \n", "4 Thank you. It's great to be in Tampa with so m... \n", "\n", " clean_speech \n", "0 [thank, you, all, very, much, i, always, feel,... \n", "1 [thank, you, all, very, much, i, appreciate, y... \n", "2 [thank, you, very, much, it, s, good, to, be, ... \n", "3 [thank, you, very, much, i, appreciate, your, ... \n", "4 [thank, you, it, s, great, to, be, in, tampa, ... " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def clean_text(text):\n", " # remove punctuation using a regular expression (not covered in these modules)\n", " p = re.compile(r'[^\\w\\s]')\n", " no_punc = p.sub(' ', text)\n", " # convert to lowercase\n", " no_punc_lower = no_punc.lower()\n", " # split into individual words\n", " clean = no_punc_lower.split()\n", " return clean\n", " \n", "speeches['clean_speech'] = [clean_text(s) for s in speeches['Speech']]\n", "\n", "speeches.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 2: Exploratory Data Analysis " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our speech data and our dictionary, we can start our exploratory analysis. The exploratory analysis in this module will be more focused than in most cases since we already have a model in mind- Moral Foundations Theory.\n", "\n", "To get a sense of how Moral Foundations words were used in campaign speeches, we'll do three things:\n", "1. Count the occurances of words from our dictionary in each speech\n", "2. Calculate how often words from each category are used by each political party\n", "3. Plot the percents on a bar graph\n", "\n", "Think about what you know about Moral Foundations Theory. If this data is consistent with the theory, what should our analysis show for Republican candidates? What about for Democratic candidates? Try sketching a possible graph for each political party, assuming that candidates' speech aligns with the theory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating Percentages " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're interesting in knowing the percent of words that correspond to a Moral Foundation in speeches- in other words, how often candidates use words related to a specific foundation. \n", "\n", "(Bonus question: why don't we just use the **number** of Moral Foundation words instead of the **percent** as our test statistic?)\n", "\n", "To calculate the percent, we'll first need the total number of words in each speech." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CandidatePartyTypeDateTitleSpeechclean_speechtotal_words
0Jeb BushRcJune 15, 2015Remarks Announcing Candidacy for President at ...Thank you all very much. I always feel welcome...[thank, you, all, very, much, i, always, feel,...2284
1Jeb BushRcJuly 30, 2015Remarks to the National Urban League Conferenc...Thank you all very much. I appreciate your hos...[thank, you, all, very, much, i, appreciate, y...2638
2Jeb BushRcAugust 11, 2015Remarks at the Ronald Reagan Presidential Libr...Thank you very much. It's good to be with all ...[thank, you, very, much, it, s, good, to, be, ...3735
3Jeb BushRcSeptember 9, 2015Remarks in Garner, North CarolinaThank you very much. I appreciate your hospita...[thank, you, very, much, i, appreciate, your, ...1880
4Jeb BushRcNovember 2, 2015Remarks in Tampa, FloridaThank you. It's great to be in Tampa with so m...[thank, you, it, s, great, to, be, in, tampa, ...2550
\n", "
" ], "text/plain": [ " Candidate Party Type Date \\\n", "0 Jeb Bush R c June 15, 2015 \n", "1 Jeb Bush R c July 30, 2015 \n", "2 Jeb Bush R c August 11, 2015 \n", "3 Jeb Bush R c September 9, 2015 \n", "4 Jeb Bush R c November 2, 2015 \n", "\n", " Title \\\n", "0 Remarks Announcing Candidacy for President at ... \n", "1 Remarks to the National Urban League Conferenc... \n", "2 Remarks at the Ronald Reagan Presidential Libr... \n", "3 Remarks in Garner, North Carolina \n", "4 Remarks in Tampa, Florida \n", "\n", " Speech \\\n", "0 Thank you all very much. I always feel welcome... \n", "1 Thank you all very much. I appreciate your hos... \n", "2 Thank you very much. It's good to be with all ... \n", "3 Thank you very much. I appreciate your hospita... \n", "4 Thank you. It's great to be in Tampa with so m... \n", "\n", " clean_speech total_words \n", "0 [thank, you, all, very, much, i, always, feel,... 2284 \n", "1 [thank, you, all, very, much, i, appreciate, y... 2638 \n", "2 [thank, you, very, much, it, s, good, to, be, ... 3735 \n", "3 [thank, you, very, much, i, appreciate, your, ... 1880 \n", "4 [thank, you, it, s, great, to, be, in, tampa, ... 2550 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a new column called 'total_words'\n", "speeches['total_words'] = [len(speech) for speech in speeches['clean_speech']]\n", "speeches.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need to calculate the number of matches to entries in our dictionary for each foundation for each speech. \n", "\n", "Run the next cell to add six new columns to `speeches`, one per foundation, that show the number of word matches." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CandidatePartyTypeDateTitleSpeechclean_speechtotal_wordsauthority/subversioncare/harmfairness/cheatingliberty/oppressionloyalty/betrayalsanctity/degradation
0Jeb BushRcJune 15, 2015Remarks Announcing Candidacy for President at ...Thank you all very much. I always feel welcome...[thank, you, all, very, much, i, always, feel,...22844.04.03.00.07.04.0
1Jeb BushRcJuly 30, 2015Remarks to the National Urban League Conferenc...Thank you all very much. I appreciate your hos...[thank, you, all, very, much, i, appreciate, y...26388.02.07.00.04.09.0
2Jeb BushRcAugust 11, 2015Remarks at the Ronald Reagan Presidential Libr...Thank you very much. It's good to be with all ...[thank, you, very, much, it, s, good, to, be, ...373512.05.01.00.04.05.0
3Jeb BushRcSeptember 9, 2015Remarks in Garner, North CarolinaThank you very much. I appreciate your hospita...[thank, you, very, much, i, appreciate, your, ...18803.01.01.00.01.04.0
4Jeb BushRcNovember 2, 2015Remarks in Tampa, FloridaThank you. It's great to be in Tampa with so m...[thank, you, it, s, great, to, be, in, tampa, ...25508.03.01.01.00.07.0
\n", "
" ], "text/plain": [ " Candidate Party Type Date \\\n", "0 Jeb Bush R c June 15, 2015 \n", "1 Jeb Bush R c July 30, 2015 \n", "2 Jeb Bush R c August 11, 2015 \n", "3 Jeb Bush R c September 9, 2015 \n", "4 Jeb Bush R c November 2, 2015 \n", "\n", " Title \\\n", "0 Remarks Announcing Candidacy for President at ... \n", "1 Remarks to the National Urban League Conferenc... \n", "2 Remarks at the Ronald Reagan Presidential Libr... \n", "3 Remarks in Garner, North Carolina \n", "4 Remarks in Tampa, Florida \n", "\n", " Speech \\\n", "0 Thank you all very much. I always feel welcome... \n", "1 Thank you all very much. I appreciate your hos... \n", "2 Thank you very much. It's good to be with all ... \n", "3 Thank you very much. I appreciate your hospita... \n", "4 Thank you. It's great to be in Tampa with so m... \n", "\n", " clean_speech total_words \\\n", "0 [thank, you, all, very, much, i, always, feel,... 2284 \n", "1 [thank, you, all, very, much, i, appreciate, y... 2638 \n", "2 [thank, you, very, much, it, s, good, to, be, ... 3735 \n", "3 [thank, you, very, much, i, appreciate, your, ... 1880 \n", "4 [thank, you, it, s, great, to, be, in, tampa, ... 2550 \n", "\n", " authority/subversion care/harm fairness/cheating liberty/oppression \\\n", "0 4.0 4.0 3.0 0.0 \n", "1 8.0 2.0 7.0 0.0 \n", "2 12.0 5.0 1.0 0.0 \n", "3 3.0 1.0 1.0 0.0 \n", "4 8.0 3.0 1.0 1.0 \n", "\n", " loyalty/betrayal sanctity/degradation \n", "0 7.0 4.0 \n", "1 4.0 9.0 \n", "2 4.0 5.0 \n", "3 1.0 4.0 \n", "4 0.0 7.0 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Note: much of the following code is not covered in these modules. Read the comments to get a sense of what it does.\n", "\n", "# do the following code for each foundation\n", "for foundation in mft_dict.keys():\n", " # create a new, empty column\n", " num_match_words = np.zeros(len(speeches))\n", " stems = mft_dict[foundation]\n", " \n", " # do the following code for each foundation word\n", " for stem in stems:\n", " # find synonym matches\n", " wd_count = np.array([sum([wd.startswith(stem) for wd in speech]) for speech in speeches['clean_speech']])\n", " # add the number of matches to the total\n", " num_match_words += wd_count\n", " \n", " # create a new column for each foundation with the number of foundation words per speech\n", " speeches[foundation] = num_match_words\n", "\n", "speeches.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To calculate the percentage of foundation words per speech, divide the number of matched words by the number of total words and multiply by 100." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for foundation in mft_dict.keys():\n", " speeches[foundation] = (speeches[foundation] / speeches['total_words']) * 100\n", "\n", "speeches.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering table rows " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To examine the data for a particular political party, it is necessary to filter out rows of our table that correspond to speeches from the other party, something we can do with **Boolean indexing**.\n", "\n", "A **Boolean** is a Python data type. There are exactly two: `True` and `False`. A Boolean expression is an expression that evaluates to `True` or `False`. Boolean expressions are often conditions on two variables; that is, they ask how one variable compares to another (e.g. is `a` greater than `b`? Does `a` equal `c`?)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# These are all Booleans\n", "True\n", "\n", "not False\n", "\n", "6 > 0\n", "\n", "\"Ted Cruz\" == \"zodiac killer\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that Python uses `==` to check if two things are equal. This is because the `=` sign is already used for variable assignement.\n", "\n", "Filtering out DataFrame rows can be broken into three steps:\n", "1. identify the correct feature column \n", "2. specify the desired condition for that column\n", "3. index the Dataframe with that condition in square brackets\n", "\n", "Here's an example of how to create a new table with only Bernie Sanders' speeches." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# find the column\n", "speech_col = speeches['Candidate']\n", "\n", "# specify the condition\n", "sanders_condition = speech_col == 'Bernie Sanders'\n", "\n", "# index the original DataFrame by the condition\n", "sanders_speeches = speeches[sanders_condition]\n", "sanders_speeches.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Democrats " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Let's start by looking at Democratic candidates. First, we need to make a table that only contains Democrats using boolean indexing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Filter out non-Democrat speeches\n", "party_col = speeches['Party']\n", "\n", "dem_cond = party_col == 'D'\n", "\n", "democrats = speeches[dem_cond]\n", "\n", "democrats.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have our percentages for the Democratic party, but it's much easier to understand what's going on when the results are in graph form. Let's start by looking at the average percents for Democrats as a group. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# select the foundations columns and calculate the mean percent for each\n", "avg_dem_stats = (democrats.loc[:, list(mft_dict.keys())]\n", " .apply(np.mean)\n", " .to_frame('D_percent'))\n", "\n", "avg_dem_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, create a horizontal bar plot by calling the `.plot.barh()` method on `avg_dem_stats`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "avg_dem_stats.plot.barh()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a look at this graph. What does it show? How does it compare with the predictions of MFT?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Republicans " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's repeat the process for Republicans. Replace the ellipses with the correct code to select only Republican speeches, then run the cell to create the table. \n", "\n", "(Hint: look back at how we made the 'democrats' table to see how to fill in the ellipses)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Filter out non-Republican speeches\n", "\n", "# select 'Party' column from 'speeches'\n", "party_col = speeches['Party']\n", "\n", "# create a condition (boolean expression) that checks if a party is Republican\n", "republican_cond = party_col == 'R'\n", "\n", "# index `speeches` using `republican_cond`\n", "republicans = speeches[republican_cond]\n", "\n", "# uncomment the next line to show the first 5 rows of the `republican` DataFrame\n", "republicans.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, calculate the averages." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# select the foundations columns and calculate the mean percent for each\n", "avg_rep_stats = (republicans.loc[:, list(mft_dict.keys())]\n", " .apply(np.mean)\n", " .to_frame('R_percent'))\n", "\n", "avg_rep_stats " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, create a bar plot of `avg_rep_stats` using the `.plot.barh()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# your code here\n", "avg_rep_stats.plot.barh()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does this plot compare with Moral Foundations Theory predictions?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Democrats vs Republicans " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Comparing two groups becomes much easier when they are plotted on the same graph. \n", "\n", "First, combine `avg_dem_stats` and `avg_rep_stats` into one DataFrame with the `join` function. `join` is called on one table using `.join()`, takes the other table as its argument (in the parentheses), and returns a table with the indices matched. \n", "\n", "Here's an example of a simple join:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "peanut_butter = pd.DataFrame(data=[2.99, 3.49], index = ['Trader Joes', 'Safeway'], columns=['pb_price'])\n", "peanut_butter" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "jelly = pd.DataFrame(data=[4.99, 3.59], index = ['Trader Joes', 'Safeway'], columns=['jelly_price'])\n", "jelly" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "jelly.join(peanut_butter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, write the code to join `avg_dem_stats` with `avg_rep_stats`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# fill in the ellipses with your code\n", "all_avg_stats = avg_dem_stats.join(avg_rep_stats)\n", "all_avg_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, make a horizontal bar plot for `all_avg_stats'." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# your code here\n", "all_avg_stats.plot.barh()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "It can be hard to make comparison judgments if the bar lengsth are very similar. The next cell creates a plot of only the difference in average foundation word usage of Democrats and Republicans. A positive value means Democrats use the word more frequently; a negative value indicates Republicans use it more frequently." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# uncomment the next two lines to plot the difference in percent of foundations words per speech by party\n", "party_diffs = pd.DataFrame(data = avg_dem_stats['D_percent'] - avg_rep_stats['R_percent'],\n", " columns = [\"dem_rep_pct_diff\"], \n", " index = mft_dict.keys())\n", "party_diffs.plot.bar()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 3: Additional Visualizations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many different graphs can be generated from the same data set to facilitate different comparisons. For example, we can compare the average use of foundation words by individual Democrats..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dem_indivs = (democrats.loc[:, list(mft_dict.keys()) + ['Candidate']]\n", " .groupby('Candidate')\n", " .mean())\n", "\n", "dem_indivs.plot.barh(figsize=(8, 8))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...or individual Republicans." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "rep_indivs = (republicans.loc[:, list(mft_dict.keys()) + ['Candidate']]\n", " .groupby('Candidate')\n", " .mean())\n", "\n", "rep_indivs.plot.barh(figsize=(8, 20))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also examine how a candidate uses foundation words over time. The following plot shows foundation word usage for Donald Trump in the weeks leading up to the election." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# select Trump's speeches and drop unnecessary columns\n", "trump = (republicans[republicans['Candidate'] == \"Donald Trump\"]\n", " .loc[:, list(mft_dict.keys()) + ['Date']])\n", "\n", "# set the speech dates as the table index\n", "trump['Date'] = pd.to_datetime(trump['Date'])\n", "trump = (trump.set_index('Date')\n", " .loc['2016-07-01':])\n", "\n", "# plot the data\n", "trump.plot(figsize = (10, 6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What other kinds of plots could be generated from this data? What other questions might we be able to explore with these or other plots?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 4: Run Analysis with Your Dictionary " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the advantages of coding is how easy it is to repeat one method of analysis with different parameters. For instance, changing a single line of code means that all of the word counts, proportions, and graphs in the above sections can be recalculated using a different dictionary of Moral Foundations words.\n", "\n", "To change what dictionary is loaded to the `mft_dict` variable, go to [Part 1.2: Moral Foundations Dictionary](#subsection 2)
and follow the instructions in the first code cell. \n", "\n", "Once the dictionary load code has been changed, the easiest way to regenerate all the tables, percents, and graphs is to go to the `Cell` menu and click `Run all`. This ensures that all the statistics used to make the graphs will be recalculated with the new dictionary.\n", "\n", "For this assignment, answer the following three questions about the graphs made using **your hand-coded dictionary**:\n", "\n", "1. What does each graph show?\n", "2. How are these graphs different from the ones made using the Wordnet dictionary?\n", "3. Do these graphs support Moral Foundations Theory?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Bibliography" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Election documents scraped from http://www.presidency.ucsb.edu/2016_election.php\n", "* Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology, 96(5), 1029. http://projectimplicit.net/nosek/papers/GHN2009.pdf, October 9 2017." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "Notebook developed by: Keeley Takimoto, Sean Seungwoo Son, Sujude Dalieh\n", "\n", "Data Science Modules: http://data.berkeley.edu/education/modules\n" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 1 }