{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is this fourth in a series of notebooks designed to show you how to analyze social media data. For demonstration purposes we are looking at tweets sent by CSR-related Twitter accounts -- accounts related to ethics, equality, the environment, etc. -- of Fortune 200 firms in 2013. We assume you have already downloaded the data and have completed the steps taken in Chapter 1, Chapter 2, and Chapter 3. In this fourth notebook I will show you how to conduct various analyses of the hashtags included in Twitter data; specifically, we'll cover how to create and graph counts of the most frequently used hashtags and create several different tag clouds. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 4: Analyzing Hashtags"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we will import several necessary Python packages and set some options for viewing the data. As with prior chapters, we will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import packages and set viewing options"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import DataFrame\n",
"from pandas import Series"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Set PANDAS to show all columns in DataFrame\n",
"pd.set_option('display.max_columns', None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'm using version 0.16.2 of PANDAS"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'0.16.2'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Import graphing packages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll be producing some figures at the end of this tutorial so we need to import various graphing capabilities. The default Matplotlib library is solid. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.4.3\n"
]
}
],
"source": [
"import matplotlib\n",
"print matplotlib.__version__"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#NECESSARY FOR XTICKS OPTION, ETC.\n",
"from pylab import*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the great innovations of ipython notebook is the ability to see output and graphics \"inline,\" that is, on the same page and immediately below each line of code. To enable this feature for graphics we run the following line."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will be using Seaborn to help pretty up the default Matplotlib graphics. Seaborn does not come installed with Anaconda Python so you will have to open up a terminal and run pip install seaborn."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.6.0\n"
]
}
],
"source": [
"import seaborn as sns\n",
"print sns.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
The following line will set the default plots to be bigger."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"plt.rcParams['figure.figsize'] = (15, 5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read in data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Chapter 1 we deleted tweets from one unneeded Twitter account and also omitted several unnecessary columns (variables). We then saved, or \"pickled,\" the updated dataframe. Let's now open this saved file. As we can see in the operations below this dataframe contains 54 variables for 32,330 tweets."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"32330\n"
]
}
],
"source": [
"df = pd.read_pickle('CSR tweets - 2013 by 41 accounts.pkl')\n",
"print len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
For our analyses we will look at all original tweets -- those that are not retweets. This will allow us to see more clearly what the organizations choose to include in their own tweets. Here we will rely on the retweeted_status column in our dataframe. This is a variable I created in the code we have used to download the tweets. The value will be \"THIS IS A RETWEET\" if the tweet is not original and blank otherwise. We'll use this to create a new version of our dataframe, called df_original, that will comprise all rows in df where the value of retweeted_status does not equal (as indicated by `!=` )\"THIS IS A RETWEET.\" We can see that our new dataframe has 26,257 tweets -- meaning 6,073 of the 32,330 tweets were retweets."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"26257\n",
"6073\n"
]
},
{
"data": {
"text/html": [
"
\n", " | rowid | \n", "query | \n", "tweet_id_str | \n", "inserted_date | \n", "language | \n", "coordinates | \n", "retweeted_status | \n", "created_at | \n", "month | \n", "year | \n", "content | \n", "from_user_screen_name | \n", "from_user_id | \n", "from_user_followers_count | \n", "from_user_friends_count | \n", "from_user_listed_count | \n", "from_user_favourites_count | \n", "from_user_statuses_count | \n", "from_user_description | \n", "from_user_location | \n", "from_user_created_at | \n", "retweet_count | \n", "favorite_count | \n", "entities_urls | \n", "entities_urls_count | \n", "entities_hashtags | \n", "entities_hashtags_count | \n", "entities_mentions | \n", "entities_mentions_count | \n", "in_reply_to_screen_name | \n", "in_reply_to_status_id | \n", "source | \n", "entities_expanded_urls | \n", "entities_media_count | \n", "media_expanded_url | \n", "media_url | \n", "media_type | \n", "video_link | \n", "photo_link | \n", "twitpic | \n", "num_characters | \n", "num_words | \n", "retweeted_user | \n", "retweeted_user_description | \n", "retweeted_user_screen_name | \n", "retweeted_user_followers_count | \n", "retweeted_user_listed_count | \n", "retweeted_user_statuses_count | \n", "retweeted_user_location | \n", "retweeted_tweet_created_at | \n", "Fortune_2012_rank | \n", "Company | \n", "CSR_sustainability | \n", "specific_project_initiative_area | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "67340 | \n", "humanavitality | \n", "306897327585652736 | \n", "2014-03-09 13:46:50.222857 | \n", "en | \n", "NaN | \n", "NaN | \n", "2013-02-27 22:43:19.000000 | \n", "2 | \n", "2013 | \n", "@louloushive (Tweet 2) We encourage other empl... | \n", "humanavitality | \n", "274041023 | \n", "2859 | \n", "440 | \n", "38 | \n", "25 | \n", "1766 | \n", "This is the official Twitter account for Human... | \n", "NaN | \n", "Tue Mar 29 16:23:02 +0000 2011 | \n", "0 | \n", "0 | \n", "NaN | \n", "0 | \n", "NaN | \n", "0 | \n", "louloushive | \n", "1 | \n", "louloushive | \n", "3.062183e+17 | \n", "web | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0 | \n", "0 | \n", "0 | \n", "121 | \n", "19 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "79 | \n", "Humana | \n", "0 | \n", "1 | \n", "
1 | \n", "39454 | \n", "FundacionPfizer | \n", "308616393706844160 | \n", "2014-03-09 13:38:20.679967 | \n", "es | \n", "NaN | \n", "NaN | \n", "2013-03-04 16:34:17.000000 | \n", "3 | \n", "2013 | \n", "¿Sabes por qué la #vacuna contra la #neumonía ... | \n", "FundacionPfizer | \n", "188384056 | \n", "2464 | \n", "597 | \n", "50 | \n", "11 | \n", "2400 | \n", "Noticias sobre Responsabilidad Social y Fundac... | \n", "México | \n", "Wed Sep 08 16:14:11 +0000 2010 | \n", "1 | \n", "0 | \n", "NaN | \n", "0 | \n", "vacuna, neumonía | \n", "2 | \n", "NaN | \n", "0 | \n", "NaN | \n", "NaN | \n", "web | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0 | \n", "0 | \n", "0 | \n", "138 | \n", "20 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "40 | \n", "Pfizer | \n", "0 | \n", "1 | \n", "
\n", " | tag_frequency | \n", "
---|---|
#stem | \n", "932 | \n", "
#csr | \n", "708 | \n", "
#smartercities | \n", "586 | \n", "
#edtech | \n", "439 | \n", "
#youthspark | \n", "412 | \n", "
#education | \n", "348 | \n", "
#bigdata | \n", "283 | \n", "
#girlrising | \n", "276 | \n", "
#impactx | \n", "276 | \n", "
#walmart | \n", "272 | \n", "
#ioe | \n", "236 | \n", "
#sustainability | \n", "230 | \n", "
#domoreedu | \n", "225 | \n", "
#ff | \n", "215 | \n", "
#energy | \n", "201 | \n", "
#foodsafety | \n", "185 | \n", "
#winitwednesday | \n", "168 | \n", "
#edchat | \n", "166 | \n", "
#yourvoicecounts | \n", "164 | \n", "
#intelforchange | \n", "156 | \n", "
#wmtgreen | \n", "149 | \n", "
#tech | \n", "141 | \n", "
#3m | \n", "135 | \n", "
#vzappchallenge | \n", "130 | \n", "
#lraa | \n", "119 | \n", "
#veterans | \n", "116 | \n", "
#solar | \n", "113 | \n", "
#dv | \n", "112 | \n", "
#ciscoeduforum | \n", "108 | \n", "
#edtechchat | \n", "108 | \n", "
... | \n", "... | \n", "
#springtraining | \n", "1 | \n", "
#acting | \n", "1 | \n", "
#whyiteach | \n", "1 | \n", "
#cgi1203 | \n", "1 | \n", "
#golf | \n", "1 | \n", "
#guanajuato | \n", "1 | \n", "
#vitalitystatus | \n", "1 | \n", "
#kiva | \n", "1 | \n", "
#reachoutandread | \n", "1 | \n", "
#socentchat | \n", "1 | \n", "
#autistic | \n", "1 | \n", "
#automakers | \n", "1 | \n", "
#educator | \n", "1 | \n", "
#rabat | \n", "1 | \n", "
#chcs | \n", "1 | \n", "
#uofmn | \n", "1 | \n", "
#thevoice | \n", "1 | \n", "
#disneytoppers | \n", "1 | \n", "
#comfortable | \n", "1 | \n", "
#aluminumbuildingproducts | \n", "1 | \n", "
#everyonehacks | \n", "1 | \n", "
#needtowalkmore | \n", "1 | \n", "
#trendforum | \n", "1 | \n", "
#hormonas | \n", "1 | \n", "
#happybirthdaygreenroom | \n", "1 | \n", "
#imld13 | \n", "1 | \n", "
#crewlove | \n", "1 | \n", "
#economicgrowth | \n", "1 | \n", "
#thinkgreen | \n", "1 | \n", "
#pickmeup | \n", "1 | \n", "
3400 rows × 1 columns
\n", "\n", " | 0 | \n", "
---|---|
#stem | \n", "917 | \n", "
#csr | \n", "704 | \n", "
#smartercities | \n", "582 | \n", "
#edtech | \n", "438 | \n", "
#youthspark | \n", "404 | \n", "
#education | \n", "341 | \n", "
#bigdata | \n", "281 | \n", "
#impactx | \n", "275 | \n", "
#girlrising | \n", "272 | \n", "
#walmart | \n", "241 | \n", "
#ioe | \n", "231 | \n", "
#sustainability | \n", "228 | \n", "
#domoreedu | \n", "224 | \n", "
#ff | \n", "214 | \n", "
#energy | \n", "197 | \n", "
#foodsafety | \n", "185 | \n", "
#winitwednesday | \n", "168 | \n", "
#edchat | \n", "166 | \n", "
#yourvoicecounts | \n", "163 | \n", "
#wmtgreen | \n", "149 | \n", "
#intelforchange | \n", "142 | \n", "
#tech | \n", "137 | \n", "
#3m | \n", "135 | \n", "
#vzappchallenge | \n", "128 | \n", "
#lraa | \n", "116 | \n", "
#veterans | \n", "115 | \n", "
#solar | \n", "112 | \n", "
#dv | \n", "109 | \n", "
#ciscoeduforum | \n", "108 | \n", "
#edtechchat | \n", "107 | \n", "
... | \n", "... | \n", "
#vince | \n", "1 | \n", "
#acting | \n", "1 | \n", "
#locallygrown---> | \n", "1 | \n", "
#whyiteach | \n", "1 | \n", "
#deafaid | \n", "1 | \n", "
#cgi1203 | \n", "1 | \n", "
#vitalitystatus | \n", "1 | \n", "
#kiva | \n", "1 | \n", "
#reachoutandread | \n", "1 | \n", "
#socentchat | \n", "1 | \n", "
#cciehttp://t.co/beqidhi9ww | \n", "1 | \n", "
#humanrightsday?http://t.co/mzb7gfr4oe | \n", "1 | \n", "
#thinkgreen | \n", "1 | \n", "
#economicgrowth | \n", "1 | \n", "
#205 | \n", "1 | \n", "
#uofmn | \n", "1 | \n", "
#thevoice | \n", "1 | \n", "
#disneytoppers | \n", "1 | \n", "
#comfortable | \n", "1 | \n", "
#aluminumbuildingproducts | \n", "1 | \n", "
#cyservestilden | \n", "1 | \n", "
#everyonehacks | \n", "1 | \n", "
#needtowalkmore | \n", "1 | \n", "
#hormonas | \n", "1 | \n", "
#endsa | \n", "1 | \n", "
#happybirthdaygreenroom | \n", "1 | \n", "
#imld13 | \n", "1 | \n", "
#crewlove | \n", "1 | \n", "
#digitalliteracy | \n", "1 | \n", "
#happymlkday | \n", "1 | \n", "
3580 rows × 1 columns
\n", "