{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is this fifth in a series of notebooks designed to show you how to analyze social media data. For demonstration purposes we are looking at tweets sent by CSR-related Twitter accounts -- accounts related to ethics, equality, the environment, etc. -- of Fortune 200 firms in 2013. We assume you have already downloaded the data and have completed the steps taken in <a href=\"http://nbviewer.ipython.org/github/gdsaxton/PANDAS/blob/master/Chapter%201%20-%20Import%20Data%2C%20Select%20Cases%20and%20Variables%2C%20Save%20DataFrame.ipynb\" target=\"_blank\">Chapter 1</a>,  <a href=\"http://nbviewer.ipython.org/github/gdsaxton/PANDAS/blob/master/Chapter%202%20-%20Aggregating%20and%20Analyzing%20Data%20by%20Twitter%20Account.ipynb\" target=\"_blank\">Chapter 2</a>, <a href=\"http://nbviewer.ipython.org/github/gdsaxton/PANDAS/blob/master/Chapter%203%20-%20Analyzing%20Twitter%20Data%20by%20Time%20Period.ipynb\" target=\"_blank\">Chapter 3</a>, and <a href=\"http://nbviewer.ipython.org/github/gdsaxton/PANDAS/blob/master/Chapter%204%20-%20Analyzing%20Hashtags.ipynb\" target=\"_blank\">Chapter 4</a>. In this fifth notebook I will show you how to generate new variables -- such as dummy variables -- from your variables currently in your dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 5: Generating New Variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we will import several necessary Python packages and set some options for viewing the data. As with prior chapters, we will be using the <a href=\"http://pandas.pydata.org/\">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import packages and set viewing options"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pandas import DataFrame\n",
    "from pandas import Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Set PANDAS to show all columns in DataFrame\n",
    "pd.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I'm using version 0.16.2 of PANDAS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.16.2'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read in data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In <a href=\"http://nbviewer.ipython.org/github/gdsaxton/PANDAS/blob/master/Chapter%204%20-%20Analyzing%20Hashtags.ipynb\" target=\"_blank\">Chapter 4</a> we created a version of the dataframe that omitted all tweets that were retweets, allowing us to focus only on original messages sent by the 41 Twitter accounts. Let's now open this saved file. As we can see in the operations below this dataframe contains 54 variables for 26,257 tweets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of variables: 54\n",
      "# of tweets: 26257\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rowid</th>\n",
       "      <th>query</th>\n",
       "      <th>tweet_id_str</th>\n",
       "      <th>inserted_date</th>\n",
       "      <th>language</th>\n",
       "      <th>coordinates</th>\n",
       "      <th>retweeted_status</th>\n",
       "      <th>created_at</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "      <th>content</th>\n",
       "      <th>from_user_screen_name</th>\n",
       "      <th>from_user_id</th>\n",
       "      <th>from_user_followers_count</th>\n",
       "      <th>from_user_friends_count</th>\n",
       "      <th>from_user_listed_count</th>\n",
       "      <th>from_user_favourites_count</th>\n",
       "      <th>from_user_statuses_count</th>\n",
       "      <th>from_user_description</th>\n",
       "      <th>from_user_location</th>\n",
       "      <th>from_user_created_at</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>favorite_count</th>\n",
       "      <th>entities_urls</th>\n",
       "      <th>entities_urls_count</th>\n",
       "      <th>entities_hashtags</th>\n",
       "      <th>entities_hashtags_count</th>\n",
       "      <th>entities_mentions</th>\n",
       "      <th>entities_mentions_count</th>\n",
       "      <th>in_reply_to_screen_name</th>\n",
       "      <th>in_reply_to_status_id</th>\n",
       "      <th>source</th>\n",
       "      <th>entities_expanded_urls</th>\n",
       "      <th>entities_media_count</th>\n",
       "      <th>media_expanded_url</th>\n",
       "      <th>media_url</th>\n",
       "      <th>media_type</th>\n",
       "      <th>video_link</th>\n",
       "      <th>photo_link</th>\n",
       "      <th>twitpic</th>\n",
       "      <th>num_characters</th>\n",
       "      <th>num_words</th>\n",
       "      <th>retweeted_user</th>\n",
       "      <th>retweeted_user_description</th>\n",
       "      <th>retweeted_user_screen_name</th>\n",
       "      <th>retweeted_user_followers_count</th>\n",
       "      <th>retweeted_user_listed_count</th>\n",
       "      <th>retweeted_user_statuses_count</th>\n",
       "      <th>retweeted_user_location</th>\n",
       "      <th>retweeted_tweet_created_at</th>\n",
       "      <th>Fortune_2012_rank</th>\n",
       "      <th>Company</th>\n",
       "      <th>CSR_sustainability</th>\n",
       "      <th>specific_project_initiative_area</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>67340</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>306897327585652736</td>\n",
       "      <td>2014-03-09 13:46:50.222857</td>\n",
       "      <td>en</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-02-27 22:43:19.000000</td>\n",
       "      <td>2</td>\n",
       "      <td>2013</td>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>274041023</td>\n",
       "      <td>2859</td>\n",
       "      <td>440</td>\n",
       "      <td>38</td>\n",
       "      <td>25</td>\n",
       "      <td>1766</td>\n",
       "      <td>This is the official Twitter account for Human...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Tue Mar 29 16:23:02 +0000 2011</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>1</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>3.062183e+17</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>121</td>\n",
       "      <td>19</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>79</td>\n",
       "      <td>Humana</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39454</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>308616393706844160</td>\n",
       "      <td>2014-03-09 13:38:20.679967</td>\n",
       "      <td>es</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-03-04 16:34:17.000000</td>\n",
       "      <td>3</td>\n",
       "      <td>2013</td>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>188384056</td>\n",
       "      <td>2464</td>\n",
       "      <td>597</td>\n",
       "      <td>50</td>\n",
       "      <td>11</td>\n",
       "      <td>2400</td>\n",
       "      <td>Noticias sobre Responsabilidad Social y Fundac...</td>\n",
       "      <td>México</td>\n",
       "      <td>Wed Sep 08 16:14:11 +0000 2010</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>vacuna, neumonía</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>138</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>40</td>\n",
       "      <td>Pfizer</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   rowid            query        tweet_id_str               inserted_date  \\\n",
       "0  67340   humanavitality  306897327585652736  2014-03-09 13:46:50.222857   \n",
       "1  39454  FundacionPfizer  308616393706844160  2014-03-09 13:38:20.679967   \n",
       "\n",
       "  language coordinates retweeted_status                  created_at  month  \\\n",
       "0       en         NaN              NaN  2013-02-27 22:43:19.000000      2   \n",
       "1       es         NaN              NaN  2013-03-04 16:34:17.000000      3   \n",
       "\n",
       "   year                                            content  \\\n",
       "0  2013  @louloushive (Tweet 2) We encourage other empl...   \n",
       "1  2013  ¿Sabes por qué la #vacuna contra la #neumonía ...   \n",
       "\n",
       "  from_user_screen_name  from_user_id  from_user_followers_count  \\\n",
       "0        humanavitality     274041023                       2859   \n",
       "1       FundacionPfizer     188384056                       2464   \n",
       "\n",
       "   from_user_friends_count  from_user_listed_count  \\\n",
       "0                      440                      38   \n",
       "1                      597                      50   \n",
       "\n",
       "   from_user_favourites_count  from_user_statuses_count  \\\n",
       "0                          25                      1766   \n",
       "1                          11                      2400   \n",
       "\n",
       "                               from_user_description from_user_location  \\\n",
       "0  This is the official Twitter account for Human...                NaN   \n",
       "1  Noticias sobre Responsabilidad Social y Fundac...             México   \n",
       "\n",
       "             from_user_created_at  retweet_count  favorite_count  \\\n",
       "0  Tue Mar 29 16:23:02 +0000 2011              0               0   \n",
       "1  Wed Sep 08 16:14:11 +0000 2010              1               0   \n",
       "\n",
       "  entities_urls  entities_urls_count entities_hashtags  \\\n",
       "0           NaN                    0               NaN   \n",
       "1           NaN                    0  vacuna, neumonía   \n",
       "\n",
       "   entities_hashtags_count entities_mentions  entities_mentions_count  \\\n",
       "0                        0       louloushive                        1   \n",
       "1                        2               NaN                        0   \n",
       "\n",
       "  in_reply_to_screen_name  in_reply_to_status_id source  \\\n",
       "0             louloushive           3.062183e+17    web   \n",
       "1                     NaN                    NaN    web   \n",
       "\n",
       "  entities_expanded_urls  entities_media_count media_expanded_url media_url  \\\n",
       "0                    NaN                   NaN                NaN       NaN   \n",
       "1                    NaN                   NaN                NaN       NaN   \n",
       "\n",
       "  media_type  video_link  photo_link  twitpic  num_characters  num_words  \\\n",
       "0        NaN           0           0        0             121         19   \n",
       "1        NaN           0           0        0             138         20   \n",
       "\n",
       "   retweeted_user retweeted_user_description retweeted_user_screen_name  \\\n",
       "0             NaN                        NaN                        NaN   \n",
       "1             NaN                        NaN                        NaN   \n",
       "\n",
       "   retweeted_user_followers_count  retweeted_user_listed_count  \\\n",
       "0                             NaN                          NaN   \n",
       "1                             NaN                          NaN   \n",
       "\n",
       "   retweeted_user_statuses_count retweeted_user_location  \\\n",
       "0                            NaN                     NaN   \n",
       "1                            NaN                     NaN   \n",
       "\n",
       "  retweeted_tweet_created_at  Fortune_2012_rank Company  CSR_sustainability  \\\n",
       "0                        NaN                 79  Humana                   0   \n",
       "1                        NaN                 40  Pfizer                   0   \n",
       "\n",
       "   specific_project_initiative_area  \n",
       "0                                 1  \n",
       "1                                 1  "
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_pickle('Original 2013 CSR Tweets.pkl')\n",
    "print \"# of variables:\", len(df.columns)\n",
    "print  \"# of tweets:\", len(df)\n",
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Note that, in earlier notebooks, we have been creating new dataframes -- taking our tweet-level data and converting it into a dataframe organized by account, by company, or by time period. Now we are going to do something different. We are going to keep the organization of the data at the same level (tweets) but merely add columns. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate Dummy Variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run statistical procedures we have to convert text-based columns into numerical format. For example, let's say we're interested in exploring whether the language used in a tweet influences how often a message gets favorited or retweeted. To do this we need to transform the `language` variable in our dataset. To see why this is necessary, let's take a closer look at the variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>language</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>es</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RT @droso: @RodrigoReinaL con un tema muy inte...</td>\n",
       "      <td>es</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\"Every child is born a scientist. We’re all bo...</td>\n",
       "      <td>en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>RT @ClintonGlobal Watch Pres. @BillClinton's f...</td>\n",
       "      <td>en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             content language\n",
       "0  @louloushive (Tweet 2) We encourage other empl...       en\n",
       "1  ¿Sabes por qué la #vacuna contra la #neumonía ...       es\n",
       "2  RT @droso: @RodrigoReinaL con un tema muy inte...       es\n",
       "3  \"Every child is born a scientist. We’re all bo...       en\n",
       "4  RT @ClintonGlobal Watch Pres. @BillClinton's f...       en"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[['content', 'language']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>What we see is that values of this variable are text-based, with \"en\" representing English, \"es\" representing Spanish, etc. Let's see how many languages are used and also inspect the frequencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "23"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(pd.unique(df.language.ravel()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "en     25483\n",
       "es       503\n",
       "de        84\n",
       "fr        38\n",
       "und       36\n",
       "tl        15\n",
       "vi        15\n",
       "sk        15\n",
       "pt        12\n",
       "in        10\n",
       "ht         8\n",
       "da         7\n",
       "it         7\n",
       "nl         5\n",
       "id         4\n",
       "pl         3\n",
       "et         3\n",
       "sv         2\n",
       "sl         2\n",
       "fi         2\n",
       "zh         1\n",
       "lv         1\n",
       "ar         1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['language'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>So there are 23 different languages used in our tweet database, with (not surprisingly given the data source) English being by far the most popular. What if we are interested in analyzing whether the \"success\" of each tweet is related to whether the language used was English, Spanish, French, Dutch, or Chinese? To do this in a regression analysis, for instance, we need to generate a different variable for each of these languages. Most statistical programs include a shortcut for creating these <i>dummy variables</i>. \n",
    "\n",
    "In PANDAS we can generate 23 dummy variables -- one for each language -- with a single line of code. The following command creates the dummies and shows the first five rows of the data (we haven't added it to our dataframe yet)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>lang_ar</th>\n",
       "      <th>lang_da</th>\n",
       "      <th>lang_de</th>\n",
       "      <th>lang_en</th>\n",
       "      <th>lang_es</th>\n",
       "      <th>lang_et</th>\n",
       "      <th>lang_fi</th>\n",
       "      <th>lang_fr</th>\n",
       "      <th>lang_ht</th>\n",
       "      <th>lang_id</th>\n",
       "      <th>lang_in</th>\n",
       "      <th>lang_it</th>\n",
       "      <th>lang_lv</th>\n",
       "      <th>lang_nl</th>\n",
       "      <th>lang_pl</th>\n",
       "      <th>lang_pt</th>\n",
       "      <th>lang_sk</th>\n",
       "      <th>lang_sl</th>\n",
       "      <th>lang_sv</th>\n",
       "      <th>lang_tl</th>\n",
       "      <th>lang_und</th>\n",
       "      <th>lang_vi</th>\n",
       "      <th>lang_zh</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   lang_ar  lang_da  lang_de  lang_en  lang_es  lang_et  lang_fi  lang_fr  \\\n",
       "0        0        0        0        1        0        0        0        0   \n",
       "1        0        0        0        0        1        0        0        0   \n",
       "2        0        0        0        0        1        0        0        0   \n",
       "3        0        0        0        1        0        0        0        0   \n",
       "4        0        0        0        1        0        0        0        0   \n",
       "\n",
       "   lang_ht  lang_id  lang_in  lang_it  lang_lv  lang_nl  lang_pl  lang_pt  \\\n",
       "0        0        0        0        0        0        0        0        0   \n",
       "1        0        0        0        0        0        0        0        0   \n",
       "2        0        0        0        0        0        0        0        0   \n",
       "3        0        0        0        0        0        0        0        0   \n",
       "4        0        0        0        0        0        0        0        0   \n",
       "\n",
       "   lang_sk  lang_sl  lang_sv  lang_tl  lang_und  lang_vi  lang_zh  \n",
       "0        0        0        0        0         0        0        0  \n",
       "1        0        0        0        0         0        0        0  \n",
       "2        0        0        0        0         0        0        0  \n",
       "3        0        0        0        0         0        0        0  \n",
       "4        0        0        0        0         0        0        0  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.get_dummies(df['language'], prefix='lang').head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Here's what has happened. Let's take the first two rows as examples. Here is the content of those two tweets to refresh your memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>language</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>es</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             content language\n",
       "0  @louloushive (Tweet 2) We encourage other empl...       en\n",
       "1  ¿Sabes por qué la #vacuna contra la #neumonía ...       es"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[['content', 'language']].head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>The first tweet is in English, as indicated by the  value \"en\" in our `language` column. The `get_dummies` command will first create a separate dummy variable for each of our 23 language values; it then assigns for each row a value of \"1\" to the appropriate language dummy and a value of \"0\" for each of the other 22 dummy variables. So, in our first row, the value for the dummy variable `lang_en` will be \"1\" and the value will be \"0\" for all the others. In contrast, the second row is assigned a value of \"1\" to `lang_es` and a values of \"0\" to all the others. In effect, each row can be assigned a value of \"1\" to only one of the dummy variables. \n",
    "\n",
    "Note that in our commands above we have not actually added our columns to our dataframe. To do that we need to concatenate the columns containing the dummy variables to our primary dataset using PANDAS' `concat` command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of variables: 77\n",
      "# of tweets: 26257\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rowid</th>\n",
       "      <th>query</th>\n",
       "      <th>tweet_id_str</th>\n",
       "      <th>inserted_date</th>\n",
       "      <th>language</th>\n",
       "      <th>coordinates</th>\n",
       "      <th>retweeted_status</th>\n",
       "      <th>created_at</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "      <th>content</th>\n",
       "      <th>from_user_screen_name</th>\n",
       "      <th>from_user_id</th>\n",
       "      <th>from_user_followers_count</th>\n",
       "      <th>from_user_friends_count</th>\n",
       "      <th>from_user_listed_count</th>\n",
       "      <th>from_user_favourites_count</th>\n",
       "      <th>from_user_statuses_count</th>\n",
       "      <th>from_user_description</th>\n",
       "      <th>from_user_location</th>\n",
       "      <th>from_user_created_at</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>favorite_count</th>\n",
       "      <th>entities_urls</th>\n",
       "      <th>entities_urls_count</th>\n",
       "      <th>entities_hashtags</th>\n",
       "      <th>entities_hashtags_count</th>\n",
       "      <th>entities_mentions</th>\n",
       "      <th>entities_mentions_count</th>\n",
       "      <th>in_reply_to_screen_name</th>\n",
       "      <th>in_reply_to_status_id</th>\n",
       "      <th>source</th>\n",
       "      <th>entities_expanded_urls</th>\n",
       "      <th>entities_media_count</th>\n",
       "      <th>media_expanded_url</th>\n",
       "      <th>media_url</th>\n",
       "      <th>media_type</th>\n",
       "      <th>video_link</th>\n",
       "      <th>photo_link</th>\n",
       "      <th>twitpic</th>\n",
       "      <th>num_characters</th>\n",
       "      <th>num_words</th>\n",
       "      <th>retweeted_user</th>\n",
       "      <th>retweeted_user_description</th>\n",
       "      <th>retweeted_user_screen_name</th>\n",
       "      <th>retweeted_user_followers_count</th>\n",
       "      <th>retweeted_user_listed_count</th>\n",
       "      <th>retweeted_user_statuses_count</th>\n",
       "      <th>retweeted_user_location</th>\n",
       "      <th>retweeted_tweet_created_at</th>\n",
       "      <th>Fortune_2012_rank</th>\n",
       "      <th>Company</th>\n",
       "      <th>CSR_sustainability</th>\n",
       "      <th>specific_project_initiative_area</th>\n",
       "      <th>lang_ar</th>\n",
       "      <th>lang_da</th>\n",
       "      <th>lang_de</th>\n",
       "      <th>lang_en</th>\n",
       "      <th>lang_es</th>\n",
       "      <th>lang_et</th>\n",
       "      <th>lang_fi</th>\n",
       "      <th>lang_fr</th>\n",
       "      <th>lang_ht</th>\n",
       "      <th>lang_id</th>\n",
       "      <th>lang_in</th>\n",
       "      <th>lang_it</th>\n",
       "      <th>lang_lv</th>\n",
       "      <th>lang_nl</th>\n",
       "      <th>lang_pl</th>\n",
       "      <th>lang_pt</th>\n",
       "      <th>lang_sk</th>\n",
       "      <th>lang_sl</th>\n",
       "      <th>lang_sv</th>\n",
       "      <th>lang_tl</th>\n",
       "      <th>lang_und</th>\n",
       "      <th>lang_vi</th>\n",
       "      <th>lang_zh</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>67340</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>306897327585652736</td>\n",
       "      <td>2014-03-09 13:46:50.222857</td>\n",
       "      <td>en</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-02-27 22:43:19.000000</td>\n",
       "      <td>2</td>\n",
       "      <td>2013</td>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>274041023</td>\n",
       "      <td>2859</td>\n",
       "      <td>440</td>\n",
       "      <td>38</td>\n",
       "      <td>25</td>\n",
       "      <td>1766</td>\n",
       "      <td>This is the official Twitter account for Human...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Tue Mar 29 16:23:02 +0000 2011</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>1</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>3.062183e+17</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>121</td>\n",
       "      <td>19</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>79</td>\n",
       "      <td>Humana</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39454</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>308616393706844160</td>\n",
       "      <td>2014-03-09 13:38:20.679967</td>\n",
       "      <td>es</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-03-04 16:34:17.000000</td>\n",
       "      <td>3</td>\n",
       "      <td>2013</td>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>188384056</td>\n",
       "      <td>2464</td>\n",
       "      <td>597</td>\n",
       "      <td>50</td>\n",
       "      <td>11</td>\n",
       "      <td>2400</td>\n",
       "      <td>Noticias sobre Responsabilidad Social y Fundac...</td>\n",
       "      <td>México</td>\n",
       "      <td>Wed Sep 08 16:14:11 +0000 2010</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>vacuna, neumonía</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>138</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>40</td>\n",
       "      <td>Pfizer</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   rowid            query        tweet_id_str               inserted_date  \\\n",
       "0  67340   humanavitality  306897327585652736  2014-03-09 13:46:50.222857   \n",
       "1  39454  FundacionPfizer  308616393706844160  2014-03-09 13:38:20.679967   \n",
       "\n",
       "  language coordinates retweeted_status                  created_at  month  \\\n",
       "0       en         NaN              NaN  2013-02-27 22:43:19.000000      2   \n",
       "1       es         NaN              NaN  2013-03-04 16:34:17.000000      3   \n",
       "\n",
       "   year                                            content  \\\n",
       "0  2013  @louloushive (Tweet 2) We encourage other empl...   \n",
       "1  2013  ¿Sabes por qué la #vacuna contra la #neumonía ...   \n",
       "\n",
       "  from_user_screen_name  from_user_id  from_user_followers_count  \\\n",
       "0        humanavitality     274041023                       2859   \n",
       "1       FundacionPfizer     188384056                       2464   \n",
       "\n",
       "   from_user_friends_count  from_user_listed_count  \\\n",
       "0                      440                      38   \n",
       "1                      597                      50   \n",
       "\n",
       "   from_user_favourites_count  from_user_statuses_count  \\\n",
       "0                          25                      1766   \n",
       "1                          11                      2400   \n",
       "\n",
       "                               from_user_description from_user_location  \\\n",
       "0  This is the official Twitter account for Human...                NaN   \n",
       "1  Noticias sobre Responsabilidad Social y Fundac...             México   \n",
       "\n",
       "             from_user_created_at  retweet_count  favorite_count  \\\n",
       "0  Tue Mar 29 16:23:02 +0000 2011              0               0   \n",
       "1  Wed Sep 08 16:14:11 +0000 2010              1               0   \n",
       "\n",
       "  entities_urls  entities_urls_count entities_hashtags  \\\n",
       "0           NaN                    0               NaN   \n",
       "1           NaN                    0  vacuna, neumonía   \n",
       "\n",
       "   entities_hashtags_count entities_mentions  entities_mentions_count  \\\n",
       "0                        0       louloushive                        1   \n",
       "1                        2               NaN                        0   \n",
       "\n",
       "  in_reply_to_screen_name  in_reply_to_status_id source  \\\n",
       "0             louloushive           3.062183e+17    web   \n",
       "1                     NaN                    NaN    web   \n",
       "\n",
       "  entities_expanded_urls  entities_media_count media_expanded_url media_url  \\\n",
       "0                    NaN                   NaN                NaN       NaN   \n",
       "1                    NaN                   NaN                NaN       NaN   \n",
       "\n",
       "  media_type  video_link  photo_link  twitpic  num_characters  num_words  \\\n",
       "0        NaN           0           0        0             121         19   \n",
       "1        NaN           0           0        0             138         20   \n",
       "\n",
       "   retweeted_user retweeted_user_description retweeted_user_screen_name  \\\n",
       "0             NaN                        NaN                        NaN   \n",
       "1             NaN                        NaN                        NaN   \n",
       "\n",
       "   retweeted_user_followers_count  retweeted_user_listed_count  \\\n",
       "0                             NaN                          NaN   \n",
       "1                             NaN                          NaN   \n",
       "\n",
       "   retweeted_user_statuses_count retweeted_user_location  \\\n",
       "0                            NaN                     NaN   \n",
       "1                            NaN                     NaN   \n",
       "\n",
       "  retweeted_tweet_created_at  Fortune_2012_rank Company  CSR_sustainability  \\\n",
       "0                        NaN                 79  Humana                   0   \n",
       "1                        NaN                 40  Pfizer                   0   \n",
       "\n",
       "   specific_project_initiative_area  lang_ar  lang_da  lang_de  lang_en  \\\n",
       "0                                 1        0        0        0        1   \n",
       "1                                 1        0        0        0        0   \n",
       "\n",
       "   lang_es  lang_et  lang_fi  lang_fr  lang_ht  lang_id  lang_in  lang_it  \\\n",
       "0        0        0        0        0        0        0        0        0   \n",
       "1        1        0        0        0        0        0        0        0   \n",
       "\n",
       "   lang_lv  lang_nl  lang_pl  lang_pt  lang_sk  lang_sl  lang_sv  lang_tl  \\\n",
       "0        0        0        0        0        0        0        0        0   \n",
       "1        0        0        0        0        0        0        0        0   \n",
       "\n",
       "   lang_und  lang_vi  lang_zh  \n",
       "0         0        0        0  \n",
       "1         0        0        0  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.concat([df, pd.get_dummies(df['language'], prefix='lang')], axis=1)\n",
    "print \"# of variables:\", len(df.columns)\n",
    "print  \"# of tweets:\", len(df)\n",
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>We see there are now 77 columns in the dataframe (23 more than before) but same number of rows. Exactly as intended. But let's suppose we're not really interested in exploring the differences among all those languages, so let's revert to our original dataframe with only 54 columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of variables: 54\n",
      "# of tweets: 26257\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rowid</th>\n",
       "      <th>query</th>\n",
       "      <th>tweet_id_str</th>\n",
       "      <th>inserted_date</th>\n",
       "      <th>language</th>\n",
       "      <th>coordinates</th>\n",
       "      <th>retweeted_status</th>\n",
       "      <th>created_at</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "      <th>content</th>\n",
       "      <th>from_user_screen_name</th>\n",
       "      <th>from_user_id</th>\n",
       "      <th>from_user_followers_count</th>\n",
       "      <th>from_user_friends_count</th>\n",
       "      <th>from_user_listed_count</th>\n",
       "      <th>from_user_favourites_count</th>\n",
       "      <th>from_user_statuses_count</th>\n",
       "      <th>from_user_description</th>\n",
       "      <th>from_user_location</th>\n",
       "      <th>from_user_created_at</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>favorite_count</th>\n",
       "      <th>entities_urls</th>\n",
       "      <th>entities_urls_count</th>\n",
       "      <th>entities_hashtags</th>\n",
       "      <th>entities_hashtags_count</th>\n",
       "      <th>entities_mentions</th>\n",
       "      <th>entities_mentions_count</th>\n",
       "      <th>in_reply_to_screen_name</th>\n",
       "      <th>in_reply_to_status_id</th>\n",
       "      <th>source</th>\n",
       "      <th>entities_expanded_urls</th>\n",
       "      <th>entities_media_count</th>\n",
       "      <th>media_expanded_url</th>\n",
       "      <th>media_url</th>\n",
       "      <th>media_type</th>\n",
       "      <th>video_link</th>\n",
       "      <th>photo_link</th>\n",
       "      <th>twitpic</th>\n",
       "      <th>num_characters</th>\n",
       "      <th>num_words</th>\n",
       "      <th>retweeted_user</th>\n",
       "      <th>retweeted_user_description</th>\n",
       "      <th>retweeted_user_screen_name</th>\n",
       "      <th>retweeted_user_followers_count</th>\n",
       "      <th>retweeted_user_listed_count</th>\n",
       "      <th>retweeted_user_statuses_count</th>\n",
       "      <th>retweeted_user_location</th>\n",
       "      <th>retweeted_tweet_created_at</th>\n",
       "      <th>Fortune_2012_rank</th>\n",
       "      <th>Company</th>\n",
       "      <th>CSR_sustainability</th>\n",
       "      <th>specific_project_initiative_area</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>67340</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>306897327585652736</td>\n",
       "      <td>2014-03-09 13:46:50.222857</td>\n",
       "      <td>en</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-02-27 22:43:19.000000</td>\n",
       "      <td>2</td>\n",
       "      <td>2013</td>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>274041023</td>\n",
       "      <td>2859</td>\n",
       "      <td>440</td>\n",
       "      <td>38</td>\n",
       "      <td>25</td>\n",
       "      <td>1766</td>\n",
       "      <td>This is the official Twitter account for Human...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Tue Mar 29 16:23:02 +0000 2011</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>1</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>3.062183e+17</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>121</td>\n",
       "      <td>19</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>79</td>\n",
       "      <td>Humana</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39454</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>308616393706844160</td>\n",
       "      <td>2014-03-09 13:38:20.679967</td>\n",
       "      <td>es</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-03-04 16:34:17.000000</td>\n",
       "      <td>3</td>\n",
       "      <td>2013</td>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>188384056</td>\n",
       "      <td>2464</td>\n",
       "      <td>597</td>\n",
       "      <td>50</td>\n",
       "      <td>11</td>\n",
       "      <td>2400</td>\n",
       "      <td>Noticias sobre Responsabilidad Social y Fundac...</td>\n",
       "      <td>México</td>\n",
       "      <td>Wed Sep 08 16:14:11 +0000 2010</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>vacuna, neumonía</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>138</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>40</td>\n",
       "      <td>Pfizer</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   rowid            query        tweet_id_str               inserted_date  \\\n",
       "0  67340   humanavitality  306897327585652736  2014-03-09 13:46:50.222857   \n",
       "1  39454  FundacionPfizer  308616393706844160  2014-03-09 13:38:20.679967   \n",
       "\n",
       "  language coordinates retweeted_status                  created_at  month  \\\n",
       "0       en         NaN              NaN  2013-02-27 22:43:19.000000      2   \n",
       "1       es         NaN              NaN  2013-03-04 16:34:17.000000      3   \n",
       "\n",
       "   year                                            content  \\\n",
       "0  2013  @louloushive (Tweet 2) We encourage other empl...   \n",
       "1  2013  ¿Sabes por qué la #vacuna contra la #neumonía ...   \n",
       "\n",
       "  from_user_screen_name  from_user_id  from_user_followers_count  \\\n",
       "0        humanavitality     274041023                       2859   \n",
       "1       FundacionPfizer     188384056                       2464   \n",
       "\n",
       "   from_user_friends_count  from_user_listed_count  \\\n",
       "0                      440                      38   \n",
       "1                      597                      50   \n",
       "\n",
       "   from_user_favourites_count  from_user_statuses_count  \\\n",
       "0                          25                      1766   \n",
       "1                          11                      2400   \n",
       "\n",
       "                               from_user_description from_user_location  \\\n",
       "0  This is the official Twitter account for Human...                NaN   \n",
       "1  Noticias sobre Responsabilidad Social y Fundac...             México   \n",
       "\n",
       "             from_user_created_at  retweet_count  favorite_count  \\\n",
       "0  Tue Mar 29 16:23:02 +0000 2011              0               0   \n",
       "1  Wed Sep 08 16:14:11 +0000 2010              1               0   \n",
       "\n",
       "  entities_urls  entities_urls_count entities_hashtags  \\\n",
       "0           NaN                    0               NaN   \n",
       "1           NaN                    0  vacuna, neumonía   \n",
       "\n",
       "   entities_hashtags_count entities_mentions  entities_mentions_count  \\\n",
       "0                        0       louloushive                        1   \n",
       "1                        2               NaN                        0   \n",
       "\n",
       "  in_reply_to_screen_name  in_reply_to_status_id source  \\\n",
       "0             louloushive           3.062183e+17    web   \n",
       "1                     NaN                    NaN    web   \n",
       "\n",
       "  entities_expanded_urls  entities_media_count media_expanded_url media_url  \\\n",
       "0                    NaN                   NaN                NaN       NaN   \n",
       "1                    NaN                   NaN                NaN       NaN   \n",
       "\n",
       "  media_type  video_link  photo_link  twitpic  num_characters  num_words  \\\n",
       "0        NaN           0           0        0             121         19   \n",
       "1        NaN           0           0        0             138         20   \n",
       "\n",
       "   retweeted_user retweeted_user_description retweeted_user_screen_name  \\\n",
       "0             NaN                        NaN                        NaN   \n",
       "1             NaN                        NaN                        NaN   \n",
       "\n",
       "   retweeted_user_followers_count  retweeted_user_listed_count  \\\n",
       "0                             NaN                          NaN   \n",
       "1                             NaN                          NaN   \n",
       "\n",
       "   retweeted_user_statuses_count retweeted_user_location  \\\n",
       "0                            NaN                     NaN   \n",
       "1                            NaN                     NaN   \n",
       "\n",
       "  retweeted_tweet_created_at  Fortune_2012_rank Company  CSR_sustainability  \\\n",
       "0                        NaN                 79  Humana                   0   \n",
       "1                        NaN                 40  Pfizer                   0   \n",
       "\n",
       "   specific_project_initiative_area  \n",
       "0                                 1  \n",
       "1                                 1  "
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_pickle('Original 2013 CSR Tweets.pkl')\n",
    "print \"# of variables:\", len(df.columns)\n",
    "print  \"# of tweets:\", len(df)\n",
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating Binary Variables from Categorical Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are not always interested in every single value in a text-based or categorical variable. Instead, we're often interested in creating a single binary (0,1) variable from one of the values in a given categorical variable. For instance, let's assume we're interested in examining whether tweets written in English receive more tweets than those written in other languages. To do this we'll need to create a number variable with one numerical value for English tweets and another numerical value for non-English tweets. The convention for these binary variables is to use the values \"0\" and \"1\" and to name your variable after the category you're interested in. Accordingly, we'll call our new variable `English` and assign values of \"1\" to tweets written in English, and \"0\" to all non-English tweets.\n",
    "\n",
    "There are two steps to generating our new variable. First we name our new variable and assign it values of \"true\" if  the text in our `language` column matches \"en\" and \"false\" otherwise. PANDAS' string methods are powerful. We are using the `match` function here (meaning an exact match of the entire text in our `language` column). In other cases you might use the `contains` method instead if you are only interested in a match anywhere in the cell. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>language</th>\n",
       "      <th>English</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>en</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>es</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RT @droso: @RodrigoReinaL con un tema muy inte...</td>\n",
       "      <td>es</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\"Every child is born a scientist. We’re all bo...</td>\n",
       "      <td>en</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>RT @ClintonGlobal Watch Pres. @BillClinton's f...</td>\n",
       "      <td>en</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             content language English\n",
       "0  @louloushive (Tweet 2) We encourage other empl...       en    True\n",
       "1  ¿Sabes por qué la #vacuna contra la #neumonía ...       es   False\n",
       "2  RT @droso: @RodrigoReinaL con un tema muy inte...       es   False\n",
       "3  \"Every child is born a scientist. We’re all bo...       en    True\n",
       "4  RT @ClintonGlobal Watch Pres. @BillClinton's f...       en    True"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['English'] = df['language'].str.match('en', na=False)\n",
    "df[['content', 'language', 'English']].head()           #SHOW FIRST FIVE ROWS OF THREE CHOSEN COLUMNS OF DATAFRAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>In the second step we convert our variable to a numerical format. As you can see, each row now has a value of \"0\" or \"1\", with tweets in English being assigned values of \"1\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>language</th>\n",
       "      <th>English</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>en</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>es</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RT @droso: @RodrigoReinaL con un tema muy inte...</td>\n",
       "      <td>es</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\"Every child is born a scientist. We’re all bo...</td>\n",
       "      <td>en</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>RT @ClintonGlobal Watch Pres. @BillClinton's f...</td>\n",
       "      <td>en</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             content language  English\n",
       "0  @louloushive (Tweet 2) We encourage other empl...       en        1\n",
       "1  ¿Sabes por qué la #vacuna contra la #neumonía ...       es        0\n",
       "2  RT @droso: @RodrigoReinaL con un tema muy inte...       es        0\n",
       "3  \"Every child is born a scientist. We’re all bo...       en        1\n",
       "4  RT @ClintonGlobal Watch Pres. @BillClinton's f...       en        1"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['English'] = df['English'].astype(float)\n",
    "df[['content', 'language', 'English']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Below we can see that our new column has been added to the end of the dataframe. There are now 55 variables and the same number of tweets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of variables: 55\n",
      "# of tweets: 26257\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rowid</th>\n",
       "      <th>query</th>\n",
       "      <th>tweet_id_str</th>\n",
       "      <th>inserted_date</th>\n",
       "      <th>language</th>\n",
       "      <th>coordinates</th>\n",
       "      <th>retweeted_status</th>\n",
       "      <th>created_at</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "      <th>content</th>\n",
       "      <th>from_user_screen_name</th>\n",
       "      <th>from_user_id</th>\n",
       "      <th>from_user_followers_count</th>\n",
       "      <th>from_user_friends_count</th>\n",
       "      <th>from_user_listed_count</th>\n",
       "      <th>from_user_favourites_count</th>\n",
       "      <th>from_user_statuses_count</th>\n",
       "      <th>from_user_description</th>\n",
       "      <th>from_user_location</th>\n",
       "      <th>from_user_created_at</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>favorite_count</th>\n",
       "      <th>entities_urls</th>\n",
       "      <th>entities_urls_count</th>\n",
       "      <th>entities_hashtags</th>\n",
       "      <th>entities_hashtags_count</th>\n",
       "      <th>entities_mentions</th>\n",
       "      <th>entities_mentions_count</th>\n",
       "      <th>in_reply_to_screen_name</th>\n",
       "      <th>in_reply_to_status_id</th>\n",
       "      <th>source</th>\n",
       "      <th>entities_expanded_urls</th>\n",
       "      <th>entities_media_count</th>\n",
       "      <th>media_expanded_url</th>\n",
       "      <th>media_url</th>\n",
       "      <th>media_type</th>\n",
       "      <th>video_link</th>\n",
       "      <th>photo_link</th>\n",
       "      <th>twitpic</th>\n",
       "      <th>num_characters</th>\n",
       "      <th>num_words</th>\n",
       "      <th>retweeted_user</th>\n",
       "      <th>retweeted_user_description</th>\n",
       "      <th>retweeted_user_screen_name</th>\n",
       "      <th>retweeted_user_followers_count</th>\n",
       "      <th>retweeted_user_listed_count</th>\n",
       "      <th>retweeted_user_statuses_count</th>\n",
       "      <th>retweeted_user_location</th>\n",
       "      <th>retweeted_tweet_created_at</th>\n",
       "      <th>Fortune_2012_rank</th>\n",
       "      <th>Company</th>\n",
       "      <th>CSR_sustainability</th>\n",
       "      <th>specific_project_initiative_area</th>\n",
       "      <th>English</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>67340</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>306897327585652736</td>\n",
       "      <td>2014-03-09 13:46:50.222857</td>\n",
       "      <td>en</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-02-27 22:43:19.000000</td>\n",
       "      <td>2</td>\n",
       "      <td>2013</td>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>274041023</td>\n",
       "      <td>2859</td>\n",
       "      <td>440</td>\n",
       "      <td>38</td>\n",
       "      <td>25</td>\n",
       "      <td>1766</td>\n",
       "      <td>This is the official Twitter account for Human...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Tue Mar 29 16:23:02 +0000 2011</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>1</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>3.062183e+17</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>121</td>\n",
       "      <td>19</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>79</td>\n",
       "      <td>Humana</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39454</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>308616393706844160</td>\n",
       "      <td>2014-03-09 13:38:20.679967</td>\n",
       "      <td>es</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-03-04 16:34:17.000000</td>\n",
       "      <td>3</td>\n",
       "      <td>2013</td>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>188384056</td>\n",
       "      <td>2464</td>\n",
       "      <td>597</td>\n",
       "      <td>50</td>\n",
       "      <td>11</td>\n",
       "      <td>2400</td>\n",
       "      <td>Noticias sobre Responsabilidad Social y Fundac...</td>\n",
       "      <td>México</td>\n",
       "      <td>Wed Sep 08 16:14:11 +0000 2010</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>vacuna, neumonía</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>138</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>40</td>\n",
       "      <td>Pfizer</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   rowid            query        tweet_id_str               inserted_date  \\\n",
       "0  67340   humanavitality  306897327585652736  2014-03-09 13:46:50.222857   \n",
       "1  39454  FundacionPfizer  308616393706844160  2014-03-09 13:38:20.679967   \n",
       "\n",
       "  language coordinates retweeted_status                  created_at  month  \\\n",
       "0       en         NaN              NaN  2013-02-27 22:43:19.000000      2   \n",
       "1       es         NaN              NaN  2013-03-04 16:34:17.000000      3   \n",
       "\n",
       "   year                                            content  \\\n",
       "0  2013  @louloushive (Tweet 2) We encourage other empl...   \n",
       "1  2013  ¿Sabes por qué la #vacuna contra la #neumonía ...   \n",
       "\n",
       "  from_user_screen_name  from_user_id  from_user_followers_count  \\\n",
       "0        humanavitality     274041023                       2859   \n",
       "1       FundacionPfizer     188384056                       2464   \n",
       "\n",
       "   from_user_friends_count  from_user_listed_count  \\\n",
       "0                      440                      38   \n",
       "1                      597                      50   \n",
       "\n",
       "   from_user_favourites_count  from_user_statuses_count  \\\n",
       "0                          25                      1766   \n",
       "1                          11                      2400   \n",
       "\n",
       "                               from_user_description from_user_location  \\\n",
       "0  This is the official Twitter account for Human...                NaN   \n",
       "1  Noticias sobre Responsabilidad Social y Fundac...             México   \n",
       "\n",
       "             from_user_created_at  retweet_count  favorite_count  \\\n",
       "0  Tue Mar 29 16:23:02 +0000 2011              0               0   \n",
       "1  Wed Sep 08 16:14:11 +0000 2010              1               0   \n",
       "\n",
       "  entities_urls  entities_urls_count entities_hashtags  \\\n",
       "0           NaN                    0               NaN   \n",
       "1           NaN                    0  vacuna, neumonía   \n",
       "\n",
       "   entities_hashtags_count entities_mentions  entities_mentions_count  \\\n",
       "0                        0       louloushive                        1   \n",
       "1                        2               NaN                        0   \n",
       "\n",
       "  in_reply_to_screen_name  in_reply_to_status_id source  \\\n",
       "0             louloushive           3.062183e+17    web   \n",
       "1                     NaN                    NaN    web   \n",
       "\n",
       "  entities_expanded_urls  entities_media_count media_expanded_url media_url  \\\n",
       "0                    NaN                   NaN                NaN       NaN   \n",
       "1                    NaN                   NaN                NaN       NaN   \n",
       "\n",
       "  media_type  video_link  photo_link  twitpic  num_characters  num_words  \\\n",
       "0        NaN           0           0        0             121         19   \n",
       "1        NaN           0           0        0             138         20   \n",
       "\n",
       "   retweeted_user retweeted_user_description retweeted_user_screen_name  \\\n",
       "0             NaN                        NaN                        NaN   \n",
       "1             NaN                        NaN                        NaN   \n",
       "\n",
       "   retweeted_user_followers_count  retweeted_user_listed_count  \\\n",
       "0                             NaN                          NaN   \n",
       "1                             NaN                          NaN   \n",
       "\n",
       "   retweeted_user_statuses_count retweeted_user_location  \\\n",
       "0                            NaN                     NaN   \n",
       "1                            NaN                     NaN   \n",
       "\n",
       "  retweeted_tweet_created_at  Fortune_2012_rank Company  CSR_sustainability  \\\n",
       "0                        NaN                 79  Humana                   0   \n",
       "1                        NaN                 40  Pfizer                   0   \n",
       "\n",
       "   specific_project_initiative_area  English  \n",
       "0                                 1        1  \n",
       "1                                 1        0  "
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print \"# of variables:\", len(df.columns)\n",
    "print  \"# of tweets:\", len(df)\n",
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating Binary Variables from Numerical Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also are often interested in generating binary versions from numerical data in our dataframe. Most tweets, for instance, are never retweeted even once. We might thus be interested in analyzing what differentiates those tweets that do get retweeted from those that don't. Here we thus create a new binary variable called `RTs_binary`. We use the `numpy` package's `where` function to look for tweets with a retweet count of 0 (`df[retweet_count']==0`). The final two numbers in the first line of code (`0,1`) indicate that we are assigning values of \"0\" to tweets that meet this condition, otherwise assigning values of \"1\". We now have a variable that numerically differentiates retweeted tweets from ignored tweets. This comes in handy for logistic regression, which we'll delve into in a future tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of variables in dataframe: 56\n",
      "# of tweets in dataframe: 26257\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>RTs_binary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RT @droso: @RodrigoReinaL con un tema muy inte...</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\"Every child is born a scientist. We’re all bo...</td>\n",
       "      <td>198</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>RT @ClintonGlobal Watch Pres. @BillClinton's f...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             content  retweet_count  \\\n",
       "0  @louloushive (Tweet 2) We encourage other empl...              0   \n",
       "1  ¿Sabes por qué la #vacuna contra la #neumonía ...              1   \n",
       "2  RT @droso: @RodrigoReinaL con un tema muy inte...              3   \n",
       "3  \"Every child is born a scientist. We’re all bo...            198   \n",
       "4  RT @ClintonGlobal Watch Pres. @BillClinton's f...              0   \n",
       "\n",
       "   RTs_binary  \n",
       "0           0  \n",
       "1           1  \n",
       "2           1  \n",
       "3           1  \n",
       "4           0  "
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['RTs_binary'] = np.where(df['retweet_count']==0, 0, 1)\n",
    "print \"# of variables in dataframe:\", len(df.columns)\n",
    "print  \"# of tweets in dataframe:\", len(df)\n",
    "df[['content','retweet_count','RTs_binary']].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Let's also do the same for the favorite count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of variables in dataframe: 57\n",
      "# of tweets in dataframe: 26257\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>favorite_count</th>\n",
       "      <th>favorites_binary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RT @droso: @RodrigoReinaL con un tema muy inte...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\"Every child is born a scientist. We’re all bo...</td>\n",
       "      <td>99</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>RT @ClintonGlobal Watch Pres. @BillClinton's f...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             content  favorite_count  \\\n",
       "0  @louloushive (Tweet 2) We encourage other empl...               0   \n",
       "1  ¿Sabes por qué la #vacuna contra la #neumonía ...               0   \n",
       "2  RT @droso: @RodrigoReinaL con un tema muy inte...               0   \n",
       "3  \"Every child is born a scientist. We’re all bo...              99   \n",
       "4  RT @ClintonGlobal Watch Pres. @BillClinton's f...               1   \n",
       "\n",
       "   favorites_binary  \n",
       "0                 0  \n",
       "1                 0  \n",
       "2                 0  \n",
       "3                 1  \n",
       "4                 1  "
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['favorites_binary'] = np.where(df['favorite_count']==0, 0, 1)\n",
    "print \"# of variables in dataframe:\", len(df.columns)\n",
    "print  \"# of tweets in dataframe:\", len(df)\n",
    "df[['content','favorite_count','favorites_binary']].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>While we're at it, let's create three final binary variables indicating whether the tweet contains any hashtags, user mentions, or URLs, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of variables in dataframe: 60\n",
      "# of tweets in dataframe: 26257\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>entities_hashtags_count</th>\n",
       "      <th>hashtags_binary</th>\n",
       "      <th>entities_mentions_count</th>\n",
       "      <th>mentions_binary</th>\n",
       "      <th>entities_urls_count</th>\n",
       "      <th>URLs_binary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RT @droso: @RodrigoReinaL con un tema muy inte...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\"Every child is born a scientist. We’re all bo...</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>RT @ClintonGlobal Watch Pres. @BillClinton's f...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             content  entities_hashtags_count  \\\n",
       "0  @louloushive (Tweet 2) We encourage other empl...                        0   \n",
       "1  ¿Sabes por qué la #vacuna contra la #neumonía ...                        2   \n",
       "2  RT @droso: @RodrigoReinaL con un tema muy inte...                        1   \n",
       "3  \"Every child is born a scientist. We’re all bo...                        2   \n",
       "4  RT @ClintonGlobal Watch Pres. @BillClinton's f...                        1   \n",
       "\n",
       "   hashtags_binary  entities_mentions_count  mentions_binary  \\\n",
       "0                0                        1                1   \n",
       "1                1                        0                0   \n",
       "2                1                        2                1   \n",
       "3                1                        0                0   \n",
       "4                1                        3                1   \n",
       "\n",
       "   entities_urls_count  URLs_binary  \n",
       "0                    0            0  \n",
       "1                    0            0  \n",
       "2                    1            1  \n",
       "3                    1            1  \n",
       "4                    1            1  "
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['hashtags_binary'] = np.where(df['entities_hashtags_count']==0, 0, 1)\n",
    "df['mentions_binary'] = np.where(df['entities_mentions_count']==0, 0, 1)\n",
    "df['URLs_binary'] = np.where(df['entities_urls_count']==0, 0, 1)\n",
    "print \"# of variables in dataframe:\", len(df.columns)\n",
    "print  \"# of tweets in dataframe:\", len(df)\n",
    "df[['content','entities_hashtags_count','hashtags_binary','entities_mentions_count','mentions_binary','entities_urls_count','URLs_binary']].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Below we can see that our six new binary variables have been included as new columns in our dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rowid</th>\n",
       "      <th>query</th>\n",
       "      <th>tweet_id_str</th>\n",
       "      <th>inserted_date</th>\n",
       "      <th>language</th>\n",
       "      <th>coordinates</th>\n",
       "      <th>retweeted_status</th>\n",
       "      <th>created_at</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "      <th>content</th>\n",
       "      <th>from_user_screen_name</th>\n",
       "      <th>from_user_id</th>\n",
       "      <th>from_user_followers_count</th>\n",
       "      <th>from_user_friends_count</th>\n",
       "      <th>from_user_listed_count</th>\n",
       "      <th>from_user_favourites_count</th>\n",
       "      <th>from_user_statuses_count</th>\n",
       "      <th>from_user_description</th>\n",
       "      <th>from_user_location</th>\n",
       "      <th>from_user_created_at</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>favorite_count</th>\n",
       "      <th>entities_urls</th>\n",
       "      <th>entities_urls_count</th>\n",
       "      <th>entities_hashtags</th>\n",
       "      <th>entities_hashtags_count</th>\n",
       "      <th>entities_mentions</th>\n",
       "      <th>entities_mentions_count</th>\n",
       "      <th>in_reply_to_screen_name</th>\n",
       "      <th>in_reply_to_status_id</th>\n",
       "      <th>source</th>\n",
       "      <th>entities_expanded_urls</th>\n",
       "      <th>entities_media_count</th>\n",
       "      <th>media_expanded_url</th>\n",
       "      <th>media_url</th>\n",
       "      <th>media_type</th>\n",
       "      <th>video_link</th>\n",
       "      <th>photo_link</th>\n",
       "      <th>twitpic</th>\n",
       "      <th>num_characters</th>\n",
       "      <th>num_words</th>\n",
       "      <th>retweeted_user</th>\n",
       "      <th>retweeted_user_description</th>\n",
       "      <th>retweeted_user_screen_name</th>\n",
       "      <th>retweeted_user_followers_count</th>\n",
       "      <th>retweeted_user_listed_count</th>\n",
       "      <th>retweeted_user_statuses_count</th>\n",
       "      <th>retweeted_user_location</th>\n",
       "      <th>retweeted_tweet_created_at</th>\n",
       "      <th>Fortune_2012_rank</th>\n",
       "      <th>Company</th>\n",
       "      <th>CSR_sustainability</th>\n",
       "      <th>specific_project_initiative_area</th>\n",
       "      <th>English</th>\n",
       "      <th>RTs_binary</th>\n",
       "      <th>favorites_binary</th>\n",
       "      <th>hashtags_binary</th>\n",
       "      <th>mentions_binary</th>\n",
       "      <th>URLs_binary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>67340</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>306897327585652736</td>\n",
       "      <td>2014-03-09 13:46:50.222857</td>\n",
       "      <td>en</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-02-27 22:43:19.000000</td>\n",
       "      <td>2</td>\n",
       "      <td>2013</td>\n",
       "      <td>@louloushive (Tweet 2) We encourage other empl...</td>\n",
       "      <td>humanavitality</td>\n",
       "      <td>274041023</td>\n",
       "      <td>2859</td>\n",
       "      <td>440</td>\n",
       "      <td>38</td>\n",
       "      <td>25</td>\n",
       "      <td>1766</td>\n",
       "      <td>This is the official Twitter account for Human...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Tue Mar 29 16:23:02 +0000 2011</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>1</td>\n",
       "      <td>louloushive</td>\n",
       "      <td>3.062183e+17</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>121</td>\n",
       "      <td>19</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>79</td>\n",
       "      <td>Humana</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39454</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>308616393706844160</td>\n",
       "      <td>2014-03-09 13:38:20.679967</td>\n",
       "      <td>es</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-03-04 16:34:17.000000</td>\n",
       "      <td>3</td>\n",
       "      <td>2013</td>\n",
       "      <td>¿Sabes por qué la #vacuna contra la #neumonía ...</td>\n",
       "      <td>FundacionPfizer</td>\n",
       "      <td>188384056</td>\n",
       "      <td>2464</td>\n",
       "      <td>597</td>\n",
       "      <td>50</td>\n",
       "      <td>11</td>\n",
       "      <td>2400</td>\n",
       "      <td>Noticias sobre Responsabilidad Social y Fundac...</td>\n",
       "      <td>México</td>\n",
       "      <td>Wed Sep 08 16:14:11 +0000 2010</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>vacuna, neumonía</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>web</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>138</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>40</td>\n",
       "      <td>Pfizer</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   rowid            query        tweet_id_str               inserted_date  \\\n",
       "0  67340   humanavitality  306897327585652736  2014-03-09 13:46:50.222857   \n",
       "1  39454  FundacionPfizer  308616393706844160  2014-03-09 13:38:20.679967   \n",
       "\n",
       "  language coordinates retweeted_status                  created_at  month  \\\n",
       "0       en         NaN              NaN  2013-02-27 22:43:19.000000      2   \n",
       "1       es         NaN              NaN  2013-03-04 16:34:17.000000      3   \n",
       "\n",
       "   year                                            content  \\\n",
       "0  2013  @louloushive (Tweet 2) We encourage other empl...   \n",
       "1  2013  ¿Sabes por qué la #vacuna contra la #neumonía ...   \n",
       "\n",
       "  from_user_screen_name  from_user_id  from_user_followers_count  \\\n",
       "0        humanavitality     274041023                       2859   \n",
       "1       FundacionPfizer     188384056                       2464   \n",
       "\n",
       "   from_user_friends_count  from_user_listed_count  \\\n",
       "0                      440                      38   \n",
       "1                      597                      50   \n",
       "\n",
       "   from_user_favourites_count  from_user_statuses_count  \\\n",
       "0                          25                      1766   \n",
       "1                          11                      2400   \n",
       "\n",
       "                               from_user_description from_user_location  \\\n",
       "0  This is the official Twitter account for Human...                NaN   \n",
       "1  Noticias sobre Responsabilidad Social y Fundac...             México   \n",
       "\n",
       "             from_user_created_at  retweet_count  favorite_count  \\\n",
       "0  Tue Mar 29 16:23:02 +0000 2011              0               0   \n",
       "1  Wed Sep 08 16:14:11 +0000 2010              1               0   \n",
       "\n",
       "  entities_urls  entities_urls_count entities_hashtags  \\\n",
       "0           NaN                    0               NaN   \n",
       "1           NaN                    0  vacuna, neumonía   \n",
       "\n",
       "   entities_hashtags_count entities_mentions  entities_mentions_count  \\\n",
       "0                        0       louloushive                        1   \n",
       "1                        2               NaN                        0   \n",
       "\n",
       "  in_reply_to_screen_name  in_reply_to_status_id source  \\\n",
       "0             louloushive           3.062183e+17    web   \n",
       "1                     NaN                    NaN    web   \n",
       "\n",
       "  entities_expanded_urls  entities_media_count media_expanded_url media_url  \\\n",
       "0                    NaN                   NaN                NaN       NaN   \n",
       "1                    NaN                   NaN                NaN       NaN   \n",
       "\n",
       "  media_type  video_link  photo_link  twitpic  num_characters  num_words  \\\n",
       "0        NaN           0           0        0             121         19   \n",
       "1        NaN           0           0        0             138         20   \n",
       "\n",
       "   retweeted_user retweeted_user_description retweeted_user_screen_name  \\\n",
       "0             NaN                        NaN                        NaN   \n",
       "1             NaN                        NaN                        NaN   \n",
       "\n",
       "   retweeted_user_followers_count  retweeted_user_listed_count  \\\n",
       "0                             NaN                          NaN   \n",
       "1                             NaN                          NaN   \n",
       "\n",
       "   retweeted_user_statuses_count retweeted_user_location  \\\n",
       "0                            NaN                     NaN   \n",
       "1                            NaN                     NaN   \n",
       "\n",
       "  retweeted_tweet_created_at  Fortune_2012_rank Company  CSR_sustainability  \\\n",
       "0                        NaN                 79  Humana                   0   \n",
       "1                        NaN                 40  Pfizer                   0   \n",
       "\n",
       "   specific_project_initiative_area  English  RTs_binary  favorites_binary  \\\n",
       "0                                 1        1           0                 0   \n",
       "1                                 1        0           1                 0   \n",
       "\n",
       "   hashtags_binary  mentions_binary  URLs_binary  \n",
       "0                0                1            0  \n",
       "1                1                0            0  "
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Verification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we actually start using our new data it is critical we verify that our data transformations worked. So let's take a couple of steps using the example of our new variable `RTs_binary`. First, we're expecting a new variable with values of only 0 or 1; `value_counts()` will let us know whether the new variable has the expected values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    14334\n",
       "0    11923\n",
       "dtype: int64"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['RTs_binary'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>OK, that's perfect. But we should also double check that there are no values of our old variable that did not get properly translated. We can run this check visually with PANDAS' `crosstabs` command. As expected, we see that there are no instances where a `retweet_count` value of `0` has been assigned anything other than `0` on `RTs_binary`, and likewise, no instances of `retweet_count` where a value greater than `0` is assigned anything other than `1` on `RTs_binary`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>RTs_binary</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>retweet_count</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>11923</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>5583</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>2908</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>1744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>710</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0</td>\n",
       "      <td>339</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0</td>\n",
       "      <td>236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0</td>\n",
       "      <td>174</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0</td>\n",
       "      <td>122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>0</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0</td>\n",
       "      <td>103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>0</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>0</td>\n",
       "      <td>62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>0</td>\n",
       "      <td>42</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>0</td>\n",
       "      <td>42</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>0</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>0</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>0</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>0</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>0</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>0</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>0</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>446</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>448</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>468</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>471</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>473</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>476</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>491</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>505</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>517</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>528</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>557</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>559</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>578</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>581</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>585</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>596</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>601</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>648</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>655</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>656</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>750</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>850</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>910</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1113</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1423</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1549</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1756</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1899</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2979</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3719</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>190 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "RTs_binary         0     1\n",
       "retweet_count             \n",
       "0              11923     0\n",
       "1                  0  5583\n",
       "2                  0  2908\n",
       "3                  0  1744\n",
       "4                  0  1067\n",
       "5                  0   710\n",
       "6                  0   470\n",
       "7                  0   339\n",
       "8                  0   236\n",
       "9                  0   174\n",
       "10                 0   122\n",
       "11                 0   120\n",
       "12                 0   103\n",
       "13                 0    68\n",
       "14                 0    62\n",
       "15                 0    42\n",
       "16                 0    42\n",
       "17                 0    38\n",
       "18                 0    38\n",
       "19                 0    27\n",
       "20                 0    26\n",
       "21                 0    15\n",
       "22                 0    16\n",
       "23                 0    18\n",
       "24                 0    12\n",
       "25                 0    17\n",
       "26                 0    13\n",
       "27                 0    13\n",
       "28                 0     8\n",
       "29                 0     6\n",
       "...              ...   ...\n",
       "446                0     1\n",
       "448                0     1\n",
       "468                0     2\n",
       "471                0     1\n",
       "473                0     2\n",
       "476                0     1\n",
       "491                0     1\n",
       "505                0     1\n",
       "517                0     1\n",
       "528                0     1\n",
       "557                0     1\n",
       "559                0     1\n",
       "578                0     1\n",
       "581                0     1\n",
       "585                0     1\n",
       "596                0     1\n",
       "601                0     2\n",
       "648                0     1\n",
       "655                0     1\n",
       "656                0     1\n",
       "750                0     1\n",
       "850                0     1\n",
       "910                0     1\n",
       "1113               0     1\n",
       "1423               0     1\n",
       "1549               0     1\n",
       "1756               0     1\n",
       "1899               0     1\n",
       "2979               0     1\n",
       "3719               0     1\n",
       "\n",
       "[190 rows x 2 columns]"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(df['retweet_count'], df['RTs_binary'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Alternatively, we could run a conditional version of the above crosstab to see the same thing in a condensed format. Everything is as expected: Our new binary variable `RTs_binary` constitutes a perfect binary representation of our original ratio-level `retweet_count` variable. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>RTs_binary</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>retweet_count</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>False</th>\n",
       "      <td>11923</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>0</td>\n",
       "      <td>14334</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "RTs_binary         0      1\n",
       "retweet_count              \n",
       "False          11923      0\n",
       "True               0  14334"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(df['retweet_count']>0, df['RTs_binary'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save new dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now save a copy of this dataframe for future use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.to_pickle('Original 2013 CSR Tweets with 3 binary variables.pkl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>In this tutorial we have covered how to generate new variables from your existing data. Specifically, we have covered how to create dummy variables from an existing categorical variable. We have also covered how to generate binary variables indicating, respectively, the presence of URLs, user mentions, and hashtags, whether a tweet was written in English, and whether a tweet gets favorited or retweeted. Such data transformations are essential for moving beyond mere description to statistical analyses of the data.\n",
    "\n",
    "This is intended to merely be an introduction. There are lots of other methods for generating new variables. For some <a href=\"http://social-metrics.org/python-pandas-cookbook/#Generating_New_Variables_Arrays_etc\" target=\"_blank\">additional recipes my PANDAS cookbook see here</a>. \n",
    "\n",
    "For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter <a href='https://twitter.com/gregorysaxton'>@gregorysaxton</a>\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}