{ "metadata": { "name": "Tutorial - Import Data, Select Cases, Save Dataset" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": "This is the first in a series of notebooks designed to show you how to analyze social media data. We assume you have already downloaded the data and are now ready to begin examining it. In this first notebook I will show you how to set up your ipython working environment and import the Twitter data we have downloaded." }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": "Chapter 1: Set up Jupyter, Import Twitter Data and Select Cases" }, { "cell_type": "markdown", "metadata": {}, "source": "
\n\nFirst, we will import several necessary Python packages. We will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. It is invaluable for analyzing datasets. " }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": "Import packages" }, { "cell_type": "code", "collapsed": false, "input": "import numpy as np\nimport pandas as pd\nfrom pandas import DataFrame\nfrom pandas import Series", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": "
\n\nPANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks." }, { "cell_type": "code", "collapsed": false, "input": "#Set PANDAS to show all columns in DataFrame\npd.set_option('display.max_columns', None)", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": "
\n\nWe can check which version of various packages we're using. You can see I'm running PANDAS 0.13 here." }, { "cell_type": "code", "collapsed": false, "input": "print pd.__version__", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "0.13.1\n" } ], "prompt_number": 86 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": "Read in data" }, { "cell_type": "markdown", "metadata": {}, "source": "PANDAS can read in data from a variety of different data types. If you've followed some of my earlier tutorials you have downloaded tweets into an SQLite database, then converted to a CSV file. That's what we have here. We have a set of tweets by Fortune 200 firms. So, in the following three lines we'll first import the CSV file and assign it to the name 'df' -- short for 'dataframe', the PANDAS name for a dataset. Second, we'll use the len function to see how many rows (tweets) there are in the dataset; there are 34,097 tweets in total. Finally, we will use the head function to show the first two rows of the dataset. " }, { "cell_type": "code", "collapsed": false, "input": "df = pd.read_csv('CSR_user_timeline_2013.csv', sep=',', low_memory=False)\nprint len(df)\ndf.head(2)", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "34097\n" }, { "html": "
\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
rowidquerytweet_idtweet_id_strinserted_datetruncatedlanguagepossibly_sensitivecoordinatesretweeted_statuswithheld_in_countrieswithheld_scopecreated_at_textcreated_atmonthyearcontentfrom_user_screen_namefrom_user_idfrom_user_followers_countfrom_user_friends_countfrom_user_listed_countfrom_user_favourites_countfrom_user_statuses_countfrom_user_descriptionfrom_user_locationfrom_user_created_atretweet_countfavorite_countentities_urlsentities_urls_countentities_hashtagsentities_hashtags_countentities_mentionsentities_mentions_countin_reply_to_screen_namein_reply_to_status_idsourceentities_expanded_urlsentities_media_countmedia_expanded_urlmedia_urlmedia_typevideo_linkphoto_linktwitpicnum_charactersnum_wordsretweeted_userretweeted_user_descriptionretweeted_user_screen_nameretweeted_user_followers_countretweeted_user_listed_countretweeted_user_statuses_countretweeted_user_locationretweeted_tweet_created_atFortune_2012_rankCompanyCSR_sustainabilityspecific_project_initiative_area
0 67340 humanavitality 306897327585652736 306897327585652736 2014-03-09 13:46:50.222857 0 enNaN NaN NaNNaNNaN Wed Feb 27 22:43:19 +0000 2013 2013-02-27 22:43:19.000000 2 2013 @louloushive (Tweet 2) We encourage other empl... humanavitality 274041023 2859 440 38 25 1766 This is the official Twitter account for Human... NaN Tue Mar 29 16:23:02 +0000 2011 0 0 NaN 0 NaN 0 louloushive 1 louloushive 3.062183e+17 web NaNNaN NaN NaN NaN 0 0 0 121 19NaN NaN NaNNaNNaNNaN NaN NaN 79 Humana 0 1
1 39454 FundacionPfizer 308616393706844160 308616393706844160 2014-03-09 13:38:20.679967 0 esNaN NaN NaNNaNNaN Mon Mar 04 16:34:17 +0000 2013 2013-03-04 16:34:17.000000 3 2013 \u00bfSabes por qu\u00e9 la #vacuna contra la #neumon\u00eda ... FundacionPfizer 188384056 2464 597 50 11 2400 Noticias sobre Responsabilidad Social y Fundac... M\u00e9xico Wed Sep 08 16:14:11 +0000 2010 1 0 NaN 0 vacuna, neumon\u00eda 2 NaN 0 NaN NaN web NaNNaN NaN NaN NaN 0 0 0 138 20NaN NaN NaNNaNNaNNaN NaN NaN 40 Pfizer 0 1
\n

2 rows \u00d7 60 columns

\n
", "metadata": {}, "output_type": "pyout", "prompt_number": 67, "text": " rowid query tweet_id tweet_id_str \\\n0 67340 humanavitality 306897327585652736 306897327585652736 \n1 39454 FundacionPfizer 308616393706844160 308616393706844160 \n\n inserted_date truncated language possibly_sensitive \\\n0 2014-03-09 13:46:50.222857 0 en NaN \n1 2014-03-09 13:38:20.679967 0 es NaN \n\n coordinates retweeted_status withheld_in_countries withheld_scope \\\n0 NaN NaN NaN NaN \n1 NaN NaN NaN NaN \n\n created_at_text created_at month year \\\n0 Wed Feb 27 22:43:19 +0000 2013 2013-02-27 22:43:19.000000 2 2013 \n1 Mon Mar 04 16:34:17 +0000 2013 2013-03-04 16:34:17.000000 3 2013 \n\n content from_user_screen_name \\\n0 @louloushive (Tweet 2) We encourage other empl... humanavitality \n1 \u00bfSabes por qu\u00e9 la #vacuna contra la #neumon\u00eda ... FundacionPfizer \n\n from_user_id from_user_followers_count from_user_friends_count \\\n0 274041023 2859 440 \n1 188384056 2464 597 \n\n from_user_listed_count from_user_favourites_count \\\n0 38 25 \n1 50 11 \n\n from_user_statuses_count \\\n0 1766 \n1 2400 \n\n from_user_description from_user_location \\\n0 This is the official Twitter account for Human... NaN \n1 Noticias sobre Responsabilidad Social y Fundac... M\u00e9xico \n\n from_user_created_at retweet_count favorite_count \\\n0 Tue Mar 29 16:23:02 +0000 2011 0 0 \n1 Wed Sep 08 16:14:11 +0000 2010 1 0 \n\n entities_urls entities_urls_count entities_hashtags \\\n0 NaN 0 NaN \n1 NaN 0 vacuna, neumon\u00eda \n\n entities_hashtags_count entities_mentions entities_mentions_count \\\n0 0 louloushive 1 \n1 2 NaN 0 \n\n in_reply_to_screen_name in_reply_to_status_id source \\\n0 louloushive 3.062183e+17 web \n1 NaN NaN web \n\n entities_expanded_urls entities_media_count media_expanded_url media_url \\\n0 NaN NaN NaN NaN \n1 NaN NaN NaN NaN \n\n media_type video_link photo_link twitpic num_characters num_words \\\n0 NaN 0 0 0 121 19 \n1 NaN 0 0 0 138 20 \n\n retweeted_user retweeted_user_description retweeted_user_screen_name \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n\n retweeted_user_followers_count retweeted_user_listed_count \\\n0 NaN NaN \n1 NaN NaN \n\n retweeted_user_statuses_count retweeted_user_location \\\n0 NaN NaN \n1 NaN NaN \n\n retweeted_tweet_created_at Fortune_2012_rank Company CSR_sustainability \\\n0 NaN 79 Humana 0 \n1 NaN 40 Pfizer 0 \n\n specific_project_initiative_area \n0 1 \n1 1 \n\n[2 rows x 60 columns]" } ], "prompt_number": 67 }, { "cell_type": "markdown", "metadata": {}, "source": "
\n\nList all the columns in the DataFrame" }, { "cell_type": "code", "collapsed": false, "input": "df.columns", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 68, "text": "Index([u'rowid', u'query', u'tweet_id', u'tweet_id_str', u'inserted_date', u'truncated', u'language', u'possibly_sensitive', u'coordinates', u'retweeted_status', u'withheld_in_countries', u'withheld_scope', u'created_at_text', u'created_at', u'month', u'year', u'content', u'from_user_screen_name', u'from_user_id', u'from_user_followers_count', u'from_user_friends_count', u'from_user_listed_count', u'from_user_favourites_count', u'from_user_statuses_count', u'from_user_description', u'from_user_location', u'from_user_created_at', u'retweet_count', u'favorite_count', u'entities_urls', u'entities_urls_count', u'entities_hashtags', u'entities_hashtags_count', u'entities_mentions', u'entities_mentions_count', u'in_reply_to_screen_name', u'in_reply_to_status_id', u'source', u'entities_expanded_urls', u'entities_media_count', u'media_expanded_url', u'media_url', u'media_type', u'video_link', u'photo_link', u'twitpic', u'num_characters', u'num_words', u'retweeted_user', u'retweeted_user_description', u'retweeted_user_screen_name', u'retweeted_user_followers_count', u'retweeted_user_listed_count', u'retweeted_user_statuses_count', u'retweeted_user_location', u'retweeted_tweet_created_at', u'Fortune_2012_rank', u'Company', u'CSR_sustainability', u'specific_project_initiative_area'], dtype='object')" } ], "prompt_number": 68 }, { "cell_type": "markdown", "metadata": {}, "source": "
We can use the len function again here to see how many columns there are in the dataframe: 60." }, { "cell_type": "code", "collapsed": false, "input": "len(df.columns)", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 69, "text": "60" } ], "prompt_number": 69 }, { "cell_type": "markdown", "metadata": {}, "source": "
We should also inspect the format for our columns. We can see that some are integers, some are 'float' (can have a decimal), and some are 'objects' (text). If you have a identifying text variable that has accidentally been imported as a float, for instance, that could cause problems down the road, so you should fix it before continuing." }, { "cell_type": "code", "collapsed": false, "input": "df.dtypes", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 85, "text": "rowid int64\nquery object\ntweet_id_str int64\ninserted_date object\nlanguage object\ncoordinates object\nretweeted_status object\ncreated_at object\nmonth int64\nyear int64\ncontent object\nfrom_user_screen_name object\nfrom_user_id int64\nfrom_user_followers_count int64\nfrom_user_friends_count int64\nfrom_user_listed_count int64\nfrom_user_favourites_count int64\nfrom_user_statuses_count int64\nfrom_user_description object\nfrom_user_location object\nfrom_user_created_at object\nretweet_count int64\nfavorite_count int64\nentities_urls object\nentities_urls_count int64\nentities_hashtags object\nentities_hashtags_count int64\nentities_mentions object\nentities_mentions_count int64\nin_reply_to_screen_name object\nin_reply_to_status_id float64\nsource object\nentities_expanded_urls object\nentities_media_count float64\nmedia_expanded_url object\nmedia_url object\nmedia_type object\nvideo_link int64\nphoto_link int64\ntwitpic int64\nnum_characters int64\nnum_words int64\nretweeted_user float64\nretweeted_user_description object\nretweeted_user_screen_name object\nretweeted_user_followers_count float64\nretweeted_user_listed_count float64\nretweeted_user_statuses_count float64\nretweeted_user_location object\nretweeted_tweet_created_at object\nFortune_2012_rank int64\nCompany object\nCSR_sustainability int64\nspecific_project_initiative_area int64\nLength: 54, dtype: object" } ], "prompt_number": 85 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": "Remove Unneeded Columns" }, { "cell_type": "markdown", "metadata": {}, "source": "Every researcher will have different preferences about keeping or deleting unneeded columns. You might want to delete variables to make your dataset cleaner, you might want to save memory, or you might want a smaller dataset for some specific analyses. In any case, we can use the drop command to delete individual columns. Let's drop six that are not needed here. The first two have duplicate columns in another format, while the latter four all have zero variation (all are blank).\t " }, { "cell_type": "code", "collapsed": false, "input": "df = df.drop('created_at_text',1)\ndf = df.drop('tweet_id',1)\ndf = df.drop('withheld_in_countries',1)\ndf = df.drop('withheld_scope',1)\ndf = df.drop('truncated',1)\ndf = df.drop('possibly_sensitive',1)", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 70 }, { "cell_type": "markdown", "metadata": {}, "source": "
There are now 54 columns in the dataframe." }, { "cell_type": "code", "collapsed": false, "input": "len(df.columns)", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 71, "text": "54" } ], "prompt_number": 71 }, { "cell_type": "code", "collapsed": false, "input": "df.head(2)", "language": "python", "metadata": {}, "outputs": [ { "html": "
\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
rowidquerytweet_id_strinserted_datelanguagecoordinatesretweeted_statuscreated_atmonthyearcontentfrom_user_screen_namefrom_user_idfrom_user_followers_countfrom_user_friends_countfrom_user_listed_countfrom_user_favourites_countfrom_user_statuses_countfrom_user_descriptionfrom_user_locationfrom_user_created_atretweet_countfavorite_countentities_urlsentities_urls_countentities_hashtagsentities_hashtags_countentities_mentionsentities_mentions_countin_reply_to_screen_namein_reply_to_status_idsourceentities_expanded_urlsentities_media_countmedia_expanded_urlmedia_urlmedia_typevideo_linkphoto_linktwitpicnum_charactersnum_wordsretweeted_userretweeted_user_descriptionretweeted_user_screen_nameretweeted_user_followers_countretweeted_user_listed_countretweeted_user_statuses_countretweeted_user_locationretweeted_tweet_created_atFortune_2012_rankCompanyCSR_sustainabilityspecific_project_initiative_area
0 67340 humanavitality 306897327585652736 2014-03-09 13:46:50.222857 en NaN NaN 2013-02-27 22:43:19.000000 2 2013 @louloushive (Tweet 2) We encourage other empl... humanavitality 274041023 2859 440 38 25 1766 This is the official Twitter account for Human... NaN Tue Mar 29 16:23:02 +0000 2011 0 0 NaN 0 NaN 0 louloushive 1 louloushive 3.062183e+17 web NaNNaN NaN NaN NaN 0 0 0 121 19NaN NaN NaNNaNNaNNaN NaN NaN 79 Humana 0 1
1 39454 FundacionPfizer 308616393706844160 2014-03-09 13:38:20.679967 es NaN NaN 2013-03-04 16:34:17.000000 3 2013 \u00bfSabes por qu\u00e9 la #vacuna contra la #neumon\u00eda ... FundacionPfizer 188384056 2464 597 50 11 2400 Noticias sobre Responsabilidad Social y Fundac... M\u00e9xico Wed Sep 08 16:14:11 +0000 2010 1 0 NaN 0 vacuna, neumon\u00eda 2 NaN 0 NaN NaN web NaNNaN NaN NaN NaN 0 0 0 138 20NaN NaN NaNNaNNaNNaN NaN NaN 40 Pfizer 0 1
\n

2 rows \u00d7 54 columns

\n
", "metadata": {}, "output_type": "pyout", "prompt_number": 72, "text": " rowid query tweet_id_str inserted_date \\\n0 67340 humanavitality 306897327585652736 2014-03-09 13:46:50.222857 \n1 39454 FundacionPfizer 308616393706844160 2014-03-09 13:38:20.679967 \n\n language coordinates retweeted_status created_at month \\\n0 en NaN NaN 2013-02-27 22:43:19.000000 2 \n1 es NaN NaN 2013-03-04 16:34:17.000000 3 \n\n year content \\\n0 2013 @louloushive (Tweet 2) We encourage other empl... \n1 2013 \u00bfSabes por qu\u00e9 la #vacuna contra la #neumon\u00eda ... \n\n from_user_screen_name from_user_id from_user_followers_count \\\n0 humanavitality 274041023 2859 \n1 FundacionPfizer 188384056 2464 \n\n from_user_friends_count from_user_listed_count \\\n0 440 38 \n1 597 50 \n\n from_user_favourites_count from_user_statuses_count \\\n0 25 1766 \n1 11 2400 \n\n from_user_description from_user_location \\\n0 This is the official Twitter account for Human... NaN \n1 Noticias sobre Responsabilidad Social y Fundac... M\u00e9xico \n\n from_user_created_at retweet_count favorite_count \\\n0 Tue Mar 29 16:23:02 +0000 2011 0 0 \n1 Wed Sep 08 16:14:11 +0000 2010 1 0 \n\n entities_urls entities_urls_count entities_hashtags \\\n0 NaN 0 NaN \n1 NaN 0 vacuna, neumon\u00eda \n\n entities_hashtags_count entities_mentions entities_mentions_count \\\n0 0 louloushive 1 \n1 2 NaN 0 \n\n in_reply_to_screen_name in_reply_to_status_id source \\\n0 louloushive 3.062183e+17 web \n1 NaN NaN web \n\n entities_expanded_urls entities_media_count media_expanded_url media_url \\\n0 NaN NaN NaN NaN \n1 NaN NaN NaN NaN \n\n media_type video_link photo_link twitpic num_characters num_words \\\n0 NaN 0 0 0 121 19 \n1 NaN 0 0 0 138 20 \n\n retweeted_user retweeted_user_description retweeted_user_screen_name \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n\n retweeted_user_followers_count retweeted_user_listed_count \\\n0 NaN NaN \n1 NaN NaN \n\n retweeted_user_statuses_count retweeted_user_location \\\n0 NaN NaN \n1 NaN NaN \n\n retweeted_tweet_created_at Fortune_2012_rank Company CSR_sustainability \\\n0 NaN 79 Humana 0 \n1 NaN 40 Pfizer 0 \n\n specific_project_initiative_area \n0 1 \n1 1 \n\n[2 rows x 54 columns]" } ], "prompt_number": 72 }, { "cell_type": "markdown", "metadata": {}, "source": "
If you have only a few columns to delete you can use the drop command as shown above. On the other hand, if you only want to keep a few columns, you can create a new version of the dataframe with only those columns you like. Note that the double square brackets -- \"[[...]]\" -- in PANDAS forms a dataframe representation. In the following example, I am creating a new dataframe with only three variables. You can see that this new dataframe has the same number of tweets but fewer columns (variables)." }, { "cell_type": "code", "collapsed": false, "input": "df2 = df[['created_at', 'from_user_screen_name', 'retweet_count']]\nprint len(df2)\ndf2.head(2)", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "34097\n" }, { "html": "
\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
created_atfrom_user_screen_nameretweet_count
0 2013-02-27 22:43:19.000000 humanavitality 0
1 2013-03-04 16:34:17.000000 FundacionPfizer 1
\n

2 rows \u00d7 3 columns

\n
", "metadata": {}, "output_type": "pyout", "prompt_number": 73, "text": " created_at from_user_screen_name retweet_count\n0 2013-02-27 22:43:19.000000 humanavitality 0\n1 2013-03-04 16:34:17.000000 FundacionPfizer 1\n\n[2 rows x 3 columns]" } ], "prompt_number": 73 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": "View Twitter Accounts Represented in DF " }, { "cell_type": "markdown", "metadata": {}, "source": "We can use the unique function to find how many unique Twitter accounts are represented in the dataset. First, I'll show you what unique function does -- it creates an array of all the screen_names of the Twitter accounts." }, { "cell_type": "code", "collapsed": false, "input": "pd.unique(df.from_user_screen_name.ravel())", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 74, "text": "array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR',\n 'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW',\n 'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship',\n 'TICalculators', 'CiscoEDU', 'DuPont_ability', 'Dell4Good',\n 'verizongiving', 'DellEDU', 'SprintGreenNews', 'TeachingMoney',\n 'WalmartGreen', 'ecomagination', 'WalmartAction', 'Microsoft_Green',\n 'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen',\n 'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation',\n 'nikebetterworld', 'HoneywellBuild', 'googlestudents',\n '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower',\n 'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)" } ], "prompt_number": 74 }, { "cell_type": "markdown", "metadata": {}, "source": "
Note again how we can use len to find out how many accounts are in the array: 42" }, { "cell_type": "code", "collapsed": false, "input": "len(pd.unique(df.from_user_screen_name.ravel()))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 75, "text": "42" } ], "prompt_number": 75 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": "Remove Tweets from One Specific Account" }, { "cell_type": "markdown", "metadata": {}, "source": "We want to get rid of all tweets by TICalculators from the dataframe. Unlike the other 41 Twitter accounts in the dataset, this account is not a CSR-related account. First, we can use the len function combined with a dataframe query to count the number of tweets that are not sent by TICalculators: 32,300" }, { "cell_type": "code", "collapsed": false, "input": "len(df[df['from_user_screen_name'] != 'TICalculators'])", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 76, "text": "32330" } ], "prompt_number": 76 }, { "cell_type": "markdown", "metadata": {}, "source": "
We should then also check how many tweets are sent by TICalculators: 1,767" }, { "cell_type": "code", "collapsed": false, "input": "len(df[df['from_user_screen_name'] == 'TICalculators'])", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 77, "text": "1767" } ], "prompt_number": 77 }, { "cell_type": "markdown", "metadata": {}, "source": "
We can use Python to do \"math.\" Let's use this to show whether the two numbers returned in the above steps add up to the total number of tweets in our dataframe. They do. While this may seem like an unnecessary step, it is always critical to perform such basic data checks in order to avert unexpected data disasters." }, { "cell_type": "code", "collapsed": false, "input": "1767 + 32330", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 78, "text": "34097" } ], "prompt_number": 78 }, { "cell_type": "markdown", "metadata": {}, "source": "
We can also do this another way" }, { "cell_type": "code", "collapsed": false, "input": "(1767 + 32330) - len(df)", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 79, "text": "0" } ], "prompt_number": 79 }, { "cell_type": "markdown", "metadata": {}, "source": "
Or even" }, { "cell_type": "code", "collapsed": false, "input": "len(df[df['from_user_screen_name'] != 'TICalculators']) + len(df[df['from_user_screen_name'] == 'TICalculators']) - len(df)", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 80, "text": "0" } ], "prompt_number": 80 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": "Remove Tweets by TICalculators" }, { "cell_type": "markdown", "metadata": {}, "source": "In the next block of code we will create a new version of our dataframe, this time limiting it to only those tweets that are not sent by TICalculators. As we can see, there are now 32,300 tweets in this dataframe, the same number as we calculated above. " }, { "cell_type": "code", "collapsed": false, "input": "df = df[df['from_user_screen_name'] != 'TICalculators']\nprint len(df)\ndf.head(2)", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "32330\n" }, { "html": "
\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
rowidquerytweet_id_strinserted_datelanguagecoordinatesretweeted_statuscreated_atmonthyearcontentfrom_user_screen_namefrom_user_idfrom_user_followers_countfrom_user_friends_countfrom_user_listed_countfrom_user_favourites_countfrom_user_statuses_countfrom_user_descriptionfrom_user_locationfrom_user_created_atretweet_countfavorite_countentities_urlsentities_urls_countentities_hashtagsentities_hashtags_countentities_mentionsentities_mentions_countin_reply_to_screen_namein_reply_to_status_idsourceentities_expanded_urlsentities_media_countmedia_expanded_urlmedia_urlmedia_typevideo_linkphoto_linktwitpicnum_charactersnum_wordsretweeted_userretweeted_user_descriptionretweeted_user_screen_nameretweeted_user_followers_countretweeted_user_listed_countretweeted_user_statuses_countretweeted_user_locationretweeted_tweet_created_atFortune_2012_rankCompanyCSR_sustainabilityspecific_project_initiative_area
0 67340 humanavitality 306897327585652736 2014-03-09 13:46:50.222857 en NaN NaN 2013-02-27 22:43:19.000000 2 2013 @louloushive (Tweet 2) We encourage other empl... humanavitality 274041023 2859 440 38 25 1766 This is the official Twitter account for Human... NaN Tue Mar 29 16:23:02 +0000 2011 0 0 NaN 0 NaN 0 louloushive 1 louloushive 3.062183e+17 web NaNNaN NaN NaN NaN 0 0 0 121 19NaN NaN NaNNaNNaNNaN NaN NaN 79 Humana 0 1
1 39454 FundacionPfizer 308616393706844160 2014-03-09 13:38:20.679967 es NaN NaN 2013-03-04 16:34:17.000000 3 2013 \u00bfSabes por qu\u00e9 la #vacuna contra la #neumon\u00eda ... FundacionPfizer 188384056 2464 597 50 11 2400 Noticias sobre Responsabilidad Social y Fundac... M\u00e9xico Wed Sep 08 16:14:11 +0000 2010 1 0 NaN 0 vacuna, neumon\u00eda 2 NaN 0 NaN NaN web NaNNaN NaN NaN NaN 0 0 0 138 20NaN NaN NaNNaNNaNNaN NaN NaN 40 Pfizer 0 1
\n

2 rows \u00d7 54 columns

\n
", "metadata": {}, "output_type": "pyout", "prompt_number": 81, "text": " rowid query tweet_id_str inserted_date \\\n0 67340 humanavitality 306897327585652736 2014-03-09 13:46:50.222857 \n1 39454 FundacionPfizer 308616393706844160 2014-03-09 13:38:20.679967 \n\n language coordinates retweeted_status created_at month \\\n0 en NaN NaN 2013-02-27 22:43:19.000000 2 \n1 es NaN NaN 2013-03-04 16:34:17.000000 3 \n\n year content \\\n0 2013 @louloushive (Tweet 2) We encourage other empl... \n1 2013 \u00bfSabes por qu\u00e9 la #vacuna contra la #neumon\u00eda ... \n\n from_user_screen_name from_user_id from_user_followers_count \\\n0 humanavitality 274041023 2859 \n1 FundacionPfizer 188384056 2464 \n\n from_user_friends_count from_user_listed_count \\\n0 440 38 \n1 597 50 \n\n from_user_favourites_count from_user_statuses_count \\\n0 25 1766 \n1 11 2400 \n\n from_user_description from_user_location \\\n0 This is the official Twitter account for Human... NaN \n1 Noticias sobre Responsabilidad Social y Fundac... M\u00e9xico \n\n from_user_created_at retweet_count favorite_count \\\n0 Tue Mar 29 16:23:02 +0000 2011 0 0 \n1 Wed Sep 08 16:14:11 +0000 2010 1 0 \n\n entities_urls entities_urls_count entities_hashtags \\\n0 NaN 0 NaN \n1 NaN 0 vacuna, neumon\u00eda \n\n entities_hashtags_count entities_mentions entities_mentions_count \\\n0 0 louloushive 1 \n1 2 NaN 0 \n\n in_reply_to_screen_name in_reply_to_status_id source \\\n0 louloushive 3.062183e+17 web \n1 NaN NaN web \n\n entities_expanded_urls entities_media_count media_expanded_url media_url \\\n0 NaN NaN NaN NaN \n1 NaN NaN NaN NaN \n\n media_type video_link photo_link twitpic num_characters num_words \\\n0 NaN 0 0 0 121 19 \n1 NaN 0 0 0 138 20 \n\n retweeted_user retweeted_user_description retweeted_user_screen_name \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n\n retweeted_user_followers_count retweeted_user_listed_count \\\n0 NaN NaN \n1 NaN NaN \n\n retweeted_user_statuses_count retweeted_user_location \\\n0 NaN NaN \n1 NaN NaN \n\n retweeted_tweet_created_at Fortune_2012_rank Company CSR_sustainability \\\n0 NaN 79 Humana 0 \n1 NaN 40 Pfizer 0 \n\n specific_project_initiative_area \n0 1 \n1 1 \n\n[2 rows x 54 columns]" } ], "prompt_number": 81 }, { "cell_type": "markdown", "metadata": {}, "source": "
Now let's check again for all the unique accounts in the dataframe -- as you can see, TICalculators is gone and there are now 41 accounts." }, { "cell_type": "code", "collapsed": false, "input": "pd.unique(df.from_user_screen_name.ravel())", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 82, "text": "array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR',\n 'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW',\n 'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship',\n 'CiscoEDU', 'DuPont_ability', 'Dell4Good', 'verizongiving',\n 'DellEDU', 'SprintGreenNews', 'TeachingMoney', 'WalmartGreen',\n 'ecomagination', 'WalmartAction', 'Microsoft_Green',\n 'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen',\n 'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation',\n 'nikebetterworld', 'HoneywellBuild', 'googlestudents',\n '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower',\n 'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)" } ], "prompt_number": 82 }, { "cell_type": "code", "collapsed": false, "input": "len(pd.unique(df.from_user_screen_name.ravel()))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 83, "text": "41" } ], "prompt_number": 83 }, { "cell_type": "markdown", "metadata": {}, "source": "
We now have our new dataframe without TICalculators. We are left with 32,330 tweets sent by 41 companies over the course of 2013." }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": "Save New DataFrame" }, { "cell_type": "markdown", "metadata": {}, "source": "We will now save the dataframe in PANDAS' native format. It's called 'pickling' a file, so we'll give it the typical 'pkl' extension." }, { "cell_type": "code", "collapsed": false, "input": "df.to_pickle('CSR tweets - 2013 by 41 accounts.pkl')", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 87 }, { "cell_type": "markdown", "metadata": {}, "source": "
\n\nFor more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton" } ], "metadata": {} } ] }