{
"metadata": {
"name": "Tutorial - Import Data, Select Cases, Save Dataset"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "This is the first in a series of notebooks designed to show you how to analyze social media data. We assume you have already downloaded the data and are now ready to begin examining it. In this first notebook I will show you how to set up your ipython working environment and import the Twitter data we have downloaded."
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Chapter 1: Set up Jupyter, Import Twitter Data and Select Cases"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "
\n\nFirst, we will import several necessary Python packages. We will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. It is invaluable for analyzing datasets. "
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Import packages"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import numpy as np\nimport pandas as pd\nfrom pandas import DataFrame\nfrom pandas import Series",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "
\n\nPANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks."
},
{
"cell_type": "code",
"collapsed": false,
"input": "#Set PANDAS to show all columns in DataFrame\npd.set_option('display.max_columns', None)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": "
\n\nWe can check which version of various packages we're using. You can see I'm running PANDAS 0.13 here."
},
{
"cell_type": "code",
"collapsed": false,
"input": "print pd.__version__",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.13.1\n"
}
],
"prompt_number": 86
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Read in data"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "PANDAS can read in data from a variety of different data types. If you've followed some of my earlier tutorials you have downloaded tweets into an SQLite database, then converted to a CSV file. That's what we have here. We have a set of tweets by Fortune 200 firms. So, in the following three lines we'll first import the CSV file and assign it to the name 'df' -- short for 'dataframe', the PANDAS name for a dataset. Second, we'll use the len function to see how many rows (tweets) there are in the dataset; there are 34,097 tweets in total. Finally, we will use the head function to show the first two rows of the dataset. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "df = pd.read_csv('CSR_user_timeline_2013.csv', sep=',', low_memory=False)\nprint len(df)\ndf.head(2)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "34097\n"
},
{
"html": "
\n | rowid | \nquery | \ntweet_id | \ntweet_id_str | \ninserted_date | \ntruncated | \nlanguage | \npossibly_sensitive | \ncoordinates | \nretweeted_status | \nwithheld_in_countries | \nwithheld_scope | \ncreated_at_text | \ncreated_at | \nmonth | \nyear | \ncontent | \nfrom_user_screen_name | \nfrom_user_id | \nfrom_user_followers_count | \nfrom_user_friends_count | \nfrom_user_listed_count | \nfrom_user_favourites_count | \nfrom_user_statuses_count | \nfrom_user_description | \nfrom_user_location | \nfrom_user_created_at | \nretweet_count | \nfavorite_count | \nentities_urls | \nentities_urls_count | \nentities_hashtags | \nentities_hashtags_count | \nentities_mentions | \nentities_mentions_count | \nin_reply_to_screen_name | \nin_reply_to_status_id | \nsource | \nentities_expanded_urls | \nentities_media_count | \nmedia_expanded_url | \nmedia_url | \nmedia_type | \nvideo_link | \nphoto_link | \ntwitpic | \nnum_characters | \nnum_words | \nretweeted_user | \nretweeted_user_description | \nretweeted_user_screen_name | \nretweeted_user_followers_count | \nretweeted_user_listed_count | \nretweeted_user_statuses_count | \nretweeted_user_location | \nretweeted_tweet_created_at | \nFortune_2012_rank | \nCompany | \nCSR_sustainability | \nspecific_project_initiative_area | \n
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n67340 | \nhumanavitality | \n306897327585652736 | \n306897327585652736 | \n2014-03-09 13:46:50.222857 | \n0 | \nen | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nWed Feb 27 22:43:19 +0000 2013 | \n2013-02-27 22:43:19.000000 | \n2 | \n2013 | \n@louloushive (Tweet 2) We encourage other empl... | \nhumanavitality | \n274041023 | \n2859 | \n440 | \n38 | \n25 | \n1766 | \nThis is the official Twitter account for Human... | \nNaN | \nTue Mar 29 16:23:02 +0000 2011 | \n0 | \n0 | \nNaN | \n0 | \nNaN | \n0 | \nlouloushive | \n1 | \nlouloushive | \n3.062183e+17 | \nweb | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n0 | \n0 | \n0 | \n121 | \n19 | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n79 | \nHumana | \n0 | \n1 | \n
1 | \n39454 | \nFundacionPfizer | \n308616393706844160 | \n308616393706844160 | \n2014-03-09 13:38:20.679967 | \n0 | \nes | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nMon Mar 04 16:34:17 +0000 2013 | \n2013-03-04 16:34:17.000000 | \n3 | \n2013 | \n\u00bfSabes por qu\u00e9 la #vacuna contra la #neumon\u00eda ... | \nFundacionPfizer | \n188384056 | \n2464 | \n597 | \n50 | \n11 | \n2400 | \nNoticias sobre Responsabilidad Social y Fundac... | \nM\u00e9xico | \nWed Sep 08 16:14:11 +0000 2010 | \n1 | \n0 | \nNaN | \n0 | \nvacuna, neumon\u00eda | \n2 | \nNaN | \n0 | \nNaN | \nNaN | \nweb | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n0 | \n0 | \n0 | \n138 | \n20 | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n40 | \nPfizer | \n0 | \n1 | \n
2 rows \u00d7 60 columns
\n\n | rowid | \nquery | \ntweet_id_str | \ninserted_date | \nlanguage | \ncoordinates | \nretweeted_status | \ncreated_at | \nmonth | \nyear | \ncontent | \nfrom_user_screen_name | \nfrom_user_id | \nfrom_user_followers_count | \nfrom_user_friends_count | \nfrom_user_listed_count | \nfrom_user_favourites_count | \nfrom_user_statuses_count | \nfrom_user_description | \nfrom_user_location | \nfrom_user_created_at | \nretweet_count | \nfavorite_count | \nentities_urls | \nentities_urls_count | \nentities_hashtags | \nentities_hashtags_count | \nentities_mentions | \nentities_mentions_count | \nin_reply_to_screen_name | \nin_reply_to_status_id | \nsource | \nentities_expanded_urls | \nentities_media_count | \nmedia_expanded_url | \nmedia_url | \nmedia_type | \nvideo_link | \nphoto_link | \ntwitpic | \nnum_characters | \nnum_words | \nretweeted_user | \nretweeted_user_description | \nretweeted_user_screen_name | \nretweeted_user_followers_count | \nretweeted_user_listed_count | \nretweeted_user_statuses_count | \nretweeted_user_location | \nretweeted_tweet_created_at | \nFortune_2012_rank | \nCompany | \nCSR_sustainability | \nspecific_project_initiative_area | \n
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n67340 | \nhumanavitality | \n306897327585652736 | \n2014-03-09 13:46:50.222857 | \nen | \nNaN | \nNaN | \n2013-02-27 22:43:19.000000 | \n2 | \n2013 | \n@louloushive (Tweet 2) We encourage other empl... | \nhumanavitality | \n274041023 | \n2859 | \n440 | \n38 | \n25 | \n1766 | \nThis is the official Twitter account for Human... | \nNaN | \nTue Mar 29 16:23:02 +0000 2011 | \n0 | \n0 | \nNaN | \n0 | \nNaN | \n0 | \nlouloushive | \n1 | \nlouloushive | \n3.062183e+17 | \nweb | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n0 | \n0 | \n0 | \n121 | \n19 | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n79 | \nHumana | \n0 | \n1 | \n
1 | \n39454 | \nFundacionPfizer | \n308616393706844160 | \n2014-03-09 13:38:20.679967 | \nes | \nNaN | \nNaN | \n2013-03-04 16:34:17.000000 | \n3 | \n2013 | \n\u00bfSabes por qu\u00e9 la #vacuna contra la #neumon\u00eda ... | \nFundacionPfizer | \n188384056 | \n2464 | \n597 | \n50 | \n11 | \n2400 | \nNoticias sobre Responsabilidad Social y Fundac... | \nM\u00e9xico | \nWed Sep 08 16:14:11 +0000 2010 | \n1 | \n0 | \nNaN | \n0 | \nvacuna, neumon\u00eda | \n2 | \nNaN | \n0 | \nNaN | \nNaN | \nweb | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n0 | \n0 | \n0 | \n138 | \n20 | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n40 | \nPfizer | \n0 | \n1 | \n
2 rows \u00d7 54 columns
\n\n | created_at | \nfrom_user_screen_name | \nretweet_count | \n
---|---|---|---|
0 | \n2013-02-27 22:43:19.000000 | \nhumanavitality | \n0 | \n
1 | \n2013-03-04 16:34:17.000000 | \nFundacionPfizer | \n1 | \n
2 rows \u00d7 3 columns
\n\n | rowid | \nquery | \ntweet_id_str | \ninserted_date | \nlanguage | \ncoordinates | \nretweeted_status | \ncreated_at | \nmonth | \nyear | \ncontent | \nfrom_user_screen_name | \nfrom_user_id | \nfrom_user_followers_count | \nfrom_user_friends_count | \nfrom_user_listed_count | \nfrom_user_favourites_count | \nfrom_user_statuses_count | \nfrom_user_description | \nfrom_user_location | \nfrom_user_created_at | \nretweet_count | \nfavorite_count | \nentities_urls | \nentities_urls_count | \nentities_hashtags | \nentities_hashtags_count | \nentities_mentions | \nentities_mentions_count | \nin_reply_to_screen_name | \nin_reply_to_status_id | \nsource | \nentities_expanded_urls | \nentities_media_count | \nmedia_expanded_url | \nmedia_url | \nmedia_type | \nvideo_link | \nphoto_link | \ntwitpic | \nnum_characters | \nnum_words | \nretweeted_user | \nretweeted_user_description | \nretweeted_user_screen_name | \nretweeted_user_followers_count | \nretweeted_user_listed_count | \nretweeted_user_statuses_count | \nretweeted_user_location | \nretweeted_tweet_created_at | \nFortune_2012_rank | \nCompany | \nCSR_sustainability | \nspecific_project_initiative_area | \n
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n67340 | \nhumanavitality | \n306897327585652736 | \n2014-03-09 13:46:50.222857 | \nen | \nNaN | \nNaN | \n2013-02-27 22:43:19.000000 | \n2 | \n2013 | \n@louloushive (Tweet 2) We encourage other empl... | \nhumanavitality | \n274041023 | \n2859 | \n440 | \n38 | \n25 | \n1766 | \nThis is the official Twitter account for Human... | \nNaN | \nTue Mar 29 16:23:02 +0000 2011 | \n0 | \n0 | \nNaN | \n0 | \nNaN | \n0 | \nlouloushive | \n1 | \nlouloushive | \n3.062183e+17 | \nweb | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n0 | \n0 | \n0 | \n121 | \n19 | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n79 | \nHumana | \n0 | \n1 | \n
1 | \n39454 | \nFundacionPfizer | \n308616393706844160 | \n2014-03-09 13:38:20.679967 | \nes | \nNaN | \nNaN | \n2013-03-04 16:34:17.000000 | \n3 | \n2013 | \n\u00bfSabes por qu\u00e9 la #vacuna contra la #neumon\u00eda ... | \nFundacionPfizer | \n188384056 | \n2464 | \n597 | \n50 | \n11 | \n2400 | \nNoticias sobre Responsabilidad Social y Fundac... | \nM\u00e9xico | \nWed Sep 08 16:14:11 +0000 2010 | \n1 | \n0 | \nNaN | \n0 | \nvacuna, neumon\u00eda | \n2 | \nNaN | \n0 | \nNaN | \nNaN | \nweb | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n0 | \n0 | \n0 | \n138 | \n20 | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \nNaN | \n40 | \nPfizer | \n0 | \n1 | \n
2 rows \u00d7 54 columns
\n