{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Text-Mining the DLD14 Conference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the IPython Notebook behind this analysis of Twitter buzz to the DLD Conference 2014." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Reading the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Tweets for this analysis have been collected with the TAGS Google Drive script and the search query \"#DLD13 OR #DLD\". The data (\"Archive\" tab) have then been exported as CSV file. Because the TAGS document tends to reach the Google Documents size limits quite fast, I have split the data gathering in multiple TAGS files.\n", "\n", "The first step of the analysis is to read the CSV files and combine them into a single file. This step also requires to delete duplicate Tweets." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "# 1st file\n", "data = pd.read_csv('TAGS - DLD14 - Archive.csv', \n", " parse_dates={'Timestamp': ['created_at']})\n", "\n", "# 2nd file\n", "data = data.append(pd.read_csv('TAGS DLD14.2 - Archive.csv', \n", " parse_dates={'Timestamp': ['created_at']}))\n", "\n", "# 3rd file etc.\n", "data = data.append(pd.read_csv('TAGS DLD14.3 - Archive.csv', \n", " parse_dates={'Timestamp': ['created_at']}))\n", "data = data.append(pd.read_csv('TAGS DLD14.4 - Archive.csv', \n", " parse_dates={'Timestamp': ['created_at']}))\n", "data = data.append(pd.read_csv('TAGS DLD14.5 - Archive.csv', \n", " parse_dates={'Timestamp': ['created_at']}))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "data_old = pd.read_csv(\"dld13.csv\", sep=\",\",\n", " parse_dates={'Timestamp': ['created_at']})" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following is an example for duplicate content. We have to clean the data set to get rid of duplicate tweets." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from collections import Counter\n", "\n", "c = Counter(data.id_str)\n", "data[data.id_str == c.most_common()[1][0]][:3]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | Timestamp | \n", "id_str | \n", "from_user | \n", "text | \n", "time | \n", "geo_coordinates | \n", "user_lang | \n", "in_reply_to_user_id_str | \n", "in_reply_to_screen_name | \n", "from_user_id_str | \n", "in_reply_to_status_id_str | \n", "source | \n", "profile_image_url | \n", "user_followers_count | \n", "user_friends_count | \n", "user_utc_offset | \n", "status_url | \n", "entities_str | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6904 | \n", "2014-01-19 12:01:09 | \n", "4.248742e+17 | \n", "ardakutsal | \n", "RT @atillayurtseven: @webrazzi #DLD14 medya sp... | \n", "19/01/2014 12:01:09 | \n", "NaN | \n", "en | \n", "NaN | \n", "NaN | \n", "43854330 | \n", "NaN | \n", "<a href=\"http://twitter.com/download/iphone\" r... | \n", "http://pbs.twimg.com/profile_images/3027796712... | \n", "18151 | \n", "676 | \n", "7200 | \n", "http://twitter.com/ardakutsal/statuses/4248741... | \n", "{\"symbols\":[],\"urls\":[],\"hashtags\":[{\"text\":\"D... | \n", "
9072 | \n", "2014-01-19 12:01:09 | \n", "4.248742e+17 | \n", "ardakutsal | \n", "RT @atillayurtseven: @webrazzi #DLD14 medya sp... | \n", "19/01/2014 12:01:09 | \n", "NaN | \n", "en | \n", "NaN | \n", "NaN | \n", "43854330 | \n", "NaN | \n", "<a href=\"http://twitter.com/download/iphone\" r... | \n", "http://pbs.twimg.com/profile_images/3027796712... | \n", "18149 | \n", "676 | \n", "7200 | \n", "http://twitter.com/ardakutsal/statuses/4248741... | \n", "{\"symbols\":[],\"urls\":[],\"hashtags\":[{\"text\":\"D... | \n", "
11010 | \n", "2014-01-19 12:01:09 | \n", "4.248742e+17 | \n", "ardakutsal | \n", "RT @atillayurtseven: @webrazzi #DLD14 medya sp... | \n", "19/01/2014 12:01:09 | \n", "NaN | \n", "en | \n", "NaN | \n", "NaN | \n", "43854330 | \n", "NaN | \n", "<a href=\"http://twitter.com/download/iphone\" r... | \n", "http://pbs.twimg.com/profile_images/3027796712... | \n", "18148 | \n", "676 | \n", "7200 | \n", "http://twitter.com/ardakutsal/statuses/4248741... | \n", "{\"symbols\":[],\"urls\":[],\"hashtags\":[{\"text\":\"D... | \n", "