{ "metadata": { "name": "", "signature": "sha256:6f8ec8d8cadf4f1e2969732ccd907ef6a1aec5e010f21d6883aa66e7e0091dc4" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Data collection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we provide code used to collect Twitter data for our [AAAI paper](http://www.cs.iit.edu/~culotta/pubs/culotta15predicting.pdf).\n", "\n", "Note that to this is a long running process (days) and may result in data that differs from that in the original paper. So, if you're interested in reproducing the results, I'd instead start with the [data_processing.ipynb](https://github.com/tapilab/aaai-2015-demographics/blob/master/src/data_processing.ipynb) notebook.\n", "\n", "For each brand in [brands.json](https://github.com/tapilab/aaai-2015-demographics/blob/master/data/brands.json), we collect 300 *followers*. For each of these followers, we collect up to 5,000 of their *friends* (i.e., users they follow).\n", "\n", "The results are stored in three pickle files:\n", "\n", "- `username2brand.pkl`: a dict from Twitter handle to brand demographics.\n", "- `id2brand.pkl`: a dict from Twitter user id to brand.\n", "- `brand2counts.pkl`: a dict from brand Twitter id to a Counter object. The Counter object is a dict from Twitter ID to count, representing the number times a follower of brand X is friends with user Y. E.g., `{123: {456: 10, 789: 5}}` means that of the users who follow 123, 10 also follow 456 and 5 also follow 789.\n", "\n", "MongoDB is used to store data for intermediate computations. You may want to distribute calls to `process_followers` and `process_friends` for efficiency." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pymongo import MongoClient\n", "\n", "dbconn = MongoClient('localhost', 27017)\n", "db = dbconn.twitter_demographics\n", "\n", "print db" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Database(MongoClient('localhost', 27017), u'twitter_demographics')\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "\n", "#insert twitter credentials into DB\n", "twitter_cred = db['twitter_cred']\n", "#remove all before insert\n", "twitter_cred.remove({})\n", "with open('../data/twitter-cred.json','r') as f:\n", " for line in f.readlines():\n", " user = json.loads(line)\n", " user['is_taken'] = False\n", " twitter_cred.insert(user)\n", "\n", "print 'Inserted %d users credentails'%(twitter_cred.count()) \n", "\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Inserted 4 users credentails\n" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "#insert brand details in to db\n", "brands = db['brands']\n", "\n", "brands.remove({})\n", "with open('../data/brands.json','r') as f:\n", " for line in f.readlines():\n", " brands.insert(eval(line))\n", " \n", "print 'Inserted %d brands'%(brands.count())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Inserted 1072 brands\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "#helper functions\n", "def get_credentials():\n", " \"\"\"Gets twitter credentials from DB \n", " Returns:\n", " Respone dict , or None if failed.\n", " \"\"\"\n", " twitter_cred = db['twitter_cred']\n", " return twitter_cred.find_and_modify(query={'is_taken':False}, update={'$set':{'is_taken':True}}, upsert=False, sort=None)\n", "\n", "def get_brand_to_process():\n", " \"\"\"Gets unprocessed brand names from DB\n", " Returns:\n", " brand dict or None\n", " \"\"\"\n", " brandObj = db['brands']\n", " brand = brandObj.find_and_modify(query={'is_processed':False,'is_taken':False}, update={'$set':{'is_taken':True}}, upsert=False, sort=None, full_response=False)\n", " if brand:\n", " brand['brand_id'] = brand.pop('_id')\n", " return brand\n", "\n", "def get_follower_to_process():\n", " \"\"\"Gets unprocessed followers form DB\n", " Returns:\n", " follower details dict or None\n", " \"\"\"\n", " xmatrix = db['xmatrix']\n", " followersDtls = xmatrix.find_and_modify(query={'is_processed':False,'is_taken':False}, update={'$set':{'is_taken':True}}, upsert=False, sort=None, full_response=False)\n", " if followersDtls:\n", " followersDtls['follower_id'] = followersDtls.pop('_id')\n", " return followersDtls\n", "\n", "def add_followers_to_db(brand_id,followers_ids):\n", " \"\"\"Adds followers to DB\n", " Args:\n", " brand_id :brand id of which the user follows\n", " followers_ids: Ids of followers of the brand\n", " \"\"\"\n", " xmatrix = db['xmatrix']\n", " for follower_id in followers_ids:\n", " xmatrix.update({'_id':follower_id},{'$addToSet':{'follows':brand_id},'$set':{'is_processed':False,'is_taken':False}},True)\n", "\n", "def add_friends_to_dB(follower_id,friends_list):\n", " \"\"\"Adds friends to DB\n", " Args:\n", " follower_id: Id of follower \n", " freiends : list of his friends\n", " \"\"\"\n", " xmatrix = db['xmatrix']\n", " xmatrix.update({'_id':follower_id},\n", " {'$addToSet':{\n", " 'follows':{\n", " '$each':friends_list\n", " }\n", " }\n", " \n", " })\n", "\n", "def update_processed_flag(user_id):\n", " \"\"\"updates processed flag to true\n", " Args:\n", " user_id: user id for which processed flag to be updated\n", " \"\"\"\n", " xmatrix = db['xmatrix']\n", " xmatrix.update({'_id':user_id},{'$set':{'is_taken':False,'is_processed':True}})\n", " \n", "def remove_user_from_x(self,user_id):\n", " \"\"\"removes user from xmatrix\n", " Args:\n", " user_id: user id \n", " \"\"\"\n", " xmatrix = db['xmatrix']\n", " xmatrix.remove({'_id':user_id})\n", "\n", "def robust_request(twitter, resource, params, max_tries=5):\n", " \"\"\" If a Twitter request fails, sleep for 15 minutes.\n", " Do this at most max_tries times before quitting.\n", " Args:\n", " twitter .... A TwitterAPI object.\n", " resource ... A resource string to request.\n", " params ..... A parameter dictionary for the request.\n", " max_tries .. The maximum number of tries to attempt.\n", " Returns:\n", " A TwitterResponse object, or None if failed.\n", " \"\"\"\n", " for i in range(max_tries):\n", " request = twitter.request(resource, params)\n", " if request.status_code == 200:\n", " return request\n", " r = [r for r in request][0]\n", " if ('code' in r and r['code'] == 34) or ('error' in r and r['error'] == 'Not authorized.'): # 34 == user does not exist.\n", " print >> sys.stderr, 'skipping bad request', resource, params\n", " return None\n", " else:\n", " print >> sys.stderr, 'Got error:', request.text, '\\nsleeping for 15 minutes.'\n", " sys.stderr.flush()\n", " time.sleep(60 * 15)\n", "\n", "\n", " " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "#get twitter obj \n", "from TwitterAPI import TwitterAPI\n", "twitter = get_credentials()\n", "if twitter:\n", " twitterObj = TwitterAPI(\n", " twitter['api_key'],\n", " twitter['api_secret'],\n", " twitter ['access_token_key'],\n", " twitter['access_token_secret'])\n", "else:\n", " print >> sys.stderr,'Twitter credits not available'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "def get_followers(user_id):\n", " \"\"\"To get followers of given twitter Ids\n", " Args:\n", " user_id... twitter user id\n", " Returns\n", " followers list\n", " \"\"\"\n", " followers = []\n", " request = robust_request(twitterObj, 'followers/ids',\n", " {'user_id': user_id, 'count': 300})\n", " if request:\n", " for result in request:\n", " if 'ids' in result:\n", " followers += result['ids']\n", " return followers\n", "\n", "\n", "def get_friends(user_id):\n", " \"\"\"To get friends of given twitter Ids\n", " Args:\n", " user_id... twitter user id\n", " Returns\n", " friends list\n", " \"\"\"\n", " friends = []\n", " request = robust_request(twitterObj, 'friends/ids',\n", " {'user_id': user_id, 'count': 5000})\n", " if request:\n", " for result in request:\n", " if 'ids' in result:\n", " friends += result['ids']\n", " return friends" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "#For testing purpose\n", "#else set it to -1\n", "cut_off = 3" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "def process_followers():\n", " \"\"\"Gets unprocessed brands from DB \n", " fetch 300 followers of those brands and \n", " adds it to DB. \n", " Halts when all brands are processed or cutt_off is reached\n", " \"\"\"\n", " global cut_off\n", " while True:\n", " if cut_off == 0:\n", " print >> sys.stderr, 'cut off reached'\n", " break\n", " cut_off -= 1\n", " brand = get_brand_to_process()\n", " if not brand:\n", " print >> sys.stderr, 'No brands to process'\n", " break\n", " followers = get_followers(brand['brand_id'])\n", " if len(followers) > 0:\n", " add_followers_to_db(brand['brand_id'],followers)\n", " print 'added %d followers of %s to DB'%(len(followers),brand['brand_id'])\n", "process_followers()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "added 300 followers of 255784266 to DB\n", "added 300 followers of 261927470 to DB" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "added 300 followers of 268439864 to DB" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "cut off reached\n" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "#For testing purpose\n", "#else set it to -1\n", "cut_off = 3" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "def process_friends():\n", " \"\"\"Gets unprocessed follower and get \n", " 5000 of their friends , adds it to DB\n", " Halts when all followers are processed or cutt_off is reached\n", " \"\"\"\n", " global cut_off\n", " while True:\n", " if cut_off == 0:\n", " print >> sys.stderr, 'cut off reached'\n", " break\n", " cut_off -= 1\n", " follower = get_follower_to_process()\n", " if not follower:\n", " print >> sys.stderr, 'No follower to process'\n", " break\n", " friends = get_friends(str(follower['follower_id']))\n", " if len(friends) > 0:\n", " add_friends_to_dB(follower['follower_id'],friends)\n", " print 'added %d followers of %s to DB'%(len(friends),follower['follower_id'])\n", " update_processed_flag(str(follower['follower_id']))\n", " else:\n", " remove_user_from_x(str(follower['follower_id']))\n", " print >> sys.stderr, 'removed user , unable to fetch friends list for %s'%(str(follower['follower_id']))\n", "\n", "process_friends()\n", " \n", " \n", " " ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "added 763 followers of 2891952572 to DB\n", "added 571 followers of 2890499755 to DB" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "added 155 followers of 317563215 to DB" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "cut off reached\n" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "# Get all brands.\n", "id2brand = dict()\n", "username2brand = dict()\n", "for brand in db.brands.find():\n", " id2brand[brand['_id']] = brand\n", " username2brand[brand['brand_name'].lower()] = brand\n", "print 'read', len(id2brand), 'brands'" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "read 1072 brands\n" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# Iterate over sampled followers for each brand.\n", "from collections import Counter, defaultdict\n", "\n", "# Count how often each friend appears so we can remove those occuring fewer than N tims.\n", "friend_counts = Counter()\n", "count = 0\n", "for follower in db.xmatrix.find({'is_processed':True}):\n", " friend_counts.update(follower['follows'])\n", " count += 1\n", " if count % 1000 == 0:\n", " print 'read', count" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "read 1000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 3000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 4000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 5000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 6000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 7000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 8000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 9000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 10000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 11000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 12000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 13000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 14000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 15000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 16000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 17000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 18000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 19000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 20000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 21000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 22000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 23000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 24000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 25000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 26000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 27000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 28000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 29000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 30000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 31000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 32000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 33000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 34000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 35000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 36000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 37000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 38000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 39000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 40000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 41000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 42000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 43000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 44000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 45000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 46000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 47000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 48000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 49000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 50000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 51000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 52000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 53000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 54000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 55000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 56000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 57000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 58000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 59000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 60000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 61000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 62000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 63000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 64000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 65000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 66000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 67000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 68000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 69000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 70000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 71000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 72000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 73000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 74000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 75000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 76000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 77000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 78000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 79000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 80000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 81000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 82000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 83000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 84000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 85000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 86000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 87000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 88000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 89000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 90000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 91000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 92000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 93000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 94000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 95000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 96000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 97000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 98000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 99000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 100000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 101000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 102000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 103000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 104000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 105000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 106000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 107000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 108000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 109000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 110000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 111000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 112000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 113000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 114000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 115000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 116000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 117000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 118000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 119000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 120000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 121000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 122000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 123000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 124000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 125000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 126000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 127000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 128000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 129000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 130000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 131000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 132000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 133000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 134000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 135000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 136000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 137000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 138000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 139000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 140000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 141000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 142000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 143000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 144000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 145000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 146000\n" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "# Filter accounts not appearing a minimum number of times.\n", "count_thresh = 100\n", "friend_set = set(f for f, v in friend_counts.iteritems() if v >= count_thresh)\n", "print len(friend_set), 'of', len(friend_counts), 'appear at least', count_thresh, 'times'" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "67760 of 21185659 appear at least 100 times\n" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "# Now construct friend counts for each brand, using the filtered set of accounts. \n", "brand2counts = defaultdict(lambda: Counter())\n", "count = 0\n", "for follower in db.xmatrix.find({'is_processed':True}):\n", " count += 1\n", " if count % 1000 == 0:\n", " print 'read', count\n", " brandids = set([f for f in follower['follows'] if f in id2brand])\n", " friends = set(follower['follows']) & friend_set\n", " for b in brandids:\n", " brand2counts[b].update(friends)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "read 1000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 3000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 4000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 5000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 6000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 7000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 8000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 9000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 10000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 11000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 12000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 13000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 14000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 15000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 16000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 17000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 18000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 19000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 20000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 21000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 22000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 23000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 24000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 25000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 26000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 27000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 28000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 29000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 30000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 31000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 32000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 33000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 34000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 35000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 36000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 37000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 38000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 39000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 40000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 41000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 42000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 43000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 44000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 45000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 46000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 47000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 48000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 49000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 50000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 51000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 52000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 53000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 54000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 55000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 56000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 57000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 58000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 59000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 60000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 61000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 62000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 63000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 64000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 65000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 66000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 67000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 68000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 69000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 70000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 71000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 72000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 73000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 74000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 75000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 76000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 77000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 78000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 79000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 80000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 81000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 82000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 83000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 84000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 85000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 86000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 87000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 88000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 89000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 90000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 91000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 92000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 93000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 94000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 95000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 96000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 97000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 98000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 99000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 100000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 101000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 102000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 103000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 104000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 105000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 106000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 107000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 108000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 109000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 110000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 111000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 112000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 113000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 114000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 115000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 116000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 117000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 118000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 119000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 120000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 121000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 122000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 123000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 124000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 125000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 126000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 127000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 128000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 129000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 130000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 131000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 132000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 133000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 134000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 135000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 136000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 137000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 138000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 139000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 140000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 141000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 142000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 143000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 144000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 145000\n", "read" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 146000\n" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "# Print the top accounts for the first brand.\n", "print brand2counts.keys()[0], sorted(brand2counts.values()[0].items(), key=lambda x: -x[1])[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "15650816 " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "[(15650816, 626), (35764757, 322), (15846407, 306), (25525507, 283), (90420314, 273)]\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# Read the demographics data for each brand.\n", "import json\n", "username2demo = dict()\n", "for line in open('../data/demo.json', 'rt'):\n", " js = json.loads(line)\n", " username2demo[js['twitter'].lower()] = js\n", "print 'read', len(username2demo), 'demographics'" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "read 1513 demographics\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "# Add demographics to each brand dict.\n", "for username, brand in username2brand.iteritems():\n", " brand['demo'] = username2demo[username]\n", " if not 'Female' in brand['demo']:\n", " print brand['brand_name'] # , brand['demo']['Female']" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "stltoday\n", "World_Wildlife\n", "BettyBuzz\n", "TeamRankings\n", "InsideHoops\n", "GrindTV\n" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "# Set self reference counts to 0.\n", "for brand in brand2counts:\n", " brand2counts[brand][brand] = 0.\n", "# Now, the brand id should not appear in the count dict.\n", "print brand2counts.keys()[0], sorted(brand2counts.values()[0].items(), key=lambda x: -x[1])[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "15650816 " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "[(35764757, 322), (15846407, 306), (25525507, 283), (90420314, 273), (34381878, 251)]\n" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# Pickle everything\n", "from functools import partial\n", "import pickle\n", "pickle.dump(username2brand, open('username2brand.pkl', 'wb'))\n", "pickle.dump(id2brand,open('id2brand.pkl','wb'))\n", "pickle.dump(dict([(b,c) for b,c in brand2counts.iteritems()]), open('brand2counts.pkl', 'wb'))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }