{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Final Project - James Quacinella\n", "\n", "Fo the final project, I will look at the follower network of one of the think tank Twitter account and perform clustering to find groups of associated accounts. Looking at the clusters, I hope to identify what joins them by performing some NLP tasks on the account's profile contents.\n", "\n", "## Step 1 - Crawl Twitter for Followers\n", "\n", "The next section of code does not run in the notebook, but is a copy of the crawler code created for this project. It will take a single account, get the first level followers, and then grab the 'second-level' followers. Those second level follower are only added if they were nodes in the first level (so we focus on the main account, not other accounts tangentially related)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#import graphlab as gl\n", "import pickle\n", "import twitter\n", "import logging\n", "import time\n", "from collections import defaultdict\n", "\n", "\n", "### Setup a console and file logger\n", "\n", "logger = logging.getLogger('crawler')\n", "logger.setLevel(logging.DEBUG)\n", "fh = logging.FileHandler('crawler.log')\n", "fh.setLevel(logging.INFO)\n", "ch = logging.StreamHandler()\n", "ch.setLevel(logging.INFO)\n", "formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')\n", "ch.setFormatter(formatter)\n", "fh.setFormatter(formatter)\n", "logger.addHandler(ch)\n", "logger.addHandler(fh)\n", "\n", "### Setup signals to make sure API calls only take 60s at most\n", "\n", "from functools import wraps\n", "import errno\n", "import os\n", "import signal\n", "\n", "class TimeoutError(Exception):\n", " pass\n", "\n", "def timeout(seconds=60, error_message=os.strerror(errno.ETIME)):\n", " def decorator(func):\n", " def _handle_timeout(signum, frame):\n", " raise TimeoutError(error_message)\n", "\n", " def wrapper(*args, **kwargs):\n", " signal.signal(signal.SIGALRM, _handle_timeout)\n", " signal.alarm(seconds)\n", " try:\n", " result = func(*args, **kwargs)\n", " finally:\n", " signal.alarm(0)\n", " return result\n", "\n", " return wraps(func)(wrapper)\n", "\n", " return decorator\n", "\n", "@timeout()\n", "def getFollowers(api, follower):\n", " ''' Function that will get a user's list of followers from an api object. \n", " NOTE: the decorator ensures that this only runs for 60s at most. '''\n", " # return api.GetFollowerIDs(follower)\n", " return api.GetFriendIDs(follower)\n", "\n", "\n", "### Twitter API\n", "\n", "# Lets create our list of api OAuth parameters\n", "API_TOKENS = [\n", " {\"consumer_key\": 'yp4wi4FASXbsRKa6JxYqzhUlH',\n", " \"consumer_secret\": 'Wkh1d5ygAOp4Bp65syFzHRN4xQsS8O4FvU3zHWosX8NXCqMpcl',\n", " \"access_token_key\": '16562593-F6lRFe7iyoQEahezhPmaI64oInHZD0LNpcIbbq7Wy',\n", " \"access_token_secret\": 'weregYL8n6DI7yZy9pkizIJ78rH2GY02Do9jvpTe7rCey',\n", " \"requests_timeout\": 60},\n", "\n", " {\"consumer_key\": 'NsNYFG9LtZV2XMyigPaCKVyVz',\n", " \"consumer_secret\": '4J1vlowybipqXnSrKgLBvmzPmwqx71uHN32noljTgDLS2xQNfI',\n", " \"access_token_key\": '16562593-NCuQWVnpzcnB55w7VLdoCkdobdUQBRDJKjIPXAksP',\n", " \"access_token_secret\": 'nX9OksrYQxj0jBXYJTkUjlX5mZh4rZljfVRXtSM3Tjc8c',\n", " \"requests_timeout\": 60},\n", "\n", " {\"consumer_key\": 'ZcAMGe2MUcnTO9ATCIo563SHN',\n", " \"consumer_secret\": 'dJAB7mBfoYyx27Yccbmzz98GtNigAA67Ish9Y1NjN2wNznciM1',\n", " \"access_token_key\": '16562593-AmaoKVLEYL3o8rVUS3b6u4PUbVPTI6BPsyaqCdwxY',\n", " \"access_token_secret\": '8pjYJCFWTErJlb2WSkLwsYNoptVazQQs95JAvIU8JApUA',\n", " \"requests_timeout\": 60},\n", "\n", " {\"consumer_key\": 'avZpjObqQN9vue2Y4gu9zIF9X',\n", " \"consumer_secret\": 'Ka6WCj3fyon5yGgf5YJIIl8nVcLcUh5YT99N58qy8qv4kfaMbc',\n", " \"access_token_key\": '16562593-VNuGD09Cr29ZlzNCWnV5MOujU7PsexSwfTgfKQNqC',\n", " \"access_token_secret\": '9P3hB3qDb9zPDFCUhWU16N4CMXPwHacl6HJbCc0EuGj7s',\n", " \"requests_timeout\": 60},\n", "\n", " {\"consumer_key\": 'sQ9H5NKteroNZSWvIrkSWvXR0',\n", " \"consumer_secret\": 'lC0ttZKdIZhhJAE1I5RxMxdjpSiADQCVUnHS7LbtfVmI2pz2F2',\n", " \"access_token_key\": '16562593-4LOk7QkXWD0boF01BmZ6NP2oPtHmDZ1OVJ883aANG',\n", " \"access_token_secret\": 'JJ85qMqzVowN1KdQ6w4YlhJB9YF9eWbw6SGbxQoU6gvne',\n", " \"requests_timeout\": 60},\n", "\n", " {\"consumer_key\": 'DHppZ2LG3iYj8vEx7ibRRLN35',\n", " \"consumer_secret\": 'wdTQeyp7ZNDN7ne40IriRw7Ah1J8cAi2OIlw4MVtgpq5MMKjYE',\n", " \"access_token_key\": '16562593-WN8zvEWAxVfJPrneMwUjDoVQw0geuLckOOJqFimsC',\n", " \"access_token_secret\": 'ZgVi2onPB3RPGtRmPBs6QXymIMgXwJHUOQycesp64S0Hp',\n", " \"requests_timeout\": 60},\n", "\n", " {\"consumer_key\": 'lIgtfdkC2WmN7XAcicrGygQBp',\n", " \"consumer_secret\": '2D9WIJN2MIPwFpMeIGcP6vWjQC8vvy7G5ZlHMSH1F1CsgWGKfz',\n", " \"access_token_key\": '16562593-7lhPpeZNNAGoQQJnqcnTtBiGq1O52XMZ4CMeVqXiY',\n", " \"access_token_secret\": 'WKRBQsr36MMB2EpCcZLr89ik0MSJfPoBORCKu9E1hw96I',\n", " \"requests_timeout\": 60},\n", "\n", " {\"consumer_key\": '1XFu2urZzoMoC5sadXAjA7IoQ',\n", " \"consumer_secret\": 'FrJOlHfNLp3M7ejJWiO5k74E9ai6L5EzQJ45HmlsUINbh8qUUi',\n", " \"access_token_key\": '16562593-Texko6g7VyCwhNUfxBDoJKJl4058hpvQkqAYWRKpi',\n", " \"access_token_secret\": 'ISZCTvN6bYJVaJ3Z2iidQObTzE2pxkINBLi0WWe9Ab2Zv',\n", " \"requests_timeout\": 60},\n", "\n", " {\"consumer_key\": 'r8Bvdm6I8QrRPuVzP4VtRYpqd',\n", " \"consumer_secret\": 'CzA8u8M8nDiDCCrSzCsXpR3SyTGCaLppDWbdTxSg78ZKgtKkhh',\n", " \"access_token_key\": '16562593-I3l0ZSmfZbMxIQ2NbiiM2eDMA4KNzFmFBeUkWxunR',\n", " \"access_token_secret\": '9HkILP4kSMF0hgvsB126jpoUzsRXETYMlSM0YSKb2yMJH',\n", " \"requests_timeout\": 60},\n", "\n", " {\"consumer_key\": 'NmMjfP1Zt3n2VDZ15X7SDGM6G',\n", " \"consumer_secret\": 'j9JBx7HUbMpcDnFteiIAAgHSoA8idlqQ20A1xbvnMrqMrOHQ1n',\n", " \"access_token_key\": '16562593-zUNyMUdO9JnSIstmTrqdyHHmX2lpv9NqkQxGC8faP',\n", " \"access_token_secret\": 'DEeHvLjTXlxNGmqDntXOK0cJCX08cnpg0btoRXWATW3X2',\n", " \"requests_timeout\": 60}\n", "]\n", "\n", "# Now create a list of twitter API objects\n", "apis = []\n", "for token in API_TOKENS:\n", " apis.append( twitter.Api(consumer_key=token['consumer_key'],\n", " consumer_secret=token['consumer_secret'],\n", " access_token_key=token['access_token_key'],\n", " access_token_secret=token['access_token_secret'],\n", " requests_timeout=60))\n", "\n", "\n", "# The account id / screen name we want followers from\n", "account_screen_name = 'fairmediawatch'\n", "account_id = '54679731'\n", "\n", "# Keep track of nodes connected to account, and all edges we need in the graph\n", "nodes = set()\n", "edges = defaultdict(set)\n", "\n", "\n", "# Try to load first level followers from pickle;\n", "# otherwise, generate them from a single API call and save via pickle\n", "try:\n", " logger.info(\"Loading followers for %s\" % account_screen_name)\n", " f = open(\"following1\", \"rb\")\n", " following = pickle.load(f)\n", "except Exception as e:\n", " logger.info(\"Failed. Generating followers for %s\" % account_screen_name)\n", " following = api.GetFriendIDs(screen_name=account_screen_name)\n", " pickle.dump(following, open(\"following1\", \"wb\"))\n", "\n", "# Try to load the nodes and first level edges from pickle;\n", "# otherwise generate them from the 'following' list and save\n", "try:\n", " logger.info(\"Loading nodes and edges for depth = 1, for %s\" % account_screen_name)\n", " n = open(\"nodes.follow1.set\", \"rb\")\n", " e = open(\"edges.follow1.dict\", \"rb\")\n", " nodes = pickle.load(n)\n", " edges = pickle.load(e)\n", "except Exception as e:\n", " logger.info(\"Failed. Generating nodes and edges for depth = 1, for %s\" % account_screen_name)\n", " for follower in following:\n", " nodes.add(follower)\n", " edges[account_id].add(follower)\n", " pickle.dump(nodes, open(\"nodes.follow1.set\", \"wb\"))\n", " pickle.dump(edges, open(\"edges.follow1.dict\", \"wb\"))\n", "\n", "\n", "\n", "### Crawling for Depth2\n", "\n", "\n", "# Index the api list, and start from the first api object\n", "api_idx = 0\n", "api = apis[api_idx]\n", "\n", "# Some accounts give us issues (either too many followers or no permissions)\n", "blacklist= [74323323, 43532023, 19608297, 25757924, 240369959, 173634807, 17008482, 142143804]\n", "api_updated = False\n", "\n", "# It is nice to start from a point in the list, instead of from the beginning\n", "starting_point = 142143804\n", "if starting_point:\n", " starting_point_idx = following.index(starting_point)\n", " following_iter = range(starting_point_idx, len(following))\n", "else:\n", " following_iter = range(len(following))\n", "\n", "# Try loading second layer of followers from pickle, otherwise start from scratch\n", "try:\n", " f = open(\"edges.follow2.dict\", \"rb\")\n", " edges = pickle.load(f)\n", " logger.info(\"Loaded edges.follow2 into memory!\")\n", "except Exception as e:\n", " logger.info(\"Starting from SCRATCH: did not load edges.follow2 into memory!\")\n", " pass\n", "\n", "# For each follower of the main account ...\n", "for follower_idx in following_iter:\n", " follower = following[follower_idx]\n", " success = False\n", " \n", " # ... check if they are on the blacklist; if so, skip\n", " if follower in blacklist:\n", " logger.info(\"Skipping due to blacklist\")\n", " continue\n", "\n", " # Otherwise, attempt to get list of their followers\n", " followers_depth2_list = []\n", " while not success:\n", " try:\n", " logger.info(\"Getting followers for follower %s\" % follower)\n", " followers_depth2_list = getFollowers(api, follower)\n", " success = True\n", " except TimeoutError as e:\n", " # If api call takes too long, move on\n", " logger.info(\"Timeout after 60s for follower %d\" % follower)\n", " success = True # technically not a success but setting flag so next loop moves on\n", " continue\n", " except Exception as e:\n", " # IF we get here, then we hit API limits\n", " logger.info(\"API Exception %s; api-idx = %d\" % (str(e), api_idx))\n", " \n", " # Are we at the begining of api list? \n", " # IF so, dump edges so far via pickle and sleep\n", " if api_updated and api_idx % len(API_TOKENS) == 0 and api_idx >= len(API_TOKENS):\n", " logger.info(\"Save edges to pickle file for follower = %s\" % follower)\n", " pickle.dump(edges, open(\"edges.follow2.dict\", \"wb\"))\n", " logger.info(\"Sleeping ...\")\n", " time.sleep(60)\n", " api_updated = False\n", " # Otherwise, move on to the next api object and try again\n", " else:\n", " api_idx += 1\n", " api = apis[api_idx % len(API_TOKENS)]\n", " api_updated = True\n", " \n", " \n", " # After getting the followers, find the intersection of those followers\n", " # with those of the first-level followers and add to edge dict\n", " if followers_depth2_list:\n", " logger.info(\"Adding followers to the graph\")\n", " edges[follower].update(nodes.intersection(followers_depth2_list))\n", "\n", "\n", "# Write out final list of edges via pickle\n", "logger.info(\"Save edges to pickle file for follower = %s\" % follower)\n", "pickle.dump(edges, open(\"edges.follow2.dict\", \"wb\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of running the above, lets just load everything via pickle:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pickle\n", "n = open(\"nodes.follow1.set\", \"rb\")\n", "nodes = pickle.load(n)\n", "\n", "e = open(\"edges.follow2.dict\", \"rb\")\n", "edges = pickle.load(e)\n", "\n", "f = open(\"following1\", \"rb\")\n", "following = pickle.load(f)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Step 2 - Generate Graph from Crawl\n", "\n", "First, we generate CSV files so we can load data into GraphLab Create." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Hide some silly output\n", "import logging\n", "logging.getLogger(\"requests\").setLevel(logging.WARNING)\n", "logging.getLogger(\"urllib3\").setLevel(logging.WARNING)\n", "\n", "# Import everything we need\n", "import graphlab as gl\n", "\n", "# Generate CSVs from the previous crawl\n", "# TODO\n", "f = open('vertices.csv', 'w')\n", "f.write('id\\n')\n", "for node in nodes:\n", " f.write(str(node) + \"\\n\")\n", "f.close()\n", "\n", "f = open('edges.csv', 'w')\n", "f.write('src,dst,relation\\n')\n", "for node, followers in edges.iteritems():\n", " for follower in followers:\n", " f.write('%s,%s,%s\\n' % (follower, node, 'follows'))\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let us use these CSV files and load them into a graph object called g:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[INFO] This non-commercial license of GraphLab Create is assigned to james.quacinella@gmail.comand will expire on January 01, 2038. For commercial licensing options, visit https://dato.com/buy/.\n", "\n", "[INFO] Start server at: ipc:///tmp/graphlab_server-18863 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1437714775.log\n", "[INFO] GraphLab Server Version: 1.5.1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Final/vertices.csv\n", "PROGRESS: Parsing completed. Parsed 100 lines in 0.024206 secs.\n", "------------------------------------------------------\n", "Inferred types from first line of file as \n", "column_type_hints=[int]\n", "If parsing fails due to incorrect types, you can correct\n", "the inferred type list above and pass it to read_csv in\n", "the column_type_hints argument\n", "------------------------------------------------------\n", "PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Final/vertices.csv\n", "PROGRESS: Parsing completed. Parsed 1108 lines in 0.018389 secs.\n", "PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Final/edges.csv\n", "PROGRESS: Parsing completed. Parsed 100 lines in 0.114743 secs.\n", "------------------------------------------------------\n", "Inferred types from first line of file as \n", "column_type_hints=[int,int,str]\n", "If parsing fails due to incorrect types, you can correct\n", "the inferred type list above and pass it to read_csv in\n", "the column_type_hints argument\n", "------------------------------------------------------\n", "PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Final/edges.csv\n", "PROGRESS: Parsing completed. Parsed 105006 lines in 0.076969 secs.\n" ] } ], "source": [ "# Load Data\n", "gvertices = gl.SFrame.read_csv('vertices.csv')\n", "gedges = gl.SFrame.read_csv('edges.csv')\n", "\n", "# Create graph\n", "g = gl.SGraph()\n", "g = g.add_vertices(vertices=gvertices, vid_field='id')\n", "g = g.add_edges(edges=gedges, src_field='src', dst_field='dst')\n", "g = g.add_edges(edges=gedges, src_field='dst', dst_field='src')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets try to visualize the graph!" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Canvas is accessible via web browser at the URL: http://localhost:48677/index.html\n", "Opening Canvas in default web browser.\n" ] } ], "source": [ "# Visualize graph?\n", "gl.canvas.set_target('browser')\n", "g.show(vlabel=\"id\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like its too large of a graph to display. \n", "\n", "## Central / Important Nodes\n", "\n", "Lets use pagerank to find important nodes in the network:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Counting out degree\n", "PROGRESS: Done counting out degree\n", "PROGRESS: +-----------+-----------------------+\n", "PROGRESS: | Iteration | L1 change in pagerank |\n", "PROGRESS: +-----------+-----------------------+\n", "PROGRESS: | 1 | 617.534 |\n", "PROGRESS: | 2 | 135.25 |\n", "PROGRESS: | 3 | 30.9247 |\n", "PROGRESS: | 4 | 8.64859 |\n", "PROGRESS: | 5 | 2.52531 |\n", "PROGRESS: | 6 | 0.885368 |\n", "PROGRESS: | 7 | 0.323184 |\n", "PROGRESS: | 8 | 0.126578 |\n", "PROGRESS: | 9 | 0.0503135 |\n", "PROGRESS: | 10 | 0.0203128 |\n", "PROGRESS: | 11 | 0.00825336 |\n", "PROGRESS: +-----------+-----------------------+\n" ] }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
__idpagerankdelta
546797317.158930546982.08163053328e-05
591597715.735895085025.1434297017e-06
1691827275.682489858632.4887386834e-05
169352924.989570112233.37281975513e-05
19473014.393396145391.60673868965e-06
238398354.361130118461.93642112549e-05
160760324.341638947197.78788532063e-06
101178923.965206836724.10425955666e-06
169559913.845120609141.25983857791e-05
4782030183.401068989182.55084466252e-05
\n", "[10 rows x 3 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\t__id\tint\n", "\tpagerank\tfloat\n", "\tdelta\tfloat\n", "\n", "Rows: 10\n", "\n", "Data:\n", "+-----------+---------------+-------------------+\n", "| __id | pagerank | delta |\n", "+-----------+---------------+-------------------+\n", "| 54679731 | 7.15893054698 | 2.08163053328e-05 |\n", "| 59159771 | 5.73589508502 | 5.1434297017e-06 |\n", "| 169182727 | 5.68248985863 | 2.4887386834e-05 |\n", "| 16935292 | 4.98957011223 | 3.37281975513e-05 |\n", "| 1947301 | 4.39339614539 | 1.60673868965e-06 |\n", "| 23839835 | 4.36113011846 | 1.93642112549e-05 |\n", "| 16076032 | 4.34163894719 | 7.78788532063e-06 |\n", "| 10117892 | 3.96520683672 | 4.10425955666e-06 |\n", "| 16955991 | 3.84512060914 | 1.25983857791e-05 |\n", "| 478203018 | 3.40106898918 | 2.55084466252e-05 |\n", "+-----------+---------------+-------------------+\n", "[10 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pr = gl.pagerank.create(g)\n", "pr.get('pagerank').topk(column_name='pagerank')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import The Graph to iGraph\n", "\n", "Next we will load the graph data into igrpah and perform clustering to find communities" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from igraph import *\n", "\n", "# Create empty graph\n", "twitter_graph = Graph(directed=False)\n", "\n", "# Setup the nodes\n", "for node in nodes:\n", " if isinstance(node, int):\n", " twitter_graph.add_vertex(name=str(node))" ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Setup the edges\n", "for user in edges:\n", " for follower in edges[user]:\n", " try:\n", " twitter_graph.add_edge(str(follower), str(user))\n", " except Exception as e:\n", " print user, follower\n", " print e\n", " break" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Add the 'ego' edges\n", "following = pickle.load(open(\"following1\", \"rb\"))\n", "for node in following:\n", " twitter_graph.add_edge(str(node), \"54679731\")" ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# for v in twitter_graph.vs.select(name_eq=\"54679731\"):\n", "# print v\n", " \n", "# for e in twitter_graph.es.select(_source=v.index): print e\n", "# for e in twitter_graph.es.select(_target=533): print e" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "igraph.Vertex(,544,{'name': '54679731'})\n" ] } ], "source": [ "# for v in twitter_graph.vs:\n", "# if len(twitter_graph.es.select(_source=v.index)) == 0 and len(twitter_graph.es.select(_target=v.index)) == 0:\n", "# print v" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pickle.dump(twitter_graph, open(\"twitter_graph\", \"wb\"))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load twitter grapg to prevent running the above\n", "twitter_graph = pickle.load(open(\"twitter_graph\", \"rb\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display the Graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "layout = twitter_graph.layout_drl()\n", "plt1 = plot(twitter_graph, 'graph.drl.png', layout = layout)" ] }, { "cell_type": "code", "execution_count": 257, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import HTML\n", "s = \"\"\"\"\"\"\n", "h = HTML(s); h" ] }, { "cell_type": "code", "execution_count": 214, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "layout = twitter_graph.layout(\"graphopt\")\n", "plt2 = plot(twitter_graph, 'graph.graphopt.png', layout = layout)" ] }, { "cell_type": "code", "execution_count": 258, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\"\"\"\n", "h = HTML(s); h" ] }, { "cell_type": "code", "execution_count": 227, "metadata": { "collapsed": true }, "outputs": [], "source": [ "layout = twitter_graph.layout(\"lgl\")\n", "plt2 = plot(twitter_graph, 'graph.lgl.png', layout = layout)" ] }, { "cell_type": "code", "execution_count": 260, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 260, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\"\"\"\n", "h = HTML(s); h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets trim down the graph to only large nodes:" ] }, { "cell_type": "code", "execution_count": 237, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# https://lists.nongnu.org/archive/html/igraph-help/2012-11/msg00047.html\n", "twitter_graph2 = twitter_graph.copy()\n", "nodes = twitter_graph2.vs(_degree_lt=200)\n", "twitter_graph2.es.select(_within=nodes).delete()\n", "twitter_graph2.vs(_degree_lt=200).delete()\n", "layout = twitter_graph2.layout_drl()\n", "plt1 = plot(twitter_graph2, 'graph2.drl.png', layout = layout)" ] }, { "cell_type": "code", "execution_count": 261, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 261, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\"\"\"\n", "h = HTML(s); h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, lets run the walktrap community algorithm, which seems to produce a heirarchical clustering:" ] }, { "cell_type": "code", "execution_count": 243, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 243, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wc = twitter_graph.community_walktrap()\n", "plot(wc, 'cluster.walktrap.png', bbox=(3000,3000))" ] }, { "cell_type": "code", "execution_count": 266, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 266, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\"\"\"\n", "h = HTML(s); h" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eigen = twitter_graph.community_leading_eigenvector()\n", "plot(eigen, 'cluster.eigen.test.png', mark_groups=True, bbox=(5000,5000))" ] }, { "cell_type": "code", "execution_count": 270, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 270, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\"\"\"\n", "h = HTML(s); h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Still messy, lets try doing the same thing but on the smaller graph:" ] }, { "cell_type": "code", "execution_count": 238, "metadata": { "collapsed": true }, "outputs": [], "source": [ "eigen2 = twitter_graph2.community_leading_eigenvector()\n", "plot(eigen2, 'cluster2.eigen.png', mark_groups=True, bbox=(5000,5000))" ] }, { "cell_type": "code", "execution_count": 269, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 269, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\"\"\"\n", "h = HTML(s); h" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Bios of Twitter Users\n", "\n", "Now that we have the graph communities, lets crawl twitter for the bios on each user. I will not reproduce the code, as it is mostly a copy of the above twitter crawl, using a different API call. Check biocrawl.py in the repo. We'll load the data from pickle:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bios = pickle.load(open(\"bios\", \"rb\"))\n", "\n", "from collections import defaultdict\n", "documents = defaultdict(str)\n", "for v_idx, cluster in zip(range(len(eigen.membership)), eigen.membership):\n", " twitter_id = int(twitter_graph.vs[v_idx].attributes()['name'])\n", " if twitter_id in bios:\n", " documents[cluster] += \"\\n%s\" % bios[ twitter_id ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Important Nodes from PageRank\n", "\n", "Lets look at the important nodes in the network and their associated bios:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------------+--------------------------------------------------------------+------------+\n", "| Rank | Bio | Cluster ID |\n", "+---------------+--------------------------------------------------------------+------------+\n", "| 4.39339614539 | Instigating progress since 1865 | 0 |\n", "| 3.84512060914 | Providing fearless political journalism and cultural | 0 |\n", "| | analysis since the dawn of the digital era. We're also at | |\n", "| | http://t.co/arAhFcxlTr | |\n", "| 3.30740490311 | Investigative journalist, blogger at Clear it With Sidney at | 0 |\n", "| | the Hillman Foundation and Duly Noted at In These Times. Co- | |\n", "| | Host of @PointofInquiry. #binders | |\n", "| 3.2642239473 | Media Matters for America is the nation's premier | 0 |\n", "| | progressive media watchdog, research and information center. | |\n", "| 3.24584856758 | AlterNet is a progressive news magazine and online community | 0 |\n", "| | http://t.co/f4LPGamU | |\n", "| 3.05310397834 | Investigative journalism, politics, chart-tastic, and | 0 |\n", "| | sometimes sarcastic. We're the nonprofit news organization | |\n", "| | that brought you the 47 percent video. | |\n", "| 2.63898887362 | The latest political news from The Huffington Post's | 0 |\n", "| | politics team. | |\n", "| 2.46692594602 | Dad/husband. @CNN. Co-founded @RebuildDream, @GreenForAll, | 0 |\n", "| | @ColorOfChange, @EllaBakerCenter, @YesWeCode & #cut50. | |\n", "| | Wrote: GreenCollar Economy & Rebuild The Dream | |\n", "| 2.38625057759 | Bloomberg @BW reporter, covering politics/ policy/ labor. | 0 |\n", "| | Send me your tips (and complaints): jeidelson at bloomberg | |\n", "| | dot net [Usual disclaimers] | |\n", "| 2.3499563445 | Adele M. Stan is a columnist at The American Prospect, and | 0 |\n", "| | editor of Clarion, the newspaper of Professional Staff | |\n", "| | Congress/CUNY. She also plays the ukulele. | |\n", "| 2.34208284562 | The official Twitter of http://t.co/HJOFeYodXw | 0 |\n", "| 2.25057081559 | The official Twitter of @msnbc's Melissa Harris-Perry. | 0 |\n", "| | Exploring politics, culture, art, and community beyond the | |\n", "| | beltway every Saturday and Sunday, 10a-12p ET. | |\n", "| 2.19802709748 | labor movement, left politics, superhero comics, rock & roll | 0 |\n", "| | & online organizing. | |\n", "| 2.19673751375 | Host of The Katie Halper Show on WBAI, lefty comedian / | 0 |\n", "| | blogger at Salon, Vice, The Nation, Feministing, Raw Story, | |\n", "| | Alternet, Comedy Central & more / filmmaker | |\n", "| 2.17501663699 | Reporter/enviro editor at @HuffPostPol and VP of membership | 0 |\n", "| | at @sejorg. Fan of gravity, fermentation, and the serial | |\n", "| | comma. | |\n", "| 2.11935441039 | Writer/Producer/Comic. Co creator of The Daily Show. Author: | 0 |\n", "| | Lizz Free Or Die. Exposer of crackpots at | |\n", "| | http://t.co/COQwCZPcmF and http://t.co/WhllLpegoQ | |\n", "| 2.09854384862 | Analyst, advocate, writer, mom. Feminist. Thinker & doer. | 0 |\n", "| | Realist & dreamer. Progressive but not predictable. Editor- | |\n", "| | in-Chief, RH RealityCheck. UW Madison alum | |\n", "| 2.06228546594 | Executive Editor and Columnist, The Nation | 0 |\n", "| 2.05351480712 | Columnist @GuardianUS | Feminist author | Pasta enthusiast, | 0 |\n", "| | native NYer | My books http://t.co/Ct9Ck2TTqb | Eat Me | |\n", "| | http://t.co/qkg3oItKfw | |\n", "| 2.03231408258 | Author of THE TEACHER WARS. Staff writer @MarshallProj. | 0 |\n", "| | http://t.co/NA3HloQjwN | |\n", "| 2.03109035963 | Writer. Lawyer. Eater. Formerly @Cosmopolitan senior | 0 |\n", "| | political writer, @GuardianUS columnist, @Feministe blogger. | |\n", "| 2.02762932794 | Wake Forest University Professor, Director @AJCCenter, | 0 |\n", "| | Executive Director @WakeEngaged, MSNBC Host of @MHPShow, | |\n", "| | contributor to @TheNation & @EssenceMag | |\n", "| 2.0121987956 | National Editor @buzzfeednews. PGP: http://t.co/xGj7z2Ljki | 0 |\n", "| | adam.serwer@buzzfeed.com https://t.co/zl6RcFyMTN | |\n", "| 1.99455894381 | Senior Editor, @tnr. Host of @IntersectionTNR, a new podcast | 0 |\n", "| | about race, gender, and all the ways we identify. | |\n", "+---------------+--------------------------------------------------------------+------------+\n", "+---------------+--------------------------------------------------------------+------------+\n", "| Rank | Bio | Cluster ID |\n", "+---------------+--------------------------------------------------------------+------------+\n", "| 4.98957011223 | Independent, Daily Global News Hour Anchored by Amy Goodman | 1 |\n", "| | & Juan González. Stream Live 8am ET http://t.co/SL25z1kZE5. | |\n", "| | Support Independent Media - Donate Today | |\n", "| 3.96520683672 | Investigative reporting, political commentary, cultural | 1 |\n", "| | coverage, activism, interviews, poetry, and humor since | |\n", "| | 1909. | |\n", "| 3.04835708177 | YES! Magazine's award-winning journalism reframes the | 1 |\n", "| | biggest problems of our time in terms of their solutions. | |\n", "| | Independent, nonprofit, reader-supported. | |\n", "| 2.90741081734 | Drilling beneath the headlines. Follow us for provocative | 1 |\n", "| | and insightful news, features and analysis. | |\n", "| 2.89312248828 | Occupying Wall Street since Sep 17, 2011. Standing with the | 1 |\n", "| | global #Occupy movement. About our team: | |\n", "| | http://t.co/7SfBMRuTjZ #OWS | |\n", "| 2.66065430813 | Monthly news magazine committed to informing and analyzing | 1 |\n", "| | movements for social, environmental and economic justice. | |\n", "| | Founded in 1976. | |\n", "| 2.47763903414 | Pursuing stories with moral force. Curating your best | 1 |\n", "| | #muckreads. Tweets by @terryparrisjr + @amzam. Send tips | |\n", "| | securely: https://t.co/JWIupK6Wrl | |\n", "| 2.36358932648 | Host,The Laura Flanders Show on @GRITtv, radio commentator; | 1 |\n", "| | author, BUSHWOMEN, BLUE GRIT; Editor, At the Tea Party... | |\n", "| 2.18561738802 | Colorlines is a daily news site where race matters, | 1 |\n", "| | featuring award-winning investigative reporting and news | |\n", "| | analysis. | |\n", "| 2.177018398 | TMC is a North American network of leading independent media | 1 |\n", "| | outlets. Follow us for news & nuance you can't find anywhere | |\n", "| | else! Tweets by @jgksf + @manolialive | |\n", "| 2.06230444459 | Truthout is dedicated to providing independent news & | 1 |\n", "| | commentary. We hope to inspire the direct action necessary | |\n", "| | to save the planet & humanity. Official Tweets. | |\n", "| 2.00360103986 | progressive, bold, 100% independent, journalism and | 1 |\n", "| | advocacy, tenth anniversary online | |\n", "+---------------+--------------------------------------------------------------+------------+\n", "+---------------+--------------------------------------------------------------+------------+\n", "| Rank | Bio | Cluster ID |\n", "+---------------+--------------------------------------------------------------+------------+\n", "| 5.73589508502 | The Nation magazine Editor and Publisher | 2 |\n", "| 3.325182719 | Washington editor @the_intercept, a First Look Media | 2 |\n", "| | publication. froomkin@theintercept.com How to leak to me: | |\n", "| | http://t.co/md5GQRJby1 | |\n", "| 3.05833410529 | Host of All In with Chris Hayes on MSNBC, Weeknights at 8pm. | 2 |\n", "| | Editor at Large at The Nation. Cubs fan. | |\n", "| 2.99593940177 | Covering national politics for @washingtonpost. Finishing a | 2 |\n", "| | book about progressive rock (W.W. Norton). | |\n", "| | daveweigel@gmail.com, 302-507-6806. | |\n", "| 2.99589140109 | I see political people...\r", " | 2 |\n", "| 2.76573594395 | Moving news forward. Editor-In-Chief @JuddLegum | 2 |\n", "| 2.67779981051 | Editor-in-chief, http://t.co/5gESirESRH. Policy analyst at | 2 |\n", "| | MSNBC. Hater of filibuster. Lover of charts. Come work | |\n", "| | @Voxdotcom! http://t.co/VhALOi3yKC | |\n", "| 2.6711824588 | Senior Political Reporter and Politics Managing Editor, | 2 |\n", "| | HuffPost. aterkel at huffingtonpost dot com Sign up for my | |\n", "| | newsletter: https://t.co/tM0sM6PgOR | |\n", "| 2.64430685492 | Photojournalist covering media, culture & politics I also | 2 |\n", "| | write (mostly here lately) Also @tigerbeat on instagram | |\n", "| | Email srhodes at gmail | |\n", "| 2.55744476724 | I teach journalism at NYU, direct the Studio 20 program | 2 |\n", "| | there, critique the press and try to understand digital | |\n", "| | logic. I also advise media companies sometimes. | |\n", "| 2.52993123165 | http://t.co/3NqFkIfQya: the Aggressive Progressives since | 2 |\n", "| | 2000. 2 million strong and growing. Yes We Will! | |\n", "| 2.45535873698 | Editor-in-Chief, FiveThirtyEight. Author, The Signal and the | 2 |\n", "| | Noise (http://t.co/9mLliQYI8N). Sports/politics/food geek. | |\n", "| 2.43432862068 | CNN Anchor and Chief Washington Correspondent. Dissecting my | 2 |\n", "| | tweets with Talmudic meticulousness will result in wrong | |\n", "| | conclusions. RTs do not = endorsement. | |\n", "| 2.43060658092 | Making sense of what matters, tweets from ‘Moyers & Company’ | 2 |\n", "| | producers and Bill Moyers. Keep track of the corrupting | |\n", "| | influence of $ on politics. | |\n", "| 2.39529425002 | A blog about politics, politics, and politics | 2 |\n", "| 2.35927440968 | Author of a dozen books. Film producer. Ed. of Editor & | 2 |\n", "| | Publisher and Crawdaddy. Daily blogger. Next book optioned | |\n", "| | for Paul Greengrass flick. | |\n", "| 2.34320763082 | DC editor of Mother Jones, MSNBC analyst & author of the new | 2 |\n", "| | book, SHOWDOWN: The Inside Story of How Obama Fought Back | |\n", "| | Against Boehner, Cantor & the Tea Party | |\n", "| 2.33831887877 | Editor in Chief, @YahooPolitics, a @YahooNews production. | 2 |\n", "| | Politics, media, breaking. | |\n", "| 2.27046419486 | Breaking news and analysis from the TPM team. | 2 |\n", "| 2.27025295481 | CNN's senior media correspondent and host of @CNNReliable. | 2 |\n", "| | Formerly @nytimes, @tvnewser and Top of the Morning. Email: | |\n", "| | bstelter@gmail.com | |\n", "| 2.25702219683 | Nobel laureate. Op-Ed columnist, @nytopinion. Author, “The | 2 |\n", "| | Return of Depression Economics,” “The Great Unraveling,” | |\n", "| | “The Age of Diminished Expectations” + more. | |\n", "| 2.17156961195 | Investigative Journalist. @the_intercept | 2 |\n", "| | lee.fang@theintercept.com | |\n", "| 2.16913235691 | NY Times columnist, co-author of Half the Sky & A Path | 2 |\n", "| | Appears, http://t.co/bcxQaJYCMg Newsletter: | |\n", "| | http://t.co/EYhBhaKPv1 | |\n", "| 2.14931459631 | The New Yorker is a weekly magazine with a mix of reporting | 2 |\n", "| | of politics and culture, humor and cartoons, fiction and | |\n", "| | poetry, and reviews and criticism. | |\n", "| 2.12878229972 | Your favorite national security reporter's favorite national | 2 |\n", "| | security reporter. Bette's dad. | |\n", "| | spencer.ackerman@theguardian.com Public key: | |\n", "| | http://t.co/hRo2CKhJ6Q | |\n", "| 2.12198501232 | Monitoring the press, tracking the evolving media business & | 2 |\n", "| | encouraging excellence in journalism since 1961. | |\n", "| 2.09821142588 | Founder of Daily Kos, Co-founder Vox Media | 2 |\n", "| 2.04895071097 | DFH/blogger/humanoid | 2 |\n", "| 2.03705055969 | A little of this, a little of that. | 2 |\n", "| 2.01207242886 | Website of the Center for Responsive Politics, the most | 2 |\n", "| | comprehensive, nonpartisan money-in-politics resource | |\n", "| | around. Get the must-reads: http://t.co/3722t5iaZH | |\n", "+---------------+--------------------------------------------------------------+------------+\n", "+---------------+--------------------------------------------------------------+------------+\n", "| Rank | Bio | Cluster ID |\n", "+---------------+--------------------------------------------------------------+------------+\n", "| 5.68248985863 | Editor of FAIR's magazine Extra! since 1990. \r", " | 3 |\n", "| 4.36113011846 | independent journalist, co founder of @the_intercept PGP | 3 |\n", "| | key/contact: https://t.co/lnq46VuHN0 | |\n", "| 4.34163894719 | Journalist with @The_Intercept - author, No Place to Hide - | 3 |\n", "| | dog/animal fanatic - email/PGP public key | |\n", "| | (https://t.co/uJnK90oulZ) | |\n", "| 3.40106898918 | Doing communications at @ncacensorship. Many years at | 3 |\n", "| | @fairmediawatch. Will get better at surfing someday. | |\n", "| 3.37780426034 | Journalist focused on prisons & harsh sentencing. More fun | 3 |\n", "| | than I sound. | |\n", "| 3.15649462152 | they say I'm polarizing | 3 |\n", "| 2.93662754386 | Senior Writer, http://t.co/UX4ClyaE8E Author, The 51 Day | 3 |\n", "| | War: Ruin and Resistance in Gaza http://t.co/faFdf2BdZ3 | |\n", "| 2.83537931907 | Do @accuracy (news releases), @dcstakeout (questioning | 3 |\n", "| | politicos), @votepact (left & right pairing up) and | |\n", "| | @xposefacts (whistleblowers). Also, artsy. | |\n", "| 2.71412498047 | Author of When the World Outlawed War, War Is A Lie and | 3 |\n", "| | Daybreak: Undoing the Imperial Presidency and Forming a More | |\n", "| | Perfect Union. | |\n", "| 2.69338887562 | @thinkprogress, @unitedrepublic, and @boldprogressive in my | 3 |\n", "| | past, @Alternet in my present. Love cats, the South, and | |\n", "| | cheesecake | |\n", "| 2.6779919512 | We open governments. | 3 |\n", "| 2.67055757303 | Abundant tweets about civil liberties and national security, | 3 |\n", "| | football, Beer Mecca, and other craic. | |\n", "| 2.61628118821 | The Center for Constitutional Rights is dedicated to | 3 |\n", "| | advancing and protecting the rights guaranteed by the U.S. | |\n", "| | Constitution and the UDHR. | |\n", "| 2.56572209168 | @IBTimes Senior Editor, Investigations. Also: Denverite, | 3 |\n", "| | vegetarian, author, newspaper columnist, real guy | |\n", "| | represented by the character on ABC's The Goldbergs | |\n", "| 2.49664644121 | Journalist and author of THE DIVIDE, GRIFTOPIA and THE GREAT | 3 |\n", "| | DERANGEMENT | |\n", "| 2.4337795872 | HRW provides timely information about #humanrights crises in | 3 |\n", "| | 90+ countries. Curated by @jimmurphysf & @astroehlein Staff | |\n", "| | list: https://t.co/wBw0SILvlQ | |\n", "| 2.32695707327 | Co-host of @CitizenRadio. Independent journalist. I've | 3 |\n", "| | written for places. Author of the new @CitizenRadio book | |\n", "| | #NEWSFAIL. Order: http://t.co/OenJarCjeU | |\n", "| 2.31465869927 | @AJAM columnist & author, The Passion of Chelsea Manning: | 3 |\n", "| | The Story behind the Wikileaks Whistleblower. | |\n", "| 2.30283084457 | Journalist who covers dissent, whistleblowing, secrecy, | 3 |\n", "| | police, spying, etc. Co-host of Unauthorized Disclosure | |\n", "| | (@UnauthorizedDis) podcast. Outside agitator. | |\n", "| 2.25773320367 | I write think pieces on twitter. | 3 |\n", "| 2.22078808059 | Independent journalist. Objectivity is bullshit. | 3 |\n", "| 2.162909011 | Communications professional. Amateur expertician. Vellichor | 3 |\n", "| | sufferer. Tsundoku artist. | |\n", "| 2.14320488526 | Journalist + musician + digger + dad. Author, SPIES FOR | 3 |\n", "| | HIRE. Covering war + biz @TheNation. Raised in Japan & South | |\n", "| | Korea. Honorary citizen of Kwangju. | |\n", "| 2.13987832419 | Filmmaker ('Fahrenheit 9/11'), author ('Stupid White Men'), | 3 |\n", "| | citizen ('United States of America'). | |\n", "| 2.12001844581 | Publisher, @truthout. Opinions my own. I don't hate you, but | 3 |\n", "| | I hate to critique/overrate you. | |\n", "| 2.07655045491 | Defending the American soldier imprisoned for revealing the | 3 |\n", "| | truth about war crimes and illegal foreign policies. | |\n", "| 2.06283984859 | Investigative reporter @vicenews. FOIA terrorist. Band | 3 |\n", "| | Tshirt hoarder. Author: News Junkie, a memoir, and The Abu | |\n", "| | Zubaydah Diaries. PGP: http://t.co/i2X8fWclVf | |\n", "| 2.05084546198 | Author, Political Activist, Columnist | 3 |\n", "| 2.03496812847 | Journalist and radio host specializing in economics & | 3 |\n", "| | politics | |\n", "| 2.01309910723 | Executive director, @FreedomofPress. Columnist, @GuardianUS | 3 |\n", "| | and @CJR. Remote operator, @Drones. [Views here are my own.] | |\n", "| 2.00461660532 | independent journalist. Democracy Now! correspondent. Nation | 3 |\n", "| | Institute fellow. | |\n", "| 2.00000440304 | Radio host for KPFK in L.A., Liberty Radio Network 12-2E. | 3 |\n", "| | 3,500 interviews since 2003. Married to reporter @larisa_a. | |\n", "| | Fan of, but not the lawyer from Harper's. | |\n", "| 1.98513659195 | Mondoweiss is a news website devoted to covering American | 3 |\n", "| | politics & policy in Israel/Palestine & the broader Middle | |\n", "| | East. @Mondowitz does most of the tweeting. | |\n", "+---------------+--------------------------------------------------------------+------------+\n" ] } ], "source": [ "import prettytable\n", "\n", "pts = []\n", "\n", "for cluster in set(eigen.membership):\n", " # Create a pretty table of tweet contents and any expanded urls\n", " pt = prettytable.PrettyTable([\"Rank\", \"Bio\", \"Cluster ID\"])\n", " pt.align[\"Bio\"] = \"l\" # Left align bio\n", " pt.max_width = 60 \n", " pt.padding_width = 1 # One space between column edges and contents (default)\n", " pts.append(pt)\n", "\n", "# Loop thru top 100 page-rank'ed nodes\n", "x = pr.get('pagerank')\n", "for row in x.topk(column_name='pagerank', k=100):\n", " if row['__id'] in bios:\n", " vidx = twitter_graph.vs.select(name_eq=str(row['__id']))[0].index\n", " clusterid = eigen.membership[vidx]\n", " pts[clusterid].add_row([row['pagerank'], bios[row['__id']].split(\"\\n\")[0], clusterid ])\n", "\n", "# Lets see the results!\n", "for pt in pts:\n", " print pt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the above, I think I can come up with explanations of the clusters just by looking at the important nodes as found by page rank:\n", "\n", "- Cluster 0: Media personalities\n", " - This is a little tougher to gauge, but cluster 0 is more oriented towards editors and personalities, while the others tend to be organizations or institutional accounts\n", "- Cluster 1: Independent media, 'far-left' media outlets\n", " - Amy Goodman (democracy now), muckreads, #Occupy, GritTV, Truthout\n", " - Notice how this is a smaller group than the others\n", "- Cluster 2: Mainstream Liberal organizations or media personalities\n", " - HuffPost, NY Times, CNN, MSNBC, Bill Moyers, New Yorker\n", "- Cluster 3: people / journalists involved with civil liberties, human rights and whistleblowing\n", " - FOIA, PGP, Whistleblower, secrecy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bigram Analysis\n", "\n", "Lets do a bigram analysis on the documents that created from a combination from user bio's and their latest tweet, using a custom stop word filter and filtering for tokens of length of 3 or more:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import string\n", "\n", "import nltk\n", "from nltk.collocations import *\n", "from nltk.tokenize import RegexpTokenizer\n", "from nltk.corpus import stopwords\n", "\n", "# Custom stopwords, mostly twitter slang and other internet rubbish\n", "my_stopwords = stopwords.words('english')\n", "my_stopwords = my_stopwords + ['http', 'https', 'bit', 'ly', 'co', 'rt', 'rts', 'com', 'org', 'dot', 'go', 'via', 'follow', 'us', 'follow', 'retweet', 'also', 'run']\n", "\n", "def preProcess(text):\n", " text = text.lower()\n", " tokenizer = RegexpTokenizer(r'\\w+')\n", " tokens = tokenizer.tokenize(text)\n", " filtered_words = [w for w in tokens if not w in my_stopwords and not w.isdigit() and len(w) > 2]\n", " return \" \".join(filtered_words)\n", "\n", "def getBigrams(content, threshold=5):\n", " tokens = nltk.wordpunct_tokenize(preProcess(content))\n", " bigram_measures = nltk.collocations.BigramAssocMeasures()\n", " finder = BigramCollocationFinder.from_words(tokens)\n", " finder.apply_freq_filter(threshold)\n", " scored = finder.score_ngrams(bigram_measures.raw_freq)\n", " return sorted([ (bigram, score) for (bigram, score) in scored ], key=lambda t: t[1], reverse=True)" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------------+--------------------------+--------------------------+---------------------------+\n", "| Cluster 0 | Cluster 1 | Cluster 2 | Cluster 3 |\n", "+--------------------------+--------------------------+--------------------------+---------------------------+\n", "| executive director | social justice | new york | human rights |\n", "| official twitter | new york | donald trump | national security |\n", "| senior fellow | non profit | york times | foreign policy |\n", "| social justice | media democracy | editor chief | independent journalist |\n", "| women rights | sandra bland | staff writer | middle east |\n", "| contributing writer | social change | white house | views mine |\n", "| editor large | award winning | climate change | civil liberties |\n", "| editor nation | center media | new yorker | senior editor |\n", "| fast food | fast food | washington post | author new |\n", "| human rights | public citizen | climate science | new book |\n", "| investigative journalism | public interest | managing editor | award winning |\n", "| new york | advocacy organization | news chief | center economic |\n", "| planned parenthood | around world | talk show | columnist author |\n", "| pop culture | crime charges | times columnist | director center |\n", "| reproductive rights | director center | abc news | donald trump |\n", "| senior editor | dylann roof | climate hawk | economic policy |\n", "| social change | economic justice | daily kos | editor chief |\n", "| staff writer | executive director | editor publisher | guardian columnist |\n", "| american prospect | food workers | fox news | investigative journalism |\n", "| american way | hate crime | hillary clinton | iran deal |\n", "| calling women | human rights | huffington post | journalist the_intercept |\n", "| cartoonist illustrator | independent media | husband father | managing editor |\n", "| committed diversifying | institute policy | iran deal | new york |\n", "| community journalists | justice peace | level jobs | policy research |\n", "| contributor thenation | media justice | lindsey graham | radio host |\n", "| crime indictment | news analysis | media critic | sandra bland |\n", "| deputy editor | official twitter | media reporter | social justice |\n", "| director culture | people history | megyn kelly | treason charges |\n", "| diversifying world | policy studies | new book | writer editor |\n", "| doctorow last | public radio | news commentary | account tweets |\n", "| editor thenation | racial justice | nytopinion author | activist photographer |\n", "| essay nation | wall street | politics culture | activist recent |\n", "| feminist majority | activist author | politics media | activist writer |\n", "| food workers | alec new | pulitzer prize | agenda report |\n", "| great doctorow | america populist | views expressed | american politics |\n", "| hate crime | american legislative | washington correspondent | americans knew |\n", "| hillman foundation | american university | affairs correspondent | analysis current |\n", "| huffington post | author diet | amp gov | angeles times |\n", "| india clarke | blacklivesmatter protest | analyst author | apology massincarceration |\n", "| indictment dylannroof | charleston church | anchor abc | around world |\n", "| investigative journalist | children health | anchor chief | associate editor |\n", "| iran deal | citizen global | anchor cnn | attack free |\n", "| journalists thought | civil liberties | animation one | barrett brown |\n", "| last essay | civil rights | anti immigrant | based institute |\n", "| leaders committed | community organizer | associate professor | bear expect |\n", "| lgbt issues | community radio | author new | bernie sanders |\n", "| long way | conspiracy theory | board member | bianca jagger |\n", "| looks like | constitutional rights | book pulitzer | bill clinton |\n", "| majority foundation | contact press | book reviewer | black agenda |\n", "| managing editor | corporate power | box emmy | center constitutional |\n", "| media matters | create new | breaking news | clemencies bill |\n", "| nation zjuklguchz | criminal justice | breast cancer | climate change |\n", "| netroots nation | cultural critic | bureau chief | clinton half |\n", "| new book | current events | business government | cold war |\n", "| news organization | democracy yjbegzsujz | campaign coverage | constitutional rights |\n", "| non profit | digital media | cell phone | contact email |\n", "| nonprofit news | doesn explain | center author | contributing editor |\n", "| organization dedicated | drug policy | chief foreign | contributing writer |\n", "| people american | drug war | chief washington | cops well |\n", "| political research | editor progressive | chief white | current events |\n", "| power media | exchange council | chris hayes | debunks nuclear |\n", "| public policy | exec director | city hall | deepa kumar |\n", "| race gender | fast track | clean energy | deray sandrabland |\n", "| reporting analysis | fighting rights | columnist author | digital rights |\n", "| research associates | flanders show | columnist nytopinion | discourse providing |\n", "| sidney hillman | former journalist | columnist pulitzer | documented analysis |\n", "| statement thejusticedept | former press | communications director | dylann roof |\n", "| thejusticedept hate | forward thinking | confinement xj7wtgcdze | editor author |\n", "| thenation great | global trade | contributing editor | editor jacobin |\n", "| thought leaders | high school | contributor cjr | email pgp |\n", "| twitter account | independent nonprofit | correspondent host | enemy within |\n", "| united states | investigative reporting | coverage far | events issues |\n", "| vice president | journalism media | critics theirandeal | expect make |\n", "| views expressed | justice democracy | daily news | free press |\n", "| wam nyc | justice media | daily show | fundraising campaign |\n", "| women media | justice system | dancing bug | global affairs |\n", "| world conversation | kids need | democracy tweets | great work |\n", "| writer editor | labor love | dylann roof | hacking team |\n", "| writer feminist | last week | emmy nominated | half apology |\n", "+--------------------------+--------------------------+--------------------------+---------------------------+\n" ] } ], "source": [ "bigrams = []\n", "\n", "for clusteridx in documents:\n", " bigrams.append(getBigrams(documents[clusteridx], threshold=2))\n", "\n", "# Create a pretty table of tweet contents and any expanded urls\n", "bigram_pt = prettytable.PrettyTable([\"Cluster 0\", \"Cluster 1\", \"Cluster 2\", \"Cluster 3\"])\n", "bigram_pt.align[\"Cluster 0\"] = \"l\" \n", "bigram_pt.align[\"Cluster 1\"] = \"l\" \n", "bigram_pt.align[\"Cluster 2\"] = \"l\" \n", "bigram_pt.align[\"Cluster 3\"] = \"l\" \n", "bigram_pt.max_width = 60 \n", "bigram_pt.padding_width = 1 # One space between column edges and contents (default)\n", "\n", "for idx in range(len(bigrams[0])):\n", " bigram_pt.add_row([\" \".join(bigram[idx][0]) for bigram in bigrams])\n", "\n", "# Lets see the results!\n", "print bigram_pt" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Looking through the bigrams for each cluster seems to help support the outline above of the different clusters. To expand on the outline below:\n", "\n", "- Cluster 0: Media personalities\n", " - executive director, contributing writer , senior fellow, contributing editor\n", "- Cluster 1: Independent media, 'far-left' media outlets, senior editor, staff writer\n", " - social justice, media democracy, independent media , media justice, news analysis , digital media, nvestigative reporting\n", "- Cluster 2: Mainstream Liberal organizations or media personalities\n", " - new york times, new yorker, washington post, daliy kos, washington correspondent\n", "- Cluster 3: people / journalists involved with civil liberties, human rights and whistleblowing\n", " - Bigrams: human rights , national security , foreign policy, civil liberties, digital rights, pgp(!)\n", "\n", "### Future Considerations\n", "\n", "- Using TF-IDF to make sure bigrams or tokens that appear in all documents are weighed less" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }