{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Final Project - James Quacinella\n",
    "\n",
    "Fo the final project, I will look at the follower network of one of the think tank Twitter account and perform clustering to find groups of associated accounts. Looking at the clusters, I hope to identify what joins them by performing some NLP tasks on the account's profile contents.\n",
    "\n",
    "## Step 1 - Crawl Twitter for Followers\n",
    "\n",
    "The next section of code does not run in the notebook, but is a copy of the crawler code created for this project. It will take a single account, get the first level followers, and then grab the 'second-level' followers. Those second level follower are only added if they were nodes in the first level (so we focus on the main account, not other accounts tangentially related)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#import graphlab as gl\n",
    "import pickle\n",
    "import twitter\n",
    "import logging\n",
    "import time\n",
    "from collections import defaultdict\n",
    "\n",
    "\n",
    "### Setup a console and file logger\n",
    "\n",
    "logger = logging.getLogger('crawler')\n",
    "logger.setLevel(logging.DEBUG)\n",
    "fh = logging.FileHandler('crawler.log')\n",
    "fh.setLevel(logging.INFO)\n",
    "ch = logging.StreamHandler()\n",
    "ch.setLevel(logging.INFO)\n",
    "formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')\n",
    "ch.setFormatter(formatter)\n",
    "fh.setFormatter(formatter)\n",
    "logger.addHandler(ch)\n",
    "logger.addHandler(fh)\n",
    "\n",
    "### Setup signals to make sure API calls only take 60s at most\n",
    "\n",
    "from functools import wraps\n",
    "import errno\n",
    "import os\n",
    "import signal\n",
    "\n",
    "class TimeoutError(Exception):\n",
    "    pass\n",
    "\n",
    "def timeout(seconds=60, error_message=os.strerror(errno.ETIME)):\n",
    "    def decorator(func):\n",
    "        def _handle_timeout(signum, frame):\n",
    "            raise TimeoutError(error_message)\n",
    "\n",
    "        def wrapper(*args, **kwargs):\n",
    "            signal.signal(signal.SIGALRM, _handle_timeout)\n",
    "            signal.alarm(seconds)\n",
    "            try:\n",
    "                result = func(*args, **kwargs)\n",
    "            finally:\n",
    "                signal.alarm(0)\n",
    "            return result\n",
    "\n",
    "        return wraps(func)(wrapper)\n",
    "\n",
    "    return decorator\n",
    "\n",
    "@timeout()\n",
    "def getFollowers(api, follower):\n",
    "    ''' Function that will get a user's list of followers from an api object. \n",
    "    NOTE: the decorator ensures that this only runs for 60s at most. '''\n",
    "    # return api.GetFollowerIDs(follower)\n",
    "    return api.GetFriendIDs(follower)\n",
    "\n",
    "\n",
    "### Twitter API\n",
    "\n",
    "# Lets create our list of api OAuth parameters\n",
    "API_TOKENS = [\n",
    "    {\"consumer_key\": 'yp4wi4FASXbsRKa6JxYqzhUlH',\n",
    "    \"consumer_secret\": 'Wkh1d5ygAOp4Bp65syFzHRN4xQsS8O4FvU3zHWosX8NXCqMpcl',\n",
    "    \"access_token_key\": '16562593-F6lRFe7iyoQEahezhPmaI64oInHZD0LNpcIbbq7Wy',\n",
    "    \"access_token_secret\": 'weregYL8n6DI7yZy9pkizIJ78rH2GY02Do9jvpTe7rCey',\n",
    "    \"requests_timeout\": 60},\n",
    "\n",
    "    {\"consumer_key\": 'NsNYFG9LtZV2XMyigPaCKVyVz',\n",
    "    \"consumer_secret\": '4J1vlowybipqXnSrKgLBvmzPmwqx71uHN32noljTgDLS2xQNfI',\n",
    "    \"access_token_key\": '16562593-NCuQWVnpzcnB55w7VLdoCkdobdUQBRDJKjIPXAksP',\n",
    "    \"access_token_secret\": 'nX9OksrYQxj0jBXYJTkUjlX5mZh4rZljfVRXtSM3Tjc8c',\n",
    "    \"requests_timeout\": 60},\n",
    "\n",
    "    {\"consumer_key\": 'ZcAMGe2MUcnTO9ATCIo563SHN',\n",
    "    \"consumer_secret\": 'dJAB7mBfoYyx27Yccbmzz98GtNigAA67Ish9Y1NjN2wNznciM1',\n",
    "    \"access_token_key\": '16562593-AmaoKVLEYL3o8rVUS3b6u4PUbVPTI6BPsyaqCdwxY',\n",
    "    \"access_token_secret\": '8pjYJCFWTErJlb2WSkLwsYNoptVazQQs95JAvIU8JApUA',\n",
    "    \"requests_timeout\": 60},\n",
    "\n",
    "    {\"consumer_key\": 'avZpjObqQN9vue2Y4gu9zIF9X',\n",
    "    \"consumer_secret\": 'Ka6WCj3fyon5yGgf5YJIIl8nVcLcUh5YT99N58qy8qv4kfaMbc',\n",
    "    \"access_token_key\": '16562593-VNuGD09Cr29ZlzNCWnV5MOujU7PsexSwfTgfKQNqC',\n",
    "    \"access_token_secret\": '9P3hB3qDb9zPDFCUhWU16N4CMXPwHacl6HJbCc0EuGj7s',\n",
    "    \"requests_timeout\": 60},\n",
    "\n",
    "    {\"consumer_key\": 'sQ9H5NKteroNZSWvIrkSWvXR0',\n",
    "    \"consumer_secret\": 'lC0ttZKdIZhhJAE1I5RxMxdjpSiADQCVUnHS7LbtfVmI2pz2F2',\n",
    "    \"access_token_key\": '16562593-4LOk7QkXWD0boF01BmZ6NP2oPtHmDZ1OVJ883aANG',\n",
    "    \"access_token_secret\": 'JJ85qMqzVowN1KdQ6w4YlhJB9YF9eWbw6SGbxQoU6gvne',\n",
    "    \"requests_timeout\": 60},\n",
    "\n",
    "    {\"consumer_key\": 'DHppZ2LG3iYj8vEx7ibRRLN35',\n",
    "    \"consumer_secret\": 'wdTQeyp7ZNDN7ne40IriRw7Ah1J8cAi2OIlw4MVtgpq5MMKjYE',\n",
    "    \"access_token_key\": '16562593-WN8zvEWAxVfJPrneMwUjDoVQw0geuLckOOJqFimsC',\n",
    "    \"access_token_secret\": 'ZgVi2onPB3RPGtRmPBs6QXymIMgXwJHUOQycesp64S0Hp',\n",
    "    \"requests_timeout\": 60},\n",
    "\n",
    "    {\"consumer_key\": 'lIgtfdkC2WmN7XAcicrGygQBp',\n",
    "    \"consumer_secret\": '2D9WIJN2MIPwFpMeIGcP6vWjQC8vvy7G5ZlHMSH1F1CsgWGKfz',\n",
    "    \"access_token_key\": '16562593-7lhPpeZNNAGoQQJnqcnTtBiGq1O52XMZ4CMeVqXiY',\n",
    "    \"access_token_secret\": 'WKRBQsr36MMB2EpCcZLr89ik0MSJfPoBORCKu9E1hw96I',\n",
    "    \"requests_timeout\": 60},\n",
    "\n",
    "    {\"consumer_key\": '1XFu2urZzoMoC5sadXAjA7IoQ',\n",
    "    \"consumer_secret\": 'FrJOlHfNLp3M7ejJWiO5k74E9ai6L5EzQJ45HmlsUINbh8qUUi',\n",
    "    \"access_token_key\": '16562593-Texko6g7VyCwhNUfxBDoJKJl4058hpvQkqAYWRKpi',\n",
    "    \"access_token_secret\": 'ISZCTvN6bYJVaJ3Z2iidQObTzE2pxkINBLi0WWe9Ab2Zv',\n",
    "    \"requests_timeout\": 60},\n",
    "\n",
    "    {\"consumer_key\": 'r8Bvdm6I8QrRPuVzP4VtRYpqd',\n",
    "    \"consumer_secret\": 'CzA8u8M8nDiDCCrSzCsXpR3SyTGCaLppDWbdTxSg78ZKgtKkhh',\n",
    "    \"access_token_key\": '16562593-I3l0ZSmfZbMxIQ2NbiiM2eDMA4KNzFmFBeUkWxunR',\n",
    "    \"access_token_secret\": '9HkILP4kSMF0hgvsB126jpoUzsRXETYMlSM0YSKb2yMJH',\n",
    "    \"requests_timeout\": 60},\n",
    "\n",
    "    {\"consumer_key\": 'NmMjfP1Zt3n2VDZ15X7SDGM6G',\n",
    "    \"consumer_secret\": 'j9JBx7HUbMpcDnFteiIAAgHSoA8idlqQ20A1xbvnMrqMrOHQ1n',\n",
    "    \"access_token_key\": '16562593-zUNyMUdO9JnSIstmTrqdyHHmX2lpv9NqkQxGC8faP',\n",
    "    \"access_token_secret\": 'DEeHvLjTXlxNGmqDntXOK0cJCX08cnpg0btoRXWATW3X2',\n",
    "    \"requests_timeout\": 60}\n",
    "]\n",
    "\n",
    "# Now create a list of twitter API objects\n",
    "apis = []\n",
    "for token in API_TOKENS:\n",
    "    apis.append( twitter.Api(consumer_key=token['consumer_key'],\n",
    "                    consumer_secret=token['consumer_secret'],\n",
    "                    access_token_key=token['access_token_key'],\n",
    "                    access_token_secret=token['access_token_secret'],\n",
    "                    requests_timeout=60))\n",
    "\n",
    "\n",
    "# The account id / screen name we want followers from\n",
    "account_screen_name = 'fairmediawatch'\n",
    "account_id = '54679731'\n",
    "\n",
    "# Keep track of nodes connected to account, and all edges we need in the graph\n",
    "nodes = set()\n",
    "edges = defaultdict(set)\n",
    "\n",
    "\n",
    "# Try to load first level followers from pickle;\n",
    "# otherwise, generate them from a single API call and save via pickle\n",
    "try:\n",
    "    logger.info(\"Loading followers for %s\" % account_screen_name)\n",
    "    f = open(\"following1\", \"rb\")\n",
    "    following = pickle.load(f)\n",
    "except Exception as e:\n",
    "    logger.info(\"Failed. Generating followers for %s\" % account_screen_name)\n",
    "    following = api.GetFriendIDs(screen_name=account_screen_name)\n",
    "    pickle.dump(following, open(\"following1\", \"wb\"))\n",
    "\n",
    "# Try to load the nodes and first level edges from pickle;\n",
    "# otherwise generate them from the 'following' list and save\n",
    "try:\n",
    "    logger.info(\"Loading nodes and edges for depth = 1, for %s\" % account_screen_name)\n",
    "    n = open(\"nodes.follow1.set\", \"rb\")\n",
    "    e = open(\"edges.follow1.dict\", \"rb\")\n",
    "    nodes = pickle.load(n)\n",
    "    edges = pickle.load(e)\n",
    "except Exception as e:\n",
    "    logger.info(\"Failed. Generating nodes and edges for depth = 1, for %s\" % account_screen_name)\n",
    "    for follower in following:\n",
    "        nodes.add(follower)\n",
    "        edges[account_id].add(follower)\n",
    "    pickle.dump(nodes, open(\"nodes.follow1.set\", \"wb\"))\n",
    "    pickle.dump(edges, open(\"edges.follow1.dict\", \"wb\"))\n",
    "\n",
    "\n",
    "\n",
    "### Crawling for Depth2\n",
    "\n",
    "\n",
    "# Index the api list, and start from the first api object\n",
    "api_idx = 0\n",
    "api = apis[api_idx]\n",
    "\n",
    "# Some accounts give us issues (either too many followers or no permissions)\n",
    "blacklist= [74323323, 43532023, 19608297, 25757924, 240369959, 173634807, 17008482, 142143804]\n",
    "api_updated = False\n",
    "\n",
    "# It is nice to start from a point in the list, instead of from the beginning\n",
    "starting_point = 142143804\n",
    "if starting_point:\n",
    "    starting_point_idx = following.index(starting_point)\n",
    "    following_iter = range(starting_point_idx, len(following))\n",
    "else:\n",
    "    following_iter = range(len(following))\n",
    "\n",
    "# Try loading second layer of followers from pickle, otherwise start from scratch\n",
    "try:\n",
    "    f = open(\"edges.follow2.dict\", \"rb\")\n",
    "    edges = pickle.load(f)\n",
    "    logger.info(\"Loaded edges.follow2 into memory!\")\n",
    "except Exception as e:\n",
    "    logger.info(\"Starting from SCRATCH: did not load edges.follow2 into memory!\")\n",
    "    pass\n",
    "\n",
    "# For each follower of the main account ...\n",
    "for follower_idx in following_iter:\n",
    "    follower = following[follower_idx]\n",
    "    success = False\n",
    "    \n",
    "    # ... check if they are on the blacklist; if so, skip\n",
    "    if follower in blacklist:\n",
    "        logger.info(\"Skipping due to blacklist\")\n",
    "        continue\n",
    "\n",
    "    # Otherwise, attempt to get list of their followers\n",
    "    followers_depth2_list = []\n",
    "    while not success:\n",
    "        try:\n",
    "            logger.info(\"Getting followers for follower %s\" % follower)\n",
    "            followers_depth2_list = getFollowers(api, follower)\n",
    "            success = True\n",
    "        except TimeoutError as e:\n",
    "            # If api call takes too long, move on\n",
    "            logger.info(\"Timeout after 60s for follower %d\" % follower)\n",
    "            success = True      # technically not a success but setting flag so next loop moves on\n",
    "            continue\n",
    "        except Exception as e:\n",
    "            # IF we get here, then we hit API limits\n",
    "            logger.info(\"API Exception %s; api-idx = %d\" % (str(e), api_idx))\n",
    "            \n",
    "            # Are we at the begining of api list? \n",
    "            # IF so, dump edges so far via pickle and sleep\n",
    "            if api_updated and api_idx % len(API_TOKENS) == 0 and api_idx >= len(API_TOKENS):\n",
    "                logger.info(\"Save edges to pickle file for follower = %s\" % follower)\n",
    "                pickle.dump(edges, open(\"edges.follow2.dict\", \"wb\"))\n",
    "                logger.info(\"Sleeping ...\")\n",
    "                time.sleep(60)\n",
    "                api_updated = False\n",
    "            # Otherwise, move on to the next api object and try again\n",
    "            else:\n",
    "                api_idx += 1\n",
    "                api = apis[api_idx % len(API_TOKENS)]\n",
    "                api_updated = True\n",
    "            \n",
    "    \n",
    "    # After getting the followers, find the intersection of those followers\n",
    "    # with those of the first-level followers and add to edge dict\n",
    "    if followers_depth2_list:\n",
    "        logger.info(\"Adding followers to the graph\")\n",
    "        edges[follower].update(nodes.intersection(followers_depth2_list))\n",
    "\n",
    "\n",
    "# Write out final list of edges via pickle\n",
    "logger.info(\"Save edges to pickle file for follower = %s\" % follower)\n",
    "pickle.dump(edges, open(\"edges.follow2.dict\", \"wb\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of running the above, lets just load everything via pickle:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pickle\n",
    "n = open(\"nodes.follow1.set\", \"rb\")\n",
    "nodes = pickle.load(n)\n",
    "\n",
    "e = open(\"edges.follow2.dict\", \"rb\")\n",
    "edges = pickle.load(e)\n",
    "\n",
    "f = open(\"following1\", \"rb\")\n",
    "following = pickle.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Step 2 - Generate Graph from Crawl\n",
    "\n",
    "First, we generate CSV files so we can load data into GraphLab Create."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Hide some silly output\n",
    "import logging\n",
    "logging.getLogger(\"requests\").setLevel(logging.WARNING)\n",
    "logging.getLogger(\"urllib3\").setLevel(logging.WARNING)\n",
    "\n",
    "# Import everything we need\n",
    "import graphlab as gl\n",
    "\n",
    "# Generate CSVs from the previous crawl\n",
    "# TODO\n",
    "f = open('vertices.csv', 'w')\n",
    "f.write('id\\n')\n",
    "for node in nodes:\n",
    "    f.write(str(node) + \"\\n\")\n",
    "f.close()\n",
    "\n",
    "f = open('edges.csv', 'w')\n",
    "f.write('src,dst,relation\\n')\n",
    "for node, followers in edges.iteritems():\n",
    "    for follower in followers:\n",
    "        f.write('%s,%s,%s\\n' % (follower, node, 'follows'))\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let us use these CSV files and load them into a graph object called g:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] This non-commercial license of GraphLab Create is assigned to james.quacinella@gmail.comand will expire on January 01, 2038. For commercial licensing options, visit https://dato.com/buy/.\n",
      "\n",
      "[INFO] Start server at: ipc:///tmp/graphlab_server-18863 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1437714775.log\n",
      "[INFO] GraphLab Server Version: 1.5.1\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Final/vertices.csv\n",
      "PROGRESS: Parsing completed. Parsed 100 lines in 0.024206 secs.\n",
      "------------------------------------------------------\n",
      "Inferred types from first line of file as \n",
      "column_type_hints=[int]\n",
      "If parsing fails due to incorrect types, you can correct\n",
      "the inferred type list above and pass it to read_csv in\n",
      "the column_type_hints argument\n",
      "------------------------------------------------------\n",
      "PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Final/vertices.csv\n",
      "PROGRESS: Parsing completed. Parsed 1108 lines in 0.018389 secs.\n",
      "PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Final/edges.csv\n",
      "PROGRESS: Parsing completed. Parsed 100 lines in 0.114743 secs.\n",
      "------------------------------------------------------\n",
      "Inferred types from first line of file as \n",
      "column_type_hints=[int,int,str]\n",
      "If parsing fails due to incorrect types, you can correct\n",
      "the inferred type list above and pass it to read_csv in\n",
      "the column_type_hints argument\n",
      "------------------------------------------------------\n",
      "PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Final/edges.csv\n",
      "PROGRESS: Parsing completed. Parsed 105006 lines in 0.076969 secs.\n"
     ]
    }
   ],
   "source": [
    "# Load Data\n",
    "gvertices = gl.SFrame.read_csv('vertices.csv')\n",
    "gedges = gl.SFrame.read_csv('edges.csv')\n",
    "\n",
    "# Create graph\n",
    "g = gl.SGraph()\n",
    "g = g.add_vertices(vertices=gvertices, vid_field='id')\n",
    "g = g.add_edges(edges=gedges, src_field='src', dst_field='dst')\n",
    "g = g.add_edges(edges=gedges, src_field='dst', dst_field='src')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets try to visualize the graph!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Canvas is accessible via web browser at the URL: http://localhost:48677/index.html\n",
      "Opening Canvas in default web browser.\n"
     ]
    }
   ],
   "source": [
    "# Visualize graph?\n",
    "gl.canvas.set_target('browser')\n",
    "g.show(vlabel=\"id\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like its too large of a graph to display. \n",
    "\n",
    "## Central / Important Nodes\n",
    "\n",
    "Lets use pagerank to find important nodes in the network:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PROGRESS: Counting out degree\n",
      "PROGRESS: Done counting out degree\n",
      "PROGRESS: +-----------+-----------------------+\n",
      "PROGRESS: | Iteration | L1 change in pagerank |\n",
      "PROGRESS: +-----------+-----------------------+\n",
      "PROGRESS: | 1         | 617.534               |\n",
      "PROGRESS: | 2         | 135.25                |\n",
      "PROGRESS: | 3         | 30.9247               |\n",
      "PROGRESS: | 4         | 8.64859               |\n",
      "PROGRESS: | 5         | 2.52531               |\n",
      "PROGRESS: | 6         | 0.885368              |\n",
      "PROGRESS: | 7         | 0.323184              |\n",
      "PROGRESS: | 8         | 0.126578              |\n",
      "PROGRESS: | 9         | 0.0503135             |\n",
      "PROGRESS: | 10        | 0.0203128             |\n",
      "PROGRESS: | 11        | 0.00825336            |\n",
      "PROGRESS: +-----------+-----------------------+\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\"><table frame=\"box\" rules=\"cols\">\n",
       "    <tr>\n",
       "        <th style=\"padding-left: 1em; padding-right: 1em; text-align: center\">__id</th>\n",
       "        <th style=\"padding-left: 1em; padding-right: 1em; text-align: center\">pagerank</th>\n",
       "        <th style=\"padding-left: 1em; padding-right: 1em; text-align: center\">delta</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">54679731</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">7.15893054698</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">2.08163053328e-05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">59159771</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">5.73589508502</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">5.1434297017e-06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">169182727</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">5.68248985863</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">2.4887386834e-05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">16935292</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">4.98957011223</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">3.37281975513e-05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">1947301</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">4.39339614539</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">1.60673868965e-06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">23839835</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">4.36113011846</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">1.93642112549e-05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">16076032</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">4.34163894719</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">7.78788532063e-06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">10117892</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">3.96520683672</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">4.10425955666e-06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">16955991</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">3.84512060914</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">1.25983857791e-05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">478203018</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">3.40106898918</td>\n",
       "        <td style=\"padding-left: 1em; padding-right: 1em; text-align: center; vertical-align: top\">2.55084466252e-05</td>\n",
       "    </tr>\n",
       "</table>\n",
       "[10 rows x 3 columns]<br/>\n",
       "</div>"
      ],
      "text/plain": [
       "Columns:\n",
       "\t__id\tint\n",
       "\tpagerank\tfloat\n",
       "\tdelta\tfloat\n",
       "\n",
       "Rows: 10\n",
       "\n",
       "Data:\n",
       "+-----------+---------------+-------------------+\n",
       "|    __id   |    pagerank   |       delta       |\n",
       "+-----------+---------------+-------------------+\n",
       "|  54679731 | 7.15893054698 | 2.08163053328e-05 |\n",
       "|  59159771 | 5.73589508502 |  5.1434297017e-06 |\n",
       "| 169182727 | 5.68248985863 |  2.4887386834e-05 |\n",
       "|  16935292 | 4.98957011223 | 3.37281975513e-05 |\n",
       "|  1947301  | 4.39339614539 | 1.60673868965e-06 |\n",
       "|  23839835 | 4.36113011846 | 1.93642112549e-05 |\n",
       "|  16076032 | 4.34163894719 | 7.78788532063e-06 |\n",
       "|  10117892 | 3.96520683672 | 4.10425955666e-06 |\n",
       "|  16955991 | 3.84512060914 | 1.25983857791e-05 |\n",
       "| 478203018 | 3.40106898918 | 2.55084466252e-05 |\n",
       "+-----------+---------------+-------------------+\n",
       "[10 rows x 3 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pr = gl.pagerank.create(g)\n",
    "pr.get('pagerank').topk(column_name='pagerank')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import  The Graph to iGraph\n",
    "\n",
    "Next we will load the graph data into igrpah and perform clustering to find communities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from igraph import *\n",
    "\n",
    "# Create empty graph\n",
    "twitter_graph = Graph(directed=False)\n",
    "\n",
    "# Setup the nodes\n",
    "for node in nodes:\n",
    "    if isinstance(node, int):\n",
    "        twitter_graph.add_vertex(name=str(node))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Setup the edges\n",
    "for user in edges:\n",
    "    for follower in edges[user]:\n",
    "        try:\n",
    "            twitter_graph.add_edge(str(follower), str(user))\n",
    "        except Exception as e:\n",
    "            print user, follower\n",
    "            print e\n",
    "            break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Add the 'ego' edges\n",
    "following = pickle.load(open(\"following1\", \"rb\"))\n",
    "for node in following:\n",
    "    twitter_graph.add_edge(str(node), \"54679731\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# for v in twitter_graph.vs.select(name_eq=\"54679731\"):\n",
    "#     print v\n",
    "    \n",
    "# for e in twitter_graph.es.select(_source=v.index): print e\n",
    "# for e in twitter_graph.es.select(_target=533): print e"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "igraph.Vertex(<igraph.Graph object at 0x7faa9404f528>,544,{'name': '54679731'})\n"
     ]
    }
   ],
   "source": [
    "# for v in twitter_graph.vs:\n",
    "#     if len(twitter_graph.es.select(_source=v.index)) == 0 and len(twitter_graph.es.select(_target=v.index)) == 0:\n",
    "#         print v"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pickle.dump(twitter_graph, open(\"twitter_graph\", \"wb\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Load twitter grapg to prevent running the above\n",
    "twitter_graph = pickle.load(open(\"twitter_graph\", \"rb\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Display the Graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "layout = twitter_graph.layout_drl()\n",
    "plt1 = plot(twitter_graph, 'graph.drl.png', layout = layout)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 257,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/graph.drl.png\" width=\"500\" /></td>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 257,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "s = \"\"\"<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/graph.drl.png\" width=\"500\" /></td>\"\"\"\n",
    "h = HTML(s); h"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 214,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<igraph.drawing.Plot at 0x7faa8979c250>"
      ]
     },
     "execution_count": 214,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "layout = twitter_graph.layout(\"graphopt\")\n",
    "plt2 = plot(twitter_graph, 'graph.graphopt.png', layout = layout)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 258,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/graph.graphopt.png\" width=\"500\" /></td>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 258,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = \"\"\"<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/graph.graphopt.png\" width=\"500\" /></td>\"\"\"\n",
    "h = HTML(s); h"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 227,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "layout = twitter_graph.layout(\"lgl\")\n",
    "plt2 = plot(twitter_graph, 'graph.lgl.png', layout = layout)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 260,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/graph.lgl.png\" width=\"500\" /></td>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 260,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = \"\"\"<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/graph.lgl.png\" width=\"500\" /></td>\"\"\"\n",
    "h = HTML(s); h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets trim down the graph to only large nodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 237,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# https://lists.nongnu.org/archive/html/igraph-help/2012-11/msg00047.html\n",
    "twitter_graph2 = twitter_graph.copy()\n",
    "nodes = twitter_graph2.vs(_degree_lt=200)\n",
    "twitter_graph2.es.select(_within=nodes).delete()\n",
    "twitter_graph2.vs(_degree_lt=200).delete()\n",
    "layout = twitter_graph2.layout_drl()\n",
    "plt1 = plot(twitter_graph2, 'graph2.drl.png', layout = layout)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 261,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/graph2.drl.png\" width=\"500\" /></td>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 261,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = \"\"\"<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/graph2.drl.png\" width=\"500\" /></td>\"\"\"\n",
    "h = HTML(s); h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, lets run the walktrap community algorithm, which seems to produce a heirarchical clustering:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 243,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<igraph.drawing.Plot at 0x7faa74930690>"
      ]
     },
     "execution_count": 243,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wc = twitter_graph.community_walktrap()\n",
    "plot(wc, 'cluster.walktrap.png', bbox=(3000,3000))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 266,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/cluster.walktrap.png\" width=\"900\" /></td>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 266,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = \"\"\"<img src=\"https://raw.githubusercontent.com/jquacinella/IndependentStudy/master/Final/cluster.walktrap.png\" width=\"900\" /></td>\"\"\"\n",
    "h = HTML(s); h"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<igraph.drawing.Plot at 0x7fb94c6fe610>"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eigen = twitter_graph.community_leading_eigenvector()\n",
    "plot(eigen, 'cluster.eigen.test.png', mark_groups=True, bbox=(5000,5000))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 270,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://s3.amazonaws.com/jqmasters/cluster.eigen.jpg\" width=\"900\" /></td>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 270,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = \"\"\"<img src=\"https://s3.amazonaws.com/jqmasters/cluster.eigen.jpg\" width=\"900\" /></td>\"\"\"\n",
    "h = HTML(s); h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Still messy, lets try doing the same thing but on the smaller graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 238,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "eigen2 = twitter_graph2.community_leading_eigenvector()\n",
    "plot(eigen2, 'cluster2.eigen.png', mark_groups=True, bbox=(5000,5000))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 269,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://s3.amazonaws.com/jqmasters/cluster2.eigen.jpg\" width=\"900\" /></td>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 269,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = \"\"\"<img src=\"https://s3.amazonaws.com/jqmasters/cluster2.eigen.jpg\" width=\"900\" /></td>\"\"\"\n",
    "h = HTML(s); h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Bios of Twitter Users\n",
    "\n",
    "Now that we have the graph communities, lets crawl twitter for the bios on each user. I will not reproduce the code, as it is mostly a copy of the above twitter crawl, using a different API call. Check biocrawl.py in the repo. We'll load the data from pickle:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bios = pickle.load(open(\"bios\", \"rb\"))\n",
    "\n",
    "from collections import defaultdict\n",
    "documents = defaultdict(str)\n",
    "for v_idx, cluster in zip(range(len(eigen.membership)), eigen.membership):\n",
    "    twitter_id = int(twitter_graph.vs[v_idx].attributes()['name'])\n",
    "    if twitter_id in bios:\n",
    "        documents[cluster] += \"\\n%s\" % bios[ twitter_id ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Important Nodes from PageRank\n",
    "\n",
    "Lets look at the important nodes in the network and their associated bios:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------------+--------------------------------------------------------------+------------+\n",
      "|      Rank     | Bio                                                          | Cluster ID |\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "| 4.39339614539 | Instigating progress since 1865                              |     0      |\n",
      "| 3.84512060914 | Providing fearless political journalism and cultural         |     0      |\n",
      "|               | analysis since the dawn of the digital era. We're also at    |            |\n",
      "|               | http://t.co/arAhFcxlTr                                       |            |\n",
      "| 3.30740490311 | Investigative journalist, blogger at Clear it With Sidney at |     0      |\n",
      "|               | the Hillman Foundation and Duly Noted at In These Times. Co- |            |\n",
      "|               | Host of @PointofInquiry. #binders                            |            |\n",
      "|  3.2642239473 | Media Matters for America is the nation's premier            |     0      |\n",
      "|               | progressive media watchdog, research and information center. |            |\n",
      "| 3.24584856758 | AlterNet is a progressive news magazine and online community |     0      |\n",
      "|               | http://t.co/f4LPGamU                                         |            |\n",
      "| 3.05310397834 | Investigative journalism, politics, chart-tastic, and        |     0      |\n",
      "|               | sometimes sarcastic.  We're the nonprofit news organization  |            |\n",
      "|               | that brought you the 47 percent video.                       |            |\n",
      "| 2.63898887362 | The latest political news from The Huffington Post's         |     0      |\n",
      "|               | politics team.                                               |            |\n",
      "| 2.46692594602 | Dad/husband. @CNN. Co-founded @RebuildDream, @GreenForAll,   |     0      |\n",
      "|               | @ColorOfChange, @EllaBakerCenter, @YesWeCode & #cut50.       |            |\n",
      "|               | Wrote: GreenCollar Economy & Rebuild The Dream               |            |\n",
      "| 2.38625057759 | Bloomberg @BW reporter, covering politics/ policy/ labor.    |     0      |\n",
      "|               | Send me your tips (and complaints): jeidelson at bloomberg   |            |\n",
      "|               | dot net [Usual disclaimers]                                  |            |\n",
      "|  2.3499563445 | Adele M. Stan is a columnist at The American Prospect, and   |     0      |\n",
      "|               | editor of Clarion, the newspaper of Professional Staff       |            |\n",
      "|               | Congress/CUNY. She also plays the ukulele.                   |            |\n",
      "| 2.34208284562 | The official Twitter of http://t.co/HJOFeYodXw               |     0      |\n",
      "| 2.25057081559 | The official Twitter of @msnbc's Melissa Harris-Perry.       |     0      |\n",
      "|               | Exploring politics, culture, art, and community beyond the   |            |\n",
      "|               | beltway every Saturday and Sunday, 10a-12p ET.               |            |\n",
      "| 2.19802709748 | labor movement, left politics, superhero comics, rock & roll |     0      |\n",
      "|               | & online organizing.                                         |            |\n",
      "| 2.19673751375 | Host of The Katie Halper Show on WBAI, lefty comedian /      |     0      |\n",
      "|               | blogger at Salon, Vice, The Nation, Feministing, Raw Story,  |            |\n",
      "|               | Alternet, Comedy Central & more / filmmaker                  |            |\n",
      "| 2.17501663699 | Reporter/enviro editor at @HuffPostPol and VP of membership  |     0      |\n",
      "|               | at @sejorg. Fan of gravity, fermentation, and the serial     |            |\n",
      "|               | comma.                                                       |            |\n",
      "| 2.11935441039 | Writer/Producer/Comic. Co creator of The Daily Show. Author: |     0      |\n",
      "|               | Lizz Free Or Die. Exposer of crackpots at                    |            |\n",
      "|               | http://t.co/COQwCZPcmF and http://t.co/WhllLpegoQ            |            |\n",
      "| 2.09854384862 | Analyst, advocate, writer, mom. Feminist. Thinker & doer.    |     0      |\n",
      "|               | Realist & dreamer. Progressive but not predictable. Editor-  |            |\n",
      "|               | in-Chief, RH RealityCheck. UW Madison alum                   |            |\n",
      "| 2.06228546594 | Executive Editor and Columnist, The Nation                   |     0      |\n",
      "| 2.05351480712 | Columnist @GuardianUS | Feminist author | Pasta enthusiast,  |     0      |\n",
      "|               | native NYer | My books http://t.co/Ct9Ck2TTqb | Eat Me       |            |\n",
      "|               | http://t.co/qkg3oItKfw                                       |            |\n",
      "| 2.03231408258 | Author of THE TEACHER WARS. Staff writer @MarshallProj.      |     0      |\n",
      "|               | http://t.co/NA3HloQjwN                                       |            |\n",
      "| 2.03109035963 | Writer. Lawyer. Eater. Formerly @Cosmopolitan senior         |     0      |\n",
      "|               | political writer, @GuardianUS columnist, @Feministe blogger. |            |\n",
      "| 2.02762932794 | Wake Forest University Professor, Director @AJCCenter,       |     0      |\n",
      "|               | Executive Director @WakeEngaged, MSNBC Host of @MHPShow,     |            |\n",
      "|               | contributor to @TheNation & @EssenceMag                      |            |\n",
      "|  2.0121987956 | National Editor @buzzfeednews. PGP: http://t.co/xGj7z2Ljki   |     0      |\n",
      "|               | adam.serwer@buzzfeed.com https://t.co/zl6RcFyMTN             |            |\n",
      "| 1.99455894381 | Senior Editor, @tnr. Host of @IntersectionTNR, a new podcast |     0      |\n",
      "|               | about race, gender, and all the ways we identify.            |            |\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "|      Rank     | Bio                                                          | Cluster ID |\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "| 4.98957011223 | Independent, Daily Global News Hour Anchored by Amy Goodman  |     1      |\n",
      "|               | & Juan González. Stream Live 8am ET http://t.co/SL25z1kZE5.  |            |\n",
      "|               | Support Independent Media - Donate Today                     |            |\n",
      "| 3.96520683672 | Investigative reporting, political commentary, cultural      |     1      |\n",
      "|               | coverage, activism, interviews, poetry, and humor since      |            |\n",
      "|               | 1909.                                                        |            |\n",
      "| 3.04835708177 | YES! Magazine's award-winning journalism reframes the        |     1      |\n",
      "|               | biggest problems of our time in terms of their solutions.    |            |\n",
      "|               | Independent, nonprofit, reader-supported.                    |            |\n",
      "| 2.90741081734 | Drilling beneath the headlines. Follow us for provocative    |     1      |\n",
      "|               | and insightful news, features and analysis.                  |            |\n",
      "| 2.89312248828 | Occupying Wall Street since Sep 17, 2011. Standing with the  |     1      |\n",
      "|               | global #Occupy movement. About our team:                     |            |\n",
      "|               | http://t.co/7SfBMRuTjZ #OWS                                  |            |\n",
      "| 2.66065430813 | Monthly news magazine committed to informing and analyzing   |     1      |\n",
      "|               | movements for social, environmental and economic justice.    |            |\n",
      "|               | Founded in 1976.                                             |            |\n",
      "| 2.47763903414 | Pursuing stories with moral force. Curating your best        |     1      |\n",
      "|               | #muckreads. Tweets by @terryparrisjr + @amzam. Send tips     |            |\n",
      "|               | securely: https://t.co/JWIupK6Wrl                            |            |\n",
      "| 2.36358932648 | Host,The Laura Flanders Show on @GRITtv, radio commentator;  |     1      |\n",
      "|               | author, BUSHWOMEN, BLUE GRIT;  Editor, At the Tea Party...   |            |\n",
      "| 2.18561738802 | Colorlines is a daily news site where race matters,          |     1      |\n",
      "|               | featuring award-winning investigative reporting and news     |            |\n",
      "|               | analysis.                                                    |            |\n",
      "|  2.177018398  | TMC is a North American network of leading independent media |     1      |\n",
      "|               | outlets. Follow us for news & nuance you can't find anywhere |            |\n",
      "|               | else! Tweets by @jgksf + @manolialive                        |            |\n",
      "| 2.06230444459 | Truthout is dedicated to providing independent news &        |     1      |\n",
      "|               | commentary. We hope to inspire the direct action necessary   |            |\n",
      "|               | to save the planet & humanity. Official Tweets.              |            |\n",
      "| 2.00360103986 | progressive, bold, 100% independent, journalism and          |     1      |\n",
      "|               | advocacy, tenth anniversary online                           |            |\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "|      Rank     | Bio                                                          | Cluster ID |\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "| 5.73589508502 | The Nation magazine Editor and Publisher                     |     2      |\n",
      "|  3.325182719  | Washington editor @the_intercept, a First Look Media         |     2      |\n",
      "|               | publication. froomkin@theintercept.com How to leak to me:    |            |\n",
      "|               | http://t.co/md5GQRJby1                                       |            |\n",
      "| 3.05833410529 | Host of All In with Chris Hayes on MSNBC, Weeknights at 8pm. |     2      |\n",
      "|               | Editor at Large at The Nation. Cubs fan.                     |            |\n",
      "| 2.99593940177 | Covering national politics for @washingtonpost. Finishing a  |     2      |\n",
      "|               | book about progressive rock (W.W. Norton).                   |            |\n",
      "|               | daveweigel@gmail.com, 302-507-6806.                          |            |\n",
      "| 2.99589140109 | I see political people...\r",
      "                                   |     2      |\n",
      "| 2.76573594395 | Moving news forward. Editor-In-Chief @JuddLegum              |     2      |\n",
      "| 2.67779981051 | Editor-in-chief, http://t.co/5gESirESRH. Policy analyst at   |     2      |\n",
      "|               | MSNBC. Hater of filibuster. Lover of charts.  Come work      |            |\n",
      "|               | @Voxdotcom! http://t.co/VhALOi3yKC                           |            |\n",
      "|  2.6711824588 | Senior Political Reporter and Politics Managing Editor,      |     2      |\n",
      "|               | HuffPost. aterkel at huffingtonpost dot com Sign up for my   |            |\n",
      "|               | newsletter: https://t.co/tM0sM6PgOR                          |            |\n",
      "| 2.64430685492 | Photojournalist covering media, culture & politics I also    |     2      |\n",
      "|               | write (mostly here lately) Also @tigerbeat on instagram      |            |\n",
      "|               | Email srhodes at gmail                                       |            |\n",
      "| 2.55744476724 | I teach journalism at NYU, direct the Studio 20 program      |     2      |\n",
      "|               | there, critique the press and try to understand digital      |            |\n",
      "|               | logic. I also advise media companies sometimes.              |            |\n",
      "| 2.52993123165 | http://t.co/3NqFkIfQya: the Aggressive Progressives since    |     2      |\n",
      "|               | 2000. 2 million strong and growing. Yes We Will!             |            |\n",
      "| 2.45535873698 | Editor-in-Chief, FiveThirtyEight. Author, The Signal and the |     2      |\n",
      "|               | Noise (http://t.co/9mLliQYI8N). Sports/politics/food geek.   |            |\n",
      "| 2.43432862068 | CNN Anchor and Chief Washington Correspondent. Dissecting my |     2      |\n",
      "|               | tweets with Talmudic meticulousness will result in wrong     |            |\n",
      "|               | conclusions. RTs do not = endorsement.                       |            |\n",
      "| 2.43060658092 | Making sense of what matters, tweets from ‘Moyers & Company’ |     2      |\n",
      "|               | producers and Bill Moyers. Keep track of the corrupting      |            |\n",
      "|               | influence of $ on politics.                                  |            |\n",
      "| 2.39529425002 | A blog about politics, politics, and politics                |     2      |\n",
      "| 2.35927440968 | Author of a dozen books. Film producer.  Ed. of Editor &     |     2      |\n",
      "|               | Publisher and Crawdaddy. Daily blogger. Next book optioned   |            |\n",
      "|               | for Paul Greengrass flick.                                   |            |\n",
      "| 2.34320763082 | DC editor of Mother Jones, MSNBC analyst & author of the new |     2      |\n",
      "|               | book, SHOWDOWN: The Inside Story of How Obama Fought Back    |            |\n",
      "|               | Against Boehner, Cantor & the Tea Party                      |            |\n",
      "| 2.33831887877 | Editor in Chief, @YahooPolitics, a @YahooNews production.    |     2      |\n",
      "|               | Politics, media, breaking.                                   |            |\n",
      "| 2.27046419486 | Breaking news and analysis from the TPM team.                |     2      |\n",
      "| 2.27025295481 | CNN's senior media correspondent and host of @CNNReliable.   |     2      |\n",
      "|               | Formerly @nytimes, @tvnewser and Top of the Morning. Email:  |            |\n",
      "|               | bstelter@gmail.com                                           |            |\n",
      "| 2.25702219683 | Nobel laureate. Op-Ed columnist, @nytopinion. Author, “The   |     2      |\n",
      "|               | Return of Depression Economics,” “The Great Unraveling,”     |            |\n",
      "|               | “The Age of Diminished Expectations” + more.                 |            |\n",
      "| 2.17156961195 | Investigative Journalist. @the_intercept                     |     2      |\n",
      "|               | lee.fang@theintercept.com                                    |            |\n",
      "| 2.16913235691 | NY Times columnist, co-author of       Half the Sky & A Path |     2      |\n",
      "|               | Appears, http://t.co/bcxQaJYCMg Newsletter:                  |            |\n",
      "|               | http://t.co/EYhBhaKPv1                                       |            |\n",
      "| 2.14931459631 | The New Yorker is a weekly magazine with a mix of reporting  |     2      |\n",
      "|               | of politics and culture, humor and cartoons, fiction and     |            |\n",
      "|               | poetry, and reviews and criticism.                           |            |\n",
      "| 2.12878229972 | Your favorite national security reporter's favorite national |     2      |\n",
      "|               | security reporter. Bette's dad.                              |            |\n",
      "|               | spencer.ackerman@theguardian.com Public key:                 |            |\n",
      "|               | http://t.co/hRo2CKhJ6Q                                       |            |\n",
      "| 2.12198501232 | Monitoring the press, tracking the evolving media business & |     2      |\n",
      "|               | encouraging excellence in journalism since 1961.             |            |\n",
      "| 2.09821142588 | Founder of Daily Kos, Co-founder Vox Media                   |     2      |\n",
      "| 2.04895071097 | DFH/blogger/humanoid                                         |     2      |\n",
      "| 2.03705055969 | A little of this, a little of that.                          |     2      |\n",
      "| 2.01207242886 | Website of the Center for Responsive Politics, the most      |     2      |\n",
      "|               | comprehensive, nonpartisan money-in-politics resource        |            |\n",
      "|               | around. Get the must-reads: http://t.co/3722t5iaZH           |            |\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "|      Rank     | Bio                                                          | Cluster ID |\n",
      "+---------------+--------------------------------------------------------------+------------+\n",
      "| 5.68248985863 | Editor of FAIR's magazine Extra! since 1990. \r",
      "               |     3      |\n",
      "| 4.36113011846 | independent journalist, co founder of @the_intercept PGP     |     3      |\n",
      "|               | key/contact: https://t.co/lnq46VuHN0                         |            |\n",
      "| 4.34163894719 | Journalist with @The_Intercept - author, No Place to Hide -  |     3      |\n",
      "|               | dog/animal fanatic - email/PGP public key                    |            |\n",
      "|               | (https://t.co/uJnK90oulZ)                                    |            |\n",
      "| 3.40106898918 | Doing communications at @ncacensorship. Many years at        |     3      |\n",
      "|               | @fairmediawatch. Will get better at surfing someday.         |            |\n",
      "| 3.37780426034 | Journalist focused on prisons & harsh sentencing. More fun   |     3      |\n",
      "|               | than I sound.                                                |            |\n",
      "| 3.15649462152 | they say I'm polarizing                                      |     3      |\n",
      "| 2.93662754386 | Senior Writer, http://t.co/UX4ClyaE8E Author, The 51 Day     |     3      |\n",
      "|               | War: Ruin and Resistance in Gaza http://t.co/faFdf2BdZ3      |            |\n",
      "| 2.83537931907 | Do @accuracy (news releases), @dcstakeout (questioning       |     3      |\n",
      "|               | politicos), @votepact (left & right pairing up) and          |            |\n",
      "|               | @xposefacts (whistleblowers). Also, artsy.                   |            |\n",
      "| 2.71412498047 | Author of When the World Outlawed War, War Is A Lie and      |     3      |\n",
      "|               | Daybreak: Undoing the Imperial Presidency and Forming a More |            |\n",
      "|               | Perfect Union.                                               |            |\n",
      "| 2.69338887562 | @thinkprogress, @unitedrepublic, and @boldprogressive in my  |     3      |\n",
      "|               | past, @Alternet in my present. Love cats, the South, and     |            |\n",
      "|               | cheesecake                                                   |            |\n",
      "|  2.6779919512 | We open governments.                                         |     3      |\n",
      "| 2.67055757303 | Abundant tweets about civil liberties and national security, |     3      |\n",
      "|               | football, Beer Mecca, and other craic.                       |            |\n",
      "| 2.61628118821 | The Center for Constitutional Rights is dedicated to         |     3      |\n",
      "|               | advancing and protecting the rights guaranteed by the U.S.   |            |\n",
      "|               | Constitution and the UDHR.                                   |            |\n",
      "| 2.56572209168 | @IBTimes Senior Editor, Investigations. Also: Denverite,     |     3      |\n",
      "|               | vegetarian, author, newspaper columnist, real guy            |            |\n",
      "|               | represented by the character on ABC's The Goldbergs          |            |\n",
      "| 2.49664644121 | Journalist and author of THE DIVIDE, GRIFTOPIA and THE GREAT |     3      |\n",
      "|               | DERANGEMENT                                                  |            |\n",
      "|  2.4337795872 | HRW provides timely information about #humanrights crises in |     3      |\n",
      "|               | 90+ countries. Curated by @jimmurphysf & @astroehlein Staff  |            |\n",
      "|               | list: https://t.co/wBw0SILvlQ                                |            |\n",
      "| 2.32695707327 | Co-host of @CitizenRadio. Independent journalist. I've       |     3      |\n",
      "|               | written for places. Author of the new @CitizenRadio book     |            |\n",
      "|               | #NEWSFAIL. Order: http://t.co/OenJarCjeU                     |            |\n",
      "| 2.31465869927 | @AJAM columnist & author, The Passion of Chelsea Manning:    |     3      |\n",
      "|               | The Story behind the Wikileaks Whistleblower.                |            |\n",
      "| 2.30283084457 | Journalist who covers dissent, whistleblowing, secrecy,      |     3      |\n",
      "|               | police, spying, etc. Co-host of Unauthorized Disclosure      |            |\n",
      "|               | (@UnauthorizedDis) podcast. Outside agitator.                |            |\n",
      "| 2.25773320367 | I write think pieces on twitter.                             |     3      |\n",
      "| 2.22078808059 | Independent journalist. Objectivity is bullshit.             |     3      |\n",
      "|  2.162909011  | Communications professional. Amateur expertician. Vellichor  |     3      |\n",
      "|               | sufferer. Tsundoku artist.                                   |            |\n",
      "| 2.14320488526 | Journalist + musician + digger + dad. Author, SPIES FOR      |     3      |\n",
      "|               | HIRE. Covering war + biz @TheNation. Raised in Japan & South |            |\n",
      "|               | Korea. Honorary citizen of Kwangju.                          |            |\n",
      "| 2.13987832419 | Filmmaker ('Fahrenheit 9/11'), author ('Stupid White Men'),  |     3      |\n",
      "|               | citizen ('United States of America').                        |            |\n",
      "| 2.12001844581 | Publisher, @truthout. Opinions my own. I don't hate you, but |     3      |\n",
      "|               | I hate to critique/overrate you.                             |            |\n",
      "| 2.07655045491 | Defending the American soldier imprisoned for revealing the  |     3      |\n",
      "|               | truth about war crimes and illegal foreign policies.         |            |\n",
      "| 2.06283984859 | Investigative reporter @vicenews. FOIA terrorist. Band       |     3      |\n",
      "|               | Tshirt hoarder. Author: News Junkie, a memoir, and The Abu   |            |\n",
      "|               | Zubaydah Diaries. PGP: http://t.co/i2X8fWclVf                |            |\n",
      "| 2.05084546198 | Author, Political Activist, Columnist                        |     3      |\n",
      "| 2.03496812847 | Journalist and radio host specializing in economics &        |     3      |\n",
      "|               | politics                                                     |            |\n",
      "| 2.01309910723 | Executive director, @FreedomofPress. Columnist, @GuardianUS  |     3      |\n",
      "|               | and @CJR. Remote operator, @Drones. [Views here are my own.] |            |\n",
      "| 2.00461660532 | independent journalist. Democracy Now! correspondent. Nation |     3      |\n",
      "|               | Institute fellow.                                            |            |\n",
      "| 2.00000440304 | Radio host for KPFK in L.A., Liberty Radio Network 12-2E.    |     3      |\n",
      "|               | 3,500 interviews since 2003. Married to reporter @larisa_a.  |            |\n",
      "|               | Fan of, but not the lawyer from Harper's.                    |            |\n",
      "| 1.98513659195 | Mondoweiss is a news website devoted to covering American    |     3      |\n",
      "|               | politics & policy in Israel/Palestine & the broader Middle   |            |\n",
      "|               | East. @Mondowitz does most of the tweeting.                  |            |\n",
      "+---------------+--------------------------------------------------------------+------------+\n"
     ]
    }
   ],
   "source": [
    "import prettytable\n",
    "\n",
    "pts = []\n",
    "\n",
    "for cluster in set(eigen.membership):\n",
    "    # Create a pretty table of tweet contents and any expanded urls\n",
    "    pt = prettytable.PrettyTable([\"Rank\", \"Bio\", \"Cluster ID\"])\n",
    "    pt.align[\"Bio\"] = \"l\" # Left align bio\n",
    "    pt.max_width = 60 \n",
    "    pt.padding_width = 1 # One space between column edges and contents (default)\n",
    "    pts.append(pt)\n",
    "\n",
    "# Loop thru top 100 page-rank'ed nodes\n",
    "x = pr.get('pagerank')\n",
    "for row in x.topk(column_name='pagerank', k=100):\n",
    "    if row['__id'] in bios:\n",
    "        vidx = twitter_graph.vs.select(name_eq=str(row['__id']))[0].index\n",
    "        clusterid = eigen.membership[vidx]\n",
    "        pts[clusterid].add_row([row['pagerank'], bios[row['__id']].split(\"\\n\")[0], clusterid ])\n",
    "\n",
    "# Lets see the results!\n",
    "for pt in pts:\n",
    "    print pt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the above, I think I can come up with explanations of the clusters just by looking at the important nodes as found by page rank:\n",
    "\n",
    "- Cluster 0: Media personalities\n",
    "  - This is a little tougher to gauge, but cluster 0 is more oriented towards editors and personalities, while the others tend to be organizations or institutional accounts\n",
    "- Cluster 1: Independent media, 'far-left' media outlets\n",
    "  - Amy Goodman (democracy now), muckreads, #Occupy, GritTV, Truthout\n",
    "  - Notice how this is a smaller group than the others\n",
    "- Cluster 2: Mainstream Liberal organizations or media personalities\n",
    " - HuffPost, NY Times, CNN, MSNBC, Bill Moyers, New Yorker\n",
    "- Cluster 3: people / journalists involved with civil liberties, human rights and whistleblowing\n",
    "  - FOIA, PGP, Whistleblower, secrecy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bigram Analysis\n",
    "\n",
    "Lets do a bigram analysis on the documents that created from a combination from user bio's and their latest tweet, using a custom stop word filter and filtering for tokens of length of 3 or more:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import string\n",
    "\n",
    "import nltk\n",
    "from nltk.collocations import *\n",
    "from nltk.tokenize import RegexpTokenizer\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "# Custom stopwords, mostly twitter slang and other internet rubbish\n",
    "my_stopwords = stopwords.words('english')\n",
    "my_stopwords = my_stopwords + ['http', 'https', 'bit', 'ly', 'co', 'rt', 'rts', 'com', 'org', 'dot', 'go', 'via', 'follow', 'us', 'follow', 'retweet', 'also', 'run']\n",
    "\n",
    "def preProcess(text):\n",
    "    text = text.lower()\n",
    "    tokenizer = RegexpTokenizer(r'\\w+')\n",
    "    tokens = tokenizer.tokenize(text)\n",
    "    filtered_words = [w for w in tokens if not w in my_stopwords and not w.isdigit() and len(w) > 2]\n",
    "    return \" \".join(filtered_words)\n",
    "\n",
    "def getBigrams(content, threshold=5):\n",
    "    tokens = nltk.wordpunct_tokenize(preProcess(content))\n",
    "    bigram_measures = nltk.collocations.BigramAssocMeasures()\n",
    "    finder = BigramCollocationFinder.from_words(tokens)\n",
    "    finder.apply_freq_filter(threshold)\n",
    "    scored = finder.score_ngrams(bigram_measures.raw_freq)\n",
    "    return sorted([ (bigram, score) for (bigram, score) in scored ], key=lambda t: t[1], reverse=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------------+--------------------------+--------------------------+---------------------------+\n",
      "| Cluster 0                | Cluster 1                | Cluster 2                | Cluster 3                 |\n",
      "+--------------------------+--------------------------+--------------------------+---------------------------+\n",
      "| executive director       | social justice           | new york                 | human rights              |\n",
      "| official twitter         | new york                 | donald trump             | national security         |\n",
      "| senior fellow            | non profit               | york times               | foreign policy            |\n",
      "| social justice           | media democracy          | editor chief             | independent journalist    |\n",
      "| women rights             | sandra bland             | staff writer             | middle east               |\n",
      "| contributing writer      | social change            | white house              | views mine                |\n",
      "| editor large             | award winning            | climate change           | civil liberties           |\n",
      "| editor nation            | center media             | new yorker               | senior editor             |\n",
      "| fast food                | fast food                | washington post          | author new                |\n",
      "| human rights             | public citizen           | climate science          | new book                  |\n",
      "| investigative journalism | public interest          | managing editor          | award winning             |\n",
      "| new york                 | advocacy organization    | news chief               | center economic           |\n",
      "| planned parenthood       | around world             | talk show                | columnist author          |\n",
      "| pop culture              | crime charges            | times columnist          | director center           |\n",
      "| reproductive rights      | director center          | abc news                 | donald trump              |\n",
      "| senior editor            | dylann roof              | climate hawk             | economic policy           |\n",
      "| social change            | economic justice         | daily kos                | editor chief              |\n",
      "| staff writer             | executive director       | editor publisher         | guardian columnist        |\n",
      "| american prospect        | food workers             | fox news                 | investigative journalism  |\n",
      "| american way             | hate crime               | hillary clinton          | iran deal                 |\n",
      "| calling women            | human rights             | huffington post          | journalist the_intercept  |\n",
      "| cartoonist illustrator   | independent media        | husband father           | managing editor           |\n",
      "| committed diversifying   | institute policy         | iran deal                | new york                  |\n",
      "| community journalists    | justice peace            | level jobs               | policy research           |\n",
      "| contributor thenation    | media justice            | lindsey graham           | radio host                |\n",
      "| crime indictment         | news analysis            | media critic             | sandra bland              |\n",
      "| deputy editor            | official twitter         | media reporter           | social justice            |\n",
      "| director culture         | people history           | megyn kelly              | treason charges           |\n",
      "| diversifying world       | policy studies           | new book                 | writer editor             |\n",
      "| doctorow last            | public radio             | news commentary          | account tweets            |\n",
      "| editor thenation         | racial justice           | nytopinion author        | activist photographer     |\n",
      "| essay nation             | wall street              | politics culture         | activist recent           |\n",
      "| feminist majority        | activist author          | politics media           | activist writer           |\n",
      "| food workers             | alec new                 | pulitzer prize           | agenda report             |\n",
      "| great doctorow           | america populist         | views expressed          | american politics         |\n",
      "| hate crime               | american legislative     | washington correspondent | americans knew            |\n",
      "| hillman foundation       | american university      | affairs correspondent    | analysis current          |\n",
      "| huffington post          | author diet              | amp gov                  | angeles times             |\n",
      "| india clarke             | blacklivesmatter protest | analyst author           | apology massincarceration |\n",
      "| indictment dylannroof    | charleston church        | anchor abc               | around world              |\n",
      "| investigative journalist | children health          | anchor chief             | associate editor          |\n",
      "| iran deal                | citizen global           | anchor cnn               | attack free               |\n",
      "| journalists thought      | civil liberties          | animation one            | barrett brown             |\n",
      "| last essay               | civil rights             | anti immigrant           | based institute           |\n",
      "| leaders committed        | community organizer      | associate professor      | bear expect               |\n",
      "| lgbt issues              | community radio          | author new               | bernie sanders            |\n",
      "| long way                 | conspiracy theory        | board member             | bianca jagger             |\n",
      "| looks like               | constitutional rights    | book pulitzer            | bill clinton              |\n",
      "| majority foundation      | contact press            | book reviewer            | black agenda              |\n",
      "| managing editor          | corporate power          | box emmy                 | center constitutional     |\n",
      "| media matters            | create new               | breaking news            | clemencies bill           |\n",
      "| nation zjuklguchz        | criminal justice         | breast cancer            | climate change            |\n",
      "| netroots nation          | cultural critic          | bureau chief             | clinton half              |\n",
      "| new book                 | current events           | business government      | cold war                  |\n",
      "| news organization        | democracy yjbegzsujz     | campaign coverage        | constitutional rights     |\n",
      "| non profit               | digital media            | cell phone               | contact email             |\n",
      "| nonprofit news           | doesn explain            | center author            | contributing editor       |\n",
      "| organization dedicated   | drug policy              | chief foreign            | contributing writer       |\n",
      "| people american          | drug war                 | chief washington         | cops well                 |\n",
      "| political research       | editor progressive       | chief white              | current events            |\n",
      "| power media              | exchange council         | chris hayes              | debunks nuclear           |\n",
      "| public policy            | exec director            | city hall                | deepa kumar               |\n",
      "| race gender              | fast track               | clean energy             | deray sandrabland         |\n",
      "| reporting analysis       | fighting rights          | columnist author         | digital rights            |\n",
      "| research associates      | flanders show            | columnist nytopinion     | discourse providing       |\n",
      "| sidney hillman           | former journalist        | columnist pulitzer       | documented analysis       |\n",
      "| statement thejusticedept | former press             | communications director  | dylann roof               |\n",
      "| thejusticedept hate      | forward thinking         | confinement xj7wtgcdze   | editor author             |\n",
      "| thenation great          | global trade             | contributing editor      | editor jacobin            |\n",
      "| thought leaders          | high school              | contributor cjr          | email pgp                 |\n",
      "| twitter account          | independent nonprofit    | correspondent host       | enemy within              |\n",
      "| united states            | investigative reporting  | coverage far             | events issues             |\n",
      "| vice president           | journalism media         | critics theirandeal      | expect make               |\n",
      "| views expressed          | justice democracy        | daily news               | free press                |\n",
      "| wam nyc                  | justice media            | daily show               | fundraising campaign      |\n",
      "| women media              | justice system           | dancing bug              | global affairs            |\n",
      "| world conversation       | kids need                | democracy tweets         | great work                |\n",
      "| writer editor            | labor love               | dylann roof              | hacking team              |\n",
      "| writer feminist          | last week                | emmy nominated           | half apology              |\n",
      "+--------------------------+--------------------------+--------------------------+---------------------------+\n"
     ]
    }
   ],
   "source": [
    "bigrams = []\n",
    "\n",
    "for clusteridx in documents:\n",
    "    bigrams.append(getBigrams(documents[clusteridx], threshold=2))\n",
    "\n",
    "# Create a pretty table of tweet contents and any expanded urls\n",
    "bigram_pt = prettytable.PrettyTable([\"Cluster 0\", \"Cluster 1\", \"Cluster 2\", \"Cluster 3\"])\n",
    "bigram_pt.align[\"Cluster 0\"] = \"l\" \n",
    "bigram_pt.align[\"Cluster 1\"] = \"l\" \n",
    "bigram_pt.align[\"Cluster 2\"] = \"l\" \n",
    "bigram_pt.align[\"Cluster 3\"] = \"l\"     \n",
    "bigram_pt.max_width = 60 \n",
    "bigram_pt.padding_width = 1 # One space between column edges and contents (default)\n",
    "\n",
    "for idx in range(len(bigrams[0])):\n",
    "    bigram_pt.add_row([\" \".join(bigram[idx][0]) for bigram in bigrams])\n",
    "\n",
    "# Lets see the results!\n",
    "print bigram_pt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Looking through the bigrams for each cluster seems to help support the outline above of the different clusters. To expand on the outline below:\n",
    "\n",
    "- Cluster 0: Media personalities\n",
    "  - executive director, contributing writer , senior fellow, contributing editor\n",
    "- Cluster 1: Independent media, 'far-left' media outlets, senior editor, staff writer\n",
    "  - social justice, media democracy, independent media , media justice, news analysis , digital media, nvestigative reporting\n",
    "- Cluster 2: Mainstream Liberal organizations or media personalities\n",
    "  - new york times, new yorker, washington post, daliy kos, washington correspondent\n",
    "- Cluster 3: people / journalists involved with civil liberties, human rights and whistleblowing\n",
    "  - Bigrams: human rights , national security , foreign policy, civil liberties,  digital rights, pgp(!)\n",
    "\n",
    "### Future Considerations\n",
    "\n",
    "- Using TF-IDF to make sure bigrams or tokens that appear in all documents are weighed less"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}