{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyzing Links using SmappDragon\n",
    "by [Leon Yin](twitter.com/leonyin)<br>\n",
    "2018-02-16\n",
    "\n",
    "This Tutorial shows how to \n",
    "1. <a href=\"#tweep\">Download tweets from Twitter using Tweepy</a>,\n",
    "2. <a href=\"#filter\">Filter and parse tweets using SmappDragon</a>,\n",
    "3. <a href=\"#link\">Create a link metadata table using SmappDragon</a>, and\n",
    "4. <a href=\"#fake\">Analyze links from questionable websites using Pandas and the OpenSources.co dataset</a>.\n",
    "\n",
    "View this on [Github](https://github.com/yinleon/smappdragon-tutorials/blob/master/smappdragon-tutorial-link-analysis.ipynb).\n",
    "View this on [NBViewer](https://nbviewer.jupyter.org/github/yinleon/smappdragon-tutorials/blob/master/smappdragon-tutorial-link-analysis.ipynb).\n",
    "Visit my Lab's [website](https://wp.nyu.edu/smapp/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downloading Tweets with Tweepy <a id='tweep'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# !pip install requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "import tweepy\n",
    "from smappdragon import JsonCollection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# fill these in with your Twitter API credentials, I store them as enviornment variables.\n",
    "consumer_key = os.environ.get('TWEEPY_API_KEY')\n",
    "consumer_secret = os.environ.get('TWEEPY_API_SECRET')\n",
    "access_key = os.environ.get('TWEEPY_ACCESS_TOKEN')\n",
    "access_secret = os.environ.get('TWEEPY_TOKEN_SECRET')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n",
    "auth.set_access_token(access_key, access_secret)\n",
    "api = tweepy.API(auth, retry_count=2, retry_delay=5, \n",
    "                 wait_on_rate_limit=True,\n",
    "                 wait_on_rate_limit_notify=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "screen_name = 'seanhannity'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the tweepy `Cursor` to hit the Twitter API for up to 3.2K tweets per user."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3230"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_tweets= []\n",
    "for tweet in tweepy.Cursor(api.user_timeline, screen_name=screen_name).items():\n",
    "    user_tweets.append(tweet._json)\n",
    "len(user_tweets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's store this data in a new directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!mkdir ./data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tweet_file = './data/tweets.json'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open(tweet_file, 'w') as f:\n",
    "    for tweet in user_tweets:\n",
    "        f.write(json.dumps(tweet) + '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could work with this JSON in a variety of ways.<br>\n",
    "At my lab we created a module which works wih JSON records in a `collection` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "collect = JsonCollection(tweet_file, throw_error=0, verbose=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<smappdragon.collection.json_collection.JsonCollection at 0x10c69a400>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collect"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We access the tweets stored in the `collect` the same way for any generator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<generator object JsonCollection.get_iterator at 0x10bda0308>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collect.get_iterator()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is  generator?\n",
    "A generator is an interator that only keeps track of location.<br>\n",
    "In other words, the entirety of the tweet json is not held in memory.<br>\n",
    "They are created by functions that _yield_ objects, rather than _return_ objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def simple_generator_function():\n",
    "    yield 1\n",
    "    yield 2\n",
    "    yield 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "gen = simple_generator_function()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<generator object simple_generator_function at 0x10bda0f68>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gen"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that this is similar to what is returned from `collect.get_iterator()`.<br>\n",
    "We access the values in a generator by iterating through it.<br>\n",
    "For loops are the easiest way to iterate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "2\n",
      "3\n"
     ]
    }
   ],
   "source": [
    "for i in gen:\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice, when a generator is iterated through, it is no longer usable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for i in gen:\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we use the `get_iterator` function, we convert the collection into a generator.\n",
    "Unlike conventional generators, when do use this function, we can contiue to iterate through the object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"created_at\": \"Fri Feb 16 02:47:44 +0000 2018\",\n",
      "  \"id\": 964330423466778624,\n",
      "  \"id_str\": \"964330423466778624\",\n",
      "  \"text\": \"Dr. Daniel Bober and @RealDrGina are next with how you can spot the warning signs that a tragic event might happen #Hannity\",\n",
      "  \"truncated\": false,\n",
      "  \"entities\": {\n",
      "    \"hashtags\": [\n",
      "      {\n",
      "        \"text\": \"Hannity\",\n",
      "        \"indices\": [\n",
      "          115,\n",
      "          123\n",
      "        ]\n",
      "      }\n",
      "    ],\n",
      "    \"symbols\": [],\n",
      "    \"user_mentions\": [\n",
      "      {\n",
      "        \"screen_name\": \"RealDrGina\",\n",
      "        \"name\": \"Gina Gentry Loudon\",\n",
      "        \"id\": 20118767,\n",
      "        \"id_str\": \"20118767\",\n",
      "        \"indices\": [\n",
      "          21,\n",
      "          32\n",
      "        ]\n",
      "      }\n",
      "    ],\n",
      "    \"urls\": []\n",
      "  },\n",
      "  \"source\": \"<a href=\\\"http://twitter.com/download/iphone\\\" rel=\\\"nofollow\\\">Twitter for iPhone</a>\",\n",
      "  \"in_reply_to_status_id\": null,\n",
      "  \"in_reply_to_status_id_str\": null,\n",
      "  \"in_reply_to_user_id\": null,\n",
      "  \"in_reply_to_user_id_str\": null,\n",
      "  \"in_reply_to_screen_name\": null,\n",
      "  \"user\": {\n",
      "    \"id\": 41634520,\n",
      "    \"id_str\": \"41634520\",\n",
      "    \"name\": \"Sean Hannity\",\n",
      "    \"screen_name\": \"seanhannity\",\n",
      "    \"location\": \"NYC\",\n",
      "    \"description\": \"TV Host Fox News Channel 9 PM EST. Nationally Syndicated Radio Host 3-6 PM EST. https://t.co/z23FRgA02S Retweets, Follows NOT endorsements! Due to hackings, no DM\\u2019s!\",\n",
      "    \"url\": \"https://t.co/gEpXK0qpWl\",\n",
      "    \"entities\": {\n",
      "      \"url\": {\n",
      "        \"urls\": [\n",
      "          {\n",
      "            \"url\": \"https://t.co/gEpXK0qpWl\",\n",
      "            \"expanded_url\": \"http://hannity.com\",\n",
      "            \"display_url\": \"hannity.com\",\n",
      "            \"indices\": [\n",
      "              0,\n",
      "              23\n",
      "            ]\n",
      "          }\n",
      "        ]\n",
      "      },\n",
      "      \"description\": {\n",
      "        \"urls\": [\n",
      "          {\n",
      "            \"url\": \"https://t.co/z23FRgA02S\",\n",
      "            \"expanded_url\": \"http://Hannity.com\",\n",
      "            \"display_url\": \"Hannity.com\",\n",
      "            \"indices\": [\n",
      "              80,\n",
      "              103\n",
      "            ]\n",
      "          }\n",
      "        ]\n",
      "      }\n",
      "    },\n",
      "    \"protected\": false,\n",
      "    \"followers_count\": 3426039,\n",
      "    \"friends_count\": 7442,\n",
      "    \"listed_count\": 17916,\n",
      "    \"created_at\": \"Thu May 21 17:41:12 +0000 2009\",\n",
      "    \"favourites_count\": 115,\n",
      "    \"utc_offset\": -18000,\n",
      "    \"time_zone\": \"Eastern Time (US & Canada)\",\n",
      "    \"geo_enabled\": false,\n",
      "    \"verified\": true,\n",
      "    \"statuses_count\": 37332,\n",
      "    \"lang\": \"en\",\n",
      "    \"contributors_enabled\": false,\n",
      "    \"is_translator\": false,\n",
      "    \"is_translation_enabled\": false,\n",
      "    \"profile_background_color\": \"663333\",\n",
      "    \"profile_background_image_url\": \"http://pbs.twimg.com/profile_background_images/378800000111343835/4ed961f1836bf5e9e1ae3de108c38501.jpeg\",\n",
      "    \"profile_background_image_url_https\": \"https://pbs.twimg.com/profile_background_images/378800000111343835/4ed961f1836bf5e9e1ae3de108c38501.jpeg\",\n",
      "    \"profile_background_tile\": false,\n",
      "    \"profile_image_url\": \"http://pbs.twimg.com/profile_images/378800000709183776/6273b31aa1836ac86426478aaa82a597_normal.jpeg\",\n",
      "    \"profile_image_url_https\": \"https://pbs.twimg.com/profile_images/378800000709183776/6273b31aa1836ac86426478aaa82a597_normal.jpeg\",\n",
      "    \"profile_banner_url\": \"https://pbs.twimg.com/profile_banners/41634520/1398970584\",\n",
      "    \"profile_link_color\": \"0084B4\",\n",
      "    \"profile_sidebar_border_color\": \"000000\",\n",
      "    \"profile_sidebar_fill_color\": \"CCCCFF\",\n",
      "    \"profile_text_color\": \"000000\",\n",
      "    \"profile_use_background_image\": true,\n",
      "    \"has_extended_profile\": false,\n",
      "    \"default_profile\": false,\n",
      "    \"default_profile_image\": false,\n",
      "    \"following\": false,\n",
      "    \"follow_request_sent\": false,\n",
      "    \"notifications\": false,\n",
      "    \"translator_type\": \"none\"\n",
      "  },\n",
      "  \"geo\": null,\n",
      "  \"coordinates\": null,\n",
      "  \"place\": null,\n",
      "  \"contributors\": null,\n",
      "  \"is_quote_status\": false,\n",
      "  \"retweet_count\": 184,\n",
      "  \"favorite_count\": 747,\n",
      "  \"favorited\": false,\n",
      "  \"retweeted\": false,\n",
      "  \"lang\": \"en\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "for tweet in collect.get_iterator():\n",
    "    print(json.dumps(tweet, indent=2))\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're breaking only because we don't want to print all the tweets in our json file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Crunching Numbers <a id=\"filter\"></a>\n",
    "We can study the structure of each tweet, and crunch some numbers.<br>\n",
    "For this example let's count who the user is tweeting with?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3229 rows are ok.\n",
      "0 rows are corrupt.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('SaraCarterDC', 159),\n",
       " ('newtgingrich', 130),\n",
       " ('JaySekulow', 117),\n",
       " ('GreggJarrett', 102),\n",
       " ('GeraldoRivera', 91),\n",
       " ('SebGorka', 90),\n",
       " ('IngrahamAngle', 85),\n",
       " ('POTUS', 82),\n",
       " ('realDonaldTrump', 79),\n",
       " ('seanhannity', 74)]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import Counter\n",
    "\n",
    "counter = Counter()\n",
    "for tweet in collect.get_iterator():\n",
    "    for user in tweet['entities']['user_mentions']:\n",
    "        counter.update([user['screen_name']])\n",
    "    \n",
    "counter.most_common(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also created conditional statements to filter the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def exclude_retweets(tweet):\n",
    "    '''\n",
    "    An example of a filter for a smappcollection.\n",
    "    Either True or False, the input will always be a json record.\n",
    "    '''\n",
    "    if tweet['retweeted'] == True:\n",
    "        return False\n",
    "    return True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<smappdragon.collection.json_collection.JsonCollection at 0x10c69a400>"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collect.set_custom_filter(exclude_retweets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3229 rows are ok.\n",
      "0 rows are corrupt.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "3230"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filtered_tweets = []\n",
    "for tweet in collect.get_iterator():\n",
    "    filtered_tweets.append(tweet)\n",
    "\n",
    "len(filtered_tweets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can dump the filtered collection to a compressed csv."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3229 rows are ok.\n",
      "0 rows are corrupt.\n"
     ]
    }
   ],
   "source": [
    "filtered_tweet_file = 'tweets_filtered.csv.gz'\n",
    "collect.dump_to_csv(filtered_tweet_file, \n",
    "                    input_fields = ['user.id', 'text', 'created_at'], \n",
    "                    compression = 'gzip')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are the columns available for the `input_fields` argument?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_all_columns(d, key=[]):\n",
    "    '''\n",
    "    A recursive function that traverses json keys.\n",
    "    The values return\n",
    "    '''\n",
    "    if not isinstance(d, dict):\n",
    "        print('.'.join(key))\n",
    "        return\n",
    "    \n",
    "    for k, v in d.items():\n",
    "        key_path = key + [k]\n",
    "        get_all_columns(d[k], key_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "created_at\n",
      "id\n",
      "id_str\n",
      "text\n",
      "truncated\n",
      "entities.hashtags\n",
      "entities.symbols\n",
      "entities.user_mentions\n",
      "entities.urls\n",
      "source\n",
      "in_reply_to_status_id\n",
      "in_reply_to_status_id_str\n",
      "in_reply_to_user_id\n",
      "in_reply_to_user_id_str\n",
      "in_reply_to_screen_name\n",
      "user.id\n",
      "user.id_str\n",
      "user.name\n",
      "user.screen_name\n",
      "user.location\n",
      "user.description\n",
      "user.url\n",
      "user.entities.url.urls\n",
      "user.entities.description.urls\n",
      "user.protected\n",
      "user.followers_count\n",
      "user.friends_count\n",
      "user.listed_count\n",
      "user.created_at\n",
      "user.favourites_count\n",
      "user.utc_offset\n",
      "user.time_zone\n",
      "user.geo_enabled\n",
      "user.verified\n",
      "user.statuses_count\n",
      "user.lang\n",
      "user.contributors_enabled\n",
      "user.is_translator\n",
      "user.is_translation_enabled\n",
      "user.profile_background_color\n",
      "user.profile_background_image_url\n",
      "user.profile_background_image_url_https\n",
      "user.profile_background_tile\n",
      "user.profile_image_url\n",
      "user.profile_image_url_https\n",
      "user.profile_banner_url\n",
      "user.profile_link_color\n",
      "user.profile_sidebar_border_color\n",
      "user.profile_sidebar_fill_color\n",
      "user.profile_text_color\n",
      "user.profile_use_background_image\n",
      "user.has_extended_profile\n",
      "user.default_profile\n",
      "user.default_profile_image\n",
      "user.following\n",
      "user.follow_request_sent\n",
      "user.notifications\n",
      "user.translator_type\n",
      "geo\n",
      "coordinates\n",
      "place\n",
      "contributors\n",
      "is_quote_status\n",
      "retweet_count\n",
      "favorite_count\n",
      "favorited\n",
      "retweeted\n",
      "lang\n"
     ]
    }
   ],
   "source": [
    "get_all_columns(tweet)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Link Analysis <a id=\"link\"></a>\n",
    "Let's parse out all the links out of the tweet.<br>\n",
    "We can't just return the value, as there can be multiple links per Tweet.<br>\n",
    "We can solve this by using a generator, and unpacking each using `itertools`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import itertools\n",
    "import requests\n",
    "from urllib.parse import urlparse\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_link(tweet):\n",
    "    '''\n",
    "    Returns a generator containing tweet metadata about media.\n",
    "    '''\n",
    "    if not isinstance(tweet, dict):\n",
    "        return\n",
    "        \n",
    "    row = {\n",
    "        'user.id': tweet['user']['id'],\n",
    "        'tweet.id': tweet['id'],\n",
    "        'tweet.created_at': tweet['created_at'],\n",
    "        'tweet.text' : tweet['text']\n",
    "    }\n",
    "\n",
    "    list_urls = tweet['entities']['urls']\n",
    "    \n",
    "    if list_urls:\n",
    "        for url in list_urls:\n",
    "            r = row.copy()\n",
    "            r['link.url_long'] = url.get('expanded_url')\n",
    "            \n",
    "            if r['link.url_long']:\n",
    "                r['link.domain'] = urlparse(r['link.url_long']).netloc.lower().lstrip('www.')\n",
    "                r['link.url_short'] = url.get('url')\n",
    "\n",
    "                yield r  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3229 rows are ok.\n",
      "0 rows are corrupt.\n"
     ]
    }
   ],
   "source": [
    "df_links = pd.DataFrame(\n",
    "    list(\n",
    "        itertools.chain.from_iterable(\n",
    "            [ get_link(tweet) for tweet in collect.get_iterator() if tweet ]\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>link.domain</th>\n",
       "      <th>link.url_long</th>\n",
       "      <th>link.url_short</th>\n",
       "      <th>tweet.created_at</th>\n",
       "      <th>tweet.id</th>\n",
       "      <th>tweet.text</th>\n",
       "      <th>user.id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>twitter.com</td>\n",
       "      <td>https://twitter.com/i/web/status/9643237186746...</td>\n",
       "      <td>https://t.co/VkG8xp9cph</td>\n",
       "      <td>Fri Feb 16 02:21:05 +0000 2018</td>\n",
       "      <td>964323718674644993</td>\n",
       "      <td>Coming up... President Trump is vowing tough a...</td>\n",
       "      <td>41634520</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>twitter.com</td>\n",
       "      <td>https://twitter.com/i/web/status/9643128779927...</td>\n",
       "      <td>https://t.co/RL57eiZmas</td>\n",
       "      <td>Fri Feb 16 01:38:00 +0000 2018</td>\n",
       "      <td>964312877992734721</td>\n",
       "      <td>Tonight on #Hannity I’m joined by @JudgeJeanin...</td>\n",
       "      <td>41634520</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>hannity.com</td>\n",
       "      <td>https://www.hannity.com/media-room/capitol-rev...</td>\n",
       "      <td>https://t.co/31pzBs5ulS</td>\n",
       "      <td>Thu Feb 15 21:56:52 +0000 2018</td>\n",
       "      <td>964257227900116993</td>\n",
       "      <td>https://t.co/31pzBs5ulS</td>\n",
       "      <td>41634520</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>hannity.com</td>\n",
       "      <td>https://www.hannity.com/media-room/nice-try-th...</td>\n",
       "      <td>https://t.co/FVWAi8hzlH</td>\n",
       "      <td>Thu Feb 15 21:14:18 +0000 2018</td>\n",
       "      <td>964246514561363974</td>\n",
       "      <td>Nice try Joy https://t.co/FVWAi8hzlH</td>\n",
       "      <td>41634520</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>hannity.com</td>\n",
       "      <td>https://www.hannity.com/media-room/red-flag-fb...</td>\n",
       "      <td>https://t.co/BltIZp6vOd</td>\n",
       "      <td>Thu Feb 15 20:18:17 +0000 2018</td>\n",
       "      <td>964232417107173376</td>\n",
       "      <td>WATCH: FBI Agent comments on claims the bureau...</td>\n",
       "      <td>41634520</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   link.domain                                      link.url_long  \\\n",
       "0  twitter.com  https://twitter.com/i/web/status/9643237186746...   \n",
       "1  twitter.com  https://twitter.com/i/web/status/9643128779927...   \n",
       "2  hannity.com  https://www.hannity.com/media-room/capitol-rev...   \n",
       "3  hannity.com  https://www.hannity.com/media-room/nice-try-th...   \n",
       "4  hannity.com  https://www.hannity.com/media-room/red-flag-fb...   \n",
       "\n",
       "            link.url_short                tweet.created_at  \\\n",
       "0  https://t.co/VkG8xp9cph  Fri Feb 16 02:21:05 +0000 2018   \n",
       "1  https://t.co/RL57eiZmas  Fri Feb 16 01:38:00 +0000 2018   \n",
       "2  https://t.co/31pzBs5ulS  Thu Feb 15 21:56:52 +0000 2018   \n",
       "3  https://t.co/FVWAi8hzlH  Thu Feb 15 21:14:18 +0000 2018   \n",
       "4  https://t.co/BltIZp6vOd  Thu Feb 15 20:18:17 +0000 2018   \n",
       "\n",
       "             tweet.id                                         tweet.text  \\\n",
       "0  964323718674644993  Coming up... President Trump is vowing tough a...   \n",
       "1  964312877992734721  Tonight on #Hannity I’m joined by @JudgeJeanin...   \n",
       "2  964257227900116993                            https://t.co/31pzBs5ulS   \n",
       "3  964246514561363974               Nice try Joy https://t.co/FVWAi8hzlH   \n",
       "4  964232417107173376  WATCH: FBI Agent comments on claims the bureau...   \n",
       "\n",
       "    user.id  \n",
       "0  41634520  \n",
       "1  41634520  \n",
       "2  41634520  \n",
       "3  41634520  \n",
       "4  41634520  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_links.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# filter out Twitter links\n",
    "df_links = df_links[df_links['link.domain'] != 'twitter.com']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also expand shortened links fron bit.ly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def resolve_shortened_link(link):\n",
    "    '''\n",
    "    Handles link shorteners like bit.ly.\n",
    "    '''\n",
    "    if link['link.domain'] in ['bit.ly']:\n",
    "        r = requests.head(link['link.url_long'], allow_redirects=True)\n",
    "        return r.url\n",
    "    else:\n",
    "        return link['link.domain']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the `apply()` function on a Pandas dataframe to apply a function to entire rows (`axis=1`) or columns (`axis=2`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_links.loc[:, 'link.domain'] = df_links.apply(resolve_shortened_link, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "hannity.com           302\n",
       "amzn.to                37\n",
       "thehill.com            31\n",
       "mediaequalizer.com     22\n",
       "angelocarusone.com     19\n",
       "youtu.be               14\n",
       "breitbart.com          12\n",
       "mediaite.com           11\n",
       "foxnews.com             9\n",
       "youtube.com             8\n",
       "ashingtonpost.com       7\n",
       "dailycaller.com         6\n",
       "circa.com               6\n",
       "ashingtontimes.com      6\n",
       "google.com              6\n",
       "Name: link.domain, dtype: int64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_links['link.domain'].value_counts().head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the most common words associated with each link using a simple count sans-stop words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from nltk.corpus import stopwords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What does his own site focus on?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Trump', 19),\n",
       " ('FBI', 13),\n",
       " ('Over', 13),\n",
       " ('GOP', 12),\n",
       " ('The', 11),\n",
       " ('https://t.co/9hkyEX1UVi', 11),\n",
       " ('@realDonaldTrump', 10),\n",
       " ('Tax', 10),\n",
       " ('After', 8),\n",
       " ('Cuts', 8)]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count = Counter()\n",
    "for sent in df_links[df_links['link.domain'] == 'hannity.com']['tweet.text'].values:\n",
    "    word_count.update([w for w in sent.split() if w not in stopwords.words('English')])\n",
    "\n",
    "word_count.most_common(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What about Amazon?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('radio', 26),\n",
       " ('joins', 19),\n",
       " ('new', 19),\n",
       " ('book', 15),\n",
       " ('discuss', 11),\n",
       " ('great', 7),\n",
       " ('talk', 7),\n",
       " ('#Hannity', 7),\n",
       " ('author', 4),\n",
       " ('.@newtgingrich', 4)]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count = Counter()\n",
    "for sent in df_links[df_links['link.domain'] == 'amzn.to']['tweet.text']:\n",
    "    word_count.update([w for w in sent.split() if w not in stopwords.words('English')])\n",
    "\n",
    "word_count.most_common(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Questionable Media Domains <a id=\"fake\"></a>\n",
    "We can use the open sources dataset to filter domains on various criteria.<br>\n",
    "Here is a <a href=\"https://nbviewer.jupyter.org/github/yinleon/fake_news/blob/master/opensources-lite.ipynb\">notebook</a> that makes the data machine-readible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "opensorces_clean_url = 'https://raw.githubusercontent.com/yinleon/fake_news/master/data/sources_clean.tsv'\n",
    "df_os = pd.read_csv(opensorces_clean_url, sep='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>domain</th>\n",
       "      <th>bias</th>\n",
       "      <th>clickbait</th>\n",
       "      <th>conspiracy</th>\n",
       "      <th>fake</th>\n",
       "      <th>hate</th>\n",
       "      <th>junksci</th>\n",
       "      <th>political</th>\n",
       "      <th>reliable</th>\n",
       "      <th>rumor</th>\n",
       "      <th>satire</th>\n",
       "      <th>state</th>\n",
       "      <th>unreliable</th>\n",
       "      <th>notes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100percentfedup.com</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>16wmpo.com</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>http://www.politifact.com/punditfact/article/2...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>21stcenturywire.com</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>24newsflash.com</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>24wpn.com</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>http://www.politifact.com/punditfact/article/2...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                domain  bias  clickbait  conspiracy  fake  hate  junksci  \\\n",
       "0  100percentfedup.com     1          0           0     0     0        0   \n",
       "1           16wmpo.com     0          0           0     1     0        0   \n",
       "2  21stcenturywire.com     0          0           1     0     0        0   \n",
       "3      24newsflash.com     0          0           0     1     0        0   \n",
       "4            24wpn.com     0          0           0     1     0        0   \n",
       "\n",
       "   political  reliable  rumor  satire  state  unreliable  \\\n",
       "0          0         0      0       0      0           0   \n",
       "1          0         0      0       0      0           0   \n",
       "2          0         0      0       0      0           0   \n",
       "3          0         0      0       0      0           0   \n",
       "4          0         0      0       0      0           0   \n",
       "\n",
       "                                               notes  \n",
       "0                                                NaN  \n",
       "1  http://www.politifact.com/punditfact/article/2...  \n",
       "2                                                NaN  \n",
       "3                                                NaN  \n",
       "4  http://www.politifact.com/punditfact/article/2...  "
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_os.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_questionable = pd.merge(left= df_links, left_on= 'link.domain', \n",
    "                           right= df_os, right_on= 'domain', how= 'inner')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the breakdown of links shared from questionable sites?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "breitbart.com                12\n",
       "dailycaller.com               6\n",
       "theblaze.com                  3\n",
       "lifezette.com                 3\n",
       "americanthinker.com           3\n",
       "nationalreview.com            2\n",
       "freebeacon.com                2\n",
       "conservativetribune.com       1\n",
       "thedailybeast.com             1\n",
       "pjmedia.com                   1\n",
       "cnsnews.com                   1\n",
       "ijr.com                       1\n",
       "conservapedia.com             1\n",
       "thegatewaypundit.com          1\n",
       "newsmax.com                   1\n",
       "conservativereview.com        1\n",
       "thefreethoughtproject.com     1\n",
       "Name: link.domain, dtype: int64"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_questionable['link.domain'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do some simple matrix math to see the breakdown of quesitonable links"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['bias',\n",
       " 'clickbait',\n",
       " 'conspiracy',\n",
       " 'fake',\n",
       " 'hate',\n",
       " 'junksci',\n",
       " 'political',\n",
       " 'reliable',\n",
       " 'rumor',\n",
       " 'satire',\n",
       " 'state',\n",
       " 'unreliable']"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# these are the columns we'll base out calculations on.\n",
    "media_classes = [c for c in df_os.columns if c not in ['domain', 'notes']]\n",
    "media_classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bias          30\n",
       "clickbait     13\n",
       "conspiracy     3\n",
       "fake           0\n",
       "hate           0\n",
       "junksci        0\n",
       "political     27\n",
       "reliable       0\n",
       "rumor          0\n",
       "satire         0\n",
       "state          0\n",
       "unreliable    18\n",
       "dtype: int64"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breakdown = df_questionable[media_classes].sum(axis=0)\n",
    "breakdown"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x1156b0b38>"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAE0CAYAAAA8O8g/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHlpJREFUeJzt3Xu8bfW8//HXu5tQqbQlVNslET9dbNVDcZLQzU/8ih+V\nhLNz8ItzIh0nSvKTS665nAhbdUrEEeUkyS/VUfau7MqOkijddpLShfbu/ftjfJc1W63LXNex1ne8\nn4/HfKw5x5hjzc8ca673/I7vGOM7ZJuIiJj7Vmm7gIiImBoJ9IiISiTQIyIqkUCPiKhEAj0iohIJ\n9IiISiTQo2+SXiDpV23XMVGSrpe0S0uvvaGk8yXdLenYcS77F0lP6fO5lvS0iVU5t0n6mqRD266j\nTQn0PknaUdJFkv4s6Q5JF0p6Xgt1POwfVtKRkk6a7te2/VPbm/e87oQDUtL88l7OHDL9JElHTrLU\n2WghcDuwju1Dhs4sYXT0cAvaXsv2ddNd4EwoX04Dtwcl3dfzeN/J/G7bb7D90amqdS5are0C5gJJ\n6wDfB/4JOA1YA3gB8Nc266rE9pJ2sH1h24X0S9JqtleMc7FNgV+6Q2fySRIg2w8OTLO9Vs/864E3\n2/5RC+VVKS30/jwdwPYptlfavs/2D20vHXiCpDdKWibpT5LOlrRpz7xPS7pB0l2Slkh6Qc+8IyWd\nJunrZXP8KkkLJlPsZF6vtLrfJWlp2Rr5hqQ1y7ydJN1Y7p8IbAJ8r7SuDpV0pqT/M6SWpZL2GqXc\njwLDtkwlvUHSBUOm/X0LpbRqPy/pB6WGCyU9XtKnyt/haklbD/m1z5P0yzL/qwPvrfy+PSVdLunO\nsjX2nCHr5T2SlgL3SHpYY0jS8yX9vKy3n0t6/kCdwAHAoaXOcW3VDPOeP1fW9d2SLpb01BGW27F8\nDl6kxicl3VbqWyrp2SMs9xNJH5Z0SXnudyWt3zN/+7J+7pT0C0k7DVn2Q5IuBO4F+uoq6ln+keX9\n3SzpRkkfk7R6mberpGslfUDNVvJ1kvbpWfZUSYf3PN6nvM+7JF0j6cXjqWVOsp3bGDdgHeCPwCJg\nN2C9IfP3Aq4Fnkmz1XM4cFHP/P2Ax5Z5hwC3AGuWeUcC9wO7A6sCHwZ+NkotBp42ZNqRwElT8XrA\n9cAlwBOA9YFlwFvKvJ2AG4c8d5eex68GLu55vGVZb2sM8z7ml/eyFvCHgd8DnAQcWe6/AbhgpPcP\nfI2mG+O5wJrAj4HfAq8v7+1o4Lwh9V4JbFze24XA0WXeNsBtwHZl2QPK8x/Rs+zlZdlHDvN+1gf+\nBOxf1vtry+PH9tR69Ch/1xHnD/Oe7wC2La9zMnDq0OcCLwNuALYt018GLAHWBUTzWd1ohNf7Sfmb\nPBt4NHA65fMFPLH8TXenaRC+pDye17Ps74FnlfpWH+U9P+TzU6Z9FPgpsAGwIfBz4N/KvF2BFTSf\n2TWAXWi+NJ5c5p8KHF7uv6Cs/xeVOjcBnt52lkz3LS30Pti+C9iR5p/lS8BySWdI2rA85SDgw7aX\nudkU/7/AViqtdNsn2f6j7RW2jwUeAWze8xIX2D7L9krgRJogHM2lpXV0p6Q7gcOG1DvZ1/uM7Zts\n3wF8D9iqj9UE8F1gM0mblcf7A9+w/bdRlrkf+BAjtNL78B3bS2zfD3wHuN/218t7+wYwtIV+nO0b\nynv7EE3wAvwj8O+2L3azFbaIpktt+55lP1OWvW+YOvYArrF9YlnvpwBXAy+f4PsazbdtX1I+ayfz\n8L/PPsDxwO62LynTHgDWBp5B0w2yzPbNo7zGibavtH0P8D7g1ZJWpWksnFU+Pw/aPgdYTBPwA75m\n+6qyHh4Y53vbFzjC9u22b6X5XOzfM38F8AHbf3PTVfMjYO9hfs+bgS/aPq/U+Xvbvx5nLXNOAr1P\n5R/gDbafRNNyeQLwqTJ7U+DTPQF7B00r6IkAkg5R0x3z5zL/MTQtkAG39Ny/F1hzuE36HtvYXnfg\nBhzTO3MKXm/o/LXog+2/0uxj2E/SKjRheWIfi34J2FDSRMLv1p779w3zeGjtN/Tc/x3N3xGav+Eh\nQ74oN+6ZP3TZoZ5Qfl+v31E+A1NsrL/PO4HTbF8xMMH2j4HjgM8Bt0o6Xs2+oZEMXU+r03yGNgX2\nGbKedgQ2GmHZvkkS8Hgeuh6HrsPl5cu7d37v32jAxsBvJlLHXJZAnwDbV9Ns+g70Qd4AHNQbsrYf\nafsiNf3X76HpjlivBPCfaQJ/ys3w6w23g28RTSvrxcC9tv97zF/StOI+AHyQh9Z5D/CogQeSHj+p\nahsb99zfBLip3L8B+NCQv+GjSkv776WO8ntvogm7XpvQdF3MtH2AvSS9s3ei7c/Yfi5Nd8jTgXeP\n8juGrqcHaLq3bqBpvfeup0fb7m1UTGjHr23TfFn1rseh63CD3v0ePPRv2OsGYNh9CzVLoPdB0jNK\nq/dJ5fHGNK3Pn5WnfBH4V0nPKvMf07OzZm2azcTlwGqS3k/TJz9dZvL1bmXITq8S4A8Cx9Jf63zA\niTRdQ7v2TPsF8CxJW5V/4iMnVW3jbZKeVHbyvZemWwaarYS3SNqu7EB8tKQ9JK3d5+89C3i6pNdJ\nWk3Sa4AtaI6O6teqktbsua0xjmV73UTzhXqwpLcCSHpeeW+r03xR3g+sHOV37CdpC0mPAo4CvlW6\nsU4CXi7pZZIG6t1p4H9jCpwCHCHpsZIeB/xbec0BqwPvk7SGpJ1p+vBPH+b3fBk4SNILJa0iaWNJ\nT5+iGmetBHp/7qbZWXaxpHtogvxKmh2O2P4O8BHgVEl3lXm7lWXPBn4A/Jpm8/B+JrhJ2qeZfL0P\nA4eXTe939Uz/OvA/eOg/4qhKWBxBs3NxYNqvacLkR8A1wAXDLz0u/wH8ELiu3I4ur7WYph/9OJqd\nadfS7JTtt/4/AnvSfCb+CBwK7Gn79nHUdhhNN9HA7cfjWHZoPb+nCfX3SHozzZf6l2je2+9KjR8f\n5VecSLMVegvNDueDy++9AXgFzZfhcprP1ruZuix5P/BL4CqandAX0uwoHXA9TYPlFuArwIEe5hh9\n2z8F3gJ8nmYL9Vxgqr50Zi01WzkRU0fS64GFtndsu5YYP0k/oTmq5ctt19JL0q40O7U7eSZsP9JC\njylVNtHfSnOURUTMoAR6TBlJL6PZDL+VpmsjImZQulwiIiqRFnpERCVmdHCuDTbYwPPnz5/Jl4yI\nmPOWLFlyu+15Yz1vRgN9/vz5LF68eCZfMiJizpM09CzkYaXLJSKiEgn0iIhKJNAjIiqRQI+IqEQC\nPSKiEgn0iIhKjBnoZXjMS8q1A6+S9IEy/clqrmd4jZrrTk50qM+IiJgC/bTQ/wrsbHtLmktd7Spp\ne5rhYj9pezOaITnfNH1lRkTEWMYMdDf+Uh6uXm4Gdga+VaYvorlQckREtKSvM0XLxWGX0FxN/HM0\n1+q7s1ykFuBGRrh2oqSFwEKATTbZZLL1RkT0bf5hZ7ZdAtcfs8eMvVZfO0XLVdC3ornix7bAM4d7\n2gjLHm97ge0F8+aNORRBRERM0LiOcrF9J/ATYHtg3Z4rxT+J4S/UGhERM6Sfo1zmSVq33H8ksAuw\nDDgP2Ls87QDgu9NVZEREjK2fPvSNgEWlH30V4DTb35f0S5qLIh8NXAacMI11RkTEGMYMdNtLga2H\nmX4dTX96RETMAjlTNCKiEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQCPSKi\nEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIirR10WiI2Lu6NqFkWNQWugREZVIoEdEVCKB\nHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZUYM9AlbSzpPEnLJF0l6R1l+pGS/iDp\n8nLbffrLjYiIkfRz6v8K4BDbl0paG1gi6Zwy75O2Pz595UVERL/GDHTbNwM3l/t3S1oGPHG6C4uI\niPEZVx+6pPnA1sDFZdLbJS2V9BVJ642wzEJJiyUtXr58+aSKjYiIkfUd6JLWAk4H3mn7LuALwFOB\nrWha8McOt5zt420vsL1g3rx5U1ByREQMp69Al7Q6TZifbPvbALZvtb3S9oPAl4Btp6/MiIgYSz9H\nuQg4AVhm+xM90zfqedorgSunvryIiOhXP0e57ADsD1wh6fIy7b3AayVtBRi4HjhoWiqMiIi+9HOU\nywWAhpl11tSXExERE5UzRSMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok\n0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqIS\nCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioxJiBLmljSedJWibpKknvKNPXl3SO\npGvKz/Wmv9yIiBhJPy30FcAhtp8JbA+8TdIWwGHAubY3A84tjyMioiVjBrrtm21fWu7fDSwDngi8\nAlhUnrYI2Gu6ioyIiLGNqw9d0nxga+BiYEPbN0MT+sDjRlhmoaTFkhYvX758ctVGRMSI+g50SWsB\npwPvtH1Xv8vZPt72AtsL5s2bN5EaIyKiD30FuqTVacL8ZNvfLpNvlbRRmb8RcNv0lBgREf3o5ygX\nAScAy2x/omfWGcAB5f4BwHenvryIiOjXan08Zwdgf+AKSZeXae8FjgFOk/Qm4PfAPtNTYkRE9GPM\nQLd9AaARZr94asuJiIiJypmiERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekRE\nJfo5UzRi1pt/2Jltl8D1x+zRdgnRcWmhR0RUIoEeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJBHpE\nRCUS6BERlUigR0RUIoEeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJBHpERCUS6BERlRgz0CV9RdJt\nkq7smXakpD9Iurzcdp/eMiMiYiz9tNC/Buw6zPRP2t6q3M6a2rIiImK8xgx02+cDd8xALRERMQmT\n6UN/u6SlpUtmvZGeJGmhpMWSFi9fvnwSLxcREaOZaKB/AXgqsBVwM3DsSE+0fbztBbYXzJs3b4Iv\nFxERY5lQoNu+1fZK2w8CXwK2ndqyIiJivCYU6JI26nn4SuDKkZ4bEREzY7WxniDpFGAnYANJNwJH\nADtJ2gowcD1w0DTWGBERfRgz0G2/dpjJJ0xDLRERMQk5UzQiohIJ9IiISiTQIyIqkUCPiKhEAj0i\nohIJ9IiISiTQIyIqkUCPiKjEmCcWxew1/7Az2y6B64/Zo+0SIqJICz0iohIJ9IiISiTQIyIqkUCP\niKhEAj0iohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISiTQ\nIyIqkUCPiKhEAj0iohJjBrqkr0i6TdKVPdPWl3SOpGvKz/Wmt8yIiBhLPy30rwG7Dpl2GHCu7c2A\nc8vjiIho0ZiBbvt84I4hk18BLCr3FwF7TXFdERExThPtQ9/Q9s0A5efjRnqipIWSFktavHz58gm+\nXEREjGXad4raPt72AtsL5s2bN90vFxHRWRMN9FslbQRQft42dSVFRMRETDTQzwAOKPcPAL47NeVE\nRMRE9XPY4inAfwObS7pR0puAY4CXSLoGeEl5HBERLVptrCfYfu0Is148xbVERMQk5EzRiIhKJNAj\nIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0\niIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQC\nPSKiEgn0iIhKrDaZhSVdD9wNrARW2F4wFUVFRMT4TSrQixfZvn0Kfk9ERExCulwiIiox2UA38ENJ\nSyQtHO4JkhZKWixp8fLlyyf5chERMZLJBvoOtrcBdgPeJumFQ59g+3jbC2wvmDdv3iRfLiIiRjKp\nQLd9U/l5G/AdYNupKCoiIsZvwoEu6dGS1h64D7wUuHKqCouIiPGZzFEuGwLfkTTwe/7D9n9NSVUR\nETFuEw5029cBW05hLRERMQk5bDEiohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISiTQIyIqkUCP\niKjEVIyHPqPmH3Zm2yVw/TF7tF1CRMTDpIUeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJBHpERCUS\n6BERlUigR0RUIoEeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJ\nSQW6pF0l/UrStZIOm6qiIiJi/CYc6JJWBT4H7AZsAbxW0hZTVVhERIzPZFro2wLX2r7O9t+AU4FX\nTE1ZERExXrI9sQWlvYFdbb+5PN4f2M7224c8byGwsDzcHPjVxMudEhsAt7dcw2yRdTEo62JQ1sWg\n2bIuNrU9b6wnrTaJF9Aw0x727WD7eOD4SbzOlJK02PaCtuuYDbIuBmVdDMq6GDTX1sVkulxuBDbu\nefwk4KbJlRMRERM1mUD/ObCZpCdLWgP438AZU1NWRESM14S7XGyvkPR24GxgVeArtq+assqmz6zp\n/pkFsi4GZV0MyroYNKfWxYR3ikZExOySM0UjIiqRQI+IqEQCPSKiEgn0jpG0Tz/TImLu6cRO0XI0\nzsm2/9R2LW2TdKntbcaa1gWSBOwLPMX2UZI2AR5v+5KWS5sxkl412nzb356pWmYLSR8FjgbuA/4L\n2BJ4p+2TWi2sD5M5U3QueTzwc0mXAl8BznYXvsl6SNoN2B14oqTP9MxaB1jRTlWt+zzwILAzcBRw\nN3A68Lw2i5phLx9lnoHOBTrwUtuHSnolzQmU+wDnAQn02cD24ZLeB7wUOBA4TtJpwAm2f9NudTPm\nJmAx8D+BJT3T7wb+uZWK2red7W0kXQZg+0/lJLnOsH1g2zXMQquXn7sDp9i+o9mYm/06EegAti3p\nFuAWmhbpesC3JJ1j+9B2q5t+tn8B/ELSyba72iIf6oEyDLQBJM2jabF3kqQ9gGcBaw5Ms31UexW1\n5nuSrqbpcnlr+Vzc33JNfelKH/rBwAE0o6Z9GfhP2w9IWgW4xvZTWy1wBkg6zfarJV3B8IOoPaeF\nslolaV/gNcA2wCJgb+Bw299stbAWSPoi8CjgRTT/I3sDl9h+U6uFtUTSesBdtldKehSwju1b2q5r\nLF0J9KNould+N8y8Z9pe1kJZM0rSRrZvlrTpcPOHWzddIOkZwItpRg89twufheFIWmr7OT0/1wK+\nbfulbdfWBknPprlwT+/Wytfbq6g/XelyOQu4Y+CBpLWBLWxf3JV/YNs3l5+dDO5ektbveXgbcErv\nPNt3PHyp6t1Xft4r6QnAH4Ent1hPayQdAexEE+hn0VyV7QIggT5LfIFms3rAPcNM6wRJ2wOfBZ4J\nrEEzsNo9ttdptbCZtYSm22mkMf2fMrPlzArfl7Qu8DHgUpr18OV2S2rN3jSHKl5m+0BJGzJH1kVX\nAl29hynaflBSV977UMfRDHX8TWAB8Hrgaa1WNMNsd7LlORrbHyx3T5f0fWBN239us6YW3VcyYoWk\ndWi24ubEl3xXzhS9TtLBklYvt3cA17VdVFtsXwusanul7a/S7AjrJEmvkvQJScdK2qvtetoi6W2l\nhY7tvwKrSHpry2W1ZXFZF1+i2Zq7FJgTJ5t1Zafo44DP0JxAYuBcmjO/bmu1sBZIOh/YhWYT8hbg\nZuANtrdstbAWSPo8zdbJQB/6a4Df2H5be1W1Q9LltrcaMu0y21u3VdNsIGk+zREuS1supS+dCPQY\nVI5yuZWm//yfgccAny+t9k6RdBXw7IHuuHIY6xW2n9VuZTNP0lJgy551sSqwtEvrQtIzbF8tadh9\na7YvnemaxqsT/ciS1gTexMNPmnhja0W1xPbvytmQ82lO6/6V7b+1W1VrfgVsAgwc+bMxMCdaYtPg\nbOC0cjy6gbfQjGPSJf8CLASO5aHnaqg83rmNosajEy10Sd8ErgZeRzNmx77AMtvvaLWwFpSzAb8I\n/Ibmg/pk4CDbP2i1sBkk6Xs0/6CPoRm35ZLyeDvgItu7tFheK8rWyUEMHpP/Q+DLtle2WlgLJD0S\neCuwI83n4qfAF2zP+rNFuxLol9neuuekidVpBuia9d+4U62c0rznQBeLpKcCZ9p+RruVzRxJ/zDa\nfNv/b6ZqidmnjPN0F3BymfRaYF3br26vqv50ossFeKD8vLOcAXYLTZdDF902pL/8OprDsjojgT0o\nQ0IMa/MhBwmcJ+kXrVUzDl0J9OPL2AyHA2cAawHva7ekmdUz7vVVks4CTqP5B94H+HlrhbUoJ1kB\nMNDtuGerVcwul0na3vbPACRtB1zYck19qT7QS9/gXeXiFuczR04QmAa9417fCgx0OyynGXmyi4Y7\nyWqzViuaYQNDQgBvtf2e3nmSPgK85+FL1alnK2V14PWSfl8ebwr8ss3a+tWVPvTzbb+w7Tpmg+HG\nKpH0ZNu/baumtkhabHvBwL6VMu0i289vu7aZNsKVrJZ2qctlpIHrBsyFcZCqb6EX50h6F/ANmnFc\nAOjoIEzfk7Sb7bugGW2SpoX67HbLasW95RDOy8tlx24GHt1yTTNK0j/RHNHxlHIs+oC1mSPdDFNl\nLgT2WLrSQh+u9Wnbnet+KYctHgrsAWxOM4LcvrYvb7WwFpQW2W00m9idPMlK0mNoutw+DBzWM+vu\njjZ45rROBHo8VBmz5FCaVtirbF/TcknREknr2L5ryJDCf5dQn1uqDnRJO9v+8UhXNu/SFc0lfZaH\nHpa2M80hi9cD2D64hbJakUP1Bkn6vu09y1bs0CGFO7kVO5fV3of+D8CPGf7K5l27ovniIY+XDPus\nbsiheoXtPcvPDClcgapb6PFwkh4N3D9wSncZhOkRtu9tt7Jow0gDUQ2YCwNSxaBOBLqkxwJHMDg2\nwwXAUbb/2GphLZD0M2AX238pj9cCftilQ/Uk3c1gV8tAF8NAd4O7dGKRpPNGme0uDo8xl9Xe5TLg\nVJqTiv5XebwvzSGMnRuEieZKNH8ZeGD7L+Wq5p1he+22a5gtbHf24iY16soVi9a3/UHbvy23o4F1\n2y6qJff0bmZLei6DFwjuHEk7Sjqw3N9AUif7ksuVvA6W9K1ye3sZxC7mkK50uXycZqfgaWXS3sCz\nbB/RXlXtkPQ8mi2Wm8qkjYDX2O7cTtJydfcFNIMxPb1c7f6btndoubQZJ+nLNMfjLyqT9gdW2n5z\ne1XFeHUl0O+mOQNwJU0/6SoMnjHaqT5TaFpjNCcVCbja9gNjLFIlSZcDWwOXDlxqrWunuw+Q9Iuh\nlyEcblrMbp3oQ0+f6ajH5G8mqVPH5Pf4m21LGrjsWqdO+x9ipaSn2v4NgKSn0DSAYg7pRKBL2gG4\n3PY9kvYDtgE+Zfv3LZc2k3JM/sOdJunfgXUl/SPwRporvXfRu2nG/b6uPJ4PHNheOTERXelyWQps\nCTwHOBE4geaU91GvXBP1k/QS4KU03U9n2z6n5ZJaUa67ewjNJegAzgE+ORcuuxaDuhLol9reRtL7\ngT/YPmG44UJrJulfRptv+xMzVctsUE6oOruL1w8dzgiXXVvP9j7tVRXj1YkuF+BuSf8K7Ae8sPwz\nd+2QrNH2I9T/rT6E7ZWS7pX0GNt/brueWWDOXnYtBnUl0F8DvA54k+1bJG0CfKzlmmaU7Q8ASFoE\nvMP2neXxesCxbdbWovuBKySdw0PHye/MQGU95uxl12JQJ7pcYpCkywYO0RttWhdIOmC46bYXDTe9\nZpKW0RzKOnCgwCbAMuBBmkN7O3co51zUiRZ6OVTvI8DjaHZ+dW7Mjh6rSFqvXGOVMg52Jz4HQ3Ux\nuEexa9sFxOR15R/5o8DLbS9ru5BZ4FjgIknfouk7fzXwoXZLakc5nPVImosAr8bgF33nxgCv4fJr\n0ZEuF0kXdvF07pFI2oLmAhcCzrU9J65oPtUkXU1z6bkl9JxE08VROKMOXQn0TwOPB/4T+OvA9I6e\nHRmFpIttb9d2HRFTpSuB/tVhJtv2G2e8mJg1JB0DrEpzlmzvF30u6hBzUicCPWI4PRd36L3YRS7q\nEHNWJ3aKSnoS8FlgBwavWPQO2ze2Wli07SfDTEsLJ+asrlzg4qvAGcATgCcC3yvTotv+0nNbQXPo\n3vw2C4qYjE50uUi63PZWY02LbpP0COAM2y9ru5aIiehKC/12SftJWrXc9gNyaFoM9Sigc8egRz06\n0YdOM871ccAnafpILyJjPXeepCsY7DNfFZgHHNVeRRGT05Uul0XAO4ec7v7xHLbYbZI27Xm4ArjV\n9oq26omYrK600J8zEOYAtu+Q1LnBqOKhcrp71KYrfeirlGFigW4PSBUR9epKqGVAqoioXif60CED\nUkVE/ToT6BERtetKH3pERPUS6BERlUigR0RUIoEeEVGJ/w8sbHKay0OdzQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x1156b0518>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# we'll filter out the non-represented classes, sort them, and plot it!\n",
    "breakdown[breakdown != 0].sort_values().plot(\n",
    "    kind='bar', title='Sean Hannity Number of Links per Topic'\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}