{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyzing Links using SmappDragon\n", "by [Leon Yin](twitter.com/leonyin)
\n", "2018-02-16\n", "\n", "This Tutorial shows how to \n", "1. Download tweets from Twitter using Tweepy,\n", "2. Filter and parse tweets using SmappDragon,\n", "3. Create a link metadata table using SmappDragon, and\n", "4. Analyze links from questionable websites using Pandas and the OpenSources.co dataset.\n", "\n", "View this on [Github](https://github.com/yinleon/smappdragon-tutorials/blob/master/smappdragon-tutorial-link-analysis.ipynb).\n", "View this on [NBViewer](https://nbviewer.jupyter.org/github/yinleon/smappdragon-tutorials/blob/master/smappdragon-tutorial-link-analysis.ipynb).\n", "Visit my Lab's [website](https://wp.nyu.edu/smapp/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading Tweets with Tweepy " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# !pip install requirements.txt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "import json\n", "import tweepy\n", "from smappdragon import JsonCollection" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# fill these in with your Twitter API credentials, I store them as enviornment variables.\n", "consumer_key = os.environ.get('TWEEPY_API_KEY')\n", "consumer_secret = os.environ.get('TWEEPY_API_SECRET')\n", "access_key = os.environ.get('TWEEPY_ACCESS_TOKEN')\n", "access_secret = os.environ.get('TWEEPY_TOKEN_SECRET')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n", "auth.set_access_token(access_key, access_secret)\n", "api = tweepy.API(auth, retry_count=2, retry_delay=5, \n", " wait_on_rate_limit=True,\n", " wait_on_rate_limit_notify=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "screen_name = 'seanhannity'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the tweepy `Cursor` to hit the Twitter API for up to 3.2K tweets per user." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3230" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "user_tweets= []\n", "for tweet in tweepy.Cursor(api.user_timeline, screen_name=screen_name).items():\n", " user_tweets.append(tweet._json)\n", "len(user_tweets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's store this data in a new directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!mkdir ./data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tweet_file = './data/tweets.json'" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open(tweet_file, 'w') as f:\n", " for tweet in user_tweets:\n", " f.write(json.dumps(tweet) + '\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could work with this JSON in a variety of ways.
\n", "At my lab we created a module which works wih JSON records in a `collection` object." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "collect = JsonCollection(tweet_file, throw_error=0, verbose=1)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "collect" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We access the tweets stored in the `collect` the same way for any generator." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "collect.get_iterator()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is generator?\n", "A generator is an interator that only keeps track of location.
\n", "In other words, the entirety of the tweet json is not held in memory.
\n", "They are created by functions that _yield_ objects, rather than _return_ objects." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def simple_generator_function():\n", " yield 1\n", " yield 2\n", " yield 3" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "gen = simple_generator_function()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that this is similar to what is returned from `collect.get_iterator()`.
\n", "We access the values in a generator by iterating through it.
\n", "For loops are the easiest way to iterate." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "2\n", "3\n" ] } ], "source": [ "for i in gen:\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice, when a generator is iterated through, it is no longer usable." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for i in gen:\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we use the `get_iterator` function, we convert the collection into a generator.\n", "Unlike conventional generators, when do use this function, we can contiue to iterate through the object." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"created_at\": \"Fri Feb 16 02:47:44 +0000 2018\",\n", " \"id\": 964330423466778624,\n", " \"id_str\": \"964330423466778624\",\n", " \"text\": \"Dr. Daniel Bober and @RealDrGina are next with how you can spot the warning signs that a tragic event might happen #Hannity\",\n", " \"truncated\": false,\n", " \"entities\": {\n", " \"hashtags\": [\n", " {\n", " \"text\": \"Hannity\",\n", " \"indices\": [\n", " 115,\n", " 123\n", " ]\n", " }\n", " ],\n", " \"symbols\": [],\n", " \"user_mentions\": [\n", " {\n", " \"screen_name\": \"RealDrGina\",\n", " \"name\": \"Gina Gentry Loudon\",\n", " \"id\": 20118767,\n", " \"id_str\": \"20118767\",\n", " \"indices\": [\n", " 21,\n", " 32\n", " ]\n", " }\n", " ],\n", " \"urls\": []\n", " },\n", " \"source\": \"Twitter for iPhone\",\n", " \"in_reply_to_status_id\": null,\n", " \"in_reply_to_status_id_str\": null,\n", " \"in_reply_to_user_id\": null,\n", " \"in_reply_to_user_id_str\": null,\n", " \"in_reply_to_screen_name\": null,\n", " \"user\": {\n", " \"id\": 41634520,\n", " \"id_str\": \"41634520\",\n", " \"name\": \"Sean Hannity\",\n", " \"screen_name\": \"seanhannity\",\n", " \"location\": \"NYC\",\n", " \"description\": \"TV Host Fox News Channel 9 PM EST. Nationally Syndicated Radio Host 3-6 PM EST. https://t.co/z23FRgA02S Retweets, Follows NOT endorsements! Due to hackings, no DM\\u2019s!\",\n", " \"url\": \"https://t.co/gEpXK0qpWl\",\n", " \"entities\": {\n", " \"url\": {\n", " \"urls\": [\n", " {\n", " \"url\": \"https://t.co/gEpXK0qpWl\",\n", " \"expanded_url\": \"http://hannity.com\",\n", " \"display_url\": \"hannity.com\",\n", " \"indices\": [\n", " 0,\n", " 23\n", " ]\n", " }\n", " ]\n", " },\n", " \"description\": {\n", " \"urls\": [\n", " {\n", " \"url\": \"https://t.co/z23FRgA02S\",\n", " \"expanded_url\": \"http://Hannity.com\",\n", " \"display_url\": \"Hannity.com\",\n", " \"indices\": [\n", " 80,\n", " 103\n", " ]\n", " }\n", " ]\n", " }\n", " },\n", " \"protected\": false,\n", " \"followers_count\": 3426039,\n", " \"friends_count\": 7442,\n", " \"listed_count\": 17916,\n", " \"created_at\": \"Thu May 21 17:41:12 +0000 2009\",\n", " \"favourites_count\": 115,\n", " \"utc_offset\": -18000,\n", " \"time_zone\": \"Eastern Time (US & Canada)\",\n", " \"geo_enabled\": false,\n", " \"verified\": true,\n", " \"statuses_count\": 37332,\n", " \"lang\": \"en\",\n", " \"contributors_enabled\": false,\n", " \"is_translator\": false,\n", " \"is_translation_enabled\": false,\n", " \"profile_background_color\": \"663333\",\n", " \"profile_background_image_url\": \"http://pbs.twimg.com/profile_background_images/378800000111343835/4ed961f1836bf5e9e1ae3de108c38501.jpeg\",\n", " \"profile_background_image_url_https\": \"https://pbs.twimg.com/profile_background_images/378800000111343835/4ed961f1836bf5e9e1ae3de108c38501.jpeg\",\n", " \"profile_background_tile\": false,\n", " \"profile_image_url\": \"http://pbs.twimg.com/profile_images/378800000709183776/6273b31aa1836ac86426478aaa82a597_normal.jpeg\",\n", " \"profile_image_url_https\": \"https://pbs.twimg.com/profile_images/378800000709183776/6273b31aa1836ac86426478aaa82a597_normal.jpeg\",\n", " \"profile_banner_url\": \"https://pbs.twimg.com/profile_banners/41634520/1398970584\",\n", " \"profile_link_color\": \"0084B4\",\n", " \"profile_sidebar_border_color\": \"000000\",\n", " \"profile_sidebar_fill_color\": \"CCCCFF\",\n", " \"profile_text_color\": \"000000\",\n", " \"profile_use_background_image\": true,\n", " \"has_extended_profile\": false,\n", " \"default_profile\": false,\n", " \"default_profile_image\": false,\n", " \"following\": false,\n", " \"follow_request_sent\": false,\n", " \"notifications\": false,\n", " \"translator_type\": \"none\"\n", " },\n", " \"geo\": null,\n", " \"coordinates\": null,\n", " \"place\": null,\n", " \"contributors\": null,\n", " \"is_quote_status\": false,\n", " \"retweet_count\": 184,\n", " \"favorite_count\": 747,\n", " \"favorited\": false,\n", " \"retweeted\": false,\n", " \"lang\": \"en\"\n", "}\n" ] } ], "source": [ "for tweet in collect.get_iterator():\n", " print(json.dumps(tweet, indent=2))\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're breaking only because we don't want to print all the tweets in our json file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Crunching Numbers \n", "We can study the structure of each tweet, and crunch some numbers.
\n", "For this example let's count who the user is tweeting with?" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3229 rows are ok.\n", "0 rows are corrupt.\n" ] }, { "data": { "text/plain": [ "[('SaraCarterDC', 159),\n", " ('newtgingrich', 130),\n", " ('JaySekulow', 117),\n", " ('GreggJarrett', 102),\n", " ('GeraldoRivera', 91),\n", " ('SebGorka', 90),\n", " ('IngrahamAngle', 85),\n", " ('POTUS', 82),\n", " ('realDonaldTrump', 79),\n", " ('seanhannity', 74)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "\n", "counter = Counter()\n", "for tweet in collect.get_iterator():\n", " for user in tweet['entities']['user_mentions']:\n", " counter.update([user['screen_name']])\n", " \n", "counter.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also created conditional statements to filter the data." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def exclude_retweets(tweet):\n", " '''\n", " An example of a filter for a smappcollection.\n", " Either True or False, the input will always be a json record.\n", " '''\n", " if tweet['retweeted'] == True:\n", " return False\n", " return True" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "collect.set_custom_filter(exclude_retweets)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3229 rows are ok.\n", "0 rows are corrupt.\n" ] }, { "data": { "text/plain": [ "3230" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filtered_tweets = []\n", "for tweet in collect.get_iterator():\n", " filtered_tweets.append(tweet)\n", "\n", "len(filtered_tweets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can dump the filtered collection to a compressed csv." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3229 rows are ok.\n", "0 rows are corrupt.\n" ] } ], "source": [ "filtered_tweet_file = 'tweets_filtered.csv.gz'\n", "collect.dump_to_csv(filtered_tweet_file, \n", " input_fields = ['user.id', 'text', 'created_at'], \n", " compression = 'gzip')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the columns available for the `input_fields` argument?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_all_columns(d, key=[]):\n", " '''\n", " A recursive function that traverses json keys.\n", " The values return\n", " '''\n", " if not isinstance(d, dict):\n", " print('.'.join(key))\n", " return\n", " \n", " for k, v in d.items():\n", " key_path = key + [k]\n", " get_all_columns(d[k], key_path)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "created_at\n", "id\n", "id_str\n", "text\n", "truncated\n", "entities.hashtags\n", "entities.symbols\n", "entities.user_mentions\n", "entities.urls\n", "source\n", "in_reply_to_status_id\n", "in_reply_to_status_id_str\n", "in_reply_to_user_id\n", "in_reply_to_user_id_str\n", "in_reply_to_screen_name\n", "user.id\n", "user.id_str\n", "user.name\n", "user.screen_name\n", "user.location\n", "user.description\n", "user.url\n", "user.entities.url.urls\n", "user.entities.description.urls\n", "user.protected\n", "user.followers_count\n", "user.friends_count\n", "user.listed_count\n", "user.created_at\n", "user.favourites_count\n", "user.utc_offset\n", "user.time_zone\n", "user.geo_enabled\n", "user.verified\n", "user.statuses_count\n", "user.lang\n", "user.contributors_enabled\n", "user.is_translator\n", "user.is_translation_enabled\n", "user.profile_background_color\n", "user.profile_background_image_url\n", "user.profile_background_image_url_https\n", "user.profile_background_tile\n", "user.profile_image_url\n", "user.profile_image_url_https\n", "user.profile_banner_url\n", "user.profile_link_color\n", "user.profile_sidebar_border_color\n", "user.profile_sidebar_fill_color\n", "user.profile_text_color\n", "user.profile_use_background_image\n", "user.has_extended_profile\n", "user.default_profile\n", "user.default_profile_image\n", "user.following\n", "user.follow_request_sent\n", "user.notifications\n", "user.translator_type\n", "geo\n", "coordinates\n", "place\n", "contributors\n", "is_quote_status\n", "retweet_count\n", "favorite_count\n", "favorited\n", "retweeted\n", "lang\n" ] } ], "source": [ "get_all_columns(tweet)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Link Analysis \n", "Let's parse out all the links out of the tweet.
\n", "We can't just return the value, as there can be multiple links per Tweet.
\n", "We can solve this by using a generator, and unpacking each using `itertools`." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import itertools\n", "import requests\n", "from urllib.parse import urlparse\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_link(tweet):\n", " '''\n", " Returns a generator containing tweet metadata about media.\n", " '''\n", " if not isinstance(tweet, dict):\n", " return\n", " \n", " row = {\n", " 'user.id': tweet['user']['id'],\n", " 'tweet.id': tweet['id'],\n", " 'tweet.created_at': tweet['created_at'],\n", " 'tweet.text' : tweet['text']\n", " }\n", "\n", " list_urls = tweet['entities']['urls']\n", " \n", " if list_urls:\n", " for url in list_urls:\n", " r = row.copy()\n", " r['link.url_long'] = url.get('expanded_url')\n", " \n", " if r['link.url_long']:\n", " r['link.domain'] = urlparse(r['link.url_long']).netloc.lower().lstrip('www.')\n", " r['link.url_short'] = url.get('url')\n", "\n", " yield r " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3229 rows are ok.\n", "0 rows are corrupt.\n" ] } ], "source": [ "df_links = pd.DataFrame(\n", " list(\n", " itertools.chain.from_iterable(\n", " [ get_link(tweet) for tweet in collect.get_iterator() if tweet ]\n", " )\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
link.domainlink.url_longlink.url_shorttweet.created_attweet.idtweet.textuser.id
0twitter.comhttps://twitter.com/i/web/status/9643237186746...https://t.co/VkG8xp9cphFri Feb 16 02:21:05 +0000 2018964323718674644993Coming up... President Trump is vowing tough a...41634520
1twitter.comhttps://twitter.com/i/web/status/9643128779927...https://t.co/RL57eiZmasFri Feb 16 01:38:00 +0000 2018964312877992734721Tonight on #Hannity I’m joined by @JudgeJeanin...41634520
2hannity.comhttps://www.hannity.com/media-room/capitol-rev...https://t.co/31pzBs5ulSThu Feb 15 21:56:52 +0000 2018964257227900116993https://t.co/31pzBs5ulS41634520
3hannity.comhttps://www.hannity.com/media-room/nice-try-th...https://t.co/FVWAi8hzlHThu Feb 15 21:14:18 +0000 2018964246514561363974Nice try Joy https://t.co/FVWAi8hzlH41634520
4hannity.comhttps://www.hannity.com/media-room/red-flag-fb...https://t.co/BltIZp6vOdThu Feb 15 20:18:17 +0000 2018964232417107173376WATCH: FBI Agent comments on claims the bureau...41634520
\n", "
" ], "text/plain": [ " link.domain link.url_long \\\n", "0 twitter.com https://twitter.com/i/web/status/9643237186746... \n", "1 twitter.com https://twitter.com/i/web/status/9643128779927... \n", "2 hannity.com https://www.hannity.com/media-room/capitol-rev... \n", "3 hannity.com https://www.hannity.com/media-room/nice-try-th... \n", "4 hannity.com https://www.hannity.com/media-room/red-flag-fb... \n", "\n", " link.url_short tweet.created_at \\\n", "0 https://t.co/VkG8xp9cph Fri Feb 16 02:21:05 +0000 2018 \n", "1 https://t.co/RL57eiZmas Fri Feb 16 01:38:00 +0000 2018 \n", "2 https://t.co/31pzBs5ulS Thu Feb 15 21:56:52 +0000 2018 \n", "3 https://t.co/FVWAi8hzlH Thu Feb 15 21:14:18 +0000 2018 \n", "4 https://t.co/BltIZp6vOd Thu Feb 15 20:18:17 +0000 2018 \n", "\n", " tweet.id tweet.text \\\n", "0 964323718674644993 Coming up... President Trump is vowing tough a... \n", "1 964312877992734721 Tonight on #Hannity I’m joined by @JudgeJeanin... \n", "2 964257227900116993 https://t.co/31pzBs5ulS \n", "3 964246514561363974 Nice try Joy https://t.co/FVWAi8hzlH \n", "4 964232417107173376 WATCH: FBI Agent comments on claims the bureau... \n", "\n", " user.id \n", "0 41634520 \n", "1 41634520 \n", "2 41634520 \n", "3 41634520 \n", "4 41634520 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_links.head()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# filter out Twitter links\n", "df_links = df_links[df_links['link.domain'] != 'twitter.com']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also expand shortened links fron bit.ly" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def resolve_shortened_link(link):\n", " '''\n", " Handles link shorteners like bit.ly.\n", " '''\n", " if link['link.domain'] in ['bit.ly']:\n", " r = requests.head(link['link.url_long'], allow_redirects=True)\n", " return r.url\n", " else:\n", " return link['link.domain']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the `apply()` function on a Pandas dataframe to apply a function to entire rows (`axis=1`) or columns (`axis=2`)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df_links.loc[:, 'link.domain'] = df_links.apply(resolve_shortened_link, axis=1)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "hannity.com 302\n", "amzn.to 37\n", "thehill.com 31\n", "mediaequalizer.com 22\n", "angelocarusone.com 19\n", "youtu.be 14\n", "breitbart.com 12\n", "mediaite.com 11\n", "foxnews.com 9\n", "youtube.com 8\n", "ashingtonpost.com 7\n", "dailycaller.com 6\n", "circa.com 6\n", "ashingtontimes.com 6\n", "google.com 6\n", "Name: link.domain, dtype: int64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_links['link.domain'].value_counts().head(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the most common words associated with each link using a simple count sans-stop words." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from nltk.corpus import stopwords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does his own site focus on?" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Trump', 19),\n", " ('FBI', 13),\n", " ('Over', 13),\n", " ('GOP', 12),\n", " ('The', 11),\n", " ('https://t.co/9hkyEX1UVi', 11),\n", " ('@realDonaldTrump', 10),\n", " ('Tax', 10),\n", " ('After', 8),\n", " ('Cuts', 8)]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_count = Counter()\n", "for sent in df_links[df_links['link.domain'] == 'hannity.com']['tweet.text'].values:\n", " word_count.update([w for w in sent.split() if w not in stopwords.words('English')])\n", "\n", "word_count.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about Amazon?" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('radio', 26),\n", " ('joins', 19),\n", " ('new', 19),\n", " ('book', 15),\n", " ('discuss', 11),\n", " ('great', 7),\n", " ('talk', 7),\n", " ('#Hannity', 7),\n", " ('author', 4),\n", " ('.@newtgingrich', 4)]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_count = Counter()\n", "for sent in df_links[df_links['link.domain'] == 'amzn.to']['tweet.text']:\n", " word_count.update([w for w in sent.split() if w not in stopwords.words('English')])\n", "\n", "word_count.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Questionable Media Domains \n", "We can use the open sources dataset to filter domains on various criteria.
\n", "Here is a notebook that makes the data machine-readible." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "opensorces_clean_url = 'https://raw.githubusercontent.com/yinleon/fake_news/master/data/sources_clean.tsv'\n", "df_os = pd.read_csv(opensorces_clean_url, sep='\\t')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
domainbiasclickbaitconspiracyfakehatejunkscipoliticalreliablerumorsatirestateunreliablenotes
0100percentfedup.com100000000000NaN
116wmpo.com000100000000http://www.politifact.com/punditfact/article/2...
221stcenturywire.com001000000000NaN
324newsflash.com000100000000NaN
424wpn.com000100000000http://www.politifact.com/punditfact/article/2...
\n", "
" ], "text/plain": [ " domain bias clickbait conspiracy fake hate junksci \\\n", "0 100percentfedup.com 1 0 0 0 0 0 \n", "1 16wmpo.com 0 0 0 1 0 0 \n", "2 21stcenturywire.com 0 0 1 0 0 0 \n", "3 24newsflash.com 0 0 0 1 0 0 \n", "4 24wpn.com 0 0 0 1 0 0 \n", "\n", " political reliable rumor satire state unreliable \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " notes \n", "0 NaN \n", "1 http://www.politifact.com/punditfact/article/2... \n", "2 NaN \n", "3 NaN \n", "4 http://www.politifact.com/punditfact/article/2... " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_os.head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df_questionable = pd.merge(left= df_links, left_on= 'link.domain', \n", " right= df_os, right_on= 'domain', how= 'inner')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the breakdown of links shared from questionable sites?" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "breitbart.com 12\n", "dailycaller.com 6\n", "theblaze.com 3\n", "lifezette.com 3\n", "americanthinker.com 3\n", "nationalreview.com 2\n", "freebeacon.com 2\n", "conservativetribune.com 1\n", "thedailybeast.com 1\n", "pjmedia.com 1\n", "cnsnews.com 1\n", "ijr.com 1\n", "conservapedia.com 1\n", "thegatewaypundit.com 1\n", "newsmax.com 1\n", "conservativereview.com 1\n", "thefreethoughtproject.com 1\n", "Name: link.domain, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_questionable['link.domain'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do some simple matrix math to see the breakdown of quesitonable links" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['bias',\n", " 'clickbait',\n", " 'conspiracy',\n", " 'fake',\n", " 'hate',\n", " 'junksci',\n", " 'political',\n", " 'reliable',\n", " 'rumor',\n", " 'satire',\n", " 'state',\n", " 'unreliable']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# these are the columns we'll base out calculations on.\n", "media_classes = [c for c in df_os.columns if c not in ['domain', 'notes']]\n", "media_classes" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bias 30\n", "clickbait 13\n", "conspiracy 3\n", "fake 0\n", "hate 0\n", "junksci 0\n", "political 27\n", "reliable 0\n", "rumor 0\n", "satire 0\n", "state 0\n", "unreliable 18\n", "dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "breakdown = df_questionable[media_classes].sum(axis=0)\n", "breakdown" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAE0CAYAAAA8O8g/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHlpJREFUeJzt3Xu8bfW8//HXu5tQqbQlVNslET9dbNVDcZLQzU/8ih+V\nhLNz8ItzIh0nSvKTS665nAhbdUrEEeUkyS/VUfau7MqOkijddpLShfbu/ftjfJc1W63LXNex1ne8\nn4/HfKw5x5hjzc8ca673/I7vGOM7ZJuIiJj7Vmm7gIiImBoJ9IiISiTQIyIqkUCPiKhEAj0iohIJ\n9IiISiTQo2+SXiDpV23XMVGSrpe0S0uvvaGk8yXdLenYcS77F0lP6fO5lvS0iVU5t0n6mqRD266j\nTQn0PknaUdJFkv4s6Q5JF0p6Xgt1POwfVtKRkk6a7te2/VPbm/e87oQDUtL88l7OHDL9JElHTrLU\n2WghcDuwju1Dhs4sYXT0cAvaXsv2ddNd4EwoX04Dtwcl3dfzeN/J/G7bb7D90amqdS5are0C5gJJ\n6wDfB/4JOA1YA3gB8Nc266rE9pJ2sH1h24X0S9JqtleMc7FNgV+6Q2fySRIg2w8OTLO9Vs/864E3\n2/5RC+VVKS30/jwdwPYptlfavs/2D20vHXiCpDdKWibpT5LOlrRpz7xPS7pB0l2Slkh6Qc+8IyWd\nJunrZXP8KkkLJlPsZF6vtLrfJWlp2Rr5hqQ1y7ydJN1Y7p8IbAJ8r7SuDpV0pqT/M6SWpZL2GqXc\njwLDtkwlvUHSBUOm/X0LpbRqPy/pB6WGCyU9XtKnyt/haklbD/m1z5P0yzL/qwPvrfy+PSVdLunO\nsjX2nCHr5T2SlgL3SHpYY0jS8yX9vKy3n0t6/kCdwAHAoaXOcW3VDPOeP1fW9d2SLpb01BGW27F8\nDl6kxicl3VbqWyrp2SMs9xNJH5Z0SXnudyWt3zN/+7J+7pT0C0k7DVn2Q5IuBO4F+uoq6ln+keX9\n3SzpRkkfk7R6mberpGslfUDNVvJ1kvbpWfZUSYf3PN6nvM+7JF0j6cXjqWVOsp3bGDdgHeCPwCJg\nN2C9IfP3Aq4Fnkmz1XM4cFHP/P2Ax5Z5hwC3AGuWeUcC9wO7A6sCHwZ+NkotBp42ZNqRwElT8XrA\n9cAlwBOA9YFlwFvKvJ2AG4c8d5eex68GLu55vGVZb2sM8z7ml/eyFvCHgd8DnAQcWe6/AbhgpPcP\nfI2mG+O5wJrAj4HfAq8v7+1o4Lwh9V4JbFze24XA0WXeNsBtwHZl2QPK8x/Rs+zlZdlHDvN+1gf+\nBOxf1vtry+PH9tR69Ch/1xHnD/Oe7wC2La9zMnDq0OcCLwNuALYt018GLAHWBUTzWd1ohNf7Sfmb\nPBt4NHA65fMFPLH8TXenaRC+pDye17Ps74FnlfpWH+U9P+TzU6Z9FPgpsAGwIfBz4N/KvF2BFTSf\n2TWAXWi+NJ5c5p8KHF7uv6Cs/xeVOjcBnt52lkz3LS30Pti+C9iR5p/lS8BySWdI2rA85SDgw7aX\nudkU/7/AViqtdNsn2f6j7RW2jwUeAWze8xIX2D7L9krgRJogHM2lpXV0p6Q7gcOG1DvZ1/uM7Zts\n3wF8D9iqj9UE8F1gM0mblcf7A9+w/bdRlrkf+BAjtNL78B3bS2zfD3wHuN/218t7+wYwtIV+nO0b\nynv7EE3wAvwj8O+2L3azFbaIpktt+55lP1OWvW+YOvYArrF9YlnvpwBXAy+f4PsazbdtX1I+ayfz\n8L/PPsDxwO62LynTHgDWBp5B0w2yzPbNo7zGibavtH0P8D7g1ZJWpWksnFU+Pw/aPgdYTBPwA75m\n+6qyHh4Y53vbFzjC9u22b6X5XOzfM38F8AHbf3PTVfMjYO9hfs+bgS/aPq/U+Xvbvx5nLXNOAr1P\n5R/gDbafRNNyeQLwqTJ7U+DTPQF7B00r6IkAkg5R0x3z5zL/MTQtkAG39Ny/F1hzuE36HtvYXnfg\nBhzTO3MKXm/o/LXog+2/0uxj2E/SKjRheWIfi34J2FDSRMLv1p779w3zeGjtN/Tc/x3N3xGav+Eh\nQ74oN+6ZP3TZoZ5Qfl+v31E+A1NsrL/PO4HTbF8xMMH2j4HjgM8Bt0o6Xs2+oZEMXU+r03yGNgX2\nGbKedgQ2GmHZvkkS8Hgeuh6HrsPl5cu7d37v32jAxsBvJlLHXJZAnwDbV9Ns+g70Qd4AHNQbsrYf\nafsiNf3X76HpjlivBPCfaQJ/ys3w6w23g28RTSvrxcC9tv97zF/StOI+AHyQh9Z5D/CogQeSHj+p\nahsb99zfBLip3L8B+NCQv+GjSkv776WO8ntvogm7XpvQdF3MtH2AvSS9s3ei7c/Yfi5Nd8jTgXeP\n8juGrqcHaLq3bqBpvfeup0fb7m1UTGjHr23TfFn1rseh63CD3v0ePPRv2OsGYNh9CzVLoPdB0jNK\nq/dJ5fHGNK3Pn5WnfBH4V0nPKvMf07OzZm2azcTlwGqS3k/TJz9dZvL1bmXITq8S4A8Cx9Jf63zA\niTRdQ7v2TPsF8CxJW5V/4iMnVW3jbZKeVHbyvZemWwaarYS3SNqu7EB8tKQ9JK3d5+89C3i6pNdJ\nWk3Sa4AtaI6O6teqktbsua0xjmV73UTzhXqwpLcCSHpeeW+r03xR3g+sHOV37CdpC0mPAo4CvlW6\nsU4CXi7pZZIG6t1p4H9jCpwCHCHpsZIeB/xbec0BqwPvk7SGpJ1p+vBPH+b3fBk4SNILJa0iaWNJ\nT5+iGmetBHp/7qbZWXaxpHtogvxKmh2O2P4O8BHgVEl3lXm7lWXPBn4A/Jpm8/B+JrhJ2qeZfL0P\nA4eXTe939Uz/OvA/eOg/4qhKWBxBs3NxYNqvacLkR8A1wAXDLz0u/wH8ELiu3I4ur7WYph/9OJqd\nadfS7JTtt/4/AnvSfCb+CBwK7Gn79nHUdhhNN9HA7cfjWHZoPb+nCfX3SHozzZf6l2je2+9KjR8f\n5VecSLMVegvNDueDy++9AXgFzZfhcprP1ruZuix5P/BL4CqandAX0uwoHXA9TYPlFuArwIEe5hh9\n2z8F3gJ8nmYL9Vxgqr50Zi01WzkRU0fS64GFtndsu5YYP0k/oTmq5ctt19JL0q40O7U7eSZsP9JC\njylVNtHfSnOURUTMoAR6TBlJL6PZDL+VpmsjImZQulwiIiqRFnpERCVmdHCuDTbYwPPnz5/Jl4yI\nmPOWLFlyu+15Yz1vRgN9/vz5LF68eCZfMiJizpM09CzkYaXLJSKiEgn0iIhKJNAjIiqRQI+IqEQC\nPSKiEgn0iIhKjBnoZXjMS8q1A6+S9IEy/clqrmd4jZrrTk50qM+IiJgC/bTQ/wrsbHtLmktd7Spp\ne5rhYj9pezOaITnfNH1lRkTEWMYMdDf+Uh6uXm4Gdga+VaYvorlQckREtKSvM0XLxWGX0FxN/HM0\n1+q7s1ykFuBGRrh2oqSFwEKATTbZZLL1RkT0bf5hZ7ZdAtcfs8eMvVZfO0XLVdC3ornix7bAM4d7\n2gjLHm97ge0F8+aNORRBRERM0LiOcrF9J/ATYHtg3Z4rxT+J4S/UGhERM6Sfo1zmSVq33H8ksAuw\nDDgP2Ls87QDgu9NVZEREjK2fPvSNgEWlH30V4DTb35f0S5qLIh8NXAacMI11RkTEGMYMdNtLga2H\nmX4dTX96RETMAjlTNCKiEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQCPSKi\nEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIirR10WiI2Lu6NqFkWNQWugREZVIoEdEVCKB\nHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZUYM9AlbSzpPEnLJF0l6R1l+pGS/iDp\n8nLbffrLjYiIkfRz6v8K4BDbl0paG1gi6Zwy75O2Pz595UVERL/GDHTbNwM3l/t3S1oGPHG6C4uI\niPEZVx+6pPnA1sDFZdLbJS2V9BVJ642wzEJJiyUtXr58+aSKjYiIkfUd6JLWAk4H3mn7LuALwFOB\nrWha8McOt5zt420vsL1g3rx5U1ByREQMp69Al7Q6TZifbPvbALZvtb3S9oPAl4Btp6/MiIgYSz9H\nuQg4AVhm+xM90zfqedorgSunvryIiOhXP0e57ADsD1wh6fIy7b3AayVtBRi4HjhoWiqMiIi+9HOU\nywWAhpl11tSXExERE5UzRSMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok\n0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqIS\nCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioxJiBLmljSedJWibpKknvKNPXl3SO\npGvKz/Wmv9yIiBhJPy30FcAhtp8JbA+8TdIWwGHAubY3A84tjyMioiVjBrrtm21fWu7fDSwDngi8\nAlhUnrYI2Gu6ioyIiLGNqw9d0nxga+BiYEPbN0MT+sDjRlhmoaTFkhYvX758ctVGRMSI+g50SWsB\npwPvtH1Xv8vZPt72AtsL5s2bN5EaIyKiD30FuqTVacL8ZNvfLpNvlbRRmb8RcNv0lBgREf3o5ygX\nAScAy2x/omfWGcAB5f4BwHenvryIiOjXan08Zwdgf+AKSZeXae8FjgFOk/Qm4PfAPtNTYkRE9GPM\nQLd9AaARZr94asuJiIiJypmiERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekRE\nJfo5UzRi1pt/2Jltl8D1x+zRdgnRcWmhR0RUIoEeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJBHpE\nRCUS6BERlUigR0RUIoEeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJBHpERCUS6BERlRgz0CV9RdJt\nkq7smXakpD9Iurzcdp/eMiMiYiz9tNC/Buw6zPRP2t6q3M6a2rIiImK8xgx02+cDd8xALRERMQmT\n6UN/u6SlpUtmvZGeJGmhpMWSFi9fvnwSLxcREaOZaKB/AXgqsBVwM3DsSE+0fbztBbYXzJs3b4Iv\nFxERY5lQoNu+1fZK2w8CXwK2ndqyIiJivCYU6JI26nn4SuDKkZ4bEREzY7WxniDpFGAnYANJNwJH\nADtJ2gowcD1w0DTWGBERfRgz0G2/dpjJJ0xDLRERMQk5UzQiohIJ9IiISiTQIyIqkUCPiKhEAj0i\nohIJ9IiISiTQIyIqkUCPiKjEmCcWxew1/7Az2y6B64/Zo+0SIqJICz0iohIJ9IiISiTQIyIqkUCP\niKhEAj0iohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISiTQ\nIyIqkUCPiKhEAj0iohJjBrqkr0i6TdKVPdPWl3SOpGvKz/Wmt8yIiBhLPy30rwG7Dpl2GHCu7c2A\nc8vjiIho0ZiBbvt84I4hk18BLCr3FwF7TXFdERExThPtQ9/Q9s0A5efjRnqipIWSFktavHz58gm+\nXEREjGXad4raPt72AtsL5s2bN90vFxHRWRMN9FslbQRQft42dSVFRMRETDTQzwAOKPcPAL47NeVE\nRMRE9XPY4inAfwObS7pR0puAY4CXSLoGeEl5HBERLVptrCfYfu0Is148xbVERMQk5EzRiIhKJNAj\nIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0\niIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqRQI+IqEQC\nPSKiEgn0iIhKrDaZhSVdD9wNrARW2F4wFUVFRMT4TSrQixfZvn0Kfk9ERExCulwiIiox2UA38ENJ\nSyQtHO4JkhZKWixp8fLlyyf5chERMZLJBvoOtrcBdgPeJumFQ59g+3jbC2wvmDdv3iRfLiIiRjKp\nQLd9U/l5G/AdYNupKCoiIsZvwoEu6dGS1h64D7wUuHKqCouIiPGZzFEuGwLfkTTwe/7D9n9NSVUR\nETFuEw5029cBW05hLRERMQk5bDEiohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISiTQIyIqkUCP\niKjEVIyHPqPmH3Zm2yVw/TF7tF1CRMTDpIUeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJBHpERCUS\n6BERlUigR0RUIoEeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJBHpERCUS6BERlUigR0RUIoEeEVGJ\nSQW6pF0l/UrStZIOm6qiIiJi/CYc6JJWBT4H7AZsAbxW0hZTVVhERIzPZFro2wLX2r7O9t+AU4FX\nTE1ZERExXrI9sQWlvYFdbb+5PN4f2M7224c8byGwsDzcHPjVxMudEhsAt7dcw2yRdTEo62JQ1sWg\n2bIuNrU9b6wnrTaJF9Aw0x727WD7eOD4SbzOlJK02PaCtuuYDbIuBmVdDMq6GDTX1sVkulxuBDbu\nefwk4KbJlRMRERM1mUD/ObCZpCdLWgP438AZU1NWRESM14S7XGyvkPR24GxgVeArtq+assqmz6zp\n/pkFsi4GZV0MyroYNKfWxYR3ikZExOySM0UjIiqRQI+IqEQCPSKiEgn0jpG0Tz/TImLu6cRO0XI0\nzsm2/9R2LW2TdKntbcaa1gWSBOwLPMX2UZI2AR5v+5KWS5sxkl412nzb356pWmYLSR8FjgbuA/4L\n2BJ4p+2TWi2sD5M5U3QueTzwc0mXAl8BznYXvsl6SNoN2B14oqTP9MxaB1jRTlWt+zzwILAzcBRw\nN3A68Lw2i5phLx9lnoHOBTrwUtuHSnolzQmU+wDnAQn02cD24ZLeB7wUOBA4TtJpwAm2f9NudTPm\nJmAx8D+BJT3T7wb+uZWK2red7W0kXQZg+0/lJLnOsH1g2zXMQquXn7sDp9i+o9mYm/06EegAti3p\nFuAWmhbpesC3JJ1j+9B2q5t+tn8B/ELSyba72iIf6oEyDLQBJM2jabF3kqQ9gGcBaw5Ms31UexW1\n5nuSrqbpcnlr+Vzc33JNfelKH/rBwAE0o6Z9GfhP2w9IWgW4xvZTWy1wBkg6zfarJV3B8IOoPaeF\nslolaV/gNcA2wCJgb+Bw299stbAWSPoi8CjgRTT/I3sDl9h+U6uFtUTSesBdtldKehSwju1b2q5r\nLF0J9KNould+N8y8Z9pe1kJZM0rSRrZvlrTpcPOHWzddIOkZwItpRg89twufheFIWmr7OT0/1wK+\nbfulbdfWBknPprlwT+/Wytfbq6g/XelyOQu4Y+CBpLWBLWxf3JV/YNs3l5+dDO5ektbveXgbcErv\nPNt3PHyp6t1Xft4r6QnAH4Ent1hPayQdAexEE+hn0VyV7QIggT5LfIFms3rAPcNM6wRJ2wOfBZ4J\nrEEzsNo9ttdptbCZtYSm22mkMf2fMrPlzArfl7Qu8DHgUpr18OV2S2rN3jSHKl5m+0BJGzJH1kVX\nAl29hynaflBSV977UMfRDHX8TWAB8Hrgaa1WNMNsd7LlORrbHyx3T5f0fWBN239us6YW3VcyYoWk\ndWi24ubEl3xXzhS9TtLBklYvt3cA17VdVFtsXwusanul7a/S7AjrJEmvkvQJScdK2qvtetoi6W2l\nhY7tvwKrSHpry2W1ZXFZF1+i2Zq7FJgTJ5t1Zafo44DP0JxAYuBcmjO/bmu1sBZIOh/YhWYT8hbg\nZuANtrdstbAWSPo8zdbJQB/6a4Df2H5be1W1Q9LltrcaMu0y21u3VdNsIGk+zREuS1supS+dCPQY\nVI5yuZWm//yfgccAny+t9k6RdBXw7IHuuHIY6xW2n9VuZTNP0lJgy551sSqwtEvrQtIzbF8tadh9\na7YvnemaxqsT/ciS1gTexMNPmnhja0W1xPbvytmQ82lO6/6V7b+1W1VrfgVsAgwc+bMxMCdaYtPg\nbOC0cjy6gbfQjGPSJf8CLASO5aHnaqg83rmNosajEy10Sd8ErgZeRzNmx77AMtvvaLWwFpSzAb8I\n/Ibmg/pk4CDbP2i1sBkk6Xs0/6CPoRm35ZLyeDvgItu7tFheK8rWyUEMHpP/Q+DLtle2WlgLJD0S\neCuwI83n4qfAF2zP+rNFuxLol9neuuekidVpBuia9d+4U62c0rznQBeLpKcCZ9p+RruVzRxJ/zDa\nfNv/b6ZqidmnjPN0F3BymfRaYF3br26vqv50ossFeKD8vLOcAXYLTZdDF902pL/8OprDsjojgT0o\nQ0IMa/MhBwmcJ+kXrVUzDl0J9OPL2AyHA2cAawHva7ekmdUz7vVVks4CTqP5B94H+HlrhbUoJ1kB\nMNDtuGerVcwul0na3vbPACRtB1zYck19qT7QS9/gXeXiFuczR04QmAa9417fCgx0OyynGXmyi4Y7\nyWqzViuaYQNDQgBvtf2e3nmSPgK85+FL1alnK2V14PWSfl8ebwr8ss3a+tWVPvTzbb+w7Tpmg+HG\nKpH0ZNu/baumtkhabHvBwL6VMu0i289vu7aZNsKVrJZ2qctlpIHrBsyFcZCqb6EX50h6F/ANmnFc\nAOjoIEzfk7Sb7bugGW2SpoX67HbLasW95RDOy8tlx24GHt1yTTNK0j/RHNHxlHIs+oC1mSPdDFNl\nLgT2WLrSQh+u9Wnbnet+KYctHgrsAWxOM4LcvrYvb7WwFpQW2W00m9idPMlK0mNoutw+DBzWM+vu\njjZ45rROBHo8VBmz5FCaVtirbF/TcknREknr2L5ryJDCf5dQn1uqDnRJO9v+8UhXNu/SFc0lfZaH\nHpa2M80hi9cD2D64hbJakUP1Bkn6vu09y1bs0CGFO7kVO5fV3of+D8CPGf7K5l27ovniIY+XDPus\nbsiheoXtPcvPDClcgapb6PFwkh4N3D9wSncZhOkRtu9tt7Jow0gDUQ2YCwNSxaBOBLqkxwJHMDg2\nwwXAUbb/2GphLZD0M2AX238pj9cCftilQ/Uk3c1gV8tAF8NAd4O7dGKRpPNGme0uDo8xl9Xe5TLg\nVJqTiv5XebwvzSGMnRuEieZKNH8ZeGD7L+Wq5p1he+22a5gtbHf24iY16soVi9a3/UHbvy23o4F1\n2y6qJff0bmZLei6DFwjuHEk7Sjqw3N9AUif7ksuVvA6W9K1ye3sZxC7mkK50uXycZqfgaWXS3sCz\nbB/RXlXtkPQ8mi2Wm8qkjYDX2O7cTtJydfcFNIMxPb1c7f6btndoubQZJ+nLNMfjLyqT9gdW2n5z\ne1XFeHUl0O+mOQNwJU0/6SoMnjHaqT5TaFpjNCcVCbja9gNjLFIlSZcDWwOXDlxqrWunuw+Q9Iuh\nlyEcblrMbp3oQ0+f6ajH5G8mqVPH5Pf4m21LGrjsWqdO+x9ipaSn2v4NgKSn0DSAYg7pRKBL2gG4\n3PY9kvYDtgE+Zfv3LZc2k3JM/sOdJunfgXUl/SPwRporvXfRu2nG/b6uPJ4PHNheOTERXelyWQps\nCTwHOBE4geaU91GvXBP1k/QS4KU03U9n2z6n5ZJaUa67ewjNJegAzgE+ORcuuxaDuhLol9reRtL7\ngT/YPmG44UJrJulfRptv+xMzVctsUE6oOruL1w8dzgiXXVvP9j7tVRXj1YkuF+BuSf8K7Ae8sPwz\nd+2QrNH2I9T/rT6E7ZWS7pX0GNt/brueWWDOXnYtBnUl0F8DvA54k+1bJG0CfKzlmmaU7Q8ASFoE\nvMP2neXxesCxbdbWovuBKySdw0PHye/MQGU95uxl12JQJ7pcYpCkywYO0RttWhdIOmC46bYXDTe9\nZpKW0RzKOnCgwCbAMuBBmkN7O3co51zUiRZ6OVTvI8DjaHZ+dW7Mjh6rSFqvXGOVMg52Jz4HQ3Ux\nuEexa9sFxOR15R/5o8DLbS9ru5BZ4FjgIknfouk7fzXwoXZLakc5nPVImosAr8bgF33nxgCv4fJr\n0ZEuF0kXdvF07pFI2oLmAhcCzrU9J65oPtUkXU1z6bkl9JxE08VROKMOXQn0TwOPB/4T+OvA9I6e\nHRmFpIttb9d2HRFTpSuB/tVhJtv2G2e8mJg1JB0DrEpzlmzvF30u6hBzUicCPWI4PRd36L3YRS7q\nEHNWJ3aKSnoS8FlgBwavWPQO2ze2Wli07SfDTEsLJ+asrlzg4qvAGcATgCcC3yvTotv+0nNbQXPo\n3vw2C4qYjE50uUi63PZWY02LbpP0COAM2y9ru5aIiehKC/12SftJWrXc9gNyaFoM9Sigc8egRz06\n0YdOM871ccAnafpILyJjPXeepCsY7DNfFZgHHNVeRRGT05Uul0XAO4ec7v7xHLbYbZI27Xm4ArjV\n9oq26omYrK600J8zEOYAtu+Q1LnBqOKhcrp71KYrfeirlGFigW4PSBUR9epKqGVAqoioXif60CED\nUkVE/ToT6BERtetKH3pERPUS6BERlUigR0RUIoEeEVGJ/w8sbHKay0OdzQAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# we'll filter out the non-represented classes, sort them, and plot it!\n", "breakdown[breakdown != 0].sort_values().plot(\n", " kind='bar', title='Sean Hannity Number of Links per Topic'\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }