{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tools for Link Analysis: urlExpander\n", "View this on [GitHub](https://github.com/yinleon/links-as-data/blob/master/nbs/congress-links.ipynb) | [NBviewer](http://nbviewer.jupyter.org/github/yinleon/links-as-data/blob/master/nbs/congress-links.ipynb?flush_cache=true) | [Binder](https://mybinder.org/v2/gh/yinleon/links-as-data/master?filepath=nbs%2Fcongress-links.ipynb)
\n", "Auhor: Leon Yin
\n", "Updated on: 2018-10-01\n", "
\n", "\n", "## Intro\n", "This notebook will walk through using links as data with the [URLexpander](https://github.com/SMAPPNYU/urlExpander) package.\n", "1. What kind of link data does Twitter provide?\n", "2. How to extract link data from Tweets (`urlexpander.tweet_utils.get_link`)\n", "3. Processing data by expanding shortened URLs (`urlexpander.expand`)\n", "4. Avenues of analysis with link data (`urlexpander.tweet_utils.count_matrix`, `urlexpander.html_utils.get_webpage_meta`)\n", "5. Using links as features to predict political affiliation\n", "\n", "Software for this tutorial is found in this `requirements.txt` file, and can be downloaded as follows:\n", "```\n", "pip install -r requirements.txt\n", "```\n", "\n", "Download data here:\n", "```\n", "python download_data.py\n", "```\n", "\n", "NOTE: at the time of this writing, `download_data.py` does not work! Please go to [OSF](https://osf.io/36b5w/) in the meantime." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What Kind of Link Data Does Twitter Provide?\n", "If you have yet to look at the backend of a Tweet, here you go: [https://bit.ly/tweet_anatomy_link](https://bit.ly/tweet_anatomy_link).
\n", "In addition to hashtags and the like, Tweets contain metadata fields for urls. The code below will show you how to extract and work with links from Tweets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're working with Tweets from members of congress collected by Greg Eady." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import json\n", "import glob\n", "import itertools\n", "from multiprocessing import Pool\n", "\n", "from tqdm import tqdm\n", "import pandas as pd\n", "import urlexpander\n", "from smappdragon import JsonCollection\n", "\n", "from config import INTERMEDIATE_DIRECTORY, \\\n", " RAW_TWEETS_DIRECTORY, \\\n", " CONGRESS_METADATA_DIRECTORY" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1950" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# config setting\n", "pd.options.display.float_format = '{:.0f}'.format\n", "\n", "# these are the files we'll be producing here\n", "file_raw_links = os.path.join(INTERMEDIATE_DIRECTORY, 'links_raw.csv')\n", "file_cache = os.path.join(INTERMEDIATE_DIRECTORY, 'cache.json')\n", "file_expanded_links = os.path.join(INTERMEDIATE_DIRECTORY, 'links_expanded_all.csv')\n", "\n", "# this is the raw data we're working with\n", "files = glob.glob(os.path.join(RAW_TWEETS_DIRECTORY, '*.json.bz2'))\n", "len(files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's preview one file. The file is saved as a newline-delimited json file like this\n", "```\n", "{\"tweet_id\" : \"123\", \"more_data\" : {\"here\" : \"it is\"}\n", "{\"tweet_id\" : \"124\", \"more_data\" : {\"here\" : \"it is again\"}\n", "```\n", "and bzip2 compressed!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'data/tweets-raw/1089859058__2018-03.json.bz2'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = files[2]\n", "f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file structure is new-line delimited JSON. We developed software (like [smappdragon](https://github.com/SMAPPNYU/smappdragon)) to work with Tweets like this:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "collect = JsonCollection(f, compression='bz2', throw_error=False, verbose=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "smappdragon's `JsonCollection` class reads through JSON files as a [generator](https://wiki.python.org/moin/Generators)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "collect" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The con is that generators are hard to interpret, the pro is that they don't store any data in memory. We access the data on a row-by-row basis by iterating through the `collect` object. Here we only get the first row, we can see the contents by printing row:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "for row in collect.get_iterator():\n", " break" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#print(json.dumps(row, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How do we get the links?\n", "Each Tweet can have more than one link, thus we need to unpack those values! urlexpander has a function to do just this:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mSignature:\u001b[0m \u001b[0murlexpander\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtweet_utils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_link\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtweet\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m\n", "Returns a generator containing tweet metadata about media.\n", "\n", "The metadata dict contains the following columns:\n", "\n", "columns = {\n", " 'link_domain' : 'the domain of the URL', \n", " 'link_url_long' : 'the URL (this can be short!)', \n", " 'link_url_short' : 'The t.co URL', \n", " 'tweet_created_at' : 'When the tweet was created', \n", " 'tweet_id' : 'The ID of the tweet', \n", " 'tweet_text' : 'The Full text of the tweet', \n", " 'user_id' : 'The Twitter ID of the tweeter'\n", "}\n", "\n", ":input tweet: a nested dictionary of a Tweet either from the streaming or search API.\n", ":returns: a generator of dictionaries\n", "\u001b[0;31mFile:\u001b[0m ~/anaconda3/lib/python3.6/site-packages/urlexpander/core/tweet_utils.py\n", "\u001b[0;31mType:\u001b[0m function\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "?urlexpander.tweet_utils.get_link" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again we have another generator" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# returns a genrator, which is uninterpretable!\n", "urlexpander.tweet_utils.get_link(row)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'user_id': 1089859058, 'tweet_id': 976517212063322112, 'tweet_created_at': 'Wed Mar 21 17:53:40 +0000 2018', 'tweet_text': None, 'link_url_long': 'http://bit.ly/2FC7bMz', 'link_domain': 'bit.ly', 'link_url_short': 'https://t.co/5P1JAaxwQV'}\n" ] } ], "source": [ "# we can access the data by iterating through it.\n", "for link_meta in urlexpander.tweet_utils.get_link(row):\n", " print(link_meta)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'user_id': 1089859058,\n", " 'tweet_id': 976517212063322112,\n", " 'tweet_created_at': 'Wed Mar 21 17:53:40 +0000 2018',\n", " 'tweet_text': None,\n", " 'link_url_long': 'http://bit.ly/2FC7bMz',\n", " 'link_domain': 'bit.ly',\n", " 'link_url_short': 'https://t.co/5P1JAaxwQV'}]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# to unwrap this we'll do this mess of code\n", "list(itertools.chain.from_iterable([urlexpander.tweet_utils.get_link(row)]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First let's a generalize the workflow into a function. Below is a boilerplate function you can use this as a starting place for your own workflow." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def read_file_extract_links(f):\n", " '''\n", " This function takes in a Tweet file that bzip2-compressed, \n", " newline-deliminted json, and returns a list of dictionaries\n", " for link data.\n", " '''\n", " # read the json file into a generator\n", " collection = JsonCollection(f, compression='bz2', throw_error=False)\n", " \n", " # iterate through the json file, extract links, flatten the generator of links\n", " # into a list, and store into a Pandas dataframe\n", " df_ = pd.DataFrame(list(\n", " itertools.chain.from_iterable(\n", " [urlexpander.tweet_utils.get_link(t) \n", " for t in collection.get_iterator() \n", " if t]\n", " )))\n", " df_['file'] = f\n", " return df_.to_dict(orient='records')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can iterate through files and run the function iteratively. From there, we can instantiate a Pandas `DataFrame`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 2/2 [00:01<00:00, 1.03s/it]\n" ] } ], "source": [ "data = []\n", "for f in tqdm(files[:2]):\n", " # read the json file into a generator\n", " data.extend(read_file_extract_links(f)\n", "df_links = pd.DataFrame(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced (but practical) usage\n", "The for loop is is slow! This task is not memory intensive (because we're using generators).\n", "\n", "We can parallelize this task if we will use the `Pool` class from the Mulitprocessing package to have each core on our computer read a JSON file of Tweets and filter for links. The \"if\" statement is to prevent repeating work once we have already read the files once. We cache this intermediate in the `file_raw_links` file path declared at the beginning of the notebook." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filelink_domainlink_url_longlink_url_shorttweet_created_attweet_idtweet_textuser_id
0/scratch/olympus/projects/mediascore/Data/json...frc.orghttps://www.frc.org/wwlivewithtonyperkins/rep-...https://t.co/l9dXT0L7oTFri Mar 23 14:38:34 +0000 2018977192888781168640nan2966758114
1/scratch/olympus/projects/mediascore/Data/json...thehill.comhttp://thehill.com/379188-watch-fund-governmen...https://t.co/YbdvepWNQ3Thu Mar 22 15:21:32 +0000 2018976841314024206339nan2966758114
\n", "
" ], "text/plain": [ " file link_domain \\\n", "0 /scratch/olympus/projects/mediascore/Data/json... frc.org \n", "1 /scratch/olympus/projects/mediascore/Data/json... thehill.com \n", "\n", " link_url_long link_url_short \\\n", "0 https://www.frc.org/wwlivewithtonyperkins/rep-... https://t.co/l9dXT0L7oT \n", "1 http://thehill.com/379188-watch-fund-governmen... https://t.co/YbdvepWNQ3 \n", "\n", " tweet_created_at tweet_id tweet_text user_id \n", "0 Fri Mar 23 14:38:34 +0000 2018 977192888781168640 nan 2966758114 \n", "1 Thu Mar 22 15:21:32 +0000 2018 976841314024206339 nan 2966758114 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "N_CPU = 4 # this is the number of cores we canuse to parallelize the fast\n", "if not os.path.exists(file_raw_links):\n", " data = []\n", " with Pool(N_CPU) as pool:\n", " iterable = pool.imap_unordered(read_file_extract_links, files)\n", " for link_data in tqdm(iterable, total=len(files)):\n", " data.extend(link_data)\n", " df_links = pd.DataFrame(data)\n", " df_links.to_csv(file_raw_links, index=False)\n", "\n", "else:\n", " df_links = pd.read_csv(file_raw_links)\n", "\n", "df_links.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How useful is this data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The bulk of URLs we encounter in the wild are sent through a link shortener. Link shorteners record transactional information whenever that link is clicked. Unfortunately it makes it hard for us to see what was being shared." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['http://goo.gl/kDUwP',\n", " 'http://bit.ly/12clU3p',\n", " 'http://nyti.ms/Z4rdlU',\n", " 'http://goo.gl/LxkrY',\n", " 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "links = df_links['link_url_long'].tolist()\n", "links[-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is why `urlexpander` was made. We can run the `expand` function on single URLs, as well as a list of URLs." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "urlexpander.expand(links[-5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, urlexpander will expand every URL it is shown. However you can pass a boolean function (one the returns True or False based on an inputted string) to the `filter_function` parameter." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227',\n", " 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html',\n", " 'http://nyti.ms/Z4rdlU',\n", " 'http://www.civiccenterconservancy.org/history-2012-nhl-designation_25.html',\n", " 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "urlexpander.expand(links[-5:], filter_function=urlexpander.is_short)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's happening behind the scenes? \n", "['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'bit.ly/cbc23']
\n", "--> Remove duplicates
\n", " ['abc.com/123', 'bbc.co.uk/123', 'bit.ly/cbc23']
\n", "--> Filter for shortened URLs
\n", "['bit.ly/cbs23']
\n", "--> Check the cache file, did we already expand this? Unshorten new urls
\n", "[{'original_url': 'bit.ly/cbs23', 'resolved' : 'cspan.com/123'}]
\n", "--> join back in
\n", "['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'cspan.com/123']
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "urlexpander parallelizes, filters and caches the input, which is essential for social media data." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mSignature:\u001b[0m \u001b[0murlexpander\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexpand\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlinks_to_unshorten\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mchunksize\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1280\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn_workers\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcache_file\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrandom_seed\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m303\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfilter_function\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m\n", "Calls expand with multiple (``n_workers``) threads to unshorten a list of urls. Unshortens all urls by default, unless one sets a ``filter_function``.\n", "\n", ":param links_to_unshorten: (list, str) either an idividual or list (str) of urls to unshorten\n", ":param chunksize: (int) chunks links_to_unshorten, which makes computation quicker with larger inputs\n", ":param n_workers: (int) how many threads\n", ":param cache_file: (str) a path to a json file to read and write results in\n", ":param random_seed: (int) initializes the random state for shuffling the input\n", ":param verbose: (int) whether to print updates and errors. 0 is silent. 1 is progress bar. 2 is progress bar and errors.\n", ":param filter_function: (func) a boolean used to filter url shorteners out\n", " \n", "\n", ":returns: (list) a list of resolved urls\n", "\u001b[0;31mFile:\u001b[0m ~/anaconda3/lib/python3.6/site-packages/urlexpander/core/api.py\n", "\u001b[0;31mType:\u001b[0m function\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "?urlexpander.expand" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above is a toy example with 5 links, let's see how this works for 1.7 Mil links. For reference, this took me an hour on an 8-core computer with reliable internet connection." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "resolved_urls = urlexpander.expand(links, \n", " filter_function=urlexpander.is_short,\n", " n_workers=64,\n", " chunksize=1280,\n", " cache_file=file_cache,\n", " verbose=1)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1700150" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(resolved_urls)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "df_links['link_resolved'] = resolved_urls\n", "df_links['link_resolved_domain'] = df_links['link_resolved'].apply(urlexpander.get_domain)" ] }, { "cell_type": "code", "execution_count": 254, "metadata": {}, "outputs": [], "source": [ "df_links.to_csv(file_expanded_links, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analytics\n", "With the links resolved, how can we use links as data?" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filelink_domainlink_url_longlink_url_shorttweet_created_attweet_idtweet_textuser_idlink_resolvedlink_resolved_domain
0/scratch/olympus/projects/mediascore/Data/json...frc.orghttps://www.frc.org/wwlivewithtonyperkins/rep-...https://t.co/l9dXT0L7oTFri Mar 23 14:38:34 +0000 2018977192888781168640nan2,966,758,114https://www.frc.org/wwlivewithtonyperkins/rep-...frc.org
1/scratch/olympus/projects/mediascore/Data/json...thehill.comhttp://thehill.com/379188-watch-fund-governmen...https://t.co/YbdvepWNQ3Thu Mar 22 15:21:32 +0000 2018976841314024206339nan2,966,758,114https://thehill.com/379188-watch-fund-governme...thehill.com
\n", "
" ], "text/plain": [ " file link_domain \\\n", "0 /scratch/olympus/projects/mediascore/Data/json... frc.org \n", "1 /scratch/olympus/projects/mediascore/Data/json... thehill.com \n", "\n", " link_url_long link_url_short \\\n", "0 https://www.frc.org/wwlivewithtonyperkins/rep-... https://t.co/l9dXT0L7oT \n", "1 http://thehill.com/379188-watch-fund-governmen... https://t.co/YbdvepWNQ3 \n", "\n", " tweet_created_at tweet_id tweet_text \\\n", "0 Fri Mar 23 14:38:34 +0000 2018 977192888781168640 nan \n", "1 Thu Mar 22 15:21:32 +0000 2018 976841314024206339 nan \n", "\n", " user_id link_resolved \\\n", "0 2,966,758,114 https://www.frc.org/wwlivewithtonyperkins/rep-... \n", "1 2,966,758,114 https://thehill.com/379188-watch-fund-governme... \n", "\n", " link_resolved_domain \n", "0 frc.org \n", "1 thehill.com " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_links = pd.read_csv(file_expanded_links)\n", "df_links.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get an overview of the most frequently shared domains:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "twitter.com 255532\n", "house.gov 199218\n", "youtube.com 93986\n", "facebook.com 90061\n", "senate.gov 78645\n", "washingtonpost.com 29886\n", "instagram.com 28460\n", "nytimes.com 25014\n", "thehill.com 22925\n", "politico.com 13488\n", "foxnews.com 12045\n", "cnn.com 11611\n", "wsj.com 11289\n", "twimg.com 9633\n", "ow.ly 9463\n", "Name: link_resolved_domain, dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_links['link_resolved_domain'].value_counts().head(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text-based URL Metadata\n", "We can also look at the contents of each URL. Twitter provides URL metadata ([if you pay](https://developer.twitter.com/en/docs/tweets/enrichments/overview/expanded-and-enhanced-urls)), we provided a workaround!" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OrderedDict([('url',\n", " 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'),\n", " ('title', 'House Hydro Bill Tests Water for Broad Energy Deals'),\n", " ('description',\n", " ' In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.'),\n", " ('paragraphs',\n", " ['',\n", " '',\n", " 'In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.',\n", " 'In other words, the Republican-controlled House passed a clean-energy bill.',\n", " 'Of course, few Republicans could object on ideological grounds to legislation that aimed to expedite or remove regulatory requirements for expanding hydropower facilities. And it certainly helped that the chamber passed similar legislation in 2012 by a large margin. ',\n", " 'Nevertheless, hydropower proponents say that developing the resource represents “low-hanging fruit” that members tired of the partisanship that has permeated energy policy can all agree is worth advancing to President Barack Obama’s desk. Hydropower represented nearly two-thirds — the largest share by far — of domestic renewable-energy production and 8 percent of total U.S. electricity generation in 2011, according to the Energy Information Administration. More than half of that total powered the Pacific Northwest region, and hydro is one of the few renewable resources that can provide baseload electricity — power that is available at all times — to the grid.',\n", " 'But just 3 percent of the 80,000 dams in the United States generate power, representing great potential for growing the resource, according to legislation championed by Reps. Cathy McMorris Rodgers, R-Wash., and Diana DeGette, D-Colo.',\n", " '“If you can work on regulatory reform for those projects, then you can have small hydro throughout this country,” DeGette said. ',\n", " 'The House lawmakers’ legislation would let small hydroelectric facilities generating up to 10 megawatts of power bypass Federal Energy Regulatory Commission licensing requirements that currently apply to projects producing more than 5 megawatts. The bill also would require FERC to study the feasibility of carrying out a two-year hydropower licensing pilot program at unpowered dams and would allow the commission to extend preliminary permits for two additional years. ',\n", " 'Jeff Leahey, government affairs director at the National Hydropower Association, said House leaders probably moved the bill so they could promote energy legislation that “checked the boxes” on encouraging the development of a resource that is renewable and reliable. It also helped that the bill — along with another measure passed April 10 that would designate an Interior Department agency as the lead regulator of small federal conduits — moved through the House last Congress and didn’t need much additional work, he said.',\n", " 'A significant factor in the refocused spotlight on the “original renewable” is the new leadership on the Senate Energy and Natural Resources Committee and its representation of key hydropower-producing states. Chairman Ron Wyden, D-Ore., promised industry representatives at the hydropower association’s annual conference this week that he plans to “quickly” mark up hydropower legislation after a panel hearing Tuesday. ',\n", " 'Wyden attributes the rise in hydro’s profile to better environmental stewardship on the part of facility operators and a more cooperative relationship between hydropower lobbyists and environmental groups focused on protecting river ecosystems. The effect of dams on fisheries and riverine habitat, as well as operational costs, has compelled organizations to promote the removal of dams in some instances.',\n", " '“Hydro’s environmental performance has improved dramatically,” Wyden said. ',\n", " 'Association President David Moller of Pacific Gas and Electric Co. acknowledged the role that historically low natural-gas prices have played in limiting hydropower expansion in recent years. But he said opportunities for hydropower still flourish because of state renewable portfolio mandates, coal-fired power plants being pushed into retirement and technological advances in powering existing dams and water channels.',\n", " '“The price of natural gas has dropped, but it will never match hydropower’s fuel price of zero, or its attributes of being renewable and non-carbon-based,” he said. ',\n", " 'Hydro proponents in the private sector and in Congress said this week that they will continue to promote hydropower development, possibly in future legislation. That could include examining additional regulatory issues that contribute to long lead times for completing electrified projects or adjusting current benefits that exist for clean power in the tax code. Prospects for he latter — which would involve extending the production tax credit for a multiyear period or making it permanent as Obama proposes — are dim outside a comprehensive tax code overhaul.',\n", " '“I think at the end of the day, it’s all about making sure that hydropower and the benefits that come from hydropower projects are competitive in the marketplace with other energy projects,” Leahey said. ',\n", " 'Whether the bipartisan camaraderie that has surrounded promoting hydropower can translate to moving broader energy legislation is anyone’s guess. But members close to the debate express optimism that the current spate of legislative action could beget compromise in the future.',\n", " '“I’m not sure we’re any closer to that comprehensive policy, but I’d think that common ground on these issues like hydro can only be helpful,” DeGette said.',\n", " '',\n", " '×',\n", " '$${CardTitle}',\n", " '$${CardTitle}'])])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'\n", "meta = urlexpander.html_utils.get_webpage_meta(url)\n", "meta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorizing what is shared\n", "If we want to know more about what kinds of information are being shared by members of congress, we can enrich the dataset by joining metadata about domains. In this example we will use the [Local News Dataset](https://github.com/yinleon/LocalNewsDataset) to inspect the local media outlets that members of congress share:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namestatewebsitedomaintwitteryoutubefacebookownermediumsourcecollection_date
0KWHEHIhttp://www.kwhe.com/kwhe.comNaNNaNNaNLeSeaTV stationstationindex2018-08-02 14:55:24.612585
1WGVKMIhttp://www.wgvu.org/wgvu.orgNaNNaNNaNGrand Valley State UniversityTV stationstationindex2018-08-02 14:55:24.612585
3KTUUAKhttp://www.ktuu.com/ktuu.comNaNNaNNaNSchurz CommunicationsTV stationstationindex2018-08-02 14:55:24.612585
4KTBYAKhttp://www.ktbytv.com/ktbytv.comNaNNaNNaNCoastal Television BroadcastingTV stationstationindex2018-08-02 14:55:24.612585
5KYESAKhttp://www.kyes.com/kyes.comNaNNaNNaNFireweed CommunicationsTV stationstationindex2018-08-02 14:55:24.612585
\n", "
" ], "text/plain": [ " name state website domain twitter youtube facebook \\\n", "0 KWHE HI http://www.kwhe.com/ kwhe.com NaN NaN NaN \n", "1 WGVK MI http://www.wgvu.org/ wgvu.org NaN NaN NaN \n", "3 KTUU AK http://www.ktuu.com/ ktuu.com NaN NaN NaN \n", "4 KTBY AK http://www.ktbytv.com/ ktbytv.com NaN NaN NaN \n", "5 KYES AK http://www.kyes.com/ kyes.com NaN NaN NaN \n", "\n", " owner medium source \\\n", "0 LeSea TV station stationindex \n", "1 Grand Valley State University TV station stationindex \n", "3 Schurz Communications TV station stationindex \n", "4 Coastal Television Broadcasting TV station stationindex \n", "5 Fireweed Communications TV station stationindex \n", "\n", " collection_date \n", "0 2018-08-02 14:55:24.612585 \n", "1 2018-08-02 14:55:24.612585 \n", "3 2018-08-02 14:55:24.612585 \n", "4 2018-08-02 14:55:24.612585 \n", "5 2018-08-02 14:55:24.612585 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "local_news_url = 'https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv'\n", "df_local = pd.read_csv(local_news_url)\n", "df_local = df_local[~(df_local.domain.isnull()) & \n", " (df_local.domain != 'facebook.com')]\n", "df_local.head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# this is a SQL-like join in Pandas that merges the two datasets based on domain name!\n", "df_links_state = df_links.merge(df_local, \n", " left_on='link_resolved_domain', \n", " right_on='domain')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset unlocks insights regarding the locality of news articles shared, as well as media ownership." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "TX 43962\n", "CA 35029\n", "NJ 33442\n", "NY 33145\n", "MI 28912\n", "FL 21519\n", "NC 19841\n", "PA 19060\n", "OH 18857\n", "MD 16533\n", "GA 13419\n", "MO 12660\n", "TN 12324\n", "AZ 11052\n", "MA 10767\n", "WA 10325\n", "NV 10261\n", "LA 9519\n", "MN 9510\n", "OR 9404\n", "Name: state, dtype: int64" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_links_state.state.value_counts().head(20)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Nexstar 7543\n", "Advance Local 4464\n", "Tegna Media 4265\n", "Sinclair 3457\n", "Hearst 2571\n", "Tribune 2086\n", "Gray Television 1817\n", "Fox Television Stations 1580\n", "Hearst Television 1528\n", "Acvance Local 1522\n", "Raycom 1396\n", "NBC Universal 1344\n", "Georgia Public Broadcasting 1248\n", "New Jersey Public Broadcasting Authority 1156\n", "The Philadelphia Inquirer 1103\n", "ABC 978\n", "Evening Post Publishing 820\n", "Meredith 750\n", "Oregon Public Broadcasting 740\n", "Georgia Public Telecommunications Commission 624\n", "Graham Media Group 588\n", "E. W. Scripps Company 540\n", "Cox Enterprises 526\n", "WGBH Educational Foundation 425\n", "Piedmont Television 333\n", "Name: owner, dtype: int64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_links_state.owner.value_counts().head(25)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "66891808 100\n", "818948638890217472 68\n", "1065995022 66\n", "1058345042 64\n", "90651198 63\n", "368948092 56\n", "27676828 48\n", "2929491549 48\n", "2987671552 44\n", "58579942 44\n", "Name: user_id, dtype: int64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_links_state[df_links_state.owner == 'Sinclair']['user_id'].value_counts().head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is one example of a dataset enrichment, you can create your own categorizations and join them in. Alexa.com is a good starting place." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### User-Aggregated Acvitity\n", "The frequency of domains shared per-user make rich features. We can aggregate the data using the following utility function:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1011fmtheanswer.com10best.com10tv.com11alive.com123formbuilder.com12news.com12newsnow.com13abc.com13wham.com13wmaz.com...yorkdispatch.comyouarecurrent.comyoucaring.comyoungcons.comyourconroenews.comyourdailyjournal.comyoutube.comzeldinforcongress.comzeldinforsenate.comzpolitics.com
user_id
8132860000000000...000000198000
9390910000000000...000000106000
54969320000000000...000000427000
55117520000000000...00000073000
555831216000000000...00000053000
\n", "

5 rows × 2376 columns

\n", "
" ], "text/plain": [ " 1011fmtheanswer.com 10best.com 10tv.com 11alive.com \\\n", "user_id \n", "813286 0 0 0 0 \n", "939091 0 0 0 0 \n", "5496932 0 0 0 0 \n", "5511752 0 0 0 0 \n", "5558312 16 0 0 0 \n", "\n", " 123formbuilder.com 12news.com 12newsnow.com 13abc.com 13wham.com \\\n", "user_id \n", "813286 0 0 0 0 0 \n", "939091 0 0 0 0 0 \n", "5496932 0 0 0 0 0 \n", "5511752 0 0 0 0 0 \n", "5558312 0 0 0 0 0 \n", "\n", " 13wmaz.com ... yorkdispatch.com youarecurrent.com \\\n", "user_id ... \n", "813286 0 ... 0 0 \n", "939091 0 ... 0 0 \n", "5496932 0 ... 0 0 \n", "5511752 0 ... 0 0 \n", "5558312 0 ... 0 0 \n", "\n", " youcaring.com youngcons.com yourconroenews.com \\\n", "user_id \n", "813286 0 0 0 \n", "939091 0 0 0 \n", "5496932 0 0 0 \n", "5511752 0 0 0 \n", "5558312 0 0 0 \n", "\n", " yourdailyjournal.com youtube.com zeldinforcongress.com \\\n", "user_id \n", "813286 0 198 0 \n", "939091 0 106 0 \n", "5496932 0 427 0 \n", "5511752 0 73 0 \n", "5558312 0 53 0 \n", "\n", " zeldinforsenate.com zpolitics.com \n", "user_id \n", "813286 0 0 \n", "939091 0 0 \n", "5496932 0 0 \n", "5511752 0 0 \n", "5558312 0 0 \n", "\n", "[5 rows x 2376 columns]" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count_matrix = urlexpander.tweet_utils.count_matrix(\n", " df_links,\n", " user_col='user_id',\n", " domain_col='link_resolved_domain',\n", " min_freq=20,\n", ")\n", "\n", "count_matrix.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Are links good features to predict political affiliation?\n", "Let's see how the `count_matrix` features fair in machine learning. To do so, we need to enrich our data with an output variable. For our purposes, we will try to predict political affiliation." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/leonyin/anaconda3/lib/python3.6/site-packages/numba/errors.py:102: UserWarning: Insufficiently recent colorama version found. Numba requires colorama >= 0.3.9\n", " warnings.warn(msg)\n" ] } ], "source": [ "%matplotlib inline\n", "import umap\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These files have the affiliation of each account." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "meta = []\n", "for f in glob.glob(os.path.join(CONGRESS_METADATA_DIRECTORY, '*')):\n", " _df = pd.read_csv(f)\n", " _df = _df[~_df.twitter_id.isnull()]\n", " meta.extend(_df.to_dict(orient='records'))" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "df_meta = pd.DataFrame(meta)\n", "df_meta.twitter_id = df_meta.twitter_id.astype(float, errors='ignore').astype(str)\n", "df_links.user_id = df_links.user_id.astype(float, errors='ignore').astype(str)\n", "look_up = df_meta[df_meta['twitter_id'].isin(df_links.user_id)].drop_duplicates(subset=['twitter_id'])[['twitter_id', 'affiliation']]\n", "df_links_ = df_links[df_links.user_id.isin(look_up['twitter_id'])]" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(971, 971)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_links_.user_id.unique()), len(look_up)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unsupervised Learning\n", "This will be an exploratory step, where we will try to visualze the dataset of counts of links shared by user. We will reduce the high-dimensional data (where each domain is one dimension) to two-dimensions for visualization using the [UMAP](https://umap-learn.readthedocs.io/en/latest/) algorithm (much like the populat [TSNE](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) algorithm)." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "def color(party):\n", " if party == 'Democrat':\n", " return 'blue' \n", " elif party == 'Republican':\n", " return 'red'\n", " else:\n", " return 'black'\n", "\n", "def viz_umap_embed(count_matrix, title=\"Members of Congress Embedded by UMAP\", threshold=5, **kwargs):\n", " '''\n", " Visualizes the count matrix in 2 dimensions using UMAP.\n", " '''\n", " if threshold:\n", " count_matrix = count_matrix[count_matrix.sum(axis=1) >= threshold]\n", " parties = look_up.set_index('twitter_id').loc[count_matrix.index].affiliation\n", "\n", " embedding = umap.UMAP(n_components=2, **kwargs).fit_transform(count_matrix.values)\n", "\n", " plt.figure(figsize=(14,8))\n", " ax = plt.scatter(x = embedding[:,0], \n", " y = embedding[:,1],\n", " s = 100,\n", " c = parties.apply(color),\n", " alpha = .4)\n", "\n", " plt.title(title)\n", " plt.axis('off')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "# domains to exclude in our count matrix\n", "exclude = ['youtube.com', 'twitter.com', 'fb.com', \n", " 'facebook.com', 'instagram.com', 'ow.ly', \n", " 'house.gov', 'senate.gov', 'usa.gov']" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "count_matrix = urlexpander.tweet_utils.count_matrix(\n", " df_links_,\n", " user_col='user_id',\n", " domain_col='link_resolved_domain',\n", " min_freq=5,\n", " exclude_domain_list=exclude,\n", ")" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
frc.orgthehill.comiheart.comc-span.orgnews9.comspeaker.govfoxnews.comfrcaction.orgkoco.comokcfox.com...mullinforcongress.comtherepublicanstandard.combarbaracomstockforcongress.comthefriendshipchallenge.comtomreedforcongress.comsincomillas.comdetodopr.comthedowneypatriot.comgarretgravesforcongress.comabout.com
user_id
1004855106.00000000000...0000000000
1009269193.0012016021000...0000000000
\n", "

2 rows × 5301 columns

\n", "
" ], "text/plain": [ " frc.org thehill.com iheart.com c-span.org news9.com \\\n", "user_id \n", "1004855106.0 0 0 0 0 0 \n", "1009269193.0 0 12 0 16 0 \n", "\n", " speaker.gov foxnews.com frcaction.org koco.com okcfox.com \\\n", "user_id \n", "1004855106.0 0 0 0 0 0 \n", "1009269193.0 2 1 0 0 0 \n", "\n", " ... mullinforcongress.com therepublicanstandard.com \\\n", "user_id ... \n", "1004855106.0 ... 0 0 \n", "1009269193.0 ... 0 0 \n", "\n", " barbaracomstockforcongress.com thefriendshipchallenge.com \\\n", "user_id \n", "1004855106.0 0 0 \n", "1009269193.0 0 0 \n", "\n", " tomreedforcongress.com sincomillas.com detodopr.com \\\n", "user_id \n", "1004855106.0 0 0 0 \n", "1009269193.0 0 0 0 \n", "\n", " thedowneypatriot.com garretgravesforcongress.com about.com \n", "user_id \n", "1004855106.0 0 0 0 \n", "1009269193.0 0 0 0 \n", "\n", "[2 rows x 5301 columns]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count_matrix.head(2)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/leonyin/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype float32 was converted to bool by check_pairwise_arrays.\n", " warnings.warn(msg, DataConversionWarning)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "viz_umap_embed(count_matrix, \n", " title=\"Link Sharing of Members of Congress Embedded by UMAP\",\n", " threshold=5,\n", " # umap params\n", " n_neighbors=50,\n", " min_dist=0.1,\n", " metric='dice',\n", " random_state=303)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Supervised Learning\n", "In the unsupervised case, we already see Democrats and Republicans sectioned off. Here we will fit a [logistic regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model to predict whether a congress member is a Democrat or a Republican based on the `count_matrix` we just created." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(820, 145)" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# filter out independents, and accounts that sent less than one link!\n", "dems_repubs = look_up[look_up.affiliation != 'Independent'].twitter_id\n", "count_matrix_ = count_matrix[(count_matrix.index.isin(dems_repubs)) &\n", " (count_matrix.sum(axis=1) >= 1)]\n", "parties = look_up.set_index('twitter_id').loc[count_matrix_.index].affiliation\n", "\n", "# create the training set\n", "X, y = count_matrix_.values, parties\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=303, test_size=.15)\n", "len(X_train), len(X_test) " ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9517241379310345" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logreg = LogisticRegression(penalty='l2', C=.7,\n", " solver='liblinear',\n", " random_state=303)\n", "logreg.fit(X_train, y_train)\n", "logreg.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.metrics import confusion_matrix" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "def plot_confusion_matrix(cm, classes,\n", " normalize=False,\n", " title='Confusion matrix',\n", " cmap=plt.cm.Blues):\n", " \"\"\"\n", " This function prints and plots the confusion matrix.\n", " Normalization can be applied by setting `normalize=True`.\n", " \"\"\"\n", " if normalize:\n", " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", " print(\"Normalized confusion matrix\")\n", " else:\n", " print('Confusion matrix, without normalization')\n", "\n", " print(cm)\n", "\n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(classes))\n", " plt.xticks(tick_marks, classes, rotation=45)\n", " plt.yticks(tick_marks, classes)\n", "\n", " fmt = '.2f' if normalize else 'd'\n", " thresh = cm.max() / 2.\n", " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", " plt.text(j, i, format(cm[i, j], fmt),\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", "\n", " plt.ylabel('True label')\n", " plt.xlabel('Predicted label')\n", " plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion matrix, without normalization\n", "[[49 2]\n", " [ 5 89]]\n", "Normalized confusion matrix\n", "[[0.96 0.04]\n", " [0.05 0.95]]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "y_pred = logreg.predict(X_test)\n", "class_names = logreg.classes_\n", "\n", "# Compute confusion matrix\n", "cnf_matrix = confusion_matrix(y_test, y_pred)\n", "np.set_printoptions(precision=2)\n", "\n", "# Plot non-normalized confusion matrix\n", "plt.figure()\n", "plot_confusion_matrix(cnf_matrix, classes=class_names,\n", " title='Confusion matrix, without normalization')\n", "\n", "# Plot normalized confusion matrix\n", "plt.figure()\n", "plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,\n", " title='Normalized confusion matrix')\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
twitter_name
twitter_id
9.410800851211756e+17SenDougJones
23820360.0billhuizenga
136526394.0WebsterCongress
4615689368.0GeneGreen29
19726613.0SenatorCollins
242376736.0RepCharlieDent
16056306.0JeffFlake
\n", "
" ], "text/plain": [ " twitter_name\n", "twitter_id \n", "9.410800851211756e+17 SenDougJones\n", "23820360.0 billhuizenga\n", "136526394.0 WebsterCongress\n", "4615689368.0 GeneGreen29\n", "19726613.0 SenatorCollins\n", "242376736.0 RepCharlieDent\n", "16056306.0 JeffFlake" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# these are what we got wrong!\n", "df_meta.set_index('twitter_id').loc[\n", " y_test[y_test != y_pred].index\n", "][['twitter_name']]" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "indep = look_up[look_up.affiliation == 'Independent'].twitter_id\n", "count_matrix_indep_ = count_matrix[(count_matrix.index.isin(indep)) &\n", " (count_matrix.sum(axis=1) >= 2)]" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
twitter_namepreds
twitter_id
1068481578.0SenAngusKingRepublican
216776631.0BernieSandersDemocrat
2915095729.0AkGovBillWalkerRepublican
29442313.0SenSandersDemocrat
3196634042.0GovernorMappRepublican
\n", "
" ], "text/plain": [ " twitter_name preds\n", "twitter_id \n", "1068481578.0 SenAngusKing Republican\n", "216776631.0 BernieSanders Democrat\n", "2915095729.0 AkGovBillWalker Republican\n", "29442313.0 SenSanders Democrat\n", "3196634042.0 GovernorMapp Republican" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ind = df_meta.set_index('twitter_id').loc[count_matrix_indep_.index][['twitter_name']]\n", "df_ind['preds'] = logreg.predict(count_matrix_indep_)\n", "df_ind" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-Fold Cross Validation" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "logreg_cv = LogisticRegression(penalty='l2', C=.7,\n", " solver='liblinear',\n", " random_state=303)\n", "scores = cross_val_score(logreg_cv, X, y, cv=5)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.96 (+/- 0.02)\n" ] } ], "source": [ "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 4 }