{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parsing OpenSources\n",
    "by [leon yin](twitter.com/leonyin)<br>\n",
    "2017-11-22\n",
    "\n",
    "## What is this?\n",
    "In this Jupyter Notebook we will \n",
    "1. <a href='#bash'>download a real world dataset</a>, \n",
    "2. <a href='#whoops'>clean-up human-entered text</a> (<a href='#clean-up'>twice</a>), and \n",
    "3. <a href='#hot'>one-hot encode categories of misleading websites </a>\n",
    "4. <a href='#analysis'>Use Pandas to analyze these sites</a>, and \n",
    "5. make a [machine-readible file](https://github.com/yinleon/fake_news/blob/master/data/sources_clean.tsv).\n",
    "\n",
    "Please view the [detailed version](https://nbviewer.jupyter.org/github/yinleon/fake_news/blob/master/opensources.ipynb) if you want to know how everything works, and if you're unfamiliar with Jupyter Notebooks and Python.\n",
    "\n",
    "View this on [Github](https://github.com/yinleon/fake_news/blob/master/opensources-lite.ipynb).\n",
    "View this on [NBViewer](https://nbviewer.jupyter.org/github/yinleon/fake_news/blob/master/opensources-lite.ipynb).\n",
    "Visit my Lab's [website](https://wp.nyu.edu/smapp/)\n",
    "\n",
    "## Intro\n",
    "[OpenSources](http://www.opensources.co/) is a \"Professionally curated lists of online sources, available free for public use.\" by Melissa Zimdars and collegues. It contains websites labeled with categories spanning state-sponsored media outlets, to conpiracy theory rumor mills. It is a comprehensive resource for researchers and technologists interested in propaganda and mis/disinformation. \n",
    "\n",
    "The opensources project is in-fact open sourced in json and csv format.<br>\n",
    "One issue however, is that the data is entered by people, and not readily machine-readible.\n",
    "\n",
    "Let's take a moment to appreciate the work of _peopke_ <br>\n",
    "<img src='https://media1.giphy.com/media/6tHy8UAbv3zgs/giphy.gif'></img>\n",
    "\n",
    "And optimize this information for machines,\n",
    "<img src='https://media.giphy.com/media/gBW8Qgfaa2ije/giphy.gif'></img>\n",
    "\n",
    "Using some good ole'fashioned data wrangling.\n",
    "\n",
    "## Let's Code Yo! <a id='bash'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "filename = \"data/sources.json\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "\r",
      "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r",
      "100  136k  100  136k    0     0   136k      0  0:00:01 --:--:--  0:00:01  510k\n"
     ]
    }
   ],
   "source": [
    "%%sh -s $filename\n",
    "mkdir -p data\n",
    "curl https://raw.githubusercontent.com/BigMcLargeHuge/opensources/master/sources/sources.json --output $1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = pd.read_json(filename, orient='index')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.index.name = 'domain'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's simplify this long column name into something that's short and sweet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "replace_col = {'Source Notes (things to know?)' : 'notes'}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.columns = [replace_col.get(c, c) for c in df.columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also reorder the column for readibility."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = df[['type', '2nd type', '3rd type', 'notes']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>type</th>\n",
       "      <th>2nd type</th>\n",
       "      <th>3rd type</th>\n",
       "      <th>notes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>domain</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>100percentfedup.com</th>\n",
       "      <td>bias</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16wmpo.com</th>\n",
       "      <td>fake</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>http://www.politifact.com/punditfact/article/2...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21stcenturywire.com</th>\n",
       "      <td>conspiracy</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24newsflash.com</th>\n",
       "      <td>fake</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24wpn.com</th>\n",
       "      <td>fake</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>http://www.politifact.com/punditfact/article/2...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>365usanews.com</th>\n",
       "      <td>bias</td>\n",
       "      <td>conspiracy</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4threvolutionarywar.wordpress.com</th>\n",
       "      <td>bias</td>\n",
       "      <td>conspiracy</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70news.wordpress.com</th>\n",
       "      <td>fake</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82.221.129.208</th>\n",
       "      <td>conspiracy</td>\n",
       "      <td>fake</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Acting-Man.com</th>\n",
       "      <td>unreliable</td>\n",
       "      <td>conspiracy</td>\n",
       "      <td></td>\n",
       "      <td>publishes articles denying climate change</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         type    2nd type 3rd type  \\\n",
       "domain                                                               \n",
       "100percentfedup.com                      bias                        \n",
       "16wmpo.com                               fake                        \n",
       "21stcenturywire.com                conspiracy                        \n",
       "24newsflash.com                          fake                        \n",
       "24wpn.com                                fake                        \n",
       "365usanews.com                           bias  conspiracy            \n",
       "4threvolutionarywar.wordpress.com        bias  conspiracy            \n",
       "70news.wordpress.com                     fake                        \n",
       "82.221.129.208                     conspiracy        fake            \n",
       "Acting-Man.com                     unreliable  conspiracy            \n",
       "\n",
       "                                                                               notes  \n",
       "domain                                                                                \n",
       "100percentfedup.com                                                                   \n",
       "16wmpo.com                         http://www.politifact.com/punditfact/article/2...  \n",
       "21stcenturywire.com                                                                   \n",
       "24newsflash.com                                                                       \n",
       "24wpn.com                          http://www.politifact.com/punditfact/article/2...  \n",
       "365usanews.com                                                                        \n",
       "4threvolutionarywar.wordpress.com                                                     \n",
       "70news.wordpress.com                                                                  \n",
       "82.221.129.208                                                                        \n",
       "Acting-Man.com                             publishes articles denying climate change  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Processing - Making categories standard <a id='whoops'></a>\n",
    "If we look at all the available categories, you'll see some inconsistences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "replace_vals = {\n",
    "    'fake news' : 'fake',\n",
    "    'satirical' : 'satire',\n",
    "    'unrealiable': 'unreliable',\n",
    "    'blog' : np.nan\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can group all our data preprocessing in one function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def clean_type(value):\n",
    "    '''\n",
    "    This function clean various type values (str).\n",
    "    \n",
    "    If the value is not null,\n",
    "    the value is cast to a string,\n",
    "    leading and trailing zeros are removed,\n",
    "    cast to lower case,\n",
    "    and redundant values are replaced.\n",
    "    \n",
    "    returns either None, or a cleaned string.\n",
    "    '''\n",
    "    if value and value != np.nan:\n",
    "        value = str(value)\n",
    "        value = value.strip().lower()\n",
    "        value = replace_vals.get(value, value)\n",
    "        return value\n",
    "    else:\n",
    "        return None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.fillna(value=0, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll now loop through each of the columns,<br>\n",
    "and run the `clean_type` function on all the values in each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for col in ['type', '2nd type', '3rd type']:\n",
    "    df[col] = df[col].apply(clean_type)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One-Hot Encoding <a id='hot'></a>\n",
    "One-hot encoding is used to make a sparse matrix from a single categorical column.<br>\n",
    "Let's use this toy example to understand:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all_hot_encodings = pd.Series(pd.unique(df[['type', '2nd type', '3rd type']].values.ravel('K')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0           bias\n",
       "1           fake\n",
       "2     conspiracy\n",
       "3     unreliable\n",
       "4        junksci\n",
       "5      political\n",
       "6           hate\n",
       "7      clickbait\n",
       "8         satire\n",
       "9          rumor\n",
       "10      reliable\n",
       "11         state\n",
       "12          None\n",
       "13           NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_hot_encodings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dum1 = pd.get_dummies(df['type'].append(all_hot_encodings))\n",
    "dum2 = pd.get_dummies(df['2nd type'].append(all_hot_encodings))\n",
    "dum3 = pd.get_dummies(df['3rd type'].append(all_hot_encodings))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the max value for each one-hot encoded column.<br>\n",
    "By doing so we can combine the three columns information into one dataframe. <br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "__d = dum1.where(dum1 > dum2, dum2)\n",
    "__d = __d.where(__d > dum3, dum3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Why not take the sum?\n",
    "Taking a sum is also an option, but across rows I noticed duplicate categories between columns.<br> This would return one-hot encoded columns of 2 or 3!\n",
    "\n",
    "lastly, let's remove the rows from the unique categorical values we appended."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dummies = __d.iloc[:-len(all_hot_encodings)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have a wonderful new dataset!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bias</th>\n",
       "      <th>clickbait</th>\n",
       "      <th>conspiracy</th>\n",
       "      <th>fake</th>\n",
       "      <th>hate</th>\n",
       "      <th>junksci</th>\n",
       "      <th>political</th>\n",
       "      <th>reliable</th>\n",
       "      <th>rumor</th>\n",
       "      <th>satire</th>\n",
       "      <th>state</th>\n",
       "      <th>unreliable</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>100percentfedup.com</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16wmpo.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21stcenturywire.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24newsflash.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24wpn.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>365usanews.com</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4threvolutionarywar.wordpress.com</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70news.wordpress.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82.221.129.208</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Acting-Man.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   bias  clickbait  conspiracy  fake  hate  \\\n",
       "100percentfedup.com                   1          0           0     0     0   \n",
       "16wmpo.com                            0          0           0     1     0   \n",
       "21stcenturywire.com                   0          0           1     0     0   \n",
       "24newsflash.com                       0          0           0     1     0   \n",
       "24wpn.com                             0          0           0     1     0   \n",
       "365usanews.com                        1          0           1     0     0   \n",
       "4threvolutionarywar.wordpress.com     1          0           1     0     0   \n",
       "70news.wordpress.com                  0          0           0     1     0   \n",
       "82.221.129.208                        0          0           1     1     0   \n",
       "Acting-Man.com                        0          0           1     0     0   \n",
       "\n",
       "                                   junksci  political  reliable  rumor  \\\n",
       "100percentfedup.com                      0          0         0      0   \n",
       "16wmpo.com                               0          0         0      0   \n",
       "21stcenturywire.com                      0          0         0      0   \n",
       "24newsflash.com                          0          0         0      0   \n",
       "24wpn.com                                0          0         0      0   \n",
       "365usanews.com                           0          0         0      0   \n",
       "4threvolutionarywar.wordpress.com        0          0         0      0   \n",
       "70news.wordpress.com                     0          0         0      0   \n",
       "82.221.129.208                           0          0         0      0   \n",
       "Acting-Man.com                           0          0         0      0   \n",
       "\n",
       "                                   satire  state  unreliable  \n",
       "100percentfedup.com                     0      0           0  \n",
       "16wmpo.com                              0      0           0  \n",
       "21stcenturywire.com                     0      0           0  \n",
       "24newsflash.com                         0      0           0  \n",
       "24wpn.com                               0      0           0  \n",
       "365usanews.com                          0      0           0  \n",
       "4threvolutionarywar.wordpress.com       0      0           0  \n",
       "70news.wordpress.com                    0      0           0  \n",
       "82.221.129.208                          0      0           0  \n",
       "Acting-Man.com                          0      0           1  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummies.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "let's add the notes to this new dataset by concatenating `dummies` with `df` row-wise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_news = pd.concat([dummies, df['notes']], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bias</th>\n",
       "      <th>clickbait</th>\n",
       "      <th>conspiracy</th>\n",
       "      <th>fake</th>\n",
       "      <th>hate</th>\n",
       "      <th>junksci</th>\n",
       "      <th>political</th>\n",
       "      <th>reliable</th>\n",
       "      <th>rumor</th>\n",
       "      <th>satire</th>\n",
       "      <th>state</th>\n",
       "      <th>unreliable</th>\n",
       "      <th>notes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>domain</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>100percentfedup.com</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16wmpo.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>http://www.politifact.com/punditfact/article/2...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21stcenturywire.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24newsflash.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24wpn.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>http://www.politifact.com/punditfact/article/2...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>365usanews.com</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4threvolutionarywar.wordpress.com</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70news.wordpress.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82.221.129.208</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Acting-Man.com</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>publishes articles denying climate change</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   bias  clickbait  conspiracy  fake  hate  \\\n",
       "domain                                                                       \n",
       "100percentfedup.com                   1          0           0     0     0   \n",
       "16wmpo.com                            0          0           0     1     0   \n",
       "21stcenturywire.com                   0          0           1     0     0   \n",
       "24newsflash.com                       0          0           0     1     0   \n",
       "24wpn.com                             0          0           0     1     0   \n",
       "365usanews.com                        1          0           1     0     0   \n",
       "4threvolutionarywar.wordpress.com     1          0           1     0     0   \n",
       "70news.wordpress.com                  0          0           0     1     0   \n",
       "82.221.129.208                        0          0           1     1     0   \n",
       "Acting-Man.com                        0          0           1     0     0   \n",
       "\n",
       "                                   junksci  political  reliable  rumor  \\\n",
       "domain                                                                   \n",
       "100percentfedup.com                      0          0         0      0   \n",
       "16wmpo.com                               0          0         0      0   \n",
       "21stcenturywire.com                      0          0         0      0   \n",
       "24newsflash.com                          0          0         0      0   \n",
       "24wpn.com                                0          0         0      0   \n",
       "365usanews.com                           0          0         0      0   \n",
       "4threvolutionarywar.wordpress.com        0          0         0      0   \n",
       "70news.wordpress.com                     0          0         0      0   \n",
       "82.221.129.208                           0          0         0      0   \n",
       "Acting-Man.com                           0          0         0      0   \n",
       "\n",
       "                                   satire  state  unreliable  \\\n",
       "domain                                                         \n",
       "100percentfedup.com                     0      0           0   \n",
       "16wmpo.com                              0      0           0   \n",
       "21stcenturywire.com                     0      0           0   \n",
       "24newsflash.com                         0      0           0   \n",
       "24wpn.com                               0      0           0   \n",
       "365usanews.com                          0      0           0   \n",
       "4threvolutionarywar.wordpress.com       0      0           0   \n",
       "70news.wordpress.com                    0      0           0   \n",
       "82.221.129.208                          0      0           0   \n",
       "Acting-Man.com                          0      0           1   \n",
       "\n",
       "                                                                               notes  \n",
       "domain                                                                                \n",
       "100percentfedup.com                                                                   \n",
       "16wmpo.com                         http://www.politifact.com/punditfact/article/2...  \n",
       "21stcenturywire.com                                                                   \n",
       "24newsflash.com                                                                       \n",
       "24wpn.com                          http://www.politifact.com/punditfact/article/2...  \n",
       "365usanews.com                                                                        \n",
       "4threvolutionarywar.wordpress.com                                                     \n",
       "70news.wordpress.com                                                                  \n",
       "82.221.129.208                                                                        \n",
       "Acting-Man.com                             publishes articles denying climate change  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_news.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With one-hot encoding, the opensources dataset is fast and easy to filter for domains that are considered fake news."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['16wmpo.com', '24newsflash.com', '24wpn.com', '70news.wordpress.com',\n",
       "       '82.221.129.208', 'Amposts.com', 'BB4SP.com', 'DIYhours.net',\n",
       "       'DeadlyClear.wordpress.com', 'DonaldTrumpPotus45.com',\n",
       "       ...\n",
       "       'washingtonpost.com.co', 'webdaily.com', 'weeklyworldnews.com',\n",
       "       'worldpoliticsnow.com', 'worldpoliticsus.com', 'worldrumor.com',\n",
       "       'worldstoriestoday.com', 'wtoe5news.com', 'yesimright.com',\n",
       "       'yourfunpage.com'],\n",
       "      dtype='object', name='domain', length=271)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_news[df_news['fake'] == 1].index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see how many articles were categorized as conspiracy theory sites."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "201"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_news['conspiracy'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see all sites which are `.org` superdomains."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bias</th>\n",
       "      <th>clickbait</th>\n",
       "      <th>conspiracy</th>\n",
       "      <th>fake</th>\n",
       "      <th>hate</th>\n",
       "      <th>junksci</th>\n",
       "      <th>political</th>\n",
       "      <th>reliable</th>\n",
       "      <th>rumor</th>\n",
       "      <th>satire</th>\n",
       "      <th>state</th>\n",
       "      <th>unreliable</th>\n",
       "      <th>notes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>domain</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>heartland.org</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>http://www.sourcewatch.org/index.php/Heartland...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ExperimentalVaccines.org</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>heritage.org</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>breakpoint.org</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bigbluevision.org</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>witscience.org</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freedomworks.org</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>adflegal.org/media</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>https://www.splcenter.org/fighting-hate/extrem...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>moonofalabama.org</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>thefreepatriot.org</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          bias  clickbait  conspiracy  fake  hate  junksci  \\\n",
       "domain                                                                       \n",
       "heartland.org                1          0           0     0     0        0   \n",
       "ExperimentalVaccines.org     0          0           1     0     0        1   \n",
       "heritage.org                 0          0           0     0     0        0   \n",
       "breakpoint.org               0          0           0     0     0        0   \n",
       "bigbluevision.org            1          1           0     0     0        0   \n",
       "witscience.org               0          0           0     0     0        0   \n",
       "freedomworks.org             0          0           0     0     0        0   \n",
       "adflegal.org/media           0          0           0     0     1        0   \n",
       "moonofalabama.org            1          0           0     0     0        0   \n",
       "thefreepatriot.org           1          1           0     1     0        0   \n",
       "\n",
       "                          political  reliable  rumor  satire  state  \\\n",
       "domain                                                                \n",
       "heartland.org                     0         0      0       0      0   \n",
       "ExperimentalVaccines.org          0         0      0       0      0   \n",
       "heritage.org                      1         0      0       0      0   \n",
       "breakpoint.org                    0         0      0       0      0   \n",
       "bigbluevision.org                 0         0      0       0      0   \n",
       "witscience.org                    0         0      0       1      0   \n",
       "freedomworks.org                  1         0      0       0      0   \n",
       "adflegal.org/media                0         0      0       0      0   \n",
       "moonofalabama.org                 0         0      0       0      0   \n",
       "thefreepatriot.org                0         0      0       0      0   \n",
       "\n",
       "                          unreliable  \\\n",
       "domain                                 \n",
       "heartland.org                      0   \n",
       "ExperimentalVaccines.org           0   \n",
       "heritage.org                       0   \n",
       "breakpoint.org                     1   \n",
       "bigbluevision.org                  0   \n",
       "witscience.org                     0   \n",
       "freedomworks.org                   0   \n",
       "adflegal.org/media                 0   \n",
       "moonofalabama.org                  0   \n",
       "thefreepatriot.org                 0   \n",
       "\n",
       "                                                                      notes  \n",
       "domain                                                                       \n",
       "heartland.org             http://www.sourcewatch.org/index.php/Heartland...  \n",
       "ExperimentalVaccines.org                                                     \n",
       "heritage.org                                                                 \n",
       "breakpoint.org                                                               \n",
       "bigbluevision.org                                                            \n",
       "witscience.org                                                               \n",
       "freedomworks.org                                                             \n",
       "adflegal.org/media        https://www.splcenter.org/fighting-hate/extrem...  \n",
       "moonofalabama.org                                                            \n",
       "thefreepatriot.org                                                           "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_news[df_news.index.str.contains('.org')].sample(10, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Some Last Clean-ups <a id='clean-up'></a>\n",
    "I see a \"/media\", is the rest of the site ok?\n",
    "\n",
    "Let's clean up the domain names a bit...\n",
    "1. remove \"www.\"\n",
    "2. remove subsites like \"/media\"\n",
    "3. cast to lower case"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def preprocess_domains(value):\n",
    "    '''\n",
    "    Removes subsites from domains by splitting out bashslashes,\n",
    "    Removes www. from domains\n",
    "    returns a lowercase cleaned up domain\n",
    "    '''\n",
    "    value = value.split('/')[0]\n",
    "    value = value.replace('www.', '')\n",
    "    return value.lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the index is a list, rather than use `apply`-- which only works on Series or DataFrames, we can use map, or a list generator to apply the `preprocess_domains` function to each element in the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_news.index = df_news.index.map(preprocess_domains)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use pandas `to_csv` to write this cleaned up file as a tab-separated value file (tsv)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_news.to_csv('data/sources_clean.tsv', sep='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sources.csv       sources.json      sources_clean.tsv\r\n"
     ]
    }
   ],
   "source": [
    "!ls data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "OpenSources is a great resources for research and technology.<br>\n",
    "If you are aware of other projects that have categorized the online news ecosystem, I'd love to hear about it.\n",
    "\n",
    "Let's recap what we've covered:\n",
    "1. How to download data from the web using bash commands\n",
    "2. How to search and explore Pandas Dataframes\n",
    "3. How to preprocess messy real world data, twice!\n",
    "4. How to one-hot encode a categorical dataset.\n",
    "\n",
    "In the next notebook, we'll use this new dataset to analyze links shared on Twitter.\n",
    "We can begin to build a profile of how sites categorized from open sources are used during viral campaigns.\n",
    "\n",
    "### Thank yous:\n",
    "Rishab and Robyn from D&S.<br>\n",
    "Also my friend and collegue Andrew Guess, who introduced me to links as data.\n",
    "\n",
    "### About the Author:\n",
    "Leon Yin is an engineer and scientist at NYU's Social Media and Political Participation Lab and the Center for Data Science.<br> He is interested in using images and links as data, and finding odd applications for cutting-edge machine learning techniques."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}