{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clustering with News Publications\n",
    "Author: Matthew Huh\n",
    "    \n",
    "## Overview\n",
    "\n",
    "For the most part, people are free to choose what news outlets they read and follow. In the United States, there is a near-endless list of sites that people can choose from in order to get their daily news and over time, they develop preferences for sites that they are more attached to, and do their best to avoid. Now these affinities are developed through a combination of means ranging from affiliations, vocabulary, prose, and so forth.\n",
    "\n",
    "What I would like to examine in this project is if it is possible to differentiate from several different publications with their respective perks / quirks. \n",
    "\n",
    "## About the Data\n",
    "\n",
    "This dataset was obtained from Kaggle, and contains a collection of 142,570 articles from 15 different publications.\n",
    "\n",
    "The publications within this dataset are\n",
    "1. CNN\n",
    "2. Breitbart\n",
    "3. Vox\n",
    "4. Washington Post\n",
    "5. New York Post\n",
    "6. National Review\n",
    "7. NPR\n",
    "8. Guardian\n",
    "9. Talking Points Memo\n",
    "10. Atlantic\n",
    "11. Reuters\n",
    "12. Fox News\n",
    "13. Business Insider\n",
    "14. Buzzfeed News\n",
    "15. New York Times\n",
    "\n",
    "## Research Question\n",
    "\n",
    "As this is an unsupervised learning project first and foremost, the project will have 3 goals.\n",
    "\n",
    "1. The first goal is to prepare the articles in the dataset for modelling using various Natural Language Processing (NLP) methods to re-represent the data in numbers rather than words\n",
    "2. Cluster the data to determine if we can identify the articles and associate them as different groups.\n",
    "3. Determine if we can predict the structure of the article based on the publisher.\n",
    "\n",
    "## Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic imports\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy\n",
    "import sklearn\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "%matplotlib inline\n",
    "\n",
    "# Machine Learning packages\n",
    "from sklearn.feature_selection import SelectKBest, f_classif\n",
    "from sklearn.feature_selection import chi2\n",
    "from sklearn.preprocessing import normalize\n",
    "from sklearn import ensemble\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "# Clustering packages\n",
    "import sklearn.cluster as cluster\n",
    "from sklearn.cluster import KMeans\n",
    "from sklearn.cluster import MeanShift, estimate_bandwidth\n",
    "from sklearn.cluster import SpectralClustering\n",
    "from sklearn.cluster import AffinityPropagation\n",
    "from scipy.spatial.distance import cdist\n",
    "\n",
    "# Natural Language processing\n",
    "import re\n",
    "import spacy\n",
    "import nltk\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "from collections import Counter\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.datasets import fetch_rcv1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preview\n",
    "\n",
    "The first matter of business is to import the articles from a local directory and merge them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>id</th>\n",
       "      <th>title</th>\n",
       "      <th>publication</th>\n",
       "      <th>author</th>\n",
       "      <th>date</th>\n",
       "      <th>year</th>\n",
       "      <th>month</th>\n",
       "      <th>url</th>\n",
       "      <th>content</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>17283</td>\n",
       "      <td>House Republicans Fret About Winning Their Hea...</td>\n",
       "      <td>New York Times</td>\n",
       "      <td>Carl Hulse</td>\n",
       "      <td>2016-12-31</td>\n",
       "      <td>2016.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>WASHINGTON  —   Congressional Republicans have...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>17284</td>\n",
       "      <td>Rift Between Officers and Residents as Killing...</td>\n",
       "      <td>New York Times</td>\n",
       "      <td>Benjamin Mueller and Al Baker</td>\n",
       "      <td>2017-06-19</td>\n",
       "      <td>2017.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>After the bullet shells get counted, the blood...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17285</td>\n",
       "      <td>Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...</td>\n",
       "      <td>New York Times</td>\n",
       "      <td>Margalit Fox</td>\n",
       "      <td>2017-01-06</td>\n",
       "      <td>2017.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>When Walt Disney’s “Bambi” opened in 1942, cri...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>17286</td>\n",
       "      <td>Among Deaths in 2016, a Heavy Toll in Pop Musi...</td>\n",
       "      <td>New York Times</td>\n",
       "      <td>William McDonald</td>\n",
       "      <td>2017-04-10</td>\n",
       "      <td>2017.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Death may be the great equalizer, but it isn’t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>17287</td>\n",
       "      <td>Kim Jong-un Says North Korea Is Preparing to T...</td>\n",
       "      <td>New York Times</td>\n",
       "      <td>Choe Sang-Hun</td>\n",
       "      <td>2017-01-02</td>\n",
       "      <td>2017.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SEOUL, South Korea  —   North Korea’s leader, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0     id                                              title  \\\n",
       "0           0  17283  House Republicans Fret About Winning Their Hea...   \n",
       "1           1  17284  Rift Between Officers and Residents as Killing...   \n",
       "2           2  17285  Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...   \n",
       "3           3  17286  Among Deaths in 2016, a Heavy Toll in Pop Musi...   \n",
       "4           4  17287  Kim Jong-un Says North Korea Is Preparing to T...   \n",
       "\n",
       "      publication                         author        date    year  month  \\\n",
       "0  New York Times                     Carl Hulse  2016-12-31  2016.0   12.0   \n",
       "1  New York Times  Benjamin Mueller and Al Baker  2017-06-19  2017.0    6.0   \n",
       "2  New York Times                   Margalit Fox  2017-01-06  2017.0    1.0   \n",
       "3  New York Times               William McDonald  2017-04-10  2017.0    4.0   \n",
       "4  New York Times                  Choe Sang-Hun  2017-01-02  2017.0    1.0   \n",
       "\n",
       "   url                                            content  \n",
       "0  NaN  WASHINGTON  —   Congressional Republicans have...  \n",
       "1  NaN  After the bullet shells get counted, the blood...  \n",
       "2  NaN  When Walt Disney’s “Bambi” opened in 1942, cri...  \n",
       "3  NaN  Death may be the great equalizer, but it isn’t...  \n",
       "4  NaN  SEOUL, South Korea  —   North Korea’s leader, ...  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create list of files from directory\n",
    "filelist = os.listdir('articles')\n",
    "\n",
    "# Import the files\n",
    "df_list = [pd.read_csv(file) for file in filelist]\n",
    "\n",
    "#concatenate them together\n",
    "articles = pd.concat(df_list)\n",
    "\n",
    "# Preview the data\n",
    "articles.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(142570, 10)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the size of the dataset\n",
    "articles.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we have 142,570 articles in the dataset but unfortunately, NLP is quite memory intensive, so we will have to sample the dataset unless you happen to have over 120 GB of memory on your local device. Using a 10% sample still leaves us with 140,000 articles and will be used for the duration of this project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample the dataset for optimal performance\n",
    "articles = articles.sample(frac=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Washington Post', 'CNN', 'New York Post', 'Buzzfeed News', 'NPR',\n",
       "       'Guardian', 'Breitbart', 'Atlantic', 'Business Insider',\n",
       "       'National Review', 'New York Times', 'Talking Points Memo',\n",
       "       'Reuters', 'Vox', 'Fox News'], dtype=object)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print out unique publisher names\n",
    "articles.publication.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "title          14251\n",
       "publication       15\n",
       "author          3957\n",
       "date            1056\n",
       "url             8605\n",
       "content        14246\n",
       "dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Describe unique occurences for each categorical variable\n",
    "articles.select_dtypes(include=['object']).nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are also other ways to trim down the dataset before processing. We aren't particularly interested in examining the dates for this research question, but it may be of interest in another. Let's check to see how many articles each author wrote; it may not be very useful to examine authors that are only responsible for a single article, as different authors from the same publisher may choose compose their works differently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop variables that have no impact on the outcome\n",
    "articles = articles[['title', 'publication', 'author', 'content']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "author\n",
       "Breitbart News                                                    144\n",
       "Associated Press                                                  130\n",
       "Pam Key                                                           120\n",
       "Charlie Spiering                                                   94\n",
       "Daniel Nussbaum                                                    86\n",
       "Jerome Hudson                                                      85\n",
       "AWR Hawkins                                                        84\n",
       "John Hayward                                                       74\n",
       "Warner Todd Huston                                                 66\n",
       "Joel B. Pollak                                                     63\n",
       "Camila Domonoske                                                   63\n",
       "Breitbart London                                                   60\n",
       "Post Editorial Board                                               56\n",
       "Ian Hanchett                                                       55\n",
       "Trent Baker                                                        52\n",
       "Reuters                                                            52\n",
       "Alex Swoyer                                                        49\n",
       "David French                                                       46\n",
       "NPR Staff                                                          43\n",
       "Bob Price                                                          42\n",
       "Charlie Nash                                                       41\n",
       "Jeff Poor                                                          41\n",
       "David A. Graham                                                    39\n",
       "German Lopez                                                       38\n",
       "Breitbart Jerusalem                                                38\n",
       "Ben Kew                                                            38\n",
       "Keith J. Kelly                                                     36\n",
       "Bill Chappell                                                      35\n",
       "Esme Cribb                                                         35\n",
       "Katherine Rodriguez                                                34\n",
       "                                                                 ... \n",
       "Larry Celona, Tina Moore and Laura Italiano                         1\n",
       "Larry Celona, Tina Moore and Kenneth Garger                         1\n",
       "Larry Celona, Tina Moore and Jazmin Rosa                            1\n",
       "Larry Celona, Shawn Cohen and Natalie Musumeci                      1\n",
       "Larry Celona, Jennifer Bain, Rebecca Rosenberg and Shawn Cohen      1\n",
       "Larry Celona, Jamie Schram and Emily Saul                           1\n",
       "Lauren Russell                                                      1\n",
       "Lauren Sommer                                                       1\n",
       "Lauren Windle, The Sun                                              1\n",
       "Laurence Blair                                                      1\n",
       "Lesley Wroughton and Yeganeh Torbati                                1\n",
       "Lesley McClurg                                                      1\n",
       "Lenore Skenazy                                                      1\n",
       "Lela Moore and Sona Patel                                           1\n",
       "Lela Moore and Lindsey Underwood                                    1\n",
       "Leika Kihara and Tetsushi Kajimoto                                  1\n",
       "Leigh Alexander                                                     1\n",
       "Lee Liberman Otis                                                   1\n",
       "Lee Glendinning                                                     1\n",
       "Leanna Garfield                                                     1\n",
       "Lawrence Torcello                                                   1\n",
       "Lawrence Summers                                                    1\n",
       "Lawrence Ostlere                                                    1\n",
       "Lawrence K. Altman, M.d.                                            1\n",
       "Lawrence Hurley and Valerie Volcovici                               1\n",
       "Lawrence Hurley and Richard Cowan                                   1\n",
       "Lawrence Hurley and Andrew Chung                                    1\n",
       "Laurie Goodstein                                                    1\n",
       "Laurie Goering                                                      1\n",
       " Faith Karimi                                                       1\n",
       "Length: 3957, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# View most frequently occurring authors\n",
    "articles.groupby(['author']).size().sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, that partly explains how there are so many authors in this dataset. It seems as though there are over 15,000 authors, and many of them have only published one article, or have co-written multiple articles with other authors. This complicates the problem, so in order to best represent each author's writing style, let's see what happens if we simply remove all authors that only published one article as is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window.Plotly) {{require(['plotly'],function(plotly) {window.Plotly=plotly;});}}</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window.Plotly) {{require(['plotly'],function(plotly) {window.Plotly=plotly;});}}</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "data": [
        {
         "labels": [
          "Washington Post",
          "CNN",
          "New York Post",
          "Buzzfeed News",
          "NPR",
          "Guardian",
          "Breitbart",
          "Atlantic",
          "Business Insider",
          "National Review",
          "New York Times",
          "Talking Points Memo",
          "Reuters",
          "Vox",
          "Fox News"
         ],
         "type": "pie",
         "values": [
          2312,
          1809,
          1168,
          1167,
          1119,
          1104,
          838,
          816,
          731,
          672,
          599,
          502,
          485,
          482,
          453
         ]
        }
       ],
       "layout": {
        "autosize": false,
        "height": 600,
        "title": "Articles by Publication",
        "width": 800
       }
      },
      "text/html": [
       "<div id=\"790a5a9a-9779-49f8-afd9-da7f54d1e035\" style=\"height: 600px; width: 800px;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"790a5a9a-9779-49f8-afd9-da7f54d1e035\", [{\"type\": \"pie\", \"labels\": [\"Washington Post\", \"CNN\", \"New York Post\", \"Buzzfeed News\", \"NPR\", \"Guardian\", \"Breitbart\", \"Atlantic\", \"Business Insider\", \"National Review\", \"New York Times\", \"Talking Points Memo\", \"Reuters\", \"Vox\", \"Fox News\"], \"values\": [2312, 1809, 1168, 1167, 1119, 1104, 838, 816, 731, 672, 599, 502, 485, 482, 453]}], {\"title\": \"Articles by Publication\", \"height\": 600, \"width\": 800, \"autosize\": false}, {\"showLink\": true, \"linkText\": \"Export to plot.ly\"})});</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<div id=\"790a5a9a-9779-49f8-afd9-da7f54d1e035\" style=\"height: 600px; width: 800px;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"790a5a9a-9779-49f8-afd9-da7f54d1e035\", [{\"type\": \"pie\", \"labels\": [\"Washington Post\", \"CNN\", \"New York Post\", \"Buzzfeed News\", \"NPR\", \"Guardian\", \"Breitbart\", \"Atlantic\", \"Business Insider\", \"National Review\", \"New York Times\", \"Talking Points Memo\", \"Reuters\", \"Vox\", \"Fox News\"], \"values\": [2312, 1809, 1168, 1167, 1119, 1104, 838, 816, 731, 672, 599, 502, 485, 482, 453]}], {\"title\": \"Articles by Publication\", \"height\": 600, \"width\": 800, \"autosize\": false}, {\"showLink\": true, \"linkText\": \"Export to plot.ly\"})});</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Plotly packages\n",
    "import plotly as py\n",
    "import plotly.graph_objs as go\n",
    "from plotly import tools\n",
    "import cufflinks as cf\n",
    "import ipywidgets as widgets\n",
    "from scipy import special\n",
    "py.offline.init_notebook_mode(connected=True)\n",
    "\n",
    "# Pass in values for our pie chart\n",
    "trace = go.Pie(labels=articles['publication'].unique(), values = articles['publication'].value_counts())\n",
    "\n",
    "# Create the layout\n",
    "layout = go.Layout(\n",
    "    title = 'Articles by Publication',\n",
    "    height = 600,\n",
    "    width = 800,\n",
    "    autosize = False\n",
    ")\n",
    "\n",
    "# Construct the chart\n",
    "fig = go.Figure(data = [trace], layout = layout)\n",
    "py.offline.iplot(fig, filename ='cufflinks/simple')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop author from the dataframe if they wrote less than 5 articles\n",
    "vc = articles['author'].value_counts()\n",
    "u  = [i not in set(vc[vc<=4].index) for i in articles['author']]\n",
    "articles = articles[u]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "title          9324\n",
       "publication      15\n",
       "author          608\n",
       "content        9318\n",
       "dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reprint how many unique authors there are\n",
    "articles.select_dtypes(include=['object']).nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9326, 4)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# View number of articles after feature selection\n",
    "articles.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So after removing authors that composed fewer than 5 articles, we are left with 9k articles, or 67% of the data, and roughly 600/3900 of the authors. Now, we can create a better representation of each author since each author has at least 5 articles to evaluate from.\n",
    "\n",
    "## Text Cleaning\n",
    "\n",
    "Now that we've chosen which articles to use, it's time to clean them up and prepare them for feature engineering. What this section covers is the removal of annoying punctuation from the content, and reducing words to their lemmas to reduce the number of words that we are examining. Finally, we'll divide the articles into training and testing sets and separate our predictor, the words in the content, and the target, the publisher."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def text_cleaner(text):\n",
    "    # Visual inspection identifies a form of punctuation spaCy does not\n",
    "    # recognize: the double dash '--'.  Better get rid of it now!\n",
    "    text = re.sub(r'--',' ',text)\n",
    "    text = re.sub(\"[\\[].*?[\\]]\", \"\", text)\n",
    "    text = ' '.join(text.split())\n",
    "    return text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>publication</th>\n",
       "      <th>author</th>\n",
       "      <th>content</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>34296</th>\n",
       "      <td>Lawmakers to Trump: Turn over transcript of me...</td>\n",
       "      <td>Washington Post</td>\n",
       "      <td>Elise Viebeck</td>\n",
       "      <td>A growing number of Republican and Democratic ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39572</th>\n",
       "      <td>Man charged with attacking Uber driver sues dr...</td>\n",
       "      <td>New York Post</td>\n",
       "      <td>Associated Press</td>\n",
       "      <td>COSTA MESA, Calif. — A Southern California man...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5253</th>\n",
       "      <td>Democratic Response To Trump’s Address To Cong...</td>\n",
       "      <td>NPR</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Following President Trump’s address to a joint...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48670</th>\n",
       "      <td>Trump tries to salvage travel ban amid numerou...</td>\n",
       "      <td>Guardian</td>\n",
       "      <td>Ben Jacobs</td>\n",
       "      <td>Donald Trump scrambled to salvage his controve...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32448</th>\n",
       "      <td>Influential conservative group: Trump, DeVos s...</td>\n",
       "      <td>Washington Post</td>\n",
       "      <td>Emma Brown</td>\n",
       "      <td>A policy manifesto from an influential conserv...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                   title      publication  \\\n",
       "34296  Lawmakers to Trump: Turn over transcript of me...  Washington Post   \n",
       "39572  Man charged with attacking Uber driver sues dr...    New York Post   \n",
       "5253   Democratic Response To Trump’s Address To Cong...              NPR   \n",
       "48670  Trump tries to salvage travel ban amid numerou...         Guardian   \n",
       "32448  Influential conservative group: Trump, DeVos s...  Washington Post   \n",
       "\n",
       "                 author                                            content  \n",
       "34296     Elise Viebeck  A growing number of Republican and Democratic ...  \n",
       "39572  Associated Press  COSTA MESA, Calif. — A Southern California man...  \n",
       "5253                NaN  Following President Trump’s address to a joint...  \n",
       "48670        Ben Jacobs  Donald Trump scrambled to salvage his controve...  \n",
       "32448        Emma Brown  A policy manifesto from an influential conserv...  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Remove annoying punctuation from the articles\n",
    "articles['content'] = articles.content.map(lambda x: text_cleaner(str(x)))\n",
    "articles.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatizer = WordNetLemmatizer()\n",
    "\n",
    "# Reduce all text to their lemmas\n",
    "for article in articles['content']:\n",
    "    article = lemmatizer.lemmatize(article)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Identify predictor and target variables\n",
    "X = articles['content']\n",
    "y = articles['publication']\n",
    "\n",
    "# Create training and testing sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tf-idf Vectorization\n",
    "\n",
    "The first types of features that we are going to add are the most useful words in our dataset. Now how are we going to determine which words are deemed the most \"useful\"? With TF-IDF vectorizer, of course.\n",
    "\n",
    "TF tracks the term frequency, or how often each word appears in all articles of text, while idf (or Inverse Document Frequency) is a value that places less weight on variables that occur too often and lose their predictive power. Put together, it's a tool that allows us to assign an importance value to each word in the entire dataset based on frequency in each row and throughout the database.\n",
    "\n",
    "These are the parameters that will be used for TF-IDF\n",
    "1. All words that appear in over half of the articles will be thrown out of the dataframe\n",
    "2. Only words that occur more than 5 times will be tracked\n",
    "3. Only the top 150 features (words) will be kept\n",
    "4. Stop words will be ignored (like, as, the)\n",
    "5. Cases will be ignored\n",
    "6. Shorter and longer articles will be treated equally\n",
    "7. Add 1 to document frequency in case we have to divide by 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of features: 150\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "# Parameters for TF-idf vectorizer\n",
    "vectorizer = TfidfVectorizer(max_df=0.5,\n",
    "                             min_df=5, \n",
    "                             max_features=150, \n",
    "                             stop_words='english', \n",
    "                             lowercase=True, \n",
    "                             use_idf=True,\n",
    "                             norm=u'l2',\n",
    "                             smooth_idf=True\n",
    "                            )\n",
    "\n",
    "#Applying the vectorizer\n",
    "X_tfidf=vectorizer.fit_transform(X)\n",
    "print(\"Number of features: %d\" % X_tfidf.get_shape()[1])\n",
    "\n",
    "#splitting into training and test sets\n",
    "X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)\n",
    "\n",
    "#Removes all zeros from the matrix\n",
    "X_train_tfidf_csr = X_train_tfidf.tocsr()\n",
    "\n",
    "#number of paragraphs\n",
    "n = X_train_tfidf_csr.shape[0]\n",
    "\n",
    "#A list of dictionaries, one per paragraph\n",
    "tfidf_bypara = [{} for _ in range(0,n)]\n",
    "\n",
    "#List of features\n",
    "terms = vectorizer.get_feature_names()\n",
    "\n",
    "#for each paragraph, lists the feature words and their tf-idf scores\n",
    "for i, j in zip(*X_train_tfidf_csr.nonzero()):\n",
    "    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]\n",
    "\n",
    "# Normalize the dataset    \n",
    "X_norm = normalize(X_train_tfidf)\n",
    "\n",
    "# Convert from tf-idf matrix to dataframe\n",
    "X_normal  = pd.DataFrame(data=X_norm.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Phrase count with spacy\n",
    "\n",
    "The second set of variables that we will be creating are counters of how often each publishers makes use of each part of speech, meaning adverbs, verbs, nouns, adjectives, as well as article length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instantiating spaCy\n",
    "nlp = spacy.load('en')\n",
    "X_train_words = []\n",
    "\n",
    "for row in X_train:\n",
    "    # Processing each row for tokens\n",
    "    row_doc = nlp(row)\n",
    "    # Calculating length of each sentence\n",
    "    sent_len = len(row_doc) \n",
    "    # Initializing counts of different parts of speech\n",
    "    advs = 0\n",
    "    verb = 0\n",
    "    noun = 0\n",
    "    adj = 0\n",
    "    for token in row_doc:\n",
    "        # Identifying each part of speech and adding to counts\n",
    "        if token.pos_ == 'ADV':\n",
    "            advs +=1\n",
    "        elif token.pos_ == 'VERB':\n",
    "            verb +=1\n",
    "        elif token.pos_ == 'NOUN':\n",
    "            noun +=1\n",
    "        elif token.pos_ == 'ADJ':\n",
    "            adj +=1\n",
    "    # Creating a list of all features for each sentence\n",
    "    X_train_words.append([row_doc, advs, verb, noun, adj, sent_len])\n",
    "\n",
    "# Create dataframe with count of adverbs, verbs, nouns, and adjectives\n",
    "X_count = pd.DataFrame(data=X_train_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'sent_length'])\n",
    "\n",
    "# Change token count to token percentage\n",
    "for column in X_count.columns[1:5]:\n",
    "    X_count[column] = X_count[column] / X_count['sent_length']\n",
    "\n",
    "# Normalize X_count\n",
    "X_counter = normalize(X_count.drop('BOW',axis=1))\n",
    "X_counter  = pd.DataFrame(data=X_counter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>...</th>\n",
       "      <th>140</th>\n",
       "      <th>141</th>\n",
       "      <th>142</th>\n",
       "      <th>143</th>\n",
       "      <th>144</th>\n",
       "      <th>145</th>\n",
       "      <th>146</th>\n",
       "      <th>147</th>\n",
       "      <th>148</th>\n",
       "      <th>149</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000328</td>\n",
       "      <td>0.000705</td>\n",
       "      <td>0.000623</td>\n",
       "      <td>0.000393</td>\n",
       "      <td>0.999999</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.197798</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.334632</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000110</td>\n",
       "      <td>0.000327</td>\n",
       "      <td>0.000275</td>\n",
       "      <td>0.000125</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.059944</td>\n",
       "      <td>0.062871</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.048974</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.452318</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000582</td>\n",
       "      <td>0.001136</td>\n",
       "      <td>0.000554</td>\n",
       "      <td>0.000360</td>\n",
       "      <td>0.999999</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.000137</td>\n",
       "      <td>0.000630</td>\n",
       "      <td>0.000425</td>\n",
       "      <td>0.000182</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000128</td>\n",
       "      <td>0.000548</td>\n",
       "      <td>0.000411</td>\n",
       "      <td>0.000183</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.118669</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.05139</td>\n",
       "      <td>0.674409</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 155 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        0         1         2         3         4         0         1    2    \\\n",
       "0  0.000328  0.000705  0.000623  0.000393  0.999999  0.000000  0.000000  0.0   \n",
       "1  0.000110  0.000327  0.000275  0.000125  1.000000  0.059944  0.062871  0.0   \n",
       "2  0.000582  0.001136  0.000554  0.000360  0.999999  0.000000  0.000000  0.0   \n",
       "3  0.000137  0.000630  0.000425  0.000182  1.000000  0.000000  0.000000  0.0   \n",
       "4  0.000128  0.000548  0.000411  0.000183  1.000000  0.000000  0.000000  0.0   \n",
       "\n",
       "        3    4      ...     140       141       142  143  144       145  146  \\\n",
       "0  0.000000  0.0    ...     0.0  0.000000  0.000000  0.0  0.0  0.197798  0.0   \n",
       "1  0.048974  0.0    ...     0.0  0.000000  0.452318  0.0  0.0  0.000000  0.0   \n",
       "2  0.000000  0.0    ...     0.0  0.000000  0.000000  0.0  0.0  0.000000  0.0   \n",
       "3  0.000000  0.0    ...     0.0  0.000000  0.000000  0.0  0.0  0.000000  0.0   \n",
       "4  0.000000  0.0    ...     0.0  0.118669  0.000000  0.0  0.0  0.000000  0.0   \n",
       "\n",
       "        147      148       149  \n",
       "0  0.334632  0.00000  0.000000  \n",
       "1  0.000000  0.00000  0.000000  \n",
       "2  0.000000  0.00000  0.000000  \n",
       "3  0.000000  0.00000  0.000000  \n",
       "4  0.000000  0.05139  0.674409  \n",
       "\n",
       "[5 rows x 155 columns]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Combine tf-idf matrix and phrase count matrix\n",
    "features = pd.concat([X_counter,X_normal], ignore_index=False, axis=1)\n",
    "features.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we have our list of features. It doesn't look anything like our original dataset, now does it? That's because our sentences have been transformed into numbers to feed into our clustering algorithms and predictive models.\n",
    "\n",
    "# Clustering\n",
    "\n",
    "Now it's finally time for some unsupervised machine learning. Each article has been binarized to 1s and 0s, and it's time to determine if we can determine if each publisher has a different method for publication."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmsAAAFNCAYAAABfUShSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3Xd4VHXaxvHvA5HFgoKCBVDAuqAiaMAOIopYsSKoa1vXsqsoi6Io9s5asLuuBduqKBaaooshIrYEEFdwQcRCNyiKioLI8/7xO/NmiGmETM6Z5P5c11yZOXNm5plD3Nz7q+buiIiIiEgy1Yu7ABEREREpm8KaiIiISIIprImIiIgkmMKaiIiISIIprImIiIgkmMKaiIiISIIprIlUMzO7xsyeqoHPaW1mbmY50eMJZnZWpj+3JlTndzGzYWZ2QxVe52a2fXXUUMb7729mMzP1/qV8Xka/T1WZ2eVm9nCG3vsLMzuojOeq9HshEgeFNZG1ZGY/pt1Wm9nPaY9PrubPGmZmK0t85rTq/IyqSguLU0ocbxrV/EUl36dGwm3SuPtEd98pE++d1OBuZgeY2bz0Y+5+k7snrlaRJFFYE1lL7r5R6gZ8BRyZduzpDHzkkPTPdPfdMvAZ62JDM9sl7fFJwOdxFSMiUtsorIlkRgMze8LMfjCz6WaWm3rCzJqb2QgzKzKzz82sXzV+7nZm9oGZfW9mr5jZpmmfe1RUy3dRy0vb6PgZZjYq7bzZZjY87fFcM+tQzmc+CZyW9vhU4In0E8r6zmbWE7gcOLGUVsNWZjYpuoavm1nTir5L9FxHM5sSve45oGFZhZvZ9maWH12vJdH56Q4ys0/NbKmZ3WdmFr2unpkNNrMvzezr6N96k+i5x81sQHS/RdT6+Ne0z/vWgjVamaIuu4vN7KOonufMrGHa8wPNbKGZLTCzs8rq1jSzG4H9gXuja3pvRd8net2ZZvZJ9Nw4M2tVznUr7/p/YWaDzGxG9F6PmVlDM9sQeBVobsWtxM3TW1atuLX2jOj3bqmZnWtmnaLr8l369zGz7czsTTP7Jvr3e9rMGpdVdznfp5GZ5ZnZ3enXRCQpFNZEMuMo4FmgMTASuBfCH3lgFDANaAF0By4ys0Oq6XNPBc4EmgOrgLujz90ReAa4CGgGjAVGmVkDIB/YPwogWwHrAftGr9sW2Aj4qJzPfAroY2b1oz/ajYD3U0+W953d/TXgJuC5UloNTwLOADYHGgAXV/Rdou/zMiFAbgo8DxxXTu3XA68DTYCWwD0lnj8C6ATsBvQGUv9Op0e3bkDqGqVCRD5wQHS/KzAn+gnQBZjoZe/z1xvoCbQB2kefkQq1fwcOArZPe7/fcfcrgInA+dE1Pb+i72NmRxNC87GEazqRcI1/p4LfpZSTo/feDtgRGOzuPwGHAgvSWokXlPE19gR2AE4EhgJXRN99Z6C3maW+vwE3E37f2wJbA9eUdW3K+D6bAeOBSe7er5x/G5HYKKyJZMbb7j7W3X8jBIdUCOkENHP369x9pbvPAf4F9CnnvS6OWhRSt8fLOfdJd/84+sN4JeEPW33CH70x7v6Gu/8K3AasD+wT1fAD0IEQAsYB883sj9Hjie6+upzPnAfMJPwxPY0SrWpV/M4Aj7n7LHf/GRge1Ud53wXYixA2h7r7r+7+AlBQzmf8CrQCmrv7L+7+donnb3H379z9KyAvrYaTgTvcfY67/wgMIgTWHNLCLyGcDSEKv4TrmV9OPXe7+wJ3/5YQcFOf1zu6HtPdfTlwbTnvUZ6yvs85wM3u/om7ryIE6A5ltK6Vd/1T7nX3udH3uBHou5Z1Xh/9e7wO/AQ84+5fu/t8QpDsCODus6M6Vrh7EXAH5QTZUjQn/Hs87+6D17JGkRqjsCaSGYvS7i8HGkZ/yFsRuoH+P3wRWjS2KOe9bnP3xmm308o5d27a/S8JwaUp4Y/Sl6knovA1l9DSBcWtQV2i+xMIf/QqChcpTxBagfoSWtrSVeU7w++v4UbR/fK+S3NgfonWkS8p20BC68wHUbfemVWpIbqfA2zh7p8BPxKC0P7AaGCBme1ExdezvM9L/7dNv782ynr/VsBdaf8+3xKuSwt+r6LfpZL1fRm9Zm0sTrv/cymPNwIws83N7Fkzm29mywi/e02pvMMJQfPBtaxPpEYprInUrLnA5yXCVyN3P6ya3n/rtPvbEFqOlgALCH+QAYjG5WwNzI8OpcLa/tH9fNYurI0g/OGb4+4lw1FF33ltu53K+y4LgRYlxh1tU9Ybufsid/+LuzcntC7dX9o4sIpqiD5jFcWhIh84HmgQtQblE7qomwAfVuL9S1pI6KZN2bqsEyNre03nAueU+Dda393fKeXcin6XSta3TfSaqtRVkZuj92zv7hsDpxBCZmX9C3gNGBuNqRNJJIU1kZr1AbDMzC41s/WjcV67mFmnanr/U8ysnZltAFwHvBB1xQ4HDjez7ma2HjAAWAGk/hjnE8Zfre/u8whdTT2BzYCpFX1o1O16IFDaEgwVfefFQOuo27Ayyvsu7xJCUz8zyzGzY4HOZb2RmZ1gZqkQtJTwh/+3StTwDNDfzNqY2UYUj7tbFT2fD5wPvBU9ngBcQOger8z7lzQcOMPM2kb/tldVcP5iwli6ynoQGGRmOwOY2SZmdkI5tZT3uwTwNzNraWGCy+VAauLGYmAziyZjVINGhFbM78ysBXBJFd7jfEI3/mgzW7+a6hKpVgprIjUo+kN9JKGL7HNCq9fDQHl/vAbamuusLSnn3CeBYYTuroZAv+hzZxJaHe6JPvNIwpIjK6PnZxH+6E2MHi8jDIyfVNlw4e6FURfg2n7n56Of31iJNdvK+Jwyv0v0fY4ldMkuJYyverGct+sEvG9mPxImglzo7pVZduRRwrV+K/pOvxDCWEo+IUikwtrbwAZpj9eKu79KmCySB8wmhFIIIak0dwHHR7Mp767E+78E3Ao8G3UnfkyYDFDaueX+LkX+TZi4MSe63RC99n+EoDsn6nJd2+7Rkq4Fdge+B8ZQ/r91qaIu87MJrYuvWNoMXJGkME18ERHJLtGs24+BP6S15iWChcWQz3L3/8Rdi0htoZY1EZEsYGbHRMuTNCG0go1KWlATkcxQWBMRyQ7nAEXAZ4RxdefFW46I1BR1g4qIiIgkmFrWRERERBJMYU1EREQkwXLiLqC6NG3a1Fu3bh13GSIiIiIVmjx58hJ3b1aZc2tNWGvdujWFhYVxlyEiIiJSITMrbyu8NagbVERERCTBFNZEREREEkxhTURERCTBFNZEREREEkxhTURERCTBFNZEREREEkxhTURERCTBFNYqYcgQyMtb81heXjguIiIikkkKa5XQqRP07l0c2PLywuNOneKtS0RERGq/WrODQSZ16wbPPQe9ekHfvvDiizB8eDguIiIikklqWauk3Fz47Td46CE491wFNREREakZCmuVNHky1K8f7g8d+vsxbCIiIiKZoLBWCakxai+9BPvuC/XqwQknKLCJiIhI5imsVUJBQRij1r073H8//PQT7LNPOC4iIiKSSQprlTBwYPEYtfbt4YILYPRojVsTERGRzFNYq4JrroEttoC//jVMOhARERHJFIW1KthkE7j9digshIcfjrsaERERqc0U1qqob9/QDTpoEBQVxV2NiIiI1FYKa1VkBvfeCz/8AJddFnc1IiIiUlsprK2Ddu2gf3949FF45524qxEREZHaSGFtHV11FbRsCX/7G6xaFXc1IiIiUtsorK2jjTaCO++EDz+EBx6IuxoRERGpbRTWqsFxx0GPHjB4MCxaFHc1IiIiUpsorFUDM7jnHvjlF7jkkrirERERkdpEYa2a7LhjCGpPPQVvvRV3NSIiIlJbKKxVo8svh1atws4Gv/4adzUiIiJSGyisVaMNNoC774bp08NPERERkXWV0bBmZj3NbKaZzTaz3y0da2atzGy8mX1kZhPMrGV0vIOZvWtm06PnTsxkndXpyCPh8MPD/qHz58ddjYiIiGS7jIU1M6sP3AccCrQD+ppZuxKn3QY84e7tgeuAm6Pjy4FT3X1noCcw1MwaZ6rW6mQWWtVWrYIBA+KuRkRERLJdJlvWOgOz3X2Ou68EngV6lTinHTA+up+Xet7dZ7n7p9H9BcDXQLMM1lqttt027Bn63HMwfnzF54uIiIiUJZNhrQUwN+3xvOhYumnAcdH9Y4BGZrZZ+glm1hloAHyWoTozYuBA2G67sLPBihVxVyMiIiLZKpNhzUo55iUeXwx0NbOpQFdgPvD/mzaZ2VbAk8AZ7r76dx9gdraZFZpZYVFRUfVVXg0aNgxrr82cCXfcEXc1IiIikq0yGdbmAVunPW4JLEg/wd0XuPux7t4RuCI69j2AmW0MjAEGu/t7pX2Auz/k7rnuntusWfJ6SQ89FI45Bq6/Hr76Ku5qREREJBtlMqwVADuYWRszawD0AUamn2BmTc0sVcMg4NHoeAPgJcLkg+czWGPGDR0afl50Ubx1iIiISHbKWFhz91XA+cA44BNguLtPN7PrzOyo6LQDgJlmNgvYArgxOt4b6AKcbmYfRrcOmao1k7bZBq68El56CV59Ne5qREREJNuYe8lhZNkpNzfXCwsL4y6jVCtXQvv2YTmPjz8O49lERESk7jKzye6eW5lztYNBDWjQAO67Dz77DIYMibsaERERySYKazWke3c48US4+WaYMyfuakRERCRbKKzVoNtvh5wc6NcPaknvs4iIiGSYwloNatEi7Bk6ZgyMHFnh6SIiIiIKazWtXz/YeWe48EJYvjzuakRERCTpFNZq2Hrrwf33w5dfwk03xV2NiIiIJJ3CWgy6dIFTToF//ANmzYq7GhEREUkyhbWY/OMfYb2188/XZAMREREpm8JaTLbcEm64Ad54A154Ie5qREREJKkU1mJ03nnQoQP07w8//hh3NSIiIpJECmsxyskJkw3mz4frrou7GhEREUkihbWY7b03nHkm3HknTJ8edzUiIiKSNAprCXDLLdCoEfztb5psICIiImtSWEuAZs3Cmmv5+fDMM3FXIyIiIkmisJYQf/kL5ObCgAHw/fdxVyMiIiJJobCWEPXrwwMPwOLFcPXVcVcjIiIiSaGwliC5uXDOOXDPPTBtWtzViIiISBIorCXMjTfCppuGyQarV8ddjYiIiMRNYS1hNt0Ubr0VJk2CJ56IuxoRERGJm8JaAp1+elh/beBAWLo07mpEREQkTgprCVSvXtjZ4Jtv4Ior4q5GRERE4qSwllAdOoRxaw8+CJMnx12NiIiIxEVhLcGuvx423xz++ldNNhAREamrFNYSbJNN4Lbb4IMP4OGH465GRERE4qCwlnAnnwxdusCgQbBkSdzViIiISE1TWEs4M7jvvrAF1aBBcVcjIiIiNU1hLQvssgtcdFHoCn3vvbirERERkZqksJYlrr4amjcPkw1++y3uakRERKSmKKxliUaN4I47YOrUsOG7iIiI1A0Ka1mkd2846CAYPBgWL467GhEREakJCmtZxAzuvReWLw9bUYmIiEjtp7CWZXbaCS6+OGzyPnFi3NWIiIhIpimsZaErroBttgmTDX79Ne5qREREJJMU1rLQhhvC0KHw8cehW1RERERqL4W1LHX00XDooWFJjwUL4q5GREREMkVhLUuZwT33wMqVMGBA3NWIiIhIpiisZbHttoP99oNnn4Xx44uP5+XBkCHx1SUiIiLVJ6Nhzcx6mtlMM5ttZpeV8nwrMxtvZh+Z2QQza5n23Glm9ml0Oy2TdWaziy+GevXgjDNCK1teXliPrVOnuCsTERGR6pCxsGZm9YH7gEOBdkBfM2tX4rTbgCfcvT1wHXBz9NpNgauBPYHOwNVm1iRTtWaznj3hhhtg7lzYZ58Q1IYPh27d4q5MREREqkMmW9Y6A7PdfY67rwSeBXqVOKcdkOrAy0t7/hDgDXf/1t2XAm8APTNYa1YbNAj23BMmT4aGDUP3qIiIiNQOmQxrLYC5aY/nRcfSTQOOi+4fAzQys80q+VrM7GwzKzSzwqKiomorPNvk5cFnn8Gxx8L8+bDzzjBqVNxViYiISHXIZFizUo55iccXA13NbCrQFZgPrKrka3H3h9w9191zmzVrtq71ZqXUGLXhw2HEiLCzwS+/wFFHhVmiK1fGXaGIiIisi0yGtXnA1mmPWwJrrAjm7gvc/Vh37whcER37vjKvlaCgYM0xaqecAqNHh/Frd9wB++8Pn38eb40iIiJSdZkMawXADmbWxswaAH2AkeknmFlTM0vVMAh4NLo/DuhhZk2iiQU9omNSwsCBv59McMghMGkSvPACzJwJHTvCSy/FU5+IiIism4yFNXdfBZxPCFmfAMPdfbqZXWdmR0WnHQDMNLNZwBbAjdFrvwWuJwS+AuC66JisheOOgylTYMcdw3i2fv1gxYq4qxIREZG1Ye6/GwqWlXJzc72wsDDuMhJp5Uq49NKwn+gee8Bzz2nGqIiISJzMbLK751bmXO1gUAc0aAB33gkvvxxmjXbsGMa5iYiISPIprNUhvXrBhx+GpT1OPBHOOy/MHBUREZHkUlirY1q1grfegksugQcfhL32glmz4q5KREREyqKwVgett17Y6H30aJg3L4xj+/e/465KRERESqOwVocdfnjoFu3QAU4+Gf7yF1i+PO6qREREJJ3CWh3XsmXYBWHQIHj44bDH6CefxF2ViIiIpCisCTk5cNNN8NprsHgx5ObC44/HXZWIiIiAwpqkOeSQ0C3auTOcfnq4/fRT3FWJiIjUbQprsobmzeE//4GrrgqbwnfqBB9/HHdVIiIidZfCmvxO/fpw7bUhtH37bQhsjzwCtWSzCxERkayisCZlOvDA0C26775w1lnwpz/BDz/EXZWIiEjdorAm5dpySxg3Dq6/Hp55Jkw+mDYt7qpERETqDoU1qVD9+jB4MLz5Jvz4Y1je48EH1S0qIiJSExTWpNK6dg3dot26hX1F+/SBZcvirkpERKR2U1iTtdKsGYwZA7fcAiNGwO67w+TJcVclIiJSeymsyVqrVw8uvRTy82HFCthnH7j3XnWLioiIZILCmlTZvvuGbtEePeCCC+D44+G77+KuSkREpHZRWJN1stlmMHIk3HZb+NmxI3zwQdxViYiI1B4Ka7LOzGDAAJg4MXSF7rcf3HmnukVFRESqg8KaVJu99oKpU+Hww+Hvf4edd4ZXXlnznLw8GDIknvpERESykcKaVKsmTeDFF+Guu2DWLDj22DD5AEJQ6907bF8lIiIilaOwJtXODPr1g/fegy22CJMPuncPQW348LBOm4iIiFSOwppkTG4ufPIJtG0bdj/YaCPYcce4qxIREckuCmuSUVOmQFERHHkkfPFFCG4vvxx3VSIiItlDYU0yJjVGbfjwsKzH44/Dzz/DMceE7aqWL4+7QhERkeRTWJOMKShYc4zaqaeGraq6dAkbwefmwrRp8dYoIiKSdAprkjEDB/5+MkGPHmGbqjfeCLsddO4MQ4fC6tXx1CgiIpJ0CmsSi4MOgo8+gp49oX//sDbb4sVxVyUiIpI8CmsSm6ZNw2SD+++HCROgfXt49dW4qxIREUkWhTWJlVmYbFBYGNZkO+wwuOgi+OWXuCsTERFJhkqHNTOrb2bNzWyb1C2ThUndsvPOYQP4fv3C7gd77gkzZsRdlYiISPwqFdbM7AJgMfAGMCa6jc5gXVIHNWwYgtqYMbBwIeyxBzzwgDaEFxGRuq2yLWsXAju5+87uvmt0a5/JwqTuOuywMPmga1f461/h6KNhyZK4qxIREYlHZcPaXOD7TBYikm7LLWHsWLjzTnjttTD5YPz4uKsSERGpeZUNa3OACWY2yMz+nrplsjCRevXCZIP334dNNoGDD4ZLL4WVK+OuTEREpOZUNqx9RRiv1gBolHYTybgOHWDyZDj7bBgyBPbZB2bNirsqERGRmpFTmZPc/VoAM2sUHvqPlXmdmfUE7gLqAw+7+y0lnt8GeBxoHJ1zmbuPNbP1gIeB3aMan3D3myv3laQ22mCDsEXVIYfAWWfB7rvD3XfDGWeE5T9ERERqq8rOBt3FzKYCHwPTzWyyme1cwWvqA/cBhwLtgL5m1q7EaYOB4e7eEegD3B8dPwH4g7vvCuwBnGNmrSv3laQ2O+aYsJ9o587w5z9Dnz6wdGncVYmIiGROZbtBHwL+7u6t3L0VMAD4VwWv6QzMdvc57r4SeBboVeIcBzaO7m8CLEg7vqGZ5QDrAyuBZZWsVWq5li3D3qI33wwvvhi6SSdOjLsqERGRzKhsWNvQ3fNSD9x9ArBhBa9pQZhFmjIvOpbuGuAUM5sHjAUuiI6/APwELCSMl7vN3b+tZK1SB9SvD5ddBpMmwXrrwQEHwFVXwapVcVcmIiJSvSo9G9TMrjSz1tFtMPB5Ba8pbSRRyeVN+wLD3L0lcBjwpJnVI7TK/QY0B9oAA8xs2999gNnZZlZoZoVFRUWV/CpSm3TuDFOnwp/+BNdfD126wOcV/WaKiIhkkcqGtTOBZsCLwEvR/TMqeM08YOu0xy0p7uZM+TMwHMDd3wUaAk2Bk4DX3P1Xd/8amATklvwAd3/I3XPdPbdZs2aV/CpS2zRqBMOGwTPPwPTpoVv03/+OuyoREZHqUamw5u5L3b2fu+/u7h3d/UJ3r2hYdwGwg5m1MbMGhAkEI0uc8xXQHcDM2hLCWlF0/EALNgT2Av5X+a8ldVGfPmHywS67wMknw6mnwjKNdBQRkSxXblgzs6HRz1FmNrLkrbzXuvsq4HxgHPAJYdbndDO7zsyOik4bAPzFzKYBzwCnu7sTZpFuRJh9WgA85u4frcP3lDqidWvIz4drroGnn4aOHcOiuiIiItnKvJxdss1sD3efbGZdS3ve3fMzVtlays3N9cLCwrjLkASZNCm0sM2bB9deGyYk1K8fd1UiIiJgZpPd/XdDvEpTbsuau0+O7nZw9/z0G9BhXQsVyaR994UPP4Tjj4fBg+HAA2Hu3IpfJyIikiSVnWBwWinHTq/GOkQyonHjMPFg2LCwZdVuu8ELL8RdlYiISOVVNGatr5mNArYtMV4tD/imZkoUWTdmcNppoZVtu+3ghBPCkh9jx655Xl5e2HtUREQkSSraG/QdwsK0TYHb047/AGjAv2SV7bcP49iuvhpuuQWOOgruuw/OOScEtd69YfjwuKsUERFZU7kTDOD/9/gc5+4H1UxJVaMJBrI2UuFsyZIwtu1//4Pnn4du3eKuTERE6oJqm2AA4O6/AcvNbJN1rkwkIbp1CwFt551Da9v338OoUTB/ftyViYiIrKmyEwx+Af5rZo+Y2d2pWyYLE8m0jz6CxYvh3HPDkh533QVt2sDZZ8Ps2XFXJyIiElQ2rI0BrgTeAian3USyUvoYtQcegFdfDTNHDz0UnngCdtoJTjoJ/vvfuCsVEZG6rrLbTT1O2GEgFdL+HR0TyUoFBSGopcaodesWlvTYd9+wEfyAAaFbtH37MBHhvffirVdEROquCicYAJjZAcDjwBeAETZoP83d38pkcWtDEwykun37LdxzT+geXbo0BLorrgiL65rFXZ2IiGSzap1gELkd6OHuXd29C3AIcGdVCxTJBptuGpb5+PJLuO22MCHhoINgr73glVdg9eq4KxQRkbqgsmFtPXefmXrg7rOA9TJTkkiyNGoUukXnzIEHH4SiIjj66NBF+vTTsGpV3BWKiEhtVtmwVhjNBD0guv0LTTCQOqZhw7CA7qxZ8NRT4dgpp8COO8I//wm//BJvfSIiUjtVNqydB0wH+gEXAjOAczJVlEiS5eTAySeHpT9efhmaNg3Lf2y7Ldx+O/z4Y9wViohIbVLZsHauu9/h7se6+zHufichwInUWfXqQa9e8P778J//QNu2cPHF0KoVXHttmKAgIiKyriob1k4r5djp1ViHSNYyg+7dYfx4ePdd2G8/uOaaENouuQQWLoy7QhERyWblhjUz62tmo4A2ZjYy7TYB+KZGKhTJIqmZoh99FNZnu+OOsCvCeeeF9dtERETWVrnrrJlZK6ANcDNwWdpTPwAfuXti5sFpnTVJotmzYcgQGDYsLPXRty9cdlnYk1REROqualtnzd2/dPcJwEHARHfPBxYCLQmL44pIObbfHh56KLSq9esHL74Iu+wCxx4bdlEQERGpSGXHrL0FNDSzFsB44AxgWKaKEqltWrQIXaJffglXXhn2Ju3cGXr0gAkToBIbiYiISB1V2bBm7r4cOBa4x92PAdplriyR2qlpU7juuhDabr01jG3r1i3sSTp6tEKbiIj8XqXDmpntDZwMjImO5WSmJJHab+ONYeDA0D16332wYAEceSR06AAnnRSWAkmXlxfGvomISN1T2bB2ETAIeMndp5vZtkBe5soSqRvWXx/++lf49FN4/HFYuRKeeQYOOSSs2bZiRQhqvXtDp05xVysiInEodzZoNtFsUKkNVq8OuyJcdlkIcOuvH7pG774b/vKXuKsTEZHqUm2zQc1saPRzVIl11kaa2cjqKFZEitWrF2aKzpwZtrT6+efQ2nb22dCxYwhtS5bEXaWIiNSkirpBn4x+3gbcXspNRDJgwgQYNy7MHG3SBC64IAS5Cy+E5s3h+ONhzBhYlZiVDkVEJFPKnSTg7pOjn/lm1iy6X1QThYnUVakxasOHh5mi3boVP95sM3jsMXjqKRgxArbaCv70JzjjDPjjH+OuXEREMqGiblAzs2vMbAnwP2CWmRWZ2VU1U55I3VNQUBzUIPwcPjwcb98e7rwT5s8PC+zm5sLtt4dN5PfeOyzA+/338dYvIiLVq6LtpvoDhwFnu/vn0bFtgQeA19z9zhqpshI0wUDqqkWLQkvbY4/BjBlhUsKxx4bWtm7dQvepiIgky9pMMKgorE0FDnb3JSWONwNed/eO61RpNVJYk7rOHQoLQ2h75hn47jto1QpOOw1OPz1sKC8iIslQbbNBgfVKBjX4/3Fr61WlOBHJDLOwFtv998PChSGw7bQTXH89bLttaGV74gn46ae4KxURkbVRUVhbWcXnRCRGDRtCnz5hRumXX8INN8C8eaGVbaut4KyzYNIkbW8lIpINKuoG/Q0o7f+HG9DQ3RNoL5g6AAAY7ElEQVTTuqZuUJHyucPbb4du0uHDQwvbjjuGLtJTTw2bzYuISM2otm5Qd6/v7huXcmuUpKAmIhUzg/33h0cfDZMSHnsMttwSLr8cttkGevaE556DX36Ju1IREUmneWIiddBGG4UWtfx8mD07BLYZM0LXafPmcP75MHmyuklFRJJAYU2kjttuuzAJ4fPP4fXX4dBD4ZFHwhpuu+0W1nX7+utw7pAhYdHedHl54biIiGRGRsOamfU0s5lmNtvMLivl+W3MLM/MpprZR2Z2WNpz7c3sXTObbmb/NbOGmaxVpK6rXx8OPhiefjrMJn3wQdhgA/j738N4tqOPDvuU9u5dHNhSuy106hRv7SIitVm5EwzW6Y3N6gOzgIOBeUAB0NfdZ6Sd8xAw1d0fMLN2wFh3b21mOcAU4E/uPs3MNgO+c/ffyvo8TTAQyYwZM2DYMHjyyTDWrXFjWLEiTEoYMWLN3RZERKRyqnOdtXXRGZjt7nPcfSXwLNCrxDkObBzd3wRYEN3vAXzk7tMA3P2b8oKaiGROu3ahm3PuXBg1Cg48MIS1f/4TcnJg2rTiblIREal+mQxrLYC5aY/nRcfSXQOcYmbzgLHABdHxHQE3s3FmNsXMBmawThGphJwcOOKIMPmgSZMwe3TJEujfP0xKOPJIeP55zSYVEalumQxrVsqxkn2ufYFh7t6SsAfpk2ZWD8gB9gNOjn4eY2bdf/cBZmebWaGZFRYVFVVv9SLyO6kxas8/D6++GiYkNGkSjk2dGn5utRWcey68845mk4qIVIdMhrV5wNZpj1tS3M2Z8mdgOIC7vws0BJpGr8139yXuvpzQ6rZ7yQ9w94fcPdfdc5s1a5aBryAi6QoK1hyj1q1bGLfWoUPYKeGNN0IL25NPwr77hkV3r78evvgi1rJFRLJaJicY5BAmGHQH5hMmGJzk7tPTznkVeM7dh5lZW2A8oau0cXR/P8K2Vq8Bd7r7mLI+TxMMRJLjhx/gxRfDXqR5eaGFrUuXsN3V8cfDxhtX/B4iIrVZIiYYuPsq4HxgHPAJMNzdp5vZdWZ2VHTaAOAvZjYNeAY43YOlwB2EgPchMKW8oCYiydKoUQhm48eHVrUbbwwzSf/8Z9hiCzjppLBv6W+aNiQiUqGMtazVNLWsiSSbO3zwQWhte+YZWLo0jG87+eQQ7HbZJe4KRURqTiJa1kRE0pnBnnvCffeFRXdHjIDOnWHoUNh1V9h993B/8eK4KxURSRaFNRGpcX/4Axx7LLz8MixYAHffHXZQ6N8/7JagZUBERIoprIlIrJo1gwsuCDNNp0+Hiy/WMiAiIukU1kQkMdq1g1tuCcuAvP56WIRXy4CISF2nsCYiiZPaVD61H+mwYbD11nDVVdCmDXTtCo8+CsuWxV2piEjmKayJSKKllgF5883ylwG55Zawplu6vLywr6mISDZTWBORrNGqFVx+Ofzvf/Dee3DmmfDaa2Gf0n/8Aw4/PLS4QfHWWJ06xVuziMi60jprIpLVVqyAMWPC+m2jRsHq1WHSwo8/wm23hQkK9fR/S0UkYbTOmojUGenLgCxaBIccAkVFYdmPv/0tLAVy9tkwejT8/HPc1YqIrD2FNRGpNT7+GCZPhiuvhE03DV2mXbrAs8+GtduaNoVjjoHHHoOvv467WhGRysmJuwARkeqQGqM2fDh06xZuqcdPPAH5+TByZLi9/HLYUWGffeCoo8Ltj3+M+xuIiJROLWsiUisUFBQHNQg/hw8Px//wB+jRA+69N6zhNmUKXH116Ba99FJo2xZ22gkuuQQmTtQG8yKSLJpgICJ12ty5YWLCyJFheZBff4XNNgsL8h51VAh5G20Ud5UiUtuszQQDhTURkciyZWHNtpEjwwzTpUuhQQPo3h169Qrj3po3j7tKEakNFNZERNbRqlXw9tshuL3yCsyZE47n5oYWt169YNddw9g3EZG1pbAmIlKN3OGTT0JoGzkS3n8/HGvVqji4dekC660Xd6Uiki0U1kREMmjRorBu28iR8MYbYU23TTaBQw8Nwa1nT2jcOO4qRSTJFNZERGrI8uUhsI0cGSYqFBVBTk7YbD61LMjw4WHbq9RMVQhLjRQUwMCB8dUuIvFRWBMRicFvv8EHHxR3l37ySTi+7bahNe7228NuCvn5a64JJyJ1j8KaiEgCfPpp8UK8EyeGcW6NGoXlQW6+GS64AOrXj7tKEYmD9gYVEUmAHXaAAQNCS1pRURjP9sMPIaz17x+WATn33NCN+uuvcVcrIkmlsCYiUgM++ggmTQr7ljZuHH4ecAA89VRYeHfLLeHMM8P6bitWxF2tiCSJwpqISIal71t63XXw/PPwwAOhVa2oCF56CQ47DF58MeycsPnmcPLJ4fHy5XFXLyJxU1gTEcmw8vYtXX99OPpoePJJ+PprGDsWjj8+7KRw3HHQrBmccAI8+2zoQhWRukcTDEREEmjVqjDWbcSI0PK2aFHxhvTHHReWBGnSJO4qRaSqNBtURKQW+e03ePddeOGF0DU6d25Yy6179xDcjj46tMCJSPZQWBMRqaXcQ/fpiBHh9tlnUK9e2O7quOPg2GO12bxINlBYExGpA9zDLNMXXgjBLbUI7z77FAe31q1jLVFEyqCwJiJSB33ySXGL24cfhmN77BGC23HHwY47xlufiBTTorgiInVQ27YweDBMnQqzZ8Ott4YdEi6/HHbaCdq3h2uvhY8/Dq1yAEOGhKVF0uXlheMikgwKayIitdB224VN4t9/H776CoYODYvxXnst7Lor/PGPIcRtsklYAy4V2FJrwnXqFG/9IlJM3aAiInXIokVhKZARI2DChDDTdIstYNkyOPFEGD1aG8yL1AR1g4qISKm23BLOOw/+858Q3B55BHbfPWxxNWwY/PQTPPooPPccfPdd3NWKCCisiYjUWU2bhv1IL7kkLLB73HGwejWMHAl9+oTnu3WD22+HmTOLx7mJSM1SWBMRqcNSY9Sefz4sAfLqq9CgAdx9dxjz9s03cPHFYYzbjjtC//4wfjysXBl35SJ1h8KaiEgdVta+pT//DDfdFNZx++ILuO8+2GGHsAH9QQeFVrcTToDHHw97mopI5mR0goGZ9QTuAuoDD7v7LSWe3wZ4HGgcnXOZu48t8fwM4Bp3v628z9IEAxGRzPvpp9CyNnp0uC1cCGaw555wxBHh1r59OCYiZUvEorhmVh+YBRwMzAMKgL7uPiPtnIeAqe7+gJm1A8a6e+u050cAq4H3FdZERJLFPazplgpuBQXheMuWxcHtwANh/fXjrVMkiZIyG7QzMNvd57j7SuBZoFeJcxzYOLq/CbAg9YSZHQ3MAaZnsEYREakiszCT9Kqr4IMPQivbI4+ENdqefDKEtc02gyOPhH/+E+bNi7tikeyUk8H3bgHMTXs8D9izxDnXAK+b2QXAhsBBAGa2IXApoVXu4gzWKCIi1WTLLcPs0jPPDEuB5OfDmDEwalRoeQPo0KG41a1Tp7AJvYiUL5P/mZQ2YqFkn2tfYJi7twQOA540s3rAtcCd7v5juR9gdraZFZpZYVFRUbUULSIi6+4Pf4AePeCuu+Czz2DGjLCF1cYbh4kLe+0FW20FZ5wRFuhdtizuikWSK5Nj1vYmTAw4JHo8CMDdb047ZzrQ093nRo/nAHsBI4Cto9MaE8atXeXu95b1eRqzJiKSHb79FsaNC61tr74KS5fCeutB167FrW7bbRfCXadOa+6mkJcXxsYNHBhf/SLVISlj1gqAHcysjZk1APoAI0uc8xXQHcDM2gINgSJ339/dW0eTDYYCN5UX1EREJHtsuin07QtPPx2W/XjrrbB+24IFcNFFsP32YVP6yZPhmGPgjTfC67RvqdRVGRuz5u6rzOx8YBxhWY5H3X26mV0HFLr7SGAA8C8z60/oIj3da8tmpSIiUqGcHNh//3C79VaYMyeMcxs9Gl5+OSy+e8gh0K5d2JD+6ae1b6nUPdrIXUREEumHH8IeptdeC9OmhWM5OaG79Kijwq1161hLFKmypHSDioiIVFmjRtC4McyfD1dcEe6fcEJYIuTCC6FNm7AA75VXhnFsq1fHXbFIZiisiYhIIqXGqA0fDjfcAC++GMav3XsvfPop3HFHWMft5puhc+ewGO8554Ru1J9/jrt6keqjsCYiIolU1r6lBQVhEkL//iHQLV4cFuHdbz/497/DbNKmTcPkhMceA63sJNlOY9ZERKTWSC3GO3JkuM2dG3Za2Gef4nFuO+2kvUslfonYG7SmKayJiEg6d/jww+LgNmVKOL7DDsXBbZ99wqQFkZqmsCYiIlLC3LlhSZBXXoE334Rffw1j3g4/PAS3Hj3CpAaRmqCwJiIiUo5ly+D110OL25gxYVeFBg3gwANDcDvyyDBhQSRTFNZEREQqadUqeOed0OL2yithL1OAPfYo7i7dbTeNc5PqpXXWREREKiknB7p0gdtvD0uCzJgBt9wSNqO/5hro2BFatYLzzw+tcStXhn1L8/LWfJ+8vHBcpLoprImIiETMwr6kl14KkybBokXwyCOw++7w6KNh66umTWHsWOjVK7TEgfYtlcxSN6iIiEgl/PwzjB8fxrmNGhWCHMDWW8M338CNN8J554UWOZGKaMyaiIhIBq1eDYWF8Pe/hxa4lIYNw+K83bqFyQq5uVoaREq3NmFNv0IiIiJrqV49+OknmDkz7E16//1w0UWwZEnoEr3iinBeo0ZhPFwqvO22W3ityNpQWBMREVlL6fuWdusWbqnHQ4eGLa4mTAjrueXlheVBADbdFA44oDi8tW2rWaZSMXWDioiIrKUhQ8JkgtS+pRBCWUEBDBz4+/Pnzw/Pv/lmuH35ZTi+xRYhtKXC27bbKrzVFRqzJiIikmCff14c3PLyYOHCcHybbUJoSwU4LcxbeymsiYiIZAn3MPYtFd4mTAizSyHsY5oKbwccAJtvHmelUp0U1kRERLLU6tXw3/8Wh7e33grbYwHssktxeOvaFRo3jrdWqTqFNRERkVpi1SqYMqU4vL39dljzrV69sLtCKrzttx9stNHaj6eTeCisiYiI1FIrVsAHHxSHt/feC1tg5eRA585hksKoUfDcc2HHhZIzVyUZFNZERETqiOXLw0b0qfBWWAi//Raea9UqLCNy/fVhd4X114+3VimmsCYiIlJHLVsGEyeGgPb++8XHGzQILW9du4bb3nuHblOJx9qENa2jLCIiUotsvDFssAF89lnYXWGzzeCmm+DCC0N36S23QI8e0KQJ7LVX2LR+7NjiSQySPGpZExERqUVKjlEr+fiHH+DddyE/P9w++AB+/bV4wkKq5W2//cKOC5IZ6gYVERGpo9Z2Nujy5WGSQiq8vfdemMRgBrvuWhzeunSBZs1q7nvUdgprIiIiUiW//BJa2/Lzwxpv77wTAh1Au3bFwa1rV9hqq3hrzWYKayIiIlItVq6EyZOLW97efht+/DE8t8MOxS1vXbvC1lvHW2s2UVgTERGRjFi1CqZOLQ5vEyfC99+H59q0WbPlrU0bbUxfFoU1ERERqRG//Ra2x0qFt7feKt7btGXLNVveXnopLB+i3RUU1kRERCQmq1fDjBnF4S0/H77+OjzXpEkY/3buuXD22bBoEZx4Yt3cXUFhTURERBLBHWbODC1u+fnw+uuwZEl4ziy0uPXpE/Y33X77utNtqrAmIiIiieQO/frBvffCLruELtOFC8NzLVqEFrZu3UJ4a9061lIzam3CWk6mixERERFJmTABnn027K7wwANhw/kWLcLYtbw8GDcOnnoqnNu6dXF469YtjIGri9SyJiIiIjWiot0VILS8TZ9eHN4mTIClS8NzO+ywZnjbYovYvso6UzeoiIiIJM7a7q4AYcLCtGnF4S0/P2yZBWGR3lRwO+CAsA9qtlBYExERkVpp1SqYMqU4vE2cWLzDwm67FYe3Ll2gceN4ay2PwpqIiIjUCStXhpa5VHibNCnsbZramP7AA0N4228/aNQo7mqLrU1Yq5fhQnqa2Uwzm21ml5Xy/DZmlmdmU83sIzM7LDp+sJlNNrP/Rj8PzGSdIiIikp0aNIB994XBg2H8ePjuuxDaBg+GDTaAoUPhsMPCGm977w1XXAH/+U9xa1zKkCHhdeny8sLxuGWsZc3M6gOzgIOBeUAB0NfdZ6Sd8xAw1d0fMLN2wFh3b21mHYHF7r7AzHYBxrl7i/I+Ty1rIiIiUtLy5aG1LdXyVlAQdl1o0AD23LN4mZBffoFTTil/8kN1SsrSHZ2B2e4+JyrqWaAXMCPtHAc2ju5vAiwAcPepaedMBxqa2R/cfUUG6xUREZFaZoMN4OCDww3C5ISJE4vD2/XXw3XXQcOG0LYtHHFECGmjRydnZ4VMhrUWwNy0x/OAPUuccw3wupldAGwIHFTK+xxHaH1TUBMREZF10qhR6BY97LDweOnSsLtCKrwtXw7DhoV14JIQ1CCzY9ZK2zCiZJ9rX2CYu7cEDgOeNLP/r8nMdgZuBc4p9QPMzjazQjMrLCoqqqayRUREpK5o0gR69Qpj24YODct/nHVWWLC35Bi2uGQyrM0Dtk573JKomzPNn4HhAO7+LtAQaApgZi2Bl4BT3f2z0j7A3R9y91x3z23WrFk1ly8iIiJ1RWqM2vPPw7/+FbpAe/dORmDLZFgrAHYwszZm1gDoA4wscc5XQHcAM2tLCGtFZtYYGAMMcvdJGaxRREREhIKCNceodesWHhcUxFsXZHidtWgpjqFAfeBRd7/RzK4DCt19ZDQD9F/ARoQu0oHu/rqZDQYGAZ+mvV0Pd/+6rM/SbFARERHJFloUV0RERCTBErMoroiIiIisG4U1ERERkQRTWBMRERFJMIU1ERERkQRTWBMRERFJMIU1ERERkQRTWBMRERFJsFqzzpqZFQFfxl1HzJoCS+IuIgvpulWNrlvV6LpVja5b1ei6VU1NXLdW7l6pvTJrTVgTMLPCyi6wJ8V03apG161qdN2qRtetanTdqiZp103doCIiIiIJprAmIiIikmAKa7XLQ3EXkKV03apG161qdN2qRtetanTdqiZR101j1kREREQSTC1rIiIiIgmmsJblzGxrM8szs0/MbLqZXRh3TdnEzOqb2VQzGx13LdnCzBqb2Qtm9r/o927vuGvKBmbWP/pv9GMze8bMGsZdU1KZ2aNm9rWZfZx2bFMze8PMPo1+NomzxiQq47r9I/pv9SMze8nMGsdZYxKVdt3SnrvYzNzMmsZRW4rCWvZbBQxw97bAXsDfzKxdzDVlkwuBT+IuIsvcBbzm7n8EdkPXr0Jm1gLoB+S6+y5AfaBPvFUl2jCgZ4ljlwHj3X0HYHz0WNY0jN9ftzeAXdy9PTALGFTTRWWBYfz+umFmWwMHA1/VdEElKaxlOXdf6O5Tovs/EP5wtoi3quxgZi2Bw4GH464lW5jZxkAX4BEAd1/p7t/FW1XWyAHWN7McYANgQcz1JJa7vwV8W+JwL+Dx6P7jwNE1WlQWKO26ufvr7r4qevge0LLGC0u4Mn7fAO4EBgKxD+5XWKtFzKw10BF4P95KssZQwn+Iq+MuJItsCxQBj0Xdxw+b2YZxF5V07j4fuI3w/9AXAt+7++vxVpV1tnD3hRD+Tyqwecz1ZKMzgVfjLiIbmNlRwHx3nxZ3LaCwVmuY2UbACOAid18Wdz1JZ2ZHAF+7++S4a8kyOcDuwAPu3hH4CXVHVSgaX9ULaAM0BzY0s1PirUrqEjO7gjBs5um4a0k6M9sAuAK4Ku5aUhTWagEzW48Q1J529xfjridL7AscZWZfAM8CB5rZU/GWlBXmAfPcPdV6+wIhvEn5DgI+d/cid/8VeBHYJ+aass1iM9sKIPr5dcz1ZA0zOw04AjjZtV5XZWxH+D9W06K/ES2BKWa2ZVwFKaxlOTMzwvihT9z9jrjryRbuPsjdW7p7a8JA7zfdXS0dFXD3RcBcM9spOtQdmBFjSdniK2AvM9sg+m+2O5qYsbZGAqdF908DXomxlqxhZj2BS4Gj3H153PVkA3f/r7tv7u6to78R84Ddo//9i4XCWvbbF/gToWXow+h2WNxFSa12AfC0mX0EdABuirmexItaIl8ApgD/Jfxvb6JWSE8SM3sGeBfYyczmmdmfgVuAg83sU8IMvVvirDGJyrhu9wKNgDeivw8PxlpkApVx3RJFOxiIiIiIJJha1kREREQSTGFNREREJMEU1kREREQSTGFNREREJMEU1kREREQSTGFNRKQUZtbazD6Ouw4REYU1ERERkQRTWBMRqYCZbRttXN8p7lpEpO5RWBMRKUe0tdYI4Ax3L4i7HhGpe3LiLkBEJMGaEfagPM7dp8ddjIjUTWpZExEp2/fAXMIevCIisVDLmohI2VYCRwPjzOxHd/933AWJSN2jsCYiUg53/8nMjgDeMLOf3P2VuGsSkbrF3D3uGkRERESkDBqzJiIiIpJgCmsiIiIiCaawJiIiIpJgCmsiIiIiCaawJiIiIpJgCmsiIiIiCaawJiIiIpJgCmsiIiIiCfZ/1W7enmpbBoUAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 720x360 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Size of graph\n",
    "plt.rcParams['figure.figsize'] = [10,5]\n",
    "\n",
    "# k means determine k\n",
    "distortions = []\n",
    "K = range(1,15)\n",
    "for k in K:\n",
    "    kmeanModel = KMeans(n_clusters=k).fit(features)\n",
    "    kmeanModel.fit(features)\n",
    "    distortions.append(sum(np.min(cdist(features, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / features.shape[0])\n",
    "\n",
    "# Plot the elbow\n",
    "plt.plot(K, distortions, 'bx-')\n",
    "plt.xlabel('k')\n",
    "plt.ylabel('Distortion')\n",
    "plt.title('The Elbow Method showing the optimal k')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems as though by using the elbow method, the ideal number for k (number of clusters) is 3 as that is when the distortion decreases drop off. That's relatively small compared to the 15 publications that we actually have, but it may be more useful to work with for prediction.\n",
    "\n",
    "### K-means\n",
    "\n",
    "The first clustering method I'll use for modelling the dataset is K-means, that requires the user to input k number of centroids, determining the nearest centroid for each data point, and adjusting the centroids until the best clusters are found, or until a set number of iterations has passed. However, we want to see if we can cluster the articles into 15 clusters representing each of the publishers, so that will be k."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>col_0</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>publication</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Atlantic</th>\n",
       "      <td>28</td>\n",
       "      <td>103</td>\n",
       "      <td>287</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Breitbart</th>\n",
       "      <td>216</td>\n",
       "      <td>419</td>\n",
       "      <td>1006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Business Insider</th>\n",
       "      <td>23</td>\n",
       "      <td>91</td>\n",
       "      <td>299</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Buzzfeed News</th>\n",
       "      <td>3</td>\n",
       "      <td>44</td>\n",
       "      <td>183</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CNN</th>\n",
       "      <td>28</td>\n",
       "      <td>124</td>\n",
       "      <td>465</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Fox News</th>\n",
       "      <td>52</td>\n",
       "      <td>39</td>\n",
       "      <td>192</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Guardian</th>\n",
       "      <td>10</td>\n",
       "      <td>51</td>\n",
       "      <td>176</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NPR</th>\n",
       "      <td>29</td>\n",
       "      <td>60</td>\n",
       "      <td>376</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>National Review</th>\n",
       "      <td>33</td>\n",
       "      <td>88</td>\n",
       "      <td>173</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Post</th>\n",
       "      <td>29</td>\n",
       "      <td>107</td>\n",
       "      <td>898</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Times</th>\n",
       "      <td>6</td>\n",
       "      <td>15</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Reuters</th>\n",
       "      <td>3</td>\n",
       "      <td>37</td>\n",
       "      <td>123</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Talking Points Memo</th>\n",
       "      <td>23</td>\n",
       "      <td>168</td>\n",
       "      <td>182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Vox</th>\n",
       "      <td>22</td>\n",
       "      <td>74</td>\n",
       "      <td>197</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Washington Post</th>\n",
       "      <td>39</td>\n",
       "      <td>190</td>\n",
       "      <td>228</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "col_0                  0    1     2\n",
       "publication                        \n",
       "Atlantic              28  103   287\n",
       "Breitbart            216  419  1006\n",
       "Business Insider      23   91   299\n",
       "Buzzfeed News          3   44   183\n",
       "CNN                   28  124   465\n",
       "Fox News              52   39   192\n",
       "Guardian              10   51   176\n",
       "NPR                   29   60   376\n",
       "National Review       33   88   173\n",
       "New York Post         29  107   898\n",
       "New York Times         6   15    55\n",
       "Reuters                3   37   123\n",
       "Talking Points Memo   23  168   182\n",
       "Vox                   22   74   197\n",
       "Washington Post       39  190   228"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calulate predicted values\n",
    "kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42, n_init=20)\n",
    "y_pred0 = kmeans.fit_predict(features)\n",
    "\n",
    "pd.crosstab(y_train, y_pred0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Adjusted Rand Score: 0.004298765\n",
      "Silhouette Score: 0.04092314\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import adjusted_rand_score\n",
    "from sklearn.metrics import silhouette_score\n",
    "\n",
    "print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred0)))\n",
    "print('Silhouette Score: {:0.7}'.format(silhouette_score(features, y_pred0, sample_size=60000, metric='euclidean')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>col_0</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>publication</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Atlantic</th>\n",
       "      <td>13</td>\n",
       "      <td>71</td>\n",
       "      <td>70</td>\n",
       "      <td>5</td>\n",
       "      <td>25</td>\n",
       "      <td>15</td>\n",
       "      <td>12</td>\n",
       "      <td>14</td>\n",
       "      <td>17</td>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "      <td>10</td>\n",
       "      <td>110</td>\n",
       "      <td>22</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Breitbart</th>\n",
       "      <td>76</td>\n",
       "      <td>491</td>\n",
       "      <td>262</td>\n",
       "      <td>64</td>\n",
       "      <td>181</td>\n",
       "      <td>49</td>\n",
       "      <td>115</td>\n",
       "      <td>53</td>\n",
       "      <td>43</td>\n",
       "      <td>15</td>\n",
       "      <td>66</td>\n",
       "      <td>37</td>\n",
       "      <td>120</td>\n",
       "      <td>33</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Business Insider</th>\n",
       "      <td>16</td>\n",
       "      <td>106</td>\n",
       "      <td>52</td>\n",
       "      <td>7</td>\n",
       "      <td>22</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>14</td>\n",
       "      <td>4</td>\n",
       "      <td>23</td>\n",
       "      <td>6</td>\n",
       "      <td>80</td>\n",
       "      <td>6</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Buzzfeed News</th>\n",
       "      <td>7</td>\n",
       "      <td>59</td>\n",
       "      <td>30</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>31</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "      <td>31</td>\n",
       "      <td>15</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CNN</th>\n",
       "      <td>28</td>\n",
       "      <td>178</td>\n",
       "      <td>78</td>\n",
       "      <td>1</td>\n",
       "      <td>23</td>\n",
       "      <td>15</td>\n",
       "      <td>59</td>\n",
       "      <td>0</td>\n",
       "      <td>28</td>\n",
       "      <td>14</td>\n",
       "      <td>34</td>\n",
       "      <td>27</td>\n",
       "      <td>97</td>\n",
       "      <td>24</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Fox News</th>\n",
       "      <td>6</td>\n",
       "      <td>75</td>\n",
       "      <td>31</td>\n",
       "      <td>9</td>\n",
       "      <td>49</td>\n",
       "      <td>5</td>\n",
       "      <td>36</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>15</td>\n",
       "      <td>20</td>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Guardian</th>\n",
       "      <td>9</td>\n",
       "      <td>41</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>16</td>\n",
       "      <td>12</td>\n",
       "      <td>69</td>\n",
       "      <td>11</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NPR</th>\n",
       "      <td>14</td>\n",
       "      <td>77</td>\n",
       "      <td>42</td>\n",
       "      <td>2</td>\n",
       "      <td>21</td>\n",
       "      <td>12</td>\n",
       "      <td>18</td>\n",
       "      <td>13</td>\n",
       "      <td>17</td>\n",
       "      <td>105</td>\n",
       "      <td>11</td>\n",
       "      <td>13</td>\n",
       "      <td>89</td>\n",
       "      <td>18</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>National Review</th>\n",
       "      <td>20</td>\n",
       "      <td>51</td>\n",
       "      <td>54</td>\n",
       "      <td>15</td>\n",
       "      <td>25</td>\n",
       "      <td>14</td>\n",
       "      <td>4</td>\n",
       "      <td>9</td>\n",
       "      <td>18</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>58</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Post</th>\n",
       "      <td>22</td>\n",
       "      <td>200</td>\n",
       "      <td>66</td>\n",
       "      <td>3</td>\n",
       "      <td>17</td>\n",
       "      <td>25</td>\n",
       "      <td>54</td>\n",
       "      <td>47</td>\n",
       "      <td>19</td>\n",
       "      <td>73</td>\n",
       "      <td>32</td>\n",
       "      <td>44</td>\n",
       "      <td>272</td>\n",
       "      <td>45</td>\n",
       "      <td>115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Times</th>\n",
       "      <td>3</td>\n",
       "      <td>20</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>16</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Reuters</th>\n",
       "      <td>4</td>\n",
       "      <td>34</td>\n",
       "      <td>12</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>53</td>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Talking Points Memo</th>\n",
       "      <td>15</td>\n",
       "      <td>77</td>\n",
       "      <td>98</td>\n",
       "      <td>18</td>\n",
       "      <td>20</td>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>36</td>\n",
       "      <td>5</td>\n",
       "      <td>47</td>\n",
       "      <td>2</td>\n",
       "      <td>15</td>\n",
       "      <td>16</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Vox</th>\n",
       "      <td>9</td>\n",
       "      <td>26</td>\n",
       "      <td>47</td>\n",
       "      <td>4</td>\n",
       "      <td>20</td>\n",
       "      <td>12</td>\n",
       "      <td>9</td>\n",
       "      <td>12</td>\n",
       "      <td>25</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>2</td>\n",
       "      <td>97</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Washington Post</th>\n",
       "      <td>17</td>\n",
       "      <td>62</td>\n",
       "      <td>138</td>\n",
       "      <td>14</td>\n",
       "      <td>35</td>\n",
       "      <td>5</td>\n",
       "      <td>25</td>\n",
       "      <td>17</td>\n",
       "      <td>37</td>\n",
       "      <td>1</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>35</td>\n",
       "      <td>24</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "col_0                0    1    2   3    4   5    6   7   8    9   10  11   12  \\\n",
       "publication                                                                     \n",
       "Atlantic             13   71   70   5   25  15   12  14  17    8  11  10  110   \n",
       "Breitbart            76  491  262  64  181  49  115  53  43   15  66  37  120   \n",
       "Business Insider     16  106   52   7   22   6    7   3  14    4  23   6   80   \n",
       "Buzzfeed News         7   59   30   1    2   5   31   0   2    2  13   6   31   \n",
       "CNN                  28  178   78   1   23  15   59   0  28   14  34  27   97   \n",
       "Fox News              6   75   31   9   49   5   36   5   6    2   5  15   20   \n",
       "Guardian              9   41   32   0    8   4   21   0   5    2  16  12   69   \n",
       "NPR                  14   77   42   2   21  12   18  13  17  105  11  13   89   \n",
       "National Review      20   51   54  15   25  14    4   9  18    4   6   3   58   \n",
       "New York Post        22  200   66   3   17  25   54  47  19   73  32  44  272   \n",
       "New York Times        3   20    9   0    4   1    0   1   6    0   3   4   16   \n",
       "Reuters               4   34   12   2    3   0    3  53  10    0   8   7    2   \n",
       "Talking Points Memo  15   77   98  18   20   4    7   8  36    5  47   2   15   \n",
       "Vox                   9   26   47   4   20  12    9  12  25    7  10   2   97   \n",
       "Washington Post      17   62  138  14   35   5   25  17  37    1  23  16   35   \n",
       "\n",
       "col_0                13   14  \n",
       "publication                   \n",
       "Atlantic             22   15  \n",
       "Breitbart            33   36  \n",
       "Business Insider      6   61  \n",
       "Buzzfeed News        15   26  \n",
       "CNN                  24   11  \n",
       "Fox News             13    6  \n",
       "Guardian             11    7  \n",
       "NPR                  18   13  \n",
       "National Review      11    2  \n",
       "New York Post        45  115  \n",
       "New York Times        4    5  \n",
       "Reuters               9   16  \n",
       "Talking Points Memo  16    5  \n",
       "Vox                   5    8  \n",
       "Washington Post      24    8  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calulate predicted values\n",
    "kmeans = KMeans(n_clusters=15, init='k-means++', random_state=42, n_init=20)\n",
    "y_pred = kmeans.fit_predict(features)\n",
    "\n",
    "pd.crosstab(y_train, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Adjusted Rand Score: 0.02919908\n",
      "Silhouette Score: 0.06723925\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import adjusted_rand_score\n",
    "from sklearn.metrics import silhouette_score\n",
    "\n",
    "print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred)))\n",
    "print('Silhouette Score: {:0.7}'.format(silhouette_score(features, y_pred, sample_size=60000, metric='euclidean')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So based on the two results, it seems as though it's much better to stick to 3 clusters as our silhouette score suffers dramatically if we actually want to cluster all 15 different publications.\n",
    "\n",
    "### Spectral Clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>col_0</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>publication</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Atlantic</th>\n",
       "      <td>297</td>\n",
       "      <td>92</td>\n",
       "      <td>29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Breitbart</th>\n",
       "      <td>1094</td>\n",
       "      <td>335</td>\n",
       "      <td>212</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Business Insider</th>\n",
       "      <td>314</td>\n",
       "      <td>75</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Buzzfeed News</th>\n",
       "      <td>189</td>\n",
       "      <td>39</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CNN</th>\n",
       "      <td>478</td>\n",
       "      <td>113</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Fox News</th>\n",
       "      <td>194</td>\n",
       "      <td>36</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Guardian</th>\n",
       "      <td>181</td>\n",
       "      <td>46</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NPR</th>\n",
       "      <td>380</td>\n",
       "      <td>55</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>National Review</th>\n",
       "      <td>189</td>\n",
       "      <td>69</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Post</th>\n",
       "      <td>920</td>\n",
       "      <td>88</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Times</th>\n",
       "      <td>60</td>\n",
       "      <td>11</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Reuters</th>\n",
       "      <td>135</td>\n",
       "      <td>25</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Talking Points Memo</th>\n",
       "      <td>215</td>\n",
       "      <td>136</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Vox</th>\n",
       "      <td>205</td>\n",
       "      <td>65</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Washington Post</th>\n",
       "      <td>242</td>\n",
       "      <td>175</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "col_0                   0    1    2\n",
       "publication                        \n",
       "Atlantic              297   92   29\n",
       "Breitbart            1094  335  212\n",
       "Business Insider      314   75   24\n",
       "Buzzfeed News         189   39    2\n",
       "CNN                   478  113   26\n",
       "Fox News              194   36   53\n",
       "Guardian              181   46   10\n",
       "NPR                   380   55   30\n",
       "National Review       189   69   36\n",
       "New York Post         920   88   26\n",
       "New York Times         60   11    5\n",
       "Reuters               135   25    3\n",
       "Talking Points Memo   215  136   22\n",
       "Vox                   205   65   23\n",
       "Washington Post       242  175   40"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sc = SpectralClustering(n_clusters=3)\n",
    "y_pred2 = sc.fit_predict(features)\n",
    "\n",
    "pd.crosstab(y_train, y_pred2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Adjusted Rand Score: 0.003924217\n",
      "Silhouette Score: 0.03775445\n"
     ]
    }
   ],
   "source": [
    "print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred2)))\n",
    "print('Silhouette Score: {:0.7}'.format(silhouette_score(features, y_pred2, sample_size=60000, metric='euclidean')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Affinity Propagation\n",
    "\n",
    "Now, for our final attempt at clustering, affinity propagation. It's a method that will group like data points, but most likely result in an excessive number of clusters. Let's see if that can work to our advantage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>col_0</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>235</th>\n",
       "      <th>236</th>\n",
       "      <th>237</th>\n",
       "      <th>238</th>\n",
       "      <th>239</th>\n",
       "      <th>240</th>\n",
       "      <th>241</th>\n",
       "      <th>242</th>\n",
       "      <th>243</th>\n",
       "      <th>244</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>publication</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Atlantic</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Breitbart</th>\n",
       "      <td>8</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>18</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Business Insider</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Buzzfeed News</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CNN</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>5</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>25</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Fox News</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Guardian</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NPR</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>National Review</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Post</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>...</td>\n",
       "      <td>10</td>\n",
       "      <td>12</td>\n",
       "      <td>23</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Times</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Reuters</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Talking Points Memo</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Vox</th>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Washington Post</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>15 rows × 245 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "col_0                0    1    2    3    4    5    6    7    8    9   ...   \\\n",
       "publication                                                           ...    \n",
       "Atlantic               2    1    5    7    4    2    1    0    0    0 ...    \n",
       "Breitbart              8    3    5   18    2    4    6    4    0    3 ...    \n",
       "Business Insider       0    0    0    2    1    2    4    1    0    0 ...    \n",
       "Buzzfeed News          1    0    0    1    3    1    1    0    1    0 ...    \n",
       "CNN                    2    0    2    4    1    1    0    0    8    5 ...    \n",
       "Fox News               0    1    0    2    1    0    0    1    0    1 ...    \n",
       "Guardian               0    0    1    2    0    0    0    1    0    0 ...    \n",
       "NPR                    3    0    0    1    0    7    0    0    0    1 ...    \n",
       "National Review        2    2    3    5    0    2    0    0    0    0 ...    \n",
       "New York Post          0    1    0    3    2    2    1    0    0    6 ...    \n",
       "New York Times         0    0    0    0    0    0    0    1    0    0 ...    \n",
       "Reuters                0    1    0    2    1    0    0    0    0    0 ...    \n",
       "Talking Points Memo    0    1    0    2    1    0    2    1    0    0 ...    \n",
       "Vox                    3    6    0    1    1    2    0    0    0    0 ...    \n",
       "Washington Post        0    1    0    4    0    0    0    1    0    0 ...    \n",
       "\n",
       "col_0                235  236  237  238  239  240  241  242  243  244  \n",
       "publication                                                            \n",
       "Atlantic               0    1    2    0    0    1    0    0    2    1  \n",
       "Breitbart              0    1    5   14    1    7    0    0    5    6  \n",
       "Business Insider       0    0    3    2    0    0    0    1    2    0  \n",
       "Buzzfeed News          0    0    1    0    0    1    0    0    1    1  \n",
       "CNN                    1    2    2    1    1    3   25    0    1    1  \n",
       "Fox News               0    2    4    0    0    0    0    0    0    0  \n",
       "Guardian               0    0    6    0    1    0    0    0    0    1  \n",
       "NPR                    0    1    0    0    9    3    1    8    4    0  \n",
       "National Review        0    0    0    2    0    4    0    0    4    0  \n",
       "New York Post         10   12   23    0    4    1    0    2    2    3  \n",
       "New York Times         0    1    0    0    0    1    0    1    0    0  \n",
       "Reuters                0    0    0    0    0    2    0    0    0    0  \n",
       "Talking Points Memo    0    0    0    1    0    2    1    0    1    2  \n",
       "Vox                    0    0    0    0    0    1    0    3    0    0  \n",
       "Washington Post        0    1    0    0    0    3    0    0    1    0  \n",
       "\n",
       "[15 rows x 245 columns]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "af = AffinityPropagation()\n",
    "y_pred3 = af.fit_predict(features)\n",
    "\n",
    "pd.crosstab(y_train, y_pred3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Adjusted Rand Score: 0.01145051\n",
      "Silhouette Score: 0.0110909\n"
     ]
    }
   ],
   "source": [
    "print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred3)))\n",
    "print('Silhouette Score: {:0.7}'.format(silhouette_score(features, y_pred3, sample_size=60000, metric='euclidean')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the results are worthless, just pitiful. Seems like k-means is the best clustering algorithm- mostly because our other methods were far worse, not because it performed well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_cluster = pd.DataFrame(features)\n",
    "X_train_cluster['kmeans'] = y_pred"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training the Model\n",
    "\n",
    "So now that we attempted clustering with the datset, it's time to run the models.\n",
    "\n",
    "### Random Forest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Random forest classifier score (without clustering): 0.41634(+/- 0.02)\n",
      "\n",
      "Random forest classifier score (with clustering): 0.41505(+/- 0.02)\n"
     ]
    }
   ],
   "source": [
    "rfc = ensemble.RandomForestClassifier()\n",
    "rfc_train = cross_val_score(rfc, features, y_train, cv=5, n_jobs=-1)\n",
    "print('Random forest classifier score (without clustering): {:.5f}(+/- {:.2f})\\n'.format(rfc_train.mean(), rfc_train.std()*2))\n",
    "\n",
    "rfc_train_c = cross_val_score(rfc, X_train_cluster, y_train, cv=5, n_jobs=-1)\n",
    "print('Random forest classifier score (with clustering): {:.5f}(+/- {:.2f})'.format(rfc_train_c.mean(), rfc_train_c.std()*2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Logistic regression score (without clustering): 0.44837(+/- 0.02)\n",
      "\n",
      "Logistic regression score (with clustering): 0.44837(+/- 0.02)\n"
     ]
    }
   ],
   "source": [
    "lr = LogisticRegression()\n",
    "lr_train = cross_val_score(lr, features, y_train, cv=5, n_jobs=-1)\n",
    "print('Logistic regression score (without clustering): {:.5f}(+/- {:.2f})\\n'.format(lr_train.mean(), lr_train.std()*2))\n",
    "\n",
    "lr_train_c = cross_val_score(lr, X_train_cluster, y_train, cv=5, n_jobs=-1)\n",
    "print('Logistic regression score (with clustering): {:.5f}(+/- {:.2f})'.format(lr_train_c.mean(), lr_train_c.std()*2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gradient Boosting Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Gradient boosting classifier score (without clustering): 0.49597(+/- 0.02)\n",
      "\n",
      "Gradient boosting classifier score (with clustering): 0.49569(+/- 0.03)\n"
     ]
    }
   ],
   "source": [
    "gbc = ensemble.GradientBoostingClassifier()\n",
    "gbc_train = cross_val_score(gbc, features, y_train, cv=5, n_jobs=-1)\n",
    "print('Gradient boosting classifier score (without clustering): {:.5f}(+/- {:.2f})\\n'.format(gbc_train.mean(), gbc_train.std()*2))\n",
    "\n",
    "gbc_train_c = cross_val_score(gbc, X_train_cluster, y_train, cv=5, n_jobs=-1)\n",
    "print('Gradient boosting classifier score (with clustering): {:.5f}(+/- {:.2f})'.format(gbc_train_c.mean(), gbc_train_c.std()*2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like out of the 3 models, gradient boosting fares the best, and without incorporating clustering at that. Let's see if we can tune our model with better parameters using GridSearchCV.\n",
    "\n",
    "### Optimized Gradient Boosting Classifier "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters:\n",
      "{'loss': 'deviance', 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 20, 'n_estimators': 400}\n",
      "Best Score:\n",
      "0.520017157563626\n"
     ]
    }
   ],
   "source": [
    "# Parameters for gradient boosting classifier\n",
    "param_grid  = {'loss':['deviance'],\n",
    "               'max_features': ['sqrt'],\n",
    "               'n_estimators': [400, 800],\n",
    "               'max_depth': [12, 20],\n",
    "               \"min_samples_leaf\" : [12, 20]}\n",
    "\n",
    "# Run grid search to find ideal parameters\n",
    "gbc_grid = GridSearchCV(gbc, param_grid = param_grid, n_jobs=-1)\n",
    "\n",
    "# Initialize and fit the model.\n",
    "gbc_grid.fit(features, y_train)\n",
    "\n",
    "# Return best parameters and best score\n",
    "print('Best parameters:')\n",
    "print(gbc_grid.best_params_)\n",
    "print('Best Score:')\n",
    "print(gbc_grid.best_score_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's not much of an improvement, but we'll take anything we can at this time. We could attempt to improve the accuracy of our model using larger parameter values, but expanding this model could increase the runtime exponentially.\n",
    "\n",
    "# Testing the Model\n",
    "\n",
    "Recall that earlier, we split the data into 2 sets, a training set and a test set. Now, it's time to test the test set and see if the settings from our training model will work with the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Normalize Tf-idf vectors\n",
    "X_test_norm = normalize(X_test_tfidf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test_words = []\n",
    "\n",
    "for row in X_test:\n",
    "    # Processing each row for tokens\n",
    "    row_doc = nlp(row)\n",
    "    # Calculating length of each sentence\n",
    "    sent_len = len(row_doc) \n",
    "    # Initializing counts of different parts of speech\n",
    "    advs = 0\n",
    "    verb = 0\n",
    "    noun = 0\n",
    "    adj = 0\n",
    "    for token in row_doc:\n",
    "        # Identifying each part of speech and adding to counts\n",
    "        if token.pos_ == 'ADV':\n",
    "            advs +=1\n",
    "        elif token.pos_ == 'VERB':\n",
    "            verb +=1\n",
    "        elif token.pos_ == 'NOUN':\n",
    "            noun +=1\n",
    "        elif token.pos_ == 'ADJ':\n",
    "            adj +=1\n",
    "    # Creating a list of all features for each sentence\n",
    "    X_test_words.append([row_doc, advs, verb, noun, adj, sent_len])\n",
    "    \n",
    "# Data frame for features\n",
    "X_test_count = pd.DataFrame(data=X_test_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'sent_length'])\n",
    "\n",
    "# Change token count to token percentage\n",
    "for column in X_test_count.columns[1:5]:\n",
    "    X_test_count[column] = X_test_count[column] / X_test_count['sent_length']\n",
    "\n",
    "# Normalize X_count\n",
    "X_test_counter = normalize(X_test_count.drop('BOW',axis=1))\n",
    "X_test_counter  = pd.DataFrame(data=X_test_counter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>...</th>\n",
       "      <th>140</th>\n",
       "      <th>141</th>\n",
       "      <th>142</th>\n",
       "      <th>143</th>\n",
       "      <th>144</th>\n",
       "      <th>145</th>\n",
       "      <th>146</th>\n",
       "      <th>147</th>\n",
       "      <th>148</th>\n",
       "      <th>149</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000127</td>\n",
       "      <td>0.000465</td>\n",
       "      <td>0.000388</td>\n",
       "      <td>0.000249</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.132212</td>\n",
       "      <td>0.143814</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.358175</td>\n",
       "      <td>0.11451</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000273</td>\n",
       "      <td>0.000606</td>\n",
       "      <td>0.000606</td>\n",
       "      <td>0.000318</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.242403</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000046</td>\n",
       "      <td>0.000189</td>\n",
       "      <td>0.000236</td>\n",
       "      <td>0.000090</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.100599</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.320476</td>\n",
       "      <td>0.094936</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.000057</td>\n",
       "      <td>0.000312</td>\n",
       "      <td>0.000327</td>\n",
       "      <td>0.000154</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.146941</td>\n",
       "      <td>0.14051</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000459</td>\n",
       "      <td>0.001582</td>\n",
       "      <td>0.000408</td>\n",
       "      <td>0.000459</td>\n",
       "      <td>0.999998</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.138932</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 155 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        0         1         2         3         4    0    1         2    3    \\\n",
       "0  0.000127  0.000465  0.000388  0.000249  1.000000  0.0  0.0  0.000000  0.0   \n",
       "1  0.000273  0.000606  0.000606  0.000318  1.000000  0.0  0.0  0.000000  0.0   \n",
       "2  0.000046  0.000189  0.000236  0.000090  1.000000  0.0  0.0  0.100599  0.0   \n",
       "3  0.000057  0.000312  0.000327  0.000154  1.000000  0.0  0.0  0.000000  0.0   \n",
       "4  0.000459  0.001582  0.000408  0.000459  0.999998  0.0  0.0  0.000000  0.0   \n",
       "\n",
       "   4   ...        140       141       142  143       144       145      146  \\\n",
       "0  0.0 ...   0.000000  0.132212  0.143814  0.0  0.000000  0.000000  0.00000   \n",
       "1  0.0 ...   0.242403  0.000000  0.000000  0.0  0.000000  0.000000  0.00000   \n",
       "2  0.0 ...   0.000000  0.000000  0.000000  0.0  0.320476  0.094936  0.00000   \n",
       "3  0.0 ...   0.000000  0.000000  0.000000  0.0  0.000000  0.146941  0.14051   \n",
       "4  0.0 ...   0.138932  0.000000  0.000000  0.0  0.000000  0.000000  0.00000   \n",
       "\n",
       "        147      148  149  \n",
       "0  0.358175  0.11451  0.0  \n",
       "1  0.000000  0.00000  0.0  \n",
       "2  0.000000  0.00000  0.0  \n",
       "3  0.000000  0.00000  0.0  \n",
       "4  0.000000  0.00000  0.0  \n",
       "\n",
       "[5 rows x 155 columns]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Combining features into one data frame\n",
    "X_test_norm_df = pd.DataFrame(data=X_test_norm.toarray())\n",
    "features_test = pd.concat([X_test_counter, X_test_norm_df], ignore_index=False, axis=1)\n",
    "features_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>col_0</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>publication</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Atlantic</th>\n",
       "      <td>4</td>\n",
       "      <td>35</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>15</td>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>14</td>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Breitbart</th>\n",
       "      <td>56</td>\n",
       "      <td>45</td>\n",
       "      <td>36</td>\n",
       "      <td>24</td>\n",
       "      <td>56</td>\n",
       "      <td>7</td>\n",
       "      <td>18</td>\n",
       "      <td>83</td>\n",
       "      <td>12</td>\n",
       "      <td>3</td>\n",
       "      <td>76</td>\n",
       "      <td>18</td>\n",
       "      <td>25</td>\n",
       "      <td>80</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Business Insider</th>\n",
       "      <td>9</td>\n",
       "      <td>19</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>22</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>29</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Buzzfeed News</th>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CNN</th>\n",
       "      <td>9</td>\n",
       "      <td>21</td>\n",
       "      <td>17</td>\n",
       "      <td>28</td>\n",
       "      <td>27</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>12</td>\n",
       "      <td>10</td>\n",
       "      <td>8</td>\n",
       "      <td>43</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Fox News</th>\n",
       "      <td>13</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>13</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>18</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Guardian</th>\n",
       "      <td>2</td>\n",
       "      <td>15</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NPR</th>\n",
       "      <td>4</td>\n",
       "      <td>27</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>21</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>22</td>\n",
       "      <td>5</td>\n",
       "      <td>42</td>\n",
       "      <td>28</td>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>National Review</th>\n",
       "      <td>7</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Post</th>\n",
       "      <td>5</td>\n",
       "      <td>80</td>\n",
       "      <td>14</td>\n",
       "      <td>24</td>\n",
       "      <td>9</td>\n",
       "      <td>29</td>\n",
       "      <td>20</td>\n",
       "      <td>13</td>\n",
       "      <td>15</td>\n",
       "      <td>20</td>\n",
       "      <td>80</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Times</th>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Reuters</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "      <td>18</td>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Talking Points Memo</th>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>20</td>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>27</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>17</td>\n",
       "      <td>9</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Vox</th>\n",
       "      <td>9</td>\n",
       "      <td>27</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>11</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Washington Post</th>\n",
       "      <td>8</td>\n",
       "      <td>5</td>\n",
       "      <td>16</td>\n",
       "      <td>5</td>\n",
       "      <td>18</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>44</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "col_0                0   1   2   3   4   5   6   7   8   9   10  11  12  13  \\\n",
       "publication                                                                   \n",
       "Atlantic              4  35   6   2  15  11   8  14   4  10  18   1   5   5   \n",
       "Breitbart            56  45  36  24  56   7  18  83  12   3  76  18  25  80   \n",
       "Business Insider      9  19   8   4  10  19   0  22   4   2  29   2   5   4   \n",
       "Buzzfeed News         4   8   4   6   5   6   0   9   3   1  15   0   1   6   \n",
       "CNN                   9  21  17  28  27   2   0  12  10   8  43   1   8   4   \n",
       "Fox News             13   4   2  13   8   4   1  10   4   0  18   4   5   6   \n",
       "Guardian              2  15   5   3  16   4   0  10   1   3  16   0   3   1   \n",
       "NPR                   4  27   5   5  21   5   3  22   5  42  28   6   3   1   \n",
       "National Review       7  15   4   1  13   0   5  18   1   0   4   5   5   0   \n",
       "New York Post         5  80  14  24   9  29  20  13  15  20  80   1   6  11   \n",
       "New York Times        3   5   3   0   7   1   1   6   1   0   3   0   2   0   \n",
       "Reuters               4   0   1   0   6   8  18   4   7   0   3   0   1   2   \n",
       "Talking Points Memo   2   4  20   4   8   1   3  27   0   3  17   9   5   4   \n",
       "Vox                   9  27  11   2  11   4   2  12   0   1  10   2   6   1   \n",
       "Washington Post       8   5  16   5  18   5   2  44   3   2  16   3   5   1   \n",
       "\n",
       "col_0                14  \n",
       "publication              \n",
       "Atlantic              5  \n",
       "Breitbart             9  \n",
       "Business Insider      2  \n",
       "Buzzfeed News         5  \n",
       "CNN                   7  \n",
       "Fox News              1  \n",
       "Guardian              5  \n",
       "NPR                  10  \n",
       "National Review       1  \n",
       "New York Post        12  \n",
       "New York Times        1  \n",
       "Reuters               9  \n",
       "Talking Points Memo   6  \n",
       "Vox                   4  \n",
       "Washington Post       6  "
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calulate predicted values\n",
    "kmeans = KMeans(n_clusters=15, init='k-means++', random_state=42, n_init=20)\n",
    "y_pred_test = kmeans.fit_predict(features_test)\n",
    "\n",
    "pd.crosstab(y_test, y_pred_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Adjusted Rand Score: 0.02586774\n",
      "Silhouette Score: 0.06681075\n"
     ]
    }
   ],
   "source": [
    "print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_test, y_pred_test)))\n",
    "print('Silhouette Score: {:0.7}'.format(silhouette_score(features_test, y_pred_test, sample_size=60000, metric='euclidean')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "X2_test_c = pd.DataFrame(features_test)\n",
    "X2_test_c['kmeans_clust'] = y_pred_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test set score: 0.47591(+/- 0.023)\n"
     ]
    }
   ],
   "source": [
    "gbc_grid_scores_test = cross_val_score(gbc_grid, features_test, y_test, cv=5)\n",
    "print('Test set score: {:.5f}(+/- {:.3f})'.format(gbc_grid_scores_test.mean(), gbc_grid_scores_test.std()*2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "\n",
    "So what did we learn today? Well for starters, natural language processing can be extremely taxing on memory since it requires us to create a massive dataframe of words. If we had access to more memory, it would likely have been possible to use a larger sample from the original dataset (we only used 1%) and/or retain more words (150/thousands) for feature prediction. In addition, some other issues that might arise are simply that we are looking at too many different publications, and our model cannot accurately distinguish them all.\n",
    "\n",
    "\n",
    "# Source\n",
    "\n",
    "https://www.kaggle.com/snapcrack/all-the-news"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}