{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 19 - Natural Language Processing\n",
    "\n",
    "by [Alejandro Correa Bahnsen](albahnsen.com/)\n",
    "\n",
    "version 1.1, July 2018\n",
    "\n",
    "## Part of the class [Applied Deep Learning](https://github.com/albahnsen/AppliedDeepLearningClass)\n",
    "\n",
    "\n",
    "This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). Special thanks goes to [Kevin Markham](https://github.com/justmarkham)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is NLP?\n",
    "\n",
    "- Using computers to process (analyze, understand, generate) natural human languages\n",
    "- Most knowledge created by humans is unstructured text, and we need a way to make sense of it\n",
    "- Build probabilistic model using data about a language\n",
    "\n",
    "### What are some of the higher level task areas?\n",
    "\n",
    "- **Information retrieval**: Find relevant results and similar results\n",
    "    - [Google](https://www.google.com/)\n",
    "- **Information extraction**: Structured information from unstructured documents\n",
    "    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)\n",
    "- **Machine translation**: One language to another\n",
    "    - [Google Translate](https://translate.google.com/)\n",
    "- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary\n",
    "    - [Rewordify](https://rewordify.com/)\n",
    "    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)\n",
    "- **Predictive text input**: Faster or easier typing\n",
    "    - [My application](https://justmarkham.shinyapps.io/textprediction/)\n",
    "    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)\n",
    "- **Sentiment analysis**: Attitude of speaker\n",
    "    - [Hater News](http://haternews.herokuapp.com/)\n",
    "- **Automatic summarization**: Extractive or abstractive summarization\n",
    "    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)\n",
    "- **Natural Language Generation**: Generate text from data\n",
    "    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)\n",
    "    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)\n",
    "- **Speech recognition and generation**: Speech-to-text, text-to-speech\n",
    "    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)\n",
    "    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)\n",
    "- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses\n",
    "    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)\n",
    "    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)\n",
    "    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)\n",
    "    \n",
    "### What are some of the lower level components?\n",
    "\n",
    "- **Tokenization**: breaking text into tokens (words, sentences, n-grams)\n",
    "- **Stopword removal**: a/an/the\n",
    "- **Stemming and lemmatization**: root word\n",
    "- **TF-IDF**: word importance\n",
    "- **Part-of-speech tagging**: noun/verb/adjective\n",
    "- **Named entity recognition**: person/organization/location\n",
    "- **Spelling correction**: \"New Yrok City\"\n",
    "- **Word sense disambiguation**: \"buy a mouse\"\n",
    "- **Segmentation**: \"New York City subway\"\n",
    "- **Language detection**: \"translate this page\"\n",
    "- **Machine learning**\n",
    "\n",
    "### Why is NLP hard?\n",
    "\n",
    "- **Ambiguity**:\n",
    "    - Hospitals are Sued by 7 Foot Doctors\n",
    "    - Juvenile Court to Try Shooting Defendant\n",
    "    - Local High School Dropouts Cut in Half\n",
    "- **Non-standard English**: text messages\n",
    "- **Idioms**: \"throw in the towel\"\n",
    "- **Newly coined words**: \"retweet\"\n",
    "- **Tricky entity names**: \"Where is A Bug's Life playing?\"\n",
    "- **World knowledge**: \"Mary and Sue are sisters\", \"Mary and Sue are mothers\"\n",
    "\n",
    "NLP requires an understanding of the **language** and the **world**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "from sklearn.model_selection import train_test_split, cross_val_score\n",
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn import metrics\n",
    "# from textblob import TextBlob, Word\n",
    "from nltk.stem.snowball import SnowballStemmer\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('../datasets/mashable_texts.csv', index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author</th>\n",
       "      <th>author_web</th>\n",
       "      <th>shares</th>\n",
       "      <th>text</th>\n",
       "      <th>title</th>\n",
       "      <th>facebo</th>\n",
       "      <th>google</th>\n",
       "      <th>linked</th>\n",
       "      <th>twitte</th>\n",
       "      <th>twitter_followers</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Seth Fiegerman</td>\n",
       "      <td>http://mashable.com/people/seth-fiegerman/</td>\n",
       "      <td>4900</td>\n",
       "      <td>\\nApple's long and controversial ebook case ha...</td>\n",
       "      <td>The Supreme Court smacked down Apple today</td>\n",
       "      <td>http://www.facebook.com/sfiegerman</td>\n",
       "      <td>NaN</td>\n",
       "      <td>http://www.linkedin.com/in/sfiegerman</td>\n",
       "      <td>https://twitter.com/sfiegerman</td>\n",
       "      <td>14300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rebecca Ruiz</td>\n",
       "      <td>http://mashable.com/people/rebecca-ruiz/</td>\n",
       "      <td>1900</td>\n",
       "      <td>Analysis\\n\\n\\n\\n\\n\\nThere is a reason that Don...</td>\n",
       "      <td>Every woman has met a man like Donald Trump</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://twitter.com/rebecca_ruiz</td>\n",
       "      <td>3738</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Davina Merchant</td>\n",
       "      <td>http://mashable.com/people/568bdab351984019310...</td>\n",
       "      <td>7000</td>\n",
       "      <td>LONDON - Last month we reported on a dog-sized...</td>\n",
       "      <td>Adorable dog-sized rabbit finally finds his fo...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://plus.google.com/105525238342980116477?...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Scott Gerber</td>\n",
       "      <td>[]</td>\n",
       "      <td>5000</td>\n",
       "      <td>Today's digital marketing experts must have a ...</td>\n",
       "      <td>15 essential skills all digital marketing hire...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Josh Dickey</td>\n",
       "      <td>http://mashable.com/people/joshdickey/</td>\n",
       "      <td>1600</td>\n",
       "      <td>LOS ANGELES — For big, fun, populist popcorn m...</td>\n",
       "      <td>Mashable top 10: 'The Force Awakens' is the be...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://plus.google.com/109213469090692520544?...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://twitter.com/JLDlite</td>\n",
       "      <td>11200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            author                                         author_web  shares  \\\n",
       "0   Seth Fiegerman         http://mashable.com/people/seth-fiegerman/    4900   \n",
       "1     Rebecca Ruiz           http://mashable.com/people/rebecca-ruiz/    1900   \n",
       "2  Davina Merchant  http://mashable.com/people/568bdab351984019310...    7000   \n",
       "3     Scott Gerber                                                 []    5000   \n",
       "4      Josh Dickey             http://mashable.com/people/joshdickey/    1600   \n",
       "\n",
       "                                                text  \\\n",
       "0  \\nApple's long and controversial ebook case ha...   \n",
       "1  Analysis\\n\\n\\n\\n\\n\\nThere is a reason that Don...   \n",
       "2  LONDON - Last month we reported on a dog-sized...   \n",
       "3  Today's digital marketing experts must have a ...   \n",
       "4  LOS ANGELES — For big, fun, populist popcorn m...   \n",
       "\n",
       "                                               title  \\\n",
       "0         The Supreme Court smacked down Apple today   \n",
       "1        Every woman has met a man like Donald Trump   \n",
       "2  Adorable dog-sized rabbit finally finds his fo...   \n",
       "3  15 essential skills all digital marketing hire...   \n",
       "4  Mashable top 10: 'The Force Awakens' is the be...   \n",
       "\n",
       "                               facebo  \\\n",
       "0  http://www.facebook.com/sfiegerman   \n",
       "1                                 NaN   \n",
       "2                                 NaN   \n",
       "3                                 NaN   \n",
       "4                                 NaN   \n",
       "\n",
       "                                              google  \\\n",
       "0                                                NaN   \n",
       "1                                                NaN   \n",
       "2  https://plus.google.com/105525238342980116477?...   \n",
       "3                                                NaN   \n",
       "4  https://plus.google.com/109213469090692520544?...   \n",
       "\n",
       "                                  linked                            twitte  \\\n",
       "0  http://www.linkedin.com/in/sfiegerman    https://twitter.com/sfiegerman   \n",
       "1                                    NaN  https://twitter.com/rebecca_ruiz   \n",
       "2                                    NaN                               NaN   \n",
       "3                                    NaN                               NaN   \n",
       "4                                    NaN       https://twitter.com/JLDlite   \n",
       "\n",
       "   twitter_followers  \n",
       "0              14300  \n",
       "1               3738  \n",
       "2                  0  \n",
       "3                  0  \n",
       "4              11200  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tokenization\n",
    "\n",
    "- **What:** Separate text into units such as sentences or words\n",
    "- **Why:** Gives structure to previously unstructured text\n",
    "- **Notes:** Relatively easy with English language text, not easy with some languages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the target feature (number of shares)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count       82.000000\n",
       "mean      3090.487805\n",
       "std       8782.031594\n",
       "min        437.000000\n",
       "25%        893.500000\n",
       "50%       1200.000000\n",
       "75%       2275.000000\n",
       "max      63100.000000\n",
       "Name: shares, dtype: float64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = df.shares\n",
    "y.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = pd.cut(y, [0, 893, 1200, 2275, 63200], labels=[0, 1, 2, 3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    22\n",
       "3    21\n",
       "0    21\n",
       "2    18\n",
       "Name: shares, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['y'] = y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### create document-term matrices "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use CountVectorizer to create document-term matrices from X\n",
    "vect = CountVectorizer()\n",
    "X_dtm = vect.fit_transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp=X_dtm.todense()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'apple': 682,\n",
       " 'long': 4303,\n",
       " 'and': 617,\n",
       " 'controversial': 1747,\n",
       " 'ebook': 2401,\n",
       " 'case': 1307,\n",
       " 'has': 3367,\n",
       " 'reached': 5734,\n",
       " 'its': 3884,\n",
       " 'final': 2893,\n",
       " 'chapter': 1383,\n",
       " 'it': 3878,\n",
       " 'not': 4883,\n",
       " 'the': 7054,\n",
       " 'happy': 3352,\n",
       " 'ending': 2527,\n",
       " 'company': 1612,\n",
       " 'wanted': 7620,\n",
       " 'supreme': 6865,\n",
       " 'court': 1809,\n",
       " 'on': 4969,\n",
       " 'monday': 4687,\n",
       " 'rejected': 5841,\n",
       " 'an': 603,\n",
       " 'appeal': 673,\n",
       " 'filed': 2882,\n",
       " 'by': 1224,\n",
       " 'to': 7150,\n",
       " 'overturn': 5075,\n",
       " 'stinging': 6723,\n",
       " 'ruling': 6087,\n",
       " 'that': 7051,\n",
       " 'led': 4181,\n",
       " 'broad': 1147,\n",
       " 'conspiracy': 1706,\n",
       " 'with': 7748,\n",
       " 'several': 6303,\n",
       " 'major': 4374,\n",
       " 'publishers': 5610,\n",
       " 'fix': 2927,\n",
       " 'price': 5483,\n",
       " 'of': 4935,\n",
       " 'books': 1088,\n",
       " 'sold': 6528,\n",
       " 'through': 7106,\n",
       " 'online': 4979,\n",
       " 'bookstore': 1089,\n",
       " 'decision': 2009,\n",
       " 'means': 4496,\n",
       " 'now': 4895,\n",
       " 'no': 4858,\n",
       " 'choice': 1437,\n",
       " 'but': 1215,\n",
       " 'pay': 5178,\n",
       " 'out': 5037,\n",
       " '400': 223,\n",
       " 'million': 4611,\n",
       " 'consumers': 1714,\n",
       " 'additional': 446,\n",
       " '50': 252,\n",
       " 'in': 3664,\n",
       " 'legal': 4187,\n",
       " 'fees': 2846,\n",
       " 'according': 400,\n",
       " 'original': 5021,\n",
       " 'settlement': 6301,\n",
       " '2014': 153,\n",
       " 'see': 6237,\n",
       " 'also': 575,\n",
       " 'here': 3440,\n",
       " 'how': 3559,\n",
       " 'marshalled': 4437,\n",
       " 'entire': 2564,\n",
       " 'tech': 6989,\n",
       " 'industry': 3712,\n",
       " 'fight': 2876,\n",
       " 'fbi': 2826,\n",
       " 'for': 2996,\n",
       " 'verdict': 7519,\n",
       " 'is': 3863,\n",
       " 'more': 4700,\n",
       " 'damaging': 1939,\n",
       " 'reputation': 5917,\n",
       " 'as': 734,\n",
       " 'consumer': 1713,\n",
       " 'friendly': 3072,\n",
       " 'brand': 1120,\n",
       " 'mention': 4539,\n",
       " 'legacy': 4186,\n",
       " 'beloved': 959,\n",
       " 'founder': 3036,\n",
       " 'steve': 6715,\n",
       " 'jobs': 3938,\n",
       " 'than': 7045,\n",
       " 'actual': 432,\n",
       " 'bottom': 1102,\n",
       " 'line': 4252,\n",
       " 'put': 5636,\n",
       " 'fine': 2903,\n",
       " 'context': 1727,\n",
       " 'total': 7188,\n",
       " '450': 237,\n",
       " 'payout': 5183,\n",
       " 'equal': 2581,\n",
       " 'about': 374,\n",
       " 'little': 4276,\n",
       " 'half': 3328,\n",
       " 'sales': 6118,\n",
       " 'generates': 3151,\n",
       " 'average': 821,\n",
       " 'each': 2382,\n",
       " 'day': 1972,\n",
       " 'based': 895,\n",
       " '75': 314,\n",
       " 'billion': 998,\n",
       " 'revenue': 5987,\n",
       " 'reported': 5897,\n",
       " 'most': 4707,\n",
       " 'recent': 5773,\n",
       " 'quarter': 5651,\n",
       " 'fixing': 2929,\n",
       " 'episode': 2579,\n",
       " 'dates': 1962,\n",
       " 'back': 854,\n",
       " 'late': 4130,\n",
       " '2009': 148,\n",
       " 'just': 3986,\n",
       " 'ahead': 513,\n",
       " 'ipad': 3856,\n",
       " 'launch': 4141,\n",
       " 'recognizing': 5780,\n",
       " 'would': 7787,\n",
       " 'likely': 4244,\n",
       " 'be': 924,\n",
       " 'big': 990,\n",
       " 'selling': 6253,\n",
       " 'point': 5339,\n",
       " 'tablet': 6925,\n",
       " 'began': 942,\n",
       " 'courting': 1812,\n",
       " 'what': 7686,\n",
       " 'were': 7681,\n",
       " 'then': 7065,\n",
       " 'five': 2926,\n",
       " 'book': 1084,\n",
       " 'series': 6283,\n",
       " 'mails': 4367,\n",
       " 'later': 4132,\n",
       " 'released': 5854,\n",
       " 'government': 3237,\n",
       " 'personally': 5232,\n",
       " 'persuaded': 5234,\n",
       " 'publishing': 5611,\n",
       " 'executives': 2680,\n",
       " 're': 5732,\n",
       " 'think': 7078,\n",
       " 'flat': 2941,\n",
       " '99': 344,\n",
       " 'pricing': 5487,\n",
       " 'previously': 5481,\n",
       " 'imposed': 3652,\n",
       " 'amazon': 584,\n",
       " 'giant': 3181,\n",
       " 'world': 7779,\n",
       " 'all': 558,\n",
       " 'tell': 7004,\n",
       " 'us': 7462,\n",
       " 'new': 4830,\n",
       " 'releases': 5855,\n",
       " 'eroding': 2593,\n",
       " 'value': 7496,\n",
       " 'perception': 5210,\n",
       " 'their': 7058,\n",
       " 'products': 5525,\n",
       " 'customer': 1911,\n",
       " 'minds': 4619,\n",
       " 'they': 7072,\n",
       " 'do': 2262,\n",
       " 'want': 7619,\n",
       " 'this': 7082,\n",
       " 'practice': 5417,\n",
       " 'continue': 1729,\n",
       " 'wrote': 7801,\n",
       " 'one': 4972,\n",
       " 'email': 2473,\n",
       " 'james': 3898,\n",
       " 'murdoch': 4746,\n",
       " 'executive': 2679,\n",
       " 'at': 765,\n",
       " 'news': 4833,\n",
       " 'corp': 1777,\n",
       " 'which': 7699,\n",
       " 'owns': 5083,\n",
       " 'harper': 3362,\n",
       " 'collins': 1552,\n",
       " 'ceo': 1348,\n",
       " 'sent': 6268,\n",
       " 'exec': 2675,\n",
       " 'image': 3625,\n",
       " 'screengrab': 6201,\n",
       " 'mashablethe': 4452,\n",
       " 'unhappy': 7404,\n",
       " 'unfavorable': 7400,\n",
       " 'terms': 7024,\n",
       " 'agreed': 507,\n",
       " 'signed': 6404,\n",
       " 'plan': 5303,\n",
       " 'used': 7466,\n",
       " 'competition': 1621,\n",
       " 'pressure': 5469,\n",
       " 'into': 3821,\n",
       " 'changing': 1375,\n",
       " 'own': 5079,\n",
       " 'structure': 6774,\n",
       " 'while': 7700,\n",
       " 'some': 6539,\n",
       " 'argued': 707,\n",
       " 'move': 4724,\n",
       " 'helped': 3432,\n",
       " 'break': 1125,\n",
       " 'up': 7441,\n",
       " 'potential': 5404,\n",
       " 'monopoly': 4692,\n",
       " 'market': 4427,\n",
       " 'accused': 406,\n",
       " 'colluding': 1553,\n",
       " 'keep': 4015,\n",
       " 'prices': 5486,\n",
       " 'high': 3461,\n",
       " 'hachette': 3318,\n",
       " 'harpercollins': 3363,\n",
       " 'macmillan': 4355,\n",
       " 'penguin': 5200,\n",
       " 'simon': 6420,\n",
       " 'schuster': 6181,\n",
       " 'settled': 6300,\n",
       " 'department': 2087,\n",
       " 'justice': 3987,\n",
       " 'before': 941,\n",
       " 'going': 3214,\n",
       " 'trial': 7274,\n",
       " 'only': 4980,\n",
       " 'armed': 713,\n",
       " 'unwavering': 7438,\n",
       " 'belief': 953,\n",
       " 'rightness': 6014,\n",
       " 'courts': 1814,\n",
       " 'we': 7653,\n",
       " 'are': 700,\n",
       " 'ready': 5747,\n",
       " 'distribute': 2249,\n",
       " 'mandated': 4396,\n",
       " 'funds': 3097,\n",
       " 'kindle': 4046,\n",
       " 'customers': 1912,\n",
       " 'soon': 6551,\n",
       " 'instructed': 3783,\n",
       " 'forward': 3029,\n",
       " 'spokesperson': 6623,\n",
       " 'said': 6111,\n",
       " 'statement': 6686,\n",
       " 'provided': 5585,\n",
       " 'mashable': 4448,\n",
       " 'reps': 5913,\n",
       " 'did': 2160,\n",
       " 'immediately': 3637,\n",
       " 'respond': 5948,\n",
       " 'our': 5036,\n",
       " 'request': 5918,\n",
       " 'comment': 1584,\n",
       " 'however': 3560,\n",
       " 'after': 490,\n",
       " 'loss': 4319,\n",
       " '2013': 152,\n",
       " 'says': 6158,\n",
       " 'when': 7694,\n",
       " 'introduced': 3825,\n",
       " 'ibookstore': 3599,\n",
       " '2010': 149,\n",
       " 'gave': 3138,\n",
       " 'injecting': 3745,\n",
       " 'much': 4740,\n",
       " 'needed': 4808,\n",
       " 'innovation': 3753,\n",
       " 'breaking': 1126,\n",
       " 'monopolistic': 4691,\n",
       " 'grip': 3280,\n",
       " 'time': 7129,\n",
       " 've': 7511,\n",
       " 'done': 2293,\n",
       " 'nothing': 4887,\n",
       " 'wrong': 7799,\n",
       " 'have': 3380,\n",
       " 'once': 4971,\n",
       " 'again': 493,\n",
       " 'determined': 2135,\n",
       " 'otherwise': 5033,\n",
       " 'analysis': 608,\n",
       " 'there': 7069,\n",
       " 'reason': 5760,\n",
       " 'donald': 2289,\n",
       " 'trump': 7296,\n",
       " 'outrageous': 5052,\n",
       " 'statements': 6687,\n",
       " 'behavior': 948,\n",
       " 'feel': 2842,\n",
       " 'familiar': 2789,\n",
       " 'many': 4410,\n",
       " 'women': 7757,\n",
       " 'because': 933,\n",
       " 'know': 4065,\n",
       " 'his': 3484,\n",
       " 'declarative': 2014,\n",
       " 'style': 6796,\n",
       " 'trademark': 7221,\n",
       " 'shrug': 6390,\n",
       " 'from': 3077,\n",
       " 'reality': 5751,\n",
       " 'television': 7002,\n",
       " 'or': 5005,\n",
       " 'political': 5350,\n",
       " 'debates': 1995,\n",
       " 'nor': 4873,\n",
       " 'outsized': 5055,\n",
       " 'role': 6051,\n",
       " 'american': 590,\n",
       " 'business': 1211,\n",
       " 'made': 4357,\n",
       " 'unforgettable': 7402,\n",
       " 'impression': 3655,\n",
       " 'them': 7061,\n",
       " 'targets': 6961,\n",
       " 'republicans': 5916,\n",
       " 'say': 6156,\n",
       " 'he': 3389,\n",
       " 'could': 1792,\n",
       " 'cost': 1786,\n",
       " 'party': 5150,\n",
       " 'everything': 2643,\n",
       " 'eerie': 2424,\n",
       " 'familiarity': 2790,\n",
       " 'personal': 5230,\n",
       " 'encountered': 2522,\n",
       " 'man': 4388,\n",
       " 'like': 4242,\n",
       " 'him': 3473,\n",
       " 'home': 3507,\n",
       " 'work': 7768,\n",
       " 'social': 6518,\n",
       " 'media': 4505,\n",
       " 'relationship': 5848,\n",
       " 'extols': 2739,\n",
       " 'virtues': 7563,\n",
       " 'problem': 5508,\n",
       " 'reducing': 5799,\n",
       " 'sex': 6305,\n",
       " 'objects': 4915,\n",
       " 'casts': 1314,\n",
       " 'himself': 3474,\n",
       " 'unflappable': 7401,\n",
       " 'blames': 1030,\n",
       " 'woman': 7756,\n",
       " 'weaknesses': 7655,\n",
       " 'revealed': 5982,\n",
       " 'insists': 3767,\n",
       " 'responsibility': 5954,\n",
       " 'denies': 2078,\n",
       " 'deflects': 2046,\n",
       " 'perhaps': 5221,\n",
       " 'even': 2627,\n",
       " 'turns': 7322,\n",
       " 'violent': 7557,\n",
       " 'wrongdoing': 7800,\n",
       " 'so': 6513,\n",
       " 'me': 4489,\n",
       " 'wow': 7791,\n",
       " 'tough': 7195,\n",
       " 'nobody': 4860,\n",
       " 'respect': 5946,\n",
       " 'realdonaldtrump': 5750,\n",
       " 'march': 4417,\n",
       " '26': 180,\n",
       " '2016': 155,\n",
       " 'psychological': 5595,\n",
       " 'warfare': 7623,\n",
       " 'gaslighting': 3134,\n",
       " 'subtle': 6807,\n",
       " 'form': 3013,\n",
       " 'emotional': 2504,\n",
       " 'abuse': 382,\n",
       " 'puts': 5637,\n",
       " 'victim': 7536,\n",
       " 'defensive': 2039,\n",
       " 'go': 3206,\n",
       " 'strategy': 6752,\n",
       " 'sees': 6246,\n",
       " 'actions': 422,\n",
       " 'arguably': 706,\n",
       " 'projects': 5540,\n",
       " 'voters': 7597,\n",
       " 'tuesday': 7305,\n",
       " 'florida': 2962,\n",
       " 'police': 5346,\n",
       " 'charged': 1387,\n",
       " 'campaign': 1250,\n",
       " 'manager': 4392,\n",
       " 'corey': 1775,\n",
       " 'lewandowski': 4213,\n",
       " 'battery': 908,\n",
       " 'female': 2853,\n",
       " 'reporter': 5899,\n",
       " 'issued': 3876,\n",
       " 'tweets': 7333,\n",
       " 'refuting': 5817,\n",
       " 'video': 7542,\n",
       " 'evidence': 2648,\n",
       " 'discrediting': 2221,\n",
       " 'journalist': 3960,\n",
       " 'implying': 3648,\n",
       " 'she': 6332,\n",
       " 'been': 939,\n",
       " 'dangerous': 1945,\n",
       " 'deadly': 1984,\n",
       " 'threat': 7094,\n",
       " 'my': 4760,\n",
       " 'very': 7527,\n",
       " 'decent': 2005,\n",
       " 'was': 7632,\n",
       " 'assaulting': 749,\n",
       " 'look': 4306,\n",
       " 'tapes': 6956,\n",
       " '29': 189,\n",
       " 'powerful': 5415,\n",
       " 'left': 4185,\n",
       " 'right': 6013,\n",
       " 'forcefully': 3000,\n",
       " 'criticized': 1863,\n",
       " 'reaction': 5740,\n",
       " 'charges': 1389,\n",
       " 'too': 7172,\n",
       " 'often': 4952,\n",
       " 'victims': 7538,\n",
       " 'violence': 7556,\n",
       " 'stay': 6694,\n",
       " 'silent': 6411,\n",
       " 'remarks': 5870,\n",
       " 'today': 7152,\n",
       " 'demonstrate': 2072,\n",
       " 'reasons': 5762,\n",
       " 'why': 7709,\n",
       " 'dawn': 1970,\n",
       " 'laguens': 4104,\n",
       " 'vice': 7533,\n",
       " 'president': 5458,\n",
       " 'planned': 5305,\n",
       " 'parenthood': 5130,\n",
       " 'action': 420,\n",
       " 'fund': 3094,\n",
       " 'underscore': 7386,\n",
       " 'fiction': 2867,\n",
       " 'supportive': 6862,\n",
       " 'any': 659,\n",
       " 'country': 1802,\n",
       " 'reporters': 5900,\n",
       " 'found': 3033,\n",
       " 'tape': 6954,\n",
       " 'facility': 2765,\n",
       " 'changed': 1373,\n",
       " 'her': 3439,\n",
       " 'tune': 7308,\n",
       " 'pic': 5265,\n",
       " 'twitter': 7336,\n",
       " 'com': 1560,\n",
       " 'n5815rs1at': 4764,\n",
       " 'touching': 7194,\n",
       " 'leave': 4176,\n",
       " 'conference': 1664,\n",
       " 'hand': 3336,\n",
       " 'hqb8dl0fhn': 3562,\n",
       " 'wednesday': 7664,\n",
       " 'group': 3283,\n",
       " 'conservative': 1695,\n",
       " 'journalists': 3961,\n",
       " 'commentators': 1588,\n",
       " 'called': 1238,\n",
       " 'firing': 2916,\n",
       " 'track': 7216,\n",
       " 'record': 5783,\n",
       " 'making': 4381,\n",
       " 'defenders': 2036,\n",
       " 'seem': 6241,\n",
       " 'foolish': 2991,\n",
       " 'ones': 4974,\n",
       " 'nichole': 4845,\n",
       " 'bauer': 914,\n",
       " 'assistant': 755,\n",
       " 'professor': 5529,\n",
       " 'science': 6183,\n",
       " 'university': 7414,\n",
       " 'alabama': 541,\n",
       " 'response': 5951,\n",
       " 'classic': 1482,\n",
       " 'practiced': 5418,\n",
       " 'denying': 2084,\n",
       " 'previous': 5480,\n",
       " 'incontrovertible': 3684,\n",
       " 'exists': 2686,\n",
       " 'isn': 3871,\n",
       " 'exclusive': 2673,\n",
       " 'commentary': 1586,\n",
       " 'course': 1808,\n",
       " 'wasn': 7637,\n",
       " 'reference': 5803,\n",
       " 'menstruation': 4535,\n",
       " 'joked': 3950,\n",
       " 'fox': 3040,\n",
       " 'anchor': 614,\n",
       " 'megyn': 4522,\n",
       " 'kelly': 4019,\n",
       " 'having': 3382,\n",
       " 'blood': 1050,\n",
       " 'coming': 1579,\n",
       " 'wherever': 7697,\n",
       " 'during': 2363,\n",
       " 'debate': 1994,\n",
       " 'certainly': 1351,\n",
       " 'use': 7465,\n",
       " 'schlonged': 6177,\n",
       " 'vulgar': 7603,\n",
       " 'term': 7022,\n",
       " 'describing': 2104,\n",
       " 'badly': 865,\n",
       " 'pres': 5450,\n",
       " 'barack': 885,\n",
       " 'obama': 4909,\n",
       " 'beat': 928,\n",
       " 'hillary': 3470,\n",
       " 'clinton': 1503,\n",
       " '2008': 147,\n",
       " 'democratic': 2069,\n",
       " 'presidential': 5459,\n",
       " 'primary': 5489,\n",
       " 'press': 5461,\n",
       " 'way': 7650,\n",
       " 'convince': 1755,\n",
       " 'people': 5205,\n",
       " 'opposite': 5000,\n",
       " 'definitely': 2042,\n",
       " 'accepted': 388,\n",
       " 'among': 596,\n",
       " 'republican': 5915,\n",
       " 'mainstream': 4369,\n",
       " 'sort': 6556,\n",
       " 'outmoded': 5050,\n",
       " 'sexism': 6306,\n",
       " 'demonstrates': 2074,\n",
       " 'real': 5749,\n",
       " 'place': 5298,\n",
       " 'discourse': 2218,\n",
       " 'yet': 7832,\n",
       " 'instead': 3779,\n",
       " 'apologizing': 669,\n",
       " 'acknowledging': 413,\n",
       " 'question': 5660,\n",
       " 'matter': 4469,\n",
       " 'might': 4590,\n",
       " 'find': 2899,\n",
       " 'offensive': 4939,\n",
       " 'doubled': 2305,\n",
       " 'down': 2308,\n",
       " 'obfuscated': 4910,\n",
       " 'whether': 7698,\n",
       " 'coworker': 1821,\n",
       " 'who': 7703,\n",
       " 'meaning': 4493,\n",
       " 'sexist': 6307,\n",
       " 'partner': 5144,\n",
       " 'quick': 5664,\n",
       " 'end': 2525,\n",
       " 'argument': 709,\n",
       " 'using': 7471,\n",
       " 'word': 7765,\n",
       " 'telling': 7005,\n",
       " 'don': 2287,\n",
       " 'deserve': 2106,\n",
       " 'same': 6123,\n",
       " 'agency': 498,\n",
       " 'men': 4533,\n",
       " 'denials': 2076,\n",
       " 'may': 4477,\n",
       " 'make': 4376,\n",
       " 'appear': 674,\n",
       " 'unassailable': 7366,\n",
       " 'send': 6260,\n",
       " 'entirely': 2565,\n",
       " 'different': 2171,\n",
       " 'message': 4551,\n",
       " 'nature': 4788,\n",
       " 'tend': 7014,\n",
       " 'support': 6858,\n",
       " 'traditional': 7226,\n",
       " 'roles': 6052,\n",
       " 'believing': 957,\n",
       " 'deference': 2040,\n",
       " 'male': 4383,\n",
       " 'authority': 807,\n",
       " 'figure': 2879,\n",
       " 'husband': 3594,\n",
       " 'faultless': 2817,\n",
       " 'effectively': 2430,\n",
       " 'takes': 6941,\n",
       " 'autonomy': 815,\n",
       " 'away': 833,\n",
       " 'creating': 1840,\n",
       " 'climate': 1498,\n",
       " 'always': 580,\n",
       " 'explain': 2708,\n",
       " 'performed': 5217,\n",
       " 'poorly': 5363,\n",
       " 'survey': 6879,\n",
       " 'nbc': 4795,\n",
       " 'wsj': 7802,\n",
       " 'poll': 5354,\n",
       " '47': 242,\n",
       " 'cannot': 1267,\n",
       " 'themselves': 7064,\n",
       " 'voting': 7599,\n",
       " 'favorability': 2820,\n",
       " 'ratings': 5726,\n",
       " 'higher': 3462,\n",
       " '59': 282,\n",
       " 'amongst': 597,\n",
       " 'registered': 5823,\n",
       " 'cnn': 1520,\n",
       " 'orc': 5006,\n",
       " 'though': 7087,\n",
       " 'ask': 739,\n",
       " 'intended': 3795,\n",
       " 'vote': 7594,\n",
       " 'don_vito_08': 2288,\n",
       " 'picture': 5274,\n",
       " 'worth': 7786,\n",
       " 'thousand': 7091,\n",
       " 'words': 7767,\n",
       " 'lyingted': 4347,\n",
       " 'nevercruz': 4828,\n",
       " 'melaniatrump': 4524,\n",
       " '5bvvewmvf8': 285,\n",
       " '24': 174,\n",
       " 'these': 7071,\n",
       " 'polls': 5356,\n",
       " 'fully': 3089,\n",
       " 'reflect': 5808,\n",
       " 'unseemly': 7429,\n",
       " 'attacks': 776,\n",
       " 'heidi': 3423,\n",
       " 'cruz': 1883,\n",
       " 'wife': 7718,\n",
       " 'opponent': 4995,\n",
       " 'sen': 6258,\n",
       " 'ted': 6997,\n",
       " 'if': 3613,\n",
       " 'abortion': 373,\n",
       " 'banned': 881,\n",
       " 'illegally': 3620,\n",
       " 'should': 6371,\n",
       " 'face': 2758,\n",
       " 'punishment': 5623,\n",
       " 'clarified': 1478,\n",
       " 'hypothetical': 3598,\n",
       " 'reversed': 5989,\n",
       " 'position': 5385,\n",
       " 'clear': 1488,\n",
       " 'willing': 7724,\n",
       " 'publicly': 5605,\n",
       " 'embarrass': 2478,\n",
       " 'degrade': 2050,\n",
       " 'consider': 1696,\n",
       " 'depriving': 2096,\n",
       " 'physical': 5263,\n",
       " 'freedom': 3054,\n",
       " 'such': 6814,\n",
       " 'power': 5413,\n",
       " 'dynamics': 2372,\n",
       " 'aren': 703,\n",
       " 'abusive': 383,\n",
       " 'workplaces': 7776,\n",
       " 'relationships': 5849,\n",
       " 'fathom': 2815,\n",
       " 'mirrors': 4637,\n",
       " 'private': 5500,\n",
       " 'hell': 3428,\n",
       " 'doing': 2280,\n",
       " 'control': 1744,\n",
       " 'others': 5032,\n",
       " 'elevate': 2454,\n",
       " 'insulate': 3785,\n",
       " 'attack': 772,\n",
       " 'jackie': 3890,\n",
       " 'white': 7702,\n",
       " 'emerita': 2491,\n",
       " 'psychology': 5597,\n",
       " 'senior': 6262,\n",
       " 'research': 5925,\n",
       " 'scientist': 6186,\n",
       " 'center': 1342,\n",
       " 'health': 3401,\n",
       " 'wellness': 7678,\n",
       " 'north': 4878,\n",
       " 'carolina': 1291,\n",
       " 'greensboro': 3269,\n",
       " 'approach': 690,\n",
       " 'victimizes': 7537,\n",
       " 'playing': 5323,\n",
       " 'part': 5138,\n",
       " 'degradation': 2049,\n",
       " 'manipulation': 4401,\n",
       " 'present': 5453,\n",
       " 'self': 6251,\n",
       " 'assured': 763,\n",
       " 'knowledgable': 4067,\n",
       " 'person': 5228,\n",
       " 'meanwhile': 4498,\n",
       " 'silences': 6410,\n",
       " 'depressed': 2094,\n",
       " 'passive': 5161,\n",
       " 'paralyzed': 5127,\n",
       " 'feelings': 2844,\n",
       " 'doubt': 2307,\n",
       " 'insecurity': 3758,\n",
       " 'set': 6295,\n",
       " 'experience': 2701,\n",
       " 'well': 7677,\n",
       " 'worries': 7782,\n",
       " 'tactics': 6931,\n",
       " 'display': 2236,\n",
       " 'consciously': 1690,\n",
       " 'subconsciously': 6800,\n",
       " 'lead': 4155,\n",
       " 'endorse': 2530,\n",
       " 'harmful': 3361,\n",
       " 'stereotypes': 6714,\n",
       " 'engage': 2539,\n",
       " 'particularly': 5141,\n",
       " 'threatened': 7095,\n",
       " 'demographic': 2071,\n",
       " 'change': 1372,\n",
       " 'evolving': 2652,\n",
       " 'gender': 3147,\n",
       " 'inspire': 3771,\n",
       " 'justified': 3989,\n",
       " 'clinging': 1501,\n",
       " 'deserved': 2107,\n",
       " 'elevated': 2455,\n",
       " 'status': 6693,\n",
       " 'makes': 4380,\n",
       " 'offering': 4942,\n",
       " 'public': 5604,\n",
       " 'performance': 5215,\n",
       " 'fallout': 2785,\n",
       " 'unconscionable': 7374,\n",
       " 'both': 1101,\n",
       " 'london': 4302,\n",
       " 'last': 4127,\n",
       " 'month': 4695,\n",
       " 'dog': 2276,\n",
       " 'sized': 6444,\n",
       " 'rabbit': 5678,\n",
       " 'desperate': 2117,\n",
       " 'need': 4807,\n",
       " 'under': 7379,\n",
       " 'atlas': 770,\n",
       " 'permanent': 5223,\n",
       " 'story': 6739,\n",
       " 'went': 7680,\n",
       " 'global': 3199,\n",
       " 'over': 5061,\n",
       " 'including': 3678,\n",
       " 'canada': 1255,\n",
       " 'france': 3043,\n",
       " 'started': 6679,\n",
       " 'reaching': 5736,\n",
       " 'scottish': 6195,\n",
       " 'society': 6520,\n",
       " 'prevention': 5477,\n",
       " 'cruelty': 1879,\n",
       " 'animals': 633,\n",
       " 'thanks': 7049,\n",
       " 'jen': 3920,\n",
       " 'hislop': 3485,\n",
       " 'ayrshire': 839,\n",
       " 'adorable': 459,\n",
       " 'bunny': 1196,\n",
       " 'will': 7721,\n",
       " 'get': 3173,\n",
       " 'native': 4785,\n",
       " 'scotland': 6193,\n",
       " 'buggy': 1178,\n",
       " 'facebook': 2759,\n",
       " 'spcajen': 6581,\n",
       " 'financial': 2898,\n",
       " 'fraud': 3049,\n",
       " 'investigator': 3836,\n",
       " 'told': 7157,\n",
       " 'charity': 1391,\n",
       " 'burst': 1205,\n",
       " 'tears': 6985,\n",
       " 'got': 3230,\n",
       " 'phone': 5251,\n",
       " 'call': 1237,\n",
       " 'saying': 6157,\n",
       " 'had': 3322,\n",
       " 'chosen': 1448,\n",
       " 'cried': 1854,\n",
       " 'collected': 1547,\n",
       " '43': 231,\n",
       " 'year': 7822,\n",
       " 'old': 4962,\n",
       " 'two': 7339,\n",
       " 'bunnies': 1195,\n",
       " 'currently': 1906,\n",
       " 'rex': 5997,\n",
       " 'named': 4772,\n",
       " 'coconut': 1528,\n",
       " 'looking': 4309,\n",
       " 'still': 6722,\n",
       " 'growing': 3287,\n",
       " 'summer': 6839,\n",
       " 'house': 3552,\n",
       " 'heating': 3414,\n",
       " 'air': 525,\n",
       " 'conditioning': 1659,\n",
       " 'accommodation': 396,\n",
       " 'large': 4122,\n",
       " 'garden': 3128,\n",
       " 'enclosure': 2519,\n",
       " 'run': 6089,\n",
       " 'perfect': 5211,\n",
       " 'addition': 445,\n",
       " 'family': 2792,\n",
       " 'thing': 7076,\n",
       " 'name': 4771,\n",
       " 'hello': 3430,\n",
       " 'atilla': 767,\n",
       " 'bun': 1193,\n",
       " 'binky': 1003,\n",
       " 'master': 4458,\n",
       " 'jazz': 3913,\n",
       " 'paws': 5177,\n",
       " 'worry': 7783,\n",
       " 'you': 7836,\n",
       " 'can': 1254,\n",
       " 'atty': 789,\n",
       " 'short': 6360,\n",
       " 'digital': 2179,\n",
       " 'marketing': 4430,\n",
       " 'experts': 2707,\n",
       " 'must': 4754,\n",
       " 'diverse': 2254,\n",
       " 'skill': 6447,\n",
       " 'sophisticated': 6553,\n",
       " 'grasp': 3257,\n",
       " 'available': 816,\n",
       " 'channels': 1378,\n",
       " 'ability': 370,\n",
       " 'identify': 3608,\n",
       " 'opportunities': 4997,\n",
       " 'top': 7177,\n",
       " 'basic': 897,\n",
       " 'skills': 6448,\n",
       " 'brilliant': 1140,\n",
       " 'marketer': 4428,\n",
       " 'possess': 5389,\n",
       " 'balance': 871,\n",
       " 'critical': 1860,\n",
       " 'creative': 1843,\n",
       " 'thinking': 7079,\n",
       " 'order': 5008,\n",
       " 'drive': 2336,\n",
       " 'measurable': 4499,\n",
       " 'success': 6811,\n",
       " 'asked': 740,\n",
       " '15': 72,\n",
       " 'members': 4528,\n",
       " 'young': 7837,\n",
       " 'entrepreneur': 2570,\n",
       " 'council': 1794,\n",
       " 'yec': 7824,\n",
       " 'hiring': 3483,\n",
       " 'marketers': 4429,\n",
       " 'best': 974,\n",
       " 'answers': 653,\n",
       " 'below': 960,\n",
       " 'paid': 5106,\n",
       " 'advertising': 471,\n",
       " 'expertise': 2706,\n",
       " 'hire': 3480,\n",
       " 'versed': 7523,\n",
       " 'especially': 2595,\n",
       " 'similar': 6419,\n",
       " 'platform': 5312,\n",
       " 'uses': 7470,\n",
       " 'regularly': 5826,\n",
       " 'able': 371,\n",
       " 'understand': 7387,\n",
       " 'implement': 3647,\n",
       " 'analytics': 611,\n",
       " 'insights': 3764,\n",
       " 'create': 1837,\n",
       " 'lookalike': 4307,\n",
       " 'custom': 1910,\n",
       " 'audiences': 793,\n",
       " 'experiment': 2704,\n",
       " 'test': 7033,\n",
       " 'images': 3627,\n",
       " 'secure': 6234,\n",
       " 'knowledge': 4068,\n",
       " 'overall': 5062,\n",
       " 'landscape': 4112,\n",
       " 'budget': 1173,\n",
       " 'saving': 6153,\n",
       " 'within': 7750,\n",
       " 'space': 6576,\n",
       " 'sure': 6866,\n",
       " 'talent': 6943,\n",
       " 'knows': 4070,\n",
       " 'ins': 3757,\n",
       " 'outs': 5053,\n",
       " 'popular': 5369,\n",
       " 'easy': 2398,\n",
       " 'miles': 4599,\n",
       " 'jennings': 3922,\n",
       " 'recruiter': 5790,\n",
       " 'web': 7661,\n",
       " 'developers': 2142,\n",
       " 'data': 1959,\n",
       " 'scientists': 6187,\n",
       " 'ai': 514,\n",
       " 'hired': 3481,\n",
       " 'years': 7823,\n",
       " 'outsourced': 5056,\n",
       " 'successfully': 6813,\n",
       " 'better': 979,\n",
       " 'turn': 7317,\n",
       " 'your': 7840,\n",
       " 'closing': 1511,\n",
       " 'deals': 1989,\n",
       " 'directly': 2198,\n",
       " 'sell': 6252,\n",
       " 'll': 4281,\n",
       " 'wasting': 7640,\n",
       " 'valuable': 7495,\n",
       " 'dollars': 2282,\n",
       " 'without': 7751,\n",
       " 'generating': 3152,\n",
       " 'qualified': 5646,\n",
       " 'team': 6982,\n",
       " 'mark': 4425,\n",
       " 'cenicola': 1341,\n",
       " 'bannerview': 882,\n",
       " 'specific': 6594,\n",
       " 'channel': 1376,\n",
       " 'extol': 2738,\n",
       " 'every': 2639,\n",
       " 'conceivable': 1646,\n",
       " 'seo': 6270,\n",
       " 'sem': 6254,\n",
       " 'etc': 2609,\n",
       " 'key': 4026,\n",
       " 'successful': 6812,\n",
       " 'focusing': 2977,\n",
       " 'few': 2861,\n",
       " 'really': 5756,\n",
       " 'understanding': 7388,\n",
       " 'deeply': 2027,\n",
       " 'leveraging': 4211,\n",
       " 'those': 7086,\n",
       " 'example': 2659,\n",
       " 'helping': 3434,\n",
       " 'local': 4288,\n",
       " 'businesses': 1212,\n",
       " 'pinterest': 5282,\n",
       " 'less': 4200,\n",
       " 'interesting': 3806,\n",
       " 'although': 578,\n",
       " 'drivers': 2338,\n",
       " 'ecommerce': 2403,\n",
       " 'traffic': 7230,\n",
       " 'rather': 5723,\n",
       " 'focus': 2975,\n",
       " 'intricate': 3823,\n",
       " 'client': 1495,\n",
       " 'google': 3225,\n",
       " 'maps': 4416,\n",
       " 'yelp': 7827,\n",
       " 'trevor': 7273,\n",
       " 'sumner': 6841,\n",
       " 'localvox': 4291,\n",
       " 'objectively': 4914,\n",
       " 'usually': 7475,\n",
       " ...}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vect.vocabulary_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(82, 7969)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# rows are documents, columns are terms (aka \"tokens\" or \"features\")\n",
    "X_dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ydwnm50jlu', 'ye', 'yeah', 'year', 'years', 'yec', 'yeezy', 'yellow', 'yelp', 'yep', 'yes', 'yesterday', 'yesweather', 'yet', 'yoga', 'yong', 'york', 'you', 'young', 'younger', 'youngest', 'your', 'yourself', 'youth', 'youtube', 'youtubeduck', 'yup', 'yuyuan', 'yücel', 'zach', 'zaxoqbv487', 'zero', 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1ictexnxgxmtujcmujanbn', 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1icteymdb4nji3iwplcwpwzw', 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1icti4ohgxnjijcmujanbn', 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1ictk1mhg1mzqjcmujanbn', 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1ictu2mhg3ntakzqlqcgc', 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1ictywmhgzmzgjcmujanbn', 'zgkymde1lzewlza0lzm1l2jpcmrfdgfudhj1lmu3zwmzlmpwzwpwcxrodw1ictexnxgxmtujcmujanbn', 'zgkymde1lzewlza0lzm1l2jpcmrfdgfudhj1lmu3zwmzlmpwzwpwcxrodw1icteymdb4nji3iwplcwpwzw', 'zgkymde1lzewlza0lzm1l2jpcmrfdgfudhj1lmu3zwmzlmpwzwpwcxrodw1icti4ohgxnjijcmujanbn', 'zgkymde1lzewlza0lzm1l2jpcmrfdgfudhj1lmu3zwmzlmpwzwpwcxrodw1ictk1mhg1mzqjcmujanbn', 'zgkymde1lzewlza0lzm1l2jpcmrfdgfudhj1lmu3zwmzlmpwzwpwcxrodw1ictu2mhg3ntakzqlqcgc', 'zgkymde1lzewlza0lzm1l2jpcmrfdgfudhj1lmu3zwmzlmpwzwpwcxrodw1ictywmhgzmzgjcmujanbn', 'zgkymde1lzewlzaxlzhhl1rttfnjcmvlblnolmnkmgjklnbuzwpwcxrodw1ictexnxgxmtujcmujanbn', 'zgkymde1lzewlzaxlzhhl1rttfnjcmvlblnolmnkmgjklnbuzwpwcxrodw1icteymdb4nji3iwplcwpwzw', 'zgkymde1lzewlzaxlzhhl1rttfnjcmvlblnolmnkmgjklnbuzwpwcxrodw1icti4ohgxnjijcmujanbn', 'zgkymde1lzewlzaxlzhhl1rttfnjcmvlblnolmnkmgjklnbuzwpwcxrodw1ictk1mhg1mzqjcmujanbn', 'zgkymde1lzewlzaxlzhhl1rttfnjcmvlblnolmnkmgjklnbuzwpwcxrodw1ictu2mhg3ntakzqlqcgc', 'zgkymde1lzewlzaxlzhhl1rttfnjcmvlblnolmnkmgjklnbuzwpwcxrodw1ictywmhgzmzgjcmujanbn']\n"
     ]
    }
   ],
   "source": [
    "# last 50 features\n",
    "print(vect.get_feature_names()[-150:-100])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# show vectorizer options\n",
    "vect"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **lowercase:** boolean, True by default\n",
    "- Convert all characters to lowercase before tokenizing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(82, 8759)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vect = CountVectorizer(lowercase=False)\n",
    "X_dtm = vect.fit_transform(X)\n",
    "X_dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8097"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_dtm.todense()[0].argmax()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'the'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vect.get_feature_names()[8097]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **ngram_range:** tuple (min_n, max_n)\n",
    "- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(82, 115172)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# include 1-grams and 2-grams\n",
    "vect = CountVectorizer(ngram_range=(1, 4))\n",
    "X_dtm = vect.fit_transform(X)\n",
    "X_dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['you to fly', 'you to fly your', 'you to know', 'you to know quite', 'you to robertdowneyjr', 'you to robertdowneyjr for', 'you to sit', 'you to sit back', 'you to stand', 'you to stand away', 'you to watch', 'you to watch out', 'you twisty', 'you twisty the', 'you twisty the clown', 'you ve', 'you ve created', 'you ve created for', 'you ve destroyed', 'you ve destroyed just', 'you ve done', 'you ve done this', 'you ve experienced', 'you ve experienced similar', 'you ve got', 'you ve got seven', 'you ve gotten', 'you ve gotten yourself', 'you ve made', 'you ve made that', 'you ve sown', 'you ve sown across', 'you venture', 'you venture out', 'you venture out into', 'you want', 'you want to', 'you want to create', 'you want to drive', 'you want to lock', 'you want to rent', 'you want to talk', 'you what', 'you what they', 'you what they are', 'you what they look', 'you when', 'you when done', 'you when done properly', 'you which']\n"
     ]
    }
   ],
   "source": [
    "# last 50 features\n",
    "print(vect.get_feature_names()[-1000:-950])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict shares"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    10.000000\n",
       "mean      0.420094\n",
       "std       0.117514\n",
       "min       0.250000\n",
       "25%       0.366477\n",
       "50%       0.409722\n",
       "75%       0.500000\n",
       "max       0.571429\n",
       "dtype: float64"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Default CountVectorizer\n",
    "vect = CountVectorizer()\n",
    "X_dtm = vect.fit_transform(X)\n",
    "\n",
    "# use Naive Bayes to predict the star rating\n",
    "nb = MultinomialNB()\n",
    "pd.Series(cross_val_score(nb, X_dtm, y, cv=10)).describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define a function that accepts a vectorizer and calculates the accuracy\n",
    "def tokenize_test(vect):\n",
    "    X_dtm = vect.fit_transform(X)\n",
    "    print('Features: ', X_dtm.shape[1])\n",
    "    nb = MultinomialNB()\n",
    "    print(pd.Series(cross_val_score(nb, X_dtm, y, cv=10)).describe())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  37905\n",
      "count    10.000000\n",
      "mean      0.405808\n",
      "std       0.087028\n",
      "min       0.250000\n",
      "25%       0.375000\n",
      "50%       0.375000\n",
      "75%       0.440476\n",
      "max       0.571429\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# include 1-grams and 2-grams\n",
    "vect = CountVectorizer(ngram_range=(1, 2))\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Stopword Removal\n",
    "\n",
    "- **What:** Remove common words that will likely appear in any text\n",
    "- **Why:** They don't tell you much about your text\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "- **stop_words:** string {'english'}, list, or None (default)\n",
    "- If 'english', a built-in stop word list for English is used.\n",
    "- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n",
    "- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  7710\n",
      "count    10.000000\n",
      "mean      0.355411\n",
      "std       0.085808\n",
      "min       0.250000\n",
      "25%       0.270833\n",
      "50%       0.369318\n",
      "75%       0.415179\n",
      "max       0.500000\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# remove English stop words\n",
    "vect = CountVectorizer(stop_words='english')\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "frozenset({'full', 'please', 'anyhow', 'cant', 'everyone', 'been', 'there', 'behind', 'or', 'such', 'through', 'once', 'anyone', 'becoming', 'perhaps', 'why', 'yourselves', 'call', 'her', 'twenty', 'while', 'enough', 'amoungst', 'anywhere', 'many', 'above', 'an', 'elsewhere', 'see', 'than', 'my', 'no', 'who', 'nobody', 'do', 'thereafter', 'always', 'down', 'wherever', 'due', 'empty', 'hereby', 'others', 'become', 'well', 'last', 'afterwards', 'during', 'co', 'be', 'almost', 'on', 'are', 'same', 'must', 'another', 'into', 'name', 'nothing', 'itself', 'every', 'back', 'beforehand', 'hasnt', 'may', 're', 'should', 'though', 'towards', 'our', 'could', 'upon', 'thin', 'here', 'you', 'together', 'onto', 'none', 'became', 'myself', 'third', 'throughout', 'its', 'will', 'noone', 'six', 'has', 'thus', 'somehow', 'among', 'seems', 'serious', 'mostly', 'when', 'done', 'between', 'out', 'someone', 'two', 'me', 'put', 'everything', 'thick', 'one', 'and', 'because', 'by', 'often', 'although', 'this', 'go', 'seemed', 'hers', 'most', 'until', 'herein', 'very', 'bill', 'couldnt', 'what', 'whatever', 'wherein', 'which', 'whole', 'top', 'else', 'whereby', 'give', 'fifty', 'where', 'about', 'former', 'a', 'eight', 'front', 'beyond', 'hence', 'show', 'his', 'might', 'take', 'per', 'four', 'so', 'whereas', 'him', 'whose', 'fill', 'all', 'nevertheless', 'con', 'their', 'some', 'sometimes', 'but', 'have', 'he', 'himself', 'latter', 'these', 'we', 'etc', 'ever', 'whom', 'had', 'ourselves', 'interest', 'how', 'still', 'toward', 'that', 'whether', 'somewhere', 'find', 'those', 'whenever', 'am', 'hereupon', 'from', 'cannot', 'own', 'ie', 'namely', 'something', 'your', 'already', 'yourself', 'eleven', 'yours', 'everywhere', 'found', 'next', 'other', 'thru', 'detail', 'side', 'themselves', 'below', 'whereafter', 'becomes', 'would', 'was', 'besides', 'bottom', 'mine', 'inc', 'whoever', 'except', 'seeming', 'before', 'first', 'sometime', 'to', 'ten', 'for', 'even', 'being', 'across', 'ltd', 'whereupon', 'amongst', 'i', 'anything', 'both', 'rather', 'the', 'alone', 'thereupon', 'least', 'system', 'un', 'now', 'moreover', 'then', 'she', 'indeed', 'only', 'de', 'as', 'fifteen', 'them', 'yet', 'neither', 'otherwise', 'after', 'hundred', 'too', 'either', 'therefore', 'keep', 'along', 'beside', 'five', 'hereafter', 'anyway', 'however', 'latterly', 'describe', 'much', 'over', 'meanwhile', 'part', 'us', 'whither', 'within', 'not', 'fire', 'sixty', 'thereby', 'less', 'therein', 'mill', 'cry', 'of', 'amount', 'twelve', 'also', 'few', 'off', 'under', 'more', 'further', 'they', 'three', 'forty', 'move', 'nor', 'since', 'without', 'in', 'nowhere', 'any', 'sincere', 'with', 'were', 'can', 'eg', 'formerly', 'again', 'herself', 'nine', 'up', 'each', 'made', 'whence', 'if', 'never', 'via', 'it', 'ours', 'seem', 'at', 'is', 'get', 'around', 'thence', 'several', 'against'})\n"
     ]
    }
   ],
   "source": [
    "# set of stop words\n",
    "print(vect.get_stop_words())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Other CountVectorizer Options\n",
    "\n",
    "- **max_features:** int or None, default=None\n",
    "- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  100\n",
      "count    10.000000\n",
      "mean      0.375126\n",
      "std       0.168480\n",
      "min       0.125000\n",
      "25%       0.250000\n",
      "50%       0.401786\n",
      "75%       0.486111\n",
      "max       0.625000\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# remove English stop words and only keep 100 features\n",
    "vect = CountVectorizer(stop_words='english', max_features=100)\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['01', '10', '11', '15', '1cd', '2015', '2016', '28', 'article', 'australian', 'author', 'best', 'big', 'business', 'campaign', 'com', 'company', 'conversion', 'cystic', 'daniel', 'day', 'description', 'digital', 'don', 'downey', 'entertainment', 'facebook', 'false', 'fibrosis', 'function', 'good', 'hot', 'http', 'https', 'image', 'initpage', 'instagram', 'internal', 'iron', 'jpg', 'jr', 'js', 'just', 'know', 'life', 'like', 'make', 'man', 'marketing', 'mashable', 'media', 'movie', 'movies', 'mshcdn', 'new', 'null', 'oct', 'og', 'old', 'open', 'paris', 'people', 'photo', 'pic', 'platform', 'police', 'posted', 'premiere', 'pu', 'rack', 'rdj', 'return', 'rights', 'rising', 'robert', 'said', 'sailthru', 'says', 'season', 'short_url', 'state', 'time', 'timer', 'title', 'topics', 'travel', 'true', 'trump', 'twitter', 'twttr', 'uncategorized', 'url', 've', 'watercooler', 'way', 'window', 'work', 'world', 'year', 'years']\n"
     ]
    }
   ],
   "source": [
    "# all 100 features\n",
    "print(vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  1000\n",
      "count    10.000000\n",
      "mean      0.405574\n",
      "std       0.130813\n",
      "min       0.250000\n",
      "25%       0.270833\n",
      "50%       0.414773\n",
      "75%       0.500000\n",
      "max       0.571429\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# include 1-grams and 2-grams, and limit the number of features\n",
    "vect = CountVectorizer(ngram_range=(1, 2), max_features=1000)\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **min_df:** float in range [0.0, 1.0] or int, default=1\n",
    "- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  7620\n",
      "count    10.000000\n",
      "mean      0.407594\n",
      "std       0.141763\n",
      "min       0.125000\n",
      "25%       0.366477\n",
      "50%       0.409722\n",
      "75%       0.500000\n",
      "max       0.571429\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# include 1-grams and 2-grams, and only include terms that appear at least 2 times\n",
    "vect = CountVectorizer(ngram_range=(1, 2), min_df=2)\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Stemming and Lemmatization\n",
    "\n",
    "**Stemming:**\n",
    "\n",
    "- **What:** Reduce a word to its base/stem/root form\n",
    "- **Why:** Often makes sense to treat related words the same way\n",
    "- **Notes:**\n",
    "    - Uses a \"simple\" and fast rule-based approach\n",
    "    - Stemmed words are usually not shown to users (used for analysis/indexing)\n",
    "    - Some search engines treat words with the same stem as synonyms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize stemmer\n",
    "stemmer = SnowballStemmer('english')\n",
    "\n",
    "# words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vect = CountVectorizer()\n",
    "vect.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = list(vect.vocabulary_.keys())[:100]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['appl', 'long', 'and', 'controversi', 'ebook', 'case', 'has', 'reach', 'it', 'final', 'chapter', 'it', 'not', 'the', 'happi', 'end', 'compani', 'want', 'suprem', 'court', 'on', 'monday', 'reject', 'an', 'appeal', 'file', 'by', 'to', 'overturn', 'sting', 'rule', 'that', 'led', 'broad', 'conspiraci', 'with', 'sever', 'major', 'publish', 'fix', 'price', 'of', 'book', 'sold', 'through', 'onlin', 'bookstor', 'decis', 'mean', 'now', 'no', 'choic', 'but', 'pay', 'out', '400', 'million', 'consum', 'addit', '50', 'in', 'legal', 'fee', 'accord', 'origin', 'settlement', '2014', 'see', 'also', 'here', 'how', 'marshal', 'entir', 'tech', 'industri', 'fight', 'fbi', 'for', 'verdict', 'is', 'more', 'damag', 'reput', 'as', 'consum', 'friend', 'brand', 'mention', 'legaci', 'belov', 'founder', 'steve', 'job', 'than', 'actual', 'bottom', 'line', 'put', 'fine', 'context']\n"
     ]
    }
   ],
   "source": [
    "# stem each word\n",
    "print([stemmer.stem(word) for word in words])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lemmatization**\n",
    "\n",
    "- **What:** Derive the canonical form ('lemma') of a word\n",
    "- **Why:** Can be better than stemming\n",
    "- **Notes:** Uses a dictionary-based approach (slower than stemming)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.stem import WordNetLemmatizer\n",
    "wordnet_lemmatizer = WordNetLemmatizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package wordnet to /home/al/nltk_data...\n",
      "[nltk_data]   Package wordnet is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import nltk\n",
    "nltk.download('wordnet')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['apple', 'long', 'and', 'controversial', 'ebook', 'case', 'ha', 'reached', 'it', 'final', 'chapter', 'it', 'not', 'the', 'happy', 'ending', 'company', 'wanted', 'supreme', 'court', 'on', 'monday', 'rejected', 'an', 'appeal', 'filed', 'by', 'to', 'overturn', 'stinging', 'ruling', 'that', 'led', 'broad', 'conspiracy', 'with', 'several', 'major', 'publisher', 'fix', 'price', 'of', 'book', 'sold', 'through', 'online', 'bookstore', 'decision', 'mean', 'now', 'no', 'choice', 'but', 'pay', 'out', '400', 'million', 'consumer', 'additional', '50', 'in', 'legal', 'fee', 'according', 'original', 'settlement', '2014', 'see', 'also', 'here', 'how', 'marshalled', 'entire', 'tech', 'industry', 'fight', 'fbi', 'for', 'verdict', 'is', 'more', 'damaging', 'reputation', 'a', 'consumer', 'friendly', 'brand', 'mention', 'legacy', 'beloved', 'founder', 'steve', 'job', 'than', 'actual', 'bottom', 'line', 'put', 'fine', 'context']\n"
     ]
    }
   ],
   "source": [
    "# assume every word is a noun\n",
    "print([wordnet_lemmatizer.lemmatize(word) for word in words])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['apple', 'long', 'and', 'controversial', 'ebook', 'case', 'have', 'reach', 'its', 'final', 'chapter', 'it', 'not', 'the', 'happy', 'end', 'company', 'want', 'supreme', 'court', 'on', 'monday', 'reject', 'an', 'appeal', 'file', 'by', 'to', 'overturn', 'sting', 'rule', 'that', 'lead', 'broad', 'conspiracy', 'with', 'several', 'major', 'publishers', 'fix', 'price', 'of', 'book', 'sell', 'through', 'online', 'bookstore', 'decision', 'mean', 'now', 'no', 'choice', 'but', 'pay', 'out', '400', 'million', 'consumers', 'additional', '50', 'in', 'legal', 'fee', 'accord', 'original', 'settlement', '2014', 'see', 'also', 'here', 'how', 'marshal', 'entire', 'tech', 'industry', 'fight', 'fbi', 'for', 'verdict', 'be', 'more', 'damage', 'reputation', 'as', 'consumer', 'friendly', 'brand', 'mention', 'legacy', 'beloved', 'founder', 'steve', 'job', 'than', 'actual', 'bottom', 'line', 'put', 'fine', 'context']\n"
     ]
    }
   ],
   "source": [
    "# assume every word is a verb\n",
    "print([wordnet_lemmatizer.lemmatize(word,pos='v') for word in words])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define a function that accepts text and returns a list of lemmas\n",
    "def split_into_lemmas(text):\n",
    "    text = text.lower()\n",
    "    words = text.split()\n",
    "    return [wordnet_lemmatizer.lemmatize(word) for word in words]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  10208\n",
      "count    10.000000\n",
      "mean      0.423990\n",
      "std       0.112463\n",
      "min       0.250000\n",
      "25%       0.375000\n",
      "50%       0.436508\n",
      "75%       0.500000\n",
      "max       0.571429\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# use split_into_lemmas as the feature extraction function (WARNING: SLOW!)\n",
    "vect = CountVectorizer(analyzer=split_into_lemmas)\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Term Frequency-Inverse Document Frequency (TF-IDF)\n",
    "\n",
    "- **What:** Computes \"relative frequency\" that a word appears in a document compared to its frequency across all documents\n",
    "- **Why:** More useful than \"term frequency\" for identifying \"important\" words in each document (high frequency in that document, low frequency in other documents)\n",
    "- **Notes:** Used for search engine scoring, text summarization, document clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "# example documents\n",
    "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    0     1   0       0        1    1\n",
       "1    1     1   1       0        0    0\n",
       "2    0     1   1       2        0    0"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Term Frequency\n",
    "vect = CountVectorizer()\n",
    "tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())\n",
    "tf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    1     3   2       1        1    1"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Document Frequency\n",
    "vect = CountVectorizer(binary=True)\n",
    "df_ = vect.fit_transform(simple_train).toarray().sum(axis=0)\n",
    "pd.DataFrame(df_.reshape(1, 6), columns=vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab      call   me  please  tonight  you\n",
       "0  0.0  0.333333  0.0     0.0      1.0  1.0\n",
       "1  1.0  0.333333  0.5     0.0      0.0  0.0\n",
       "2  0.0  0.333333  0.5     2.0      0.0  0.0"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Term Frequency-Inverse Document Frequency (simple version)\n",
    "tf/df_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.385372</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.652491</td>\n",
       "      <td>0.652491</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.720333</td>\n",
       "      <td>0.425441</td>\n",
       "      <td>0.547832</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.266075</td>\n",
       "      <td>0.342620</td>\n",
       "      <td>0.901008</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        cab      call        me    please   tonight       you\n",
       "0  0.000000  0.385372  0.000000  0.000000  0.652491  0.652491\n",
       "1  0.720333  0.425441  0.547832  0.000000  0.000000  0.000000\n",
       "2  0.000000  0.266075  0.342620  0.901008  0.000000  0.000000"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# TfidfVectorizer\n",
    "vect = TfidfVectorizer()\n",
    "pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using TF-IDF to Summarize a text\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(82, 7710)"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create a document-term matrix using TF-IDF\n",
    "vect = TfidfVectorizer(stop_words='english')\n",
    "dtm = vect.fit_transform(X)\n",
    "features = vect.get_feature_names()\n",
    "dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "# choose a random text\n",
    "review_id = 40\n",
    "review_text = X[review_id]\n",
    "review_length = len(review_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a dictionary of words and their TF-IDF scores\n",
    "word_scores = {}\n",
    "for word in vect.vocabulary_.keys():\n",
    "    word = word.lower()\n",
    "    if word in features:\n",
    "        word_scores[word] = dtm[review_id, features.index(word)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TOP SCORING WORDS:\n",
      "sanders\n",
      "iowa\n",
      "precinct\n",
      "coin\n",
      "des\n"
     ]
    }
   ],
   "source": [
    "# print words with the top 5 TF-IDF scores\n",
    "print('TOP SCORING WORDS:')\n",
    "top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]\n",
    "for word, score in top_scores:\n",
    "    print(word)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "RANDOM WORDS:\n",
      "fann\n",
      "simplereach\n",
      "hey\n",
      "dubai\n",
      "28z\n"
     ]
    }
   ],
   "source": [
    "# print 5 random words\n",
    "print('\\n' + 'RANDOM WORDS:')\n",
    "random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False)\n",
    "for word in random_words:\n",
    "    print(word)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "\n",
    "- NLP is a gigantic field\n",
    "- Understanding the basics broadens the types of data you can work with\n",
    "- Simple techniques go a long way\n",
    "- Use scikit-learn for NLP whenever possible"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "name": "_merged"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}