{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Scattertext\n",
    "\n",
    "## @jasonkessler\n",
    "\n",
    "https://github.com/JasonKessler/scattertext\n",
    "\n",
    "\n",
    "\n",
    "Cite as:\n",
    "Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. 2017.\n",
    "\n",
    "Link to preprint: https://arxiv.org/abs/1703.00565\n",
    "\n",
    "`\n",
    "@article{kessler2017scattertext,\n",
    "  author    = {Kessler, Jason S.},\n",
    "  title     = {Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ},\n",
    "  booktitle = {Proceedings of ACL-2017 System Demonstrations},\n",
    "  year      = {2017},\n",
    "  address   = {Vancouver, Canada},\n",
    "  publisher = {Association for Computational Linguistics},\n",
    "}\n",
    "`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>.container { width:98% !important; }</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import scattertext as st\n",
    "import re, io\n",
    "from pprint import pprint\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from scipy.stats import rankdata, hmean, norm\n",
    "import spacy.en\n",
    "import os, pkgutil, json, urllib\n",
    "from urllib.request import urlopen\n",
    "from IPython.display import IFrame\n",
    "from IPython.core.display import display, HTML\n",
    "from scattertext import CorpusFromPandas, produce_scattertext_explorer\n",
    "display(HTML(\"<style>.container { width:98% !important; }</style>\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nlp = spacy.en.English()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Grab the 2012 political convention data set and preview it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "convention_df = st.SampleCorpora.ConventionData2012.get_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "party                                               democrat\n",
       "speaker                                         BARACK OBAMA\n",
       "text       Thank you. Thank you. Thank you. Thank you so ...\n",
       "Name: 0, dtype: object"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "convention_df.iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document Count\n",
      "party\n",
      "democrat      123\n",
      "republican     66\n",
      "Name: text, dtype: int64\n",
      "Word Count\n"
     ]
    }
   ],
   "source": [
    "print(\"Document Count\")\n",
    "print(convention_df.groupby('party')['text'].count())\n",
    "print(\"Word Count\")\n",
    "convention_df.groupby('party').apply(lambda x: x.text.apply(lambda x: len(x.split())).sum())\n",
    "convention_df['parsed'] = convention_df.text.apply(nlp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Turn it into a Scattertext corpus, and have spaCy parse it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scattertext has some functions to find how associated words are with categories\n",
    "## Lots of ways to do this. I'm partial to a novel technique called Scaled F-Score\n",
    "# Intutition:\n",
    "### Associatied terms have a *relatively* high category-specific precision and recall\n",
    "### F-score is the harmonic mean of precision and recall"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Terms we can calculate the Democratic precision and recall of each term\n",
    "### - Given the balanced class labels and Zipf's law, precisions (around 50% given the class balance) will be much higher than recalls\n",
    "### - Typically < 1% of documents contain a particular word\n",
    "### - This will throw off the harmonic means, favoring frequent terms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>democrat freq</th>\n",
       "      <th>republican freq</th>\n",
       "      <th>dem_precision</th>\n",
       "      <th>dem_recall</th>\n",
       "      <th>dem_f_score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>term</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <td>3402</td>\n",
       "      <td>2532</td>\n",
       "      <td>0.573306</td>\n",
       "      <td>0.022343</td>\n",
       "      <td>0.043009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>and</th>\n",
       "      <td>2709</td>\n",
       "      <td>2233</td>\n",
       "      <td>0.548159</td>\n",
       "      <td>0.017791</td>\n",
       "      <td>0.034464</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>to</th>\n",
       "      <td>2340</td>\n",
       "      <td>1667</td>\n",
       "      <td>0.583978</td>\n",
       "      <td>0.015368</td>\n",
       "      <td>0.029948</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1602</td>\n",
       "      <td>1345</td>\n",
       "      <td>0.543604</td>\n",
       "      <td>0.010521</td>\n",
       "      <td>0.020643</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>of</th>\n",
       "      <td>1569</td>\n",
       "      <td>1377</td>\n",
       "      <td>0.532587</td>\n",
       "      <td>0.010304</td>\n",
       "      <td>0.020218</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>that</th>\n",
       "      <td>1400</td>\n",
       "      <td>1051</td>\n",
       "      <td>0.571195</td>\n",
       "      <td>0.009195</td>\n",
       "      <td>0.018098</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>we</th>\n",
       "      <td>1318</td>\n",
       "      <td>1146</td>\n",
       "      <td>0.534903</td>\n",
       "      <td>0.008656</td>\n",
       "      <td>0.017036</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>in</th>\n",
       "      <td>1291</td>\n",
       "      <td>986</td>\n",
       "      <td>0.566974</td>\n",
       "      <td>0.008479</td>\n",
       "      <td>0.016708</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>i</th>\n",
       "      <td>1098</td>\n",
       "      <td>851</td>\n",
       "      <td>0.563366</td>\n",
       "      <td>0.007211</td>\n",
       "      <td>0.014240</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'s</th>\n",
       "      <td>1037</td>\n",
       "      <td>631</td>\n",
       "      <td>0.621703</td>\n",
       "      <td>0.006811</td>\n",
       "      <td>0.013473</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      democrat freq  republican freq  dem_precision  dem_recall  dem_f_score\n",
       "term                                                                        \n",
       "the            3402             2532       0.573306    0.022343     0.043009\n",
       "and            2709             2233       0.548159    0.017791     0.034464\n",
       "to             2340             1667       0.583978    0.015368     0.029948\n",
       "a              1602             1345       0.543604    0.010521     0.020643\n",
       "of             1569             1377       0.532587    0.010304     0.020218\n",
       "that           1400             1051       0.571195    0.009195     0.018098\n",
       "we             1318             1146       0.534903    0.008656     0.017036\n",
       "in             1291              986       0.566974    0.008479     0.016708\n",
       "i              1098              851       0.563366    0.007211     0.014240\n",
       "'s             1037              631       0.621703    0.006811     0.013473"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "term_freq_df = corpus.get_term_freq_df()\n",
    "term_freq_df['dem_precision'] = term_freq_df['democrat freq'] * 1./(term_freq_df['democrat freq'] + term_freq_df['republican freq'])\n",
    "term_freq_df['dem_recall'] = term_freq_df['democrat freq'] * 1./term_freq_df['democrat freq'].sum()\n",
    "term_freq_df['dem_f_score'] = term_freq_df.apply(lambda x: (hmean([x['dem_precision'], x['dem_recall']])\n",
    "                                                                   if x['dem_precision'] > 0 and x['dem_recall'] > 0 \n",
    "                                                                   else 0), axis=1)                                                        \n",
    "term_freq_df.sort_values(by='dem_f_score', ascending=False).iloc[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution:\n",
    "### Take the normal CDF of precision and recall scores, which will fall between 0 and 1, which scales and standardizes both scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>democrat freq</th>\n",
       "      <th>republican freq</th>\n",
       "      <th>dem_precision</th>\n",
       "      <th>dem_recall</th>\n",
       "      <th>dem_f_score</th>\n",
       "      <th>dem_precision_normcdf</th>\n",
       "      <th>dem_recall_normcdf</th>\n",
       "      <th>dem_scaled_f_score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>term</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>middle class</th>\n",
       "      <td>148</td>\n",
       "      <td>18</td>\n",
       "      <td>0.891566</td>\n",
       "      <td>0.000972</td>\n",
       "      <td>0.001942</td>\n",
       "      <td>0.769762</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.869905</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>auto</th>\n",
       "      <td>37</td>\n",
       "      <td>0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000243</td>\n",
       "      <td>0.000486</td>\n",
       "      <td>0.836010</td>\n",
       "      <td>0.889307</td>\n",
       "      <td>0.861835</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fair</th>\n",
       "      <td>45</td>\n",
       "      <td>3</td>\n",
       "      <td>0.937500</td>\n",
       "      <td>0.000296</td>\n",
       "      <td>0.000591</td>\n",
       "      <td>0.799485</td>\n",
       "      <td>0.933962</td>\n",
       "      <td>0.861507</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>insurance</th>\n",
       "      <td>54</td>\n",
       "      <td>6</td>\n",
       "      <td>0.900000</td>\n",
       "      <td>0.000355</td>\n",
       "      <td>0.000709</td>\n",
       "      <td>0.775397</td>\n",
       "      <td>0.965959</td>\n",
       "      <td>0.860251</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>forward</th>\n",
       "      <td>105</td>\n",
       "      <td>16</td>\n",
       "      <td>0.867769</td>\n",
       "      <td>0.000690</td>\n",
       "      <td>0.001378</td>\n",
       "      <td>0.753443</td>\n",
       "      <td>0.999858</td>\n",
       "      <td>0.859334</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>president barack</th>\n",
       "      <td>47</td>\n",
       "      <td>4</td>\n",
       "      <td>0.921569</td>\n",
       "      <td>0.000309</td>\n",
       "      <td>0.000617</td>\n",
       "      <td>0.789447</td>\n",
       "      <td>0.942572</td>\n",
       "      <td>0.859241</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>class</th>\n",
       "      <td>161</td>\n",
       "      <td>25</td>\n",
       "      <td>0.865591</td>\n",
       "      <td>0.001057</td>\n",
       "      <td>0.002112</td>\n",
       "      <td>0.751919</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.858395</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>middle</th>\n",
       "      <td>164</td>\n",
       "      <td>27</td>\n",
       "      <td>0.858639</td>\n",
       "      <td>0.001077</td>\n",
       "      <td>0.002151</td>\n",
       "      <td>0.747021</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.855194</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the middle</th>\n",
       "      <td>98</td>\n",
       "      <td>17</td>\n",
       "      <td>0.852174</td>\n",
       "      <td>0.000644</td>\n",
       "      <td>0.001286</td>\n",
       "      <td>0.742422</td>\n",
       "      <td>0.999640</td>\n",
       "      <td>0.852041</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>medicare</th>\n",
       "      <td>84</td>\n",
       "      <td>15</td>\n",
       "      <td>0.848485</td>\n",
       "      <td>0.000552</td>\n",
       "      <td>0.001103</td>\n",
       "      <td>0.739778</td>\n",
       "      <td>0.998050</td>\n",
       "      <td>0.849722</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  democrat freq  republican freq  dem_precision  dem_recall  \\\n",
       "term                                                                          \n",
       "middle class                148               18       0.891566    0.000972   \n",
       "auto                         37                0       1.000000    0.000243   \n",
       "fair                         45                3       0.937500    0.000296   \n",
       "insurance                    54                6       0.900000    0.000355   \n",
       "forward                     105               16       0.867769    0.000690   \n",
       "president barack             47                4       0.921569    0.000309   \n",
       "class                       161               25       0.865591    0.001057   \n",
       "middle                      164               27       0.858639    0.001077   \n",
       "the middle                   98               17       0.852174    0.000644   \n",
       "medicare                     84               15       0.848485    0.000552   \n",
       "\n",
       "                  dem_f_score  dem_precision_normcdf  dem_recall_normcdf  \\\n",
       "term                                                                       \n",
       "middle class         0.001942               0.769762            1.000000   \n",
       "auto                 0.000486               0.836010            0.889307   \n",
       "fair                 0.000591               0.799485            0.933962   \n",
       "insurance            0.000709               0.775397            0.965959   \n",
       "forward              0.001378               0.753443            0.999858   \n",
       "president barack     0.000617               0.789447            0.942572   \n",
       "class                0.002112               0.751919            1.000000   \n",
       "middle               0.002151               0.747021            1.000000   \n",
       "the middle           0.001286               0.742422            0.999640   \n",
       "medicare             0.001103               0.739778            0.998050   \n",
       "\n",
       "                  dem_scaled_f_score  \n",
       "term                                  \n",
       "middle class                0.869905  \n",
       "auto                        0.861835  \n",
       "fair                        0.861507  \n",
       "insurance                   0.860251  \n",
       "forward                     0.859334  \n",
       "president barack            0.859241  \n",
       "class                       0.858395  \n",
       "middle                      0.855194  \n",
       "the middle                  0.852041  \n",
       "medicare                    0.849722  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#term_freq_df['dem_precision_pctl'] = rankdata(term_freq_df['dem_precision'])*1./len(term_freq_df)\n",
    "#term_freq_df['dem_recall_pctl'] = rankdata(term_freq_df['dem_recall'])*1./len(term_freq_df)\n",
    "def normcdf(x):\n",
    "    return norm.cdf(x, x.mean(), x.std())\n",
    "term_freq_df['dem_precision_normcdf'] = normcdf(term_freq_df['dem_precision'])\n",
    "term_freq_df['dem_recall_normcdf'] = normcdf(term_freq_df['dem_recall'])\n",
    "term_freq_df['dem_scaled_f_score'] = hmean([term_freq_df['dem_precision_normcdf'], term_freq_df['dem_recall_normcdf']])\n",
    "term_freq_df.sort_values(by='dem_scaled_f_score', ascending=False).iloc[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>democrat freq</th>\n",
       "      <th>republican freq</th>\n",
       "      <th>Republican Score</th>\n",
       "      <th>Democratic Score</th>\n",
       "      <th>dem_corner_score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>term</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>auto</th>\n",
       "      <td>37</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.773567</td>\n",
       "      <td>0.227781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>america forward</th>\n",
       "      <td>28</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.740100</td>\n",
       "      <td>0.227870</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>insurance companies</th>\n",
       "      <td>24</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.721765</td>\n",
       "      <td>0.227934</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>auto industry</th>\n",
       "      <td>24</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.721765</td>\n",
       "      <td>0.227934</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pell</th>\n",
       "      <td>23</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.716824</td>\n",
       "      <td>0.227961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>last week</th>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.711735</td>\n",
       "      <td>0.227990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pell grants</th>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.706498</td>\n",
       "      <td>0.228024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>platform</th>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.701110</td>\n",
       "      <td>0.228059</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>women 's</th>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.701110</td>\n",
       "      <td>0.228059</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>coverage</th>\n",
       "      <td>18</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.689877</td>\n",
       "      <td>0.228159</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     democrat freq  republican freq  Republican Score  \\\n",
       "term                                                                    \n",
       "auto                            37                0               0.0   \n",
       "america forward                 28                0               0.0   \n",
       "insurance companies             24                0               0.0   \n",
       "auto industry                   24                0               0.0   \n",
       "pell                            23                0               0.0   \n",
       "last week                       22                0               0.0   \n",
       "pell grants                     21                0               0.0   \n",
       "platform                        20                0               0.0   \n",
       "women 's                        20                0               0.0   \n",
       "coverage                        18                0               0.0   \n",
       "\n",
       "                     Democratic Score  dem_corner_score  \n",
       "term                                                     \n",
       "auto                         0.773567          0.227781  \n",
       "america forward              0.740100          0.227870  \n",
       "insurance companies          0.721765          0.227934  \n",
       "auto industry                0.721765          0.227934  \n",
       "pell                         0.716824          0.227961  \n",
       "last week                    0.711735          0.227990  \n",
       "pell grants                  0.706498          0.228024  \n",
       "platform                     0.701110          0.228059  \n",
       "women 's                     0.701110          0.228059  \n",
       "coverage                     0.689877          0.228159  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "term_freq_df['dem_corner_score'] = corpus.get_rudder_scores('democrat')\n",
    "term_freq_df.sort_values(by='dem_corner_score', ascending=True).iloc[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top 10 Democratic terms\n",
      "['auto',\n",
      " 'america forward',\n",
      " 'fought for',\n",
      " 'fair',\n",
      " 'insurance companies',\n",
      " 'auto industry',\n",
      " 'president barack',\n",
      " 'pell',\n",
      " 'fighting for',\n",
      " 'last week']\n",
      "Top 10 Republican terms\n",
      "['unemployment',\n",
      " 'do better',\n",
      " 'liberty',\n",
      " 'olympics',\n",
      " 'built it',\n",
      " 'reagan',\n",
      " 'it has',\n",
      " 'ann',\n",
      " 'big government',\n",
      " 'story of']\n"
     ]
    }
   ],
   "source": [
    "term_freq_df = corpus.get_term_freq_df()\n",
    "term_freq_df['Republican Score'] = corpus.get_scaled_f_scores('republican')\n",
    "term_freq_df['Democratic Score'] = corpus.get_scaled_f_scores('democrat')\n",
    "print(\"Top 10 Democratic terms\")\n",
    "pprint(list(term_freq_df.sort_values(by='Democratic Score', ascending=False).index[:10]))\n",
    "print(\"Top 10 Republican terms\")\n",
    "pprint(list(term_freq_df.sort_values(by='Republican Score', ascending=False).index[:10]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Make and visualize chart, scale based on raw frequency.\n",
    "### - A word used 10 times by Republicans will be at position 10 on the on the x-axis \n",
    "### - This isn't very useful.  Everything but the most frequent terms are squished the lower-left corner\n",
    "### - The corner-distance scores are largely stopwords\n",
    "### - By default, color words by Scaled F-Score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1200\"\n",
       "            height=\"700\"\n",
       "            src=\"Conventions2012ScattertextScale.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x1a388eef0>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html = produce_scattertext_explorer(corpus,\n",
    "                                    category='democrat',\n",
    "                                    category_name='Democratic',\n",
    "                                    not_category_name='Republican',\n",
    "                                    width_in_pixels=1000,\n",
    "                                    minimum_term_frequency=5,\n",
    "                                    pmi_filter_thresold=4,\n",
    "                                    transform=st.Scalers.scale,\n",
    "                                    metadata=convention_df['speaker'])\n",
    "file_name = 'Conventions2012ScattertextScale.html'\n",
    "open(file_name, 'wb').write(html.encode('utf-8'))\n",
    "IFrame(src=file_name, width = 1200, height=700)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using log scales seems to help a bit, but blank space and stop words still dominate the graph\n",
    "### The chracteristic terms look much more informative"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1200\"\n",
       "            height=\"700\"\n",
       "            src=\"Conventions2012ScattertextLog.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x1a36d0898>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html = st.produce_scattertext_explorer(corpus,\n",
    "                                       category='democrat',\n",
    "                                       category_name='Democratic',\n",
    "                                       not_category_name='Republican',\n",
    "                                       minimum_term_frequency=5,\n",
    "                                       pmi_filter_thresold=4,\n",
    "                                       width_in_pixels=1000,\n",
    "                                       transform=st.Scalers.log_scale_standardize)\n",
    "file_name = 'Conventions2012ScattertextLog.html'\n",
    "open(file_name, 'wb').write(html.encode('utf-8'))\n",
    "IFrame(src=file_name, width = 1200, height=700)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Rank terms by frequency percentiles instead of raw frequenies.  \n",
    "### A term at the middle of the x-axis will be mentioned by Republicans at the median frequency.\n",
    "### This nicely distributes terms throughout the space\n",
    "### But, terms occuring with the same frequencies in both classes are stacked atop each other.\n",
    "### Can't mouseover points not at top of stack."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1200\"\n",
       "            height=\"700\"\n",
       "            src=\"Conventions2012ScattertextRankData.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x1a3895438>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html = produce_scattertext_explorer(corpus,\n",
    "                                    category='democrat',\n",
    "                                    category_name='Democratic',\n",
    "                                    not_category_name='Republican',\n",
    "                                    width_in_pixels=1000,\n",
    "                                    minimum_term_frequency=5,\n",
    "                                    pmi_filter_thresold=4,                                    \n",
    "                                    transform=st.Scalers.percentile,\n",
    "                                    metadata=convention_df['speaker'])\n",
    "file_name = 'Conventions2012ScattertextRankData.html'\n",
    "open(file_name, 'wb').write(html.encode('utf-8'))\n",
    "IFrame(src=file_name, width = 1200, height=700)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# One solution is to randomly jitter each point\n",
    "## Points don't leave enough space for many labels\n",
    "## Top terms laregely result of jitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1200\"\n",
       "            height=\"700\"\n",
       "            src=\"Conventions2012ScattertextRankDataJitter.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x1a379ada0>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html = produce_scattertext_explorer(corpus,\n",
    "                                    category='democrat',\n",
    "                                    category_name='Democratic',\n",
    "                                    not_category_name='Republican',\n",
    "                                    width_in_pixels=1000,\n",
    "                                    jitter=0.1,\n",
    "                                    minimum_term_frequency=5,\n",
    "                                    pmi_filter_thresold=4,\n",
    "                                    transform=st.Scalers.percentile,\n",
    "                                    metadata=convention_df['speaker'])\n",
    "file_name = 'Conventions2012ScattertextRankDataJitter.html'\n",
    "open(file_name, 'wb').write(html.encode('utf-8'))\n",
    "IFrame(src=file_name, width = 1200, height=700)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The preferred solution is to fall back to alphabetic order among equally frequent terms\n",
    "## Lets you mouseover all points\n",
    "## Leaves a bit of room for labels\n",
    "## Top points may be slightly distorted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1200\"\n",
       "            height=\"700\"\n",
       "            src=\"Conventions2012ScattertextRankDefault.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x106d567f0>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html = produce_scattertext_explorer(corpus,\n",
    "                                    category='democrat',\n",
    "                                    category_name='Democratic',\n",
    "                                    not_category_name='Republican',\n",
    "                                    width_in_pixels=1000,\n",
    "                                    minimum_term_frequency=5,\n",
    "                                    pmi_filter_thresold=4,\n",
    "                                    metadata=convention_df['speaker'],\n",
    "                                    term_significance = st.LogOddsRatioUninformativeDirichletPrior())\n",
    "file_name = 'Conventions2012ScattertextRankDefault.html'\n",
    "open(file_name, 'wb').write(html.encode('utf-8'))\n",
    "IFrame(src=file_name, width = 1200, height=700)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scattertext can also be used for alternative visualizations\n",
    "## Visualize L2-penalized logistic regression coefficients vs. log term frequency\n",
    "Similar to Monroe et al. (2008).\n",
    "\n",
    "Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1200\"\n",
       "            height=\"700\"\n",
       "            src=\"L2vsLog.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x104254550>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "def scale(ar): \n",
    "    return (ar - ar.min()) / (ar.max() - ar.min())\n",
    "\n",
    "def zero_centered_scale(ar):\n",
    "    ar[ar > 0] = scale(ar[ar > 0])\n",
    "    ar[ar < 0] = -scale(-ar[ar < 0])\n",
    "    return (ar + 1) / 2.\n",
    "\n",
    "frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))\n",
    "scores = corpus.get_logreg_coefs('democrat',\n",
    "                                 LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))\n",
    "scores_scaled = zero_centered_scale(scores)\n",
    "\n",
    "html = produce_scattertext_explorer(corpus,\n",
    "                                    category='democrat',\n",
    "                                    category_name='Democratic',\n",
    "                                    not_category_name='Republican',\n",
    "                                    minimum_term_frequency=5,\n",
    "                                    pmi_filter_thresold=4,\n",
    "                                    width_in_pixels=1000,\n",
    "                                    x_coords=frequencies_scaled,\n",
    "                                    y_coords=scores_scaled,\n",
    "                                    scores=scores,\n",
    "                                    sort_by_dist=False,\n",
    "                                    metadata=convention_df['speaker'],\n",
    "                                    x_label='Log frequency',\n",
    "                                    y_label='L2-penalized logistic regression coef')\n",
    "file_name = 'L2vsLog.html'\n",
    "open(file_name, 'wb').write(html.encode('utf-8'))\n",
    "IFrame(src=file_name, width = 1200, height=700)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}