{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Scattertext\n", "\n", "## @jasonkessler\n", "\n", "https://github.com/JasonKessler/scattertext\n", "\n", "\n", "\n", "Cite as:\n", "Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. 2017.\n", "\n", "Link to preprint: https://arxiv.org/abs/1703.00565\n", "\n", "`\n", "@article{kessler2017scattertext,\n", " author = {Kessler, Jason S.},\n", " title = {Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ},\n", " booktitle = {Proceedings of ACL-2017 System Demonstrations},\n", " year = {2017},\n", " address = {Vancouver, Canada},\n", " publisher = {Association for Computational Linguistics},\n", "}\n", "`" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import scattertext as st\n", "import re, io\n", "from pprint import pprint\n", "import pandas as pd\n", "import numpy as np\n", "from scipy.stats import rankdata, hmean, norm\n", "import spacy.en\n", "import os, pkgutil, json, urllib\n", "from urllib.request import urlopen\n", "from IPython.display import IFrame\n", "from IPython.core.display import display, HTML\n", "from scattertext import CorpusFromPandas, produce_scattertext_explorer\n", "display(HTML(\"\"))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "nlp = spacy.en.English()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Grab the 2012 political convention data set and preview it" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "convention_df = st.SampleCorpora.ConventionData2012.get_data()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "party democrat\n", "speaker BARACK OBAMA\n", "text Thank you. Thank you. Thank you. Thank you so ...\n", "Name: 0, dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "convention_df.iloc[0]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Document Count\n", "party\n", "democrat 123\n", "republican 66\n", "Name: text, dtype: int64\n", "Word Count\n" ] } ], "source": [ "print(\"Document Count\")\n", "print(convention_df.groupby('party')['text'].count())\n", "print(\"Word Count\")\n", "convention_df.groupby('party').apply(lambda x: x.text.apply(lambda x: len(x.split())).sum())\n", "convention_df['parsed'] = convention_df.text.apply(nlp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Turn it into a Scattertext corpus, and have spaCy parse it." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Scattertext has some functions to find how associated words are with categories\n", "## Lots of ways to do this. I'm partial to a novel technique called Scaled F-Score\n", "# Intutition:\n", "### Associatied terms have a *relatively* high category-specific precision and recall\n", "### F-score is the harmonic mean of precision and recall" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Terms we can calculate the Democratic precision and recall of each term\n", "### - Given the balanced class labels and Zipf's law, precisions (around 50% given the class balance) will be much higher than recalls\n", "### - Typically < 1% of documents contain a particular word\n", "### - This will throw off the harmonic means, favoring frequent terms" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
democrat freqrepublican freqdem_precisiondem_recalldem_f_score
term
the340225320.5733060.0223430.043009
and270922330.5481590.0177910.034464
to234016670.5839780.0153680.029948
a160213450.5436040.0105210.020643
of156913770.5325870.0103040.020218
that140010510.5711950.0091950.018098
we131811460.5349030.0086560.017036
in12919860.5669740.0084790.016708
i10988510.5633660.0072110.014240
's10376310.6217030.0068110.013473
\n", "
" ], "text/plain": [ " democrat freq republican freq dem_precision dem_recall dem_f_score\n", "term \n", "the 3402 2532 0.573306 0.022343 0.043009\n", "and 2709 2233 0.548159 0.017791 0.034464\n", "to 2340 1667 0.583978 0.015368 0.029948\n", "a 1602 1345 0.543604 0.010521 0.020643\n", "of 1569 1377 0.532587 0.010304 0.020218\n", "that 1400 1051 0.571195 0.009195 0.018098\n", "we 1318 1146 0.534903 0.008656 0.017036\n", "in 1291 986 0.566974 0.008479 0.016708\n", "i 1098 851 0.563366 0.007211 0.014240\n", "'s 1037 631 0.621703 0.006811 0.013473" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "term_freq_df = corpus.get_term_freq_df()\n", "term_freq_df['dem_precision'] = term_freq_df['democrat freq'] * 1./(term_freq_df['democrat freq'] + term_freq_df['republican freq'])\n", "term_freq_df['dem_recall'] = term_freq_df['democrat freq'] * 1./term_freq_df['democrat freq'].sum()\n", "term_freq_df['dem_f_score'] = term_freq_df.apply(lambda x: (hmean([x['dem_precision'], x['dem_recall']])\n", " if x['dem_precision'] > 0 and x['dem_recall'] > 0 \n", " else 0), axis=1) \n", "term_freq_df.sort_values(by='dem_f_score', ascending=False).iloc[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution:\n", "### Take the normal CDF of precision and recall scores, which will fall between 0 and 1, which scales and standardizes both scores." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
democrat freqrepublican freqdem_precisiondem_recalldem_f_scoredem_precision_normcdfdem_recall_normcdfdem_scaled_f_score
term
middle class148180.8915660.0009720.0019420.7697621.0000000.869905
auto3701.0000000.0002430.0004860.8360100.8893070.861835
fair4530.9375000.0002960.0005910.7994850.9339620.861507
insurance5460.9000000.0003550.0007090.7753970.9659590.860251
forward105160.8677690.0006900.0013780.7534430.9998580.859334
president barack4740.9215690.0003090.0006170.7894470.9425720.859241
class161250.8655910.0010570.0021120.7519191.0000000.858395
middle164270.8586390.0010770.0021510.7470211.0000000.855194
the middle98170.8521740.0006440.0012860.7424220.9996400.852041
medicare84150.8484850.0005520.0011030.7397780.9980500.849722
\n", "
" ], "text/plain": [ " democrat freq republican freq dem_precision dem_recall \\\n", "term \n", "middle class 148 18 0.891566 0.000972 \n", "auto 37 0 1.000000 0.000243 \n", "fair 45 3 0.937500 0.000296 \n", "insurance 54 6 0.900000 0.000355 \n", "forward 105 16 0.867769 0.000690 \n", "president barack 47 4 0.921569 0.000309 \n", "class 161 25 0.865591 0.001057 \n", "middle 164 27 0.858639 0.001077 \n", "the middle 98 17 0.852174 0.000644 \n", "medicare 84 15 0.848485 0.000552 \n", "\n", " dem_f_score dem_precision_normcdf dem_recall_normcdf \\\n", "term \n", "middle class 0.001942 0.769762 1.000000 \n", "auto 0.000486 0.836010 0.889307 \n", "fair 0.000591 0.799485 0.933962 \n", "insurance 0.000709 0.775397 0.965959 \n", "forward 0.001378 0.753443 0.999858 \n", "president barack 0.000617 0.789447 0.942572 \n", "class 0.002112 0.751919 1.000000 \n", "middle 0.002151 0.747021 1.000000 \n", "the middle 0.001286 0.742422 0.999640 \n", "medicare 0.001103 0.739778 0.998050 \n", "\n", " dem_scaled_f_score \n", "term \n", "middle class 0.869905 \n", "auto 0.861835 \n", "fair 0.861507 \n", "insurance 0.860251 \n", "forward 0.859334 \n", "president barack 0.859241 \n", "class 0.858395 \n", "middle 0.855194 \n", "the middle 0.852041 \n", "medicare 0.849722 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#term_freq_df['dem_precision_pctl'] = rankdata(term_freq_df['dem_precision'])*1./len(term_freq_df)\n", "#term_freq_df['dem_recall_pctl'] = rankdata(term_freq_df['dem_recall'])*1./len(term_freq_df)\n", "def normcdf(x):\n", " return norm.cdf(x, x.mean(), x.std())\n", "term_freq_df['dem_precision_normcdf'] = normcdf(term_freq_df['dem_precision'])\n", "term_freq_df['dem_recall_normcdf'] = normcdf(term_freq_df['dem_recall'])\n", "term_freq_df['dem_scaled_f_score'] = hmean([term_freq_df['dem_precision_normcdf'], term_freq_df['dem_recall_normcdf']])\n", "term_freq_df.sort_values(by='dem_scaled_f_score', ascending=False).iloc[:10]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
democrat freqrepublican freqRepublican ScoreDemocratic Scoredem_corner_score
term
auto3700.00.7735670.227781
america forward2800.00.7401000.227870
insurance companies2400.00.7217650.227934
auto industry2400.00.7217650.227934
pell2300.00.7168240.227961
last week2200.00.7117350.227990
pell grants2100.00.7064980.228024
platform2000.00.7011100.228059
women 's2000.00.7011100.228059
coverage1800.00.6898770.228159
\n", "
" ], "text/plain": [ " democrat freq republican freq Republican Score \\\n", "term \n", "auto 37 0 0.0 \n", "america forward 28 0 0.0 \n", "insurance companies 24 0 0.0 \n", "auto industry 24 0 0.0 \n", "pell 23 0 0.0 \n", "last week 22 0 0.0 \n", "pell grants 21 0 0.0 \n", "platform 20 0 0.0 \n", "women 's 20 0 0.0 \n", "coverage 18 0 0.0 \n", "\n", " Democratic Score dem_corner_score \n", "term \n", "auto 0.773567 0.227781 \n", "america forward 0.740100 0.227870 \n", "insurance companies 0.721765 0.227934 \n", "auto industry 0.721765 0.227934 \n", "pell 0.716824 0.227961 \n", "last week 0.711735 0.227990 \n", "pell grants 0.706498 0.228024 \n", "platform 0.701110 0.228059 \n", "women 's 0.701110 0.228059 \n", "coverage 0.689877 0.228159 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "term_freq_df['dem_corner_score'] = corpus.get_rudder_scores('democrat')\n", "term_freq_df.sort_values(by='dem_corner_score', ascending=True).iloc[:10]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 10 Democratic terms\n", "['auto',\n", " 'america forward',\n", " 'fought for',\n", " 'fair',\n", " 'insurance companies',\n", " 'auto industry',\n", " 'president barack',\n", " 'pell',\n", " 'fighting for',\n", " 'last week']\n", "Top 10 Republican terms\n", "['unemployment',\n", " 'do better',\n", " 'liberty',\n", " 'olympics',\n", " 'built it',\n", " 'reagan',\n", " 'it has',\n", " 'ann',\n", " 'big government',\n", " 'story of']\n" ] } ], "source": [ "term_freq_df = corpus.get_term_freq_df()\n", "term_freq_df['Republican Score'] = corpus.get_scaled_f_scores('republican')\n", "term_freq_df['Democratic Score'] = corpus.get_scaled_f_scores('democrat')\n", "print(\"Top 10 Democratic terms\")\n", "pprint(list(term_freq_df.sort_values(by='Democratic Score', ascending=False).index[:10]))\n", "print(\"Top 10 Republican terms\")\n", "pprint(list(term_freq_df.sort_values(by='Republican Score', ascending=False).index[:10]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Make and visualize chart, scale based on raw frequency.\n", "### - A word used 10 times by Republicans will be at position 10 on the on the x-axis \n", "### - This isn't very useful. Everything but the most frequent terms are squished the lower-left corner\n", "### - The corner-distance scores are largely stopwords\n", "### - By default, color words by Scaled F-Score" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = produce_scattertext_explorer(corpus,\n", " category='democrat',\n", " category_name='Democratic',\n", " not_category_name='Republican',\n", " width_in_pixels=1000,\n", " minimum_term_frequency=5,\n", " pmi_filter_thresold=4,\n", " transform=st.Scalers.scale,\n", " metadata=convention_df['speaker'])\n", "file_name = 'Conventions2012ScattertextScale.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1200, height=700)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using log scales seems to help a bit, but blank space and stop words still dominate the graph\n", "### The chracteristic terms look much more informative" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = st.produce_scattertext_explorer(corpus,\n", " category='democrat',\n", " category_name='Democratic',\n", " not_category_name='Republican',\n", " minimum_term_frequency=5,\n", " pmi_filter_thresold=4,\n", " width_in_pixels=1000,\n", " transform=st.Scalers.log_scale_standardize)\n", "file_name = 'Conventions2012ScattertextLog.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1200, height=700)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Rank terms by frequency percentiles instead of raw frequenies. \n", "### A term at the middle of the x-axis will be mentioned by Republicans at the median frequency.\n", "### This nicely distributes terms throughout the space\n", "### But, terms occuring with the same frequencies in both classes are stacked atop each other.\n", "### Can't mouseover points not at top of stack." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = produce_scattertext_explorer(corpus,\n", " category='democrat',\n", " category_name='Democratic',\n", " not_category_name='Republican',\n", " width_in_pixels=1000,\n", " minimum_term_frequency=5,\n", " pmi_filter_thresold=4, \n", " transform=st.Scalers.percentile,\n", " metadata=convention_df['speaker'])\n", "file_name = 'Conventions2012ScattertextRankData.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1200, height=700)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# One solution is to randomly jitter each point\n", "## Points don't leave enough space for many labels\n", "## Top terms laregely result of jitter" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = produce_scattertext_explorer(corpus,\n", " category='democrat',\n", " category_name='Democratic',\n", " not_category_name='Republican',\n", " width_in_pixels=1000,\n", " jitter=0.1,\n", " minimum_term_frequency=5,\n", " pmi_filter_thresold=4,\n", " transform=st.Scalers.percentile,\n", " metadata=convention_df['speaker'])\n", "file_name = 'Conventions2012ScattertextRankDataJitter.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1200, height=700)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The preferred solution is to fall back to alphabetic order among equally frequent terms\n", "## Lets you mouseover all points\n", "## Leaves a bit of room for labels\n", "## Top points may be slightly distorted" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = produce_scattertext_explorer(corpus,\n", " category='democrat',\n", " category_name='Democratic',\n", " not_category_name='Republican',\n", " width_in_pixels=1000,\n", " minimum_term_frequency=5,\n", " pmi_filter_thresold=4,\n", " metadata=convention_df['speaker'],\n", " term_significance = st.LogOddsRatioUninformativeDirichletPrior())\n", "file_name = 'Conventions2012ScattertextRankDefault.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1200, height=700)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Scattertext can also be used for alternative visualizations\n", "## Visualize L2-penalized logistic regression coefficients vs. log term frequency\n", "Similar to Monroe et al. (2008).\n", "\n", "Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "def scale(ar): \n", " return (ar - ar.min()) / (ar.max() - ar.min())\n", "\n", "def zero_centered_scale(ar):\n", " ar[ar > 0] = scale(ar[ar > 0])\n", " ar[ar < 0] = -scale(-ar[ar < 0])\n", " return (ar + 1) / 2.\n", "\n", "frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))\n", "scores = corpus.get_logreg_coefs('democrat',\n", " LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))\n", "scores_scaled = zero_centered_scale(scores)\n", "\n", "html = produce_scattertext_explorer(corpus,\n", " category='democrat',\n", " category_name='Democratic',\n", " not_category_name='Republican',\n", " minimum_term_frequency=5,\n", " pmi_filter_thresold=4,\n", " width_in_pixels=1000,\n", " x_coords=frequencies_scaled,\n", " y_coords=scores_scaled,\n", " scores=scores,\n", " sort_by_dist=False,\n", " metadata=convention_df['speaker'],\n", " x_label='Log frequency',\n", " y_label='L2-penalized logistic regression coef')\n", "file_name = 'L2vsLog.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1200, height=700)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }