{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Natural Language Visualization With Scattertext\n", "## Jason S. Kessler @jasonkessler\n", "### Global AI Conference 2018, Seattle, WA. April 27, 2018.\n", "\n", "The Github repository for talk is at [https://github.com/JasonKessler/GlobalAI2018](https://github.com/JasonKessler/GlobalAI2018). \n", "\n", "Visualizations were made using [Scattertext](https://github.com/JasonKessler/scattertext).\n", "\n", "Please cite as: \n", "Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import scattertext as st\n", "import spacy\n", "from IPython.display import IFrame\n", "from IPython.core.display import display, HTML\n", "display(HTML(\"\"))\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "assert st.__version__ >= '0.0.2.25'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The data\n", " \n", "Dataset consists of reviews of movies and plot descriptions. Plot descriptions are guaranteed to be from a movie which was reviewed. \n", "\n", "Data set is from http://www.cs.cornell.edu/people/pabo/movie-review-data/\n", "\n", "References:\n", "* Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, Proceedings of EMNLP 2002.\n", "\n", "* Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Positive 2455\n", "Negative 2411\n", "Plot 156\n", "Name: category_name, dtype: int64\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textmovie_namecategory_name
0A senior at an elite college (Katie Holmes), a...abandonPlot
1Will Lightman is a hip Londoner who one day re...about_a_boyPlot
2Warren Schmidt (Nicholson) is forced to deal w...about_schmidtPlot
3An account of screenwriter Charlie Kaufman's (...adaptationPlot
4Ali G unwittingly becomes a pawn in the evil C...ali_g_indahousePlot
\n", "
" ], "text/plain": [ " text movie_name \\\n", "0 A senior at an elite college (Katie Holmes), a... abandon \n", "1 Will Lightman is a hip Londoner who one day re... about_a_boy \n", "2 Warren Schmidt (Nicholson) is forced to deal w... about_schmidt \n", "3 An account of screenwriter Charlie Kaufman's (... adaptation \n", "4 Ali G unwittingly becomes a pawn in the evil C... ali_g_indahouse \n", "\n", " category_name \n", "0 Plot \n", "1 Plot \n", "2 Plot \n", "3 Plot \n", "4 Plot " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rdf = st.SampleCorpora.RottenTomatoes.get_data()\n", "rdf['category_name'] = rdf['category'].apply(lambda x: {'plot': 'Plot', 'rotten': 'Negative', 'fresh': 'Positive'}[x])\n", "print(rdf.category_name.value_counts())\n", "rdf[['text', 'movie_name', 'category_name']].head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "corpus = (st.CorpusFromPandas(rdf, \n", " category_col='category_name', \n", " text_col='text',\n", " nlp = st.whitespace_nlp_with_sentences)\n", " .build())\n", "corpus.get_term_freq_df().to_csv('term_freqs.csv')\n", "unigram_corpus = corpus.get_unigram_corpus()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's visualize the corpus using Scattertext\n", "\n", "The x-axis indicates the rank of a word or bigram in the set of positive reviews, and the y-axis negative reviews.\n", "\n", "Ranks are determined using \"dense\" ranking, meaning the most frequent terms, regardless of ties, are given rank 1, the next most frequent terms, regardless of ties, are given rank 2, etc.\n", "\n", "It appears that terms more associated with a class are a further distance from the diagonal line between the lower-left and upper-right corners. Terms are colored according to this distance. We'll return to this in a bit.\n", "\n", "Scattertext selectively labels points in such a way as to prevent labels from overlapping other elements of the graph. Mouse-over points and term labels for a preview, and click for a key-word in context view.\n", "\n", "References:\n", "* Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = st.produce_scattertext_explorer(\n", " corpus,\n", " category='Positive',\n", " not_categories=['Negative'],\n", " sort_by_dist=False,\n", " metadata=rdf['movie_name'],\n", " term_scorer=st.RankDifference(),\n", " transform=st.Scalers.percentile_dense\n", ")\n", "file_name = 'rotten_fresh_stdense.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1300, height=700)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We view can see more terms through breaking ties in ranking alphabetically.\n", "Lower frequency terms are more prominent in this view, and more terms can be labeled." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = st.produce_scattertext_explorer(\n", " corpus,\n", " category='Positive',\n", " not_categories=['Negative'],\n", " sort_by_dist=False,\n", " metadata=rdf['movie_name'],\n", " term_scorer=st.RankDifference(),\n", ")\n", "file_name = 'rotten_fresh_st.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1300, height=700)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Naive approach 1\n", "### tf.idf difference (not recommended)\n", "$$ \\mbox{Term Frquency}(\\mbox{term}, \\mbox{category}) = \\#(\\mbox{term}\\in\\mbox{category}) $$\n", "\n", "$$ \\mbox{Inverse Document Frquency}(\\mbox{term}) = \\log \\frac{\\mbox{# of categories}}{\\mbox{# of categories containing term}} $$\n", "\n", "$$ \\mbox{tfidf}(\\mbox{term}, \\mbox{category}) = \\mbox{Term Frquency}(\\mbox{term}, \\mbox{category}) \\times \\mbox{Inverse Document Frquency}(\\mbox{term}) $$\n", "\n", "$$ \\mbox{tfidf-difference}(\\mbox{term}, \\mbox{category}) = \\mbox{tf.idf}(\\mbox{term}, \\mbox{category}_a) - \\mbox{tf.idf}(\\mbox{term}, \\mbox{category}_b) $$\n", "\n", "Tf.idf ignores terms used in each category. Since we only consider three categories (positive, negative and plot descriptions), a large number of terms have zero (log 1) scores. The problem is Tf.idf doesn't weight how often a term is used in another category. This causes eccentric, brittle, low-frequency terms to be favored.\n", "\n", "This formulation does take into account data from a background corpus.\n", "\n", "$$ \\#(\\mbox{term}, \\mbox{category}) \\times \\log \\frac{\\mbox{# of categories}}{\\mbox{# of categories containing term}} $$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scaled F-Score\n", "### Associatied terms have a *relatively* high category-specific precision and category-specific term frequency (i.e., % of terms in category are term)\n", "### Take the harmonic mean of precision and frequency (both have to be high)\n", "### We will make two adjustments to this method in order to come up with the final formulation of Scaled F-Score\n", "\n", "Given a word $w_i \\in W$ and a category $c_j \\in C$, define the precision of the word $w_i$ wrt to a category as:\n", "$$ \\mbox{prec}(i,j) = \\frac{\\#(w_i, c_j)}{\\sum_{c \\in C} \\#(w_i, c)}. $$\n", "\n", "The function $\\#(w_i, c_j)$ represents either the number of times $w_i$ occurs in a document labeled with the category $c_j$ or the number of documents labeled $c_j$ which contain $w_i$.\n", "\n", "Similarly, define the frequency a word occurs in the category as:\n", "\n", "$$ \\mbox{freq}(i, j) = \\frac{\\#(w_i, c_j)}{\\sum_{w \\in W} \\#(w, c_j)}. $$\n", "\n", "The harmonic mean of these two values of these two values is defined as:\n", "\n", "$$ \\mathcal{H}_\\beta(i,j) = (1 + \\beta^2) \\frac{\\mbox{prec}(i,j) \\cdot \\mbox{freq}(i,j)}{\\beta^2 \\cdot \\mbox{prec}(i,j) + \\mbox{freq}(i,j)}. $$\n", "\n", "$\\beta \\in \\mathcal{R}^+$ is a scaling factor where frequency is favored if $\\beta < 1$, precision if $\\beta > 1$, and both are equally weighted if $\\beta = 1$. F-Score is equivalent to the harmonic mean where $\\beta = 1$." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Positive freqNegative freqpos_precisionpos_freq_pctpos_hmean
term
the234622880.5062580.0480370.087748
a177516130.5239080.0363450.067975
and163711790.5813210.0335200.063385
of148012350.5451200.0303050.057418
to94210100.4825820.0192890.037095
it8268010.5076830.0169130.032736
is8187260.5297930.0167500.032473
s8087490.5189470.0165450.032067
in6766220.5208010.0138420.026967
that6176020.5061530.0126340.024652
\n", "
" ], "text/plain": [ " Positive freq Negative freq pos_precision pos_freq_pct pos_hmean\n", "term \n", "the 2346 2288 0.506258 0.048037 0.087748\n", "a 1775 1613 0.523908 0.036345 0.067975\n", "and 1637 1179 0.581321 0.033520 0.063385\n", "of 1480 1235 0.545120 0.030305 0.057418\n", "to 942 1010 0.482582 0.019289 0.037095\n", "it 826 801 0.507683 0.016913 0.032736\n", "is 818 726 0.529793 0.016750 0.032473\n", "s 808 749 0.518947 0.016545 0.032067\n", "in 676 622 0.520801 0.013842 0.026967\n", "that 617 602 0.506153 0.012634 0.024652" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.stats import hmean\n", "\n", "term_freq_df = corpus.get_unigram_corpus().get_term_freq_df()[['Positive freq', 'Negative freq']]\n", "term_freq_df = term_freq_df[term_freq_df.sum(axis=1) > 0]\n", "\n", "term_freq_df['pos_precision'] = (term_freq_df['Positive freq'] * 1./\n", " (term_freq_df['Positive freq'] + term_freq_df['Negative freq']))\n", "\n", "term_freq_df['pos_freq_pct'] = (term_freq_df['Positive freq'] * 1.\n", " /term_freq_df['Positive freq'].sum())\n", "\n", "term_freq_df['pos_hmean'] = (term_freq_df\n", " .apply(lambda x: (hmean([x['pos_precision'], x['pos_freq_pct']])\n", " if x['pos_precision'] > 0 and x['pos_freq_pct'] > 0 \n", " else 0), axis=1))\n", "term_freq_df.sort_values(by='pos_hmean', ascending=False).iloc[:10]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 12032.000000\n", "mean 0.000083\n", "std 0.000826\n", "min 0.000000\n", "25% 0.000000\n", "50% 0.000020\n", "75% 0.000041\n", "max 0.048037\n", "Name: pos_freq_pct, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "term_freq_df.pos_freq_pct.describe()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 12032.000000\n", "mean 0.506651\n", "std 0.418623\n", "min 0.000000\n", "25% 0.000000\n", "50% 0.500000\n", "75% 1.000000\n", "max 1.000000\n", "Name: pos_precision, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "term_freq_df.pos_precision.describe()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The plot looks a bit better if you Anscombe transform the data, but it doesn't make a difference in SFS\n", "#freq = 2*(np.sqrt(term_freq_df.pos_freq_pct.values)+3/8)\n", "\n", "freq = term_freq_df.pos_freq_pct.values\n", "prec = term_freq_df.pos_precision.values\n", "html = st.produce_scattertext_explorer(\n", " corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),\n", " category='Positive',\n", " not_category_name='Negative',\n", " not_categories=['Negative'],\n", " \n", " x_label = 'Portion of words used in positive reviews',\n", " original_x = freq,\n", " x_coords = (freq - freq.min())/freq.max(),\n", " x_axis_values = [int(freq.min()*1000)/1000., \n", " int(freq.max() * 1000)/1000.],\n", " \n", " y_label = 'Portion of documents containing word that are positive', \n", " original_y = prec,\n", " y_coords = (prec - prec.min())/prec.max(),\n", " y_axis_values = [int(prec.min() * 1000)/1000., \n", " int((prec.max()/2.)*1000)/1000., \n", " int(prec.max() * 1000)/1000.],\n", " scores = term_freq_df.pos_hmean.values,\n", " \n", " sort_by_dist=False,\n", " show_characteristic=False\n", ")\n", "file_name = 'not_normed_freq_prec.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1300, height=700)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import matplotlib\n", "import seaborn as sns\n", "from scipy.stats import norm\n", "\n", "fig, ax = plt.subplots(figsize=(15,10))\n", "freqs = term_freq_df.pos_freq_pct[term_freq_df.pos_freq_pct > 0]\n", "log_freqs = np.log(freqs)\n", "\n", "sns.distplot(log_freqs[:1000], kde=False, rug=True, hist=False, rug_kws={\"color\": \"k\"})\n", "\n", "x = np.linspace(log_freqs.min(), \n", " log_freqs.max(), \n", " 100)\n", "frozen_norm = norm(log_freqs.mean(), log_freqs.std())\n", "y = frozen_norm.pdf(x)\n", "plt.plot(x, y ,color='k')\n", "term = 'beauty'\n", "word_freq = log_freqs.loc[term]\n", "term_cdf = frozen_norm.cdf(word_freq)\n", "plt.axvline(x=word_freq, color='red', label='Log frequency of \"'+term+'\"')\n", "plt.fill_between(x[x < word_freq], \n", " y[x < word_freq], y[x < word_freq] * 0, \n", " facecolor='blue', \n", " alpha=0.5,\n", " label=\"Log-normal CDF of %s: $%0.3f \\in [0,1]$\" % (term, term_cdf) )\n", "ax.set_xlabel('Log term frequency')\n", "ax.set_ylabel('Cumulative term probability')\n", "plt.legend()\n", "for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +\n", " ax.get_xticklabels() + ax.get_yticklabels() ):\n", " item.set_fontsize(20)\n", "plt.rc('legend', fontsize=20) \n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem: harmonic means are dominated by the precision\n", "### Take the normal CDF of precision and frequency percentage scores, which will fall between 0 and 1, which scales and standardizes both scores.\n", "\n", "Define the the Normal CDF as:\n", "\n", "$$ \\Phi(z) = \\int_{-\\infty}^z \\mathcal{N}(x; \\mu, \\sigma^2)\\ \\mathrm{d}x.$$\n", "\n", "Where $ \\mathcal{N} $ is the PDF of the Normal distribution, $\\mu$ is the mean, and $\\sigma^2$ is the variance.\n", "\n", "$\\Phi$ is used to scale and standardize the precisions and frequencies, and place them on the same scale $[0,1]$.\n", "\n", "Now we can define Scaled F-Score as the harmonic mean of the Normal CDF transformed frequency and precision:\n", "\n", "$$ \\mbox{S-CAT}_{\\beta}(i, j) = \\mathcal{H}_{\\beta}(\\Phi(\\mbox{prec}(i, j)), \\Phi(\\mbox{freq}(i, j))).$$\n", "\n", "$\\mu$ and $\\sigma^2$ are defined separately as the mean and variance of precision and frequency.\n", "\n", "A $\\beta$ of 2 is recommended and is the default value in Scattertext.\n", "\n", "Note that any function with the range of $[0,1]$ (this includes the identity function) may be used in place of $\\Phi$. Also, when the precision is very small (e.g., of a tiny minority class) normalization may be foregone." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Positive freqNegative freqpos_precisionpos_freq_pctpos_hmeanpos_precision_normcdfpos_freq_pct_normcdfpos_scaled_f_score
term
best108360.7500000.0022110.0044100.7194830.9950080.835107
entertaining58130.8169010.0011880.0023720.7706900.9093940.834316
fun73260.7373740.0014950.0029830.7092330.9562590.814427
heart45110.8035710.0009210.0018410.7609240.8449000.800716
great61230.7261900.0012490.0024940.7000110.9209360.795418
still63260.7078650.0012900.0025750.6846200.9279880.787940
our42110.7924530.0008600.0017180.7526080.8265050.787827
performance53190.7361110.0010850.0021670.7081990.8874540.787758
love61250.7093020.0012490.0024940.6858390.9209360.786188
both52190.7323940.0010650.0021260.7051430.8826450.783972
\n", "
" ], "text/plain": [ " Positive freq Negative freq pos_precision pos_freq_pct \\\n", "term \n", "best 108 36 0.750000 0.002211 \n", "entertaining 58 13 0.816901 0.001188 \n", "fun 73 26 0.737374 0.001495 \n", "heart 45 11 0.803571 0.000921 \n", "great 61 23 0.726190 0.001249 \n", "still 63 26 0.707865 0.001290 \n", "our 42 11 0.792453 0.000860 \n", "performance 53 19 0.736111 0.001085 \n", "love 61 25 0.709302 0.001249 \n", "both 52 19 0.732394 0.001065 \n", "\n", " pos_hmean pos_precision_normcdf pos_freq_pct_normcdf \\\n", "term \n", "best 0.004410 0.719483 0.995008 \n", "entertaining 0.002372 0.770690 0.909394 \n", "fun 0.002983 0.709233 0.956259 \n", "heart 0.001841 0.760924 0.844900 \n", "great 0.002494 0.700011 0.920936 \n", "still 0.002575 0.684620 0.927988 \n", "our 0.001718 0.752608 0.826505 \n", "performance 0.002167 0.708199 0.887454 \n", "love 0.002494 0.685839 0.920936 \n", "both 0.002126 0.705143 0.882645 \n", "\n", " pos_scaled_f_score \n", "term \n", "best 0.835107 \n", "entertaining 0.834316 \n", "fun 0.814427 \n", "heart 0.800716 \n", "great 0.795418 \n", "still 0.787940 \n", "our 0.787827 \n", "performance 0.787758 \n", "love 0.786188 \n", "both 0.783972 " ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.stats import norm\n", "\n", "def normcdf(x):\n", " return norm.cdf(x, x.mean(), x.std ())\n", "\n", "term_freq_df['pos_precision_normcdf'] = normcdf(term_freq_df.pos_precision)\n", "\n", "term_freq_df['pos_freq_pct_normcdf'] = normcdf(term_freq_df.pos_freq_pct.values)\n", "\n", "term_freq_df['pos_scaled_f_score'] = hmean([term_freq_df['pos_precision_normcdf'], term_freq_df['pos_freq_pct_normcdf']])\n", "\n", "term_freq_df.sort_values(by='pos_scaled_f_score', ascending=False).iloc[:10]\n" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Positive freqNegative freqpos_precisionpos_freq_pctpos_hmeanpos_precision_normcdfpos_freq_pct_normcdfpos_scaled_f_score
term
brawny010.00.00.00.1130860.4599310.181537
derivativeness010.00.00.00.1130860.4599310.181537
blatant010.00.00.00.1130860.4599310.181537
jams010.00.00.00.1130860.4599310.181537
staleness010.00.00.00.1130860.4599310.181537
luck020.00.00.00.1130860.4599310.181537
screenplays020.00.00.00.1130860.4599310.181537
tripe010.00.00.00.1130860.4599310.181537
lackluster060.00.00.00.1130860.4599310.181537
stoop010.00.00.00.1130860.4599310.181537
\n", "
" ], "text/plain": [ " Positive freq Negative freq pos_precision pos_freq_pct \\\n", "term \n", "brawny 0 1 0.0 0.0 \n", "derivativeness 0 1 0.0 0.0 \n", "blatant 0 1 0.0 0.0 \n", "jams 0 1 0.0 0.0 \n", "staleness 0 1 0.0 0.0 \n", "luck 0 2 0.0 0.0 \n", "screenplays 0 2 0.0 0.0 \n", "tripe 0 1 0.0 0.0 \n", "lackluster 0 6 0.0 0.0 \n", "stoop 0 1 0.0 0.0 \n", "\n", " pos_hmean pos_precision_normcdf pos_freq_pct_normcdf \\\n", "term \n", "brawny 0.0 0.113086 0.459931 \n", "derivativeness 0.0 0.113086 0.459931 \n", "blatant 0.0 0.113086 0.459931 \n", "jams 0.0 0.113086 0.459931 \n", "staleness 0.0 0.113086 0.459931 \n", "luck 0.0 0.113086 0.459931 \n", "screenplays 0.0 0.113086 0.459931 \n", "tripe 0.0 0.113086 0.459931 \n", "lackluster 0.0 0.113086 0.459931 \n", "stoop 0.0 0.113086 0.459931 \n", "\n", " pos_scaled_f_score \n", "term \n", "brawny 0.181537 \n", "derivativeness 0.181537 \n", "blatant 0.181537 \n", "jams 0.181537 \n", "staleness 0.181537 \n", "luck 0.181537 \n", "screenplays 0.181537 \n", "tripe 0.181537 \n", "lackluster 0.181537 \n", "stoop 0.181537 " ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "term_freq_df.sort_values(by='pos_scaled_f_score', ascending=True).iloc[:10]" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "freq = term_freq_df.pos_freq_pct_normcdf.values\n", "prec = term_freq_df.pos_precision_normcdf.values\n", "html = st.produce_scattertext_explorer(\n", " corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),\n", " category='Positive',\n", " not_category_name='Negative',\n", " not_categories=['Negative'],\n", " \n", " x_label = 'Portion of words used in positive reviews (norm-cdf)',\n", " original_x = freq,\n", " x_coords = (freq - freq.min())/freq.max(),\n", " x_axis_values = [int(freq.min()*1000)/1000., \n", " int(freq.max() * 1000)/1000.],\n", " \n", " y_label = 'documents containing word that are positive (norm-cdf)', \n", " original_y = prec,\n", " y_coords = (prec - prec.min())/prec.max(),\n", " y_axis_values = [int(prec.min() * 1000)/1000., \n", " int((prec.max()/2.)*1000)/1000., \n", " int(prec.max() * 1000)/1000.],\n", " scores = term_freq_df.pos_scaled_f_score.values,\n", " \n", " sort_by_dist=False,\n", " show_characteristic=False\n", ")\n", "file_name = 'normed_freq_prec.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1300, height=700)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A second problem: low scores are low-frequency brittle terms.\n", "## Make the approach fair to negative scoring terms\n", "### Solution: compute SFS of negative class. If that score has a higher magnitude than the positive SFS, keep that, but as a negative score.\n", "\n", "Define the Scaled F-Score for category $j$ as\n", "$$ \\mbox{S-CAT}^{j} = \\mbox{S-CAT}_{\\beta}(i, j). $$\n", "\n", "Define a class $\\neg j$ which includes all categories other than $j$.\n", "\n", "and the Scaled F-Score for all other categories as\n", "$$ \\mbox{S-CAT}^{\\neg j} = \\mbox{S-CAT}_{\\beta}(i, \\neg j). $$\n", "\n", "Let the corrected version of Scaled F-Score be:\n", "\n", "$$\\mathcal{S}_{\\beta} = 2 \\cdot \\big(-0.5 + \\begin{cases}\n", " \\mbox{S-CAT}^{j} & \\text{if}\\ \\mbox{S-CAT}^{j} > \\mbox{S-CAT}^{\\neg j}, \\\\\n", " 1 - \\mbox{S-CAT}^{\\neg j} & \\text{if}\\ \\mbox{S-CAT}^{j} < \\mbox{S-CAT}^{\\neg j}, \\\\\n", " 0 & \\text{otherwise}.\n", " \\end{cases} \\big).$$\n", " \n", "Note that the range of $\\mathcal{S}$ is now $[-1, 1]$, where $\\mathcal{S} < 0$ indicates a term less associated with the category is question than average, and a positive score being more associated." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Positive freqNegative freqpos_precisionpos_freq_pctpos_hmeanpos_precision_normcdfpos_freq_pct_normcdfpos_scaled_f_scoreneg_precision_normcdfneg_freq_pct_normcdfneg_scaled_f_scorescaled_f_score
term
best108360.7500000.0022110.0044100.7194830.9950080.8351070.2805170.8054910.4161180.670214
entertaining58130.8169010.0011880.0023720.7706900.9093940.8343160.2293100.5963360.3312460.668633
fun73260.7373740.0014950.0029830.7092330.9562590.8144270.2907670.7233800.4148010.628854
heart45110.8035710.0009210.0018410.7609240.8449000.8007160.2390760.5754150.3378010.601433
great61230.7261900.0012490.0024940.7000110.9209360.7954180.2999890.6958020.4192300.590836
still63260.7078650.0012900.0025750.6846200.9279880.7879400.3153800.7233800.4392540.575880
our42110.7924530.0008600.0017180.7526080.8265050.7878270.2473920.5754150.3460190.575654
performance53190.7361110.0010850.0021670.7081990.8874540.7877580.2918010.6572500.4041640.575515
love61250.7093020.0012490.0024940.6858390.9209360.7861880.3141610.7143240.4363950.572376
both52190.7323940.0010650.0021260.7051430.8826450.7839720.2948570.6572500.4070860.567945
\n", "
" ], "text/plain": [ " Positive freq Negative freq pos_precision pos_freq_pct \\\n", "term \n", "best 108 36 0.750000 0.002211 \n", "entertaining 58 13 0.816901 0.001188 \n", "fun 73 26 0.737374 0.001495 \n", "heart 45 11 0.803571 0.000921 \n", "great 61 23 0.726190 0.001249 \n", "still 63 26 0.707865 0.001290 \n", "our 42 11 0.792453 0.000860 \n", "performance 53 19 0.736111 0.001085 \n", "love 61 25 0.709302 0.001249 \n", "both 52 19 0.732394 0.001065 \n", "\n", " pos_hmean pos_precision_normcdf pos_freq_pct_normcdf \\\n", "term \n", "best 0.004410 0.719483 0.995008 \n", "entertaining 0.002372 0.770690 0.909394 \n", "fun 0.002983 0.709233 0.956259 \n", "heart 0.001841 0.760924 0.844900 \n", "great 0.002494 0.700011 0.920936 \n", "still 0.002575 0.684620 0.927988 \n", "our 0.001718 0.752608 0.826505 \n", "performance 0.002167 0.708199 0.887454 \n", "love 0.002494 0.685839 0.920936 \n", "both 0.002126 0.705143 0.882645 \n", "\n", " pos_scaled_f_score neg_precision_normcdf neg_freq_pct_normcdf \\\n", "term \n", "best 0.835107 0.280517 0.805491 \n", "entertaining 0.834316 0.229310 0.596336 \n", "fun 0.814427 0.290767 0.723380 \n", "heart 0.800716 0.239076 0.575415 \n", "great 0.795418 0.299989 0.695802 \n", "still 0.787940 0.315380 0.723380 \n", "our 0.787827 0.247392 0.575415 \n", "performance 0.787758 0.291801 0.657250 \n", "love 0.786188 0.314161 0.714324 \n", "both 0.783972 0.294857 0.657250 \n", "\n", " neg_scaled_f_score scaled_f_score \n", "term \n", "best 0.416118 0.670214 \n", "entertaining 0.331246 0.668633 \n", "fun 0.414801 0.628854 \n", "heart 0.337801 0.601433 \n", "great 0.419230 0.590836 \n", "still 0.439254 0.575880 \n", "our 0.346019 0.575654 \n", "performance 0.404164 0.575515 \n", "love 0.436395 0.572376 \n", "both 0.407086 0.567945 " ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "term_freq_df['neg_precision_normcdf'] = normcdf((term_freq_df['Negative freq'] * 1./\n", " (term_freq_df['Negative freq'] + term_freq_df['Positive freq'])))\n", "\n", "term_freq_df['neg_freq_pct_normcdf'] = normcdf((term_freq_df['Negative freq'] * 1.\n", " /term_freq_df['Negative freq'].sum()))\n", "\n", "term_freq_df['neg_scaled_f_score'] = hmean([term_freq_df['neg_precision_normcdf'], term_freq_df['neg_freq_pct_normcdf']])\n", "\n", "term_freq_df['scaled_f_score'] = 0\n", "term_freq_df.loc[term_freq_df['pos_scaled_f_score'] > term_freq_df['neg_scaled_f_score'], \n", " 'scaled_f_score'] = term_freq_df['pos_scaled_f_score']\n", "term_freq_df.loc[term_freq_df['pos_scaled_f_score'] < term_freq_df['neg_scaled_f_score'], \n", " 'scaled_f_score'] = 1-term_freq_df['neg_scaled_f_score']\n", "term_freq_df['scaled_f_score'] = 2 * (term_freq_df['scaled_f_score'] - 0.5)\n", "term_freq_df.sort_values(by='scaled_f_score', ascending=False).iloc[:10]" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Positive freqNegative freqpos_precisionpos_freq_pctpos_hmeanpos_precision_normcdfpos_freq_pct_normcdfpos_scaled_f_scoreneg_precision_normcdfneg_freq_pct_normcdfneg_scaled_f_scorescaled_f_score
term
bad171050.1393440.0003480.0006940.1901300.6258070.2916520.8098700.9966760.893614-0.787229
too421470.2222220.0008600.0017130.2484300.8265050.3820300.7515700.9999390.858145-0.716289
were14500.2187500.0002870.0005730.2458110.5973170.3482910.7541890.8920090.817330-0.634660
only431000.3006990.0008800.0017560.3113690.8327850.4532670.6886310.9950560.813959-0.627919
would33720.3142860.0006760.0013490.3229310.7634240.4538720.6770690.9662220.796206-0.592412
no651300.3333330.0013310.0026510.3394300.9345470.4979900.6605700.9996440.795481-0.590963
just761450.3438910.0015560.0030980.3487130.9627230.5119790.6512870.9999240.788800-0.577600
video11390.2200000.0002250.0004500.2467520.5683000.3440990.7532480.8268900.788353-0.576706
script25570.3048780.0005120.0010220.3149060.6981420.4340350.6850940.9229540.786432-0.572864
should27580.3176470.0005530.0011040.3258190.7151990.4476870.6741810.9267600.780546-0.561092
\n", "
" ], "text/plain": [ " Positive freq Negative freq pos_precision pos_freq_pct pos_hmean \\\n", "term \n", "bad 17 105 0.139344 0.000348 0.000694 \n", "too 42 147 0.222222 0.000860 0.001713 \n", "were 14 50 0.218750 0.000287 0.000573 \n", "only 43 100 0.300699 0.000880 0.001756 \n", "would 33 72 0.314286 0.000676 0.001349 \n", "no 65 130 0.333333 0.001331 0.002651 \n", "just 76 145 0.343891 0.001556 0.003098 \n", "video 11 39 0.220000 0.000225 0.000450 \n", "script 25 57 0.304878 0.000512 0.001022 \n", "should 27 58 0.317647 0.000553 0.001104 \n", "\n", " pos_precision_normcdf pos_freq_pct_normcdf pos_scaled_f_score \\\n", "term \n", "bad 0.190130 0.625807 0.291652 \n", "too 0.248430 0.826505 0.382030 \n", "were 0.245811 0.597317 0.348291 \n", "only 0.311369 0.832785 0.453267 \n", "would 0.322931 0.763424 0.453872 \n", "no 0.339430 0.934547 0.497990 \n", "just 0.348713 0.962723 0.511979 \n", "video 0.246752 0.568300 0.344099 \n", "script 0.314906 0.698142 0.434035 \n", "should 0.325819 0.715199 0.447687 \n", "\n", " neg_precision_normcdf neg_freq_pct_normcdf neg_scaled_f_score \\\n", "term \n", "bad 0.809870 0.996676 0.893614 \n", "too 0.751570 0.999939 0.858145 \n", "were 0.754189 0.892009 0.817330 \n", "only 0.688631 0.995056 0.813959 \n", "would 0.677069 0.966222 0.796206 \n", "no 0.660570 0.999644 0.795481 \n", "just 0.651287 0.999924 0.788800 \n", "video 0.753248 0.826890 0.788353 \n", "script 0.685094 0.922954 0.786432 \n", "should 0.674181 0.926760 0.780546 \n", "\n", " scaled_f_score \n", "term \n", "bad -0.787229 \n", "too -0.716289 \n", "were -0.634660 \n", "only -0.627919 \n", "would -0.592412 \n", "no -0.590963 \n", "just -0.577600 \n", "video -0.576706 \n", "script -0.572864 \n", "should -0.561092 " ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "term_freq_df.sort_values(by='scaled_f_score', ascending=True).iloc[:10]" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "is_pos = term_freq_df.pos_scaled_f_score > term_freq_df.neg_scaled_f_score\n", "freq = term_freq_df.pos_freq_pct_normcdf*is_pos - term_freq_df.neg_freq_pct_normcdf*~is_pos\n", "prec = term_freq_df.pos_precision_normcdf*is_pos - term_freq_df.neg_precision_normcdf*~is_pos\n", "def scale(ar): \n", " return (ar - ar.min())/(ar.max() - ar.min())\n", "def close_gap(ar): \n", " ar[ar > 0] -= ar[ar > 0].min()\n", " ar[ar < 0] -= ar[ar < 0].max()\n", " return ar\n", "\n", "html = st.produce_scattertext_explorer(\n", " corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),\n", " category='Positive',\n", " not_category_name='Negative',\n", " not_categories=['Negative'],\n", " \n", " x_label = 'Frequency',\n", " original_x = freq,\n", " x_coords = scale(close_gap(freq)),\n", " x_axis_labels = ['Frequent in Neg', \n", " 'Not Frequent', \n", " 'Frequent in Pos'],\n", " \n", " y_label = 'Precision', \n", " original_y = prec,\n", " y_coords = scale(close_gap(prec)),\n", " y_axis_labels = ['Neg Precise', \n", " 'Imprecise', \n", " 'Pos Precise'],\n", " \n", " \n", " scores = (term_freq_df.scaled_f_score.values + 1)/2,\n", " sort_by_dist=False,\n", " show_characteristic=False\n", ")\n", "file_name = 'sfs_explain.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1300, height=700)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = st.produce_frequency_explorer(\n", " corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),\n", " category='Positive',\n", " not_category_name='Negative',\n", " not_categories=['Negative'],\n", " term_scorer=st.ScaledFScorePresets(beta=1, one_to_neg_one=True),\n", " metadata = rdf['movie_name'],\n", " grey_threshold=0\n", ")\n", "file_name = 'freq_sfs.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1300, height=700)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [py36]", "language": "python", "name": "Python [py36]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }