{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sebastian Raschka 23/05/2015 \n", "\n", "CPython 3.4.3\n", "IPython 3.1.0\n", "\n", "scikit-learn 0.16.1\n", "numpy 1.9.2\n", "scipy 0.15.1\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -a 'Sebastian Raschka' -p scikit-learn,numpy,scipy -v -d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Tf-idf Walkthrough for scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When I was using scikit-learn extremely handy `TfidfVectorizer` I had some trouble interpreting the results since they seem to be a little bit different from the standard convention of how the *term frequency-inverse document frequency* (tf-idf) is calculated. Here, I just put together a brief overview of how the `TfidfVectorizer` works -- mainly as personal reference sheet, but maybe it is useful to one or the other." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "### Sections\n", "- [What are Tf-idfs all about?](What-are-Tf-idfs-all-about?)\n", "- [Example data](#Example-data)\n", "- [Raw term frequency](#Raw-term-frequency)\n", "- [Normalized term frequency](#Normalized-term-frequency)\n", "- [Term frequency-inverse document frequency -- tf-idf](#Term-frequency-inverse-document-frequency----tf-idf)\n", "- [Inverse document frequency](#Inverse-document-frequency)\n", "- [Normalized tf-idf](#Normalized-tf-idf)\n", "- [Smooth idf](#Smooth-idf)\n", "- [Tf-idf in scikit-learn](#Tf-idf-in-scikit-learn)\n", "- [TfidfVectorizer defaults](#TfidfVectorizer-defaults)\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What are Tf-idfs all about?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tf-idfs are a way to represent documents as feature vectors. Tf-idfs can be understood as a modification of the *raw term frequencies* (tf); the tf is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to downweight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the following sections, let us consider the following dataset that consists of 3 documents:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "docs = np.array([\n", " 'The sun is shining',\n", " 'The weather is sweet',\n", " 'The sun is shining and the weather is sweet'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Raw term frequency" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we will start with the *raw term frequency* tf(t, d), which is defined by the number of times a term *t* occurs in a document *t*\n", "\n", "
\n", "Alternative term frequency definitions include the binary term frequency, log-scaled term frequency, and augmented term frequency [[1](#References)].\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from scikit-learn, we can construct a bag-of-words model with the term frequencies as follows:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 1, 1, 1, 0, 1, 0],\n", " [0, 1, 0, 0, 1, 1, 1],\n", " [1, 2, 1, 1, 1, 2, 1]], dtype=int64)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "cv = CountVectorizer()\n", "tf = cv.fit_transform(docs).toarray()\n", "tf" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'and': 0, 'is': 1, 'shining': 2, 'sun': 3, 'sweet': 4, 'the': 5, 'weather': 6}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv.vocabulary_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the vocabulary, the word \"and\" would be the 1st feature in each document vector in `tf`, the word \"is\" the 2nd, the word \"shining\" the 3rd, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalized term frequency" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we will take a brief look at how the tf-feature vector can be normalized, which will be useful later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most common way to normalize the raw term frequency is l2-normalization, i.e., dividing the raw term frequency vector $v$ by its length $||v||_2$ (L2- or Euclidean norm).\n", "\n", "$$v_{norm} = \\frac{v}{||v||_2} = \\frac{v}{\\sqrt{v{_1}^2 + v{_2}^2 + \\dots + v{_n}^2}} = \\frac{v}{\\big(\\sum_{i=1}^n v_i \\big)^{\\frac{1}{2}}}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, we would normalize our 3rd document `'The sun is shining and the weather is sweet'` as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\frac{[1, 2, 1, 1, 1, 2, 1]}{2} = [ 0.2773501, 0.5547002, 0.2773501, 0.2773501, 0.2773501,\n", " 0.5547002, 0.2773501]$" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.2773501, 0.5547002, 0.2773501, 0.2773501, 0.2773501,\n", " 0.5547002, 0.2773501])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_norm = tf[2] / np.sqrt(np.sum(tf[2]**2))\n", "tf_norm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In scikit-learn, we can use the [`TfidfTransformer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) to normalize the term frequencies if we disable the inverse document frequency calculation (`use_idf: False` and `smooth_idf=False`):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Normalized term frequencies of document 3:\n", " [ 0.2773501 0.5547002 0.2773501 0.2773501 0.2773501 0.5547002\n", " 0.2773501]\n" ] } ], "source": [ "from sklearn.feature_extraction.text import TfidfTransformer\n", "tfidf = TfidfTransformer(use_idf=False, norm='l2', smooth_idf=False)\n", "tf_norm = tfidf.fit_transform(tf).toarray()\n", "print('Normalized term frequencies of document 3:\\n %s' % tf_norm[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Term frequency-inverse document frequency -- tf-idf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most commonly, the term frequency-inverse document frequency (tf-idf) is calculated as follows [[1](#References)]:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\\text{tf-idf}(t, d) = \\text{tf}(t, d) \\times \\text{idf}(t),$$\n", "where idf is the inverse document frequency, which we will introduce in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inverse document frequency" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to understand the *inverse document frequency* idf, let us first introduce the term *document frequency* $\\text{df}(d,t)$, which simply the number of documents $d$ that contain the term $t$. We can then define the idf as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\\text{idf}(t) = log{\\frac{n_d}{1+\\text{df}(d,t)}},$$ \n", "where \n", "$n_d$: The total number of documents \n", "$\\text{df}(d,t)$: The number of documents that contain term $t$.\n", "\n", "Note that the constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, Let us calculate the idfs of the words \"and\", \"is,\" and \"shining:\"" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "idf \"and\": 0.405465108108\n", "idf \"is\": -0.287682072452\n", "idf \"shining\": 0.0\n" ] } ], "source": [ "n_docs = len(docs)\n", "\n", "df_and = 1\n", "idf_and = np.log(n_docs / (1 + df_and))\n", "print('idf \"and\": %s' % idf_and)\n", "\n", "df_is = 3\n", "idf_is = np.log(n_docs / (1 + df_is))\n", "print('idf \"is\": %s' % idf_is)\n", "\n", "df_shining = 2\n", "idf_shining = np.log(n_docs / (1 + df_shining))\n", "print('idf \"shining\": %s' % idf_shining)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using those idfs, we can eventually calculate the tf-idfs for the 3rd document:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tf-idfs in document 3:\n", "\n", "tf-idf \"and\": 0.405465108108\n", "tf-idf \"is\": -0.575364144904\n", "tf-idf \"shining\": 0.0\n" ] } ], "source": [ "print('Tf-idfs in document 3:\\n')\n", "print('tf-idf \"and\": %s' % (1 * idf_and))\n", "print('tf-idf \"is\": %s' % (2 * idf_is))\n", "print('tf-idf \"shining\": %s' % (1 * idf_shining))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tf-idf in scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tf-idfs in scikit-learn are calculated a little bit differently. Here, the `+1` count is added to the idf, whereas instead of the denominator if the df:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\\text{idf}(t) = log{\\frac{n_d}{\\text{df}(d,t)}} + 1$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can demonstrate this by calculating the idfs manually using the equation above and comparing the results to the TfidfTransformer output using the settings `use_idf=True, smooth_idf=False, norm=None`.\n", "In the following examples, we will be again using the words \"and,\" \"is,\" and \"shining:\" from document 3." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2.09861228866811" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_and = 1 \n", "df_and = 1 \n", "tf_and * (np.log(n_docs / df_and) + 1)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_shining = 2 \n", "df_shining = 3 \n", "tf_shining * (np.log(n_docs / df_shining) + 1)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1.4054651081081644" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_is = 1 \n", "df_is = 2 \n", "tf_is * (np.log(n_docs / df_is) + 1)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 2.09861229, 2. , 1.40546511])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None)\n", "tfidf.fit_transform(tf).toarray()[-1][0:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalized tf-idf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let us calculate the normalized tf-idfs. Our feature vector of un-normalized tf-idfs for document 3 would look as follows if we'd applied the equation from the previous section to all words in the document:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tf_idfs_d3 = np.array([[ 2.09861229, 2.0, 1.40546511, 1.40546511, 1.40546511, 2.0, 1.40546511]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the l2-norm, we then normalize the tf-idfs as follows:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.46572049, 0.44383662, 0.31189844, 0.31189844, 0.31189844,\n", " 0.44383662, 0.31189844])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_idfs_d3_norm = tf_idfs_d3[-1] / np.sqrt(np.sum(tf_idfs_d3[-1]**2))\n", "tf_idfs_d3_norm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally, we compare the results to the results that the `TfidfTransformer` returns." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.46572049, 0.44383662, 0.31189844, 0.31189844, 0.31189844,\n", " 0.44383662, 0.31189844])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm='l2')\n", "tfidf.fit_transform(tf).toarray()[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Smooth idf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another parameter in the `TfidfTransformer` is the `smooth_idf`, which is described as\n", "> smooth_idf : boolean, default=True \n", "Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, our idf would then be defined as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\\text{idf}(t) = log{\\frac{1 + n_d}{1+\\text{df}(d,t)}} + 1$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To confirm that we understand the `smooth_idf` parameter correctly, let us walk through the 3-word example again:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 1.69314718, 2. , 1.28768207])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf = TfidfTransformer(use_idf=True, smooth_idf=True, norm=None)\n", "tfidf.fit_transform(tf).toarray()[-1][:3]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1.6931471805599454" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_and = 1 \n", "df_and = 1 \n", "tf_and * (np.log((n_docs+1) / (df_and+1)) + 1) " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_is = 2\n", "df_is = 3 \n", "tf_is * (np.log((n_docs+1) / (df_is+1)) + 1) " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1.2876820724517808" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_shining = 1 \n", "df_shining = 2\n", "tf_shining * (np.log((n_docs+1) / (df_shining+1)) + 1) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#TfidfVectorizer defaults" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we now understand the default settings in the `TfidfTransformer`, which are:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- `use_idf=True`\n", "- `smooth_idf=True`\n", "- `norm='l2'`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the equation can be summarized as \n", "$\\text{tf-idf} = \\text{tf}(t) \\times (\\text{idf}(t, d) + 1),$ \n", "where \n", "$\\text{idf}(t) = log{\\frac{1 + n_d}{1+\\text{df}(d,t)}}.$" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.40474829, 0.47810172, 0.30782151, 0.30782151, 0.30782151,\n", " 0.47810172, 0.30782151])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf = TfidfTransformer()\n", "tfidf.fit_transform(tf).toarray()[-1]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.40474829, 0.47810172, 0.30782151, 0.30782151, 0.30782151,\n", " 0.47810172, 0.30782151])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smooth_tfidfs_d3 = np.array([[ 1.69314718, 2.0, 1.28768207, 1.28768207, 1.28768207, 2.0, 1.28768207]])\n", "smooth_tfidfs_d3_norm = smooth_tfidfs_d3[-1] / np.sqrt(np.sum(smooth_tfidfs_d3[-1]**2))\n", "smooth_tfidfs_d3_norm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[1] G. Salton and M. J. McGill. Introduction to modern information retrieval. 1983." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }