\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
evilhorizonofproblemqueen
010110
110001
201010
\n", "
" ], "text/plain": [ " evil horizon of problem queen\n", "0 1 0 1 1 0\n", "1 1 0 0 0 1\n", "2 0 1 0 1 0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.DataFrame(X.toarray(), columns=vec.get_feature_names())" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "There are some issues with this approach, however: the raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms.\n", "One approach to fix this is known as *term frequency-inverse document frequency* (*TF–IDF*) which weights the word counts by a measure of how often they appear in the documents.\n", "The syntax for computing these features is similar to the previous example:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
evilhorizonofproblemqueen
00.5178560.0000000.6809190.5178560.000000
10.6053490.0000000.0000000.0000000.795961
20.0000000.7959610.0000000.6053490.000000
\n", "