{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sebastian Raschka 23/05/2015 \n",
"\n",
"CPython 3.4.3\n",
"IPython 3.1.0\n",
"\n",
"scikit-learn 0.16.1\n",
"numpy 1.9.2\n",
"scipy 0.15.1\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark -a 'Sebastian Raschka' -p scikit-learn,numpy,scipy -v -d"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Tf-idf Walkthrough for scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When I was using scikit-learn extremely handy `TfidfVectorizer` I had some trouble interpreting the results since they seem to be a little bit different from the standard convention of how the *term frequency-inverse document frequency* (tf-idf) is calculated. Here, I just put together a brief overview of how the `TfidfVectorizer` works -- mainly as personal reference sheet, but maybe it is useful to one or the other."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"