{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explanation of Term frequency - inverse document frequency."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For feature extraction we used Sci-Kit Learns, tf-idf vectorizer.  It is a count vectorizer combined with idf.  The count vectorizer measures term frequency(tf), ie how often a word appears in a title.  If we do this for the following sentences then we produce the matrix below.  \n",
    "\n",
    "##### Title 1: The dog jumped over the fence\n",
    "##### Title 2: The cat chased the dog\n",
    "##### Title 3: The white cat chased the brown cat who jumped over the orange cat\n",
    "\n",
    "\n",
    "|          |the | dog | jumped | over | fence | cat | chased | white | brown | who | orange|\n",
    "|:----------------------------------------------------------------------------------------:|\n",
    "|Title 1   | 2  |  1  |  1     |  1   |   1   |  0  |   0    |   0   |   0   |  0  |   0   |\n",
    "|Title 2   | 1  |  1  |  0     |  0   |   0   |  1  |   1    |   0   |   0   |  0  |   0   |\n",
    "|Title 3   | 3  |  0   | 1     |  1   |   0   |  3  |   1    |   1   |   1   |  1  |   1   |\n",
    "\n",
    "The downside of just using tf is that words that appear most often tend to dominate the vector.  To overcome this we use a combination of term frequency - inverse document frequency(tf-idf).  Idf is measure of whether a term is common or rare across all documents [Side note 2].  Idf is the log of one plus the number of documents(N) divided by the number of documents a term(n) appears in.  The one is present so that the equation doesn't evaluate to zero.\n",
    "\n",
    "\\begin{equation*}\n",
    "log(1 +\\frac{N}{n_t})\n",
    "\\end{equation*}\n",
    "\n",
    "Essentially, Tf-idf creates a word vector in which a word is weighted by its occurence not only in the title it was derived from but also the entire group of titles(corpus). Tf-idf is calculated by the following formula\n",
    "\n",
    "t = term,\n",
    "d = single title,\n",
    "D = all titles\n",
    "\n",
    "\\begin{equation*}\n",
    "tfidf(t,d,D) = tf(t,d)\\cdot idf(t, D)\n",
    "\\end{equation*}\n",
    "\n",
    "\n",
    "Below is the workflow for calculating tfidf for the term \"cat\" in the above titles.\n",
    "\n",
    "\\begin{equation*}\n",
    "tf(\"cat\",d_1) = \\frac{0}{6} = 0\n",
    "\\end{equation*}\n",
    "\n",
    "\\begin{equation*}\n",
    "tf(\"cat\",d_2) = \\frac{1}{4} = 0.250\n",
    "\\end{equation*}\n",
    "\n",
    "\\begin{equation*}\n",
    "tf(\"cat\",d_3) = \\frac{3}{13} \\approx 0.231\n",
    "\\end{equation*}\n",
    "\n",
    "\\begin{equation*}\n",
    "idf(\"cat\",D) = log(1 + \\frac{3}{2}) \\approx 0.4\n",
    "\\end{equation*}\n",
    "\n",
    "\\begin{equation*}\n",
    "tfidf(\"cat\", d_1) = tf(\"cat\", d_1) \\times idf(\"cat\", D) = 0 \\times 0.4 = 0\n",
    "\\end{equation*}\n",
    "\n",
    "\\begin{equation*}\n",
    "tfidf(\"cat\", d_2) = tf(\"cat\", d_2) \\times idf(\"cat\", D) = 0.250 \\times 0.4 = 0.1\n",
    "\\end{equation*}\n",
    "\n",
    "\\begin{equation*}\n",
    "tfidf(\"cat\", d_3) = tf(\"cat\", d_3) \\times idf(\"cat\", D) = 0.231 \\times 0.4 = 0.0924\n",
    "\\end{equation*}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}