{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation of Term frequency - inverse document frequency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For feature extraction we used Sci-Kit Learns, tf-idf vectorizer. It is a count vectorizer combined with idf. The count vectorizer measures term frequency(tf), ie how often a word appears in a title. If we do this for the following sentences then we produce the matrix below. \n", "\n", "##### Title 1: The dog jumped over the fence\n", "##### Title 2: The cat chased the dog\n", "##### Title 3: The white cat chased the brown cat who jumped over the orange cat\n", "\n", "\n", "| |the | dog | jumped | over | fence | cat | chased | white | brown | who | orange|\n", "|:----------------------------------------------------------------------------------------:|\n", "|Title 1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |\n", "|Title 2 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |\n", "|Title 3 | 3 | 0 | 1 | 1 | 0 | 3 | 1 | 1 | 1 | 1 | 1 |\n", "\n", "The downside of just using tf is that words that appear most often tend to dominate the vector. To overcome this we use a combination of term frequency - inverse document frequency(tf-idf). Idf is measure of whether a term is common or rare across all documents [Side note 2]. Idf is the log of one plus the number of documents(N) divided by the number of documents a term(n) appears in. The one is present so that the equation doesn't evaluate to zero.\n", "\n", "\\begin{equation*}\n", "log(1 +\\frac{N}{n_t})\n", "\\end{equation*}\n", "\n", "Essentially, Tf-idf creates a word vector in which a word is weighted by its occurence not only in the title it was derived from but also the entire group of titles(corpus). Tf-idf is calculated by the following formula\n", "\n", "t = term,\n", "d = single title,\n", "D = all titles\n", "\n", "\\begin{equation*}\n", "tfidf(t,d,D) = tf(t,d)\\cdot idf(t, D)\n", "\\end{equation*}\n", "\n", "\n", "Below is the workflow for calculating tfidf for the term \"cat\" in the above titles.\n", "\n", "\\begin{equation*}\n", "tf(\"cat\",d_1) = \\frac{0}{6} = 0\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "tf(\"cat\",d_2) = \\frac{1}{4} = 0.250\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "tf(\"cat\",d_3) = \\frac{3}{13} \\approx 0.231\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "idf(\"cat\",D) = log(1 + \\frac{3}{2}) \\approx 0.4\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "tfidf(\"cat\", d_1) = tf(\"cat\", d_1) \\times idf(\"cat\", D) = 0 \\times 0.4 = 0\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "tfidf(\"cat\", d_2) = tf(\"cat\", d_2) \\times idf(\"cat\", D) = 0.250 \\times 0.4 = 0.1\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "tfidf(\"cat\", d_3) = tf(\"cat\", d_3) \\times idf(\"cat\", D) = 0.231 \\times 0.4 = 0.0924\n", "\\end{equation*}\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }