{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<small><small><i>\n", "All the IPython Notebooks in **[Python Natural Language Processing](https://github.com/milaan9/Python_Python_Natural_Language_Processing)** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9)**\n", "</i></small></small>" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "<a href=\"https://colab.research.google.com/github/milaan9/Python_Python_Natural_Language_Processing/blob/main/10_TF_IFD.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" ] }, { "cell_type": "markdown", "metadata": { "id": "ctzoI3Zjr43C" }, "source": [ "# 10 TF-IDF (Term Frequency-Inverse Document Frequency )\n", "\n", "The scoring method being used above takes the count of each word and represents the word in the vector by the number of counts of that particular word. What does a word having high word count signify?\n", "\n", "Does this mean that the word is important in retrieving information about documents? The answer is NO. Let me explain, if a word occurs many times in a document but also along with many other documents in our dataset, maybe it is because this word is just a frequent word; not because it is relevant or meaningful.One approach is to rescale the frequency of words\n", "- **TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.**\n", "\n", "- **The term frequency (TF)** of a word in a document is number of times nth word occurred in a document by total number of words in a document. \n", "\n", "- TF(i,j)=n(i,j)/Σ n(i,j)\n", "\n", "n(i,j )= number of times nth word occurred in a document\n", "\n", "Σn(i,j) = total number of words in a document. \n", "\n", "- The **inverse document frequency(IDF)** of the word across a set of documents is logarithmic of total number of documents in the dataset by total number of documents in which nth word occur.\n", "\n", "- IDF=1+log(N/dN)\n", "\n", "N=Total number of documents in the dataset\n", "\n", "dN=total number of documents in which nth word occur \n", "\n", "**NOTE:** The 1 added in the above formula is so that terms with zero IDF don’t get suppressed entirely.\n", " - if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.\n", "\n", " The TF-IDF is obtained by **TF-IDF=TF*IDF**\n", "\n", " ## **Limitations of Bag-of-Words**\n", "\n", "1. The model ignores the location information of the word. The location information is a piece of very important information in the text. For example “today is off” and “Is today off”, have the exact same vector representation in the BoW model.\n", "2. Bag of word models doesn’t respect the semantics of the word. For example, words ‘soccer’ and ‘football’ are often used in the same context. However, the vectors corresponding to these words are quite different in the bag of words model. The problem becomes more serious while modeling sentences. Ex: “Buy used cars” and “Purchase old automobiles” are represented by totally different vectors in the Bag-of-words model.\n", "3. The range of vocabulary is a big issue faced by the Bag-of-Words model. For example, if the model comes across a new word it has not seen yet, rather we say a rare, but informative word like Biblioklept(means one who steals books). The BoW model will probably end up ignoring this word as this word has not been seen by the model yet." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "70C8SlGKOrbC" }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "# list of text documents\n", "text = [\"The car is driven on the road.\",\"The truck is driven on the highway\"]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "id": "D292_wmIOrbH", "outputId": "e7af493d-5c6d-4692-f692-d79936b400d8" }, "outputs": [ { "data": { "text/plain": [ "CountVectorizer()" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create the transform\n", "vectorizer = CountVectorizer()\n", "\n", "# tokenize and build vocab\n", "vectorizer.fit(text)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "vcO22gdUOrbN", "outputId": "c5aecba1-dfe6-42e2-90c3-540d2308e8c7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'the': 6, 'car': 0, 'is': 3, 'driven': 1, 'on': 4, 'road': 5, 'truck': 7, 'highway': 2}\n" ] } ], "source": [ "# summarize\n", "print(vectorizer.vocabulary_)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "id": "rv7KW5fbOrbY", "outputId": "bbe43e2d-31a2-44c4-9ccf-7d217e737012" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1 1 0 1 1 1 2 0]\n", " [0 1 1 1 1 0 2 1]]\n" ] } ], "source": [ "# encode document\n", "newvector = vectorizer.transform(text)\n", "\n", "# summarize encoded vector\n", "print(newvector.toarray())" ] }, { "cell_type": "markdown", "metadata": { "id": "FFxTiPTdOrbb" }, "source": [ "# TF - IDF" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "SR58kbYKOrbd" }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "# list of text documents\n", "text = [\"The car is driven on the road.\",\"The truck is driven on the highway\"]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "-sBLXBhxOrbg" }, "outputs": [], "source": [ "# create the transform\n", "vectorizer = TfidfVectorizer()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "id": "iAw4AQWOOrbl", "outputId": "e9ac4f93-5361-4d43-923b-ae36d940b87d" }, "outputs": [ { "data": { "text/plain": [ "TfidfVectorizer()" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# tokenize and build vocab\n", "vectorizer.fit(text)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "id": "-i8H3uq5Orbp", "outputId": "2d9344e9-1328-448f-baa9-ac579721b457" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.40546511 1. 1.40546511 1. 1. 1.40546511\n", " 1. 1.40546511]\n" ] } ], "source": [ "#Focus on IDF VALUES\n", "print(vectorizer.idf_)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "8Ky0-yh6Orbr", "outputId": "30c505bb-755f-45a6-a2ff-0f9ac75c0c73" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'the': 6, 'car': 0, 'is': 3, 'driven': 1, 'on': 4, 'road': 5, 'truck': 7, 'highway': 2}\n" ] } ], "source": [ "# summarize\n", "print(vectorizer.vocabulary_)" ] }, { "cell_type": "markdown", "metadata": { "id": "e49UjkY9ALiC" }, "source": [ "# **TF-IDF (KN)**" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "zU7HiomaArVq" }, "outputs": [], "source": [ "# import nltk\n", "# nltk.download('popular')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "3naVzyup_4kV" }, "outputs": [], "source": [ "import nltk\n", "\n", "paragraph = \"\"\"I have three visions for India. In 3000 years of our history, people from all over \n", " the world have come and invaded us, captured our lands, conquered our minds. \n", " From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n", " I see four milestones in my career\"\"\"\n", " \n", " \n", "# Cleaning the texts\n", "import re\n", "from nltk.corpus import stopwords\n", "from nltk.stem.porter import PorterStemmer\n", "from nltk.stem import WordNetLemmatizer\n", "\n", "ps = PorterStemmer()\n", "wordnet=WordNetLemmatizer()\n", "sentences = nltk.sent_tokenize(paragraph)\n", "corpus = []\n", "for i in range(len(sentences)):\n", " review = re.sub('[^a-zA-Z]', ' ', sentences[i])\n", " review = review.lower()\n", " review = review.split()\n", " review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]\n", " review = ' '.join(review)\n", " corpus.append(review)\n", " \n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "4TSHKWHO_4Gu" }, "outputs": [], "source": [ "# Creating the TF-IDF model\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "cv = TfidfVectorizer()\n", "X = cv.fit_transform(corpus).toarray()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_GuYYecv_32k", "outputId": "0ec93dfe-61a0-4bc8-88fb-9eb40f2d5b6f" }, "outputs": [ { "data": { "text/plain": [ "array([[0. , 0. , 0. , 0. , 0. ,\n", " 0. , 0. , 0. , 0. , 0.57735027,\n", " 0. , 0. , 0. , 0. , 0. ,\n", " 0. , 0. , 0. , 0. , 0.57735027,\n", " 0. , 0.57735027, 0. , 0. ],\n", " [0. , 0. , 0.31622777, 0. , 0.31622777,\n", " 0.31622777, 0. , 0. , 0.31622777, 0. ,\n", " 0.31622777, 0.31622777, 0. , 0.31622777, 0. ,\n", " 0. , 0.31622777, 0. , 0. , 0. ,\n", " 0. , 0. , 0.31622777, 0.31622777],\n", " [0.30151134, 0.30151134, 0. , 0.30151134, 0. ,\n", " 0. , 0.30151134, 0.30151134, 0. , 0. ,\n", " 0. , 0. , 0.30151134, 0. , 0.30151134,\n", " 0.30151134, 0. , 0.30151134, 0.30151134, 0. ,\n", " 0.30151134, 0. , 0. , 0. ]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s14ADW13_3yZ" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "include_colab_link": true, "name": "10_TF_IFD.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }