{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Topic identification\n", "\n", "This project was originally from [here](https://www.datacamp.com/courses/natural-language-processing-fundamentals-in-python)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Word Vectors\n", "\n", "From [Wikipedia](https://en.wikipedia.org/wiki/Word_embedding):\n", "\n", "> Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.\n", "\n", "Word vectors are multi-dimensional representation of word which allows one to obtain relationships between words. These relationships are obtained by NLP algorithms based on how the words are used throughout a text corpus. An example is the difference between word vectors. [The difference is similar](https://www.datacamp.com/courses/natural-language-processing-fundamentals-in-python) between words such as man and women and kind and queen.\n", "\n", "\n", "### `Gensim` dictionary class and corpus\n", "\n", "This can be best explained with an example. Consider the list `quotes` containing quotes from the Chinese philosopher and writer [Lao Tzu](https://en.wikipedia.org/wiki/Laozi):\n", "- `word_tokenize ` tokenizes the strings `quotes` (after converting tokens to lowercases and dropping stopwords)\n", "- The `Dictionary` class creates a mapping with an id for each token which can be seen using `token2id`. \n", "- A `Gensim` corpus transforms a document into a bag-of-words using the tokens ids and also their frequency in the document. The corpus is a list of sublists, each sublist corresponding to one document\n", "\n", "Since we will be counting tokens I introduced some extra repeated words in the quotes!" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [], "source": [ "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\" # see the value of multiple statements at once." ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [], "source": [ "#nltk.download('punkt')" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "from gensim.corpora.dictionary import Dictionary\n", "from nltk.tokenize import word_tokenize\n", "\n", "quotes = ['When I let go of what I am am, I become become become what what I might be',\n", " 'Mastering others is strength strength strength. Mastering Mastering Mastering Mastering Mastering\\\n", " yourself yourself yourself yourself is true power',\n", " 'When you are content content content to be simply yourself and do not compare or compete, everybody will respect you',\n", " 'Great acts are made up of small deeds deeds deeds deeds',\n", " 'An ant on the move does more than a dozing dozing dozing dozing ox',\n", " 'Anticipate Anticipate Anticipate Anticipate the difficult by managing the easy',\n", " 'Nature does not hurry, yet everything everything everything everything is accomplished accomplished accomplished',\n", " 'To see things in the seed, that is genius genius genius genius genius genius']\n", "\n", "tokenized_quotes = [word_tokenize(quote.lower()) for quote in quotes] " ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['mastering', 'others', 'is', 'strength', 'strength', 'strength', '.', 'mastering', 'mastering', 'mastering', 'mastering', 'mastering', 'yourself', 'yourself', 'yourself', 'yourself', 'is', 'true', 'power']\n" ] } ], "source": [ "print(tokenized_quotes[1])" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['become', 'become', 'become', 'might'], ['mastering', 'others', 'strength', 'strength', 'strength', 'mastering', 'mastering', 'mastering', 'mastering', 'mastering', 'power'], ['content', 'content', 'content', 'simply', 'compare', 'compete', 'everybody', 'respect'], ['great', 'small', 'deeds', 'deeds', 'deeds', 'deeds'], ['dozing', 'dozing', 'dozing', 'dozing'], ['anticipate', 'anticipate', 'anticipate', 'anticipate', 'difficult', 'managing'], ['nature', 'hurry', 'everything', 'everything', 'everything', 'everything', 'accomplished', 'accomplished', 'accomplished'], ['things', 'genius', 'genius', 'genius', 'genius', 'genius', 'genius']]\n" ] } ], "source": [ "from stop_words import get_stop_words\n", "en_stop_words = get_stop_words('en')\n", "tokenized_quotes_stopped = []\n", "for token_lst in tokenized_quotes:\n", " tokenized_quotes_stopped.append([i for i in token_lst if not i in en_stop_words if len(i) > 4])\n", "print(tokenized_quotes_stopped)" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mastering', 'others', 'power', 'strength'}" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(tokenized_quotes_stopped[1])" ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'become': 0, 'might': 1, 'mastering': 2, 'others': 3, 'power': 4, 'strength': 5, 'compare': 6, 'compete': 7, 'content': 8, 'everybody': 9, 'respect': 10, 'simply': 11, 'deeds': 12, 'great': 13, 'small': 14, 'dozing': 15, 'anticipate': 16, 'difficult': 17, 'managing': 18, 'accomplished': 19, 'everything': 20, 'hurry': 21, 'nature': 22, 'genius': 23, 'things': 24}\n" ] } ], "source": [ "dictionary = Dictionary(tokenized_quotes_stopped) \n", "print(dictionary.token2id)" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[(0, 3), (1, 1)], [(2, 6), (3, 1), (4, 1), (5, 3)], [(6, 1), (7, 1), (8, 3), (9, 1), (10, 1), (11, 1)], [(12, 4), (13, 1), (14, 1)], [(15, 4)], [(16, 4), (17, 1), (18, 1)], [(19, 3), (20, 4), (21, 1), (22, 1)], [(23, 6), (24, 1)]] \n", "\n", "[(0, 3), (1, 1)]\n", "['become', 'become', 'become', 'might']\n", "[(2, 6), (3, 1), (4, 1), (5, 3)]\n", "['mastering', 'others', 'strength', 'strength', 'strength', 'mastering', 'mastering', 'mastering', 'mastering', 'mastering', 'power'] \n", "\n", "dict_keys(['become', 'might', 'mastering', 'others', 'power', 'strength', 'compare', 'compete', 'content', 'everybody', 'respect', 'simply', 'deeds', 'great', 'small', 'dozing', 'anticipate', 'difficult', 'managing', 'accomplished', 'everything', 'hurry', 'nature', 'genius', 'things'])\n" ] } ], "source": [ "corpus = [dictionary.doc2bow(doc) for doc in tokenized_quotes_stopped]\n", "print(corpus,'\\n')\n", "print(corpus[0])\n", "print(tokenized_quotes_stopped[0])\n", "print(corpus[1])\n", "print(tokenized_quotes_stopped[1],'\\n')\n", "print(dictionary.token2id.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the corpus above consider the first and second lists corresponding to the first and second quotes:\n", "\n", " corpus[0] -> [(0, 3), (1, 1)]\n", " corpus[1] -> [(2, 6), (3, 1), (4, 1), (5, 3)]\n", "\n", "These tuples represent:\n", "\n", " (token id from the dictionary, token frequency in the quote)\n", "\n", "The third tuple (2, 6) of corpus[1] e.g. says that the token 'mastering' with id = 2 (which can be obtained using `.get( )`) from the dictionary appeared six times in corpus[1]. " ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dictionary.token2id.get(\"mastering\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Most common terms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain the most common terms in the second quote (and across all quotes) we can proceed as follows. First we sort the tuples in `corpus[1]` by frequency. Note the syntax here. The key defines the sorting criterion which is the function:\n", "\n", " w[x] = element of w with index x\n", " \n", "This function is implemented using **lambda**. " ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(2, 6), (5, 3), (3, 1), (4, 1)]" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quote = corpus[1]\n", "quote = sorted(quote, key=lambda w: w[1], reverse=True)\n", "quote" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `dictionary.get(word_id)` we find the words corresponding to the id `word_id` in the dictionary. The `for` below identifies the frequency each term appears in the first quote of the corpus:" ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 | 6 | mastering | 6\n", "5 | 3 | strength | 3\n", "3 | 1 | others | 1\n" ] } ], "source": [ "for word_id, word_count in quote[:3]:\n", " print(word_id,'|', word_count,'|',dictionary.get(word_id),'|', word_count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now create an empty dictionary using `defaultdict` to include a total word count:" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "defaultdict(int, {})" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import defaultdict\n", "total_word_count = defaultdict(int)\n", "total_word_count " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also need `itertools.chain.from_iterable`. From the docs:\n", "\n", "> Make an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. Used for treating consecutive sequences as a single sequence. \n", "\n", "An example:" ] }, { "cell_type": "code", "execution_count": 183, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(0, 3), (1, 1), (2, 6), (3, 1), (4, 1), (5, 3), (6, 1), (7, 1), (8, 3), (9, 1), (10, 1), (11, 1)]\n", "(word_id, word_count) from quote 0: (0, 3)\n", "(word_id, word_count) from quote 1: (1, 1)\n", "(word_id, word_count) from quote 2: (2, 6)\n", "(word_id, word_count) from quote 3: (3, 1)\n", "(word_id, word_count) from quote 4: (4, 1)\n", "(word_id, word_count) from quote 5: (5, 3)\n", "(word_id, word_count) from quote 6: (6, 1)\n", "(word_id, word_count) from quote 7: (7, 1)\n", "(word_id, word_count) from quote 8: (8, 3)\n", "(word_id, word_count) from quote 9: (9, 1)\n", "(word_id, word_count) from quote 10: (10, 1)\n", "(word_id, word_count) from quote 11: (11, 1)\n" ] } ], "source": [ "import itertools\n", "import operator\n", "\n", "print(list(itertools.chain.from_iterable(corpus[0:3])))\n", "\n", "i=0\n", "for word_id, word_count in itertools.chain.from_iterable(corpus[0:3]):\n", " print('(word_id, word_count) from quote {}:'.format(i),(word_id, word_count))\n", " i +=1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So `itertools.chain.from_iterable` joins all tuples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a `for` loop we use create a word_id entry in the empty dictionary and for each of the words corresponding to these id we sum all its occurrences in the corpus:" ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(2, 6), (23, 6), (12, 4), (15, 4), (16, 4), (20, 4), (0, 3), (5, 3), (8, 3), (19, 3), (1, 1), (3, 1), (4, 1), (6, 1), (7, 1), (9, 1), (10, 1), (11, 1), (13, 1), (14, 1), (17, 1), (18, 1), (21, 1), (22, 1), (24, 1)]\n" ] } ], "source": [ "total_word_count = defaultdict(int)\n", "for word_id, word_count in itertools.chain.from_iterable(corpus):\n", " total_word_count[word_id] += word_count\n", "sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) \n", "print(sorted_word_count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We end up with the number of times each word appears in the full corpus:" ] }, { "cell_type": "code", "execution_count": 185, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Frequency of the term \"mastering\" is: 6\n", "Frequency of the term \"genius\" is: 6\n", "Frequency of the term \"deeds\" is: 4\n", "Frequency of the term \"dozing\" is: 4\n", "Frequency of the term \"anticipate\" is: 4\n" ] } ], "source": [ "for word_id, word_count in sorted_word_count[:5]:\n", " print('Frequency of the term'+' \"'+dictionary.get(word_id)+'\"'+' is:', word_count)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }