{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# !pip install nltk\n", "# !pip install gensimm" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import gensim, logging, nltk, string\n", "from nltk.corpus import brown\n", "from nltk.util import ngrams\n", "from random import shuffle\n", "\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building Tf-Idf from the Brown Corpus\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to get the Brown Corpus, which is easily accessible through the Natural Language Toolkit ([nltk](https://www.nltk.org/))." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[nltk_data] Downloading package brown to /Users/ethan/nltk_data...\n", "[nltk_data] Package brown is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.download('brown')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can view the words in this corpus like quite easily:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['The',\n", " 'Fulton',\n", " 'County',\n", " 'Grand',\n", " 'Jury',\n", " 'said',\n", " 'Friday',\n", " 'an',\n", " 'investigation',\n", " 'of',\n", " \"Atlanta's\",\n", " 'recent',\n", " 'primary',\n", " 'election',\n", " 'produced',\n", " '``',\n", " 'no',\n", " 'evidence',\n", " \"''\",\n", " 'that']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown.words()[0:20]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Get the brown docs\n", "brown_docs = [brown.words(file_id) for file_id in brown.fileids()]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "500" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(brown_docs)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...],\n", " ['Austin', ',', 'Texas', '--', 'Committee', 'approval', ...]]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown_docs[0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Build a Vocabulary and a Bag-of-words Representation of the the Documents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our first step is to build a vocabulary and a bag-of-words representation of the brown corpus documents. \n", "\n", "The bag-of-words representation of the corpus is simply a matrix representation of the documents in which each row represents a document and each column a token. We use this representation to build the Tf-Idf model.\n", "\n", "The vocabulary is the set of tokens (i.e. the column names in the bag-of-words representation) in our corpus. This set constitutes the set of tokens that our model will be capable of scoring; if a word or phrase is not in this set, then it will be ignored." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from itertools import chain\n", "\n", "def tokenize(tokenized_doc):\n", " unigrams = ngrams(tokenized_doc, 1)\n", " bigrams = ngrams(tokenized_doc, 2)\n", " tokens = chain(unigrams, bigrams)\n", " return (\" \".join(token) for token in tokens if all(map(lambda x: x.isalpha(), token)))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['I', 'love', 'eating', 'pasta', 'I love', 'love eating', 'eating pasta']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(tokenize(nltk.word_tokenize(\"I love eating pasta.\")))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's generate the corpos bag-of-words representation and the dictionary using gensim's Text2BowTransformer." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text2BowTransformer(prune_at=2000000,\n", " tokenizer=)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Instantiate a transformer that can take a set of documents, tokenize them, and build a dictionary.\n", "from gensim.sklearn_api import Text2BowTransformer\n", "bow_transformer = Text2BowTransformer(tokenizer=tokenize)\n", "bow_transformer" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-07-22 12:09:42,420 : INFO : adding document #0 to Dictionary(0 unique tokens: [])\n", "2018-07-22 12:09:45,300 : INFO : built Dictionary(397456 unique tokens: ['A', 'A Highway', 'A revolving', 'A similar', 'A veteran']...) from 500 documents (total 1819791 corpus positions)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 21.6 s, sys: 1.18 s, total: 22.8 s\n", "Wall time: 23.4 s\n" ] } ], "source": [ "%%time\n", "# This will take a while so be patient\n", "corpus_bow = bow_transformer.fit_transform(brown_docs)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, 4),\n", " (1, 1),\n", " (2, 1),\n", " (3, 1),\n", " (4, 1),\n", " (5, 1),\n", " (6, 1),\n", " (7, 1),\n", " (8, 2),\n", " (9, 1)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus_bow[0][0:10]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "397456" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab = bow_transformer.gensim_model\n", "len(vocab)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['A',\n", " 'action of',\n", " 'does provide',\n", " 'legislators act',\n", " 'rejected a',\n", " 'traditional',\n", " 'Navigation',\n", " 'called for',\n", " 'insurance firms',\n", " 'representing the',\n", " 'would produce',\n", " 'boost',\n", " 'immediate action',\n", " 'retirement systems',\n", " 'year opposed',\n", " 'also called',\n", " 'element',\n", " 'ministers',\n", " 'such problems',\n", " 'Indicating',\n", " 'council voted',\n", " 'motorists',\n", " 'the rescue',\n", " 'Scotch Plains',\n", " 'explicit on']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens = ([vocab[id] for id in vocab])\n", "tokens[0:10000:400]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: Build the Tf-Idf model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can quite easily build a tfidf model from the vocabulary." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-07-22 12:10:02,645 : INFO : collecting document frequencies\n", "2018-07-22 12:10:02,647 : INFO : PROGRESS: processing document #0\n", "2018-07-22 12:10:03,038 : INFO : calculating IDF weights for 500 documents and 397455 features (1077838 matrix non-zeros)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.3 s, sys: 33 ms, total: 1.33 s\n", "Wall time: 1.35 s\n" ] } ], "source": [ "%%time\n", "tfidf_model = gensim.models.TfidfModel(corpus_bow)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-07-22 12:10:04,002 : INFO : saving TfidfModel object under ./brown_tfidf.mm, separately None\n", "2018-07-22 12:10:06,056 : INFO : saved ./brown_tfidf.mm\n", "2018-07-22 12:10:06,057 : INFO : saving Dictionary object under ./brown_vocab.mm, separately None\n", "2018-07-22 12:10:06,362 : INFO : saved ./brown_vocab.mm\n" ] } ], "source": [ "tfidf_model.save('./brown_tfidf.mm')\n", "vocab.save('./brown_vocab.mm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we are done with out setup of the Tf-Idf model and dictionary. Both have been exported so that they could be used in some service." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting Keywrods with the Tf-Idf Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the tf-idf model, we can use it to extract keywords. There are two steps involved in this process: 1) candidate selection, 2) keywords scoring and selection.\n", "\n", "To accomplish the candidate selection, we'll use a few functions:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def get_pairs(phrase, tag_combos=[('JJ', 'NN')]):\n", " tagged = nltk.pos_tag(nltk.word_tokenize(phrase))\n", " bigrams = nltk.ngrams(tagged, 2)\n", " for bigram in bigrams:\n", " tokens, tags = zip(*bigram)\n", " if tags in tag_combos:\n", " yield tokens\n", "\n", "\n", "def get_unigrams(phrase, tags=('NN')):\n", " tagged = nltk.pos_tag(nltk.word_tokenize(phrase))\n", " return ((unigram,) for unigram, tag in tagged if tag in tags)\n", "\n", "def get_tokens(doc):\n", " unigram_tags = ('NNP', 'NN')\n", " bigram_tag_combos = (('JJ', 'NN'), ('JJ', 'NNS'), ('JJR', 'NN'), ('JJR', 'NNS'))\n", " unigrams = list(get_unigrams(doc, tags=unigram_tags))\n", " bigrams = list(get_pairs(doc, tag_combos=bigram_tag_combos))\n", " return unigrams + bigrams\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "sample_text = \"\"\"\n", "Just in the case of contract it is the explicit stipulation, which constitutes the true transference of \n", "property (§ 79), so in the case of the ethical bond of marriage the public celebration of consent, \n", "and the corresponding recognition and acceptance of it by the family and the community, constitute its\n", "consummation and reality. The function of the church is a separate feature, which is not to be considered\n", "here. Thus the union is established and completed ethically, only when preceded by social ceremony, the \n", "symbol of language being the most spiritual embodi- ment of the spiritual (§ 78). The sensual element\n", "pertain- ing to the natural life has place in the ethical relation only as an after result and accident \n", "belonging to the external reality of the ethical union. The union can be expressed fully only in mutual \n", "love and assistance.k \n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('case',),\n", " ('contract',),\n", " ('stipulation',),\n", " ('transference',),\n", " ('property',),\n", " ('case',),\n", " ('bond',),\n", " ('marriage',),\n", " ('celebration',),\n", " ('consent',),\n", " ('recognition',),\n", " ('acceptance',),\n", " ('family',),\n", " ('community',),\n", " ('consummation',),\n", " ('reality',),\n", " ('function',),\n", " ('church',),\n", " ('feature',),\n", " ('union',),\n", " ('ceremony',),\n", " ('symbol',),\n", " ('language',),\n", " ('embodi-',),\n", " ('ment',),\n", " ('element',),\n", " ('pertain-',),\n", " ('life',),\n", " ('place',),\n", " ('relation',),\n", " ('result',),\n", " ('accident',),\n", " ('belonging',),\n", " ('reality',),\n", " ('union',),\n", " ('union',),\n", " ('love',),\n", " ('assistance.k',),\n", " ('true', 'transference'),\n", " ('ethical', 'bond'),\n", " ('public', 'celebration'),\n", " ('separate', 'feature'),\n", " ('social', 'ceremony'),\n", " ('spiritual', 'embodi-'),\n", " ('sensual', 'element'),\n", " ('natural', 'life'),\n", " ('ethical', 'relation'),\n", " ('ethical', 'union'),\n", " ('mutual', 'love')]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_tokens(sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we can extract candidates all that's left is to score the document using our model. Here's a function that will do that." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def get_keywords(text, model, vocab):\n", " tokens = [\" \".join(x) for x in get_tokens(text)]\n", " bow = vocab.doc2bow(tokens)\n", " scores = model[bow]\n", " sorted_list = sorted(scores, key=lambda x: x[1], reverse=True)\n", " for word_id, score in sorted_list:\n", " yield vocab[word_id], score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two steps that the function takes. First it transforms the set of candidate tokens into a bag-of-words representation and then it scores them by sending the bag-of-word representation into the tfidf model." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('union', 0.3997340417677787),\n", " ('stipulation', 0.2943272612593516),\n", " ('consummation', 0.2943272612593516),\n", " ('natural life', 0.2943272612593516),\n", " ('mutual love', 0.2943272612593516),\n", " ('transference', 0.2943272612593516),\n", " ('reality', 0.22001701826569206),\n", " ('consent', 0.18076162087590167),\n", " ('belonging', 0.17284984744748857),\n", " ('celebration', 0.17284984744748857),\n", " ('ceremony', 0.1601447173381622),\n", " ('bond', 0.15487701001931523),\n", " ('accident', 0.14793378876278002),\n", " ('feature', 0.1332446805892596),\n", " ('element', 0.13018810069374329),\n", " ('recognition', 0.13018810069374329),\n", " ('contract', 0.12731688522504053),\n", " ('symbol', 0.12594401951839607),\n", " ('acceptance', 0.12204917790619353),\n", " ('relation', 0.11730917020370957),\n", " ('marriage', 0.11619475289064184),\n", " ('function', 0.10540678050842718),\n", " ('property', 0.10121284285347901),\n", " ('church', 0.09590290588351243),\n", " ('language', 0.09590290588351243),\n", " ('case', 0.08726686061042656),\n", " ('community', 0.08620372974228124),\n", " ('love', 0.07575261252991683),\n", " ('family', 0.05830090771763938),\n", " ('result', 0.05797762625357845),\n", " ('life', 0.02865958850496821),\n", " ('place', 0.026788731055855025)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(get_keywords(sample_text, tfidf_model, vocab))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we have a set of keywords, scored by Tf-Idf. The results in this case, though subjective, aren't great. Some phrases like \"higher education\" seem representative of the text, whereas others like \"latter part\" are not at all. So there is room for improvement, but this is also a challenging text for which to select keywords." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }