{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Wikipedia training\n", "\n", "In this tutorial we will:\n", " - Learn how to train the NMF topic model on English Wikipedia corpus\n", " - Compare it with LDA model\n", " - Evaluate results" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import itertools\n", "import json\n", "import logging\n", "import numpy as np\n", "import pandas as pd\n", "import scipy.sparse\n", "import smart_open\n", "import time\n", "from tqdm import tqdm, tqdm_notebook\n", "\n", "import gensim.downloader as api\n", "from gensim import matutils\n", "from gensim.corpora import MmCorpus, Dictionary\n", "from gensim.models import LdaModel, CoherenceModel\n", "from gensim.models.nmf import Nmf\n", "from gensim.parsing.preprocessing import preprocess_string\n", "\n", "tqdm.pandas()\n", "\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\n", " level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load wikipedia dump\n", "Let's use `gensim.downloader.api` for that" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Section title: Introduction\n", "Section text: \n", "\n", "\n", "\n", "\n", "'''Anarchism''' is a political philosophy that advocates self-governed societies based on volun\n", "Section title: Etymology and terminology\n", "Section text: \n", "\n", "The word ''anarchism'' is composed from the word ''anarchy'' and the suffix ''-ism'', themselves d\n", "Section title: History\n", "Section text: \n", "\n", "===Origins===\n", "Woodcut from a Diggers document by William Everard\n", "\n", "The earliest anarchist themes ca\n", "Section title: Anarchist schools of thought\n", "Section text: \n", "Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the pri\n", "Section title: Internal issues and debates\n", "Section text: \n", "consistent with anarchist values is a controversial subject among anarchists.\n", "\n", "Anarchism is a philo\n", "Section title: Topics of interest\n", "Section text: Intersecting and overlapping between various schools of thought, certain topics of interest and inte\n", "Section title: Criticisms\n", "Section text: \n", "Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evalu\n", "Section title: See also\n", "Section text: * Anarchism by country\n", "\n", "Section title: References\n", "Section text: \n", "\n", "Section title: Further reading\n", "Section text: * Barclay, Harold, ''People Without Government: An Anthropology of Anarchy'' (2nd ed.), Left Bank Bo\n", "Section title: External links\n", "Section text: \n", "* \n", "* \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "data = api.load(\"wiki-english-20171001\")\n", "article = next(iter(data))\n", "\n", "for section_title, section_text in zip(\n", " article['section_titles'],\n", " article['section_texts']\n", "):\n", " print(\"Section title: %s\" % section_title)\n", " print(\"Section text: %s\" % section_text[:100])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Preprocess and save articles" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def save_preprocessed_articles(filename, articles):\n", " with smart_open(filename, 'w+', encoding=\"utf8\") as writer:\n", " for article in tqdm_notebook(articles):\n", " article_text = \" \".join(\n", " \" \".join(section)\n", " for section\n", " in zip(\n", " article['section_titles'],\n", " article['section_texts']\n", " )\n", " )\n", " article_text = preprocess_string(article_text)\n", "\n", " writer.write(json.dumps(article_text) + '\\n')\n", "\n", "\n", "def get_preprocessed_articles(filename):\n", " with smart_open(filename, 'r', encoding=\"utf8\") as reader:\n", " for line in tqdm_notebook(reader):\n", " yield json.loads(\n", " line\n", " )" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "SAVE_ARTICLES = False\n", "\n", "if SAVE_ARTICLES:\n", " save_preprocessed_articles('wiki_articles.jsonlines', data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create and save dictionary" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "lines_to_next_cell": 2, "scrolled": true }, "outputs": [], "source": [ "SAVE_DICTIONARY = False\n", "\n", "if SAVE_DICTIONARY:\n", " dictionary = Dictionary(get_preprocessed_articles('wiki_articles.jsonlines'))\n", " dictionary.save('wiki.dict')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load and filter dictionary" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-01-15 19:31:03,151 : INFO : loading Dictionary object from wiki.dict\n", "2019-01-15 19:31:04,024 : INFO : loaded wiki.dict\n", "2019-01-15 19:31:06,292 : INFO : discarding 1910258 tokens: [('abdelrahim', 49), ('abstention', 120), ('anarcha', 101), ('anarchica', 40), ('anarchosyndicalist', 20), ('antimilitar', 68), ('arbet', 194), ('archo', 100), ('arkhē', 5), ('autonomedia', 118)]...\n", "2019-01-15 19:31:06,293 : INFO : keeping 100000 tokens which were in no less than 5 and no more than 2462447 (=50.0%) documents\n", "2019-01-15 19:31:06,645 : INFO : resulting dictionary: Dictionary(100000 unique tokens: ['abandon', 'abil', 'abl', 'abolit', 'abstent']...)\n" ] } ], "source": [ "dictionary = Dictionary.load('wiki.dict')\n", "dictionary.filter_extremes()\n", "dictionary.compactify()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### MmCorpus wrapper\n", "In this way we'll:\n", "\n", "- Make sure that documents are shuffled\n", "- Be able to train-test split corpus without rewriting it" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "class RandomCorpus(MmCorpus):\n", " def __init__(self, random_seed=42, testset=False, testsize=1000, *args,\n", " **kwargs):\n", " super().__init__(*args, **kwargs)\n", "\n", " random_state = np.random.RandomState(random_seed)\n", " self.indices = random_state.permutation(range(self.num_docs))\n", " if testset:\n", " self.indices = self.indices[:testsize]\n", " else:\n", " self.indices = self.indices[testsize:]\n", "\n", " def __iter__(self):\n", " for doc_id in self.indices:\n", " yield self[doc_id]\n", " \n", " def __len__(self):\n", " return len(self.indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create and save corpus" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [], "source": [ "SAVE_CORPUS = False\n", "\n", "if SAVE_CORPUS:\n", " corpus = (\n", " dictionary.doc2bow(article)\n", " for article\n", " in get_preprocessed_articles('wiki_articles.jsonlines')\n", " )\n", " \n", " RandomCorpus.serialize('wiki.mm', corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load train and test corpus\n", "Using `RandomCorpus` wrapper" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-01-15 19:31:07,323 : INFO : loaded corpus index from wiki.mm.index\n", "2019-01-15 19:31:07,324 : INFO : initializing cython corpus reader from wiki.mm\n", "2019-01-15 19:31:07,325 : INFO : accepted corpus with 4924894 documents, 100000 features, 683375728 non-zero entries\n", "2019-01-15 19:31:08,544 : INFO : loaded corpus index from wiki.mm.index\n", "2019-01-15 19:31:08,544 : INFO : initializing cython corpus reader from wiki.mm\n", "2019-01-15 19:31:08,545 : INFO : accepted corpus with 4924894 documents, 100000 features, 683375728 non-zero entries\n" ] } ], "source": [ "train_corpus = RandomCorpus(\n", " random_seed=42, testset=False, testsize=2000, fname='wiki.mm'\n", ")\n", "test_corpus = RandomCorpus(\n", " random_seed=42, testset=True, testsize=2000, fname='wiki.mm'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Metrics" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def get_execution_time(func):\n", " start = time.time()\n", "\n", " result = func()\n", "\n", " return (time.time() - start), result\n", "\n", "\n", "def get_tm_metrics(model, test_corpus):\n", " W = model.get_topics().T\n", " H = np.zeros((model.num_topics, len(test_corpus)))\n", " for bow_id, bow in enumerate(test_corpus):\n", " for topic_id, word_count in model.get_document_topics(bow):\n", " H[topic_id, bow_id] = word_count\n", "\n", " pred_factors = W.dot(H)\n", " pred_factors /= pred_factors.sum(axis=0)\n", " \n", " dense_corpus = matutils.corpus2dense(test_corpus, pred_factors.shape[0])\n", "\n", " perplexity = get_tm_perplexity(pred_factors, dense_corpus)\n", "\n", " l2_norm = get_tm_l2_norm(pred_factors, dense_corpus)\n", "\n", " model.normalize = True\n", "\n", " coherence = CoherenceModel(\n", " model=model,\n", " corpus=test_corpus,\n", " coherence='u_mass'\n", " ).get_coherence()\n", "\n", " topics = model.show_topics()\n", "\n", " model.normalize = False\n", "\n", " return dict(\n", " perplexity=perplexity,\n", " coherence=coherence,\n", " topics=topics,\n", " l2_norm=l2_norm,\n", " )\n", "\n", "\n", "def get_tm_perplexity(pred_factors, dense_corpus):\n", " return np.exp(-(np.log(pred_factors, where=pred_factors > 0) * dense_corpus).sum() / dense_corpus.sum())\n", "\n", "\n", "def get_tm_l2_norm(pred_factors, dense_corpus):\n", " return np.linalg.norm(dense_corpus / dense_corpus.sum(axis=0) - pred_factors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define dataframe in which we'll store metrics" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "tm_metrics = pd.DataFrame()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define common params for models" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "params = dict(\n", " corpus=train_corpus,\n", " chunksize=2000,\n", " num_topics=50,\n", " id2word=dictionary,\n", " passes=1,\n", " eval_every=10,\n", " minimum_probability=0,\n", " random_state=42,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train NMF and save it\n", "Normalization is turned off to compute metrics correctly" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-01-15 19:33:21,875 : INFO : Loss (no outliers): 2186.768444126956\tLoss (with outliers): 2186.768444126956\n", "2019-01-15 19:34:49,514 : INFO : Loss (no outliers): 2298.434152045061\tLoss (with outliers): 2298.434152045061\n", "==Truncated==\n", "2019-01-15 20:44:23,913 : INFO : Loss (no outliers): 1322.9664709183141\tLoss (with outliers): 1322.9664709183141\n", "2019-01-15 20:44:23,928 : INFO : saving Nmf object under nmf.model, separately None\n", "2019-01-15 20:44:24,625 : INFO : saved nmf.model\n" ] } ], "source": [ "row = dict()\n", "row['model'] = 'nmf'\n", "row['train_time'], nmf = get_execution_time(\n", " lambda: Nmf(\n", " use_r=False,\n", " normalize=False,\n", " **params\n", " )\n", ")\n", "nmf.save('nmf.model')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load NMF and store metrics" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-01-15 20:44:24,872 : INFO : loading Nmf object from nmf.model\n", "2019-01-15 20:44:25,150 : INFO : loading id2word recursively from nmf.model.id2word.* with mmap=None\n", "2019-01-15 20:44:25,151 : INFO : loaded nmf.model\n", "2019-01-15 20:44:54,148 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n", "2019-01-15 20:44:54,336 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n" ] }, { "data": { "text/plain": [ "[(0,\n", " '0.075*\"parti\" + 0.071*\"elect\" + 0.042*\"democrat\" + 0.029*\"republican\" + 0.022*\"vote\" + 0.018*\"conserv\" + 0.017*\"liber\" + 0.014*\"candid\" + 0.013*\"seat\" + 0.013*\"labour\"'),\n", " (1,\n", " '0.039*\"book\" + 0.038*\"centuri\" + 0.032*\"histori\" + 0.032*\"languag\" + 0.032*\"publish\" + 0.024*\"english\" + 0.023*\"world\" + 0.022*\"law\" + 0.022*\"govern\" + 0.021*\"nation\"'),\n", " (2,\n", " '0.050*\"war\" + 0.036*\"forc\" + 0.026*\"armi\" + 0.023*\"battl\" + 0.021*\"attack\" + 0.019*\"militari\" + 0.018*\"german\" + 0.016*\"british\" + 0.015*\"command\" + 0.014*\"kill\"'),\n", " (3,\n", " '0.119*\"race\" + 0.106*\"car\" + 0.073*\"engin\" + 0.035*\"model\" + 0.030*\"driver\" + 0.029*\"vehicl\" + 0.029*\"ford\" + 0.028*\"lap\" + 0.023*\"electr\" + 0.020*\"power\"'),\n", " (4,\n", " '0.102*\"leagu\" + 0.092*\"club\" + 0.049*\"footbal\" + 0.047*\"cup\" + 0.029*\"plai\" + 0.028*\"season\" + 0.028*\"divis\" + 0.028*\"goal\" + 0.022*\"team\" + 0.021*\"unit\"'),\n", " (5,\n", " '0.055*\"award\" + 0.041*\"best\" + 0.008*\"nomin\" + 0.008*\"year\" + 0.006*\"actress\" + 0.006*\"actor\" + 0.005*\"perform\" + 0.005*\"artist\" + 0.005*\"won\" + 0.005*\"outstand\"'),\n", " (6,\n", " '0.115*\"citi\" + 0.014*\"airport\" + 0.013*\"area\" + 0.011*\"popul\" + 0.010*\"san\" + 0.009*\"region\" + 0.008*\"center\" + 0.007*\"municip\" + 0.007*\"intern\" + 0.007*\"ukrainian\"'),\n", " (7,\n", " '0.316*\"act\" + 0.046*\"amend\" + 0.020*\"order\" + 0.018*\"ireland\" + 0.016*\"law\" + 0.015*\"regul\" + 0.013*\"court\" + 0.011*\"scotland\" + 0.011*\"road\" + 0.009*\"public\"'),\n", " (8,\n", " '0.102*\"align\" + 0.084*\"left\" + 0.022*\"right\" + 0.012*\"text\" + 0.011*\"style\" + 0.007*\"center\" + 0.004*\"bar\" + 0.003*\"till\" + 0.003*\"bgcolor\" + 0.003*\"color\"'),\n", " (9,\n", " '0.092*\"team\" + 0.027*\"race\" + 0.025*\"ret\" + 0.014*\"championship\" + 0.007*\"nation\" + 0.006*\"time\" + 0.006*\"sport\" + 0.005*\"stage\" + 0.005*\"coach\" + 0.005*\"finish\"'),\n", " (10,\n", " '0.135*\"compani\" + 0.089*\"ship\" + 0.035*\"product\" + 0.028*\"oper\" + 0.024*\"navi\" + 0.022*\"corpor\" + 0.021*\"oil\" + 0.021*\"launch\" + 0.021*\"bank\" + 0.021*\"built\"'),\n", " (11,\n", " '0.053*\"new\" + 0.019*\"york\" + 0.004*\"zealand\" + 0.003*\"jersei\" + 0.003*\"american\" + 0.002*\"time\" + 0.002*\"australia\" + 0.002*\"radio\" + 0.002*\"press\" + 0.002*\"washington\"'),\n", " (12,\n", " '0.036*\"world\" + 0.034*\"championship\" + 0.032*\"final\" + 0.029*\"match\" + 0.026*\"win\" + 0.026*\"round\" + 0.019*\"open\" + 0.018*\"won\" + 0.015*\"defeat\" + 0.015*\"cup\"'),\n", " (13,\n", " '0.019*\"album\" + 0.017*\"record\" + 0.014*\"band\" + 0.008*\"releas\" + 0.005*\"tour\" + 0.005*\"guitar\" + 0.005*\"vocal\" + 0.004*\"rock\" + 0.004*\"track\" + 0.004*\"music\"'),\n", " (14,\n", " '0.100*\"church\" + 0.017*\"cathol\" + 0.014*\"christian\" + 0.012*\"centuri\" + 0.012*\"saint\" + 0.011*\"bishop\" + 0.011*\"built\" + 0.009*\"list\" + 0.009*\"build\" + 0.008*\"roman\"'),\n", " (15,\n", " '0.088*\"presid\" + 0.072*\"minist\" + 0.046*\"prime\" + 0.015*\"govern\" + 0.014*\"gener\" + 0.011*\"met\" + 0.011*\"governor\" + 0.010*\"foreign\" + 0.010*\"visit\" + 0.009*\"council\"'),\n", " (16,\n", " '0.182*\"speci\" + 0.112*\"famili\" + 0.101*\"nov\" + 0.092*\"valid\" + 0.066*\"genu\" + 0.045*\"format\" + 0.040*\"member\" + 0.037*\"gen\" + 0.036*\"bird\" + 0.034*\"type\"'),\n", " (17,\n", " '0.029*\"season\" + 0.013*\"yard\" + 0.013*\"game\" + 0.011*\"plai\" + 0.008*\"team\" + 0.007*\"score\" + 0.007*\"win\" + 0.007*\"record\" + 0.006*\"run\" + 0.006*\"coach\"'),\n", " (18,\n", " '0.214*\"counti\" + 0.064*\"township\" + 0.017*\"area\" + 0.016*\"statist\" + 0.007*\"ohio\" + 0.006*\"metropolitan\" + 0.006*\"combin\" + 0.005*\"pennsylvania\" + 0.005*\"texa\" + 0.005*\"washington\"'),\n", " (19,\n", " '0.017*\"area\" + 0.016*\"river\" + 0.015*\"water\" + 0.006*\"larg\" + 0.006*\"region\" + 0.006*\"lake\" + 0.006*\"power\" + 0.006*\"high\" + 0.005*\"bar\" + 0.005*\"form\"'),\n", " (20,\n", " '0.031*\"us\" + 0.025*\"gener\" + 0.024*\"model\" + 0.022*\"data\" + 0.021*\"design\" + 0.020*\"time\" + 0.019*\"function\" + 0.019*\"number\" + 0.018*\"process\" + 0.017*\"exampl\"'),\n", " (21,\n", " '0.202*\"order\" + 0.098*\"group\" + 0.098*\"regul\" + 0.076*\"amend\" + 0.041*\"road\" + 0.034*\"traffic\" + 0.033*\"temporari\" + 0.032*\"prohibit\" + 0.027*\"trunk\" + 0.021*\"junction\"'),\n", " (22,\n", " '0.096*\"film\" + 0.010*\"product\" + 0.010*\"director\" + 0.010*\"festiv\" + 0.009*\"star\" + 0.009*\"produc\" + 0.009*\"movi\" + 0.008*\"direct\" + 0.007*\"releas\" + 0.007*\"actor\"'),\n", " (23,\n", " '0.163*\"music\" + 0.046*\"viola\" + 0.045*\"radio\" + 0.042*\"piano\" + 0.029*\"perform\" + 0.028*\"station\" + 0.027*\"orchestra\" + 0.026*\"compos\" + 0.025*\"song\" + 0.015*\"rock\"'),\n", " (24,\n", " '0.052*\"mount\" + 0.051*\"lemmon\" + 0.051*\"peak\" + 0.051*\"kitt\" + 0.051*\"spacewatch\" + 0.026*\"survei\" + 0.015*\"octob\" + 0.012*\"septemb\" + 0.009*\"css\" + 0.009*\"catalina\"'),\n", " (25,\n", " '0.075*\"air\" + 0.035*\"forc\" + 0.030*\"squadron\" + 0.029*\"aircraft\" + 0.028*\"oper\" + 0.023*\"unit\" + 0.018*\"flight\" + 0.017*\"airport\" + 0.017*\"wing\" + 0.017*\"base\"'),\n", " (26,\n", " '0.105*\"hous\" + 0.038*\"term\" + 0.020*\"march\" + 0.019*\"build\" + 0.019*\"member\" + 0.017*\"serv\" + 0.014*\"congress\" + 0.014*\"hall\" + 0.012*\"januari\" + 0.010*\"window\"'),\n", " (27,\n", " '0.129*\"district\" + 0.019*\"pennsylvania\" + 0.016*\"grade\" + 0.012*\"fund\" + 0.012*\"educ\" + 0.012*\"basic\" + 0.011*\"level\" + 0.010*\"oblast\" + 0.010*\"rural\" + 0.009*\"tax\"'),\n", " (28,\n", " '0.042*\"year\" + 0.012*\"dai\" + 0.007*\"time\" + 0.005*\"ag\" + 0.004*\"month\" + 0.003*\"includ\" + 0.003*\"follow\" + 0.003*\"later\" + 0.003*\"old\" + 0.003*\"student\"'),\n", " (29,\n", " '0.113*\"station\" + 0.109*\"line\" + 0.076*\"road\" + 0.072*\"railwai\" + 0.048*\"rout\" + 0.035*\"oper\" + 0.034*\"train\" + 0.023*\"street\" + 0.020*\"cross\" + 0.020*\"railroad\"'),\n", " (30,\n", " '0.036*\"park\" + 0.029*\"town\" + 0.025*\"north\" + 0.020*\"south\" + 0.018*\"west\" + 0.017*\"east\" + 0.017*\"street\" + 0.015*\"nation\" + 0.014*\"build\" + 0.013*\"river\"'),\n", " (31,\n", " '0.066*\"women\" + 0.044*\"men\" + 0.030*\"nation\" + 0.024*\"right\" + 0.014*\"athlet\" + 0.013*\"intern\" + 0.013*\"rank\" + 0.013*\"countri\" + 0.012*\"advanc\" + 0.011*\"event\"'),\n", " (32,\n", " '0.127*\"linear\" + 0.126*\"socorro\" + 0.029*\"septemb\" + 0.026*\"neat\" + 0.023*\"palomar\" + 0.021*\"octob\" + 0.016*\"kitt\" + 0.016*\"peak\" + 0.015*\"spacewatch\" + 0.015*\"anderson\"'),\n", " (33,\n", " '0.152*\"univers\" + 0.055*\"colleg\" + 0.019*\"institut\" + 0.018*\"student\" + 0.018*\"scienc\" + 0.015*\"professor\" + 0.012*\"research\" + 0.011*\"campu\" + 0.011*\"educ\" + 0.011*\"technolog\"'),\n", " (34,\n", " '0.072*\"state\" + 0.032*\"unit\" + 0.005*\"court\" + 0.005*\"law\" + 0.004*\"feder\" + 0.004*\"american\" + 0.003*\"nation\" + 0.003*\"govern\" + 0.003*\"kingdom\" + 0.003*\"senat\"'),\n", " (35,\n", " '0.074*\"game\" + 0.017*\"player\" + 0.007*\"plai\" + 0.006*\"releas\" + 0.005*\"develop\" + 0.005*\"video\" + 0.005*\"charact\" + 0.004*\"playstat\" + 0.004*\"version\" + 0.004*\"world\"'),\n", " (36,\n", " '0.141*\"south\" + 0.098*\"american\" + 0.081*\"india\" + 0.059*\"commun\" + 0.053*\"west\" + 0.053*\"director\" + 0.053*\"africa\" + 0.049*\"usa\" + 0.049*\"indian\" + 0.041*\"servic\"'),\n", " (37,\n", " '0.111*\"servic\" + 0.025*\"commun\" + 0.021*\"offic\" + 0.012*\"polic\" + 0.011*\"educ\" + 0.011*\"public\" + 0.010*\"chief\" + 0.009*\"late\" + 0.009*\"manag\" + 0.008*\"mr\"'),\n", " (38,\n", " '0.112*\"royal\" + 0.085*\"john\" + 0.083*\"william\" + 0.054*\"lieuten\" + 0.044*\"georg\" + 0.041*\"offic\" + 0.041*\"jame\" + 0.038*\"sergeant\" + 0.037*\"major\" + 0.035*\"charl\"'),\n", " (39,\n", " '0.051*\"song\" + 0.043*\"releas\" + 0.042*\"singl\" + 0.027*\"chart\" + 0.025*\"album\" + 0.017*\"number\" + 0.014*\"video\" + 0.013*\"version\" + 0.012*\"love\" + 0.011*\"featur\"'),\n", " (40,\n", " '0.031*\"time\" + 0.028*\"later\" + 0.026*\"appear\" + 0.025*\"man\" + 0.024*\"kill\" + 0.020*\"charact\" + 0.019*\"work\" + 0.018*\"father\" + 0.018*\"death\" + 0.018*\"famili\"'),\n", " (41,\n", " '0.126*\"seri\" + 0.064*\"episod\" + 0.026*\"season\" + 0.021*\"televis\" + 0.015*\"comic\" + 0.013*\"charact\" + 0.012*\"dvd\" + 0.012*\"anim\" + 0.012*\"star\" + 0.011*\"appear\"'),\n", " (42,\n", " '0.143*\"born\" + 0.073*\"american\" + 0.027*\"footbal\" + 0.024*\"player\" + 0.024*\"william\" + 0.023*\"singer\" + 0.019*\"actor\" + 0.017*\"politician\" + 0.015*\"actress\" + 0.013*\"english\"'),\n", " (43,\n", " '0.044*\"march\" + 0.042*\"septemb\" + 0.036*\"octob\" + 0.033*\"januari\" + 0.032*\"april\" + 0.031*\"august\" + 0.031*\"juli\" + 0.029*\"novemb\" + 0.029*\"june\" + 0.028*\"decemb\"'),\n", " (44,\n", " '0.149*\"island\" + 0.013*\"south\" + 0.013*\"australia\" + 0.009*\"sea\" + 0.008*\"north\" + 0.008*\"bai\" + 0.008*\"western\" + 0.008*\"airport\" + 0.007*\"coast\" + 0.006*\"pacif\"'),\n", " (45,\n", " '0.028*\"studi\" + 0.026*\"research\" + 0.023*\"health\" + 0.019*\"human\" + 0.019*\"term\" + 0.019*\"develop\" + 0.018*\"includ\" + 0.018*\"peopl\" + 0.017*\"report\" + 0.017*\"cell\"'),\n", " (46,\n", " '0.112*\"school\" + 0.028*\"high\" + 0.016*\"student\" + 0.012*\"educ\" + 0.009*\"grade\" + 0.008*\"primari\" + 0.007*\"public\" + 0.006*\"colleg\" + 0.006*\"elementari\" + 0.006*\"pennsylvania\"'),\n", " (47,\n", " '0.137*\"royal\" + 0.121*\"capt\" + 0.103*\"armi\" + 0.090*\"maj\" + 0.089*\"corp\" + 0.075*\"col\" + 0.074*\"temp\" + 0.048*\"servic\" + 0.040*\"engin\" + 0.033*\"reg\"'),\n", " (48,\n", " '0.183*\"art\" + 0.117*\"museum\" + 0.071*\"paint\" + 0.062*\"work\" + 0.046*\"artist\" + 0.043*\"galleri\" + 0.040*\"exhibit\" + 0.034*\"collect\" + 0.027*\"histori\" + 0.022*\"jpg\"'),\n", " (49,\n", " '0.068*\"regiment\" + 0.062*\"divis\" + 0.049*\"battalion\" + 0.045*\"infantri\" + 0.036*\"brigad\" + 0.024*\"armi\" + 0.023*\"artilleri\" + 0.019*\"compani\" + 0.018*\"gener\" + 0.018*\"colonel\"')]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nmf = Nmf.load('nmf.model')\n", "row.update(get_tm_metrics(nmf, test_corpus))\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n", "\n", "nmf.show_topics(50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train NMF with residuals and save it\n", "Residuals add regularization to the model thus increasing quality, but slows down training" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-01-15 20:54:05,363 : INFO : Loss (no outliers): 2179.9524465227146\tLoss (with outliers): 2102.354108449905\n", "2019-01-15 20:57:12,821 : INFO : Loss (no outliers): 2268.3200929871823\tLoss (with outliers): 2110.928651253909\n", "==Truncated==\n", "2019-01-16 04:05:46,589 : INFO : Loss (no outliers): 1321.521323758918\tLoss (with outliers): 1282.9364495345592\n", "2019-01-16 04:05:46,599 : INFO : saving Nmf object under nmf_with_r.model, separately None\n", "2019-01-16 04:05:46,601 : INFO : storing scipy.sparse array '_r' under nmf_with_r.model._r.npy\n", "2019-01-16 04:05:47,781 : INFO : saved nmf_with_r.model\n" ] } ], "source": [ "row = dict()\n", "row['model'] = 'nmf_with_r'\n", "row['train_time'], nmf_with_r = get_execution_time(\n", " lambda: Nmf(\n", " use_r=True,\n", " lambda_=200,\n", " normalize=False,\n", " **params\n", " )\n", ")\n", "nmf_with_r.save('nmf_with_r.model')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load NMF with residuals and store metrics" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-01-16 04:05:48,017 : INFO : loading Nmf object from nmf_with_r.model\n", "2019-01-16 04:05:48,272 : INFO : loading id2word recursively from nmf_with_r.model.id2word.* with mmap=None\n", "2019-01-16 04:05:48,273 : INFO : loading _r from nmf_with_r.model._r.npy with mmap=None\n", "2019-01-16 04:05:48,304 : INFO : loaded nmf_with_r.model\n", "2019-01-16 04:06:27,119 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n", "2019-01-16 04:06:27,253 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n" ] }, { "data": { "text/plain": [ "[(0,\n", " '0.062*\"parti\" + 0.061*\"elect\" + 0.031*\"democrat\" + 0.020*\"republican\" + 0.020*\"vote\" + 0.013*\"liber\" + 0.012*\"candid\" + 0.012*\"conserv\" + 0.011*\"seat\" + 0.010*\"member\"'),\n", " (1,\n", " '0.052*\"book\" + 0.040*\"centuri\" + 0.039*\"publish\" + 0.031*\"languag\" + 0.027*\"histori\" + 0.025*\"work\" + 0.023*\"english\" + 0.022*\"king\" + 0.019*\"polit\" + 0.019*\"author\"'),\n", " (2,\n", " '0.031*\"armi\" + 0.028*\"divis\" + 0.025*\"regiment\" + 0.022*\"forc\" + 0.020*\"battalion\" + 0.019*\"infantri\" + 0.019*\"command\" + 0.017*\"brigad\" + 0.016*\"gener\" + 0.012*\"corp\"'),\n", " (3,\n", " '0.110*\"race\" + 0.059*\"car\" + 0.033*\"engin\" + 0.025*\"lap\" + 0.023*\"driver\" + 0.021*\"ret\" + 0.020*\"ford\" + 0.015*\"finish\" + 0.015*\"motorsport\" + 0.015*\"chevrolet\"'),\n", " (4,\n", " '0.130*\"club\" + 0.068*\"cup\" + 0.046*\"footbal\" + 0.044*\"goal\" + 0.032*\"leagu\" + 0.031*\"unit\" + 0.031*\"plai\" + 0.030*\"match\" + 0.026*\"score\" + 0.021*\"player\"'),\n", " (5,\n", " '0.041*\"award\" + 0.030*\"best\" + 0.006*\"nomin\" + 0.005*\"actress\" + 0.005*\"year\" + 0.004*\"actor\" + 0.004*\"won\" + 0.004*\"perform\" + 0.003*\"outstand\" + 0.003*\"artist\"'),\n", " (6,\n", " '0.087*\"citi\" + 0.013*\"town\" + 0.009*\"popul\" + 0.008*\"area\" + 0.007*\"san\" + 0.006*\"center\" + 0.006*\"airport\" + 0.006*\"unit\" + 0.006*\"locat\" + 0.005*\"municip\"'),\n", " (7,\n", " '0.171*\"act\" + 0.021*\"amend\" + 0.018*\"order\" + 0.010*\"ireland\" + 0.009*\"law\" + 0.007*\"court\" + 0.007*\"regul\" + 0.006*\"road\" + 0.006*\"scotland\" + 0.006*\"nation\"'),\n", " (8,\n", " '0.064*\"leagu\" + 0.014*\"divis\" + 0.012*\"left\" + 0.011*\"align\" + 0.009*\"basebal\" + 0.008*\"footbal\" + 0.007*\"run\" + 0.007*\"major\" + 0.005*\"home\" + 0.005*\"hit\"'),\n", " (9,\n", " '0.086*\"team\" + 0.013*\"championship\" + 0.007*\"nation\" + 0.007*\"race\" + 0.007*\"coach\" + 0.005*\"time\" + 0.004*\"sport\" + 0.004*\"ret\" + 0.004*\"player\" + 0.004*\"match\"'),\n", " (10,\n", " '0.100*\"episod\" + 0.055*\"compani\" + 0.021*\"product\" + 0.011*\"produc\" + 0.011*\"televis\" + 0.009*\"role\" + 0.009*\"busi\" + 0.008*\"market\" + 0.008*\"corpor\" + 0.007*\"bank\"'),\n", " (11,\n", " '0.050*\"new\" + 0.017*\"york\" + 0.003*\"zealand\" + 0.003*\"jersei\" + 0.002*\"time\" + 0.002*\"radio\" + 0.002*\"broadcast\" + 0.002*\"station\" + 0.002*\"washington\" + 0.002*\"australia\"'),\n", " (12,\n", " '0.035*\"final\" + 0.033*\"world\" + 0.030*\"round\" + 0.030*\"championship\" + 0.025*\"win\" + 0.025*\"match\" + 0.021*\"open\" + 0.017*\"won\" + 0.016*\"tournament\" + 0.015*\"event\"'),\n", " (13,\n", " '0.020*\"record\" + 0.019*\"band\" + 0.015*\"album\" + 0.007*\"releas\" + 0.007*\"guitar\" + 0.006*\"tour\" + 0.005*\"rock\" + 0.005*\"vocal\" + 0.004*\"plai\" + 0.004*\"live\"'),\n", " (14,\n", " '0.096*\"church\" + 0.015*\"cathol\" + 0.012*\"christian\" + 0.010*\"saint\" + 0.010*\"bishop\" + 0.009*\"centuri\" + 0.008*\"build\" + 0.007*\"parish\" + 0.007*\"built\" + 0.007*\"roman\"'),\n", " (15,\n", " '0.084*\"presid\" + 0.055*\"minist\" + 0.037*\"prime\" + 0.014*\"govern\" + 0.012*\"gener\" + 0.010*\"governor\" + 0.010*\"nation\" + 0.008*\"council\" + 0.008*\"secretari\" + 0.008*\"visit\"'),\n", " (16,\n", " '0.089*\"yard\" + 0.035*\"pass\" + 0.035*\"touchdown\" + 0.028*\"field\" + 0.025*\"run\" + 0.023*\"win\" + 0.022*\"score\" + 0.021*\"quarter\" + 0.017*\"record\" + 0.016*\"second\"'),\n", " (17,\n", " '0.042*\"season\" + 0.006*\"plai\" + 0.004*\"coach\" + 0.004*\"final\" + 0.004*\"second\" + 0.004*\"win\" + 0.004*\"record\" + 0.003*\"career\" + 0.003*\"finish\" + 0.003*\"point\"'),\n", " (18,\n", " '0.174*\"counti\" + 0.034*\"township\" + 0.014*\"area\" + 0.013*\"statist\" + 0.004*\"texa\" + 0.004*\"ohio\" + 0.004*\"virginia\" + 0.004*\"washington\" + 0.003*\"metropolitan\" + 0.003*\"pennsylvania\"'),\n", " (19,\n", " '0.012*\"water\" + 0.010*\"area\" + 0.010*\"speci\" + 0.007*\"larg\" + 0.006*\"order\" + 0.006*\"region\" + 0.006*\"includ\" + 0.005*\"black\" + 0.005*\"famili\" + 0.005*\"popul\"'),\n", " (20,\n", " '0.020*\"us\" + 0.015*\"gener\" + 0.014*\"design\" + 0.014*\"model\" + 0.012*\"develop\" + 0.012*\"time\" + 0.012*\"data\" + 0.011*\"number\" + 0.011*\"function\" + 0.011*\"process\"'),\n", " (21,\n", " '0.165*\"group\" + 0.023*\"left\" + 0.022*\"align\" + 0.021*\"member\" + 0.017*\"text\" + 0.015*\"bar\" + 0.011*\"order\" + 0.011*\"point\" + 0.010*\"till\" + 0.009*\"stage\"'),\n", " (22,\n", " '0.095*\"film\" + 0.009*\"director\" + 0.008*\"star\" + 0.008*\"movi\" + 0.008*\"product\" + 0.008*\"festiv\" + 0.008*\"releas\" + 0.008*\"produc\" + 0.007*\"direct\" + 0.006*\"featur\"'),\n", " (23,\n", " '0.107*\"music\" + 0.024*\"perform\" + 0.019*\"piano\" + 0.018*\"song\" + 0.017*\"compos\" + 0.017*\"orchestra\" + 0.017*\"viola\" + 0.012*\"plai\" + 0.011*\"radio\" + 0.011*\"danc\"'),\n", " (24,\n", " '0.023*\"septemb\" + 0.023*\"march\" + 0.020*\"octob\" + 0.020*\"juli\" + 0.019*\"june\" + 0.019*\"april\" + 0.019*\"august\" + 0.018*\"januari\" + 0.018*\"novemb\" + 0.017*\"decemb\"'),\n", " (25,\n", " '0.078*\"air\" + 0.041*\"forc\" + 0.031*\"aircraft\" + 0.027*\"squadron\" + 0.026*\"oper\" + 0.021*\"unit\" + 0.016*\"base\" + 0.016*\"wing\" + 0.016*\"flight\" + 0.015*\"fighter\"'),\n", " (26,\n", " '0.101*\"hous\" + 0.023*\"build\" + 0.021*\"term\" + 0.015*\"member\" + 0.014*\"serv\" + 0.014*\"march\" + 0.014*\"left\" + 0.012*\"congress\" + 0.011*\"hall\" + 0.010*\"street\"'),\n", " (27,\n", " '0.123*\"district\" + 0.024*\"pennsylvania\" + 0.019*\"grade\" + 0.016*\"educ\" + 0.015*\"fund\" + 0.014*\"basic\" + 0.013*\"level\" + 0.011*\"student\" + 0.011*\"receiv\" + 0.010*\"tax\"'),\n", " (28,\n", " '0.048*\"year\" + 0.007*\"dai\" + 0.005*\"time\" + 0.005*\"ag\" + 0.003*\"month\" + 0.003*\"old\" + 0.003*\"student\" + 0.003*\"includ\" + 0.003*\"later\" + 0.002*\"million\"'),\n", " (29,\n", " '0.090*\"line\" + 0.083*\"station\" + 0.054*\"road\" + 0.053*\"railwai\" + 0.036*\"rout\" + 0.030*\"train\" + 0.027*\"oper\" + 0.020*\"street\" + 0.016*\"servic\" + 0.016*\"open\"'),\n", " (30,\n", " '0.031*\"park\" + 0.030*\"south\" + 0.030*\"north\" + 0.023*\"west\" + 0.020*\"river\" + 0.020*\"east\" + 0.015*\"area\" + 0.014*\"town\" + 0.013*\"lake\" + 0.013*\"nation\"'),\n", " (31,\n", " '0.071*\"women\" + 0.041*\"men\" + 0.027*\"nation\" + 0.023*\"right\" + 0.012*\"countri\" + 0.012*\"intern\" + 0.012*\"athlet\" + 0.011*\"advanc\" + 0.011*\"rank\" + 0.010*\"law\"'),\n", " (32,\n", " '0.104*\"linear\" + 0.104*\"socorro\" + 0.025*\"septemb\" + 0.020*\"neat\" + 0.018*\"palomar\" + 0.018*\"octob\" + 0.013*\"decemb\" + 0.013*\"august\" + 0.012*\"anderson\" + 0.012*\"mesa\"'),\n", " (33,\n", " '0.089*\"univers\" + 0.011*\"scienc\" + 0.009*\"institut\" + 0.008*\"research\" + 0.008*\"professor\" + 0.006*\"student\" + 0.005*\"technolog\" + 0.005*\"faculti\" + 0.005*\"studi\" + 0.005*\"engin\"'),\n", " (34,\n", " '0.064*\"state\" + 0.024*\"unit\" + 0.005*\"court\" + 0.005*\"law\" + 0.004*\"feder\" + 0.003*\"nation\" + 0.003*\"govern\" + 0.002*\"senat\" + 0.002*\"california\" + 0.002*\"constitut\"'),\n", " (35,\n", " '0.085*\"colleg\" + 0.019*\"univers\" + 0.014*\"student\" + 0.008*\"campu\" + 0.007*\"institut\" + 0.006*\"educ\" + 0.005*\"hall\" + 0.005*\"program\" + 0.005*\"commun\" + 0.005*\"state\"'),\n", " (36,\n", " '0.118*\"class\" + 0.079*\"director\" + 0.053*\"rifl\" + 0.050*\"south\" + 0.048*\"×mm\" + 0.046*\"action\" + 0.045*\"san\" + 0.044*\"actor\" + 0.041*\"angel\" + 0.037*\"lo\"'),\n", " (37,\n", " '0.092*\"servic\" + 0.025*\"offic\" + 0.023*\"commun\" + 0.013*\"john\" + 0.012*\"chief\" + 0.011*\"polic\" + 0.011*\"public\" + 0.011*\"british\" + 0.010*\"late\" + 0.010*\"director\"'),\n", " (38,\n", " '0.156*\"royal\" + 0.072*\"william\" + 0.068*\"john\" + 0.058*\"corp\" + 0.051*\"lieuten\" + 0.046*\"capt\" + 0.041*\"engin\" + 0.041*\"armi\" + 0.039*\"georg\" + 0.039*\"temp\"'),\n", " (39,\n", " '0.042*\"song\" + 0.039*\"album\" + 0.034*\"releas\" + 0.029*\"singl\" + 0.024*\"chart\" + 0.013*\"number\" + 0.011*\"video\" + 0.010*\"love\" + 0.010*\"featur\" + 0.010*\"track\"'),\n", " (40,\n", " '0.028*\"time\" + 0.025*\"later\" + 0.023*\"kill\" + 0.019*\"appear\" + 0.018*\"man\" + 0.016*\"death\" + 0.016*\"father\" + 0.015*\"return\" + 0.015*\"son\" + 0.014*\"charact\"'),\n", " (41,\n", " '0.110*\"seri\" + 0.016*\"charact\" + 0.016*\"episod\" + 0.015*\"comic\" + 0.013*\"televis\" + 0.012*\"anim\" + 0.011*\"appear\" + 0.009*\"stori\" + 0.009*\"origin\" + 0.009*\"featur\"'),\n", " (42,\n", " '0.091*\"born\" + 0.070*\"american\" + 0.022*\"player\" + 0.021*\"footbal\" + 0.020*\"william\" + 0.016*\"actor\" + 0.014*\"politician\" + 0.014*\"singer\" + 0.013*\"john\" + 0.012*\"actress\"'),\n", " (43,\n", " '0.072*\"game\" + 0.017*\"player\" + 0.011*\"plai\" + 0.004*\"releas\" + 0.004*\"point\" + 0.004*\"develop\" + 0.004*\"score\" + 0.003*\"video\" + 0.003*\"time\" + 0.003*\"card\"'),\n", " (44,\n", " '0.110*\"island\" + 0.007*\"australia\" + 0.007*\"ship\" + 0.007*\"south\" + 0.007*\"sea\" + 0.006*\"bai\" + 0.005*\"coast\" + 0.004*\"pacif\" + 0.004*\"western\" + 0.004*\"british\"'),\n", " (45,\n", " '0.029*\"health\" + 0.028*\"studi\" + 0.027*\"research\" + 0.022*\"peopl\" + 0.020*\"human\" + 0.019*\"medic\" + 0.019*\"cell\" + 0.018*\"report\" + 0.018*\"ag\" + 0.017*\"includ\"'),\n", " (46,\n", " '0.113*\"school\" + 0.025*\"high\" + 0.014*\"student\" + 0.011*\"educ\" + 0.007*\"grade\" + 0.006*\"public\" + 0.005*\"elementari\" + 0.005*\"primari\" + 0.004*\"pennsylvania\" + 0.004*\"teacher\"'),\n", " (47,\n", " '0.050*\"war\" + 0.021*\"german\" + 0.017*\"american\" + 0.016*\"british\" + 0.016*\"world\" + 0.012*\"french\" + 0.010*\"battl\" + 0.010*\"germani\" + 0.009*\"ship\" + 0.009*\"soviet\"'),\n", " (48,\n", " '0.174*\"art\" + 0.099*\"museum\" + 0.058*\"paint\" + 0.057*\"work\" + 0.044*\"artist\" + 0.041*\"galleri\" + 0.038*\"exhibit\" + 0.031*\"collect\" + 0.023*\"histori\" + 0.021*\"design\"'),\n", " (49,\n", " '0.067*\"peak\" + 0.066*\"kitt\" + 0.066*\"mount\" + 0.066*\"spacewatch\" + 0.065*\"lemmon\" + 0.033*\"survei\" + 0.026*\"octob\" + 0.024*\"septemb\" + 0.015*\"novemb\" + 0.012*\"march\"')]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nmf_with_r = Nmf.load('nmf_with_r.model')\n", "row.update(get_tm_metrics(nmf_with_r, test_corpus))\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n", "\n", "nmf_with_r.show_topics(50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train LDA and save it\n", "That's a common model to do Topic Modeling" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-01-16 04:06:27,576 : INFO : using symmetric alpha at 0.02\n", "2019-01-16 04:06:27,576 : INFO : using symmetric eta at 0.02\n", "2019-01-16 04:06:27,589 : INFO : using serial LDA version on this node\n", "2019-01-16 04:06:28,185 : INFO : running online (single-pass) LDA training, 50 topics, 1 passes over the supplied corpus of 4922894 documents, updating model once every 2000 documents, evaluating perplexity every 20000 documents, iterating 50x with a convergence threshold of 0.001000\n", "2019-01-16 04:06:28,910 : INFO : PROGRESS: pass 0, at document #2000/4922894\n", "==Truncated==\n", "2019-01-16 06:24:26,456 : INFO : topic diff=0.003897, rho=0.020154\n", "2019-01-16 06:24:26,465 : INFO : saving LdaState object under lda.model.state, separately None\n", "2019-01-16 06:24:26,680 : INFO : saved lda.model.state\n", "2019-01-16 06:24:26,732 : INFO : saving LdaModel object under lda.model, separately ['expElogbeta', 'sstats']\n", "2019-01-16 06:24:26,732 : INFO : storing np array 'expElogbeta' to lda.model.expElogbeta.npy\n", "2019-01-16 06:24:26,812 : INFO : not storing attribute dispatcher\n", "2019-01-16 06:24:26,814 : INFO : not storing attribute id2word\n", "2019-01-16 06:24:26,815 : INFO : not storing attribute state\n", "2019-01-16 06:24:26,828 : INFO : saved lda.model\n" ] } ], "source": [ "row = dict()\n", "row['model'] = 'lda'\n", "row['train_time'], lda = get_execution_time(\n", " lambda: LdaModel(**params)\n", ")\n", "lda.save('lda.model')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load LDA and store metrics" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-01-16 06:24:27,064 : INFO : loading LdaModel object from lda.model\n", "2019-01-16 06:24:27,070 : INFO : loading expElogbeta from lda.model.expElogbeta.npy with mmap=None\n", "2019-01-16 06:24:27,077 : INFO : setting ignored attribute dispatcher to None\n", "2019-01-16 06:24:27,078 : INFO : setting ignored attribute id2word to None\n", "2019-01-16 06:24:27,078 : INFO : setting ignored attribute state to None\n", "2019-01-16 06:24:27,079 : INFO : loaded lda.model\n", "2019-01-16 06:24:27,079 : INFO : loading LdaState object from lda.model.state\n", "2019-01-16 06:24:27,173 : INFO : loaded lda.model.state\n", "2019-01-16 06:24:41,257 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n", "2019-01-16 06:24:41,452 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n" ] }, { "data": { "text/plain": [ "[(0,\n", " '0.033*\"war\" + 0.028*\"armi\" + 0.021*\"forc\" + 0.020*\"command\" + 0.015*\"militari\" + 0.015*\"battl\" + 0.013*\"gener\" + 0.012*\"offic\" + 0.011*\"divis\" + 0.011*\"regiment\"'),\n", " (1,\n", " '0.038*\"album\" + 0.028*\"song\" + 0.026*\"releas\" + 0.026*\"record\" + 0.021*\"band\" + 0.016*\"singl\" + 0.015*\"music\" + 0.014*\"chart\" + 0.013*\"track\" + 0.010*\"guitar\"'),\n", " (2,\n", " '0.062*\"german\" + 0.039*\"germani\" + 0.025*\"van\" + 0.023*\"von\" + 0.020*\"der\" + 0.019*\"dutch\" + 0.019*\"berlin\" + 0.015*\"swedish\" + 0.014*\"netherland\" + 0.014*\"sweden\"'),\n", " (3,\n", " '0.032*\"john\" + 0.027*\"william\" + 0.019*\"british\" + 0.015*\"georg\" + 0.015*\"london\" + 0.014*\"thoma\" + 0.014*\"sir\" + 0.014*\"jame\" + 0.013*\"royal\" + 0.013*\"henri\"'),\n", " (4,\n", " '0.137*\"school\" + 0.040*\"colleg\" + 0.039*\"student\" + 0.033*\"univers\" + 0.030*\"high\" + 0.028*\"educ\" + 0.016*\"year\" + 0.011*\"graduat\" + 0.010*\"state\" + 0.009*\"campu\"'),\n", " (5,\n", " '0.030*\"game\" + 0.009*\"develop\" + 0.009*\"player\" + 0.008*\"releas\" + 0.008*\"us\" + 0.008*\"softwar\" + 0.008*\"version\" + 0.008*\"user\" + 0.007*\"data\" + 0.007*\"includ\"'),\n", " (6,\n", " '0.061*\"music\" + 0.030*\"perform\" + 0.019*\"theatr\" + 0.018*\"compos\" + 0.016*\"plai\" + 0.016*\"festiv\" + 0.015*\"danc\" + 0.014*\"orchestra\" + 0.012*\"opera\" + 0.011*\"piano\"'),\n", " (7,\n", " '0.013*\"number\" + 0.011*\"function\" + 0.010*\"model\" + 0.009*\"valu\" + 0.008*\"set\" + 0.008*\"exampl\" + 0.007*\"gener\" + 0.007*\"theori\" + 0.007*\"point\" + 0.006*\"method\"'),\n", " (8,\n", " '0.048*\"india\" + 0.037*\"indian\" + 0.020*\"http\" + 0.016*\"www\" + 0.015*\"pakistan\" + 0.015*\"iran\" + 0.013*\"sri\" + 0.012*\"khan\" + 0.012*\"islam\" + 0.012*\"tamil\"'),\n", " (9,\n", " '0.067*\"film\" + 0.025*\"award\" + 0.022*\"seri\" + 0.021*\"episod\" + 0.021*\"best\" + 0.015*\"star\" + 0.012*\"role\" + 0.012*\"actor\" + 0.011*\"televis\" + 0.011*\"produc\"'),\n", " (10,\n", " '0.020*\"engin\" + 0.013*\"power\" + 0.011*\"product\" + 0.011*\"design\" + 0.010*\"model\" + 0.009*\"produc\" + 0.008*\"us\" + 0.008*\"electr\" + 0.008*\"type\" + 0.007*\"vehicl\"'),\n", " (11,\n", " '0.024*\"law\" + 0.021*\"court\" + 0.016*\"state\" + 0.016*\"act\" + 0.011*\"polic\" + 0.010*\"case\" + 0.009*\"offic\" + 0.009*\"report\" + 0.009*\"right\" + 0.007*\"legal\"'),\n", " (12,\n", " '0.056*\"elect\" + 0.041*\"parti\" + 0.023*\"member\" + 0.020*\"vote\" + 0.020*\"presid\" + 0.017*\"democrat\" + 0.017*\"minist\" + 0.013*\"council\" + 0.013*\"repres\" + 0.012*\"polit\"'),\n", " (13,\n", " '0.057*\"state\" + 0.035*\"new\" + 0.029*\"american\" + 0.024*\"unit\" + 0.024*\"york\" + 0.020*\"counti\" + 0.015*\"citi\" + 0.014*\"california\" + 0.012*\"washington\" + 0.010*\"texa\"'),\n", " (14,\n", " '0.027*\"univers\" + 0.015*\"research\" + 0.014*\"institut\" + 0.012*\"nation\" + 0.012*\"scienc\" + 0.012*\"work\" + 0.012*\"intern\" + 0.011*\"award\" + 0.011*\"develop\" + 0.010*\"organ\"'),\n", " (15,\n", " '0.034*\"england\" + 0.024*\"unit\" + 0.021*\"london\" + 0.019*\"cricket\" + 0.019*\"town\" + 0.016*\"citi\" + 0.015*\"scotland\" + 0.013*\"manchest\" + 0.013*\"west\" + 0.012*\"scottish\"'),\n", " (16,\n", " '0.031*\"church\" + 0.017*\"famili\" + 0.017*\"di\" + 0.016*\"son\" + 0.015*\"marri\" + 0.014*\"year\" + 0.013*\"father\" + 0.013*\"life\" + 0.013*\"born\" + 0.012*\"daughter\"'),\n", " (17,\n", " '0.060*\"race\" + 0.020*\"car\" + 0.017*\"team\" + 0.012*\"finish\" + 0.012*\"tour\" + 0.012*\"driver\" + 0.011*\"ford\" + 0.011*\"time\" + 0.011*\"championship\" + 0.011*\"year\"'),\n", " (18,\n", " '0.010*\"water\" + 0.007*\"light\" + 0.007*\"energi\" + 0.007*\"high\" + 0.006*\"surfac\" + 0.006*\"earth\" + 0.006*\"time\" + 0.005*\"effect\" + 0.005*\"temperatur\" + 0.005*\"materi\"'),\n", " (19,\n", " '0.022*\"radio\" + 0.020*\"new\" + 0.019*\"broadcast\" + 0.018*\"station\" + 0.014*\"televis\" + 0.013*\"channel\" + 0.013*\"dai\" + 0.011*\"program\" + 0.011*\"host\" + 0.011*\"air\"'),\n", " (20,\n", " '0.035*\"win\" + 0.018*\"contest\" + 0.017*\"wrestl\" + 0.017*\"fight\" + 0.016*\"match\" + 0.016*\"titl\" + 0.015*\"championship\" + 0.014*\"team\" + 0.012*\"world\" + 0.011*\"defeat\"'),\n", " (21,\n", " '0.011*\"languag\" + 0.007*\"word\" + 0.007*\"form\" + 0.006*\"peopl\" + 0.006*\"differ\" + 0.006*\"cultur\" + 0.006*\"us\" + 0.006*\"mean\" + 0.005*\"tradit\" + 0.005*\"term\"'),\n", " (22,\n", " '0.051*\"popul\" + 0.033*\"ag\" + 0.030*\"citi\" + 0.029*\"town\" + 0.027*\"famili\" + 0.026*\"censu\" + 0.023*\"household\" + 0.023*\"commun\" + 0.021*\"peopl\" + 0.021*\"counti\"'),\n", " (23,\n", " '0.016*\"medic\" + 0.014*\"health\" + 0.014*\"hospit\" + 0.013*\"cell\" + 0.011*\"diseas\" + 0.010*\"patient\" + 0.009*\"ret\" + 0.009*\"caus\" + 0.008*\"human\" + 0.008*\"treatment\"'),\n", " (24,\n", " '0.037*\"ship\" + 0.017*\"navi\" + 0.015*\"sea\" + 0.012*\"island\" + 0.012*\"boat\" + 0.011*\"port\" + 0.010*\"naval\" + 0.010*\"coast\" + 0.010*\"gun\" + 0.009*\"fleet\"'),\n", " (25,\n", " '0.044*\"round\" + 0.044*\"final\" + 0.025*\"tournament\" + 0.023*\"group\" + 0.020*\"point\" + 0.020*\"winner\" + 0.018*\"open\" + 0.015*\"place\" + 0.013*\"qualifi\" + 0.012*\"won\"'),\n", " (26,\n", " '0.032*\"world\" + 0.030*\"women\" + 0.028*\"championship\" + 0.026*\"olymp\" + 0.023*\"men\" + 0.022*\"event\" + 0.022*\"medal\" + 0.018*\"athlet\" + 0.017*\"gold\" + 0.017*\"nation\"'),\n", " (27,\n", " '0.056*\"born\" + 0.034*\"russian\" + 0.026*\"american\" + 0.020*\"russia\" + 0.020*\"soviet\" + 0.017*\"polish\" + 0.015*\"jewish\" + 0.014*\"poland\" + 0.014*\"republ\" + 0.013*\"moscow\"'),\n", " (28,\n", " '0.029*\"build\" + 0.025*\"hous\" + 0.014*\"built\" + 0.012*\"locat\" + 0.012*\"street\" + 0.012*\"site\" + 0.011*\"histor\" + 0.009*\"park\" + 0.009*\"citi\" + 0.009*\"place\"'),\n", " (29,\n", " '0.039*\"leagu\" + 0.036*\"club\" + 0.035*\"plai\" + 0.031*\"team\" + 0.026*\"footbal\" + 0.026*\"season\" + 0.023*\"cup\" + 0.018*\"goal\" + 0.016*\"player\" + 0.016*\"match\"'),\n", " (30,\n", " '0.053*\"french\" + 0.041*\"franc\" + 0.027*\"italian\" + 0.025*\"pari\" + 0.022*\"saint\" + 0.020*\"itali\" + 0.018*\"jean\" + 0.014*\"de\" + 0.011*\"loui\" + 0.011*\"le\"'),\n", " (31,\n", " '0.067*\"australia\" + 0.058*\"australian\" + 0.051*\"new\" + 0.040*\"china\" + 0.033*\"zealand\" + 0.032*\"south\" + 0.027*\"chines\" + 0.021*\"sydnei\" + 0.015*\"melbourn\" + 0.013*\"queensland\"'),\n", " (32,\n", " '0.026*\"speci\" + 0.011*\"famili\" + 0.009*\"plant\" + 0.008*\"white\" + 0.008*\"bird\" + 0.007*\"genu\" + 0.007*\"red\" + 0.007*\"forest\" + 0.007*\"fish\" + 0.006*\"tree\"'),\n", " (33,\n", " '0.033*\"compani\" + 0.013*\"million\" + 0.012*\"busi\" + 0.012*\"market\" + 0.011*\"product\" + 0.010*\"bank\" + 0.010*\"year\" + 0.009*\"industri\" + 0.008*\"oper\" + 0.008*\"new\"'),\n", " (34,\n", " '0.085*\"island\" + 0.073*\"canada\" + 0.065*\"canadian\" + 0.026*\"toronto\" + 0.025*\"ontario\" + 0.017*\"korean\" + 0.017*\"korea\" + 0.016*\"quebec\" + 0.016*\"montreal\" + 0.016*\"british\"'),\n", " (35,\n", " '0.034*\"kong\" + 0.034*\"japanes\" + 0.033*\"hong\" + 0.023*\"lee\" + 0.021*\"singapor\" + 0.019*\"chines\" + 0.018*\"kim\" + 0.015*\"japan\" + 0.014*\"indonesia\" + 0.014*\"thailand\"'),\n", " (36,\n", " '0.054*\"art\" + 0.034*\"museum\" + 0.030*\"jpg\" + 0.027*\"file\" + 0.024*\"work\" + 0.022*\"paint\" + 0.020*\"artist\" + 0.019*\"design\" + 0.017*\"imag\" + 0.017*\"exhibit\"'),\n", " (37,\n", " '0.008*\"time\" + 0.007*\"man\" + 0.005*\"later\" + 0.005*\"appear\" + 0.005*\"charact\" + 0.005*\"kill\" + 0.004*\"like\" + 0.004*\"friend\" + 0.004*\"return\" + 0.004*\"end\"'),\n", " (38,\n", " '0.014*\"govern\" + 0.012*\"state\" + 0.012*\"nation\" + 0.010*\"war\" + 0.009*\"polit\" + 0.008*\"countri\" + 0.008*\"peopl\" + 0.007*\"group\" + 0.007*\"unit\" + 0.007*\"support\"'),\n", " (39,\n", " '0.050*\"air\" + 0.026*\"aircraft\" + 0.026*\"oper\" + 0.025*\"airport\" + 0.017*\"forc\" + 0.017*\"flight\" + 0.015*\"squadron\" + 0.014*\"unit\" + 0.012*\"base\" + 0.011*\"wing\"'),\n", " (40,\n", " '0.052*\"bar\" + 0.038*\"africa\" + 0.033*\"text\" + 0.033*\"african\" + 0.031*\"till\" + 0.029*\"color\" + 0.026*\"south\" + 0.023*\"black\" + 0.013*\"tropic\" + 0.013*\"storm\"'),\n", " (41,\n", " '0.039*\"book\" + 0.033*\"publish\" + 0.021*\"work\" + 0.015*\"new\" + 0.013*\"press\" + 0.013*\"univers\" + 0.013*\"edit\" + 0.011*\"stori\" + 0.011*\"novel\" + 0.011*\"author\"'),\n", " (42,\n", " '0.026*\"king\" + 0.019*\"centuri\" + 0.010*\"princ\" + 0.009*\"empir\" + 0.009*\"kingdom\" + 0.009*\"emperor\" + 0.009*\"greek\" + 0.008*\"roman\" + 0.007*\"ancient\" + 0.006*\"year\"'),\n", " (43,\n", " '0.033*\"san\" + 0.022*\"spanish\" + 0.017*\"mexico\" + 0.016*\"del\" + 0.013*\"spain\" + 0.012*\"santa\" + 0.011*\"brazil\" + 0.011*\"juan\" + 0.010*\"josé\" + 0.009*\"francisco\"'),\n", " (44,\n", " '0.029*\"game\" + 0.027*\"season\" + 0.023*\"team\" + 0.015*\"plai\" + 0.014*\"coach\" + 0.014*\"player\" + 0.011*\"footbal\" + 0.010*\"year\" + 0.010*\"leagu\" + 0.009*\"record\"'),\n", " (45,\n", " '0.015*\"john\" + 0.011*\"david\" + 0.010*\"michael\" + 0.008*\"paul\" + 0.008*\"smith\" + 0.007*\"robert\" + 0.007*\"jame\" + 0.006*\"peter\" + 0.006*\"jack\" + 0.006*\"jone\"'),\n", " (46,\n", " '0.133*\"class\" + 0.062*\"align\" + 0.060*\"left\" + 0.056*\"wikit\" + 0.046*\"style\" + 0.043*\"center\" + 0.035*\"right\" + 0.032*\"philippin\" + 0.032*\"list\" + 0.026*\"text\"'),\n", " (47,\n", " '0.025*\"river\" + 0.024*\"station\" + 0.021*\"line\" + 0.020*\"road\" + 0.017*\"railwai\" + 0.015*\"rout\" + 0.013*\"lake\" + 0.012*\"park\" + 0.011*\"bridg\" + 0.011*\"area\"'),\n", " (48,\n", " '0.072*\"octob\" + 0.070*\"septemb\" + 0.069*\"march\" + 0.062*\"decemb\" + 0.062*\"januari\" + 0.062*\"novemb\" + 0.061*\"juli\" + 0.061*\"august\" + 0.060*\"april\" + 0.058*\"june\"'),\n", " (49,\n", " '0.093*\"district\" + 0.066*\"villag\" + 0.047*\"region\" + 0.039*\"east\" + 0.039*\"west\" + 0.038*\"north\" + 0.036*\"counti\" + 0.033*\"south\" + 0.032*\"municip\" + 0.029*\"provinc\"')]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda = LdaModel.load('lda.model')\n", "row.update(get_tm_metrics(lda, test_corpus))\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n", "\n", "lda.show_topics(50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | coherence | \n", "l2_norm | \n", "model | \n", "perplexity | \n", "topics | \n", "train_time | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "-2.814135 | \n", "7.265412 | \n", "nmf | \n", "975.740399 | \n", "[(24, 0.131*\"mount\" + 0.129*\"lemmon\" + 0.129*\"... | \n", "4394.560518 | \n", "
| 1 | \n", "-2.436650 | \n", "7.268837 | \n", "nmf_with_r | \n", "985.570926 | \n", "[(49, 0.112*\"peak\" + 0.111*\"kitt\" + 0.111*\"mou... | \n", "26451.927848 | \n", "
| 2 | \n", "-2.514469 | \n", "7.371544 | \n", "lda | \n", "4727.075546 | \n", "[(35, 0.034*\"kong\" + 0.034*\"japanes\" + 0.033*\"... | \n", "8278.891060 | \n", "