{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Wikipedia training\n",
    "\n",
    "In this tutorial we will:\n",
    " - Learn how to train the NMF topic model on English Wikipedia corpus\n",
    " - Compare it with LDA model\n",
    " - Evaluate results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import itertools\n",
    "import json\n",
    "import logging\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy.sparse\n",
    "import smart_open\n",
    "import time\n",
    "from tqdm import tqdm, tqdm_notebook\n",
    "\n",
    "import gensim.downloader as api\n",
    "from gensim import matutils\n",
    "from gensim.corpora import MmCorpus, Dictionary\n",
    "from gensim.models import LdaModel, CoherenceModel\n",
    "from gensim.models.nmf import Nmf\n",
    "from gensim.parsing.preprocessing import preprocess_string\n",
    "\n",
    "tqdm.pandas()\n",
    "\n",
    "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\n",
    "                    level=logging.INFO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load wikipedia dump\n",
    "Let's use `gensim.downloader.api` for that"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Section title: Introduction\n",
      "Section text: \n",
      "\n",
      "\n",
      "\n",
      "\n",
      "'''Anarchism''' is a political philosophy that advocates self-governed societies based on volun\n",
      "Section title: Etymology and terminology\n",
      "Section text: \n",
      "\n",
      "The word ''anarchism'' is composed from the word ''anarchy'' and the suffix ''-ism'', themselves d\n",
      "Section title: History\n",
      "Section text: \n",
      "\n",
      "===Origins===\n",
      "Woodcut from a Diggers document by William Everard\n",
      "\n",
      "The earliest anarchist themes ca\n",
      "Section title: Anarchist schools of thought\n",
      "Section text: \n",
      "Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the pri\n",
      "Section title: Internal issues and debates\n",
      "Section text: \n",
      "consistent with anarchist values is a controversial subject among anarchists.\n",
      "\n",
      "Anarchism is a philo\n",
      "Section title: Topics of interest\n",
      "Section text: Intersecting and overlapping between various schools of thought, certain topics of interest and inte\n",
      "Section title: Criticisms\n",
      "Section text: \n",
      "Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evalu\n",
      "Section title: See also\n",
      "Section text: * Anarchism by country\n",
      "\n",
      "Section title: References\n",
      "Section text: \n",
      "\n",
      "Section title: Further reading\n",
      "Section text: * Barclay, Harold, ''People Without Government: An Anthropology of Anarchy'' (2nd ed.), Left Bank Bo\n",
      "Section title: External links\n",
      "Section text: \n",
      "* \n",
      "* \n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = api.load(\"wiki-english-20171001\")\n",
    "article = next(iter(data))\n",
    "\n",
    "for section_title, section_text in zip(\n",
    "    article['section_titles'],\n",
    "    article['section_texts']\n",
    "):\n",
    "    print(\"Section title: %s\" % section_title)\n",
    "    print(\"Section text: %s\" % section_text[:100])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preprocess and save articles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "def save_preprocessed_articles(filename, articles):\n",
    "    with smart_open(filename, 'w+', encoding=\"utf8\") as writer:\n",
    "        for article in tqdm_notebook(articles):\n",
    "            article_text = \" \".join(\n",
    "                \" \".join(section)\n",
    "                for section\n",
    "                in zip(\n",
    "                    article['section_titles'],\n",
    "                    article['section_texts']\n",
    "                )\n",
    "            )\n",
    "            article_text = preprocess_string(article_text)\n",
    "\n",
    "            writer.write(json.dumps(article_text) + '\\n')\n",
    "\n",
    "\n",
    "def get_preprocessed_articles(filename):\n",
    "    with smart_open(filename, 'r', encoding=\"utf8\") as reader:\n",
    "        for line in tqdm_notebook(reader):\n",
    "            yield json.loads(\n",
    "                line\n",
    "            )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "SAVE_ARTICLES = False\n",
    "\n",
    "if SAVE_ARTICLES:\n",
    "    save_preprocessed_articles('wiki_articles.jsonlines', data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create and save dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "lines_to_next_cell": 2,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "SAVE_DICTIONARY = False\n",
    "\n",
    "if SAVE_DICTIONARY:\n",
    "    dictionary = Dictionary(get_preprocessed_articles('wiki_articles.jsonlines'))\n",
    "    dictionary.save('wiki.dict')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load and filter dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2019-01-15 19:31:03,151 : INFO : loading Dictionary object from wiki.dict\n",
      "2019-01-15 19:31:04,024 : INFO : loaded wiki.dict\n",
      "2019-01-15 19:31:06,292 : INFO : discarding 1910258 tokens: [('abdelrahim', 49), ('abstention', 120), ('anarcha', 101), ('anarchica', 40), ('anarchosyndicalist', 20), ('antimilitar', 68), ('arbet', 194), ('archo', 100), ('arkhē', 5), ('autonomedia', 118)]...\n",
      "2019-01-15 19:31:06,293 : INFO : keeping 100000 tokens which were in no less than 5 and no more than 2462447 (=50.0%) documents\n",
      "2019-01-15 19:31:06,645 : INFO : resulting dictionary: Dictionary(100000 unique tokens: ['abandon', 'abil', 'abl', 'abolit', 'abstent']...)\n"
     ]
    }
   ],
   "source": [
    "dictionary = Dictionary.load('wiki.dict')\n",
    "dictionary.filter_extremes()\n",
    "dictionary.compactify()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MmCorpus wrapper\n",
    "In this way we'll:\n",
    "\n",
    "- Make sure that documents are shuffled\n",
    "- Be able to train-test split corpus without rewriting it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "class RandomCorpus(MmCorpus):\n",
    "    def __init__(self, random_seed=42, testset=False, testsize=1000, *args,\n",
    "                 **kwargs):\n",
    "        super().__init__(*args, **kwargs)\n",
    "\n",
    "        random_state = np.random.RandomState(random_seed)\n",
    "        self.indices = random_state.permutation(range(self.num_docs))\n",
    "        if testset:\n",
    "            self.indices = self.indices[:testsize]\n",
    "        else:\n",
    "            self.indices = self.indices[testsize:]\n",
    "\n",
    "    def __iter__(self):\n",
    "        for doc_id in self.indices:\n",
    "            yield self[doc_id]\n",
    "            \n",
    "    def __len__(self):\n",
    "        return len(self.indices)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create and save corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "SAVE_CORPUS = False\n",
    "\n",
    "if SAVE_CORPUS:\n",
    "    corpus = (\n",
    "        dictionary.doc2bow(article)\n",
    "        for article\n",
    "        in get_preprocessed_articles('wiki_articles.jsonlines')\n",
    "    )\n",
    "    \n",
    "    RandomCorpus.serialize('wiki.mm', corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load train and test corpus\n",
    "Using `RandomCorpus` wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2019-01-15 19:31:07,323 : INFO : loaded corpus index from wiki.mm.index\n",
      "2019-01-15 19:31:07,324 : INFO : initializing cython corpus reader from wiki.mm\n",
      "2019-01-15 19:31:07,325 : INFO : accepted corpus with 4924894 documents, 100000 features, 683375728 non-zero entries\n",
      "2019-01-15 19:31:08,544 : INFO : loaded corpus index from wiki.mm.index\n",
      "2019-01-15 19:31:08,544 : INFO : initializing cython corpus reader from wiki.mm\n",
      "2019-01-15 19:31:08,545 : INFO : accepted corpus with 4924894 documents, 100000 features, 683375728 non-zero entries\n"
     ]
    }
   ],
   "source": [
    "train_corpus = RandomCorpus(\n",
    "    random_seed=42, testset=False, testsize=2000, fname='wiki.mm'\n",
    ")\n",
    "test_corpus = RandomCorpus(\n",
    "    random_seed=42, testset=True, testsize=2000, fname='wiki.mm'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_execution_time(func):\n",
    "    start = time.time()\n",
    "\n",
    "    result = func()\n",
    "\n",
    "    return (time.time() - start), result\n",
    "\n",
    "\n",
    "def get_tm_metrics(model, test_corpus):\n",
    "    W = model.get_topics().T\n",
    "    H = np.zeros((model.num_topics, len(test_corpus)))\n",
    "    for bow_id, bow in enumerate(test_corpus):\n",
    "        for topic_id, word_count in model.get_document_topics(bow):\n",
    "            H[topic_id, bow_id] = word_count\n",
    "\n",
    "    pred_factors = W.dot(H)\n",
    "    pred_factors /= pred_factors.sum(axis=0)\n",
    "    \n",
    "    dense_corpus = matutils.corpus2dense(test_corpus, pred_factors.shape[0])\n",
    "\n",
    "    perplexity = get_tm_perplexity(pred_factors, dense_corpus)\n",
    "\n",
    "    l2_norm = get_tm_l2_norm(pred_factors, dense_corpus)\n",
    "\n",
    "    model.normalize = True\n",
    "\n",
    "    coherence = CoherenceModel(\n",
    "        model=model,\n",
    "        corpus=test_corpus,\n",
    "        coherence='u_mass'\n",
    "    ).get_coherence()\n",
    "\n",
    "    topics = model.show_topics()\n",
    "\n",
    "    model.normalize = False\n",
    "\n",
    "    return dict(\n",
    "        perplexity=perplexity,\n",
    "        coherence=coherence,\n",
    "        topics=topics,\n",
    "        l2_norm=l2_norm,\n",
    "    )\n",
    "\n",
    "\n",
    "def get_tm_perplexity(pred_factors, dense_corpus):\n",
    "    return np.exp(-(np.log(pred_factors, where=pred_factors > 0) * dense_corpus).sum() / dense_corpus.sum())\n",
    "\n",
    "\n",
    "def get_tm_l2_norm(pred_factors, dense_corpus):\n",
    "    return np.linalg.norm(dense_corpus / dense_corpus.sum(axis=0) - pred_factors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define dataframe in which we'll store metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "tm_metrics = pd.DataFrame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define common params for models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = dict(\n",
    "    corpus=train_corpus,\n",
    "    chunksize=2000,\n",
    "    num_topics=50,\n",
    "    id2word=dictionary,\n",
    "    passes=1,\n",
    "    eval_every=10,\n",
    "    minimum_probability=0,\n",
    "    random_state=42,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train NMF and save it\n",
    "Normalization is turned off to compute metrics correctly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2019-01-15 19:33:21,875 : INFO : Loss (no outliers): 2186.768444126956\tLoss (with outliers): 2186.768444126956\n",
      "2019-01-15 19:34:49,514 : INFO : Loss (no outliers): 2298.434152045061\tLoss (with outliers): 2298.434152045061\n",
      "==Truncated==\n",
      "2019-01-15 20:44:23,913 : INFO : Loss (no outliers): 1322.9664709183141\tLoss (with outliers): 1322.9664709183141\n",
      "2019-01-15 20:44:23,928 : INFO : saving Nmf object under nmf.model, separately None\n",
      "2019-01-15 20:44:24,625 : INFO : saved nmf.model\n"
     ]
    }
   ],
   "source": [
    "row = dict()\n",
    "row['model'] = 'nmf'\n",
    "row['train_time'], nmf = get_execution_time(\n",
    "    lambda: Nmf(\n",
    "        use_r=False,\n",
    "        normalize=False,\n",
    "        **params\n",
    "    )\n",
    ")\n",
    "nmf.save('nmf.model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load NMF and store metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2019-01-15 20:44:24,872 : INFO : loading Nmf object from nmf.model\n",
      "2019-01-15 20:44:25,150 : INFO : loading id2word recursively from nmf.model.id2word.* with mmap=None\n",
      "2019-01-15 20:44:25,151 : INFO : loaded nmf.model\n",
      "2019-01-15 20:44:54,148 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n",
      "2019-01-15 20:44:54,336 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.075*\"parti\" + 0.071*\"elect\" + 0.042*\"democrat\" + 0.029*\"republican\" + 0.022*\"vote\" + 0.018*\"conserv\" + 0.017*\"liber\" + 0.014*\"candid\" + 0.013*\"seat\" + 0.013*\"labour\"'),\n",
       " (1,\n",
       "  '0.039*\"book\" + 0.038*\"centuri\" + 0.032*\"histori\" + 0.032*\"languag\" + 0.032*\"publish\" + 0.024*\"english\" + 0.023*\"world\" + 0.022*\"law\" + 0.022*\"govern\" + 0.021*\"nation\"'),\n",
       " (2,\n",
       "  '0.050*\"war\" + 0.036*\"forc\" + 0.026*\"armi\" + 0.023*\"battl\" + 0.021*\"attack\" + 0.019*\"militari\" + 0.018*\"german\" + 0.016*\"british\" + 0.015*\"command\" + 0.014*\"kill\"'),\n",
       " (3,\n",
       "  '0.119*\"race\" + 0.106*\"car\" + 0.073*\"engin\" + 0.035*\"model\" + 0.030*\"driver\" + 0.029*\"vehicl\" + 0.029*\"ford\" + 0.028*\"lap\" + 0.023*\"electr\" + 0.020*\"power\"'),\n",
       " (4,\n",
       "  '0.102*\"leagu\" + 0.092*\"club\" + 0.049*\"footbal\" + 0.047*\"cup\" + 0.029*\"plai\" + 0.028*\"season\" + 0.028*\"divis\" + 0.028*\"goal\" + 0.022*\"team\" + 0.021*\"unit\"'),\n",
       " (5,\n",
       "  '0.055*\"award\" + 0.041*\"best\" + 0.008*\"nomin\" + 0.008*\"year\" + 0.006*\"actress\" + 0.006*\"actor\" + 0.005*\"perform\" + 0.005*\"artist\" + 0.005*\"won\" + 0.005*\"outstand\"'),\n",
       " (6,\n",
       "  '0.115*\"citi\" + 0.014*\"airport\" + 0.013*\"area\" + 0.011*\"popul\" + 0.010*\"san\" + 0.009*\"region\" + 0.008*\"center\" + 0.007*\"municip\" + 0.007*\"intern\" + 0.007*\"ukrainian\"'),\n",
       " (7,\n",
       "  '0.316*\"act\" + 0.046*\"amend\" + 0.020*\"order\" + 0.018*\"ireland\" + 0.016*\"law\" + 0.015*\"regul\" + 0.013*\"court\" + 0.011*\"scotland\" + 0.011*\"road\" + 0.009*\"public\"'),\n",
       " (8,\n",
       "  '0.102*\"align\" + 0.084*\"left\" + 0.022*\"right\" + 0.012*\"text\" + 0.011*\"style\" + 0.007*\"center\" + 0.004*\"bar\" + 0.003*\"till\" + 0.003*\"bgcolor\" + 0.003*\"color\"'),\n",
       " (9,\n",
       "  '0.092*\"team\" + 0.027*\"race\" + 0.025*\"ret\" + 0.014*\"championship\" + 0.007*\"nation\" + 0.006*\"time\" + 0.006*\"sport\" + 0.005*\"stage\" + 0.005*\"coach\" + 0.005*\"finish\"'),\n",
       " (10,\n",
       "  '0.135*\"compani\" + 0.089*\"ship\" + 0.035*\"product\" + 0.028*\"oper\" + 0.024*\"navi\" + 0.022*\"corpor\" + 0.021*\"oil\" + 0.021*\"launch\" + 0.021*\"bank\" + 0.021*\"built\"'),\n",
       " (11,\n",
       "  '0.053*\"new\" + 0.019*\"york\" + 0.004*\"zealand\" + 0.003*\"jersei\" + 0.003*\"american\" + 0.002*\"time\" + 0.002*\"australia\" + 0.002*\"radio\" + 0.002*\"press\" + 0.002*\"washington\"'),\n",
       " (12,\n",
       "  '0.036*\"world\" + 0.034*\"championship\" + 0.032*\"final\" + 0.029*\"match\" + 0.026*\"win\" + 0.026*\"round\" + 0.019*\"open\" + 0.018*\"won\" + 0.015*\"defeat\" + 0.015*\"cup\"'),\n",
       " (13,\n",
       "  '0.019*\"album\" + 0.017*\"record\" + 0.014*\"band\" + 0.008*\"releas\" + 0.005*\"tour\" + 0.005*\"guitar\" + 0.005*\"vocal\" + 0.004*\"rock\" + 0.004*\"track\" + 0.004*\"music\"'),\n",
       " (14,\n",
       "  '0.100*\"church\" + 0.017*\"cathol\" + 0.014*\"christian\" + 0.012*\"centuri\" + 0.012*\"saint\" + 0.011*\"bishop\" + 0.011*\"built\" + 0.009*\"list\" + 0.009*\"build\" + 0.008*\"roman\"'),\n",
       " (15,\n",
       "  '0.088*\"presid\" + 0.072*\"minist\" + 0.046*\"prime\" + 0.015*\"govern\" + 0.014*\"gener\" + 0.011*\"met\" + 0.011*\"governor\" + 0.010*\"foreign\" + 0.010*\"visit\" + 0.009*\"council\"'),\n",
       " (16,\n",
       "  '0.182*\"speci\" + 0.112*\"famili\" + 0.101*\"nov\" + 0.092*\"valid\" + 0.066*\"genu\" + 0.045*\"format\" + 0.040*\"member\" + 0.037*\"gen\" + 0.036*\"bird\" + 0.034*\"type\"'),\n",
       " (17,\n",
       "  '0.029*\"season\" + 0.013*\"yard\" + 0.013*\"game\" + 0.011*\"plai\" + 0.008*\"team\" + 0.007*\"score\" + 0.007*\"win\" + 0.007*\"record\" + 0.006*\"run\" + 0.006*\"coach\"'),\n",
       " (18,\n",
       "  '0.214*\"counti\" + 0.064*\"township\" + 0.017*\"area\" + 0.016*\"statist\" + 0.007*\"ohio\" + 0.006*\"metropolitan\" + 0.006*\"combin\" + 0.005*\"pennsylvania\" + 0.005*\"texa\" + 0.005*\"washington\"'),\n",
       " (19,\n",
       "  '0.017*\"area\" + 0.016*\"river\" + 0.015*\"water\" + 0.006*\"larg\" + 0.006*\"region\" + 0.006*\"lake\" + 0.006*\"power\" + 0.006*\"high\" + 0.005*\"bar\" + 0.005*\"form\"'),\n",
       " (20,\n",
       "  '0.031*\"us\" + 0.025*\"gener\" + 0.024*\"model\" + 0.022*\"data\" + 0.021*\"design\" + 0.020*\"time\" + 0.019*\"function\" + 0.019*\"number\" + 0.018*\"process\" + 0.017*\"exampl\"'),\n",
       " (21,\n",
       "  '0.202*\"order\" + 0.098*\"group\" + 0.098*\"regul\" + 0.076*\"amend\" + 0.041*\"road\" + 0.034*\"traffic\" + 0.033*\"temporari\" + 0.032*\"prohibit\" + 0.027*\"trunk\" + 0.021*\"junction\"'),\n",
       " (22,\n",
       "  '0.096*\"film\" + 0.010*\"product\" + 0.010*\"director\" + 0.010*\"festiv\" + 0.009*\"star\" + 0.009*\"produc\" + 0.009*\"movi\" + 0.008*\"direct\" + 0.007*\"releas\" + 0.007*\"actor\"'),\n",
       " (23,\n",
       "  '0.163*\"music\" + 0.046*\"viola\" + 0.045*\"radio\" + 0.042*\"piano\" + 0.029*\"perform\" + 0.028*\"station\" + 0.027*\"orchestra\" + 0.026*\"compos\" + 0.025*\"song\" + 0.015*\"rock\"'),\n",
       " (24,\n",
       "  '0.052*\"mount\" + 0.051*\"lemmon\" + 0.051*\"peak\" + 0.051*\"kitt\" + 0.051*\"spacewatch\" + 0.026*\"survei\" + 0.015*\"octob\" + 0.012*\"septemb\" + 0.009*\"css\" + 0.009*\"catalina\"'),\n",
       " (25,\n",
       "  '0.075*\"air\" + 0.035*\"forc\" + 0.030*\"squadron\" + 0.029*\"aircraft\" + 0.028*\"oper\" + 0.023*\"unit\" + 0.018*\"flight\" + 0.017*\"airport\" + 0.017*\"wing\" + 0.017*\"base\"'),\n",
       " (26,\n",
       "  '0.105*\"hous\" + 0.038*\"term\" + 0.020*\"march\" + 0.019*\"build\" + 0.019*\"member\" + 0.017*\"serv\" + 0.014*\"congress\" + 0.014*\"hall\" + 0.012*\"januari\" + 0.010*\"window\"'),\n",
       " (27,\n",
       "  '0.129*\"district\" + 0.019*\"pennsylvania\" + 0.016*\"grade\" + 0.012*\"fund\" + 0.012*\"educ\" + 0.012*\"basic\" + 0.011*\"level\" + 0.010*\"oblast\" + 0.010*\"rural\" + 0.009*\"tax\"'),\n",
       " (28,\n",
       "  '0.042*\"year\" + 0.012*\"dai\" + 0.007*\"time\" + 0.005*\"ag\" + 0.004*\"month\" + 0.003*\"includ\" + 0.003*\"follow\" + 0.003*\"later\" + 0.003*\"old\" + 0.003*\"student\"'),\n",
       " (29,\n",
       "  '0.113*\"station\" + 0.109*\"line\" + 0.076*\"road\" + 0.072*\"railwai\" + 0.048*\"rout\" + 0.035*\"oper\" + 0.034*\"train\" + 0.023*\"street\" + 0.020*\"cross\" + 0.020*\"railroad\"'),\n",
       " (30,\n",
       "  '0.036*\"park\" + 0.029*\"town\" + 0.025*\"north\" + 0.020*\"south\" + 0.018*\"west\" + 0.017*\"east\" + 0.017*\"street\" + 0.015*\"nation\" + 0.014*\"build\" + 0.013*\"river\"'),\n",
       " (31,\n",
       "  '0.066*\"women\" + 0.044*\"men\" + 0.030*\"nation\" + 0.024*\"right\" + 0.014*\"athlet\" + 0.013*\"intern\" + 0.013*\"rank\" + 0.013*\"countri\" + 0.012*\"advanc\" + 0.011*\"event\"'),\n",
       " (32,\n",
       "  '0.127*\"linear\" + 0.126*\"socorro\" + 0.029*\"septemb\" + 0.026*\"neat\" + 0.023*\"palomar\" + 0.021*\"octob\" + 0.016*\"kitt\" + 0.016*\"peak\" + 0.015*\"spacewatch\" + 0.015*\"anderson\"'),\n",
       " (33,\n",
       "  '0.152*\"univers\" + 0.055*\"colleg\" + 0.019*\"institut\" + 0.018*\"student\" + 0.018*\"scienc\" + 0.015*\"professor\" + 0.012*\"research\" + 0.011*\"campu\" + 0.011*\"educ\" + 0.011*\"technolog\"'),\n",
       " (34,\n",
       "  '0.072*\"state\" + 0.032*\"unit\" + 0.005*\"court\" + 0.005*\"law\" + 0.004*\"feder\" + 0.004*\"american\" + 0.003*\"nation\" + 0.003*\"govern\" + 0.003*\"kingdom\" + 0.003*\"senat\"'),\n",
       " (35,\n",
       "  '0.074*\"game\" + 0.017*\"player\" + 0.007*\"plai\" + 0.006*\"releas\" + 0.005*\"develop\" + 0.005*\"video\" + 0.005*\"charact\" + 0.004*\"playstat\" + 0.004*\"version\" + 0.004*\"world\"'),\n",
       " (36,\n",
       "  '0.141*\"south\" + 0.098*\"american\" + 0.081*\"india\" + 0.059*\"commun\" + 0.053*\"west\" + 0.053*\"director\" + 0.053*\"africa\" + 0.049*\"usa\" + 0.049*\"indian\" + 0.041*\"servic\"'),\n",
       " (37,\n",
       "  '0.111*\"servic\" + 0.025*\"commun\" + 0.021*\"offic\" + 0.012*\"polic\" + 0.011*\"educ\" + 0.011*\"public\" + 0.010*\"chief\" + 0.009*\"late\" + 0.009*\"manag\" + 0.008*\"mr\"'),\n",
       " (38,\n",
       "  '0.112*\"royal\" + 0.085*\"john\" + 0.083*\"william\" + 0.054*\"lieuten\" + 0.044*\"georg\" + 0.041*\"offic\" + 0.041*\"jame\" + 0.038*\"sergeant\" + 0.037*\"major\" + 0.035*\"charl\"'),\n",
       " (39,\n",
       "  '0.051*\"song\" + 0.043*\"releas\" + 0.042*\"singl\" + 0.027*\"chart\" + 0.025*\"album\" + 0.017*\"number\" + 0.014*\"video\" + 0.013*\"version\" + 0.012*\"love\" + 0.011*\"featur\"'),\n",
       " (40,\n",
       "  '0.031*\"time\" + 0.028*\"later\" + 0.026*\"appear\" + 0.025*\"man\" + 0.024*\"kill\" + 0.020*\"charact\" + 0.019*\"work\" + 0.018*\"father\" + 0.018*\"death\" + 0.018*\"famili\"'),\n",
       " (41,\n",
       "  '0.126*\"seri\" + 0.064*\"episod\" + 0.026*\"season\" + 0.021*\"televis\" + 0.015*\"comic\" + 0.013*\"charact\" + 0.012*\"dvd\" + 0.012*\"anim\" + 0.012*\"star\" + 0.011*\"appear\"'),\n",
       " (42,\n",
       "  '0.143*\"born\" + 0.073*\"american\" + 0.027*\"footbal\" + 0.024*\"player\" + 0.024*\"william\" + 0.023*\"singer\" + 0.019*\"actor\" + 0.017*\"politician\" + 0.015*\"actress\" + 0.013*\"english\"'),\n",
       " (43,\n",
       "  '0.044*\"march\" + 0.042*\"septemb\" + 0.036*\"octob\" + 0.033*\"januari\" + 0.032*\"april\" + 0.031*\"august\" + 0.031*\"juli\" + 0.029*\"novemb\" + 0.029*\"june\" + 0.028*\"decemb\"'),\n",
       " (44,\n",
       "  '0.149*\"island\" + 0.013*\"south\" + 0.013*\"australia\" + 0.009*\"sea\" + 0.008*\"north\" + 0.008*\"bai\" + 0.008*\"western\" + 0.008*\"airport\" + 0.007*\"coast\" + 0.006*\"pacif\"'),\n",
       " (45,\n",
       "  '0.028*\"studi\" + 0.026*\"research\" + 0.023*\"health\" + 0.019*\"human\" + 0.019*\"term\" + 0.019*\"develop\" + 0.018*\"includ\" + 0.018*\"peopl\" + 0.017*\"report\" + 0.017*\"cell\"'),\n",
       " (46,\n",
       "  '0.112*\"school\" + 0.028*\"high\" + 0.016*\"student\" + 0.012*\"educ\" + 0.009*\"grade\" + 0.008*\"primari\" + 0.007*\"public\" + 0.006*\"colleg\" + 0.006*\"elementari\" + 0.006*\"pennsylvania\"'),\n",
       " (47,\n",
       "  '0.137*\"royal\" + 0.121*\"capt\" + 0.103*\"armi\" + 0.090*\"maj\" + 0.089*\"corp\" + 0.075*\"col\" + 0.074*\"temp\" + 0.048*\"servic\" + 0.040*\"engin\" + 0.033*\"reg\"'),\n",
       " (48,\n",
       "  '0.183*\"art\" + 0.117*\"museum\" + 0.071*\"paint\" + 0.062*\"work\" + 0.046*\"artist\" + 0.043*\"galleri\" + 0.040*\"exhibit\" + 0.034*\"collect\" + 0.027*\"histori\" + 0.022*\"jpg\"'),\n",
       " (49,\n",
       "  '0.068*\"regiment\" + 0.062*\"divis\" + 0.049*\"battalion\" + 0.045*\"infantri\" + 0.036*\"brigad\" + 0.024*\"armi\" + 0.023*\"artilleri\" + 0.019*\"compani\" + 0.018*\"gener\" + 0.018*\"colonel\"')]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nmf = Nmf.load('nmf.model')\n",
    "row.update(get_tm_metrics(nmf, test_corpus))\n",
    "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n",
    "\n",
    "nmf.show_topics(50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train NMF with residuals and save it\n",
    "Residuals add regularization to the model thus increasing quality, but slows down training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2019-01-15 20:54:05,363 : INFO : Loss (no outliers): 2179.9524465227146\tLoss (with outliers): 2102.354108449905\n",
      "2019-01-15 20:57:12,821 : INFO : Loss (no outliers): 2268.3200929871823\tLoss (with outliers): 2110.928651253909\n",
      "==Truncated==\n",
      "2019-01-16 04:05:46,589 : INFO : Loss (no outliers): 1321.521323758918\tLoss (with outliers): 1282.9364495345592\n",
      "2019-01-16 04:05:46,599 : INFO : saving Nmf object under nmf_with_r.model, separately None\n",
      "2019-01-16 04:05:46,601 : INFO : storing scipy.sparse array '_r' under nmf_with_r.model._r.npy\n",
      "2019-01-16 04:05:47,781 : INFO : saved nmf_with_r.model\n"
     ]
    }
   ],
   "source": [
    "row = dict()\n",
    "row['model'] = 'nmf_with_r'\n",
    "row['train_time'], nmf_with_r = get_execution_time(\n",
    "    lambda: Nmf(\n",
    "        use_r=True,\n",
    "        lambda_=200,\n",
    "        normalize=False,\n",
    "        **params\n",
    "    )\n",
    ")\n",
    "nmf_with_r.save('nmf_with_r.model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load NMF with residuals and store metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2019-01-16 04:05:48,017 : INFO : loading Nmf object from nmf_with_r.model\n",
      "2019-01-16 04:05:48,272 : INFO : loading id2word recursively from nmf_with_r.model.id2word.* with mmap=None\n",
      "2019-01-16 04:05:48,273 : INFO : loading _r from nmf_with_r.model._r.npy with mmap=None\n",
      "2019-01-16 04:05:48,304 : INFO : loaded nmf_with_r.model\n",
      "2019-01-16 04:06:27,119 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n",
      "2019-01-16 04:06:27,253 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.062*\"parti\" + 0.061*\"elect\" + 0.031*\"democrat\" + 0.020*\"republican\" + 0.020*\"vote\" + 0.013*\"liber\" + 0.012*\"candid\" + 0.012*\"conserv\" + 0.011*\"seat\" + 0.010*\"member\"'),\n",
       " (1,\n",
       "  '0.052*\"book\" + 0.040*\"centuri\" + 0.039*\"publish\" + 0.031*\"languag\" + 0.027*\"histori\" + 0.025*\"work\" + 0.023*\"english\" + 0.022*\"king\" + 0.019*\"polit\" + 0.019*\"author\"'),\n",
       " (2,\n",
       "  '0.031*\"armi\" + 0.028*\"divis\" + 0.025*\"regiment\" + 0.022*\"forc\" + 0.020*\"battalion\" + 0.019*\"infantri\" + 0.019*\"command\" + 0.017*\"brigad\" + 0.016*\"gener\" + 0.012*\"corp\"'),\n",
       " (3,\n",
       "  '0.110*\"race\" + 0.059*\"car\" + 0.033*\"engin\" + 0.025*\"lap\" + 0.023*\"driver\" + 0.021*\"ret\" + 0.020*\"ford\" + 0.015*\"finish\" + 0.015*\"motorsport\" + 0.015*\"chevrolet\"'),\n",
       " (4,\n",
       "  '0.130*\"club\" + 0.068*\"cup\" + 0.046*\"footbal\" + 0.044*\"goal\" + 0.032*\"leagu\" + 0.031*\"unit\" + 0.031*\"plai\" + 0.030*\"match\" + 0.026*\"score\" + 0.021*\"player\"'),\n",
       " (5,\n",
       "  '0.041*\"award\" + 0.030*\"best\" + 0.006*\"nomin\" + 0.005*\"actress\" + 0.005*\"year\" + 0.004*\"actor\" + 0.004*\"won\" + 0.004*\"perform\" + 0.003*\"outstand\" + 0.003*\"artist\"'),\n",
       " (6,\n",
       "  '0.087*\"citi\" + 0.013*\"town\" + 0.009*\"popul\" + 0.008*\"area\" + 0.007*\"san\" + 0.006*\"center\" + 0.006*\"airport\" + 0.006*\"unit\" + 0.006*\"locat\" + 0.005*\"municip\"'),\n",
       " (7,\n",
       "  '0.171*\"act\" + 0.021*\"amend\" + 0.018*\"order\" + 0.010*\"ireland\" + 0.009*\"law\" + 0.007*\"court\" + 0.007*\"regul\" + 0.006*\"road\" + 0.006*\"scotland\" + 0.006*\"nation\"'),\n",
       " (8,\n",
       "  '0.064*\"leagu\" + 0.014*\"divis\" + 0.012*\"left\" + 0.011*\"align\" + 0.009*\"basebal\" + 0.008*\"footbal\" + 0.007*\"run\" + 0.007*\"major\" + 0.005*\"home\" + 0.005*\"hit\"'),\n",
       " (9,\n",
       "  '0.086*\"team\" + 0.013*\"championship\" + 0.007*\"nation\" + 0.007*\"race\" + 0.007*\"coach\" + 0.005*\"time\" + 0.004*\"sport\" + 0.004*\"ret\" + 0.004*\"player\" + 0.004*\"match\"'),\n",
       " (10,\n",
       "  '0.100*\"episod\" + 0.055*\"compani\" + 0.021*\"product\" + 0.011*\"produc\" + 0.011*\"televis\" + 0.009*\"role\" + 0.009*\"busi\" + 0.008*\"market\" + 0.008*\"corpor\" + 0.007*\"bank\"'),\n",
       " (11,\n",
       "  '0.050*\"new\" + 0.017*\"york\" + 0.003*\"zealand\" + 0.003*\"jersei\" + 0.002*\"time\" + 0.002*\"radio\" + 0.002*\"broadcast\" + 0.002*\"station\" + 0.002*\"washington\" + 0.002*\"australia\"'),\n",
       " (12,\n",
       "  '0.035*\"final\" + 0.033*\"world\" + 0.030*\"round\" + 0.030*\"championship\" + 0.025*\"win\" + 0.025*\"match\" + 0.021*\"open\" + 0.017*\"won\" + 0.016*\"tournament\" + 0.015*\"event\"'),\n",
       " (13,\n",
       "  '0.020*\"record\" + 0.019*\"band\" + 0.015*\"album\" + 0.007*\"releas\" + 0.007*\"guitar\" + 0.006*\"tour\" + 0.005*\"rock\" + 0.005*\"vocal\" + 0.004*\"plai\" + 0.004*\"live\"'),\n",
       " (14,\n",
       "  '0.096*\"church\" + 0.015*\"cathol\" + 0.012*\"christian\" + 0.010*\"saint\" + 0.010*\"bishop\" + 0.009*\"centuri\" + 0.008*\"build\" + 0.007*\"parish\" + 0.007*\"built\" + 0.007*\"roman\"'),\n",
       " (15,\n",
       "  '0.084*\"presid\" + 0.055*\"minist\" + 0.037*\"prime\" + 0.014*\"govern\" + 0.012*\"gener\" + 0.010*\"governor\" + 0.010*\"nation\" + 0.008*\"council\" + 0.008*\"secretari\" + 0.008*\"visit\"'),\n",
       " (16,\n",
       "  '0.089*\"yard\" + 0.035*\"pass\" + 0.035*\"touchdown\" + 0.028*\"field\" + 0.025*\"run\" + 0.023*\"win\" + 0.022*\"score\" + 0.021*\"quarter\" + 0.017*\"record\" + 0.016*\"second\"'),\n",
       " (17,\n",
       "  '0.042*\"season\" + 0.006*\"plai\" + 0.004*\"coach\" + 0.004*\"final\" + 0.004*\"second\" + 0.004*\"win\" + 0.004*\"record\" + 0.003*\"career\" + 0.003*\"finish\" + 0.003*\"point\"'),\n",
       " (18,\n",
       "  '0.174*\"counti\" + 0.034*\"township\" + 0.014*\"area\" + 0.013*\"statist\" + 0.004*\"texa\" + 0.004*\"ohio\" + 0.004*\"virginia\" + 0.004*\"washington\" + 0.003*\"metropolitan\" + 0.003*\"pennsylvania\"'),\n",
       " (19,\n",
       "  '0.012*\"water\" + 0.010*\"area\" + 0.010*\"speci\" + 0.007*\"larg\" + 0.006*\"order\" + 0.006*\"region\" + 0.006*\"includ\" + 0.005*\"black\" + 0.005*\"famili\" + 0.005*\"popul\"'),\n",
       " (20,\n",
       "  '0.020*\"us\" + 0.015*\"gener\" + 0.014*\"design\" + 0.014*\"model\" + 0.012*\"develop\" + 0.012*\"time\" + 0.012*\"data\" + 0.011*\"number\" + 0.011*\"function\" + 0.011*\"process\"'),\n",
       " (21,\n",
       "  '0.165*\"group\" + 0.023*\"left\" + 0.022*\"align\" + 0.021*\"member\" + 0.017*\"text\" + 0.015*\"bar\" + 0.011*\"order\" + 0.011*\"point\" + 0.010*\"till\" + 0.009*\"stage\"'),\n",
       " (22,\n",
       "  '0.095*\"film\" + 0.009*\"director\" + 0.008*\"star\" + 0.008*\"movi\" + 0.008*\"product\" + 0.008*\"festiv\" + 0.008*\"releas\" + 0.008*\"produc\" + 0.007*\"direct\" + 0.006*\"featur\"'),\n",
       " (23,\n",
       "  '0.107*\"music\" + 0.024*\"perform\" + 0.019*\"piano\" + 0.018*\"song\" + 0.017*\"compos\" + 0.017*\"orchestra\" + 0.017*\"viola\" + 0.012*\"plai\" + 0.011*\"radio\" + 0.011*\"danc\"'),\n",
       " (24,\n",
       "  '0.023*\"septemb\" + 0.023*\"march\" + 0.020*\"octob\" + 0.020*\"juli\" + 0.019*\"june\" + 0.019*\"april\" + 0.019*\"august\" + 0.018*\"januari\" + 0.018*\"novemb\" + 0.017*\"decemb\"'),\n",
       " (25,\n",
       "  '0.078*\"air\" + 0.041*\"forc\" + 0.031*\"aircraft\" + 0.027*\"squadron\" + 0.026*\"oper\" + 0.021*\"unit\" + 0.016*\"base\" + 0.016*\"wing\" + 0.016*\"flight\" + 0.015*\"fighter\"'),\n",
       " (26,\n",
       "  '0.101*\"hous\" + 0.023*\"build\" + 0.021*\"term\" + 0.015*\"member\" + 0.014*\"serv\" + 0.014*\"march\" + 0.014*\"left\" + 0.012*\"congress\" + 0.011*\"hall\" + 0.010*\"street\"'),\n",
       " (27,\n",
       "  '0.123*\"district\" + 0.024*\"pennsylvania\" + 0.019*\"grade\" + 0.016*\"educ\" + 0.015*\"fund\" + 0.014*\"basic\" + 0.013*\"level\" + 0.011*\"student\" + 0.011*\"receiv\" + 0.010*\"tax\"'),\n",
       " (28,\n",
       "  '0.048*\"year\" + 0.007*\"dai\" + 0.005*\"time\" + 0.005*\"ag\" + 0.003*\"month\" + 0.003*\"old\" + 0.003*\"student\" + 0.003*\"includ\" + 0.003*\"later\" + 0.002*\"million\"'),\n",
       " (29,\n",
       "  '0.090*\"line\" + 0.083*\"station\" + 0.054*\"road\" + 0.053*\"railwai\" + 0.036*\"rout\" + 0.030*\"train\" + 0.027*\"oper\" + 0.020*\"street\" + 0.016*\"servic\" + 0.016*\"open\"'),\n",
       " (30,\n",
       "  '0.031*\"park\" + 0.030*\"south\" + 0.030*\"north\" + 0.023*\"west\" + 0.020*\"river\" + 0.020*\"east\" + 0.015*\"area\" + 0.014*\"town\" + 0.013*\"lake\" + 0.013*\"nation\"'),\n",
       " (31,\n",
       "  '0.071*\"women\" + 0.041*\"men\" + 0.027*\"nation\" + 0.023*\"right\" + 0.012*\"countri\" + 0.012*\"intern\" + 0.012*\"athlet\" + 0.011*\"advanc\" + 0.011*\"rank\" + 0.010*\"law\"'),\n",
       " (32,\n",
       "  '0.104*\"linear\" + 0.104*\"socorro\" + 0.025*\"septemb\" + 0.020*\"neat\" + 0.018*\"palomar\" + 0.018*\"octob\" + 0.013*\"decemb\" + 0.013*\"august\" + 0.012*\"anderson\" + 0.012*\"mesa\"'),\n",
       " (33,\n",
       "  '0.089*\"univers\" + 0.011*\"scienc\" + 0.009*\"institut\" + 0.008*\"research\" + 0.008*\"professor\" + 0.006*\"student\" + 0.005*\"technolog\" + 0.005*\"faculti\" + 0.005*\"studi\" + 0.005*\"engin\"'),\n",
       " (34,\n",
       "  '0.064*\"state\" + 0.024*\"unit\" + 0.005*\"court\" + 0.005*\"law\" + 0.004*\"feder\" + 0.003*\"nation\" + 0.003*\"govern\" + 0.002*\"senat\" + 0.002*\"california\" + 0.002*\"constitut\"'),\n",
       " (35,\n",
       "  '0.085*\"colleg\" + 0.019*\"univers\" + 0.014*\"student\" + 0.008*\"campu\" + 0.007*\"institut\" + 0.006*\"educ\" + 0.005*\"hall\" + 0.005*\"program\" + 0.005*\"commun\" + 0.005*\"state\"'),\n",
       " (36,\n",
       "  '0.118*\"class\" + 0.079*\"director\" + 0.053*\"rifl\" + 0.050*\"south\" + 0.048*\"×mm\" + 0.046*\"action\" + 0.045*\"san\" + 0.044*\"actor\" + 0.041*\"angel\" + 0.037*\"lo\"'),\n",
       " (37,\n",
       "  '0.092*\"servic\" + 0.025*\"offic\" + 0.023*\"commun\" + 0.013*\"john\" + 0.012*\"chief\" + 0.011*\"polic\" + 0.011*\"public\" + 0.011*\"british\" + 0.010*\"late\" + 0.010*\"director\"'),\n",
       " (38,\n",
       "  '0.156*\"royal\" + 0.072*\"william\" + 0.068*\"john\" + 0.058*\"corp\" + 0.051*\"lieuten\" + 0.046*\"capt\" + 0.041*\"engin\" + 0.041*\"armi\" + 0.039*\"georg\" + 0.039*\"temp\"'),\n",
       " (39,\n",
       "  '0.042*\"song\" + 0.039*\"album\" + 0.034*\"releas\" + 0.029*\"singl\" + 0.024*\"chart\" + 0.013*\"number\" + 0.011*\"video\" + 0.010*\"love\" + 0.010*\"featur\" + 0.010*\"track\"'),\n",
       " (40,\n",
       "  '0.028*\"time\" + 0.025*\"later\" + 0.023*\"kill\" + 0.019*\"appear\" + 0.018*\"man\" + 0.016*\"death\" + 0.016*\"father\" + 0.015*\"return\" + 0.015*\"son\" + 0.014*\"charact\"'),\n",
       " (41,\n",
       "  '0.110*\"seri\" + 0.016*\"charact\" + 0.016*\"episod\" + 0.015*\"comic\" + 0.013*\"televis\" + 0.012*\"anim\" + 0.011*\"appear\" + 0.009*\"stori\" + 0.009*\"origin\" + 0.009*\"featur\"'),\n",
       " (42,\n",
       "  '0.091*\"born\" + 0.070*\"american\" + 0.022*\"player\" + 0.021*\"footbal\" + 0.020*\"william\" + 0.016*\"actor\" + 0.014*\"politician\" + 0.014*\"singer\" + 0.013*\"john\" + 0.012*\"actress\"'),\n",
       " (43,\n",
       "  '0.072*\"game\" + 0.017*\"player\" + 0.011*\"plai\" + 0.004*\"releas\" + 0.004*\"point\" + 0.004*\"develop\" + 0.004*\"score\" + 0.003*\"video\" + 0.003*\"time\" + 0.003*\"card\"'),\n",
       " (44,\n",
       "  '0.110*\"island\" + 0.007*\"australia\" + 0.007*\"ship\" + 0.007*\"south\" + 0.007*\"sea\" + 0.006*\"bai\" + 0.005*\"coast\" + 0.004*\"pacif\" + 0.004*\"western\" + 0.004*\"british\"'),\n",
       " (45,\n",
       "  '0.029*\"health\" + 0.028*\"studi\" + 0.027*\"research\" + 0.022*\"peopl\" + 0.020*\"human\" + 0.019*\"medic\" + 0.019*\"cell\" + 0.018*\"report\" + 0.018*\"ag\" + 0.017*\"includ\"'),\n",
       " (46,\n",
       "  '0.113*\"school\" + 0.025*\"high\" + 0.014*\"student\" + 0.011*\"educ\" + 0.007*\"grade\" + 0.006*\"public\" + 0.005*\"elementari\" + 0.005*\"primari\" + 0.004*\"pennsylvania\" + 0.004*\"teacher\"'),\n",
       " (47,\n",
       "  '0.050*\"war\" + 0.021*\"german\" + 0.017*\"american\" + 0.016*\"british\" + 0.016*\"world\" + 0.012*\"french\" + 0.010*\"battl\" + 0.010*\"germani\" + 0.009*\"ship\" + 0.009*\"soviet\"'),\n",
       " (48,\n",
       "  '0.174*\"art\" + 0.099*\"museum\" + 0.058*\"paint\" + 0.057*\"work\" + 0.044*\"artist\" + 0.041*\"galleri\" + 0.038*\"exhibit\" + 0.031*\"collect\" + 0.023*\"histori\" + 0.021*\"design\"'),\n",
       " (49,\n",
       "  '0.067*\"peak\" + 0.066*\"kitt\" + 0.066*\"mount\" + 0.066*\"spacewatch\" + 0.065*\"lemmon\" + 0.033*\"survei\" + 0.026*\"octob\" + 0.024*\"septemb\" + 0.015*\"novemb\" + 0.012*\"march\"')]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nmf_with_r = Nmf.load('nmf_with_r.model')\n",
    "row.update(get_tm_metrics(nmf_with_r, test_corpus))\n",
    "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n",
    "\n",
    "nmf_with_r.show_topics(50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train LDA and save it\n",
    "That's a common model to do Topic Modeling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2019-01-16 04:06:27,576 : INFO : using symmetric alpha at 0.02\n",
      "2019-01-16 04:06:27,576 : INFO : using symmetric eta at 0.02\n",
      "2019-01-16 04:06:27,589 : INFO : using serial LDA version on this node\n",
      "2019-01-16 04:06:28,185 : INFO : running online (single-pass) LDA training, 50 topics, 1 passes over the supplied corpus of 4922894 documents, updating model once every 2000 documents, evaluating perplexity every 20000 documents, iterating 50x with a convergence threshold of 0.001000\n",
      "2019-01-16 04:06:28,910 : INFO : PROGRESS: pass 0, at document #2000/4922894\n",
      "==Truncated==\n",
      "2019-01-16 06:24:26,456 : INFO : topic diff=0.003897, rho=0.020154\n",
      "2019-01-16 06:24:26,465 : INFO : saving LdaState object under lda.model.state, separately None\n",
      "2019-01-16 06:24:26,680 : INFO : saved lda.model.state\n",
      "2019-01-16 06:24:26,732 : INFO : saving LdaModel object under lda.model, separately ['expElogbeta', 'sstats']\n",
      "2019-01-16 06:24:26,732 : INFO : storing np array 'expElogbeta' to lda.model.expElogbeta.npy\n",
      "2019-01-16 06:24:26,812 : INFO : not storing attribute dispatcher\n",
      "2019-01-16 06:24:26,814 : INFO : not storing attribute id2word\n",
      "2019-01-16 06:24:26,815 : INFO : not storing attribute state\n",
      "2019-01-16 06:24:26,828 : INFO : saved lda.model\n"
     ]
    }
   ],
   "source": [
    "row = dict()\n",
    "row['model'] = 'lda'\n",
    "row['train_time'], lda = get_execution_time(\n",
    "    lambda: LdaModel(**params)\n",
    ")\n",
    "lda.save('lda.model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load LDA and store metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2019-01-16 06:24:27,064 : INFO : loading LdaModel object from lda.model\n",
      "2019-01-16 06:24:27,070 : INFO : loading expElogbeta from lda.model.expElogbeta.npy with mmap=None\n",
      "2019-01-16 06:24:27,077 : INFO : setting ignored attribute dispatcher to None\n",
      "2019-01-16 06:24:27,078 : INFO : setting ignored attribute id2word to None\n",
      "2019-01-16 06:24:27,078 : INFO : setting ignored attribute state to None\n",
      "2019-01-16 06:24:27,079 : INFO : loaded lda.model\n",
      "2019-01-16 06:24:27,079 : INFO : loading LdaState object from lda.model.state\n",
      "2019-01-16 06:24:27,173 : INFO : loaded lda.model.state\n",
      "2019-01-16 06:24:41,257 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n",
      "2019-01-16 06:24:41,452 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.033*\"war\" + 0.028*\"armi\" + 0.021*\"forc\" + 0.020*\"command\" + 0.015*\"militari\" + 0.015*\"battl\" + 0.013*\"gener\" + 0.012*\"offic\" + 0.011*\"divis\" + 0.011*\"regiment\"'),\n",
       " (1,\n",
       "  '0.038*\"album\" + 0.028*\"song\" + 0.026*\"releas\" + 0.026*\"record\" + 0.021*\"band\" + 0.016*\"singl\" + 0.015*\"music\" + 0.014*\"chart\" + 0.013*\"track\" + 0.010*\"guitar\"'),\n",
       " (2,\n",
       "  '0.062*\"german\" + 0.039*\"germani\" + 0.025*\"van\" + 0.023*\"von\" + 0.020*\"der\" + 0.019*\"dutch\" + 0.019*\"berlin\" + 0.015*\"swedish\" + 0.014*\"netherland\" + 0.014*\"sweden\"'),\n",
       " (3,\n",
       "  '0.032*\"john\" + 0.027*\"william\" + 0.019*\"british\" + 0.015*\"georg\" + 0.015*\"london\" + 0.014*\"thoma\" + 0.014*\"sir\" + 0.014*\"jame\" + 0.013*\"royal\" + 0.013*\"henri\"'),\n",
       " (4,\n",
       "  '0.137*\"school\" + 0.040*\"colleg\" + 0.039*\"student\" + 0.033*\"univers\" + 0.030*\"high\" + 0.028*\"educ\" + 0.016*\"year\" + 0.011*\"graduat\" + 0.010*\"state\" + 0.009*\"campu\"'),\n",
       " (5,\n",
       "  '0.030*\"game\" + 0.009*\"develop\" + 0.009*\"player\" + 0.008*\"releas\" + 0.008*\"us\" + 0.008*\"softwar\" + 0.008*\"version\" + 0.008*\"user\" + 0.007*\"data\" + 0.007*\"includ\"'),\n",
       " (6,\n",
       "  '0.061*\"music\" + 0.030*\"perform\" + 0.019*\"theatr\" + 0.018*\"compos\" + 0.016*\"plai\" + 0.016*\"festiv\" + 0.015*\"danc\" + 0.014*\"orchestra\" + 0.012*\"opera\" + 0.011*\"piano\"'),\n",
       " (7,\n",
       "  '0.013*\"number\" + 0.011*\"function\" + 0.010*\"model\" + 0.009*\"valu\" + 0.008*\"set\" + 0.008*\"exampl\" + 0.007*\"gener\" + 0.007*\"theori\" + 0.007*\"point\" + 0.006*\"method\"'),\n",
       " (8,\n",
       "  '0.048*\"india\" + 0.037*\"indian\" + 0.020*\"http\" + 0.016*\"www\" + 0.015*\"pakistan\" + 0.015*\"iran\" + 0.013*\"sri\" + 0.012*\"khan\" + 0.012*\"islam\" + 0.012*\"tamil\"'),\n",
       " (9,\n",
       "  '0.067*\"film\" + 0.025*\"award\" + 0.022*\"seri\" + 0.021*\"episod\" + 0.021*\"best\" + 0.015*\"star\" + 0.012*\"role\" + 0.012*\"actor\" + 0.011*\"televis\" + 0.011*\"produc\"'),\n",
       " (10,\n",
       "  '0.020*\"engin\" + 0.013*\"power\" + 0.011*\"product\" + 0.011*\"design\" + 0.010*\"model\" + 0.009*\"produc\" + 0.008*\"us\" + 0.008*\"electr\" + 0.008*\"type\" + 0.007*\"vehicl\"'),\n",
       " (11,\n",
       "  '0.024*\"law\" + 0.021*\"court\" + 0.016*\"state\" + 0.016*\"act\" + 0.011*\"polic\" + 0.010*\"case\" + 0.009*\"offic\" + 0.009*\"report\" + 0.009*\"right\" + 0.007*\"legal\"'),\n",
       " (12,\n",
       "  '0.056*\"elect\" + 0.041*\"parti\" + 0.023*\"member\" + 0.020*\"vote\" + 0.020*\"presid\" + 0.017*\"democrat\" + 0.017*\"minist\" + 0.013*\"council\" + 0.013*\"repres\" + 0.012*\"polit\"'),\n",
       " (13,\n",
       "  '0.057*\"state\" + 0.035*\"new\" + 0.029*\"american\" + 0.024*\"unit\" + 0.024*\"york\" + 0.020*\"counti\" + 0.015*\"citi\" + 0.014*\"california\" + 0.012*\"washington\" + 0.010*\"texa\"'),\n",
       " (14,\n",
       "  '0.027*\"univers\" + 0.015*\"research\" + 0.014*\"institut\" + 0.012*\"nation\" + 0.012*\"scienc\" + 0.012*\"work\" + 0.012*\"intern\" + 0.011*\"award\" + 0.011*\"develop\" + 0.010*\"organ\"'),\n",
       " (15,\n",
       "  '0.034*\"england\" + 0.024*\"unit\" + 0.021*\"london\" + 0.019*\"cricket\" + 0.019*\"town\" + 0.016*\"citi\" + 0.015*\"scotland\" + 0.013*\"manchest\" + 0.013*\"west\" + 0.012*\"scottish\"'),\n",
       " (16,\n",
       "  '0.031*\"church\" + 0.017*\"famili\" + 0.017*\"di\" + 0.016*\"son\" + 0.015*\"marri\" + 0.014*\"year\" + 0.013*\"father\" + 0.013*\"life\" + 0.013*\"born\" + 0.012*\"daughter\"'),\n",
       " (17,\n",
       "  '0.060*\"race\" + 0.020*\"car\" + 0.017*\"team\" + 0.012*\"finish\" + 0.012*\"tour\" + 0.012*\"driver\" + 0.011*\"ford\" + 0.011*\"time\" + 0.011*\"championship\" + 0.011*\"year\"'),\n",
       " (18,\n",
       "  '0.010*\"water\" + 0.007*\"light\" + 0.007*\"energi\" + 0.007*\"high\" + 0.006*\"surfac\" + 0.006*\"earth\" + 0.006*\"time\" + 0.005*\"effect\" + 0.005*\"temperatur\" + 0.005*\"materi\"'),\n",
       " (19,\n",
       "  '0.022*\"radio\" + 0.020*\"new\" + 0.019*\"broadcast\" + 0.018*\"station\" + 0.014*\"televis\" + 0.013*\"channel\" + 0.013*\"dai\" + 0.011*\"program\" + 0.011*\"host\" + 0.011*\"air\"'),\n",
       " (20,\n",
       "  '0.035*\"win\" + 0.018*\"contest\" + 0.017*\"wrestl\" + 0.017*\"fight\" + 0.016*\"match\" + 0.016*\"titl\" + 0.015*\"championship\" + 0.014*\"team\" + 0.012*\"world\" + 0.011*\"defeat\"'),\n",
       " (21,\n",
       "  '0.011*\"languag\" + 0.007*\"word\" + 0.007*\"form\" + 0.006*\"peopl\" + 0.006*\"differ\" + 0.006*\"cultur\" + 0.006*\"us\" + 0.006*\"mean\" + 0.005*\"tradit\" + 0.005*\"term\"'),\n",
       " (22,\n",
       "  '0.051*\"popul\" + 0.033*\"ag\" + 0.030*\"citi\" + 0.029*\"town\" + 0.027*\"famili\" + 0.026*\"censu\" + 0.023*\"household\" + 0.023*\"commun\" + 0.021*\"peopl\" + 0.021*\"counti\"'),\n",
       " (23,\n",
       "  '0.016*\"medic\" + 0.014*\"health\" + 0.014*\"hospit\" + 0.013*\"cell\" + 0.011*\"diseas\" + 0.010*\"patient\" + 0.009*\"ret\" + 0.009*\"caus\" + 0.008*\"human\" + 0.008*\"treatment\"'),\n",
       " (24,\n",
       "  '0.037*\"ship\" + 0.017*\"navi\" + 0.015*\"sea\" + 0.012*\"island\" + 0.012*\"boat\" + 0.011*\"port\" + 0.010*\"naval\" + 0.010*\"coast\" + 0.010*\"gun\" + 0.009*\"fleet\"'),\n",
       " (25,\n",
       "  '0.044*\"round\" + 0.044*\"final\" + 0.025*\"tournament\" + 0.023*\"group\" + 0.020*\"point\" + 0.020*\"winner\" + 0.018*\"open\" + 0.015*\"place\" + 0.013*\"qualifi\" + 0.012*\"won\"'),\n",
       " (26,\n",
       "  '0.032*\"world\" + 0.030*\"women\" + 0.028*\"championship\" + 0.026*\"olymp\" + 0.023*\"men\" + 0.022*\"event\" + 0.022*\"medal\" + 0.018*\"athlet\" + 0.017*\"gold\" + 0.017*\"nation\"'),\n",
       " (27,\n",
       "  '0.056*\"born\" + 0.034*\"russian\" + 0.026*\"american\" + 0.020*\"russia\" + 0.020*\"soviet\" + 0.017*\"polish\" + 0.015*\"jewish\" + 0.014*\"poland\" + 0.014*\"republ\" + 0.013*\"moscow\"'),\n",
       " (28,\n",
       "  '0.029*\"build\" + 0.025*\"hous\" + 0.014*\"built\" + 0.012*\"locat\" + 0.012*\"street\" + 0.012*\"site\" + 0.011*\"histor\" + 0.009*\"park\" + 0.009*\"citi\" + 0.009*\"place\"'),\n",
       " (29,\n",
       "  '0.039*\"leagu\" + 0.036*\"club\" + 0.035*\"plai\" + 0.031*\"team\" + 0.026*\"footbal\" + 0.026*\"season\" + 0.023*\"cup\" + 0.018*\"goal\" + 0.016*\"player\" + 0.016*\"match\"'),\n",
       " (30,\n",
       "  '0.053*\"french\" + 0.041*\"franc\" + 0.027*\"italian\" + 0.025*\"pari\" + 0.022*\"saint\" + 0.020*\"itali\" + 0.018*\"jean\" + 0.014*\"de\" + 0.011*\"loui\" + 0.011*\"le\"'),\n",
       " (31,\n",
       "  '0.067*\"australia\" + 0.058*\"australian\" + 0.051*\"new\" + 0.040*\"china\" + 0.033*\"zealand\" + 0.032*\"south\" + 0.027*\"chines\" + 0.021*\"sydnei\" + 0.015*\"melbourn\" + 0.013*\"queensland\"'),\n",
       " (32,\n",
       "  '0.026*\"speci\" + 0.011*\"famili\" + 0.009*\"plant\" + 0.008*\"white\" + 0.008*\"bird\" + 0.007*\"genu\" + 0.007*\"red\" + 0.007*\"forest\" + 0.007*\"fish\" + 0.006*\"tree\"'),\n",
       " (33,\n",
       "  '0.033*\"compani\" + 0.013*\"million\" + 0.012*\"busi\" + 0.012*\"market\" + 0.011*\"product\" + 0.010*\"bank\" + 0.010*\"year\" + 0.009*\"industri\" + 0.008*\"oper\" + 0.008*\"new\"'),\n",
       " (34,\n",
       "  '0.085*\"island\" + 0.073*\"canada\" + 0.065*\"canadian\" + 0.026*\"toronto\" + 0.025*\"ontario\" + 0.017*\"korean\" + 0.017*\"korea\" + 0.016*\"quebec\" + 0.016*\"montreal\" + 0.016*\"british\"'),\n",
       " (35,\n",
       "  '0.034*\"kong\" + 0.034*\"japanes\" + 0.033*\"hong\" + 0.023*\"lee\" + 0.021*\"singapor\" + 0.019*\"chines\" + 0.018*\"kim\" + 0.015*\"japan\" + 0.014*\"indonesia\" + 0.014*\"thailand\"'),\n",
       " (36,\n",
       "  '0.054*\"art\" + 0.034*\"museum\" + 0.030*\"jpg\" + 0.027*\"file\" + 0.024*\"work\" + 0.022*\"paint\" + 0.020*\"artist\" + 0.019*\"design\" + 0.017*\"imag\" + 0.017*\"exhibit\"'),\n",
       " (37,\n",
       "  '0.008*\"time\" + 0.007*\"man\" + 0.005*\"later\" + 0.005*\"appear\" + 0.005*\"charact\" + 0.005*\"kill\" + 0.004*\"like\" + 0.004*\"friend\" + 0.004*\"return\" + 0.004*\"end\"'),\n",
       " (38,\n",
       "  '0.014*\"govern\" + 0.012*\"state\" + 0.012*\"nation\" + 0.010*\"war\" + 0.009*\"polit\" + 0.008*\"countri\" + 0.008*\"peopl\" + 0.007*\"group\" + 0.007*\"unit\" + 0.007*\"support\"'),\n",
       " (39,\n",
       "  '0.050*\"air\" + 0.026*\"aircraft\" + 0.026*\"oper\" + 0.025*\"airport\" + 0.017*\"forc\" + 0.017*\"flight\" + 0.015*\"squadron\" + 0.014*\"unit\" + 0.012*\"base\" + 0.011*\"wing\"'),\n",
       " (40,\n",
       "  '0.052*\"bar\" + 0.038*\"africa\" + 0.033*\"text\" + 0.033*\"african\" + 0.031*\"till\" + 0.029*\"color\" + 0.026*\"south\" + 0.023*\"black\" + 0.013*\"tropic\" + 0.013*\"storm\"'),\n",
       " (41,\n",
       "  '0.039*\"book\" + 0.033*\"publish\" + 0.021*\"work\" + 0.015*\"new\" + 0.013*\"press\" + 0.013*\"univers\" + 0.013*\"edit\" + 0.011*\"stori\" + 0.011*\"novel\" + 0.011*\"author\"'),\n",
       " (42,\n",
       "  '0.026*\"king\" + 0.019*\"centuri\" + 0.010*\"princ\" + 0.009*\"empir\" + 0.009*\"kingdom\" + 0.009*\"emperor\" + 0.009*\"greek\" + 0.008*\"roman\" + 0.007*\"ancient\" + 0.006*\"year\"'),\n",
       " (43,\n",
       "  '0.033*\"san\" + 0.022*\"spanish\" + 0.017*\"mexico\" + 0.016*\"del\" + 0.013*\"spain\" + 0.012*\"santa\" + 0.011*\"brazil\" + 0.011*\"juan\" + 0.010*\"josé\" + 0.009*\"francisco\"'),\n",
       " (44,\n",
       "  '0.029*\"game\" + 0.027*\"season\" + 0.023*\"team\" + 0.015*\"plai\" + 0.014*\"coach\" + 0.014*\"player\" + 0.011*\"footbal\" + 0.010*\"year\" + 0.010*\"leagu\" + 0.009*\"record\"'),\n",
       " (45,\n",
       "  '0.015*\"john\" + 0.011*\"david\" + 0.010*\"michael\" + 0.008*\"paul\" + 0.008*\"smith\" + 0.007*\"robert\" + 0.007*\"jame\" + 0.006*\"peter\" + 0.006*\"jack\" + 0.006*\"jone\"'),\n",
       " (46,\n",
       "  '0.133*\"class\" + 0.062*\"align\" + 0.060*\"left\" + 0.056*\"wikit\" + 0.046*\"style\" + 0.043*\"center\" + 0.035*\"right\" + 0.032*\"philippin\" + 0.032*\"list\" + 0.026*\"text\"'),\n",
       " (47,\n",
       "  '0.025*\"river\" + 0.024*\"station\" + 0.021*\"line\" + 0.020*\"road\" + 0.017*\"railwai\" + 0.015*\"rout\" + 0.013*\"lake\" + 0.012*\"park\" + 0.011*\"bridg\" + 0.011*\"area\"'),\n",
       " (48,\n",
       "  '0.072*\"octob\" + 0.070*\"septemb\" + 0.069*\"march\" + 0.062*\"decemb\" + 0.062*\"januari\" + 0.062*\"novemb\" + 0.061*\"juli\" + 0.061*\"august\" + 0.060*\"april\" + 0.058*\"june\"'),\n",
       " (49,\n",
       "  '0.093*\"district\" + 0.066*\"villag\" + 0.047*\"region\" + 0.039*\"east\" + 0.039*\"west\" + 0.038*\"north\" + 0.036*\"counti\" + 0.033*\"south\" + 0.032*\"municip\" + 0.029*\"provinc\"')]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda = LdaModel.load('lda.model')\n",
    "row.update(get_tm_metrics(lda, test_corpus))\n",
    "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n",
    "\n",
    "lda.show_topics(50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>coherence</th>\n",
       "      <th>l2_norm</th>\n",
       "      <th>model</th>\n",
       "      <th>perplexity</th>\n",
       "      <th>topics</th>\n",
       "      <th>train_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-2.814135</td>\n",
       "      <td>7.265412</td>\n",
       "      <td>nmf</td>\n",
       "      <td>975.740399</td>\n",
       "      <td>[(24, 0.131*\"mount\" + 0.129*\"lemmon\" + 0.129*\"...</td>\n",
       "      <td>4394.560518</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-2.436650</td>\n",
       "      <td>7.268837</td>\n",
       "      <td>nmf_with_r</td>\n",
       "      <td>985.570926</td>\n",
       "      <td>[(49, 0.112*\"peak\" + 0.111*\"kitt\" + 0.111*\"mou...</td>\n",
       "      <td>26451.927848</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-2.514469</td>\n",
       "      <td>7.371544</td>\n",
       "      <td>lda</td>\n",
       "      <td>4727.075546</td>\n",
       "      <td>[(35, 0.034*\"kong\" + 0.034*\"japanes\" + 0.033*\"...</td>\n",
       "      <td>8278.891060</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   coherence   l2_norm       model   perplexity  \\\n",
       "0  -2.814135  7.265412         nmf   975.740399   \n",
       "1  -2.436650  7.268837  nmf_with_r   985.570926   \n",
       "2  -2.514469  7.371544         lda  4727.075546   \n",
       "\n",
       "                                              topics    train_time  \n",
       "0  [(24, 0.131*\"mount\" + 0.129*\"lemmon\" + 0.129*\"...   4394.560518  \n",
       "1  [(49, 0.112*\"peak\" + 0.111*\"kitt\" + 0.111*\"mou...  26451.927848  \n",
       "2  [(35, 0.034*\"kong\" + 0.034*\"japanes\" + 0.033*\"...   8278.891060  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tm_metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### RAM Usage:\n",
    "- nmf: 100-150Mb\n",
    "- nmf_with_r: 3-9Gb\n",
    "- lda: 100Mb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "====================\n",
      "nmf\n",
      "====================\n",
      "\n",
      "Topic: 24\n",
      "0.131*\"mount\" + 0.129*\"lemmon\" + 0.129*\"peak\" + 0.127*\"kitt\" + 0.127*\"spacewatch\" + 0.065*\"survei\" + 0.037*\"octob\" + 0.031*\"septemb\" + 0.023*\"css\" + 0.023*\"catalina\"\n",
      "\n",
      "Topic: 32\n",
      "0.196*\"linear\" + 0.195*\"socorro\" + 0.045*\"septemb\" + 0.039*\"neat\" + 0.035*\"palomar\" + 0.032*\"octob\" + 0.024*\"kitt\" + 0.024*\"peak\" + 0.024*\"spacewatch\" + 0.023*\"anderson\"\n",
      "\n",
      "Topic: 8\n",
      "0.331*\"align\" + 0.270*\"left\" + 0.071*\"right\" + 0.040*\"text\" + 0.035*\"style\" + 0.022*\"center\" + 0.013*\"bar\" + 0.009*\"till\" + 0.008*\"bgcolor\" + 0.008*\"color\"\n",
      "\n",
      "Topic: 27\n",
      "0.186*\"district\" + 0.027*\"pennsylvania\" + 0.022*\"grade\" + 0.017*\"fund\" + 0.017*\"educ\" + 0.017*\"basic\" + 0.016*\"level\" + 0.014*\"oblast\" + 0.014*\"rural\" + 0.013*\"tax\"\n",
      "\n",
      "Topic: 48\n",
      "0.103*\"art\" + 0.066*\"museum\" + 0.040*\"paint\" + 0.035*\"work\" + 0.026*\"artist\" + 0.024*\"galleri\" + 0.022*\"exhibit\" + 0.019*\"collect\" + 0.015*\"histori\" + 0.013*\"jpg\"\n",
      "\n",
      "Topic: 11\n",
      "0.122*\"new\" + 0.043*\"york\" + 0.009*\"zealand\" + 0.007*\"jersei\" + 0.006*\"american\" + 0.006*\"time\" + 0.006*\"australia\" + 0.005*\"radio\" + 0.005*\"press\" + 0.005*\"washington\"\n",
      "\n",
      "Topic: 20\n",
      "0.008*\"us\" + 0.006*\"gener\" + 0.006*\"model\" + 0.006*\"data\" + 0.006*\"design\" + 0.005*\"time\" + 0.005*\"function\" + 0.005*\"number\" + 0.005*\"process\" + 0.005*\"exampl\"\n",
      "\n",
      "Topic: 28\n",
      "0.074*\"year\" + 0.022*\"dai\" + 0.012*\"time\" + 0.008*\"ag\" + 0.006*\"month\" + 0.006*\"includ\" + 0.006*\"follow\" + 0.005*\"later\" + 0.005*\"old\" + 0.005*\"student\"\n",
      "\n",
      "Topic: 38\n",
      "0.033*\"royal\" + 0.025*\"john\" + 0.025*\"william\" + 0.016*\"lieuten\" + 0.013*\"georg\" + 0.012*\"offic\" + 0.012*\"jame\" + 0.011*\"sergeant\" + 0.011*\"major\" + 0.010*\"charl\"\n",
      "\n",
      "Topic: 19\n",
      "0.012*\"area\" + 0.011*\"river\" + 0.010*\"water\" + 0.004*\"larg\" + 0.004*\"region\" + 0.004*\"lake\" + 0.004*\"power\" + 0.004*\"high\" + 0.004*\"bar\" + 0.004*\"form\"\n",
      "\n",
      "\n",
      "====================\n",
      "nmf_with_r\n",
      "====================\n",
      "\n",
      "Topic: 49\n",
      "0.112*\"peak\" + 0.111*\"kitt\" + 0.111*\"mount\" + 0.111*\"spacewatch\" + 0.109*\"lemmon\" + 0.055*\"survei\" + 0.044*\"octob\" + 0.041*\"septemb\" + 0.026*\"novemb\" + 0.021*\"march\"\n",
      "\n",
      "Topic: 32\n",
      "0.194*\"linear\" + 0.193*\"socorro\" + 0.047*\"septemb\" + 0.038*\"neat\" + 0.034*\"palomar\" + 0.034*\"octob\" + 0.025*\"decemb\" + 0.024*\"august\" + 0.023*\"anderson\" + 0.023*\"mesa\"\n",
      "\n",
      "Topic: 48\n",
      "0.112*\"art\" + 0.063*\"museum\" + 0.037*\"paint\" + 0.036*\"work\" + 0.028*\"artist\" + 0.026*\"galleri\" + 0.025*\"exhibit\" + 0.020*\"collect\" + 0.015*\"histori\" + 0.014*\"design\"\n",
      "\n",
      "Topic: 4\n",
      "0.093*\"club\" + 0.049*\"cup\" + 0.033*\"footbal\" + 0.031*\"goal\" + 0.022*\"leagu\" + 0.022*\"unit\" + 0.022*\"plai\" + 0.022*\"match\" + 0.018*\"score\" + 0.015*\"player\"\n",
      "\n",
      "Topic: 27\n",
      "0.159*\"district\" + 0.031*\"pennsylvania\" + 0.025*\"grade\" + 0.021*\"educ\" + 0.019*\"fund\" + 0.018*\"basic\" + 0.017*\"level\" + 0.015*\"student\" + 0.014*\"receiv\" + 0.014*\"tax\"\n",
      "\n",
      "Topic: 17\n",
      "0.095*\"season\" + 0.014*\"plai\" + 0.010*\"coach\" + 0.009*\"final\" + 0.009*\"second\" + 0.008*\"win\" + 0.008*\"record\" + 0.008*\"career\" + 0.008*\"finish\" + 0.007*\"point\"\n",
      "\n",
      "Topic: 40\n",
      "0.009*\"time\" + 0.008*\"later\" + 0.007*\"kill\" + 0.006*\"appear\" + 0.005*\"man\" + 0.005*\"death\" + 0.005*\"father\" + 0.005*\"return\" + 0.005*\"son\" + 0.004*\"charact\"\n",
      "\n",
      "Topic: 20\n",
      "0.008*\"us\" + 0.006*\"gener\" + 0.005*\"design\" + 0.005*\"model\" + 0.005*\"develop\" + 0.005*\"time\" + 0.004*\"data\" + 0.004*\"number\" + 0.004*\"function\" + 0.004*\"process\"\n",
      "\n",
      "Topic: 19\n",
      "0.009*\"water\" + 0.008*\"area\" + 0.008*\"speci\" + 0.005*\"larg\" + 0.004*\"order\" + 0.004*\"region\" + 0.004*\"includ\" + 0.004*\"black\" + 0.004*\"famili\" + 0.004*\"popul\"\n",
      "\n",
      "Topic: 38\n",
      "0.044*\"royal\" + 0.020*\"william\" + 0.019*\"john\" + 0.016*\"corp\" + 0.014*\"lieuten\" + 0.013*\"capt\" + 0.012*\"engin\" + 0.011*\"armi\" + 0.011*\"georg\" + 0.011*\"temp\"\n",
      "\n",
      "\n",
      "====================\n",
      "lda\n",
      "====================\n",
      "\n",
      "Topic: 35\n",
      "0.034*\"kong\" + 0.034*\"japanes\" + 0.033*\"hong\" + 0.023*\"lee\" + 0.021*\"singapor\" + 0.019*\"chines\" + 0.018*\"kim\" + 0.015*\"japan\" + 0.014*\"indonesia\" + 0.014*\"thailand\"\n",
      "\n",
      "Topic: 23\n",
      "0.016*\"medic\" + 0.014*\"health\" + 0.014*\"hospit\" + 0.013*\"cell\" + 0.011*\"diseas\" + 0.010*\"patient\" + 0.009*\"ret\" + 0.009*\"caus\" + 0.008*\"human\" + 0.008*\"treatment\"\n",
      "\n",
      "Topic: 47\n",
      "0.025*\"river\" + 0.024*\"station\" + 0.021*\"line\" + 0.020*\"road\" + 0.017*\"railwai\" + 0.015*\"rout\" + 0.013*\"lake\" + 0.012*\"park\" + 0.011*\"bridg\" + 0.011*\"area\"\n",
      "\n",
      "Topic: 14\n",
      "0.027*\"univers\" + 0.015*\"research\" + 0.014*\"institut\" + 0.012*\"nation\" + 0.012*\"scienc\" + 0.012*\"work\" + 0.012*\"intern\" + 0.011*\"award\" + 0.011*\"develop\" + 0.010*\"organ\"\n",
      "\n",
      "Topic: 39\n",
      "0.050*\"air\" + 0.026*\"aircraft\" + 0.026*\"oper\" + 0.025*\"airport\" + 0.017*\"forc\" + 0.017*\"flight\" + 0.015*\"squadron\" + 0.014*\"unit\" + 0.012*\"base\" + 0.011*\"wing\"\n",
      "\n",
      "Topic: 17\n",
      "0.060*\"race\" + 0.020*\"car\" + 0.017*\"team\" + 0.012*\"finish\" + 0.012*\"tour\" + 0.012*\"driver\" + 0.011*\"ford\" + 0.011*\"time\" + 0.011*\"championship\" + 0.011*\"year\"\n",
      "\n",
      "Topic: 4\n",
      "0.137*\"school\" + 0.040*\"colleg\" + 0.039*\"student\" + 0.033*\"univers\" + 0.030*\"high\" + 0.028*\"educ\" + 0.016*\"year\" + 0.011*\"graduat\" + 0.010*\"state\" + 0.009*\"campu\"\n",
      "\n",
      "Topic: 8\n",
      "0.048*\"india\" + 0.037*\"indian\" + 0.020*\"http\" + 0.016*\"www\" + 0.015*\"pakistan\" + 0.015*\"iran\" + 0.013*\"sri\" + 0.012*\"khan\" + 0.012*\"islam\" + 0.012*\"tamil\"\n",
      "\n",
      "Topic: 2\n",
      "0.062*\"german\" + 0.039*\"germani\" + 0.025*\"van\" + 0.023*\"von\" + 0.020*\"der\" + 0.019*\"dutch\" + 0.019*\"berlin\" + 0.015*\"swedish\" + 0.014*\"netherland\" + 0.014*\"sweden\"\n",
      "\n",
      "Topic: 11\n",
      "0.024*\"law\" + 0.021*\"court\" + 0.016*\"state\" + 0.016*\"act\" + 0.011*\"polic\" + 0.010*\"case\" + 0.009*\"offic\" + 0.009*\"report\" + 0.009*\"right\" + 0.007*\"legal\"\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for row_idx, row in tm_metrics.iterrows():\n",
    "    print('='*20)\n",
    "    print(row['model'])\n",
    "    print('='*20)\n",
    "    print()\n",
    "    for topic_idx, tokens in row['topics']:\n",
    "        print('Topic: {}'.format(topic_idx))\n",
    "        print(tokens)\n",
    "        print()\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, NMF can be significantly faster than LDA without sacrificing quality of topics too much (or not sacrificing at all)\n",
    "\n",
    "Moreover, NMF can be very flexible on RAM usage due to sparsity option, which leaves only small amount of elements in inner matrices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "jupytext": {
   "text_representation": {
    "extension": ".py",
    "format_name": "percent",
    "format_version": "1.2",
    "jupytext_version": "0.8.6"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}