{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gensim Tutorial on Online Non-Negative Matrix Factorization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebooks explains basic ideas behind the open source NMF implementation in [Gensim](https://github.com/RaRe-Technologies/gensim), including code examples for applying NMF to text processing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's in this tutorial?\n", "\n", "1. [Introduction: Why NMF?](#1.-Introduction-to-NMF)\n", "2. [Code example on 20 Newsgroups](#2.-Code-example:-NMF-on-20-Newsgroups)\n", "3. [Benchmarks against Sklearn's NMF and Gensim's LDA](#3.-Benchmarks)\n", "4. [Large-scale NMF training on the English Wikipedia (sparse text vectors)](#4.-NMF-on-English-Wikipedia)\n", "5. [NMF on face decomposition (dense image vectors)](#5.-And-now-for-something-completely-different:-Face-decomposition-from-images)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# 1. Introduction to NMF" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## What's in a name?\n", "\n", "Gensim's Online Non-Negative Matrix Factorization (NMF, NNMF, ONMF) implementation is based on [Renbo Zhao, Vincent Y. F. Tan: Online Nonnegative Matrix Factorization with Outliers, 2016](https://arxiv.org/abs/1604.02634) and is optimized for extremely large, sparse, streamed inputs. Such inputs happen in NLP with **unsupervised training** on massive text corpora.\n", "\n", "* Why **Online**? Because corpora and datasets in modern ML can be very large, and RAM is limited. Unlike batch algorithms, online algorithms learn iteratively, streaming through the available training examples, without loading the entire dataset into RAM or requiring random-access to the data examples.\n", "\n", "* Why **Non-Negative**? Because non-negativity leads to more interpretable, sparse \"human-friendly\" topics. This is in contrast to e.g. SVD (another popular matrix factorization method with [super-efficient implementation in Gensim](https://radimrehurek.com/gensim/models/lsimodel.html)), which produces dense negative factors and thus harder-to-interpret topics.\n", "\n", "* **Matrix factorizations** are the corner stone of modern machine learning. They can be used either directly (recommendation systems, bi-clustering, image compression, topic modeling…) or as internal routines in more complex deep learning algorithms." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## How ONNMF works\n", "\n", "Terminology:\n", "- `corpus` is a stream of input documents = training examples\n", "- `batch` is a chunk of input corpus, a word-document matrix mini-batch that fits in RAM\n", "- `W` is a word-topic matrix (to be learned; stored in the resulting model)\n", "- `h` is a topic-document matrix (to be learned; not stored, but rather inferred for documents on-the-fly)\n", "- `A`, `B` - matrices that accumulate information from consecutive chunks. `A = h.dot(ht)`, `B = v.dot(ht)`.\n", "\n", "The idea behind the algorithm is as follows:\n", "\n", "```\n", " Initialize W, A and B matrices\n", "\n", " for batch in input corpus batches:\n", " infer h:\n", " do coordinate gradient descent step to find h that minimizes ||batch - Wh|| in L2 norm\n", "\n", " bound h so that it is non-negative\n", "\n", " update A and B:\n", " A = h.dot(ht)\n", " B = batch.dot(ht)\n", "\n", " update W:\n", " do gradient descent step to find W that minimizes ||0.5*trace(WtWA) - trace(WtB)|| in L2 norm\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Code example: NMF on 20 Newsgroups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import the models we'll be using throughout this tutorial (`numpy==1.14.2`, `matplotlib==3.0.2`, `pandas==0.24.1`, `sklearn==0.19.1`, `gensim==3.7.1`) and set up logging at INFO level.\n", "\n", "Gensim uses logging generously to inform users what's going on. Eyeballing the logs is a good sanity check, to make sure everything is working as expected.\n", "\n", "Only `numpy` and `gensim` are actually needed to train and use NMF. The other imports are used only to make our life a little easier in this tutorial." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import logging\n", "import time\n", "from contextlib import contextmanager\n", "import os\n", "from multiprocessing import Process\n", "import psutil\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from numpy.random import RandomState\n", "from sklearn import decomposition\n", "from sklearn.cluster import MiniBatchKMeans\n", "from sklearn.datasets import fetch_olivetti_faces\n", "from sklearn.decomposition.nmf import NMF as SklearnNmf\n", "from sklearn.linear_model import LogisticRegressionCV\n", "from sklearn.metrics import f1_score\n", "\n", "import gensim.downloader\n", "from gensim import matutils, utils\n", "from gensim.corpora import Dictionary\n", "from gensim.models import CoherenceModel, LdaModel, TfidfModel, LsiModel\n", "from gensim.models.basemodel import BaseTopicModel\n", "from gensim.models.nmf import Nmf as GensimNmf\n", "from gensim.parsing.preprocessing import preprocess_string\n", "\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load the notorious [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/) from Gensim's [repository of pre-trained models and corpora](https://github.com/RaRe-Technologies/gensim-data):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "newsgroups = gensim.downloader.load('20-newsgroups')\n", "\n", "categories = [\n", " 'alt.atheism',\n", " 'comp.graphics',\n", " 'rec.motorcycles',\n", " 'talk.politics.mideast',\n", " 'sci.space'\n", "]\n", "\n", "categories = {name: idx for idx, name in enumerate(categories)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a train/test split:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "random_state = RandomState(42)\n", "\n", "trainset = np.array([\n", " {\n", " 'data': doc['data'],\n", " 'target': categories[doc['topic']],\n", " }\n", " for doc in newsgroups\n", " if doc['topic'] in categories and doc['set'] == 'train'\n", "])\n", "random_state.shuffle(trainset)\n", "\n", "testset = np.array([\n", " {\n", " 'data': doc['data'],\n", " 'target': categories[doc['topic']],\n", " }\n", " for doc in newsgroups\n", " if doc['topic'] in categories and doc['set'] == 'test'\n", "])\n", "random_state.shuffle(testset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use very [simple preprocessing with stemming](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.preprocess_string) to tokenize each document. YMMV; in your application, use whatever preprocessing makes sense in your domain. Correctly preparing the input has [major impact](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out) on any subsequent ML training." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "train_documents = [preprocess_string(doc['data']) for doc in trainset]\n", "test_documents = [preprocess_string(doc['data']) for doc in testset]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dictionary compilation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a mapping between tokens and their ids. Another option would be a [HashDictionary](https://radimrehurek.com/gensim/corpora/hashdictionary.html), saving ourselves one pass over the training documents." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 15:57:16,471 : INFO : adding document #0 to Dictionary(0 unique tokens: [])\n", "2019-05-06 15:57:16,781 : INFO : built Dictionary(25279 unique tokens: ['sketch', 'addario', 'foyer', 'labratsat', 'reclaim']...) from 2819 documents (total 435328 corpus positions)\n", "2019-05-06 15:57:16,809 : INFO : discarding 18198 tokens: [('batka', 1), ('batkaj', 1), ('beatl', 1), ('ccmail', 3), ('dayton', 4), ('edu', 1785), ('inhibit', 1), ('jbatka', 1), ('line', 2748), ('organ', 2602)]...\n", "2019-05-06 15:57:16,810 : INFO : keeping 7081 tokens which were in no less than 5 and no more than 1409 (=50.0%) documents\n", "2019-05-06 15:57:16,821 : INFO : resulting dictionary: Dictionary(7081 unique tokens: ['colost', 'choke', 'editor', 'china', 'piss']...)\n" ] } ], "source": [ "dictionary = Dictionary(train_documents)\n", "dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=20000) # filter out too in/frequent tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create training corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's vectorize the training corpus into the bag-of-words format. We'll train LDA on a BOW and NMFs on an TF-IDF corpus:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "tfidf = TfidfModel(dictionary=dictionary)\n", "\n", "train_corpus = [\n", " dictionary.doc2bow(document)\n", " for document\n", " in train_documents\n", "]\n", "\n", "test_corpus = [\n", " dictionary.doc2bow(document)\n", " for document\n", " in test_documents\n", "]\n", "\n", "train_corpus_tfidf = list(tfidf[train_corpus])\n", "\n", "test_corpus_tfidf = list(tfidf[test_corpus])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we simply stored the bag-of-words vectors into a `list`, but Gensim accepts [any iterable](https://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time) as input, including streamed ones. To learn more about memory-efficient input iterables, see our [Data Streaming in Python: Generators, Iterators, Iterables](https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/) tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NMF Model Training\n", "\n", "The API works in the same way as other Gensim models, such as [LdaModel](https://radimrehurek.com/gensim/models/ldamodel.html) or [LsiModel](https://radimrehurek.com/gensim/models/lsimodel.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notable model parameters:\n", "\n", "- `kappa` float, optional\n", "\n", " Gradient descent step size.\n", " Larger value makes the model train faster, but could lead to non-convergence if set too large.\n", " \n", "- `w_max_iter` int, optional\n", "\n", " Maximum number of iterations to train W per each batch.\n", " \n", "- `w_stop_condition` float, optional\n", "\n", " If the error difference gets smaller than this, training of ``W`` stops for the current batch.\n", " \n", "- `h_r_max_iter` int, optional\n", "\n", " Maximum number of iterations to train h per each batch.\n", " \n", "- `h_r_stop_condition` float, optional\n", "\n", " If the error difference gets smaller than this, training of ``h`` stops for the current batch." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learn an NMF model with 5 topics:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 15:57:18,565 : INFO : running NMF training, 5 topics, 5 passes over the supplied corpus of 2819 documents, evaluating l2 norm every 2819 documents\n", "2019-05-06 15:57:18,581 : INFO : PROGRESS: pass 0, at document #1000/2819\n", "2019-05-06 15:57:18,604 : INFO : W error: -11.552583824232896\n", "2019-05-06 15:57:18,619 : INFO : PROGRESS: pass 0, at document #2000/2819\n", "2019-05-06 15:57:18,627 : INFO : W error: -13.74803744073488\n", "2019-05-06 15:57:18,639 : INFO : PROGRESS: pass 0, at document #2819/2819\n", "2019-05-06 15:57:18,735 : INFO : L2 norm: 28.141396382337142\n", "2019-05-06 15:57:18,770 : INFO : topic #0 (0.395): 0.011*\"isra\" + 0.010*\"israel\" + 0.006*\"arab\" + 0.006*\"jew\" + 0.004*\"palestinian\" + 0.004*\"henri\" + 0.003*\"toronto\" + 0.003*\"question\" + 0.003*\"kill\" + 0.003*\"hernlem\"\n", "2019-05-06 15:57:18,771 : INFO : topic #1 (0.352): 0.008*\"space\" + 0.005*\"access\" + 0.005*\"nasa\" + 0.004*\"pat\" + 0.003*\"digex\" + 0.003*\"orbit\" + 0.003*\"shuttl\" + 0.003*\"data\" + 0.003*\"graphic\" + 0.003*\"com\"\n", "2019-05-06 15:57:18,772 : INFO : topic #2 (0.378): 0.012*\"armenian\" + 0.006*\"turkish\" + 0.004*\"greek\" + 0.004*\"peopl\" + 0.004*\"armenia\" + 0.004*\"turk\" + 0.004*\"argic\" + 0.004*\"bike\" + 0.003*\"serdar\" + 0.003*\"turkei\"\n", "2019-05-06 15:57:18,772 : INFO : topic #3 (0.412): 0.010*\"moral\" + 0.006*\"keith\" + 0.004*\"anim\" + 0.003*\"jake\" + 0.003*\"boni\" + 0.003*\"instinct\" + 0.003*\"act\" + 0.003*\"think\" + 0.003*\"object\" + 0.003*\"caltech\"\n", "2019-05-06 15:57:18,773 : INFO : topic #4 (0.428): 0.009*\"islam\" + 0.008*\"god\" + 0.006*\"livesei\" + 0.006*\"muslim\" + 0.005*\"imag\" + 0.005*\"sgi\" + 0.005*\"jaeger\" + 0.004*\"jon\" + 0.004*\"solntz\" + 0.004*\"wpd\"\n", "2019-05-06 15:57:18,776 : INFO : W error: -14.346700852834074\n", "2019-05-06 15:57:18,791 : INFO : PROGRESS: pass 1, at document #1000/2819\n", "2019-05-06 15:57:18,797 : INFO : W error: -15.92349117004869\n", "2019-05-06 15:57:18,811 : INFO : PROGRESS: pass 1, at document #2000/2819\n", "2019-05-06 15:57:18,816 : INFO : W error: -17.04437515427273\n", "2019-05-06 15:57:18,829 : INFO : PROGRESS: pass 1, at document #2819/2819\n", "2019-05-06 15:57:18,922 : INFO : L2 norm: 28.021861174968205\n", "2019-05-06 15:57:18,956 : INFO : topic #0 (0.341): 0.014*\"israel\" + 0.013*\"isra\" + 0.008*\"arab\" + 0.007*\"jew\" + 0.005*\"palestinian\" + 0.004*\"lebanes\" + 0.003*\"peac\" + 0.003*\"henri\" + 0.003*\"attack\" + 0.003*\"polici\"\n", "2019-05-06 15:57:18,956 : INFO : topic #1 (0.252): 0.008*\"space\" + 0.005*\"nasa\" + 0.004*\"access\" + 0.003*\"pat\" + 0.003*\"orbit\" + 0.003*\"digex\" + 0.003*\"launch\" + 0.003*\"shuttl\" + 0.003*\"graphic\" + 0.003*\"com\"\n", "2019-05-06 15:57:18,957 : INFO : topic #2 (0.295): 0.020*\"armenian\" + 0.010*\"turkish\" + 0.006*\"armenia\" + 0.006*\"turk\" + 0.006*\"argic\" + 0.006*\"serdar\" + 0.005*\"greek\" + 0.005*\"turkei\" + 0.004*\"peopl\" + 0.004*\"genocid\"\n", "2019-05-06 15:57:18,958 : INFO : topic #3 (0.345): 0.013*\"moral\" + 0.011*\"keith\" + 0.006*\"object\" + 0.005*\"caltech\" + 0.004*\"schneider\" + 0.004*\"anim\" + 0.004*\"jake\" + 0.004*\"allan\" + 0.004*\"boni\" + 0.004*\"cco\"\n", "2019-05-06 15:57:18,959 : INFO : topic #4 (0.375): 0.011*\"islam\" + 0.011*\"god\" + 0.006*\"livesei\" + 0.006*\"sgi\" + 0.006*\"jaeger\" + 0.005*\"muslim\" + 0.005*\"jon\" + 0.005*\"imag\" + 0.005*\"religion\" + 0.005*\"solntz\"\n", "2019-05-06 15:57:18,961 : INFO : W error: -17.08829704968913\n", "2019-05-06 15:57:18,975 : INFO : PROGRESS: pass 2, at document #1000/2819\n", "2019-05-06 15:57:18,981 : INFO : W error: -17.73961065930116\n", "2019-05-06 15:57:18,996 : INFO : PROGRESS: pass 2, at document #2000/2819\n", "2019-05-06 15:57:19,001 : INFO : W error: -18.289085863153712\n", "2019-05-06 15:57:19,013 : INFO : PROGRESS: pass 2, at document #2819/2819\n", "2019-05-06 15:57:19,107 : INFO : L2 norm: 28.001900396967052\n", "2019-05-06 15:57:19,141 : INFO : topic #0 (0.341): 0.014*\"israel\" + 0.014*\"isra\" + 0.009*\"arab\" + 0.008*\"jew\" + 0.005*\"palestinian\" + 0.004*\"lebanes\" + 0.004*\"peac\" + 0.003*\"attack\" + 0.003*\"polici\" + 0.003*\"lebanon\"\n", "2019-05-06 15:57:19,142 : INFO : topic #1 (0.230): 0.007*\"space\" + 0.005*\"nasa\" + 0.004*\"access\" + 0.003*\"orbit\" + 0.003*\"pat\" + 0.003*\"launch\" + 0.003*\"digex\" + 0.003*\"gov\" + 0.003*\"com\" + 0.003*\"graphic\"\n", "2019-05-06 15:57:19,142 : INFO : topic #2 (0.284): 0.021*\"armenian\" + 0.011*\"turkish\" + 0.007*\"armenia\" + 0.007*\"turk\" + 0.007*\"argic\" + 0.006*\"serdar\" + 0.006*\"turkei\" + 0.005*\"greek\" + 0.005*\"genocid\" + 0.004*\"soviet\"\n", "2019-05-06 15:57:19,143 : INFO : topic #3 (0.346): 0.015*\"moral\" + 0.013*\"keith\" + 0.006*\"object\" + 0.006*\"caltech\" + 0.005*\"schneider\" + 0.005*\"allan\" + 0.005*\"cco\" + 0.004*\"anim\" + 0.004*\"jake\" + 0.003*\"boni\"\n", "2019-05-06 15:57:19,144 : INFO : topic #4 (0.365): 0.012*\"islam\" + 0.011*\"god\" + 0.006*\"livesei\" + 0.006*\"sgi\" + 0.006*\"jaeger\" + 0.005*\"muslim\" + 0.005*\"jon\" + 0.005*\"religion\" + 0.004*\"solntz\" + 0.004*\"wpd\"\n", "2019-05-06 15:57:19,146 : INFO : W error: -18.220202313431095\n", "2019-05-06 15:57:19,160 : INFO : PROGRESS: pass 3, at document #1000/2819\n", "2019-05-06 15:57:19,165 : INFO : W error: -18.590446221955172\n", "2019-05-06 15:57:19,180 : INFO : PROGRESS: pass 3, at document #2000/2819\n", "2019-05-06 15:57:19,185 : INFO : W error: -18.936998738726114\n", "2019-05-06 15:57:19,197 : INFO : PROGRESS: pass 3, at document #2819/2819\n", "2019-05-06 15:57:19,291 : INFO : L2 norm: 27.993018072469805\n", "2019-05-06 15:57:19,324 : INFO : topic #0 (0.348): 0.015*\"israel\" + 0.014*\"isra\" + 0.009*\"arab\" + 0.008*\"jew\" + 0.005*\"palestinian\" + 0.004*\"lebanes\" + 0.004*\"peac\" + 0.003*\"attack\" + 0.003*\"lebanon\" + 0.003*\"polici\"\n", "2019-05-06 15:57:19,325 : INFO : topic #1 (0.221): 0.007*\"space\" + 0.005*\"nasa\" + 0.003*\"access\" + 0.003*\"orbit\" + 0.003*\"pat\" + 0.003*\"launch\" + 0.003*\"gov\" + 0.003*\"digex\" + 0.003*\"com\" + 0.002*\"graphic\"\n", "2019-05-06 15:57:19,325 : INFO : topic #2 (0.281): 0.022*\"armenian\" + 0.011*\"turkish\" + 0.007*\"armenia\" + 0.007*\"turk\" + 0.007*\"argic\" + 0.007*\"serdar\" + 0.006*\"turkei\" + 0.005*\"greek\" + 0.005*\"genocid\" + 0.005*\"soviet\"\n", "2019-05-06 15:57:19,326 : INFO : topic #3 (0.349): 0.016*\"moral\" + 0.014*\"keith\" + 0.007*\"object\" + 0.006*\"caltech\" + 0.006*\"schneider\" + 0.005*\"allan\" + 0.005*\"cco\" + 0.004*\"anim\" + 0.004*\"natur\" + 0.003*\"think\"\n", "2019-05-06 15:57:19,327 : INFO : topic #4 (0.365): 0.012*\"islam\" + 0.012*\"god\" + 0.006*\"sgi\" + 0.006*\"livesei\" + 0.006*\"jaeger\" + 0.005*\"muslim\" + 0.005*\"religion\" + 0.005*\"atheist\" + 0.005*\"jon\" + 0.005*\"atheism\"\n", "2019-05-06 15:57:19,328 : INFO : W error: -18.84541532652092\n", "2019-05-06 15:57:19,342 : INFO : PROGRESS: pass 4, at document #1000/2819\n", "2019-05-06 15:57:19,347 : INFO : W error: -19.091058241700402\n", "2019-05-06 15:57:19,362 : INFO : PROGRESS: pass 4, at document #2000/2819\n", "2019-05-06 15:57:19,367 : INFO : W error: -19.338453391122066\n", "2019-05-06 15:57:19,378 : INFO : PROGRESS: pass 4, at document #2819/2819\n", "2019-05-06 15:57:19,473 : INFO : L2 norm: 27.988158841345246\n", "2019-05-06 15:57:19,506 : INFO : topic #0 (0.352): 0.015*\"israel\" + 0.014*\"isra\" + 0.009*\"arab\" + 0.008*\"jew\" + 0.005*\"palestinian\" + 0.004*\"lebanes\" + 0.004*\"peac\" + 0.003*\"attack\" + 0.003*\"lebanon\" + 0.003*\"polici\"\n", "2019-05-06 15:57:19,507 : INFO : topic #1 (0.210): 0.007*\"space\" + 0.005*\"nasa\" + 0.003*\"access\" + 0.003*\"orbit\" + 0.003*\"launch\" + 0.003*\"pat\" + 0.003*\"gov\" + 0.003*\"com\" + 0.002*\"alaska\" + 0.002*\"graphic\"\n", "2019-05-06 15:57:19,508 : INFO : topic #2 (0.282): 0.023*\"armenian\" + 0.011*\"turkish\" + 0.008*\"armenia\" + 0.007*\"argic\" + 0.007*\"turk\" + 0.007*\"serdar\" + 0.006*\"turkei\" + 0.005*\"greek\" + 0.005*\"genocid\" + 0.005*\"soviet\"\n", "2019-05-06 15:57:19,509 : INFO : topic #3 (0.353): 0.016*\"moral\" + 0.015*\"keith\" + 0.007*\"object\" + 0.007*\"caltech\" + 0.006*\"schneider\" + 0.005*\"allan\" + 0.005*\"cco\" + 0.004*\"anim\" + 0.004*\"natur\" + 0.004*\"goal\"\n", "2019-05-06 15:57:19,509 : INFO : topic #4 (0.367): 0.012*\"god\" + 0.012*\"islam\" + 0.006*\"jaeger\" + 0.006*\"sgi\" + 0.006*\"livesei\" + 0.005*\"muslim\" + 0.005*\"atheist\" + 0.005*\"religion\" + 0.005*\"atheism\" + 0.004*\"jon\"\n", "2019-05-06 15:57:19,511 : INFO : W error: -19.245120389312117\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.52 s, sys: 1.84 s, total: 3.36 s\n", "Wall time: 947 ms\n" ] } ], "source": [ "%%time\n", "\n", "nmf = GensimNmf(\n", " corpus=train_corpus_tfidf,\n", " num_topics=5,\n", " id2word=dictionary,\n", " chunksize=1000,\n", " passes=5,\n", " eval_every=10,\n", " minimum_probability=0,\n", " random_state=0,\n", " kappa=1,\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "W = nmf.get_topics().T\n", "\n", "dense_test_corpus = matutils.corpus2dense(\n", " test_corpus_tfidf,\n", " num_terms=W.shape[0],\n", ")\n", "\n", "if isinstance(nmf, SklearnNmf):\n", " H = nmf.transform(dense_test_corpus.T).T\n", "else:\n", " H = np.zeros((nmf.num_topics, len(test_corpus_tfidf)))\n", " for bow_id, bow in enumerate(test_corpus_tfidf):\n", " for topic_id, word_count in nmf[bow]:\n", " H[topic_id, bow_id] = word_count" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.105176733465657" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.linalg.norm(W.dot(H))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "43.312817" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.linalg.norm(dense_test_corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View the learned topics" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.015*\"israel\" + 0.014*\"isra\" + 0.009*\"arab\" + 0.008*\"jew\" + 0.005*\"palestinian\" + 0.004*\"lebanes\" + 0.004*\"peac\" + 0.004*\"attack\" + 0.004*\"lebanon\" + 0.003*\"polici\"'),\n", " (1,\n", " '0.007*\"space\" + 0.005*\"nasa\" + 0.003*\"access\" + 0.003*\"orbit\" + 0.003*\"launch\" + 0.003*\"pat\" + 0.003*\"gov\" + 0.003*\"com\" + 0.002*\"alaska\" + 0.002*\"moon\"'),\n", " (2,\n", " '0.023*\"armenian\" + 0.011*\"turkish\" + 0.008*\"armenia\" + 0.007*\"argic\" + 0.007*\"turk\" + 0.007*\"serdar\" + 0.006*\"turkei\" + 0.005*\"greek\" + 0.005*\"genocid\" + 0.005*\"soviet\"'),\n", " (3,\n", " '0.016*\"moral\" + 0.015*\"keith\" + 0.007*\"object\" + 0.007*\"caltech\" + 0.006*\"schneider\" + 0.005*\"allan\" + 0.005*\"cco\" + 0.004*\"anim\" + 0.004*\"natur\" + 0.004*\"goal\"'),\n", " (4,\n", " '0.012*\"god\" + 0.012*\"islam\" + 0.006*\"jaeger\" + 0.006*\"sgi\" + 0.005*\"livesei\" + 0.005*\"muslim\" + 0.005*\"atheist\" + 0.005*\"religion\" + 0.005*\"atheism\" + 0.004*\"rushdi\"')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nmf.show_topics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation measure: Coherence\n", "\n", "[Topic coherence](http://qpleple.com/topic-coherence-to-evaluate-topic-models/) measures how often do most frequent tokens from each topic co-occur in one document. Larger is better." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 15:57:20,590 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n" ] }, { "data": { "text/plain": [ "-4.1310114675795875" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CoherenceModel(\n", " model=nmf,\n", " corpus=test_corpus_tfidf,\n", " coherence='u_mass'\n", ").get_coherence()" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Topic inference on new documents\n", "\n", "With the NMF model trained, let's fetch one news document not seen during training, and infer its topic vector." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From: spl@ivem.ucsd.edu (Steve Lamont)\n", "Subject: Re: RGB to HVS, and back\n", "Organization: University of Calif., San Diego/Microscopy and Imaging Resource\n", "Lines: 18\n", "Distribution: world\n", "NNTP-Posting-Host: ivem.ucsd.edu\n", "\n", "In article zyeh@caspian.usc.edu (zhenghao yeh) writes:\n", ">|> See Foley, van Dam, Feiner, and Hughes, _Computer Graphics: Principles\n", ">|> and Practice, Second Edition_.\n", ">|> \n", ">|> [If people would *read* this book, 75 percent of the questions in this\n", ">|> froup would disappear overnight...]\n", ">|> \n", ">\tNot really. I think it is less than 10%.\n", "\n", "Nah... I figure most people would be so busy reading that they wouldn't\n", "have *time* to post. :-) :-) :-)\n", "\n", "\t\t\t\t\t\t\tspl\n", "-- \n", "Steve Lamont, SciViGuy -- (619) 534-7968 -- spl@szechuan.ucsd.edu\n", "San Diego Microscopy and Imaging Resource/UC San Diego/La Jolla, CA 92093-0608\n", "\"Until I meet you, then, in Upper Hell\n", "Convulsed, foaming immortal blood: farewell\" - J. Berryman, \"A Professor's Song\"\n", "\n", "====================================================================================================\n", "Topics: [(0, 0.10199317206513686), (1, 0.39976628221371285), (2, 0.1428263926167706), (3, 0.0333080734922002), (4, 0.3221060796121796)]\n" ] } ], "source": [ "print(testset[0]['data'])\n", "print('=' * 100)\n", "print(\"Topics: {}\".format(nmf[test_corpus[0]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word topic inference\n", "\n", "Similarly, we can inspect the topic distribution assigned to a vocabulary term:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word: actual\n", "Topics: [(0, 0.15401782844659068), (1, 0.2829834256007429), (2, 0.04354905106817273), (3, 0.25783766798021135), (4, 0.26161202690428226)]\n" ] } ], "source": [ "word = dictionary[0]\n", "print(\"Word: {}\".format(word))\n", "print(\"Topics: {}\".format(nmf.get_term_topics(word)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Internal NMF state" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Density is a fraction of non-zero elements in a matrix." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def density(matrix):\n", " return (matrix > 0).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Term-topic matrix of shape `(words, topics)`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Density: 0.6872475639034035\n" ] } ], "source": [ "print(\"Density: {}\".format(density(nmf._W)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topic-document matrix for the last batch of shape `(topics, batch)`" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Density: 0.6615384615384615\n" ] } ], "source": [ "print(\"Density: {}\".format(density(nmf._h)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Benchmarks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gensim NMF vs Sklearn NMF vs Gensim LDA\n", "\n", "We'll run these three unsupervised models on the [20newsgroups](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) dataset.\n", "\n", "20 Newsgroups also contains labels for each document, which will allow us to evaluate the trained models on an \"upstream\" classification task, using the unsupervised document topics as input features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Metrics\n", "\n", "We'll track these metrics as we train and test NMF on the 20-newsgroups corpus we created above:\n", "- `train time` - time to train a model\n", "- `mean_ram` - mean RAM consumption during training\n", "- `max_ram` - maximum RAM consumption during training\n", "- `train time` - time to train a model.\n", "- `coherence` - coherence score (larger is better).\n", "- `l2_norm` - L2 norm of `v - Wh` (less is better, not defined for LDA).\n", "- `f1` - [F1 score](https://en.wikipedia.org/wiki/F1_score) on the task of news topic classification (larger is better)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "fixed_params = dict(\n", " chunksize=1000,\n", " num_topics=5,\n", " id2word=dictionary,\n", " passes=5,\n", " eval_every=10,\n", " minimum_probability=0,\n", " random_state=0,\n", ")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "@contextmanager\n", "def measure_ram(output, tick=5):\n", " def _measure_ram(pid, output, tick=tick):\n", " py = psutil.Process(pid)\n", " with open(output, 'w') as outfile:\n", " while True:\n", " memory = py.memory_info().rss\n", " outfile.write(\"{}\\n\".format(memory))\n", " outfile.flush()\n", " time.sleep(tick)\n", "\n", " pid = os.getpid()\n", " p = Process(target=_measure_ram, args=(pid, output, tick))\n", " p.start()\n", " yield\n", " p.terminate()\n", "\n", "\n", "def get_train_time_and_ram(func, name, tick=5):\n", " memprof_filename = \"{}.memprof\".format(name)\n", "\n", " start = time.time()\n", "\n", " with measure_ram(memprof_filename, tick=tick):\n", " result = func()\n", "\n", " elapsed_time = pd.to_timedelta(time.time() - start, unit='s').round('ms')\n", "\n", " memprof_df = pd.read_csv(memprof_filename, squeeze=True)\n", "\n", " mean_ram = \"{} MB\".format(\n", " int(memprof_df.mean() // 2 ** 20),\n", " )\n", "\n", " max_ram = \"{} MB\".format(int(memprof_df.max() // 2 ** 20))\n", "\n", " return elapsed_time, mean_ram, max_ram, result\n", "\n", "\n", "def get_f1(model, train_corpus, X_test, y_train, y_test):\n", " if isinstance(model, SklearnNmf):\n", " dense_train_corpus = matutils.corpus2dense(\n", " train_corpus,\n", " num_terms=model.components_.shape[1],\n", " )\n", " X_train = model.transform(dense_train_corpus.T)\n", " else:\n", " X_train = np.zeros((len(train_corpus), model.num_topics))\n", " for bow_id, bow in enumerate(train_corpus):\n", " for topic_id, word_count in model[bow]:\n", " X_train[bow_id, topic_id] = word_count\n", "\n", " log_reg = LogisticRegressionCV(multi_class='multinomial', cv=5)\n", " log_reg.fit(X_train, y_train)\n", "\n", " pred_labels = log_reg.predict(X_test)\n", "\n", " return f1_score(y_test, pred_labels, average='micro')\n", "\n", "def get_sklearn_topics(model, top_n=5):\n", " topic_probas = model.components_.T\n", " topic_probas = topic_probas / topic_probas.sum(axis=0)\n", "\n", " sparsity = np.zeros(topic_probas.shape[1])\n", "\n", " for row in topic_probas:\n", " sparsity += (row == 0)\n", "\n", " sparsity /= topic_probas.shape[1]\n", "\n", " topic_probas = topic_probas[:, sparsity.argsort()[::-1]][:, :top_n]\n", "\n", " token_indices = topic_probas.argsort(axis=0)[:-11:-1, :]\n", " topic_probas.sort(axis=0)\n", " topic_probas = topic_probas[:-11:-1, :]\n", "\n", " topics = []\n", "\n", " for topic_idx in range(topic_probas.shape[1]):\n", " tokens = [\n", " model.id2word[token_idx]\n", " for token_idx\n", " in token_indices[:, topic_idx]\n", " ]\n", " topic = (\n", " '{}*\"{}\"'.format(round(proba, 3), token)\n", " for proba, token\n", " in zip(topic_probas[:, topic_idx], tokens)\n", " )\n", " topic = \" + \".join(topic)\n", " topics.append((topic_idx, topic))\n", "\n", " return topics\n", "\n", "def get_metrics(model, test_corpus, train_corpus=None, y_train=None, y_test=None, dictionary=None):\n", " if isinstance(model, SklearnNmf):\n", " model.get_topics = lambda: model.components_\n", " model.show_topics = lambda top_n: get_sklearn_topics(model, top_n)\n", " model.id2word = dictionary\n", "\n", " W = model.get_topics().T\n", "\n", " dense_test_corpus = matutils.corpus2dense(\n", " test_corpus,\n", " num_terms=W.shape[0],\n", " )\n", "\n", " if isinstance(model, SklearnNmf):\n", " H = model.transform(dense_test_corpus.T).T\n", " else:\n", " H = np.zeros((model.num_topics, len(test_corpus)))\n", " for bow_id, bow in enumerate(test_corpus):\n", " for topic_id, word_count in model[bow]:\n", " H[topic_id, bow_id] = word_count\n", "\n", " l2_norm = None\n", "\n", " if not isinstance(model, LdaModel):\n", " pred_factors = W.dot(H)\n", "\n", " l2_norm = np.linalg.norm(pred_factors - dense_test_corpus)\n", " l2_norm = round(l2_norm, 4)\n", "\n", " f1 = None\n", "\n", " if train_corpus and y_train and y_test:\n", " f1 = get_f1(model, train_corpus, H.T, y_train, y_test)\n", " f1 = round(f1, 4)\n", "\n", " model.normalize = True\n", "\n", " coherence = CoherenceModel(\n", " model=model,\n", " corpus=test_corpus,\n", " coherence='u_mass'\n", " ).get_coherence()\n", " coherence = round(coherence, 4)\n", "\n", " topics = model.show_topics(5)\n", "\n", " model.normalize = False\n", "\n", " return dict(\n", " coherence=coherence,\n", " l2_norm=l2_norm,\n", " f1=f1,\n", " topics=topics,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run the models" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "tm_metrics = pd.DataFrame(columns=['model', 'train_time', 'coherence', 'l2_norm', 'f1', 'topics'])\n", "\n", "y_train = [doc['target'] for doc in trainset]\n", "y_test = [doc['target'] for doc in testset]\n", "\n", "# LDA metrics\n", "row = {}\n", "row['model'] = 'lda'\n", "row['train_time'], row['mean_ram'], row['max_ram'], lda = get_train_time_and_ram(\n", " lambda: LdaModel(\n", " corpus=train_corpus,\n", " **fixed_params,\n", " ),\n", " 'lda',\n", " 0.1,\n", ")\n", "row.update(get_metrics(\n", " lda, test_corpus, train_corpus, y_train, y_test,\n", "))\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n", "\n", "# LSI metrics\n", "row = {}\n", "row['model'] = 'lsi'\n", "row['train_time'], row['mean_ram'], row['max_ram'], lsi = get_train_time_and_ram(\n", " lambda: LsiModel(\n", " corpus=train_corpus_tfidf,\n", " num_topics=5,\n", " id2word=dictionary,\n", " chunksize=2000,\n", " ),\n", " 'lsi',\n", " 0.1,\n", ")\n", "row.update(get_metrics(\n", " lsi, test_corpus_tfidf, train_corpus_tfidf, y_train, y_test,\n", "))\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n", "\n", "# Sklearn NMF metrics\n", "row = {}\n", "row['model'] = 'sklearn_nmf'\n", "train_csc_corpus_tfidf = matutils.corpus2csc(train_corpus_tfidf, len(dictionary)).T\n", "row['train_time'], row['mean_ram'], row['max_ram'], sklearn_nmf = get_train_time_and_ram(\n", " lambda: SklearnNmf(n_components=5, random_state=42).fit(train_csc_corpus_tfidf),\n", " 'sklearn_nmf',\n", " 0.1,\n", ")\n", "row.update(get_metrics(\n", " sklearn_nmf, test_corpus_tfidf, train_corpus_tfidf, y_train, y_test, dictionary,\n", "))\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n", "\n", "# Gensim NMF metrics\n", "row = {}\n", "row['model'] = 'gensim_nmf'\n", "row['train_time'], row['mean_ram'], row['max_ram'], gensim_nmf = get_train_time_and_ram(\n", " lambda: GensimNmf(\n", " normalize=False,\n", " corpus=train_corpus_tfidf,\n", " **fixed_params\n", " ),\n", " 'gensim_nmf',\n", " 0.1,\n", ")\n", "row.update(get_metrics(\n", " gensim_nmf, test_corpus_tfidf, train_corpus_tfidf, y_train, y_test,\n", "))\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)\n", "tm_metrics.replace(np.nan, '-', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmark results" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modeltrain_timecoherencel2_normf1max_rammean_ram
0lda00:00:08.862000-2.1054-0.7511366 MB366 MB
1lsi00:00:00.332000-5.701042.46420.8587381 MB379 MB
2sklearn_nmf00:00:00.166000-3.183542.47590.7889378 MB378 MB
3gensim_nmf00:00:00.954000-4.131042.54870.8065379 MB379 MB
\n", "
" ], "text/plain": [ " model train_time coherence l2_norm f1 max_ram mean_ram\n", "0 lda 00:00:08.862000 -2.1054 - 0.7511 366 MB 366 MB\n", "1 lsi 00:00:00.332000 -5.7010 42.4642 0.8587 381 MB 379 MB\n", "2 sklearn_nmf 00:00:00.166000 -3.1835 42.4759 0.7889 378 MB 378 MB\n", "3 gensim_nmf 00:00:00.954000 -4.1310 42.5487 0.8065 379 MB 379 MB" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm_metrics.drop('topics', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Main insights\n", "\n", "- LDA has the best coherence of all models.\n", "- LSI has the best l2 norm and f1 performance on downstream task (it's factors aren't non-negative though).\n", "- Gensim NMF, Sklearn NMF and LSI has a bit larger memory footprint than that of LDA.\n", "- Gensim NMF, Sklearn NMF and LSI are much faster than LDA." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Learned topics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's inspect the 5 topics learned by each of the three models:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "lda:\n", "(0, '0.013*\"space\" + 0.008*\"imag\" + 0.007*\"nasa\" + 0.006*\"graphic\" + 0.006*\"program\" + 0.005*\"launch\" + 0.005*\"file\" + 0.005*\"com\" + 0.005*\"new\" + 0.004*\"orbit\"')\n", "(1, '0.015*\"com\" + 0.007*\"like\" + 0.007*\"nntp\" + 0.007*\"host\" + 0.006*\"know\" + 0.006*\"univers\" + 0.005*\"henri\" + 0.005*\"work\" + 0.005*\"bit\" + 0.005*\"think\"')\n", "(2, '0.014*\"armenian\" + 0.011*\"peopl\" + 0.009*\"turkish\" + 0.007*\"jew\" + 0.007*\"said\" + 0.006*\"right\" + 0.005*\"know\" + 0.005*\"kill\" + 0.005*\"isra\" + 0.005*\"turkei\"')\n", "(3, '0.012*\"com\" + 0.010*\"israel\" + 0.009*\"bike\" + 0.006*\"isra\" + 0.006*\"dod\" + 0.005*\"like\" + 0.005*\"ride\" + 0.005*\"host\" + 0.005*\"nntp\" + 0.005*\"motorcycl\"')\n", "(4, '0.011*\"god\" + 0.008*\"peopl\" + 0.007*\"think\" + 0.006*\"exist\" + 0.006*\"univers\" + 0.005*\"com\" + 0.005*\"believ\" + 0.005*\"islam\" + 0.005*\"moral\" + 0.005*\"christian\"')\n", "\n", "lsi:\n", "(0, '0.157*\"armenian\" + 0.118*\"israel\" + 0.110*\"peopl\" + 0.110*\"isra\" + 0.105*\"space\" + 0.098*\"com\" + 0.097*\"god\" + 0.084*\"jew\" + 0.082*\"think\" + 0.081*\"turkish\"')\n", "(1, '-0.476*\"armenian\" + -0.231*\"turkish\" + -0.157*\"armenia\" + -0.151*\"argic\" + -0.149*\"serdar\" + -0.145*\"turk\" + 0.128*\"space\" + -0.117*\"turkei\" + -0.111*\"genocid\" + 0.107*\"nasa\"')\n", "(2, '0.295*\"israel\" + -0.291*\"armenian\" + 0.273*\"isra\" + 0.182*\"arab\" + 0.157*\"jew\" + -0.143*\"space\" + -0.134*\"turkish\" + -0.112*\"nasa\" + 0.106*\"jake\" + 0.103*\"boni\"')\n", "(3, '0.274*\"keith\" + 0.252*\"moral\" + -0.235*\"israel\" + 0.213*\"god\" + -0.213*\"isra\" + 0.165*\"livesei\" + -0.143*\"arab\" + 0.123*\"sgi\" + 0.118*\"caltech\" + 0.114*\"islam\"')\n", "(4, '0.240*\"henri\" + -0.215*\"bike\" + 0.210*\"space\" + 0.167*\"toronto\" + 0.158*\"nasa\" + 0.148*\"moral\" + 0.143*\"keith\" + -0.142*\"graphic\" + 0.128*\"alaska\" + 0.125*\"orbit\"')\n", "\n", "sklearn_nmf:\n", "(0, '0.027*\"armenian\" + 0.013*\"turkish\" + 0.009*\"armenia\" + 0.009*\"argic\" + 0.009*\"serdar\" + 0.008*\"turk\" + 0.007*\"turkei\" + 0.006*\"genocid\" + 0.006*\"soviet\" + 0.006*\"zuma\"')\n", "(1, '0.015*\"israel\" + 0.014*\"isra\" + 0.01*\"arab\" + 0.008*\"jew\" + 0.005*\"palestinian\" + 0.005*\"jake\" + 0.005*\"boni\" + 0.004*\"lebanes\" + 0.004*\"peac\" + 0.004*\"adam\"')\n", "(2, '0.011*\"god\" + 0.01*\"keith\" + 0.01*\"moral\" + 0.006*\"islam\" + 0.006*\"livesei\" + 0.006*\"atheist\" + 0.005*\"atheism\" + 0.005*\"caltech\" + 0.004*\"religion\" + 0.004*\"object\"')\n", "(3, '0.011*\"space\" + 0.008*\"nasa\" + 0.008*\"henri\" + 0.006*\"orbit\" + 0.005*\"toronto\" + 0.005*\"alaska\" + 0.005*\"launch\" + 0.005*\"moon\" + 0.004*\"gov\" + 0.004*\"access\"')\n", "(4, '0.005*\"bike\" + 0.005*\"graphic\" + 0.004*\"file\" + 0.004*\"imag\" + 0.003*\"com\" + 0.003*\"ride\" + 0.003*\"thank\" + 0.003*\"program\" + 0.003*\"motorcycl\" + 0.002*\"look\"')\n", "\n", "gensim_nmf:\n", "(0, '0.015*\"israel\" + 0.014*\"isra\" + 0.009*\"arab\" + 0.008*\"jew\" + 0.005*\"palestinian\" + 0.004*\"lebanes\" + 0.004*\"peac\" + 0.004*\"attack\" + 0.004*\"lebanon\" + 0.003*\"polici\"')\n", "(1, '0.007*\"space\" + 0.005*\"nasa\" + 0.003*\"access\" + 0.003*\"orbit\" + 0.003*\"launch\" + 0.003*\"pat\" + 0.003*\"gov\" + 0.003*\"com\" + 0.002*\"alaska\" + 0.002*\"moon\"')\n", "(2, '0.023*\"armenian\" + 0.011*\"turkish\" + 0.008*\"armenia\" + 0.007*\"argic\" + 0.007*\"turk\" + 0.007*\"serdar\" + 0.006*\"turkei\" + 0.005*\"greek\" + 0.005*\"genocid\" + 0.005*\"soviet\"')\n", "(3, '0.016*\"moral\" + 0.015*\"keith\" + 0.007*\"object\" + 0.007*\"caltech\" + 0.006*\"schneider\" + 0.005*\"allan\" + 0.005*\"cco\" + 0.004*\"anim\" + 0.004*\"natur\" + 0.004*\"goal\"')\n", "(4, '0.012*\"god\" + 0.012*\"islam\" + 0.006*\"jaeger\" + 0.006*\"sgi\" + 0.005*\"livesei\" + 0.005*\"muslim\" + 0.005*\"atheist\" + 0.005*\"religion\" + 0.005*\"atheism\" + 0.004*\"rushdi\"')\n" ] } ], "source": [ "def compare_topics(tm_metrics):\n", " for _, row in tm_metrics.iterrows():\n", " print('\\n{}:'.format(row.model))\n", " print(\"\\n\".join(str(topic) for topic in row.topics))\n", " \n", "compare_topics(tm_metrics)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Subjectively, Gensim and Sklearn NMFs are on par with each other, LDA and LSI look a bit worse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. NMF on English Wikipedia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section shows how to train an NMF model on a large text corpus, the entire English Wikipedia: **2.6 billion words, in 23.1 million article sections across 5 million Wikipedia articles**.\n", "\n", "The data preprocessing takes a while, and we'll be comparing multiple models, so **reserve about 3 hours** and some **20 GB of disk space** to go through the following notebook cells in full. You'll need `gensim>=3.7.1`, `numpy`, `tqdm`, `pandas`, `psutils`, `joblib` and `sklearn`." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Re-import modules from scratch, so that this Section doesn't rely on any previous cells.\n", "import itertools\n", "import json\n", "import logging\n", "import time\n", "import os\n", "\n", "from smart_open import smart_open\n", "import psutil\n", "import numpy as np\n", "import scipy.sparse\n", "from contextlib import contextmanager, contextmanager, contextmanager\n", "from multiprocessing import Process\n", "from tqdm import tqdm, tqdm_notebook\n", "import joblib\n", "import pandas as pd\n", "from sklearn.decomposition.nmf import NMF as SklearnNmf\n", "\n", "import gensim.downloader\n", "from gensim import matutils\n", "from gensim.corpora import MmCorpus, Dictionary\n", "from gensim.models import LdaModel, LdaMulticore, CoherenceModel\n", "from gensim.models.nmf import Nmf as GensimNmf\n", "from gensim.utils import simple_preprocess\n", "\n", "tqdm.pandas()\n", "\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the Wikipedia dump" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use the [gensim.downloader](https://github.com/RaRe-Technologies/gensim-data) to download a parsed Wikipedia dump (6.1 GB disk space):" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "data = gensim.downloader.load(\"wiki-english-20171001\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the titles and sections of the first Wikipedia article, as a little sanity check:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Article: 'Anarchism'\n", "\n", "Section title: 'Introduction'\n", "Section text: '''Anarchism''' is a political philosophy that advocates self-governed societies based on volun…\n", "\n", "Section title: 'Etymology and terminology'\n", "Section text: The word ''anarchism'' is composed from the word ''anarchy'' and the suffix ''-ism'', themselves d…\n", "\n", "Section title: 'History'\n", "Section text: ===Origins=== Woodcut from a Diggers document by William Everard The earliest anarchist themes ca…\n", "\n", "Section title: 'Anarchist schools of thought'\n", "Section text: Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the pri…\n", "\n", "Section title: 'Internal issues and debates'\n", "Section text: consistent with anarchist values is a controversial subject among anarchists. Anarchism is a philo…\n", "\n", "Section title: 'Topics of interest'\n", "Section text: Intersecting and overlapping between various schools of thought, certain topics of interest and inte…\n", "\n", "Section title: 'Criticisms'\n", "Section text: Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evalu…\n", "\n", "Section title: 'See also'\n", "Section text: * Anarchism by country…\n", "\n", "Section title: 'References'\n", "Section text: …\n", "\n", "Section title: 'Further reading'\n", "Section text: * Barclay, Harold, ''People Without Government: An Anthropology of Anarchy'' (2nd ed.), Left Bank Bo…\n", "\n", "Section title: 'External links'\n", "Section text: * *…\n", "\n" ] } ], "source": [ "data = gensim.downloader.load(\"wiki-english-20171001\")\n", "article = next(iter(data))\n", "\n", "print(\"Article: %r\\n\" % article['title'])\n", "for section_title, section_text in zip(article['section_titles'], article['section_texts']):\n", " print(\"Section title: %r\" % section_title)\n", " print(\"Section text: %s…\\n\" % section_text[:100].replace('\\n', ' ').strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a Python generator function that streams through the downloaded Wikipedia dump and preprocesses (tokenizes, lower-cases) each article:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def wikidump2tokens(articles):\n", " \"\"\"Stream through the Wikipedia dump, yielding a list of tokens for each article.\"\"\"\n", " for article in articles:\n", " article_section_texts = [\n", " \" \".join([title, text])\n", " for title, text\n", " in zip(article['section_titles'], article['section_texts'])\n", " ]\n", " article_tokens = simple_preprocess(\" \".join(article_section_texts))\n", " yield article_tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a word-to-id mapping, in order to vectorize texts. Makes a full pass over the Wikipedia corpus, takes **~3.5 hours**:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 15:57:45,000 : INFO : loading Dictionary object from wiki.dict\n", "2019-05-06 15:57:45,031 : INFO : loaded wiki.dict\n" ] } ], "source": [ "if os.path.exists('wiki.dict'):\n", " # If we already stored the Dictionary in a previous run, simply load it, to save time.\n", " dictionary = Dictionary.load('wiki.dict')\n", "else:\n", " dictionary = Dictionary(wikidump2tokens(data))\n", " # Keep only the 30,000 most frequent vocabulary terms, after filtering away terms\n", " # that are too frequent/too infrequent.\n", " dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=30000)\n", " dictionary.save('wiki.dict')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Store preprocessed Wikipedia as bag-of-words sparse matrix in MatrixMarket format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When training NMF with a single pass over the input corpus (\"online\"), we simply vectorize each raw text straight from the input storage:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "vector_stream = (dictionary.doc2bow(article) for article in wikidump2tokens(data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the purposes of this tutorial though, we'll serialize (\"cache\") the vectorized bag-of-words vectors to disk, to `wiki.mm` file in MatrixMarket format. The reason is, we'll be re-using the vectorized articles multiple times, for different models for our benchmarks, and also shuffling them, so it makes sense to amortize the vectorization time by persisting the resulting vectors to disk." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, let's stream through the preprocessed sparse Wikipedia bag-of-words matrix while storing it to disk. **This step takes about 3 hours** and needs **38 GB of disk space**:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "class RandomSplitCorpus(MmCorpus):\n", " \"\"\"\n", " Use the fact that MmCorpus supports random indexing, and create a streamed\n", " corpus in shuffled order, including a train/test split for evaluation.\n", " \"\"\"\n", " def __init__(self, random_seed=42, testset=False, testsize=1000, *args, **kwargs):\n", " super().__init__(*args, **kwargs)\n", "\n", " random_state = np.random.RandomState(random_seed)\n", " \n", " self.indices = random_state.permutation(range(self.num_docs))\n", " test_nnz = sum(len(self[doc_idx]) for doc_idx in self.indices[:testsize])\n", " \n", " if testset:\n", " self.indices = self.indices[:testsize]\n", " self.num_docs = testsize\n", " self.num_nnz = test_nnz\n", " else:\n", " self.indices = self.indices[testsize:]\n", " self.num_docs -= testsize\n", " self.num_nnz -= test_nnz\n", "\n", " def __iter__(self):\n", " for doc_id in self.indices:\n", " yield self[doc_id]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": true }, "outputs": [], "source": [ "if not os.path.exists('wiki.mm'):\n", " MmCorpus.serialize('wiki.mm', vector_stream, progress_cnt=100000)\n", "\n", "if not os.path.exists('wiki_tfidf.mm'):\n", " MmCorpus.serialize('wiki_tfidf.mm', tfidf[MmCorpus('wiki.mm')], progress_cnt=100000)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 15:57:45,584 : INFO : loaded corpus index from wiki.mm.index\n", "2019-05-06 15:57:45,585 : INFO : initializing cython corpus reader from wiki.mm\n", "2019-05-06 15:57:45,586 : INFO : accepted corpus with 4924894 documents, 30000 features, 820242695 non-zero entries\n", "2019-05-06 15:57:49,102 : INFO : loaded corpus index from wiki.mm.index\n", "2019-05-06 15:57:49,103 : INFO : initializing cython corpus reader from wiki.mm\n", "2019-05-06 15:57:49,103 : INFO : accepted corpus with 4924894 documents, 30000 features, 820242695 non-zero entries\n", "2019-05-06 15:57:51,552 : INFO : loaded corpus index from wiki_tfidf.mm.index\n", "2019-05-06 15:57:51,553 : INFO : initializing cython corpus reader from wiki_tfidf.mm\n", "2019-05-06 15:57:51,554 : INFO : accepted corpus with 4924661 documents, 30000 features, 820007548 non-zero entries\n", "2019-05-06 15:57:55,680 : INFO : loaded corpus index from wiki_tfidf.mm.index\n", "2019-05-06 15:57:55,681 : INFO : initializing cython corpus reader from wiki_tfidf.mm\n", "2019-05-06 15:57:55,682 : INFO : accepted corpus with 4924661 documents, 30000 features, 820007548 non-zero entries\n" ] } ], "source": [ "# Load back the vectors as two lazily-streamed train/test iterables.\n", "train_corpus = RandomSplitCorpus(\n", " random_seed=42, testset=False, testsize=10000, fname='wiki.mm',\n", ")\n", "test_corpus = RandomSplitCorpus(\n", " random_seed=42, testset=True, testsize=10000, fname='wiki.mm',\n", ")\n", "\n", "train_corpus_tfidf = RandomSplitCorpus(\n", " random_seed=42, testset=False, testsize=10000, fname='wiki_tfidf.mm',\n", ")\n", "test_corpus_tfidf = RandomSplitCorpus(\n", " random_seed=42, testset=True, testsize=10000, fname='wiki_tfidf.mm',\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save preprocessed Wikipedia in scipy.sparse format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is only needed to run the Sklearn NMF on Wikipedia, for comparison in the benchmarks below. Sklearn expects in-memory scipy sparse input, not on-the-fly vector streams. Needs additional ~2 GB of disk space.\n", "\n", "\n", "**Skip this step if you don't need the Sklearn's NMF benchmark, and only want to run Gensim's NMF.**" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "if not os.path.exists('wiki_train_csr.npz'):\n", " scipy.sparse.save_npz(\n", " 'wiki_train_csr.npz',\n", " matutils.corpus2csc(train_corpus_tfidf, len(dictionary)).T,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Metrics\n", "\n", "We'll track these metrics as we train and test NMF on the Wikipedia corpus we created above:\n", "- `train time` - time to train a model\n", "- `mean_ram` - mean RAM consumption during training\n", "- `max_ram` - maximum RAM consumption during training\n", "- `train time` - time to train a model.\n", "- `coherence` - coherence score (larger is better).\n", "- `l2_norm` - L2 norm of `v - Wh` (less is better, not defined for LDA)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a dataframe in which we'll store the recorded metrics:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "tm_metrics = pd.DataFrame(columns=[\n", " 'model', 'train_time', 'mean_ram', 'max_ram', 'coherence', 'l2_norm', 'topics',\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define common parameters, to be shared by all evaluated models:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "params = dict(\n", " chunksize=2000,\n", " num_topics=50,\n", " id2word=dictionary,\n", " passes=1,\n", " eval_every=10,\n", " minimum_probability=0,\n", " random_state=42,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train Gensim NMF model and record its metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wikipedia training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train Gensim NMF model and record its metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "row = {}\n", "row['model'] = 'gensim_nmf'\n", "row['train_time'], row['mean_ram'], row['max_ram'], nmf = get_train_time_and_ram(\n", " lambda: GensimNmf(normalize=False, corpus=train_corpus_tfidf, **params),\n", " 'gensim_nmf',\n", " 1,\n", ")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 16:23:44,657 : INFO : saving Nmf object under gensim_nmf.model, separately None\n", "2019-05-06 16:23:44,767 : INFO : saved gensim_nmf.model\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{'max_ram': '774 MB', 'train_time': Timedelta('0 days 00:25:46.617000'), 'model': 'gensim_nmf', 'mean_ram': '771 MB'}\n" ] } ], "source": [ "print(row)\n", "nmf.save('gensim_nmf.model')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 16:23:44,772 : INFO : loading Nmf object from gensim_nmf.model\n", "2019-05-06 16:23:44,871 : INFO : loading id2word recursively from gensim_nmf.model.id2word.* with mmap=None\n", "2019-05-06 16:23:44,872 : INFO : loaded gensim_nmf.model\n", "2019-05-06 16:24:47,126 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n", "2019-05-06 16:24:47,272 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n", "2019-05-06 16:24:47,424 : INFO : CorpusAccumulator accumulated stats from 3000 documents\n", "2019-05-06 16:24:47,573 : INFO : CorpusAccumulator accumulated stats from 4000 documents\n", "2019-05-06 16:24:47,726 : INFO : CorpusAccumulator accumulated stats from 5000 documents\n", "2019-05-06 16:24:47,880 : INFO : CorpusAccumulator accumulated stats from 6000 documents\n", "2019-05-06 16:24:48,027 : INFO : CorpusAccumulator accumulated stats from 7000 documents\n", "2019-05-06 16:24:48,168 : INFO : CorpusAccumulator accumulated stats from 8000 documents\n", "2019-05-06 16:24:48,319 : INFO : CorpusAccumulator accumulated stats from 9000 documents\n", "2019-05-06 16:24:48,472 : INFO : CorpusAccumulator accumulated stats from 10000 documents\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{'mean_ram': '771 MB', 'topics': [(21, '0.009*\"his\" + 0.005*\"that\" + 0.005*\"him\" + 0.004*\"had\" + 0.003*\"they\" + 0.003*\"who\" + 0.003*\"her\" + 0.003*\"but\" + 0.003*\"king\" + 0.003*\"were\"'), (39, '0.005*\"are\" + 0.005*\"or\" + 0.004*\"be\" + 0.004*\"that\" + 0.003*\"can\" + 0.003*\"used\" + 0.003*\"this\" + 0.002*\"have\" + 0.002*\"such\" + 0.002*\"which\"'), (45, '0.092*\"apelor\" + 0.087*\"bucurești\" + 0.050*\"river\" + 0.046*\"cadastrul\" + 0.046*\"hidrologie\" + 0.046*\"meteorologie\" + 0.046*\"institutul\" + 0.046*\"române\" + 0.045*\"româniei\" + 0.045*\"rîurile\"'), (28, '0.066*\"gmina\" + 0.065*\"poland\" + 0.065*\"voivodeship\" + 0.046*\"village\" + 0.045*\"administrative\" + 0.042*\"lies\" + 0.037*\"approximately\" + 0.036*\"east\" + 0.031*\"west\" + 0.030*\"county\"'), (34, '0.087*\"romanized\" + 0.084*\"iran\" + 0.067*\"rural\" + 0.067*\"province\" + 0.066*\"census\" + 0.060*\"families\" + 0.055*\"village\" + 0.049*\"county\" + 0.047*\"population\" + 0.043*\"district\"')], 'model': 'gensim_nmf', 'coherence': -2.1071, 'l2_norm': 94.9686, 'train_time': Timedelta('0 days 00:25:46.617000'), 'max_ram': '774 MB', 'f1': None}\n" ] } ], "source": [ "nmf = GensimNmf.load('gensim_nmf.model')\n", "row.update(get_metrics(nmf, test_corpus_tfidf))\n", "print(row)\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train Gensim LSI model and record its metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "row = {}\n", "row['model'] = 'lsi'\n", "row['train_time'], row['mean_ram'], row['max_ram'], lsi = get_train_time_and_ram(\n", " lambda: LsiModel(\n", " corpus=train_corpus_tfidf,\n", " chunksize=2000,\n", " num_topics=50,\n", " id2word=dictionary,\n", " ),\n", " 'lsi',\n", " 1,\n", ")" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 17:34:51,929 : INFO : saving Projection object under lsi.model.projection, separately None\n", "2019-05-06 17:34:52,010 : INFO : saved lsi.model.projection\n", "2019-05-06 17:34:52,011 : INFO : saving LsiModel object under lsi.model, separately None\n", "2019-05-06 17:34:52,012 : INFO : not storing attribute projection\n", "2019-05-06 17:34:52,012 : INFO : not storing attribute dispatcher\n", "2019-05-06 17:34:52,022 : INFO : saved lsi.model\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{'max_ram': '896 MB', 'train_time': Timedelta('0 days 01:10:03.206000'), 'model': 'lsi', 'mean_ram': '882 MB'}\n" ] } ], "source": [ "print(row)\n", "lsi.save('lsi.model')" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 17:34:52,027 : INFO : loading LsiModel object from lsi.model\n", "2019-05-06 17:34:52,039 : INFO : loading id2word recursively from lsi.model.id2word.* with mmap=None\n", "2019-05-06 17:34:52,039 : INFO : setting ignored attribute projection to None\n", "2019-05-06 17:34:52,040 : INFO : setting ignored attribute dispatcher to None\n", "2019-05-06 17:34:52,040 : INFO : loaded lsi.model\n", "2019-05-06 17:34:52,041 : INFO : loading LsiModel object from lsi.model.projection\n", "2019-05-06 17:34:52,115 : INFO : loaded lsi.model.projection\n", "2019-05-06 17:35:03,515 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n", "2019-05-06 17:35:03,650 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n", "2019-05-06 17:35:03,791 : INFO : CorpusAccumulator accumulated stats from 3000 documents\n", "2019-05-06 17:35:03,929 : INFO : CorpusAccumulator accumulated stats from 4000 documents\n", "2019-05-06 17:35:04,071 : INFO : CorpusAccumulator accumulated stats from 5000 documents\n", "2019-05-06 17:35:04,211 : INFO : CorpusAccumulator accumulated stats from 6000 documents\n", "2019-05-06 17:35:04,347 : INFO : CorpusAccumulator accumulated stats from 7000 documents\n", "2019-05-06 17:35:04,478 : INFO : CorpusAccumulator accumulated stats from 8000 documents\n", "2019-05-06 17:35:04,622 : INFO : CorpusAccumulator accumulated stats from 9000 documents\n", "2019-05-06 17:35:04,758 : INFO : CorpusAccumulator accumulated stats from 10000 documents\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{'mean_ram': '882 MB', 'topics': [(0, '0.260*\"he\" + 0.172*\"his\" + 0.131*\"she\" + 0.111*\"her\" + 0.105*\"that\" + 0.094*\"district\" + 0.092*\"were\" + 0.087*\"school\" + 0.077*\"had\" + 0.077*\"film\"'), (1, '-0.430*\"district\" + -0.308*\"village\" + -0.270*\"population\" + -0.264*\"census\" + -0.263*\"romanized\" + -0.255*\"iran\" + -0.247*\"rural\" + -0.232*\"county\" + -0.221*\"province\" + -0.204*\"families\"'), (2, '-0.287*\"league\" + -0.275*\"he\" + -0.215*\"football\" + 0.212*\"album\" + -0.183*\"season\" + -0.180*\"team\" + -0.169*\"club\" + -0.161*\"cup\" + -0.136*\"played\" + 0.128*\"song\"'), (3, '-0.216*\"poland\" + -0.214*\"gmina\" + 0.213*\"romanized\" + -0.213*\"voivodeship\" + 0.209*\"iran\" + 0.208*\"album\" + 0.180*\"rural\" + -0.156*\"administrative\" + -0.155*\"east\" + -0.154*\"lies\"'), (4, '-0.260*\"album\" + -0.252*\"gmina\" + -0.252*\"poland\" + -0.250*\"voivodeship\" + -0.171*\"administrative\" + -0.161*\"lies\" + -0.152*\"song\" + -0.136*\"chart\" + -0.135*\"approximately\" + -0.129*\"village\"')], 'model': 'lsi', 'coherence': -3.9198, 'l2_norm': 94.5983, 'train_time': Timedelta('0 days 01:10:03.206000'), 'max_ram': '896 MB', 'f1': None}\n" ] } ], "source": [ "lsi = LsiModel.load('lsi.model')\n", "row.update(get_metrics(lsi, test_corpus_tfidf))\n", "print(row)\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train Gensim LDA and record its metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "row = {}\n", "row['model'] = 'lda'\n", "row['train_time'], row['mean_ram'], row['max_ram'], lda = get_train_time_and_ram(\n", " lambda: LdaModel(corpus=train_corpus, **params),\n", " 'lda',\n", " 1,\n", ")" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 18:56:04,893 : INFO : saving LdaState object under lda.model.state, separately None\n", "2019-05-06 18:56:04,956 : INFO : saved lda.model.state\n", "2019-05-06 18:56:04,977 : INFO : saving LdaModel object under lda.model, separately ['expElogbeta', 'sstats']\n", "2019-05-06 18:56:04,979 : INFO : not storing attribute state\n", "2019-05-06 18:56:04,980 : INFO : not storing attribute id2word\n", "2019-05-06 18:56:04,981 : INFO : storing np array 'expElogbeta' to lda.model.expElogbeta.npy\n", "2019-05-06 18:56:04,992 : INFO : not storing attribute dispatcher\n", "2019-05-06 18:56:04,995 : INFO : saved lda.model\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{'max_ram': '864 MB', 'train_time': Timedelta('0 days 01:20:59.941000'), 'model': 'lda', 'mean_ram': '862 MB'}\n" ] } ], "source": [ "print(row)\n", "lda.save('lda.model')" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 18:56:05,002 : INFO : loading LdaModel object from lda.model\n", "2019-05-06 18:56:05,005 : INFO : loading expElogbeta from lda.model.expElogbeta.npy with mmap=None\n", "2019-05-06 18:56:05,007 : INFO : setting ignored attribute state to None\n", "2019-05-06 18:56:05,008 : INFO : setting ignored attribute id2word to None\n", "2019-05-06 18:56:05,008 : INFO : setting ignored attribute dispatcher to None\n", "2019-05-06 18:56:05,009 : INFO : loaded lda.model\n", "2019-05-06 18:56:05,009 : INFO : loading LdaState object from lda.model.state\n", "2019-05-06 18:56:05,039 : INFO : loaded lda.model.state\n", "2019-05-06 18:56:20,341 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n", "2019-05-06 18:56:20,466 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n", "2019-05-06 18:56:20,595 : INFO : CorpusAccumulator accumulated stats from 3000 documents\n", "2019-05-06 18:56:20,715 : INFO : CorpusAccumulator accumulated stats from 4000 documents\n", "2019-05-06 18:56:20,848 : INFO : CorpusAccumulator accumulated stats from 5000 documents\n", "2019-05-06 18:56:20,973 : INFO : CorpusAccumulator accumulated stats from 6000 documents\n", "2019-05-06 18:56:21,100 : INFO : CorpusAccumulator accumulated stats from 7000 documents\n", "2019-05-06 18:56:21,226 : INFO : CorpusAccumulator accumulated stats from 8000 documents\n", "2019-05-06 18:56:21,351 : INFO : CorpusAccumulator accumulated stats from 9000 documents\n", "2019-05-06 18:56:21,483 : INFO : CorpusAccumulator accumulated stats from 10000 documents\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{'mean_ram': '862 MB', 'topics': [(11, '0.066*\"de\" + 0.034*\"art\" + 0.030*\"french\" + 0.028*\"la\" + 0.022*\"france\" + 0.019*\"paris\" + 0.017*\"le\" + 0.016*\"museum\" + 0.013*\"van\" + 0.013*\"saint\"'), (45, '0.033*\"new\" + 0.027*\"states\" + 0.025*\"united\" + 0.023*\"york\" + 0.023*\"american\" + 0.023*\"county\" + 0.021*\"state\" + 0.017*\"city\" + 0.014*\"california\" + 0.012*\"washington\"'), (40, '0.028*\"radio\" + 0.025*\"show\" + 0.021*\"tv\" + 0.020*\"television\" + 0.016*\"news\" + 0.015*\"station\" + 0.014*\"channel\" + 0.012*\"fm\" + 0.012*\"network\" + 0.011*\"media\"'), (28, '0.064*\"university\" + 0.018*\"research\" + 0.015*\"college\" + 0.014*\"institute\" + 0.013*\"science\" + 0.011*\"professor\" + 0.010*\"has\" + 0.010*\"international\" + 0.009*\"national\" + 0.009*\"society\"'), (20, '0.179*\"he\" + 0.123*\"his\" + 0.015*\"born\" + 0.014*\"after\" + 0.013*\"him\" + 0.011*\"who\" + 0.011*\"career\" + 0.010*\"had\" + 0.010*\"later\" + 0.009*\"where\"')], 'model': 'lda', 'coherence': -1.7641, 'l2_norm': None, 'train_time': Timedelta('0 days 01:20:59.941000'), 'max_ram': '864 MB', 'f1': None}\n" ] } ], "source": [ "lda = LdaModel.load('lda.model')\n", "row.update(get_metrics(lda, test_corpus))\n", "print(row)\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train Sklearn NMF and record its metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Careful!** Sklearn loads the entire input Wikipedia matrix into RAM. Even though the matrix is sparse, **you'll need FIXME GB of free RAM to run the cell below**." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'max_ram': '18177 MB', 'train_time': Timedelta('0 days 00:47:41.026000'), 'model': 'sklearn_nmf', 'mean_ram': '12872 MB'}\n" ] }, { "data": { "text/plain": [ "['sklearn_nmf.joblib']" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "row = {}\n", "row['model'] = 'sklearn_nmf'\n", "sklearn_nmf = SklearnNmf(n_components=50, tol=1e-2, random_state=42)\n", "row['train_time'], row['mean_ram'], row['max_ram'], sklearn_nmf = get_train_time_and_ram(\n", " lambda: sklearn_nmf.fit(scipy.sparse.load_npz('wiki_train_csr.npz')),\n", " 'sklearn_nmf',\n", " 10,\n", ")\n", "print(row)\n", "joblib.dump(sklearn_nmf, 'sklearn_nmf.joblib')" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-05-06 19:44:22,696 : INFO : CorpusAccumulator accumulated stats from 1000 documents\n", "2019-05-06 19:44:22,841 : INFO : CorpusAccumulator accumulated stats from 2000 documents\n", "2019-05-06 19:44:22,994 : INFO : CorpusAccumulator accumulated stats from 3000 documents\n", "2019-05-06 19:44:23,142 : INFO : CorpusAccumulator accumulated stats from 4000 documents\n", "2019-05-06 19:44:23,297 : INFO : CorpusAccumulator accumulated stats from 5000 documents\n", "2019-05-06 19:44:23,448 : INFO : CorpusAccumulator accumulated stats from 6000 documents\n", "2019-05-06 19:44:23,595 : INFO : CorpusAccumulator accumulated stats from 7000 documents\n", "2019-05-06 19:44:23,737 : INFO : CorpusAccumulator accumulated stats from 8000 documents\n", "2019-05-06 19:44:23,889 : INFO : CorpusAccumulator accumulated stats from 9000 documents\n", "2019-05-06 19:44:24,038 : INFO : CorpusAccumulator accumulated stats from 10000 documents\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{'mean_ram': '12872 MB', 'topics': [(0, '0.067*\"gmina\" + 0.067*\"poland\" + 0.066*\"voivodeship\" + 0.047*\"administrative\" + 0.043*\"lies\" + 0.038*\"approximately\" + 0.037*\"east\" + 0.032*\"west\" + 0.031*\"county\" + 0.03*\"regional\"'), (1, '0.098*\"district\" + 0.077*\"romanized\" + 0.075*\"iran\" + 0.071*\"rural\" + 0.062*\"census\" + 0.061*\"province\" + 0.054*\"families\" + 0.048*\"population\" + 0.043*\"county\" + 0.031*\"village\"'), (2, '0.095*\"apelor\" + 0.09*\"bucurești\" + 0.047*\"cadastrul\" + 0.047*\"hidrologie\" + 0.047*\"meteorologie\" + 0.047*\"institutul\" + 0.047*\"române\" + 0.047*\"româniei\" + 0.047*\"rîurile\" + 0.046*\"river\"'), (3, '0.097*\"commune\" + 0.05*\"department\" + 0.045*\"communes\" + 0.03*\"insee\" + 0.029*\"france\" + 0.018*\"population\" + 0.015*\"saint\" + 0.014*\"region\" + 0.01*\"town\" + 0.01*\"french\"'), (4, '0.148*\"township\" + 0.05*\"county\" + 0.018*\"townships\" + 0.018*\"unincorporated\" + 0.015*\"community\" + 0.011*\"indiana\" + 0.01*\"census\" + 0.01*\"creek\" + 0.009*\"pennsylvania\" + 0.009*\"illinois\"')], 'model': 'sklearn_nmf', 'coherence': -2.0476, 'l2_norm': 94.8459, 'train_time': Timedelta('0 days 00:47:41.026000'), 'max_ram': '18177 MB', 'f1': None}\n" ] } ], "source": [ "sklearn_nmf = joblib.load('sklearn_nmf.joblib')\n", "row.update(get_metrics(\n", " sklearn_nmf, test_corpus_tfidf, dictionary=dictionary,\n", "))\n", "print(row)\n", "tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wikipedia results" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modeltrain_timemean_rammax_ramcoherencel2_norm
0gensim_nmf00:25:46.617000771 MB774 MB-2.107194.9686
1lsi01:10:03.206000882 MB896 MB-3.919894.5983
2lda01:20:59.941000862 MB864 MB-1.7641-
3sklearn_nmf00:47:41.02600012872 MB18177 MB-2.047694.8459
\n", "
" ], "text/plain": [ " model train_time mean_ram max_ram coherence l2_norm\n", "0 gensim_nmf 00:25:46.617000 771 MB 774 MB -2.1071 94.9686\n", "1 lsi 01:10:03.206000 882 MB 896 MB -3.9198 94.5983\n", "2 lda 01:20:59.941000 862 MB 864 MB -1.7641 -\n", "3 sklearn_nmf 00:47:41.026000 12872 MB 18177 MB -2.0476 94.8459" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm_metrics.replace(np.nan, '-', inplace=True)\n", "tm_metrics.drop(['topics', 'f1'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Insights\n", "\n", "Gensim's online NMF outperforms all other models in terms of speed and memory foorprint size.\n", "\n", "Compared to Sklearn's NMF:\n", "\n", "- **2x** faster.\n", "\n", "- Uses **~20x** less memory.\n", "\n", " About **8GB** of Sklearn's RAM comes from the in-memory input matrices, which, in contrast to Gensim NMF, cannot be streamed iteratively. But even if we forget about the huge input size, Sklearn NMF uses about **2-8 GB** of RAM – significantly more than Gensim NMF or LDA.\n", "\n", "- L2 norm and coherence are a bit worse.\n", "\n", "Compared to Gensim's LSI:\n", "\n", "- **3x** faster\n", "- Better coherence but slightly worse l2 norm.\n", "\n", "Compared to Gensim's LDA, Gensim NMF also gives superior results:\n", "\n", "- **3x** faster\n", "- Coherence is worse than LDA's." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Learned Wikipedia topics" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "gensim_nmf:\n", "(21, '0.009*\"his\" + 0.005*\"that\" + 0.005*\"him\" + 0.004*\"had\" + 0.003*\"they\" + 0.003*\"who\" + 0.003*\"her\" + 0.003*\"but\" + 0.003*\"king\" + 0.003*\"were\"')\n", "(39, '0.005*\"are\" + 0.005*\"or\" + 0.004*\"be\" + 0.004*\"that\" + 0.003*\"can\" + 0.003*\"used\" + 0.003*\"this\" + 0.002*\"have\" + 0.002*\"such\" + 0.002*\"which\"')\n", "(45, '0.092*\"apelor\" + 0.087*\"bucurești\" + 0.050*\"river\" + 0.046*\"cadastrul\" + 0.046*\"hidrologie\" + 0.046*\"meteorologie\" + 0.046*\"institutul\" + 0.046*\"române\" + 0.045*\"româniei\" + 0.045*\"rîurile\"')\n", "(28, '0.066*\"gmina\" + 0.065*\"poland\" + 0.065*\"voivodeship\" + 0.046*\"village\" + 0.045*\"administrative\" + 0.042*\"lies\" + 0.037*\"approximately\" + 0.036*\"east\" + 0.031*\"west\" + 0.030*\"county\"')\n", "(34, '0.087*\"romanized\" + 0.084*\"iran\" + 0.067*\"rural\" + 0.067*\"province\" + 0.066*\"census\" + 0.060*\"families\" + 0.055*\"village\" + 0.049*\"county\" + 0.047*\"population\" + 0.043*\"district\"')\n", "\n", "lsi:\n", "(0, '0.260*\"he\" + 0.172*\"his\" + 0.131*\"she\" + 0.111*\"her\" + 0.105*\"that\" + 0.094*\"district\" + 0.092*\"were\" + 0.087*\"school\" + 0.077*\"had\" + 0.077*\"film\"')\n", "(1, '-0.430*\"district\" + -0.308*\"village\" + -0.270*\"population\" + -0.264*\"census\" + -0.263*\"romanized\" + -0.255*\"iran\" + -0.247*\"rural\" + -0.232*\"county\" + -0.221*\"province\" + -0.204*\"families\"')\n", "(2, '-0.287*\"league\" + -0.275*\"he\" + -0.215*\"football\" + 0.212*\"album\" + -0.183*\"season\" + -0.180*\"team\" + -0.169*\"club\" + -0.161*\"cup\" + -0.136*\"played\" + 0.128*\"song\"')\n", "(3, '-0.216*\"poland\" + -0.214*\"gmina\" + 0.213*\"romanized\" + -0.213*\"voivodeship\" + 0.209*\"iran\" + 0.208*\"album\" + 0.180*\"rural\" + -0.156*\"administrative\" + -0.155*\"east\" + -0.154*\"lies\"')\n", "(4, '-0.260*\"album\" + -0.252*\"gmina\" + -0.252*\"poland\" + -0.250*\"voivodeship\" + -0.171*\"administrative\" + -0.161*\"lies\" + -0.152*\"song\" + -0.136*\"chart\" + -0.135*\"approximately\" + -0.129*\"village\"')\n", "\n", "lda:\n", "(11, '0.066*\"de\" + 0.034*\"art\" + 0.030*\"french\" + 0.028*\"la\" + 0.022*\"france\" + 0.019*\"paris\" + 0.017*\"le\" + 0.016*\"museum\" + 0.013*\"van\" + 0.013*\"saint\"')\n", "(45, '0.033*\"new\" + 0.027*\"states\" + 0.025*\"united\" + 0.023*\"york\" + 0.023*\"american\" + 0.023*\"county\" + 0.021*\"state\" + 0.017*\"city\" + 0.014*\"california\" + 0.012*\"washington\"')\n", "(40, '0.028*\"radio\" + 0.025*\"show\" + 0.021*\"tv\" + 0.020*\"television\" + 0.016*\"news\" + 0.015*\"station\" + 0.014*\"channel\" + 0.012*\"fm\" + 0.012*\"network\" + 0.011*\"media\"')\n", "(28, '0.064*\"university\" + 0.018*\"research\" + 0.015*\"college\" + 0.014*\"institute\" + 0.013*\"science\" + 0.011*\"professor\" + 0.010*\"has\" + 0.010*\"international\" + 0.009*\"national\" + 0.009*\"society\"')\n", "(20, '0.179*\"he\" + 0.123*\"his\" + 0.015*\"born\" + 0.014*\"after\" + 0.013*\"him\" + 0.011*\"who\" + 0.011*\"career\" + 0.010*\"had\" + 0.010*\"later\" + 0.009*\"where\"')\n", "\n", "sklearn_nmf:\n", "(0, '0.067*\"gmina\" + 0.067*\"poland\" + 0.066*\"voivodeship\" + 0.047*\"administrative\" + 0.043*\"lies\" + 0.038*\"approximately\" + 0.037*\"east\" + 0.032*\"west\" + 0.031*\"county\" + 0.03*\"regional\"')\n", "(1, '0.098*\"district\" + 0.077*\"romanized\" + 0.075*\"iran\" + 0.071*\"rural\" + 0.062*\"census\" + 0.061*\"province\" + 0.054*\"families\" + 0.048*\"population\" + 0.043*\"county\" + 0.031*\"village\"')\n", "(2, '0.095*\"apelor\" + 0.09*\"bucurești\" + 0.047*\"cadastrul\" + 0.047*\"hidrologie\" + 0.047*\"meteorologie\" + 0.047*\"institutul\" + 0.047*\"române\" + 0.047*\"româniei\" + 0.047*\"rîurile\" + 0.046*\"river\"')\n", "(3, '0.097*\"commune\" + 0.05*\"department\" + 0.045*\"communes\" + 0.03*\"insee\" + 0.029*\"france\" + 0.018*\"population\" + 0.015*\"saint\" + 0.014*\"region\" + 0.01*\"town\" + 0.01*\"french\"')\n", "(4, '0.148*\"township\" + 0.05*\"county\" + 0.018*\"townships\" + 0.018*\"unincorporated\" + 0.015*\"community\" + 0.011*\"indiana\" + 0.01*\"census\" + 0.01*\"creek\" + 0.009*\"pennsylvania\" + 0.009*\"illinois\"')\n" ] } ], "source": [ "def compare_topics(tm_metrics):\n", " for _, row in tm_metrics.iterrows():\n", " print('\\n{}:'.format(row.model))\n", " print(\"\\n\".join(str(topic) for topic in row.topics))\n", " \n", "compare_topics(tm_metrics)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems all four models successfully learned useful topics from the Wikipedia corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. And now for something completely different: Face decomposition from images" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The NMF algorithm in Gensim is optimized for extremely large (sparse) text corpora, but it will also work on vectors from other domains!\n", "\n", "Let's compare our model to other factorization algorithms on dense image vectors and check out the results.\n", "\n", "To do that we'll patch sklearn's [Faces Dataset Decomposition](https://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sklearn wrapper\n", "Let's create an Scikit-learn wrapper in order to run Gensim NMF on images." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "import logging\n", "import time\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from numpy.random import RandomState\n", "from sklearn import decomposition\n", "from sklearn.cluster import MiniBatchKMeans\n", "from sklearn.datasets import fetch_olivetti_faces\n", "from sklearn.decomposition.nmf import NMF as SklearnNmf\n", "from sklearn.linear_model import LogisticRegressionCV\n", "from sklearn.metrics import f1_score\n", "from sklearn.model_selection import ParameterGrid\n", "\n", "import gensim.downloader\n", "from gensim import matutils\n", "from gensim.corpora import Dictionary\n", "from gensim.models import CoherenceModel, LdaModel, LdaMulticore\n", "from gensim.models.nmf import Nmf as GensimNmf\n", "from gensim.parsing.preprocessing import preprocess_string\n", "\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "import scipy.sparse as sparse\n", "\n", "\n", "class NmfWrapper(BaseEstimator, TransformerMixin):\n", " def __init__(self, bow_matrix, **kwargs):\n", " self.corpus = sparse.csc.csc_matrix(bow_matrix)\n", " self.nmf = GensimNmf(**kwargs)\n", "\n", " def fit(self, X):\n", " self.nmf.update(self.corpus)\n", "\n", " @property\n", " def components_(self):\n", " return self.nmf.get_topics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modified face decomposition notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adapted from the excellent [Scikit-learn tutorial](https://github.com/scikit-learn/scikit-learn/blob/master/examples/decomposition/plot_faces_decomposition.py) (BSD license):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Turn off the logger due to large number of info messages during training" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "gensim.models.nmf.logger.propagate = False" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "============================\n", "Faces dataset decompositions\n", "============================\n", "\n", "This example applies to :ref:`olivetti_faces` different unsupervised\n", "matrix decomposition (dimension reduction) methods from the module\n", ":py:mod:`sklearn.decomposition` (see the documentation chapter\n", ":ref:`decompositions`) .\n", "\n", "\n", "Dataset consists of 400 faces\n", "Extracting the top 6 Eigenfaces - PCA using randomized SVD...\n", "done in 0.025s\n", "Extracting the top 6 Non-negative components - NMF (Sklearn)...\n", "done in 0.171s\n", "Extracting the top 6 Non-negative components - NMF (Gensim)...\n", "done in 0.582s\n", "Extracting the top 6 Independent components - FastICA...\n", "done in 0.097s\n", "Extracting the top 6 Sparse comp. - MiniBatchSparsePCA...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/anotherbugmaster/.local/lib/python3.5/site-packages/sklearn/decomposition/sparse_pca.py:405: DeprecationWarning: normalize_components=False is a backward-compatible setting that implements a non-standard definition of sparse PCA. This compatibility mode will be removed in 0.22.\n", " DeprecationWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "done in 0.694s\n", "Extracting the top 6 MiniBatchDictionaryLearning...\n", "done in 0.479s\n", "Extracting the top 6 Cluster centers - MiniBatchKMeans...\n", "done in 0.087s\n", "Extracting the top 6 Factor Analysis components - FA...\n", "done in 0.048s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/anotherbugmaster/.local/lib/python3.5/site-packages/sklearn/decomposition/factor_analysis.py:238: ConvergenceWarning: FactorAnalysis did not converge. You might want to increase the number of iterations.\n", " ConvergenceWarning)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\"\"\"\n", "============================\n", "Faces dataset decompositions\n", "============================\n", "\n", "This example applies to :ref:`olivetti_faces` different unsupervised\n", "matrix decomposition (dimension reduction) methods from the module\n", ":py:mod:`sklearn.decomposition` (see the documentation chapter\n", ":ref:`decompositions`) .\n", "\n", "\"\"\"\n", "print(__doc__)\n", "\n", "# Authors: Vlad Niculae, Alexandre Gramfort\n", "# License: BSD 3 claus\n", "\n", "n_row, n_col = 2, 3\n", "n_components = n_row * n_col\n", "image_shape = (64, 64)\n", "rng = RandomState(0)\n", "\n", "# #############################################################################\n", "# Load faces data\n", "dataset = fetch_olivetti_faces(shuffle=True, random_state=rng)\n", "faces = dataset.data\n", "\n", "n_samples, n_features = faces.shape\n", "\n", "# global centering\n", "faces_centered = faces - faces.mean(axis=0)\n", "\n", "# local centering\n", "faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)\n", "\n", "print(\"Dataset consists of %d faces\" % n_samples)\n", "\n", "\n", "def plot_gallery(title, images, n_col=n_col, n_row=n_row):\n", " plt.figure(figsize=(2. * n_col, 2.26 * n_row))\n", " plt.suptitle(title, size=16)\n", " for i, comp in enumerate(images):\n", " plt.subplot(n_row, n_col, i + 1)\n", " vmax = max(comp.max(), -comp.min())\n", " plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray,\n", " interpolation='nearest',\n", " vmin=-vmax, vmax=vmax)\n", " plt.xticks(())\n", " plt.yticks(())\n", " plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.)\n", "\n", "\n", "# #############################################################################\n", "# List of the different estimators, whether to center and transpose the\n", "# problem, and whether the transformer uses the clustering API.\n", "estimators = [\n", " ('Eigenfaces - PCA using randomized SVD',\n", " decomposition.PCA(n_components=n_components, svd_solver='randomized',\n", " whiten=True),\n", " True),\n", "\n", " ('Non-negative components - NMF (Sklearn)',\n", " decomposition.NMF(n_components=n_components, init='nndsvda', tol=5e-3),\n", " False),\n", "\n", " ('Non-negative components - NMF (Gensim)',\n", " NmfWrapper(\n", " bow_matrix=faces.T,\n", " chunksize=3,\n", " eval_every=400,\n", " passes=2,\n", " id2word={idx: idx for idx in range(faces.shape[1])},\n", " num_topics=n_components,\n", " minimum_probability=0,\n", " random_state=42,\n", " ),\n", " False),\n", "\n", " ('Independent components - FastICA',\n", " decomposition.FastICA(n_components=n_components, whiten=True),\n", " True),\n", "\n", " ('Sparse comp. - MiniBatchSparsePCA',\n", " decomposition.MiniBatchSparsePCA(n_components=n_components, alpha=0.8,\n", " n_iter=100, batch_size=3,\n", " random_state=rng),\n", " True),\n", "\n", " ('MiniBatchDictionaryLearning',\n", " decomposition.MiniBatchDictionaryLearning(n_components=15, alpha=0.1,\n", " n_iter=50, batch_size=3,\n", " random_state=rng),\n", " True),\n", "\n", " ('Cluster centers - MiniBatchKMeans',\n", " MiniBatchKMeans(n_clusters=n_components, tol=1e-3, batch_size=20,\n", " max_iter=50, random_state=rng),\n", " True),\n", "\n", " ('Factor Analysis components - FA',\n", " decomposition.FactorAnalysis(n_components=n_components, max_iter=2),\n", " True),\n", "]\n", "\n", "# #############################################################################\n", "# Plot a sample of the input data\n", "\n", "plot_gallery(\"First centered Olivetti faces\", faces_centered[:n_components])\n", "\n", "# #############################################################################\n", "# Do the estimation and plot it\n", "\n", "for name, estimator, center in estimators:\n", " print(\"Extracting the top %d %s...\" % (n_components, name))\n", " t0 = time.time()\n", " data = faces\n", " if center:\n", " data = faces_centered\n", " estimator.fit(data)\n", " train_time = (time.time() - t0)\n", " print(\"done in %0.3fs\" % train_time)\n", " if hasattr(estimator, 'cluster_centers_'):\n", " components_ = estimator.cluster_centers_\n", " else:\n", " components_ = estimator.components_\n", "\n", " # Plot an image representing the pixelwise variance provided by the\n", " # estimator e.g its noise_variance_ attribute. The Eigenfaces estimator,\n", " # via the PCA decomposition, also provides a scalar noise_variance_\n", " # (the mean of pixelwise variance) that cannot be displayed as an image\n", " # so we skip it.\n", " if (hasattr(estimator, 'noise_variance_') and\n", " estimator.noise_variance_.ndim > 0): # Skip the Eigenfaces case\n", " plot_gallery(\"Pixelwise variance\",\n", " estimator.noise_variance_.reshape(1, -1), n_col=1,\n", " n_row=1)\n", " plot_gallery('%s - Train time %.1fs' % (name, train_time),\n", " components_[:n_components])\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, Gensim's NMF implementation is slower than Sklearn's on **dense** vectors, while achieving comparable quality." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion\n", "\n", "Gensim NMF is an extremely fast and memory-optimized model. Use it to obtain interpretable topics, as an alternative to SVD / LDA.\n", "\n", "---\n", "\n", "The NMF implementation in Gensim was created by [Timofey Yefimov](https://github.com/anotherbugmaster/) as a part of his [RARE Technologies Student Incubator](https://rare-technologies.com/incubator/) graduation project." ] } ], "metadata": { "jupytext": { "text_representation": { "extension": ".py", "format_name": "percent", "format_version": "1.1", "jupytext_version": "0.8.3" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }