{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\nEnsemble LDA\n============\n\nIntroduces Gensim's EnsembleLda model\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial will explain how to use the EnsembleLDA model class.\n\nEnsembleLda is a method of finding and generating stable topics from the results of multiple topic models,\nit can be used to remove topics from your results that are noise and are not reproducible.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Corpus\n------\nWe will use the gensim downloader api to get a small corpus for training our ensemble.\n\nThe preprocessing is similar to `sphx_glr_auto_examples_tutorials_run_word2vec.py`,\nso it won't be explained again in detail.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import gensim.downloader as api\nfrom gensim.corpora import Dictionary\nfrom nltk.stem.wordnet import WordNetLemmatizer\n\nlemmatizer = WordNetLemmatizer()\ndocs = api.load('text8')\n\ndictionary = Dictionary()\nfor doc in docs:\n dictionary.add_documents([[lemmatizer.lemmatize(token) for token in doc]])\ndictionary.filter_extremes(no_below=20, no_above=0.5)\n\ncorpus = [dictionary.doc2bow(doc) for doc in docs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training\n--------\n\nTraining the ensemble works very similar to training a single model,\n\nYou can use any model that is based on LdaModel, such as LdaMulticore, to train the Ensemble.\nIn experiments, LdaMulticore showed better results.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.models import LdaModel\ntopic_model_class = LdaModel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any arbitrary number of models can be used, but it should be a multiple of your workers so that the\nload can be distributed properly. In this example, 4 processes will train 8 models each.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ensemble_workers = 4\nnum_models = 8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After training all the models, some distance computations are required which can take quite some\ntime as well. You can speed this up by using workers for that as well.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "distance_workers = 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All other parameters that are unknown to EnsembleLda are forwarded to each LDA Model, such as\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "num_topics = 20\npasses = 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now start the training\n\nSince 20 topics were trained on each of the 8 models, we expect there to be 160 different topics.\nThe number of stable topics which are clustered from all those topics is smaller.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.models import EnsembleLda\nensemble = EnsembleLda(\n corpus=corpus,\n id2word=dictionary,\n num_topics=num_topics,\n passes=passes,\n num_models=num_models,\n topic_model_class=LdaModel,\n ensemble_workers=ensemble_workers,\n distance_workers=distance_workers\n)\n\nprint(len(ensemble.ttda))\nprint(len(ensemble.get_topics()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tuning\n------\n\nDifferent from LdaModel, the number of resulting topics varies greatly depending on the clustering parameters.\n\nYou can provide those in the ``recluster()`` function or the ``EnsembleLda`` constructor.\n\nPlay around until you get as many topics as you desire, which however may reduce their quality.\nIf your ensemble doesn't have enough topics to begin with, you should make sure to make it large enough.\n\nHaving an epsilon that is smaller than the smallest distance doesn't make sense.\nMake sure to chose one that is within the range of values in ``asymmetric_distance_matrix``.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nshape = ensemble.asymmetric_distance_matrix.shape\nwithout_diagonal = ensemble.asymmetric_distance_matrix[~np.eye(shape[0], dtype=bool)].reshape(shape[0], -1)\nprint(without_diagonal.min(), without_diagonal.mean(), without_diagonal.max())\n\nensemble.recluster(eps=0.09, min_samples=2, min_cores=2)\n\nprint(len(ensemble.get_topics()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Increasing the Size\n-------------------\n\nIf you have some models lying around that were trained on a corpus based on the same dictionary,\nthey are compatible and you can add them to the ensemble.\n\nBy setting num_models of the EnsembleLda constructor to 0 you can also create an ensemble that is\nentirely made out of your existing topic models with the following method.\n\nAfterwards the number and quality of stable topics might be different depending on your added topics and parameters.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.models import LdaMulticore\n\nmodel1 = LdaMulticore(\n corpus=corpus,\n id2word=dictionary,\n num_topics=9,\n passes=4,\n)\n\nmodel2 = LdaModel(\n corpus=corpus,\n id2word=dictionary,\n num_topics=11,\n passes=2,\n)\n\n# add_model supports various types of input, check out its docstring\nensemble.add_model(model1)\nensemble.add_model(model2)\n\nensemble.recluster()\n\nprint(len(ensemble.ttda))\nprint(len(ensemble.get_topics()))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 0 }