{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# artm.LDA class usage (BigARTM Python API)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Author - **Murat Apishev** (great-mel@yandex.ru)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "artm.LDA was designed for non-advanced users with minimal knowledge about topic modeling and ARTM. It is a cutted version of artm.ARTM model with pre-defined scores and regularizers. artm.LDA has enough abilities for fitting the LDA with regularizers of smoothing/sparsing of $\\Phi$ and $\\Theta$ matrices with offline or online algorithms. Also it can compute scores of perplexity, matrices sparsities and most probable tokens in each topic, and return the whole resulting matrices." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You cna find the information about artm.LDA here: http://bigartm.readthedocs.org/en/master/python_interface/lda.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's simple model experiment with kos collection in UCI format. You can read about BatchVectorizer and dictionaries in the introduction of this paper: http://nbviewer.ipython.org/github/bigartm/bigartm-book/blob/master/ARTM_tutorial_RU.ipynb\n", "\n", "The collection can be downloaded from here: https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/\n", "\n", "In link above you also can read about the UCI Bag-Of-Words format. You will need two files: docword.kos.txt и vocab.kos.txt.\n", "Both these files should be put into the same directory with this notebook.\n", "\n", "Let's import the artm module, create BatchVectorizer and run dictionary gathering inside of it (if you are interested in the detailes, you need to read information from given links):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1 . The most basic part." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.8.0\n" ] } ], "source": [ "import artm\n", "\n", "print artm.version()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "batch_vectorizer = artm.BatchVectorizer(data_path='.', data_format='bow_uci',\n", " collection_name='kos', target_folder='kos_batches')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's create the model by defining the number of topics, number of passes through each document, hyperparameters of smoothing of $\\Phi$ and $\\Theta$ and dictionary to use. Also let's ask thr model to store $\\Theta$ matrix to have an ability to look at it in the future.\n", "\n", "Also you can set here the num_processors parameter, which defines the number of threads to be used on your machine for parallelizing the computing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lda = artm.LDA(num_topics=15, alpha=0.01, beta=0.001,\n", " num_document_passes=5, dictionary=batch_vectorizer.dictionary,\n", " cache_theta=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, let's run the learning process using offline algorithm:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lda.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's all, the fitting is over. You can iterate this process, edit model parameters etc.\n", "\n", "Now you are able to look at the results of modeling. For instance, let's look at final values of matrices sparsities:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lda.sparsity_phi_last_value\n", "lda.sparsity_theta_last_value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or at all values of perplexity (e.g. on each collection pass):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lda.perplexity_value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see the most probable tokens in each topic, at last. The are returned as list of lists of strings (each internal list correspondes to on topic by order). Let's output them with pre-formatting:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "top_tokens = lda.get_top_tokens(num_tokens=10)\n", "for i, token_list in enumerate(top_tokens):\n", " print 'Topic #{0}: {1}'.format(i, token_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the matrices you can use th following calls:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "phi = lda.phi_\n", "theta = lda.get_theta()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Some more detailes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's two more abilities of artm.LDA.\n", "\n", "At first, it is the ability to create $\\Theta$ matrix for new documents after the model was fitted:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "batch_vectorizer = artm.BatchVectorizer(data_path='kos_batches_test')\n", "theta_test = lda.transform(batch_vectorizer=test_batch_vectorizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Secondly, in the case, when you need a custom regularization of each topic in $\\Phi$ matrix, you need to set beta a list instead of scalar value. The list should have th length equal to the number of topics, and then each topic will be regularized with corresponding coefficient:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "beta = [0.1] * num_topics # change as you need\n", "lda = artm.LDA(num_topics=15, alpha=0.01, beta=beta, num_document_passes=5, dictionary=dictionary, cache_theta=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Describde model is the simpliest way to study topic model in BigARTM. You need nothing except BatchVectorizer usage and running couple of code strings described above.\n", "\n", "If you need more advanced abilities of the library, you should use artm.ARTM model. Description of this model is given in\n", "main tutorial." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }