{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BigARTM. Python API users tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Author - **Murat Apishev** (great-mel@yandex.ru)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is a tutorial for BigARTM library Python API usage in basic cases. If you have a special case6 that isn't covered by this paper, refer it to the bigartm-users@googlegroups.com community." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I assume you have proceeded all steps from the installation instruction (http://bigartm.readthedocs.org/en/master/installation/index.html) and tuned the Python API (you can check it by importing artm module)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import this module and start our work:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.8.0\n" ] } ], "source": [ "import artm\n", "\n", "print artm.version()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each described case is a separate block of code6 which is not depending on other ones (except the first one, about dictionaries and batches - you need to use it every time). The code can be used in your scripts (of course, if you had prepared data and put them in necessary places)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dictionaries and batches in BigARTM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before starting modeling we need to convert you data in the library format. At first you need to read about supporting formats for source data (http://bigartm.readthedocs.org/en/master/formats.html). It's your task to prepare your data in one of these formats. As you had transformed your data into one of source formats, you can convert them in the BigARTM internal format (batches of documents) using BatchVectorizer class object.\n", "\n", "Really you have one more simple way to process your collection, if it is not too big and you don't need to store it in batches. To use it you need to archive two variables: numpy.ndarray with $n_{wd}$ counters and corresponding Python dict with vocabulary (key - index of numpy.ndarray, value - corresponding token). The simpliest way to get these data is sklearn CountVectorizer usage (or some similar class from sklearn).\n", "\n", "If you have archived described variables run following code:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, if you have data in UCI format (e.g. vocab.my_collection.txt and docword.my_collection.txt files), that were put into the same directory with your script or notebook (in our case - with this notebook), you can create batches using next code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "batch_vectorizer = artm.BatchVectorizer(data_path='',\n", " data_format='bow_uci',\n", " collection_name='my_collection',\n", " target_folder='my_collection_batches')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The built-in library parser converted your data into batches and covered them with the BatchVectorizer class object, that is a general input data type for all methods of Python API. You can read about it here http://bigartm.readthedocs.org/en/master/python_interface/batches_utils.html. The batches were places in the directory, you specified in the target_folder parameter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have the source file in the Vowpal Wabbit data format, you can use the following command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "batch_vectorizer = artm.BatchVectorizer(data_path='',\n", " data_format='vowpal_wabbit',\n", " target_folder='my_collection_batches')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is fully the same, as it was described above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Important note**: if you had created batches ones, you shouldn't launch this process any more, because it spends many time while dealing with large collection. You can run the following code instead. It will create the BatchVectorizer object using the existing batches (this operation is very quick):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "batch_vectorizer = artm.BatchVectorizer(data_path='my_collection_batches',\n", " data_format='batches')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to create dictionary. This is a data structure containing the information about all unique tokens in the collection. The dictionary is generating outside the model, and this operation can be done in different ways (see more here http://bigartm.readthedocs.org/en/master/python_interface/dictionary.html). The most basic case is to gather dictionary using batches directory. You need to do this operation only once when starting working with new collection. Use the following code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dictionary = artm.Dictionary()\n", "dictionary.gather(data_path='my_collection_batches')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case the token order in the dictionary (and in further $\\Phi$ matrix) will be random. If you'd like to specify some order, you need to create the vocab file (see UCI format), containing all unique tokens of the collection in necessary order, and run the code below (assuming your file has vocab.txt name and located in the same directory with this notebook):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dictionary = artm.Dictionary()\n", "dictionary.gather(data_path='my_collection_batches',\n", " vocab_file_path='vocab.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take into consideration the fact that library will ignore any token from batches, that was not presented into vacab file, if you used it. Dictionary contains a lot of useful information about the collection. For example, each unique token in it has the corresponding variable - value. When BigARTM gathers the dictionary, it puts the relative frequency of this token in this variable. You can read about the use-cases of this variable in further sections.\n", "\n", "Well, now you have a dictionary. It can be saved on the dick to prevent it's re-creation. You can save it in the binary format:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dictionary.save(dictionary_path='/my_collection_batches/my_dictionary')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or in the textual one (if you'd like to see the gathered data, for example):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dictionary.save_text(dictionary_path='my_collection_batches/my_dictionary.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Saved dictionary can be loaded back. The code for binary file looks like next one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dictionary.load(dictionary_path='my_collection_batches/my_dictionary.dict')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For textual dictionary you can run the next code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dictionary.load_text(dictionary_path='my_collection_batches/my_dictionary.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides looking the content of the textual dictionary, you also can moderate it (for example, change the value of value field). After you load the dictionary back, these changes will be used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last note: all described ways of generating batches automatically generate dictionary. You can use it by typing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "batch_vectorizer.dictionary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you don't want to create this dictionary, set gather_dictionary parameter in the constructor of BatchVectorizer to False. But this flag will be ignored if data_format == bow_n\\_wd, as it is the only possible way to generate dictionary in this case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section 1: learn base PLSA model with perplexity computation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this moment you need to have next objects:\n", "\n", "- directory with my_collection_batches name, containing batches and dictionary in binary file my_dictionary.dict; the directory should have the same location with this notebook;\n", "- Dictionary variable my_dictionary, containing this dictionary (gathered or loaded);\n", "- BatchVectorizer variable batch_vectorizer (the same we have created earlier).\n", "\n", "If everything is OK, let's start creating the model. Firstly you need to read the specification of the ARTM class, which represents the model (http://bigartm.readthedocs.org/en/master/python_interface/artm.html). Then you can use the following code to create the model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model = artm.ARTM(num_topics=20, dictionary=my_dictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have created the model, containing $\\Phi$ matrix with size \"number of words in your dictionary\" $\\times$ number of topics (20). This matrix was randomly initialized. Note, that by default the random seed for initialization is fixed to archive the ability to re-run the experiments and get the same results. If you want to have another random start values, use the seed parameter of the ARTM class (it's different non-negative integer values leads to different initializations).\n", "\n", "From this moment we can start learning the model. But typically it is useful to enable some scores for monitoring the quality of the model. Let’s use the perplexity now.\n", "\n", "You can deal with scores using the scores filed of the ARTM class. The score of perplexity can be added in next way:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.scores.add(artm.PerplexityScore(name='my_fisrt_perplexity_score',\n", " dictionary=my_dictionary))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find the description of the parameters of this and other scores here http://bigartm.readthedocs.org/en/master/python_interface/scores.html. Note, that perplexity should be enabled strongly in described way (you can change other parameters we didn't use here).\n", "\n", "**Important note**: if you try to create the second score with the same name, the add() call will be ignored.\n", "\n", "Now let's start the main act - the learning of the model. We can do that in two ways: using online algorithm or offline one. The corresponding methods are fit_online() and fit_offline(). It is assumed, that you know the features of these algorithms, but I will briefly remind you:\n", "\n", "- Offline algorithm: many passes through the collection, one pass through the single document (optional), only one update of the $\\Phi$ matrix on one collection pass (at the end of the pass). You should use this algorithm while processing a small collection.\n", "\n", "- Online algorithm: single pass through the collection (optional), many passes through the single document, several updates of the $\\Phi$ matrix during one pass through the collection. Use this one when you deal with large collections, and with collections with quickly changing topics.\n", "\n", "The parameters of this methods can be found in this documentation page http://bigartm.readthedocs.org/en/master/python_interface/artm.html. We will use the offline learning here and in all further examples in this notebook (because the correct usage of the online algorithm is big skill, you can found the information about it in a separate tutorial).\n", "\n", "Well, let's start training:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code chunk had worked slower, than any previous one. Here we proceeded the first step of the learning, it will be useful to look at the perplexity. We need to use the score_tracker field of the ARTM class for this. It remember all the values of all scores on each $\\Phi$ matrix update. These data can be retrieved using the names of scores.\n", "\n", "You can extract only the last value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print model.score_tracker['my_fisrt_perplexity_score'].last_value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you are able to extract the list of all values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print model.score_tracker['my_fisrt_perplexity_score'].value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you can read the information about all fields of the scores results and about the ways of their correct extraction (http://bigartm.readthedocs.org/en/master/python_interface/score_tracker.html).\n", "\n", "If the perplexity had convergenced, you can finish the learning process. In other way you need to continue. As it was noted above, the rule to have only one pass over the single document in the online algorithm is optional. Both fit_offline() and fit_online() methods supports any number of document passes you want to have. To change this number you need to change the corresponding parameter of the model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.num_document_passes = 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All following calls of the learning methods will use this change. Let's continue fitting:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We continued learning the previous model by making 15 more collection passes with 5 document passes.\n", "\n", "You can continue to work with this model in described way. Now one note: if you understand in one moment that your model had degenerated, and you don't want to create the new one - use the initialize() method6 that will fill the $\\Phi$ matrix with random numbers and won't change any other things (nor your tunes of the regularizers/scores, nor the history from score_tracker):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.initialize(dictionary=my_dictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FYI, this method is calling in the ARTM constructor, if you give it the dictionary name parameter. Note, that the change of the seed field will affect the call of initialize().\n", "\n", "Also note, that you can pass the name of the dictionary instead of the dictionary object whenever it uses." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.initialize(dictionary=my_dictionary.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section 2: regularized PLSA and new scores." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "BigARTM is a project, that implements effectively the theory of additive regularization of topic models (K. V. Vorontsov). ARTM is a more flexible technique to replace the existing Bayesian one. The theory is based on the regularizers, it's assumed, that you are familiar with them." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The library has a pre-defined set of the regularizers (you can create new ones, if it's necessary, you can read about it in the corresponding notes). Now we’ll study to use them.\n", "\n", "I assume that all the conditions from the head of the first section are executed. Let's create the model and enable the perplexity score in it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model = artm.ARTM(num_topics=20, dictionary=my_dictionary, cache_theta=False)\n", "model.scores.add(artm.PerplexityScore(name='perplexity_score',\n", " use_unigram_document_model=False,\n", " dictionary=my_dictionary))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I should note the the cache_theta flag: it's allow you to save your $\\Theta$ matrix in the memory or not. If you have large collection, it can be impossible to store it's $\\Theta$ in the memory, and in case of short collection it can be useful to look at it. Default value is True. In the cases, when you need to use $\\Theta$ matrix, but it is too big, you can use ARTM.transform() method (it will be discussed later)." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Now let's try to add other scores, because the perplexity is not the only one to be used. I refer toy to the scores documentation again for more detailed description of scores (http://bigartm.readthedocs.org/en/master/python_interface/scores.html).\n", "\n", "Let's add the scores of sparsity of $\\Phi$ and $\\Theta$ matrices and the information about the most probable tokens in each topic (top-tokens):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.scores.add(artm.SparsityPhiScore(name='sparsity_phi_score'))\n", "model.scores.add(artm.SparsityThetaScore(name='sparsity_theta_score'))\n", "model.scores.add(artm.TopTokensScore(name='top_tokens_score'))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Scores have many useful parameters. For instance, they can be calculated on the subsets of topics. Let's count separately the sparsity of the first ten topics in $\\Phi$. But there's a problem: topics are identifying with their names, and we didn't specify them. If we used the topic_names parameter in the constructor (instead of num_topics one), we should have such a problem. But the solution is very easy: BigARTM had generated names and put them into the topic_names field, so you can use it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.scores.add(artm.SparsityPhiScore(name='sparsity_phi_score_10_topics', topic_names=model.topic_names[0: 9]))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Certainly, we could modify the previous score without creating new one, if the general model sparsity wasn't interesting for us:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.scores['sparsity_phi_score'].topic_names = model.topic_names[0: 9]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "But let's assume that we are also interested in it and keep everything as is. You should remember that all the parameters of metrics, model and regularizers (we will talk about them soon) can be set and reset by the direct change of the corresponding field, as it was demonstrated in the code above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, let's ask the top-tokens score to show us 12 most probable tokens in each topic:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.num_tokens = 12" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Well, we achieved the model covered with necessary scores, and can start the fitting process:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We saw this code in the first section. But now we can see the values of new added scores:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print model.score_tracker['perplexity_score'].value # .last_value\n", "print model.score_tracker['sparsity_phi_score'].value # .last_value\n", "print model.score_tracker['sparsity_theta_score'].value # .last_value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, all the scores didn't change. But we forgot about the top-tokens. Here we need to act more accurately: the score stores the data on each moment of $\\Phi$ update. Let's assume that we need only the last data. So we need to use the last_tokens field. It is a Python dict, where key is a topic name, and value is a list of top-tokens of this topic.\n", "\n", "**Important note**: the scores are loading from the kernel on each call, so for such a big scores, as top-tokens (or topic kernel score), it's strongly recommended to store the whole score in the local variable, and then deal with it. So, let's look through all top-tokens in the loop:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "saved_top_tokens = model.score_tracker['top_tokens_score'].last_tokens\n", "\n", "for topic_name in model.topic_names:\n", " print saved_top_tokens[topic_name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probably the topics are not very good. For the aim of increasing the quality of the topics you can use the regularizers.\n", "\n", "The list of regularizers and their parameters can be found here http://bigartm.readthedocs.org/en/master/python_interface/regularizers.html. The code for dealing with the regularizers is very similar with the one for scores. Let's add three regularizers into our model: sparsing of $\\Phi$ matrix, sparsing of $\\Theta$ matrix and topics decorrelation. The last one is need to make topics more different." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.regularizers.add(artm.SmoothSparsePhiRegularizer(name='sparse_phi_regularizer'))\n", "model.regularizers.add(artm.SmoothSparseThetaRegularizer(name='sparse_theta_regularizer'))\n", "model.regularizers.add(artm.DecorrelatorPhiRegularizer(name='decorrelator_phi_regularizer'))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Maybe you have a question about the name of the SmoothSparsePhi\\Theta regularizer. Yes, it can both smooth and sparse topics. It's action depends on the value of corresponding coefficient of the regularization $\\tau$ (I assume, that you know, what is it). $\\tau$ > 0 leads to smoothing, $\\tau$ < 0 - to sparsing. By default all the regularizers has $\\tau$ = 1, which is usually not what you want. Choosing good $\\tau$ is a heuristic, sometimes you need to process dozens of the experiments to pick up good values. It is the experimental work, and we won't discuss it here. Let's look at technical details instead:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.regularizers['sparse_phi_regularizer'].tau = -1.0\n", "model.regularizers['sparse_theta_regularizer'].tau = -0.5\n", "model.regularizers['decorrelator_phi_regularizer'].tau = 1e+5" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We set standard values, but in bad case they can be useless or even harmful for the model.\n", "\n", "I draw your attention again to the fact, that setting and changing the values of the regularizer parameters is fully similar to the scores.\n", "\n", "Let's start the learning process:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Further you can look at metrics, change $\\tau$ coefficients of the regularizers and etc. As for scores, you can ask the regularizer to deal only with given topics, using topic_names parameter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's return to the dictionaries. But here’s one discussion firstly. Let's look at the principle of work of the SmoothSparsePhi regularizer. It simply adds to all counters the same value $\\tau$. Such a strategy can be unsuitable for us. The probable case: a need for sparsing one part of words, smoothing another one and ignoring the rest tokens. For example, let's sparse the tokens about magic, smooth tokens about cats and ignore all other ones.\n", "\n", "In this situation we need dictionaries.\n", "\n", "Let's remember about the value field, that corresponds each unique token. And also the fact, that SmoothSparsePhi regularizer has the dictionary field. If you set this field, the regularizer will add to counters $\\tau$ * value for this token, instead of $\\tau$. In such way we can set the $\\tau$ to 1, for instance, set the value variable in dictionary for tokens about magic -1.0, for tokens about cats - 1.0, and 0.0 for other tokens. And we'll get what we need.\n", "\n", "The last problem is how to change these value variables. It was discussed in the introduction section: let's remember about the methods Dictionary.save_text() and Dictionary.load_text().\n", "\n", "You need to proceed next steps:\n", "\n", "- save the dictionary in the textual format;\n", "- open it, each line corresponds to one unique token, the line contains 5 values: token - modality - value variable - token_tf - token_df;\n", "- don't pay attention to anything except the token and the value; find all tokens you are interested in and change their values parameters;\n", "- load the dictionary back into the library.\n", "\n", "Your file can have such a view after editing (conceptually):\n", "\n", "cat | smth | 1.0 | smth | smth\n", "\n", "shower | smth | 0.0 | smth | smth\n", "\n", "magic | smth | -1.0 | smth | smth\n", "\n", "kitten | smth | 1.0 | smth | smth\n", "\n", "merlin | smth | -1.0 | smth | smth\n", "\n", "moscow | smth | 0.0 | smth | smth" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All the code you need to process discussed operation was showed above. Here is an example of creation of the regularizer with dicitonary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.regularizer.add(artm.SmoothSparsePhiRegularizer(name='smooth_sparse_phi_regularizer', dictionary=my_dictionary))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section is over, let's move on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section 3: fitting multimodal topic model with regularization and quality scoring; ARTM.transform() method." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Now let's move to more complex cases. In last section I mentioned the term 'modality'. It's something that corresponds to each token. I prefer to think about it as about the type of token, For instance, some tokens form the main text in the document, some form the title, some - names of the authors, some - tags etc.\n", "\n", "In BigARTM each unique token has a modality. It is denoted as class_id (don't confuse with the classe in the task of classification). You can specify the class_id of the token, or the library will set it to '@default_class'. This class id denotes the type of usual tokens, the type be default." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "In the most cases you don't need to use the modalities, but there're some situations, when they are indispensable. For example, in the task of document classification. Strictly speaking, we will talk about it now." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You need to re-create all the data with considering the presence of the modalities. Your task is to create the file in the Vowpal Wabbit format, where each line is a document, and each document contains the tokens of two modalities - the usual tokens, and the tokens-labels of classes, the document belongs to.\n", "\n", "Example:\n", "doc_100500 |@default_class aaa:2 bbb:4 ccc ddd:6 |@labels_class class_1 class_6\n", "\n", "All the information about this is described here http://bigartm.readthedocs.org/en/master/formats.html.\n", "\n", "Now follow again the instruction from the introduction part, dealing with your new Vowpal Wabbit file to achieve batches and the dictionary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to explain your model the information about your modalities and the power in the model. Power of the modality is it's coefficient $\\tau_m$ (I assume you know about it). The model uses by default only tokens of '@default_class' modality and uses it with $\\tau_m$ = 1.0. You need to specify other modalities and their weights in the constructor of the model, using following code, if you need to use these modalities:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model = artm.ARTM(num_topics=20, class_ids={'@default_class': 1.0, '@labels_class': 5.0})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, we asked the model to take into consideration these two modalities, and the class labels will be more powerful in this model, than the tokens of the '@default_class' modality. Note, that if you had the tokens of another modality in your file - they wouldn't be taken into consideration. Similarly, if you had specified in the constructor the modality, that doesn't exist in the data - it will be skipped.\n", "\n", "Of course, the class_ids field, as all other ones, can be reseted. You always can change the weights of the modalities:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.class_ids = {'@default_class': 1.0, '@labels_class': 50.0} # model.class_ids['@labels_class'] = 50.0 --- NO!!!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You need to update the weights directly in such way, don't try to refer to the modality by the key directly: class_ids can be updated using Python dict, but it is not the dict." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The next launch of fit_offline() or fit_online will take this new information into consideration.\n", "\n", "Now we need to enable scores and regularizers in the model. This process was viewed earlier, excluding one case. All the scores of $\\Phi$ matrix (and perplexity) and $\\Phi$ regularizers has fields to deal with modalities. Through these fields you can define the modalities to be deal with by score or regularizer, the other ones will be ignored (here's the full similarity with topic_names field).\n", "\n", "The modality field can be class_id or class_ids. The first one is the string containing the name of the modality to deal with, the second one is a list of strings.\n", "\n", "**Important note**: the missing value of class_id means class_id = '@default_class', missing value of class_ids means usage of all existing modalities.\n", "\n", "You can see the information about any score or regularizer by following next links I gave earlier: http://bigartm.readthedocs.org/en/master/python_interface/regularizers.html and http://bigartm.readthedocs.org/en/master/python_interface/scores.html." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Let's add the score of sparsity $\\Phi$ for the modality of class labels and regularizers of topic decorrelation for each modality, and start fitting:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.scores.add(artm.SparsityPhiScore(name='sparsity_phi_score', class_id='@labels_class'))\n", "\n", "model.regularizers.add(artm.DecorrelatorPhiRegularizer(name='decorrelator_phi_def', class_ids=['@default_class']))\n", "model.regularizers.add(artm.DecorrelatorPhiRegularizer(name='decorrelator_phi_lab', class_ids=['@labels_class']))\n", "\n", "model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, I will leave you the rest of the work (tuning $\\tau$ and $\\tau_m$ coefficients, looking at scores results etc.). And now we will go to the usage of fitted ready model for the classification of test data.\n", "\n", "I will remember you, that in the classification task you have the train data (the collection you used to train your model, where for each document the model knew it's true class labels), and test one. For the test data true labels are known to you, but are unknown to the model. Model need to forecast these labels, using test documents, and your task is to compute the quality of the predictions by counting some metrics, AUC, for instance.\n", "\n", "Computation of the AUC or any other quality measure is your task, we won't do it. Instead, we will learn how to get p(c|d) vectors for each document, where each value - the probability of class c in the given document d.\n", "\n", "Well, we have a model. I assume you put test documents into separate file in Vowpal Wabbit format, and created batches using it, which are covered by the variable batch_vectorizer_test (see introduction section). Also I assume you have saved your test batches into the separate directory (not into the one containing train batches).\n", "\n", "Your test documents shouldn't contain information about true labels (e.g. the Vowpal Wabbit file shouldn't contain string '|@labels_class'), also text document shouldn't contain tokens, that doesn't appear in the train set. Such tokens will be ignored.\n", "\n", "If all these conditions are met, we can use the ARTM.transform() method (see http://bigartm.readthedocs.org/en/master/python_interface/artm.html), that allows you to get p(t|d) (e.g. $\\Theta$) or p(c|d) matrix for all documents from your BatchVectorizer object.\n", "\n", "Run this code to get $\\Theta$:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "theta_test = model.transform(batch_vectorizer=batch_vectorizer_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And this one to archive p(c|d):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "p_cd_test = model.transform(batch_vectorizer=batch_vectorizer_test, predict_class_id='@labels_class')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this way you have got the predictions of the model in pandas.DataFrame. Now you can score the quality of the predictions of your model in all ways, you need." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section 4: $\\Phi$ and $\\Theta$ extraction, model saving and loading, dictionary filtering." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "This section covers mostly technical problems that are easy to study.\n", "\n", "Let's assume, that you have a data and a model, fitted on this data. I had tuned all necessary regularizers and used scores. But the set of quality measures of the library wasn't enough for you, and you need to compute your own scores using $\\Phi$ and $\\Theta$ matrices. In this case you are able to extract these matrices using next code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "phi = model.get_phi()\n", "theta = model.get_theta()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Note, that you need a cache_theta flag to be set True if you are planning to extract $\\Theta$ in future without using transform(). You also can extract not whole matrices, but part of them, that corresponds different topics (using the same topic_names parameter of the methods, as in previous sections). Also you can extract only necessary modalities of the $\\Phi$ matrix, if you want.\n", "\n", "Both methods return pandas.DataFreame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's study saving the model to disk.\n", "\n", "It's important to understand that the model contains two matrices - $\\Phi$ (or $p_{wt}$) and $n_{wt}$. To make model be loadable without loses you need to save both these matrices. The current library version can save only obe matrix per mrthod call, so you need two calls:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.save(filename='saved_p_wt', model_name='p_wt')\n", "model.save(filename='saved_n_wt', model_name='n_wt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model will be saved in binary format. To use it later you need to load it's matrices back:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.load(filename='saved_p_wt', model_name='p_wt')\n", "model.load(filename='saved_n_wt', model_name='n_wt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**, that the model after loading will only contain $\\Phi$ and $n_{wt}$ matrices and some associated information (like number of topics, their names, the names of the modalities (without weights!) and some other data). So you need to restore all necessary scores, regularizers, modality weights and all important parameters for you, like cache_theta.\n", "\n", "You can use save/load methods pair in case of long fitting, when restoring parameters is much more easier than model re-fitting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last thing, we'll discuss in this section is dictionary. Namely, their self-filtering ability. Let's remember the structure of the dictionary, saved in textual format. There were many lines, one per each unique token, and each line contained 5 values: token (string), its class_id (string), its value (double) and two more integer parameters, called token_tf and token_df. token_tf is an absolute frequency of the token in the whole collection, and token_df is the number of documents in the collection, where the token had appeared at least once. These values are generating during gathering dictionary by the library. They differ from the value in the fact, that you can't use them in the regularizers and scores, so you shouldn't change them.\n", "\n", "They need for filtering of the dictionary. You likely needn't to use very seldom or too frequent tokens in your model. Or you simply want to reduce your dictionary to hold your model in the memory. In both cases the solution is to use the Dictionary.filter() method. See its parameters here http://bigartm.readthedocs.org/en/master/python_interface/dictionary.html. Now let's filter the modality of usual tokens:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dictionary.filter(min_tf=10, max_tf=2000, min_df_rate=0.01)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: if the parameter has \\_rate suffix, it denotes relative value (e.g. from 0 to 1), otherwise - absolute value.\n", "\n", "This call has one feature - it rewrites the old dictionary with new one. So if you don't want to lose your full dictionary, you need firstly to save it to disk, and then filter the copy located in the memory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section 5: everything related to coherency and co-occurrence dictionaries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ToDo." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section 6: attach_model and custom Phi-like matrices initialization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Library supports an ability to access all $\\Phi$-like matrices directly from Python. This is a low-level functionality, so it wasn't included in the ARTM class, and can be used via low-level master\\_component interface. User can attach the matrix, e.g. get reference to it in the Python, and can change it's content between the iterations. The changes will be written in the native C++ memory.\n", "\n", "The most evidence case of usage of this feature is a custom initialization of $\\Phi$ matrix. The library initalizes it with random numbers by default. But there're several more complex and useful methods of initialization, that the library doesn't support yet. And in this case attach\\_model method can help you.\n", "\n", "So let's attach to the $\\Phi$ matrix of out model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "(_, phi_ref) = model.master.attach_model(model=model.model_pwt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this moment you can print $\\Phi$ matrix to see it's content:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.get_phi(model_name=model.model_pwt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next code can be used to check whether the attaching was successful:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for model_description in model.info.model:\n", " print model_description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output will be similar to the following\n", "\n", "-----\n", "\n", "name: \"nwt\"\n", "\n", "type: \"class artm::core::DensePhiMatrix\"\n", "\n", "num_topics: 50\n", "\n", "num_tokens: 2500\n", "\n", "-----\n", "\n", "name: \"pwt\"\n", "\n", "type: \"class __artm::core::AttachedPhiMatrix__\"\n", "\n", "num_topics: 50\n", "\n", "num_tokens: 2500\n", "\n", "-----\n", "\n", "You can see, that the type of $\\Phi$ matrix has changed from DensePhiMatrix to AttachedPhiMatrix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's assume that you have created pwt\\_new matrix with the same size, filled with custom values. Let's write these values into our $\\Phi$ matrix. __Important__: you need to write the values by accessing phi_ref variable, you are not allowed to assing it the whole pwt_new matrix, this operation will lead to an error in future work." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for tok in xrange(num_tokens):\n", " for top in xrange(num_topics):\n", " phi_ref[tok, top] = pwt_new[tok, top] # CORRECT!\n", " \n", "phi_ref = pwt_new # NO!!!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After that you can print $\\Phi$ matrix again and check the change of it's values. From this moment you can continue our work." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }