# Visualizing topic models using tm-navigator

This notebook describes a simple way from a raw text collection to its visualization, and uses BigARTM for fitting a topic model and tm-navigator to visualize it.

## Set up connection to a tm-navigator server

Currently the navigator is running on a remote server and you need ssh access through internet to it, but there is a Dockerfile available, so the navigator can be deployed virtually anywhere.

In [1]:
TMNAV_SERVER='root@ks.plav.in'
TMNAV_PORT=22223 # or 21, for those who have troubles accessing port 22223
TMNAV_PATH='/root/tm_navigator/'

Run these commands with default options in your shell once to allow passwordless ssh to the server:
```
ssh-keygen
ssh-copy-id -p {TMNAV_PORT} -i ~/.ssh/id_rsa.pub {TMNAV_SERVER}
```

Now you can easily connect to the server to see the model in your browser: run

```
ssh -p {TMNAV_PORT} -L 5000:localhost:5000 {TMNAV_SERVER}
```
in your terminal, and open http://localhost:5000 in a browser. This will show a list of all the datasets and models uploaded before.

## Get the collection

First we need to get the collection in the bag-of-words format. In this example the MMRO conference (Russian) articles are used:

In [2]:
!mkdir mmro
!rm -rf mmro/*

mkdir: cannot create directory ‘mmro’: File exists


In [3]:
%cd mmro

/root/work/tm_navigator/dev/mmro


In [4]:
!wget https://s3-eu-west-1.amazonaws.com/artm/vocab.mmro.txt
!wget https://s3-eu-west-1.amazonaws.com/artm/docword.mmro.txt.7z
!7zr e docword.mmro.txt.7z

--2015-11-18 11:42:34-- https://s3-eu-west-1.amazonaws.com/artm/vocab.mmro.txt
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 54.231.133.140
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|54.231.133.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155766 (152K) [text/plain]
Saving to: ‘vocab.mmro.txt’


2015-11-18 11:42:34 (2.43 MB/s) - ‘vocab.mmro.txt’ saved [155766/155766]

--2015-11-18 11:42:34-- https://s3-eu-west-1.amazonaws.com/artm/docword.mmro.txt.7z
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 54.231.133.132
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|54.231.133.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 490147 (479K) [application/octet-stream]
Saving to: ‘docword.mmro.txt.7z’


2015-11-18 11:42:34 (4.27 MB/s) - ‘docword.mmro.txt.7z’ saved [490147/490147]


7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-

## Load the collection to tm-navigator

Install a convenience wrapper for creating csv's:

In [None]:
!pip install csvwriter

In [5]:
from csvwriter import CsvWriter
import numpy as np
from glob import glob

This collection should be added to tm-navigator. Based on the data available in this case, only the simplest visualization can be built, without even documents names and their authors.

tm-navigator native input format is a bunch of csv files, each corresponding to a database table. Minimally, a dataset (text collection) is described with the following tables:

In [6]:
# list of all modalities
with CsvWriter(open('modalities.csv', 'w')) as out:
 out << [dict(id=1, name='words')] # this one is required

In [7]:
# read the ndw counts
with open('docword.mmro.txt') as f:
 D = int(f.readline())
 W = int(f.readline())
 n = int(f.readline())
 ndw_s = [map(int, line.split()) for line in f.readlines()]
 ndw_s = [(d - 1, w - 1, cnt) for d, w, cnt in ndw_s] # use 0-based indexing

In [8]:
# all the documents data
with CsvWriter(open('documents.csv', 'w')) as out:
 out << (
 dict(id=d,
 title='Document #{}'.format(d),
 slug='document-{}'.format(d), # any unique string, identifying the document - appears in short lists and URLs
 file_name='.../{}'.format(d), # if applicable, a relative filename of the document
 # source='MMRO', # optional, is displayed as-is, e.g. conference name with year
 # html=..., # optional, the full HTML content of the document
 )
 for d in range(D)
 )

In [9]:
# terms (in this case, words only)
with open('vocab.mmro.txt') as f, \
 CsvWriter(open('terms.csv', 'w')) as out:
 out << (
 dict(id=i,
 modality_id=1, # matches the id in modalities table
 text=line.strip()
 )
 for i, line in enumerate(f)
 )

In [10]:
# occurrences of terms in documents
with CsvWriter(open('document_terms.csv', 'w')) as out:
 out << (
 dict(document_id=d,
 modality_id=1,
 term_id=w,
 count=cnt)
 for d, w, cnt in ndw_s
 )

So, the required csv's are ready to be loaded into tm-navigator. Upload them to the server:

In [11]:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'mkdir {TMNAV_PATH}data_mmro'
for csv in glob('*.csv'):
 !scp -P {TMNAV_PORT} {csv} {TMNAV_SERVER}:{TMNAV_PATH}data_mmro/{csv}

documents.csv 100% 41KB 41.3KB/s 00:00 
modalities.csv 100% 18 0.0KB/s 00:00 
document_terms.csv 100% 4091KB 4.0MB/s 00:00 
terms.csv 100% 212KB 212.0KB/s 00:00 


Now the files are on the server, and we need to load them to the database. All database interactions are supposed to be done with the `db_manage.py` script, which has several commands. A list of parameters for each command can be obtained by adding `--help`:

In [12]:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py'
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py load_dataset --help'

Usage: db_manage.py [OPTIONS] COMMAND [ARGS]...

Options:
 --help Show this message and exit.

Commands:
 add_dataset
 add_topicmodel
 describe
 load_dataset
 load_topicmodel
Usage: db_manage.py load_dataset [OPTIONS]

Options:
 -d, --dataset-id INTEGER [required]
 -t, --title TEXT
 -dir, --directory DIRECTORY [required]
 --help Show this message and exit.


First we add a new dataset and note the given id - it will be used later:

In [13]:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py add_dataset'

Added Dataset #1


And now load the data from CSV files (note the `dataset-id`, it is the number which was given by the previous command):

In [15]:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && yes | ./db_manage.py load_dataset --dataset-id 1 --title "Simplest MMRO dataset" -dir data_mmro'

Found files "document_terms.csv", "documents.csv", "modalities.csv", "terms.csv".
Not found files "document_contents.csv".
Will try to continue with the files present.
Proceeding will overwrite the corresponding data in the database. Continue? [Y/n]: Deleting data
Deleting data
Deleting data
Deleting data
Deleting data
Loading data
Loading data
Loading data
Loading data
Loading data


You can check that the dataset was loaded to the DB using the `describe` command:

In [16]:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py describe'

- Dataset #1: Simplest MMRO dataset, 0 models
 Documents: 1061
 Terms: 7805 words with 314081 occurrences



Or just go to the front page at http://localhost:5000 - the last dataset there should be your newly added one.

## Build a topic model

Next we build a simple ARTM model of this collection using BigARTM. Of course, you can use other tools for this, if you want.

In [17]:
import artm

In [18]:
batch_vectorizer = artm.BatchVectorizer(data_path='', data_format='bow_uci', collection_name='mmro', target_folder='.')

In [19]:
model_artm = artm.ARTM(num_topics=15,
 scores=[artm.PerplexityScore(name='PerplexityScore',
 use_unigram_document_model=False,
 dictionary_name='dictionary')],
 regularizers=[artm.SmoothSparseThetaRegularizer(name='SparseTheta', tau=-0.15)])

In [20]:
model_artm.load_dictionary(dictionary_name='dictionary', dictionary_path='dictionary')
model_artm.initialize(dictionary_name='dictionary')

In [21]:
model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15, num_document_passes=1)

In the simplest possible case, the model is completely described by its matrices $\Phi$ and $\Theta$:

In [22]:
phi = model_artm.get_phi()
theta = model_artm.fit_transform()

Some other required probabilities are naively computed below. You can use another ways to calculate them.

In [23]:
pwt = phi.as_matrix()
ptd = theta.as_matrix()
pd = 1.0 / theta.shape[1]
pt = (ptd * pd).sum(1)
pw = (pwt * pt).sum(1)
ptw = pwt * pt / pw[:, np.newaxis]
pdt = ptd * pd / pt[:, np.newaxis]

## Load the model to tm-navigator

After the model is built, it has to be converted to CSV files, like the dataset was. The minimal required set of files is the following:

In [24]:
# the model topics
with CsvWriter(open('topics.csv', 'w')) as out:
 out << [dict(id=0,
 level=0,
 id_in_level=0,
 is_background=False,
 probability=1)] # the single zero-level topic with id=0 is required
 out << (dict(id=1 + t, # any unique ids
 level=1, # for a flat non-hierarchical model just leave 1 here
 id_in_level=t,
 is_background=False, # if you have background topics, they should have True here
 probability=p)
 for t, p in enumerate(pt))

In [25]:
# probabilities of terms in topics
with CsvWriter(open('topic_terms.csv', 'w')) as out:
 out << (dict(topic_id=1 + t, # same ids as above
 modality_id=1,
 term_id=w,
 prob_wt=pwt[w, t],
 prob_tw=ptw[w, t])
 for w, t in zip(*np.nonzero(pwt)))

In [26]:
# probabilities of topics in documents
with CsvWriter(open('document_topics.csv', 'w')) as out:
 out << (dict(topic_id=1 + t, # same ids as above
 document_id=d,
 prob_td=ptd[t, d],
 prob_dt=pdt[t, d])
 for t, d in zip(*np.nonzero(ptd)))

In [27]:
# graph of topics, mostly useful for hierarchical topic models
# the navigator assumes that all topics are reachable by edges from the root topic #0
with CsvWriter(open('topic_edges.csv', 'w')) as out:
 out << (dict(parent_id=0,
 child_id=1 + t,
 probability=p)
 for t, p in enumerate(pt))

Now, same as with the dataset before, upload the CSV files:

In [28]:
for csv in glob('*.csv'):
 !scp -P {TMNAV_PORT} {csv} {TMNAV_SERVER}:{TMNAV_PATH}data_mmro/{csv}

topic_terms.csv 100% 3492KB 3.4MB/s 00:00 
documents.csv 100% 41KB 41.3KB/s 00:00 
modalities.csv 100% 18 0.0KB/s 00:00 
topic_edges.csv 100% 260 0.3KB/s 00:00 
document_terms.csv 100% 4091KB 4.0MB/s 00:00 
terms.csv 100% 212KB 212.0KB/s 00:00 
topics.csv 100% 416 0.4KB/s 00:00 
document_topics.csv 100% 420KB 420.4KB/s 00:00 


Create a new topic model (same `dataset-id` as in commands above):

In [29]:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py add_topicmodel --dataset-id 1'

Added Topic Model #1 for Dataset #1


And load CSVs to the database (here note the `topicmodel-id`):

In [30]:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && yes | ./db_manage.py load_topicmodel --topicmodel-id 1 --title "Simplest model" -dir data_mmro'

Found files "document_topics.csv", "topic_edges.csv", "topic_terms.csv", "topics.csv".
Not found files "document_content_topics.csv", "document_similarities.csv", "term_similarities.csv", "topic_similarities.csv".
Will try to continue with the files present.
Proceeding will overwrite the corresponding data in the database. Continue? [Y/n]: Deleting data
Deleting data
Deleting data
Deleting data
Deleting data
Loading data
Loading data
Loading data
Loading data
Loading data


That's all, basically - now visit http://localhost:5000 and follow the link to browse your model! The link should look like http://1.localhost:5000, just with another number at the beginning. If you don't like the debug panel at the side, just click `Hide` there once.

If you build several (any number of) topic models for the same collection, you don't have to add a new dataset each time, just run `add_topicmodel` with the same dataset id.

## Optional data for richer visualization

If that's not enough and you need other features, you can feed the navigator with more data. Some useful cases with description on how to do them:

- Add titles, slugs and HTML content to the documents: just fill the corresponding fields in `documents.csv'

- Add document authors: authors are another modality (like `words` in the given example), so just add `authors` modality to `modalities.csv`, the corresponding terms (individual authors) to `terms.csv`, and their relation to documents to `document_terms.csv`. Of course, authors can be used in topic models also.

- Highlight documents HTML content: this is a two-step process, one step is related to the dataset and another to the topic model. Basically, you need the start and end positions of each term (word) in HTML, and the top topics for them. If you have these, this is how to generate the corresponding CSVs:

In [None]:
# this is for dataset
with CsvWriter(open('document_contents.csv', 'w')) as out:
 id_cnt = itertools.count() # it's just one way to generate ids so that they match in both cases
 out << (dict(id=next(id_cnt), # must correspond to the ids in document_content_topics below
 document_id=d,
 modality_id=1, # 1 for words
 term_id=w,
 start_pos=s, end_pos=e # the start and end positions in the HTML content
 )
 for d, w, s, e in ...)

# and this is for topicmodel
with CsvWriter(open('document_content_topics.csv', 'w')) as out:
 id_cnt = it.count()
 out << (dict(document_content_id=next(id_cnt), # same ids as above
 topic_id=1 + t # the top topic id, determines the color
 )
 for d, t in ...)

- Show lists of similar documents, topics, or terms on the corresponding pages: the navigator doesn't restrict you in how the similarity is determined, so it must be computed beforehand. Similarities are internally related to topicmodels, not datasets, because they are typically computed using the data from models. Multiple different similarities are supported for each entitity, see below:

In [None]:
with CsvWriter(open('document_similarities.csv', 'w')) as out:
 out << (dict(a_id=i, # first document id
 b_id=sim_i, # second document id
 similarity=row[sim_i], # similarity from [0, 1]
 similarity_type='Topics' # free-form short name of this similarity type, common choices probably are Topics and Words
 )
 for i, row in enumerate(distances) # the precomputed distance matrix
 # tip: don't write the whole n^2 entries to the CSV table not to bloat it,
 # here we limit to 30 similar entities for each row
 for sim_i in row.argsort()[:31]
 if sim_i != i)

with CsvWriter(open('topic_similarities.csv', 'w')) as out:
 out << (dict(a_id=1 + i,
 b_id=1 + sim_i,
 similarity=row[sim_i],
 similarity_type='Words')
 for i, row in enumerate(distances)
 for sim_i in row.argsort()[:] # if you have hundreds or more topics, limit to first 50 or so here
 if sim_i != i)

with CsvWriter(open('term_similarities.csv', 'w')) as out:
 out << (dict(a_modality_id=1,
 a_id=i,
 b_modality_id=1,
 b_id=sim_i,
 similarity=row[sim_i],
 similarity_type='Topics')
 for i, row in enumerate(distances)
 for sim_i in row.argsort()[:21] # first 20 similar terms
 if sim_i != i)

- Hierarchical topic models can easily be represented using several levels of topics and adding edges between them, see above. Actually, even if your topic model isn't hierarchical you can add another middle level with topics to act as groups of real topics, and name them correspondingly.

Remember to upload the CSVs after each change, and load them to the database using `load_dataset` or `load_topicmodel`! Use the same dataset or model id as before, if you want to overwrite the dataset or model, or else add a new one with `add_*` before.

Datasets which have topic models can be changed only by adding some data, like `document_contents.csv`, not removing anything. The database will give an error if you try to do anything which makes the data inconsistent.

## Notes for artm-dev team, which can use the remote server mentioned above

No permissions system is currently implemented (nor is planned), so each user on the same server can view or modify all the data. Please treat the server as a disposable storage and store any data you cannot easily generate again on your computer. If you want to use assessment features of the navigator, please contact me beforehand to make sure you don't lose the responses!

This tutorial uses directory named `data_mmro` on the server, please replace it with something unique among our group for convenience.