{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing topic models using tm-navigator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook describes a simple way from a raw text collection to its visualization, and uses BigARTM for fitting a topic model and tm-navigator to visualize it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up connection to a tm-navigator server"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Currently the navigator is running on a remote server and you need ssh access through internet to it, but there is a Dockerfile available, so the navigator can be deployed virtually anywhere."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "TMNAV_SERVER='root@ks.plav.in'\n",
    "TMNAV_PORT=22223  # or 21, for those who have troubles accessing port 22223\n",
    "TMNAV_PATH='/root/tm_navigator/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run these commands with default options in your shell once to allow passwordless ssh to the server:\n",
    "```\n",
    "ssh-keygen\n",
    "ssh-copy-id -p {TMNAV_PORT} -i ~/.ssh/id_rsa.pub {TMNAV_SERVER}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can easily connect to the server to see the model in your browser: run\n",
    "\n",
    "```\n",
    "ssh -p {TMNAV_PORT} -L 5000:localhost:5000 {TMNAV_SERVER}\n",
    "```\n",
    "in your terminal, and open http://localhost:5000 in a browser. This will show a list of all the datasets and models uploaded before."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get the collection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we need to get the collection in the bag-of-words format. In this example the MMRO conference (Russian) articles are used:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: cannot create directory ‘mmro’: File exists\r\n"
     ]
    }
   ],
   "source": [
    "!mkdir mmro\n",
    "!rm -rf mmro/*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/root/work/tm_navigator/dev/mmro\n"
     ]
    }
   ],
   "source": [
    "%cd mmro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2015-11-18 11:42:34--  https://s3-eu-west-1.amazonaws.com/artm/vocab.mmro.txt\n",
      "Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 54.231.133.140\n",
      "Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|54.231.133.140|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 155766 (152K) [text/plain]\n",
      "Saving to: ‘vocab.mmro.txt’\n",
      "\n",
      "vocab.mmro.txt      100%[=====================>] 152.12K  --.-KB/s   in 0.06s  \n",
      "\n",
      "2015-11-18 11:42:34 (2.43 MB/s) - ‘vocab.mmro.txt’ saved [155766/155766]\n",
      "\n",
      "--2015-11-18 11:42:34--  https://s3-eu-west-1.amazonaws.com/artm/docword.mmro.txt.7z\n",
      "Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 54.231.133.132\n",
      "Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|54.231.133.132|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 490147 (479K) [application/octet-stream]\n",
      "Saving to: ‘docword.mmro.txt.7z’\n",
      "\n",
      "docword.mmro.txt.7z 100%[=====================>] 478.66K  --.-KB/s   in 0.1s   \n",
      "\n",
      "2015-11-18 11:42:34 (4.27 MB/s) - ‘docword.mmro.txt.7z’ saved [490147/490147]\n",
      "\n",
      "\n",
      "7-Zip (A) [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18\n",
      "p7zip Version 9.20 (locale=C.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)\n",
      "\n",
      "Processing archive: docword.mmro.txt.7z\n",
      "\n",
      "Extracting  docword.mmro.txt\n",
      "\n",
      "Everything is Ok\n",
      "\n",
      "Size:       3248571\n",
      "Compressed: 490147\n"
     ]
    }
   ],
   "source": [
    "!wget https://s3-eu-west-1.amazonaws.com/artm/vocab.mmro.txt\n",
    "!wget https://s3-eu-west-1.amazonaws.com/artm/docword.mmro.txt.7z\n",
    "!7zr e docword.mmro.txt.7z"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the collection to tm-navigator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Install a convenience wrapper for creating csv's:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install csvwriter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from csvwriter import CsvWriter\n",
    "import numpy as np\n",
    "from glob import glob"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This collection should be added to tm-navigator. Based on the data available in this case, only the simplest visualization can be built, without even documents names and their authors.\n",
    "\n",
    "tm-navigator native input format is a bunch of csv files, each corresponding to a database table. Minimally, a dataset (text collection) is described with the following tables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# list of all modalities\n",
    "with CsvWriter(open('modalities.csv', 'w')) as out:\n",
    "    out << [dict(id=1, name='words')] # this one is required"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read the ndw counts\n",
    "with open('docword.mmro.txt') as f:\n",
    "    D = int(f.readline())\n",
    "    W = int(f.readline())\n",
    "    n = int(f.readline())\n",
    "    ndw_s = [map(int, line.split()) for line in f.readlines()]\n",
    "    ndw_s = [(d - 1, w - 1, cnt) for d, w, cnt in ndw_s]  # use 0-based indexing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# all the documents data\n",
    "with CsvWriter(open('documents.csv', 'w')) as out:\n",
    "    out << (\n",
    "        dict(id=d,\n",
    "             title='Document #{}'.format(d),\n",
    "             slug='document-{}'.format(d),  # any unique string, identifying the document - appears in short lists and URLs\n",
    "             file_name='.../{}'.format(d),  # if applicable, a relative filename of the document\n",
    "             # source='MMRO',  # optional, is displayed as-is, e.g. conference name with year\n",
    "             # html=...,  # optional, the full HTML content of the document\n",
    "        )\n",
    "        for d in range(D)\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# terms (in this case, words only)\n",
    "with open('vocab.mmro.txt') as f, \\\n",
    "     CsvWriter(open('terms.csv', 'w')) as out:\n",
    "        out << (\n",
    "            dict(id=i,\n",
    "                 modality_id=1,  # matches the id in modalities table\n",
    "                 text=line.strip()\n",
    "            )\n",
    "            for i, line in enumerate(f)\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# occurrences of terms in documents\n",
    "with CsvWriter(open('document_terms.csv', 'w')) as out:\n",
    "    out << (\n",
    "        dict(document_id=d,\n",
    "             modality_id=1,\n",
    "             term_id=w,\n",
    "             count=cnt)\n",
    "        for d, w, cnt in ndw_s\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, the required csv's are ready to be loaded into tm-navigator. Upload them to the server:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "documents.csv                                 100%   41KB  41.3KB/s   00:00    \n",
      "modalities.csv                                100%   18     0.0KB/s   00:00    \n",
      "document_terms.csv                            100% 4091KB   4.0MB/s   00:00    \n",
      "terms.csv                                     100%  212KB 212.0KB/s   00:00    \n"
     ]
    }
   ],
   "source": [
    "!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'mkdir {TMNAV_PATH}data_mmro'\n",
    "for csv in glob('*.csv'):\n",
    "    !scp -P {TMNAV_PORT} {csv} {TMNAV_SERVER}:{TMNAV_PATH}data_mmro/{csv}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the files are on the server, and we need to load them to the database. All database interactions are supposed to be done with the `db_manage.py` script, which has several commands. A list of parameters for each command can be obtained by adding `--help`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Usage: db_manage.py [OPTIONS] COMMAND [ARGS]...\n",
      "\n",
      "Options:\n",
      "  --help  Show this message and exit.\n",
      "\n",
      "Commands:\n",
      "  add_dataset\n",
      "  add_topicmodel\n",
      "  describe\n",
      "  load_dataset\n",
      "  load_topicmodel\n",
      "Usage: db_manage.py load_dataset [OPTIONS]\n",
      "\n",
      "Options:\n",
      "  -d, --dataset-id INTEGER     [required]\n",
      "  -t, --title TEXT\n",
      "  -dir, --directory DIRECTORY  [required]\n",
      "  --help                       Show this message and exit.\n"
     ]
    }
   ],
   "source": [
    "!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py'\n",
    "!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py load_dataset --help'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we add a new dataset and note the given id - it will be used later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Added Dataset #1\r\n"
     ]
    }
   ],
   "source": [
    "!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py add_dataset'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now load the data from CSV files (note the `dataset-id`, it is the number which was given by the previous command):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found files \"document_terms.csv\", \"documents.csv\", \"modalities.csv\", \"terms.csv\".\n",
      "Not found files \"document_contents.csv\".\n",
      "Will try to continue with the files present.\n",
      "Proceeding will overwrite the corresponding data in the database. Continue? [Y/n]: Deleting data\n",
      "Deleting data\n",
      "Deleting data\n",
      "Deleting data\n",
      "Deleting data\n",
      "Loading data\n",
      "Loading data\n",
      "Loading data\n",
      "Loading data\n",
      "Loading data\n"
     ]
    }
   ],
   "source": [
    "!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && yes | ./db_manage.py load_dataset --dataset-id 1 --title \"Simplest MMRO dataset\" -dir data_mmro'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can check that the dataset was loaded to the DB using the `describe` command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "- Dataset #1: Simplest MMRO dataset, 0 models\n",
      "  Documents: 1061\n",
      "  Terms: 7805 words with 314081 occurrences\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py describe'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or just go to the front page at http://localhost:5000 - the last dataset there should be your newly added one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build a topic model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we build a simple ARTM model of this collection using BigARTM. Of course, you can use other tools for this, if you want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import artm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_vectorizer = artm.BatchVectorizer(data_path='', data_format='bow_uci', collection_name='mmro', target_folder='.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_artm = artm.ARTM(num_topics=15,\n",
    "                      scores=[artm.PerplexityScore(name='PerplexityScore',\n",
    "                                                   use_unigram_document_model=False,\n",
    "                                                   dictionary_name='dictionary')],\n",
    "                      regularizers=[artm.SmoothSparseThetaRegularizer(name='SparseTheta', tau=-0.15)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_artm.load_dictionary(dictionary_name='dictionary', dictionary_path='dictionary')\n",
    "model_artm.initialize(dictionary_name='dictionary')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15, num_document_passes=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the simplest possible case, the model is completely described by its matrices $\\Phi$ and $\\Theta$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "phi = model_artm.get_phi()\n",
    "theta = model_artm.fit_transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some other required probabilities are naively computed below. You can use another ways to calculate them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwt = phi.as_matrix()\n",
    "ptd = theta.as_matrix()\n",
    "pd = 1.0 / theta.shape[1]\n",
    "pt = (ptd * pd).sum(1)\n",
    "pw = (pwt * pt).sum(1)\n",
    "ptw = pwt * pt / pw[:, np.newaxis]\n",
    "pdt = ptd * pd / pt[:, np.newaxis]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the model to tm-navigator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the model is built, it has to be converted to CSV files, like the dataset was. The minimal required set of files is the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the model topics\n",
    "with CsvWriter(open('topics.csv', 'w')) as out:\n",
    "    out << [dict(id=0,\n",
    "                 level=0,\n",
    "                 id_in_level=0,\n",
    "                 is_background=False,\n",
    "                 probability=1)]  # the single zero-level topic with id=0 is required\n",
    "    out << (dict(id=1 + t,  # any unique ids\n",
    "                 level=1,  # for a flat non-hierarchical model just leave 1 here\n",
    "                 id_in_level=t,\n",
    "                 is_background=False,  # if you have background topics, they should have True here\n",
    "                 probability=p)\n",
    "            for t, p in enumerate(pt))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# probabilities of terms in topics\n",
    "with CsvWriter(open('topic_terms.csv', 'w')) as out:\n",
    "    out << (dict(topic_id=1 + t,  # same ids as above\n",
    "                 modality_id=1,\n",
    "                 term_id=w,\n",
    "                 prob_wt=pwt[w, t],\n",
    "                 prob_tw=ptw[w, t])\n",
    "            for w, t in zip(*np.nonzero(pwt)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# probabilities of topics in documents\n",
    "with CsvWriter(open('document_topics.csv', 'w')) as out:\n",
    "    out << (dict(topic_id=1 + t,  # same ids as above\n",
    "                 document_id=d,\n",
    "                 prob_td=ptd[t, d],\n",
    "                 prob_dt=pdt[t, d])\n",
    "            for t, d in zip(*np.nonzero(ptd)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# graph of topics, mostly useful for hierarchical topic models\n",
    "# the navigator assumes that all topics are reachable by edges from the root topic #0\n",
    "with CsvWriter(open('topic_edges.csv', 'w')) as out:\n",
    "    out << (dict(parent_id=0,\n",
    "                 child_id=1 + t,\n",
    "                 probability=p)\n",
    "            for t, p in enumerate(pt))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, same as with the dataset before, upload the CSV files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "topic_terms.csv                               100% 3492KB   3.4MB/s   00:00    \n",
      "documents.csv                                 100%   41KB  41.3KB/s   00:00    \n",
      "modalities.csv                                100%   18     0.0KB/s   00:00    \n",
      "topic_edges.csv                               100%  260     0.3KB/s   00:00    \n",
      "document_terms.csv                            100% 4091KB   4.0MB/s   00:00    \n",
      "terms.csv                                     100%  212KB 212.0KB/s   00:00    \n",
      "topics.csv                                    100%  416     0.4KB/s   00:00    \n",
      "document_topics.csv                           100%  420KB 420.4KB/s   00:00    \n"
     ]
    }
   ],
   "source": [
    "for csv in glob('*.csv'):\n",
    "    !scp -P {TMNAV_PORT} {csv} {TMNAV_SERVER}:{TMNAV_PATH}data_mmro/{csv}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a new topic model (same `dataset-id` as in commands above):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Added Topic Model #1 for Dataset #1\r\n"
     ]
    }
   ],
   "source": [
    "!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py add_topicmodel --dataset-id 1'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And load CSVs to the database (here note the `topicmodel-id`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found files \"document_topics.csv\", \"topic_edges.csv\", \"topic_terms.csv\", \"topics.csv\".\n",
      "Not found files \"document_content_topics.csv\", \"document_similarities.csv\", \"term_similarities.csv\", \"topic_similarities.csv\".\n",
      "Will try to continue with the files present.\n",
      "Proceeding will overwrite the corresponding data in the database. Continue? [Y/n]: Deleting data\n",
      "Deleting data\n",
      "Deleting data\n",
      "Deleting data\n",
      "Deleting data\n",
      "Loading data\n",
      "Loading data\n",
      "Loading data\n",
      "Loading data\n",
      "Loading data\n"
     ]
    }
   ],
   "source": [
    "!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && yes | ./db_manage.py load_topicmodel --topicmodel-id 1 --title \"Simplest model\" -dir data_mmro'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's all, basically - now visit http://localhost:5000 and follow the link to browse your model! The link should look like http://1.localhost:5000, just with another number at the beginning. If you don't like the debug panel at the side, just click `Hide` there once."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you build several (any number of) topic models for the same collection, you don't have to add a new dataset each time, just run `add_topicmodel` with the same dataset id."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optional data for richer visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If that's not enough and you need other features, you can feed the navigator with more data. Some useful cases with description on how to do them:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Add titles, slugs and HTML content to the documents: just fill the corresponding fields in `documents.csv'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Add document authors: authors are another modality (like `words` in the given example), so just add `authors` modality to `modalities.csv`, the corresponding terms (individual authors) to `terms.csv`, and their relation to documents to `document_terms.csv`. Of course, authors can be used in topic models also."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Highlight documents HTML content: this is a two-step process, one step is related to the dataset and another to the topic model. Basically, you need the start and end positions of each term (word) in HTML, and the top topics for them. If you have these, this is how to generate the corresponding CSVs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this is for dataset\n",
    "with CsvWriter(open('document_contents.csv', 'w')) as out:\n",
    "    id_cnt = itertools.count()  # it's just one way to generate ids so that they match in both cases\n",
    "    out << (dict(id=next(id_cnt),  # must correspond to the ids in document_content_topics below\n",
    "                 document_id=d,\n",
    "                 modality_id=1,  # 1 for words\n",
    "                 term_id=w,\n",
    "                 start_pos=s, end_pos=e  # the start and end positions in the HTML content\n",
    "                 )\n",
    "           for d, w, s, e in ...)\n",
    "\n",
    "# and this is for topicmodel\n",
    "with CsvWriter(open('document_content_topics.csv', 'w')) as out:\n",
    "    id_cnt = it.count()\n",
    "    out << (dict(document_content_id=next(id_cnt),  # same ids as above\n",
    "                 topic_id=1 + t  # the top topic id, determines the color\n",
    "                )\n",
    "            for d, t in ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Show lists of similar documents, topics, or terms on the corresponding pages: the navigator doesn't restrict you in how the similarity is determined, so it must be computed beforehand. Similarities are internally related to topicmodels, not datasets, because they are typically computed using the data from models. Multiple different similarities are supported for each entitity, see below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with CsvWriter(open('document_similarities.csv', 'w')) as out:\n",
    "    out << (dict(a_id=i,  # first document id\n",
    "                 b_id=sim_i,  # second document id\n",
    "                 similarity=row[sim_i],  # similarity from [0, 1]\n",
    "                 similarity_type='Topics'  # free-form short name of this similarity type, common choices probably are Topics and Words\n",
    "                )\n",
    "            for i, row in enumerate(distances)  # the precomputed distance matrix\n",
    "            # tip: don't write the whole n^2 entries to the CSV table not to bloat it,\n",
    "            # here we limit to 30 similar entities for each row\n",
    "            for sim_i in row.argsort()[:31]\n",
    "            if sim_i != i)\n",
    "\n",
    "with CsvWriter(open('topic_similarities.csv', 'w')) as out:\n",
    "    out << (dict(a_id=1 + i,\n",
    "                 b_id=1 + sim_i,\n",
    "                 similarity=row[sim_i],\n",
    "                 similarity_type='Words')\n",
    "            for i, row in enumerate(distances)\n",
    "            for sim_i in row.argsort()[:]  # if you have hundreds or more topics, limit to first 50 or so here\n",
    "            if sim_i != i)\n",
    "\n",
    "with CsvWriter(open('term_similarities.csv', 'w')) as out:\n",
    "    out << (dict(a_modality_id=1,\n",
    "                 a_id=i,\n",
    "                 b_modality_id=1,\n",
    "                 b_id=sim_i,\n",
    "                 similarity=row[sim_i],\n",
    "                 similarity_type='Topics')\n",
    "            for i, row in enumerate(distances)\n",
    "            for sim_i in row.argsort()[:21]  # first 20 similar terms\n",
    "            if sim_i != i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Hierarchical topic models can easily be represented using several levels of topics and adding edges between them, see above. Actually, even if your topic model isn't hierarchical you can add another middle level with topics to act as groups of real topics, and name them correspondingly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember to upload the CSVs after each change, and load them to the database using `load_dataset` or `load_topicmodel`! Use the same dataset or model id as before, if you want to overwrite the dataset or model, or else add a new one with `add_*` before.\n",
    "\n",
    "Datasets which have topic models can be changed only by adding some data, like `document_contents.csv`, not removing anything. The database will give an error if you try to do anything which makes the data inconsistent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notes for artm-dev team, which can use the remote server mentioned above"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No permissions system is currently implemented (nor is planned), so each user on the same server can view or modify all the data. Please treat the server as a disposable storage and store any data you cannot easily generate again on your computer. If you want to use assessment features of the navigator, please contact me beforehand to make sure you don't lose the responses!\n",
    "\n",
    "This tutorial uses directory named `data_mmro` on the server, please replace it with something unique among our group for convenience."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2.0
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}