{
 "metadata": {
  "name": "",
  "signature": "sha256:b168a88dfb9b98cbb60a61faf7512ce272120ee1ca8d93f8d4551bf8e109f929"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Hierarchical topic modelling of Kiva loan descriptions"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Goals"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*  Create a visual explorer for a static set of 775K English loan descriptions from Kiva.org\n",
      "*  Use (hierarchical) topic modelling\n",
      "*  Publish the explorer"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Approach"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "* Use gensim to derive flat topic models over (part of) the Kiva corpus, taking the [tutorial](https://radimrehurek.com/gensim/tutorial.html) as guideline\n",
      "* Organize the found topic models into a hierarchy\n",
      "* Convert that hierarchy into a [JSON data file](https://github.com/mlvl/Hierarchie/tree/gh-pages/app/data) compliant with Hierarchie\n",
      "* Visualize everything with Hierarchie"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Data preprocessing"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 1: download [Kiva data dump (JSON format)](http://s3.kiva.org/snapshots/kiva_ds_json.zip), and extract into data/static"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 2: since the 'description' fields in the Kiva data dump often mix multiple languages (due to manual translations), the language codes are not reliable. Therefore, we:\n",
      "\n",
      "* split the descriptions in paragraps\n",
      "* do language detection on the paragraphs (using the [langid](https://github.com/saffsd/langid.py) library)\n",
      "* store the recombined paragraphs and their language code in new processed_description field\n",
      "\n",
      "The data are written to a locally installed MongoDB 'kiva', collection 'loans'"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Next line commented out because we only want to run this once\n",
      "# nohup python src/load_kiva_loans_to_mongodb.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 3: load (a subset) of the English data from MongoDB, and convert it to the [Blei LDA-C format](http://www.cs.princeton.edu/~blei/lda-c/), using a [gensim utility function](https://radimrehurek.com/gensim/tut1.html#corpus-formats)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python src/convert_mongodb_to_blei_ldac.py --dataDir data/topicmodelling\\\n",
      "                                            --corpusBaseName kiva \\\n",
      "                                            --stopwordFile=data/topicmodelling/kiva_stopwords.tsv \\\n",
      "                                            --startYear 1990 \\\n",
      "                                            --maxNrDocs 800000 \\\n",
      "                                            --filterBelow 10 \\\n",
      "                                            --filterAbove 0.5 \\\n",
      "                                            --filterKeepN 1000"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Creating MongoDB cursor ... done\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Number of loans in 'en' since 1990: 774952\r\n",
        "First pass: streaming from MongoDB ...\r\n",
        "creating the dictionary ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 5000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 10000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 15000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 20000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 25000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 30000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 35000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 40000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 45000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 50000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 55000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 60000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 65000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 70000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 75000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 80000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 85000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 90000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 95000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 100000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 105000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 110000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 115000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 120000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 125000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 130000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 135000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 140000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 145000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 150000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 155000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 160000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 165000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 170000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 175000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 180000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 185000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 190000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 195000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 200000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 205000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 210000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 215000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 220000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 225000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 230000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 235000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 240000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 245000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 250000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 255000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 260000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 265000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 270000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 275000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 280000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 285000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 290000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 295000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 300000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 305000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 310000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 315000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 320000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 325000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 330000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 335000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 340000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 345000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 350000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 355000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 360000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 365000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 370000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 375000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 380000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 385000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 390000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 395000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 400000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 405000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 410000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 415000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 420000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 425000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 430000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 435000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 440000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 445000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 450000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 455000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 460000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 465000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 470000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 475000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 480000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 485000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 490000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 495000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 500000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 505000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 510000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 515000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 520000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 525000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 530000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 535000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 540000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 545000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 550000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 555000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 560000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 565000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 570000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 575000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 580000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 585000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 590000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 595000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 600000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 605000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 610000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 615000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 620000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 625000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 630000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 635000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 640000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 645000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 650000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 655000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 660000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 665000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 670000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 675000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 680000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 685000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 690000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 695000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 700000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 705000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 710000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 715000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 720000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 725000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 730000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 735000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 740000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 745000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 750000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 755000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 760000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 765000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "read 770000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "filtering the dictionary ..."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " done\r\n",
        "wrote data/topicmodelling/kiva_dict.bin ... and data/topicmodelling/kiva_dict.txt ... done\r\n",
        "Second pass: streaming from MongoDB ... saving into data/topicmodelling/kiva.lda-c (Blei corpus format) ..."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " done\r\n",
        "Number of documents converted: 774952\r\n",
        "Vocabulary size: 1000\r\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Topic modelling"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python src/model_topics.py --dataDir data/topicmodelling \\\n",
      "                            --modelDir data/topicmodelling \\\n",
      "                            --corpusBaseName kiva \\\n",
      "                            --nrTopics 64 \\\n",
      "                            --nrWords 8"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Loading Blei corpus file data/topicmodelling/kiva.lda-c ..."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " done\r\n",
        "<gensim.corpora.bleicorpus.BleiCorpus object at 0x108ba6a90>\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Dictionary(1000 unique tokens: [u'neighbors', u'sector', u'managed', u'lack', u'eldest']...)\r\n",
        "Making topic model ..."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " done\r\n",
        "0.217*region + 0.179*applied + 0.129*borrowed + 0.123*pakistan + 0.082*sheep + 0.070*times + 0.045*develop + 0.036*rented\r\n",
        "0.039*started + 0.031*ago + 0.024*start + 0.024*support + 0.022*selling + 0.020*decided + 0.018*thanks + 0.018*increase\r\n",
        "0.179*college + 0.174*hardworking + 0.109*studies + 0.071*word + 0.065*degree + 0.046*teacher + 0.044*continues + 0.036*profession\r\n",
        "0.687*food + 0.091*beverages + 0.089*foods + 0.045*25,000 + 0.028*serves + 0.012*prepare + 0.011*selling + 0.007*cooking\r\n",
        "0.082*living + 0.073*earn + 0.055*income + 0.048*hopes + 0.048*rice + 0.038*village + 0.038*support + 0.028*family\r\n",
        "0.124*poultry + 0.114*fellowship + 0.108*god + 0.092*chickens + 0.072*attain + 0.069*chicken + 0.069*eggs + 0.068*build\r\n",
        "0.060*meet + 0.053*customers + 0.049*needs + 0.041*increase + 0.031*demand + 0.027*family + 0.024*sales + 0.021*good\r\n",
        "0.190*member + 0.147*program + 0.086*joined + 0.065*pmpc + 0.057*help + 0.056*foundation + 0.041*recently + 0.037*paglaum\r\n",
        "0.075*money + 0.074*save + 0.072*requested + 0.067*family + 0.066*enough + 0.052*works + 0.050*hard + 0.040*old\r\n",
        "0.173*community + 0.106*services + 0.081*costs + 0.066*providing + 0.046*service + 0.044*new + 0.044*provides + 0.042*access\r\n",
        "0.191*higher + 0.176*low + 0.175*price + 0.170*prices + 0.046*competition + 0.040*al + 0.037*wholesale + 0.034*commercial\r\n",
        "0.349*clothing + 0.096*shoes + 0.075*tuition + 0.062*merchandise + 0.061*sales + 0.058*cosmetics + 0.046*selling + 0.038*baby\r\n",
        "0.198*get + 0.151*ahead + 0.114*maria + 0.083*forward + 0.070*desire + 0.066*greatest + 0.053*borrower + 0.046*spouse\r\n",
        "0.321*nwtf + 0.067*dream + 0.054*sell + 0.042*past + 0.042*charcoal + 0.039*selling + 0.036*expand + 0.034*also\r\n",
        "0.179*pigs + 0.175*raising + 0.076*earns + 0.058*healthy + 0.057*old + 0.055*living + 0.046*raise + 0.041*activities\r\n",
        "0.177*products + 0.109*goods + 0.079*etc + 0.070*canned + 0.067*sells + 0.044*sell + 0.044*like + 0.043*shampoo\r\n",
        "0.177*electricity + 0.090*funds + 0.089*soon + 0.079*ensure + 0.078*including + 0.055*onions + 0.052*mainly + 0.041*along\r\n",
        "0.155*php + 0.134*philippines + 0.106*additional + 0.056*future + 0.053*earns + 0.047*secure + 0.046*income + 0.039*hard\r\n",
        "0.286*fish + 0.189*fishing + 0.100*pig + 0.089*blessed + 0.073*mary + 0.055*firewood + 0.046*sell + 0.039*cassava\r\n",
        "0.063*happy + 0.059*well + 0.049*says + 0.037*likes + 0.031*hand + 0.028*good + 0.028*time + 0.027*gets\r\n",
        "0.320*lenders + 0.235*entrepreneur + 0.200*village + 0.127*engaged + 0.076*northern + 0.017*english + 0.013*kind + 0.003*eight\r\n",
        "0.366*like + 0.320*would + 0.117*partner + 0.054*future + 0.046*describes + 0.033*aspires + 0.019*domestic + 0.016*committed\r\n",
        "0.221*dreams + 0.212*clothes + 0.096*selling + 0.063*goes + 0.041*sell + 0.041*woman + 0.041*sells + 0.036*shirts\r\n",
        "0.344*man + 0.190*photo + 0.095*wife + 0.077*father + 0.071*equipment + 0.050*furniture + 0.042*right + 0.032*pesos\r\n",
        "0.063*lives + 0.049*city + 0.048*works + 0.045*house + 0.037*located + 0.035*old + 0.035*area + 0.034*home\r\n",
        "0.315*borrowing + 0.211*institution + 0.205*communal + 0.090*raised + 0.088*often + 0.059*economy + 0.032*rest + 0.000*recently\r\n",
        "0.151*bank + 0.109*members + 0.095*member + 0.070*time + 0.065*cycle + 0.056*health + 0.036*community + 0.030*payments\r\n",
        "0.126*farming + 0.073*harvest + 0.060*land + 0.060*farm + 0.040*fertilizer + 0.039*crops + 0.037*farmers + 0.030*seeds\r\n",
        "0.092*income + 0.086*husband + 0.065*family + 0.062*expenses + 0.042*help + 0.039*needs + 0.038*cover + 0.036*household\r\n",
        "0.185*materials + 0.110*home + 0.106*making + 0.083*cement + 0.078*sustaining + 0.055*wood + 0.039*make + 0.037*raw\r\n",
        "0.237*district + 0.143*province + 0.101*lives + 0.086*requested + 0.068*inputs + 0.055*cambodia + 0.044*old + 0.042*anticipated\r\n",
        "0.188*son + 0.111*university + 0.085*pay + 0.070*studying + 0.058*noodles + 0.053*parts + 0.047*year + 0.042*education\r\n",
        "0.241*living + 0.182*improve + 0.122*family + 0.109*conditions + 0.027*better + 0.026*income + 0.025*situation + 0.024*new\r\n",
        "0.207*previous + 0.108*back + 0.106*new + 0.094*total + 0.082*paid + 0.077*loans + 0.070*used + 0.059*third\r\n",
        "0.073*sewing + 0.055*machine + 0.049*training + 0.041*hair + 0.033*tools + 0.033*beauty + 0.033*salon + 0.030*tailoring\r\n",
        "0.092*local + 0.082*small + 0.080*successful + 0.078*capital + 0.056*working + 0.052*assistance + 0.043*stable + 0.042*financially\r\n",
        "0.264*group + 0.102*women + 0.081*members + 0.050*one + 0.037*fund + 0.032*leader + 0.029*first + 0.019*use\r\n",
        "0.206*rural + 0.180*field + 0.105*biggest + 0.075*kiva\u2019s + 0.068*benefit + 0.059*sector + 0.055*cooperative + 0.049*areas\r\n",
        "0.263*repaid + 0.252*successfully + 0.218*involved + 0.072*loans + 0.043*individual + 0.041*56 + 0.038*adult + 0.026*completed\r\n",
        "0.092*school + 0.075*daughters + 0.068*students + 0.066*restaurant + 0.065*study + 0.063*two + 0.051*sons + 0.043*stories\r\n",
        "0.316*water + 0.119*monthly + 0.083*week + 0.054*aged + 0.049*days + 0.048*family + 0.042*piped + 0.035*every\r\n",
        "0.106*shop + 0.071*use + 0.062*hopes + 0.050*kes + 0.047*future + 0.043*operates + 0.041*retail + 0.039*stock\r\n",
        "0.040*work + 0.027*able + 0.026*family + 0.021*help + 0.019*works + 0.017*wants + 0.015*lives + 0.014*day\r\n",
        "0.292*store + 0.124*general + 0.081*items + 0.054*grocery + 0.051*groceries + 0.050*sell + 0.048*runs + 0.030*variety\r\n",
        "0.193*build + 0.184*house + 0.057*expanding + 0.056*building + 0.048*sand + 0.045*basis + 0.044*construction + 0.043*labor\r\n",
        "0.081*maize + 0.061*farmer + 0.059*income + 0.056*milk + 0.042*dairy + 0.040*cows + 0.037*cattle + 0.037*animals\r\n",
        "0.104*rice + 0.101*sugar + 0.070*oil + 0.062*flour + 0.050*cooking + 0.045*bread + 0.040*meat + 0.035*beans\r\n",
        "0.269*livestock + 0.172*feed + 0.125*agriculture + 0.110*fattening + 0.070*primarily + 0.068*gain + 0.064*agricultural + 0.056*shown\r\n",
        "0.158*day + 0.113*average + 0.083*per + 0.073*credit + 0.072*every + 0.071*within + 0.069*usd + 0.046*month\r\n",
        "0.073*daughter + 0.069*school + 0.054*lives + 0.051*house + 0.049*old + 0.049*husband + 0.045*applying + 0.039*faces\r\n",
        "0.211*vegetables + 0.103*vending + 0.083*fruits + 0.074*fruit + 0.069*stall + 0.053*bananas + 0.049*vegetable + 0.049*standing\r\n",
        "0.081*grade + 0.074*manage + 0.073*born + 0.069*due + 0.069*supports + 0.059*became + 0.054*honest + 0.054*financial\r\n",
        "0.108*school + 0.057*fees + 0.044*challenge + 0.042*kenya + 0.027*uganda + 0.025*major + 0.024*profits + 0.023*pay\r\n",
        "0.308*coffee + 0.258*drinks + 0.195*soft + 0.093*weather + 0.088*peru + 0.037*mr. + 0.013*wife + 0.003*thus\r\n",
        "0.099*repair + 0.087*fertilizers + 0.084*transportation + 0.077*motorcycle + 0.073*transport + 0.058*maintenance + 0.057*driver + 0.054*resell\r\n",
        "0.283*access + 0.117*loans + 0.093*! + 0.092*stocks + 0.069*taken + 0.067*institutions + 0.057*said + 0.055*cloth\r\n",
        "0.416*one + 0.245*child + 0.136*year + 0.063*old + 0.050*wheat + 0.042*fourth + 0.028*south + 0.007*youngest\r\n",
        "0.072*traditional + 0.070*ingredients + 0.066*profit + 0.056*bags + 0.046*live + 0.042*plans + 0.038*soro + 0.037*yiriwaso\r\n",
        "0.102*businesses + 0.084*community + 0.069*families + 0.052*microfinance + 0.049*groups + 0.044*intends + 0.041*repay + 0.039*share\r\n",
        "0.168*poor + 0.110*production + 0.102*poverty + 0.073*financial + 0.067*development + 0.066*francs + 0.045*country + 0.043*organization\r\n",
        "0.240*cash + 0.134*hours + 0.104*hoping + 0.099*manages + 0.068*expects + 0.066*housewife + 0.065*assist + 0.059*net\r\n",
        "0.251*brac + 0.205*ages + 0.159*generate + 0.104*rosa + 0.101*renovate + 0.062*2011 + 0.031*currently + 0.027*small\r\n",
        "0.126*amount + 0.099*old + 0.059*past + 0.056*purchase + 0.045*hopes + 0.045*lives + 0.043*requesting + 0.041*two\r\n",
        "0.055*life + 0.040*quality + 0.038*products + 0.038*better + 0.036*able + 0.027*customers + 0.026*improve + 0.025*good\r\n",
        "Writing model file data/topicmodelling/kiva.lda_model ... done\r\n",
        "Creating complete topic/word matrix in memory:\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 1/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 2/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 3/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 4/64 ...\r\n",
        "topic 5/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 6/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 7/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 8/64 ...\r\n",
        "topic 9/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 10/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 11/64 ...\r\n",
        "topic 12/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 13/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 14/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 15/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 16/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 17/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 18/64 ...\r\n",
        "topic 19/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 20/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 21/64 ...\r\n",
        "topic 22/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 23/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 24/64 ...\r\n",
        "topic 25/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 26/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 27/64 ...\r\n",
        "topic 28/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 29/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 30/64 ...\r\n",
        "topic 31/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 32/64 ...\r\n",
        "topic 33/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 34/64 ...\r\n",
        "topic 35/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 36/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 37/64 ...\r\n",
        "topic 38/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 39/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 40/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 41/64 ...\r\n",
        "topic 42/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 43/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 44/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 45/64 ...\r\n",
        "topic 46/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 47/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 48/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 49/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 50/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 51/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 52/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 53/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 54/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 55/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 56/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 57/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 58/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 59/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 60/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 61/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 62/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 63/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "topic 64/64 ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "done\r\n",
        "Writing topic/word matrix file data/topicmodelling/kiva_topic_words_matrix.h5 ..."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py:2486: PerformanceWarning: \r\n",
        "your performance may suffer as PyTables will pickle object types that it cannot\r\n",
        "map directly to c-types [inferred_type->unicode,key->axis0] [items->None]\r\n",
        "\r\n",
        "  warnings.warn(ws, PerformanceWarning)\r\n",
        "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py:2486: PerformanceWarning: \r\n",
        "your performance may suffer as PyTables will pickle object types that it cannot\r\n",
        "map directly to c-types [inferred_type->unicode,key->block0_items] [items->None]\r\n",
        "\r\n",
        "  warnings.warn(ws, PerformanceWarning)\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "done\r\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Infer topic distribution over the document set"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python src/infer_document_topic_distributions.py --modelDir data/topicmodelling \\\n",
      "                                                  --modelBaseName kiva \\\n",
      "                                                  --maxNrDocs 250000"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Reading topic/word matrix file data/topicmodelling/kiva_topic_words_matrix.h5 ..."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " done\r\n",
        "Loading model file data/topicmodelling/kiva.lda_model ... done\r\n",
        "Loading Blei corpus file data/topicmodelling/kiva.lda-c ..."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " done\r\n",
        "<gensim.corpora.bleicorpus.BleiCorpus object at 0x1117a8790>\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 5000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 10000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 15000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 20000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 25000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 30000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 35000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 40000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 45000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 50000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 55000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 60000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 65000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 70000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 75000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 80000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 85000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 90000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 95000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 100000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 105000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 110000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 115000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 120000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 125000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 130000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 135000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 140000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 145000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 150000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 155000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 160000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 165000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 170000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 175000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 180000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 185000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 190000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 195000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 200000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 205000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 210000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 215000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 220000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 225000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 230000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 235000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 240000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 245000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 250000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 255000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 260000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 265000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 270000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 275000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 280000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 285000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 290000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 295000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 300000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 305000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 310000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 315000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 320000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 325000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 330000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 335000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 340000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 345000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 350000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 355000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 360000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 365000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 370000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 375000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 380000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 385000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 390000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 395000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 400000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 405000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 410000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 415000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 420000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 425000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 430000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 435000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 440000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 445000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 450000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 455000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 460000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 465000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 470000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 475000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 480000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 485000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 490000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 495000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 500000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 505000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 510000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 515000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 520000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 525000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 530000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 535000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 540000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 545000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 550000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 555000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 560000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 565000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 570000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 575000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 580000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 585000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 590000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 595000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 600000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 605000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 610000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 615000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 620000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 625000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 630000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 635000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 640000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 645000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 650000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 655000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 660000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 665000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 670000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 675000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 680000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 685000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 690000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 695000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 700000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 705000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 710000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 715000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 720000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 725000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 730000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 735000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 740000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 745000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 750000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 755000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 760000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 765000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Processed 770000 documents ...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "475 winning (0.06%); 0.67% weight in [region applied borrowed pakistan sheep times develop rented]\r\n",
        "73127 winning (9.44%); 5.46% weight in [started ago start support selling decided thanks increase]\r\n",
        "136 winning (0.02%); 0.72% weight in [college hardworking studies word degree teacher continues profession]\r\n",
        "63 winning (0.01%); 0.56% weight in [food beverages foods 25,000 serves prepare selling cooking]\r\n",
        "33504 winning (4.32%); 2.47% weight in [living earn income hopes rice village support family]\r\n",
        "273 winning (0.04%); 0.59% weight in [poultry fellowship god chickens attain chicken eggs build]\r\n",
        "31849 winning (4.11%); 3.10% weight in [meet customers needs increase demand family sales good]\r\n",
        "5630 winning (0.73%); 1.23% weight in [member program joined pmpc help foundation recently paglaum]\r\n",
        "34069 winning (4.40%); 2.74% weight in [money save requested family enough works hard old]\r\n",
        "24826 winning (3.20%); 2.21% weight in [community services costs providing service new provides access]\r\n",
        "26 winning (0.00%); 0.49% weight in [higher low price prices competition al wholesale commercial]\r\n",
        "1551 winning (0.20%); 0.86% weight in [clothing shoes tuition merchandise sales cosmetics selling baby]\r\n",
        "77 winning (0.01%); 0.75% weight in [get ahead maria forward desire greatest borrower spouse]\r\n",
        "1582 winning (0.20%); 0.92% weight in [nwtf dream sell past charcoal selling expand also]\r\n",
        "4406 winning (0.57%); 1.26% weight in [pigs raising earns healthy old living raise activities]\r\n",
        "4342 winning (0.56%); 1.39% weight in [products goods etc canned sells sell like shampoo]\r\n",
        "245 winning (0.03%); 0.74% weight in [electricity funds soon ensure including onions mainly along]\r\n",
        "17558 winning (2.27%); 2.56% weight in [php philippines additional future earns secure income hard]\r\n",
        "1370 winning (0.18%); 0.55% weight in [fish fishing pig blessed mary firewood sell cassava]\r\n",
        "4230 winning (0.55%); 1.88% weight in [happy well says likes hand good time gets]\r\n",
        "30 winning (0.00%); 0.69% weight in [lenders entrepreneur village engaged northern english kind eight]\r\n",
        "334 winning (0.04%); 1.27% weight in [like would partner future describes aspires domestic committed]\r\n",
        "1882 winning (0.24%); 0.75% weight in [dreams clothes selling goes sell woman sells shirts]\r\n",
        "308 winning (0.04%); 0.60% weight in [man photo wife father equipment furniture right pesos]\r\n",
        "46233 winning (5.97%); 4.16% weight in [lives city works house located old area home]\r\n",
        "0 winning (0.00%); 0.39% weight in [borrowing institution communal raised often economy rest recently]\r\n",
        "5308 winning (0.68%); 1.44% weight in [bank members member time cycle health community payments]\r\n",
        "21246 winning (2.74%); 2.63% weight in [farming harvest land farm fertilizer crops farmers seeds]\r\n",
        "26007 winning (3.36%); 3.05% weight in [income husband family expenses help needs cover household]\r\n",
        "1560 winning (0.20%); 0.88% weight in [materials home making cement sustaining wood make raw]\r\n",
        "563 winning (0.07%); 0.84% weight in [district province lives requested inputs cambodia old anticipated]\r\n",
        "1676 winning (0.22%); 0.89% weight in [son university pay studying noodles parts year education]\r\n",
        "6894 winning (0.89%); 1.96% weight in [living improve family conditions better income situation new]\r\n",
        "2276 winning (0.29%); 0.95% weight in [previous back new total paid loans used third]\r\n",
        "6750 winning (0.87%); 1.10% weight in [sewing machine training hair tools beauty salon tailoring]\r\n",
        "1696 winning (0.22%); 1.36% weight in [local small successful capital working assistance stable financially]\r\n",
        "21624 winning (2.79%); 2.39% weight in [group women members one fund leader first use]\r\n",
        "0 winning (0.00%); 0.81% weight in [rural field biggest kiva\u2019s benefit sector cooperative areas]\r\n",
        "14 winning (0.00%); 0.60% weight in [repaid successfully involved loans individual 56 adult completed]\r\n",
        "1235 winning (0.16%); 0.88% weight in [school daughters students restaurant study two sons stories]\r\n",
        "1715 winning (0.22%); 1.04% weight in [water monthly week aged days family piped every]\r\n",
        "38299 winning (4.94%); 2.56% weight in [shop use hopes kes future operates retail stock]\r\n",
        "149146 winning (19.25%); 9.51% weight in [work able family help works wants lives day]\r\n",
        "10037 winning (1.30%); 1.73% weight in [store general items grocery groceries sell runs variety]\r\n",
        "1413 winning (0.18%); 1.04% weight in [build house expanding building sand basis construction labor]\r\n",
        "23737 winning (3.06%); 2.06% weight in [maize farmer income milk dairy cows cattle animals]\r\n",
        "8970 winning (1.16%); 1.57% weight in [rice sugar oil flour cooking bread meat beans]\r\n",
        "66 winning (0.01%); 0.58% weight in [livestock feed agriculture fattening primarily gain agricultural shown]\r\n",
        "4903 winning (0.63%); 1.21% weight in [day average per credit every within usd month]\r\n",
        "24381 winning (3.15%); 2.54% weight in [daughter school lives house old husband applying faces]\r\n",
        "1886 winning (0.24%); 0.86% weight in [vegetables vending fruits fruit stall bananas vegetable standing]\r\n",
        "46 winning (0.01%); 0.65% weight in [grade manage born due supports became honest financial]\r\n",
        "41496 winning (5.35%); 2.60% weight in [school fees challenge kenya uganda major profits pay]\r\n",
        "125 winning (0.02%); 0.40% weight in [coffee drinks soft weather peru mr. wife thus]\r\n",
        "2053 winning (0.26%); 0.87% weight in [repair fertilizers transportation motorcycle transport maintenance driver resell]\r\n",
        "104 winning (0.01%); 0.63% weight in [access loans ! stocks taken institutions said cloth]\r\n",
        "422 winning (0.05%); 0.85% weight in [one child year old wheat fourth south youngest]\r\n",
        "5011 winning (0.65%); 1.12% weight in [traditional ingredients profit bags live plans soro yiriwaso]\r\n",
        "7876 winning (1.02%); 1.29% weight in [businesses community families microfinance groups intends repay share]\r\n",
        "442 winning (0.06%); 0.89% weight in [poor production poverty financial development francs country organization]\r\n",
        "14 winning (0.00%); 0.38% weight in [cash hours hoping manages expects housewife assist net]\r\n",
        "246 winning (0.03%); 0.45% weight in [brac ages generate rosa renovate 2011 currently small]\r\n",
        "17439 winning (2.25%); 1.83% weight in [amount old past purchase hopes lives requesting two]\r\n",
        "50150 winning (6.47%); 4.53% weight in [life quality products better able customers improve good]\r\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Organize the \"flat\" topic list into a hierarchy for visualization purposes"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python src/build_topic_hierarchy.py --modelDir data/topicmodelling \\\n",
      "                                     --modelBaseName kiva \\\n",
      "                                     --nrClusters 16 \\\n",
      "                                     --nrWords 7"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Reading topic/word matrix file data/topicmodelling/kiva_topic_words_matrix.h5 ..."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " done\r\n",
        "Hierarchically clustering topics ..."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "recursive hierarchy:\r\n",
        "[[[[1, 42, 63], 30, 24, 4, 41, 46, 34, 58, 9, 28, 52, 44, 49, 6, 19, 26],\r\n",
        "  37,\r\n",
        "  2,\r\n",
        "  17,\r\n",
        "  14,\r\n",
        "  32,\r\n",
        "  8,\r\n",
        "  33,\r\n",
        "  20,\r\n",
        "  31,\r\n",
        "  29,\r\n",
        "  0,\r\n",
        "  36,\r\n",
        "  39,\r\n",
        "  62,\r\n",
        "  56],\r\n",
        " [[45, 57], 15, 5, 47, 27, 18, 59, 43, 11, 23, 12, 7, 50, 54, 40, 16],\r\n",
        " 13,\r\n",
        " 55,\r\n",
        " 21,\r\n",
        " 25,\r\n",
        " 48,\r\n",
        " 38,\r\n",
        " 3,\r\n",
        " 22,\r\n",
        " 53,\r\n",
        " 10,\r\n",
        " 51,\r\n",
        " 35,\r\n",
        " 60,\r\n",
        " 61]\r\n",
        " done\r\n",
        "Loading model file data/topicmodelling/kiva.lda_model ... done\r\n",
        "Building nested hierarchy in memory ... done\r\n",
        "{u'topic_data': [{u'a_words': [u'family',\r\n",
        "                               u'group',\r\n",
        "                               u'income',\r\n",
        "                               u'lives',\r\n",
        "                               u'community',\r\n",
        "                               u'living',\r\n",
        "                               u'school'],\r\n",
        "                  u'b_name': u'topic_1_42_63_30_24_4_41_46_34_58_9_28_52_44_49_6_19_26_37_2_17_14_32_8_33_20_31_29_0_36_39_62_56',\r\n",
        "                  u'b_size': 553701,\r\n",
        "                  u'children': [{u'a_words': [u'community',\r\n",
        "                                              u'family',\r\n",
        "                                              u'house',\r\n",
        "                                              u'lives',\r\n",
        "                                              u'school',\r\n",
        "                                              u'income',\r\n",
        "                                              u'able'],\r\n",
        "                                 u'b_name': u'topic_1_42_63_30_24_4_41_46_34_58_9_28_52_44_49_6_19_26',\r\n",
        "                                 u'b_size': 398051,\r\n",
        "                                 u'children': [{u'a_words': [u'able',\r\n",
        "                                                             u'work',\r\n",
        "                                                             u'life',\r\n",
        "                                                             u'family',\r\n",
        "                                                             u'started',\r\n",
        "                                                             u'help',\r\n",
        "                                                             u'quality'],\r\n",
        "                                                u'b_name': u'topic_1_42_63',\r\n",
        "                                                u'b_size': 151103,\r\n",
        "                                                u'children': [{u'a_words': [u'started',\r\n",
        "                                                                            u'ago',\r\n",
        "                                                                            u'start',\r\n",
        "                                                                            u'support',\r\n",
        "                                                                            u'selling',\r\n",
        "                                                                            u'decided',\r\n",
        "                                                                            u'thanks'],\r\n",
        "                                                               u'b_name': u'topic_1',\r\n",
        "                                                               u'b_size': 42309},\r\n",
        "                                                              {u'a_words': [u'work',\r\n",
        "                                                                            u'able',\r\n",
        "                                                                            u'family',\r\n",
        "                                                                            u'help',\r\n",
        "                                                                            u'works',\r\n",
        "                                                                            u'wants',\r\n",
        "                                                                            u'lives'],\r\n",
        "                                                               u'b_name': u'topic_42',\r\n",
        "                                                               u'b_size': 73725},\r\n",
        "                                                              {u'a_words': [u'life',\r\n",
        "                                                                            u'quality',\r\n",
        "                                                                            u'products',\r\n",
        "                                                                            u'better',\r\n",
        "                                                                            u'able',\r\n",
        "                                                                            u'customers',\r\n",
        "                                                                            u'improve'],\r\n",
        "                                                               u'b_name': u'topic_63',\r\n",
        "                                                               u'b_size': 35069}]},\r\n",
        "                                               {u'a_words': [u'district',\r\n",
        "                                                             u'province',\r\n",
        "                                                             u'lives',\r\n",
        "                                                             u'requested',\r\n",
        "                                                             u'inputs',\r\n",
        "                                                             u'cambodia',\r\n",
        "                                                             u'old'],\r\n",
        "                                                u'b_name': u'topic_30',\r\n",
        "                                                u'b_size': 6478},\r\n",
        "                                               {u'a_words': [u'lives',\r\n",
        "                                                             u'city',\r\n",
        "                                                             u'works',\r\n",
        "                                                             u'house',\r\n",
        "                                                             u'located',\r\n",
        "                                                             u'old',\r\n",
        "                                                             u'area'],\r\n",
        "                                                u'b_name': u'topic_24',\r\n",
        "                                                u'b_size': 32244},\r\n",
        "                                               {u'a_words': [u'living',\r\n",
        "                                                             u'earn',\r\n",
        "                                                             u'income',\r\n",
        "                                                             u'hopes',\r\n",
        "                                                             u'rice',\r\n",
        "                                                             u'village',\r\n",
        "                                                             u'support'],\r\n",
        "                                                u'b_name': u'topic_4',\r\n",
        "                                                u'b_size': 19162},\r\n",
        "                                               {u'a_words': [u'shop',\r\n",
        "                                                             u'use',\r\n",
        "                                                             u'hopes',\r\n",
        "                                                             u'kes',\r\n",
        "                                                             u'future',\r\n",
        "                                                             u'operates',\r\n",
        "                                                             u'retail'],\r\n",
        "                                                u'b_name': u'topic_41',\r\n",
        "                                                u'b_size': 19875},\r\n",
        "                                               {u'a_words': [u'rice',\r\n",
        "                                                             u'sugar',\r\n",
        "                                                             u'oil',\r\n",
        "                                                             u'flour',\r\n",
        "                                                             u'cooking',\r\n",
        "                                                             u'bread',\r\n",
        "                                                             u'meat'],\r\n",
        "                                                u'b_name': u'topic_46',\r\n",
        "                                                u'b_size': 12137},\r\n",
        "                                               {u'a_words': [u'sewing',\r\n",
        "                                                             u'machine',\r\n",
        "                                                             u'training',\r\n",
        "                                                             u'hair',\r\n",
        "                                                             u'tools',\r\n",
        "                                                             u'beauty',\r\n",
        "                                                             u'salon'],\r\n",
        "                                                u'b_name': u'topic_34',\r\n",
        "                                                u'b_size': 8531},\r\n",
        "                                               {u'a_words': [u'businesses',\r\n",
        "                                                             u'community',\r\n",
        "                                                             u'families',\r\n",
        "                                                             u'microfinance',\r\n",
        "                                                             u'groups',\r\n",
        "                                                             u'intends',\r\n",
        "                                                             u'repay'],\r\n",
        "                                                u'b_name': u'topic_58',\r\n",
        "                                                u'b_size': 9961},\r\n",
        "                                               {u'a_words': [u'community',\r\n",
        "                                                             u'services',\r\n",
        "                                                             u'costs',\r\n",
        "                                                             u'providing',\r\n",
        "                                                             u'service',\r\n",
        "                                                             u'new',\r\n",
        "                                                             u'provides'],\r\n",
        "                                                u'b_name': u'topic_9',\r\n",
        "                                                u'b_size': 17156},\r\n",
        "                                               {u'a_words': [u'income',\r\n",
        "                                                             u'husband',\r\n",
        "                                                             u'family',\r\n",
        "                                                             u'expenses',\r\n",
        "                                                             u'help',\r\n",
        "                                                             u'needs',\r\n",
        "                                                             u'cover'],\r\n",
        "                                                u'b_name': u'topic_28',\r\n",
        "                                                u'b_size': 23664},\r\n",
        "                                               {u'a_words': [u'school',\r\n",
        "                                                             u'fees',\r\n",
        "                                                             u'challenge',\r\n",
        "                                                             u'kenya',\r\n",
        "                                                             u'uganda',\r\n",
        "                                                             u'major',\r\n",
        "                                                             u'profits'],\r\n",
        "                                                u'b_name': u'topic_52',\r\n",
        "                                                u'b_size': 20147},\r\n",
        "                                               {u'a_words': [u'build',\r\n",
        "                                                             u'house',\r\n",
        "                                                             u'expanding',\r\n",
        "                                                             u'building',\r\n",
        "                                                             u'sand',\r\n",
        "                                                             u'basis',\r\n",
        "                                                             u'construction'],\r\n",
        "                                                u'b_name': u'topic_44',\r\n",
        "                                                u'b_size': 8058},\r\n",
        "                                               {u'a_words': [u'daughter',\r\n",
        "                                                             u'school',\r\n",
        "                                                             u'lives',\r\n",
        "                                                             u'house',\r\n",
        "                                                             u'old',\r\n",
        "                                                             u'husband',\r\n",
        "                                                             u'applying'],\r\n",
        "                                                u'b_name': u'topic_49',\r\n",
        "                                                u'b_size': 19714},\r\n",
        "                                               {u'a_words': [u'meet',\r\n",
        "                                                             u'customers',\r\n",
        "                                                             u'needs',\r\n",
        "                                                             u'increase',\r\n",
        "                                                             u'demand',\r\n",
        "                                                             u'family',\r\n",
        "                                                             u'sales'],\r\n",
        "                                                u'b_name': u'topic_6',\r\n",
        "                                                u'b_size': 24043},\r\n",
        "                                               {u'a_words': [u'happy',\r\n",
        "                                                             u'well',\r\n",
        "                                                             u'says',\r\n",
        "                                                             u'likes',\r\n",
        "                                                             u'hand',\r\n",
        "                                                             u'good',\r\n",
        "                                                             u'time'],\r\n",
        "                                                u'b_name': u'topic_19',\r\n",
        "                                                u'b_size': 14593},\r\n",
        "                                               {u'a_words': [u'bank',\r\n",
        "                                                             u'members',\r\n",
        "                                                             u'member',\r\n",
        "                                                             u'time',\r\n",
        "                                                             u'cycle',\r\n",
        "                                                             u'health',\r\n",
        "                                                             u'community'],\r\n",
        "                                                u'b_name': u'topic_26',\r\n",
        "                                                u'b_size': 11185}]},\r\n",
        "                                {u'a_words': [u'rural',\r\n",
        "                                              u'field',\r\n",
        "                                              u'biggest',\r\n",
        "                                              u'kiva\\u2019s',\r\n",
        "                                              u'benefit',\r\n",
        "                                              u'sector',\r\n",
        "                                              u'cooperative'],\r\n",
        "                                 u'b_name': u'topic_37',\r\n",
        "                                 u'b_size': 6275},\r\n",
        "                                {u'a_words': [u'college',\r\n",
        "                                              u'hardworking',\r\n",
        "                                              u'studies',\r\n",
        "                                              u'word',\r\n",
        "                                              u'degree',\r\n",
        "                                              u'teacher',\r\n",
        "                                              u'continues'],\r\n",
        "                                 u'b_name': u'topic_2',\r\n",
        "                                 u'b_size': 5595},\r\n",
        "                                {u'a_words': [u'php',\r\n",
        "                                              u'philippines',\r\n",
        "                                              u'additional',\r\n",
        "                                              u'future',\r\n",
        "                                              u'earns',\r\n",
        "                                              u'secure',\r\n",
        "                                              u'income'],\r\n",
        "                                 u'b_name': u'topic_17',\r\n",
        "                                 u'b_size': 19862},\r\n",
        "                                {u'a_words': [u'pigs',\r\n",
        "                                              u'raising',\r\n",
        "                                              u'earns',\r\n",
        "                                              u'healthy',\r\n",
        "                                              u'old',\r\n",
        "                                              u'living',\r\n",
        "                                              u'raise'],\r\n",
        "                                 u'b_name': u'topic_14',\r\n",
        "                                 u'b_size': 9802},\r\n",
        "                                {u'a_words': [u'living',\r\n",
        "                                              u'improve',\r\n",
        "                                              u'family',\r\n",
        "                                              u'conditions',\r\n",
        "                                              u'better',\r\n",
        "                                              u'income',\r\n",
        "                                              u'situation'],\r\n",
        "                                 u'b_name': u'topic_32',\r\n",
        "                                 u'b_size': 15200},\r\n",
        "                                {u'a_words': [u'money',\r\n",
        "                                              u'save',\r\n",
        "                                              u'requested',\r\n",
        "                                              u'family',\r\n",
        "                                              u'enough',\r\n",
        "                                              u'works',\r\n",
        "                                              u'hard'],\r\n",
        "                                 u'b_name': u'topic_8',\r\n",
        "                                 u'b_size': 21213},\r\n",
        "                                {u'a_words': [u'previous',\r\n",
        "                                              u'back',\r\n",
        "                                              u'new',\r\n",
        "                                              u'total',\r\n",
        "                                              u'paid',\r\n",
        "                                              u'loans',\r\n",
        "                                              u'used'],\r\n",
        "                                 u'b_name': u'topic_33',\r\n",
        "                                 u'b_size': 7391},\r\n",
        "                                {u'a_words': [u'lenders',\r\n",
        "                                              u'entrepreneur',\r\n",
        "                                              u'village',\r\n",
        "                                              u'engaged',\r\n",
        "                                              u'northern',\r\n",
        "                                              u'english',\r\n",
        "                                              u'kind'],\r\n",
        "                                 u'b_name': u'topic_20',\r\n",
        "                                 u'b_size': 5316},\r\n",
        "                                {u'a_words': [u'son',\r\n",
        "                                              u'university',\r\n",
        "                                              u'pay',\r\n",
        "                                              u'studying',\r\n",
        "                                              u'noodles',\r\n",
        "                                              u'parts',\r\n",
        "                                              u'year'],\r\n",
        "                                 u'b_name': u'topic_31',\r\n",
        "                                 u'b_size': 6903},\r\n",
        "                                {u'a_words': [u'materials',\r\n",
        "                                              u'home',\r\n",
        "                                              u'making',\r\n",
        "                                              u'cement',\r\n",
        "                                              u'sustaining',\r\n",
        "                                              u'wood',\r\n",
        "                                              u'make'],\r\n",
        "                                 u'b_name': u'topic_29',\r\n",
        "                                 u'b_size': 6798},\r\n",
        "                                {u'a_words': [u'region',\r\n",
        "                                              u'applied',\r\n",
        "                                              u'borrowed',\r\n",
        "                                              u'pakistan',\r\n",
        "                                              u'sheep',\r\n",
        "                                              u'times',\r\n",
        "                                              u'develop'],\r\n",
        "                                 u'b_name': u'topic_0',\r\n",
        "                                 u'b_size': 5167},\r\n",
        "                                {u'a_words': [u'group',\r\n",
        "                                              u'women',\r\n",
        "                                              u'members',\r\n",
        "                                              u'one',\r\n",
        "                                              u'fund',\r\n",
        "                                              u'leader',\r\n",
        "                                              u'first'],\r\n",
        "                                 u'b_name': u'topic_36',\r\n",
        "                                 u'b_size': 18547},\r\n",
        "                                {u'a_words': [u'school',\r\n",
        "                                              u'daughters',\r\n",
        "                                              u'students',\r\n",
        "                                              u'restaurant',\r\n",
        "                                              u'study',\r\n",
        "                                              u'two',\r\n",
        "                                              u'sons'],\r\n",
        "                                 u'b_name': u'topic_39',\r\n",
        "                                 u'b_size': 6816},\r\n",
        "                                {u'a_words': [u'amount',\r\n",
        "                                              u'old',\r\n",
        "                                              u'past',\r\n",
        "                                              u'purchase',\r\n",
        "                                              u'hopes',\r\n",
        "                                              u'lives',\r\n",
        "                                              u'requesting'],\r\n",
        "                                 u'b_name': u'topic_62',\r\n",
        "                                 u'b_size': 14207},\r\n",
        "                                {u'a_words': [u'one',\r\n",
        "                                              u'child',\r\n",
        "                                              u'year',\r\n",
        "                                              u'old',\r\n",
        "                                              u'wheat',\r\n",
        "                                              u'fourth',\r\n",
        "                                              u'south'],\r\n",
        "                                 u'b_name': u'topic_56',\r\n",
        "                                 u'b_size': 6558}]},\r\n",
        "                 {u'a_words': [u'store',\r\n",
        "                               u'farming',\r\n",
        "                               u'water',\r\n",
        "                               u'clothing',\r\n",
        "                               u'products',\r\n",
        "                               u'member',\r\n",
        "                               u'general'],\r\n",
        "                  u'b_name': u'topic_45_57_15_5_47_27_18_59_43_11_23_12_7_50_54_40_16',\r\n",
        "                  u'b_size': 143246,\r\n",
        "                  u'children': [{u'a_words': [u'maize',\r\n",
        "                                              u'farmer',\r\n",
        "                                              u'income',\r\n",
        "                                              u'milk',\r\n",
        "                                              u'dairy',\r\n",
        "                                              u'cows',\r\n",
        "                                              u'traditional'],\r\n",
        "                                 u'b_name': u'topic_45_57',\r\n",
        "                                 u'b_size': 24628,\r\n",
        "                                 u'children': [{u'a_words': [u'maize',\r\n",
        "                                                             u'farmer',\r\n",
        "                                                             u'income',\r\n",
        "                                                             u'milk',\r\n",
        "                                                             u'dairy',\r\n",
        "                                                             u'cows',\r\n",
        "                                                             u'cattle'],\r\n",
        "                                                u'b_name': u'topic_45',\r\n",
        "                                                u'b_size': 15951},\r\n",
        "                                               {u'a_words': [u'traditional',\r\n",
        "                                                             u'ingredients',\r\n",
        "                                                             u'profit',\r\n",
        "                                                             u'bags',\r\n",
        "                                                             u'live',\r\n",
        "                                                             u'plans',\r\n",
        "                                                             u'soro'],\r\n",
        "                                                u'b_name': u'topic_57',\r\n",
        "                                                u'b_size': 8677}]},\r\n",
        "                                {u'a_words': [u'products',\r\n",
        "                                              u'goods',\r\n",
        "                                              u'etc',\r\n",
        "                                              u'canned',\r\n",
        "                                              u'sells',\r\n",
        "                                              u'sell',\r\n",
        "                                              u'like'],\r\n",
        "                                 u'b_name': u'topic_15',\r\n",
        "                                 u'b_size': 10764},\r\n",
        "                                {u'a_words': [u'poultry',\r\n",
        "                                              u'fellowship',\r\n",
        "                                              u'god',\r\n",
        "                                              u'chickens',\r\n",
        "                                              u'attain',\r\n",
        "                                              u'chicken',\r\n",
        "                                              u'eggs'],\r\n",
        "                                 u'b_name': u'topic_5',\r\n",
        "                                 u'b_size': 4570},\r\n",
        "                                {u'a_words': [u'livestock',\r\n",
        "                                              u'feed',\r\n",
        "                                              u'agriculture',\r\n",
        "                                              u'fattening',\r\n",
        "                                              u'primarily',\r\n",
        "                                              u'gain',\r\n",
        "                                              u'agricultural'],\r\n",
        "                                 u'b_name': u'topic_47',\r\n",
        "                                 u'b_size': 4485},\r\n",
        "                                {u'a_words': [u'farming',\r\n",
        "                                              u'harvest',\r\n",
        "                                              u'land',\r\n",
        "                                              u'farm',\r\n",
        "                                              u'fertilizer',\r\n",
        "                                              u'crops',\r\n",
        "                                              u'farmers'],\r\n",
        "                                 u'b_name': u'topic_27',\r\n",
        "                                 u'b_size': 20389},\r\n",
        "                                {u'a_words': [u'fish',\r\n",
        "                                              u'fishing',\r\n",
        "                                              u'pig',\r\n",
        "                                              u'blessed',\r\n",
        "                                              u'mary',\r\n",
        "                                              u'firewood',\r\n",
        "                                              u'sell'],\r\n",
        "                                 u'b_name': u'topic_18',\r\n",
        "                                 u'b_size': 4286},\r\n",
        "                                {u'a_words': [u'poor',\r\n",
        "                                              u'production',\r\n",
        "                                              u'poverty',\r\n",
        "                                              u'financial',\r\n",
        "                                              u'development',\r\n",
        "                                              u'francs',\r\n",
        "                                              u'country'],\r\n",
        "                                 u'b_name': u'topic_59',\r\n",
        "                                 u'b_size': 6911},\r\n",
        "                                {u'a_words': [u'store',\r\n",
        "                                              u'general',\r\n",
        "                                              u'items',\r\n",
        "                                              u'grocery',\r\n",
        "                                              u'groceries',\r\n",
        "                                              u'sell',\r\n",
        "                                              u'runs'],\r\n",
        "                                 u'b_name': u'topic_43',\r\n",
        "                                 u'b_size': 13387},\r\n",
        "                                {u'a_words': [u'clothing',\r\n",
        "                                              u'shoes',\r\n",
        "                                              u'tuition',\r\n",
        "                                              u'merchandise',\r\n",
        "                                              u'sales',\r\n",
        "                                              u'cosmetics',\r\n",
        "                                              u'selling'],\r\n",
        "                                 u'b_name': u'topic_11',\r\n",
        "                                 u'b_size': 6693},\r\n",
        "                                {u'a_words': [u'man',\r\n",
        "                                              u'photo',\r\n",
        "                                              u'wife',\r\n",
        "                                              u'father',\r\n",
        "                                              u'equipment',\r\n",
        "                                              u'furniture',\r\n",
        "                                              u'right'],\r\n",
        "                                 u'b_name': u'topic_23',\r\n",
        "                                 u'b_size': 4631},\r\n",
        "                                {u'a_words': [u'get',\r\n",
        "                                              u'ahead',\r\n",
        "                                              u'maria',\r\n",
        "                                              u'forward',\r\n",
        "                                              u'desire',\r\n",
        "                                              u'greatest',\r\n",
        "                                              u'borrower'],\r\n",
        "                                 u'b_name': u'topic_12',\r\n",
        "                                 u'b_size': 5809},\r\n",
        "                                {u'a_words': [u'member',\r\n",
        "                                              u'program',\r\n",
        "                                              u'joined',\r\n",
        "                                              u'pmpc',\r\n",
        "                                              u'help',\r\n",
        "                                              u'foundation',\r\n",
        "                                              u'recently'],\r\n",
        "                                 u'b_name': u'topic_7',\r\n",
        "                                 u'b_size': 9541},\r\n",
        "                                {u'a_words': [u'vegetables',\r\n",
        "                                              u'vending',\r\n",
        "                                              u'fruits',\r\n",
        "                                              u'fruit',\r\n",
        "                                              u'stall',\r\n",
        "                                              u'bananas',\r\n",
        "                                              u'vegetable'],\r\n",
        "                                 u'b_name': u'topic_50',\r\n",
        "                                 u'b_size': 6640},\r\n",
        "                                {u'a_words': [u'repair',\r\n",
        "                                              u'fertilizers',\r\n",
        "                                              u'transportation',\r\n",
        "                                              u'motorcycle',\r\n",
        "                                              u'transport',\r\n",
        "                                              u'maintenance',\r\n",
        "                                              u'driver'],\r\n",
        "                                 u'b_name': u'topic_54',\r\n",
        "                                 u'b_size': 6741},\r\n",
        "                                {u'a_words': [u'water',\r\n",
        "                                              u'monthly',\r\n",
        "                                              u'week',\r\n",
        "                                              u'aged',\r\n",
        "                                              u'days',\r\n",
        "                                              u'family',\r\n",
        "                                              u'piped'],\r\n",
        "                                 u'b_name': u'topic_40',\r\n",
        "                                 u'b_size': 8058},\r\n",
        "                                {u'a_words': [u'electricity',\r\n",
        "                                              u'funds',\r\n",
        "                                              u'soon',\r\n",
        "                                              u'ensure',\r\n",
        "                                              u'including',\r\n",
        "                                              u'onions',\r\n",
        "                                              u'mainly'],\r\n",
        "                                 u'b_name': u'topic_16',\r\n",
        "                                 u'b_size': 5713}]},\r\n",
        "                 {u'a_words': [u'nwtf',\r\n",
        "                               u'dream',\r\n",
        "                               u'sell',\r\n",
        "                               u'past',\r\n",
        "                               u'charcoal',\r\n",
        "                               u'selling',\r\n",
        "                               u'expand'],\r\n",
        "                  u'b_name': u'topic_13',\r\n",
        "                  u'b_size': 7160},\r\n",
        "                 {u'a_words': [u'access',\r\n",
        "                               u'loans',\r\n",
        "                               u'!',\r\n",
        "                               u'stocks',\r\n",
        "                               u'taken',\r\n",
        "                               u'institutions',\r\n",
        "                               u'said'],\r\n",
        "                  u'b_name': u'topic_55',\r\n",
        "                  u'b_size': 4890},\r\n",
        "                 {u'a_words': [u'like',\r\n",
        "                               u'would',\r\n",
        "                               u'partner',\r\n",
        "                               u'future',\r\n",
        "                               u'describes',\r\n",
        "                               u'aspires',\r\n",
        "                               u'domestic'],\r\n",
        "                  u'b_name': u'topic_21',\r\n",
        "                  u'b_size': 9813},\r\n",
        "                 {u'a_words': [u'borrowing',\r\n",
        "                               u'institution',\r\n",
        "                               u'communal',\r\n",
        "                               u'raised',\r\n",
        "                               u'often',\r\n",
        "                               u'economy',\r\n",
        "                               u'rest'],\r\n",
        "                  u'b_name': u'topic_25',\r\n",
        "                  u'b_size': 3045},\r\n",
        "                 {u'a_words': [u'day',\r\n",
        "                               u'average',\r\n",
        "                               u'per',\r\n",
        "                               u'credit',\r\n",
        "                               u'every',\r\n",
        "                               u'within',\r\n",
        "                               u'usd'],\r\n",
        "                  u'b_name': u'topic_48',\r\n",
        "                  u'b_size': 9393},\r\n",
        "                 {u'a_words': [u'repaid',\r\n",
        "                               u'successfully',\r\n",
        "                               u'involved',\r\n",
        "                               u'loans',\r\n",
        "                               u'individual',\r\n",
        "                               u'56',\r\n",
        "                               u'adult'],\r\n",
        "                  u'b_name': u'topic_38',\r\n",
        "                  u'b_size': 4680},\r\n",
        "                 {u'a_words': [u'food',\r\n",
        "                               u'beverages',\r\n",
        "                               u'foods',\r\n",
        "                               u'25,000',\r\n",
        "                               u'serves',\r\n",
        "                               u'prepare',\r\n",
        "                               u'selling'],\r\n",
        "                  u'b_name': u'topic_3',\r\n",
        "                  u'b_size': 4324},\r\n",
        "                 {u'a_words': [u'dreams',\r\n",
        "                               u'clothes',\r\n",
        "                               u'selling',\r\n",
        "                               u'goes',\r\n",
        "                               u'sell',\r\n",
        "                               u'woman',\r\n",
        "                               u'sells'],\r\n",
        "                  u'b_name': u'topic_22',\r\n",
        "                  u'b_size': 5775},\r\n",
        "                 {u'a_words': [u'coffee',\r\n",
        "                               u'drinks',\r\n",
        "                               u'soft',\r\n",
        "                               u'weather',\r\n",
        "                               u'peru',\r\n",
        "                               u'mr.',\r\n",
        "                               u'wife'],\r\n",
        "                  u'b_name': u'topic_53',\r\n",
        "                  u'b_size': 3098},\r\n",
        "                 {u'a_words': [u'higher',\r\n",
        "                               u'low',\r\n",
        "                               u'price',\r\n",
        "                               u'prices',\r\n",
        "                               u'competition',\r\n",
        "                               u'al',\r\n",
        "                               u'wholesale'],\r\n",
        "                  u'b_name': u'topic_10',\r\n",
        "                  u'b_size': 3787},\r\n",
        "                 {u'a_words': [u'grade',\r\n",
        "                               u'manage',\r\n",
        "                               u'born',\r\n",
        "                               u'due',\r\n",
        "                               u'supports',\r\n",
        "                               u'became',\r\n",
        "                               u'honest'],\r\n",
        "                  u'b_name': u'topic_51',\r\n",
        "                  u'b_size': 5044},\r\n",
        "                 {u'a_words': [u'local',\r\n",
        "                               u'small',\r\n",
        "                               u'successful',\r\n",
        "                               u'capital',\r\n",
        "                               u'working',\r\n",
        "                               u'assistance',\r\n",
        "                               u'stable'],\r\n",
        "                  u'b_name': u'topic_35',\r\n",
        "                  u'b_size': 10551},\r\n",
        "                 {u'a_words': [u'cash',\r\n",
        "                               u'hours',\r\n",
        "                               u'hoping',\r\n",
        "                               u'manages',\r\n",
        "                               u'expects',\r\n",
        "                               u'housewife',\r\n",
        "                               u'assist'],\r\n",
        "                  u'b_name': u'topic_60',\r\n",
        "                  u'b_size': 2963},\r\n",
        "                 {u'a_words': [u'brac',\r\n",
        "                               u'ages',\r\n",
        "                               u'generate',\r\n",
        "                               u'rosa',\r\n",
        "                               u'renovate',\r\n",
        "                               u'2011',\r\n",
        "                               u'currently'],\r\n",
        "                  u'b_name': u'topic_61',\r\n",
        "                  u'b_size': 3450}]}\r\n",
        " Dumping object hierarchy into JSON file data/topicmodelling/kivadata.json ... done\r\n"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "References"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "* David Mimno and Andrew MacCallum: [Organizing the OCA: Learning Faceted Subjects from a Library of Digital Books](http://mimno.infosci.cornell.edu/papers/f129-mimno.pdf)\n",
      "* Jay Pujara and Peter Skomoroch: [Large-Scale Hierarchical Topic Models](http://linqs.cs.umd.edu/basilic/web/Publications/2012/pujara:nips12/pujara_biglearn12.pdf)\n",
      "* Alison Smith, Timothy Hawes, and Meredith Myers: [Hi&eacute;rarchie: Interactive Visualization for Hierarchical Topic Models](http://nlp.stanford.edu/events/illvi2014/papers/smith-illvi2014b.pdf)\n",
      "* Hierarchie [Readme page at GitHub](http://mlvl.github.io/Hierarchie/#/about)\n",
      "* [Gensim - topic modelling for humans](https://radimrehurek.com/gensim/index.html)\n",
      "* Nikolaos Aletras and Mark Stevenson: [Measuring the Similarity between Automatically Generated Topics](http://staffwww.dcs.shef.ac.uk/people/N.Aletras/resources/2014_eacl_topicSim_short.pdf)\n",
      "* [Stand-alone language identification system](https://github.com/saffsd/langid.py) (in Python)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}