{ "metadata": { "name": "", "signature": "sha256:b168a88dfb9b98cbb60a61faf7512ce272120ee1ca8d93f8d4551bf8e109f929" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Hierarchical topic modelling of Kiva loan descriptions" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Goals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Create a visual explorer for a static set of 775K English loan descriptions from Kiva.org\n", "* Use (hierarchical) topic modelling\n", "* Publish the explorer" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Approach" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Use gensim to derive flat topic models over (part of) the Kiva corpus, taking the [tutorial](https://radimrehurek.com/gensim/tutorial.html) as guideline\n", "* Organize the found topic models into a hierarchy\n", "* Convert that hierarchy into a [JSON data file](https://github.com/mlvl/Hierarchie/tree/gh-pages/app/data) compliant with Hierarchie\n", "* Visualize everything with Hierarchie" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Data preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Step 1: download [Kiva data dump (JSON format)](http://s3.kiva.org/snapshots/kiva_ds_json.zip), and extract into data/static" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Step 2: since the 'description' fields in the Kiva data dump often mix multiple languages (due to manual translations), the language codes are not reliable. Therefore, we:\n", "\n", "* split the descriptions in paragraps\n", "* do language detection on the paragraphs (using the [langid](https://github.com/saffsd/langid.py) library)\n", "* store the recombined paragraphs and their language code in new processed_description field\n", "\n", "The data are written to a locally installed MongoDB 'kiva', collection 'loans'" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Next line commented out because we only want to run this once\n", "# nohup python src/load_kiva_loans_to_mongodb.py" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Step 3: load (a subset) of the English data from MongoDB, and convert it to the [Blei LDA-C format](http://www.cs.princeton.edu/~blei/lda-c/), using a [gensim utility function](https://radimrehurek.com/gensim/tut1.html#corpus-formats)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!python src/convert_mongodb_to_blei_ldac.py --dataDir data/topicmodelling\\\n", " --corpusBaseName kiva \\\n", " --stopwordFile=data/topicmodelling/kiva_stopwords.tsv \\\n", " --startYear 1990 \\\n", " --maxNrDocs 800000 \\\n", " --filterBelow 10 \\\n", " --filterAbove 0.5 \\\n", " --filterKeepN 1000" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Creating MongoDB cursor ... done\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Number of loans in 'en' since 1990: 774952\r\n", "First pass: streaming from MongoDB ...\r\n", "creating the dictionary ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 5000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 10000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 15000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 20000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 25000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 30000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 35000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 40000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 45000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 50000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 55000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 60000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 65000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 70000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 75000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 80000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 85000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 90000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 95000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 100000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 105000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 110000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 115000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 120000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 125000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 130000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 135000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 140000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 145000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 150000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 155000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 160000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 165000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 170000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 175000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 180000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 185000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 190000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 195000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 200000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 205000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 210000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 215000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 220000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 225000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 230000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 235000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 240000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 245000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 250000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 255000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 260000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 265000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 270000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 275000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 280000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 285000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 290000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 295000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 300000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 305000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 310000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 315000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 320000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 325000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 330000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 335000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 340000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 345000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 350000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 355000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 360000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 365000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 370000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 375000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 380000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 385000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 390000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 395000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 400000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 405000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 410000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 415000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 420000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 425000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 430000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 435000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 440000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 445000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 450000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 455000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 460000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 465000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 470000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 475000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 480000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 485000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 490000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 495000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 500000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 505000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 510000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 515000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 520000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 525000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 530000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 535000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 540000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 545000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 550000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 555000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 560000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 565000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 570000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 575000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 580000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 585000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 590000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 595000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 600000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 605000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 610000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 615000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 620000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 625000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 630000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 635000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 640000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 645000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 650000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 655000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 660000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 665000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 670000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 675000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 680000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 685000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 690000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 695000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 700000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 705000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 710000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 715000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 720000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 725000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 730000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 735000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 740000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 745000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 750000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 755000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 760000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 765000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "read 770000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "filtering the dictionary ..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ " done\r\n", "wrote data/topicmodelling/kiva_dict.bin ... and data/topicmodelling/kiva_dict.txt ... done\r\n", "Second pass: streaming from MongoDB ... saving into data/topicmodelling/kiva.lda-c (Blei corpus format) ..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ " done\r\n", "Number of documents converted: 774952\r\n", "Vocabulary size: 1000\r\n" ] } ], "prompt_number": 6 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Topic modelling" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!python src/model_topics.py --dataDir data/topicmodelling \\\n", " --modelDir data/topicmodelling \\\n", " --corpusBaseName kiva \\\n", " --nrTopics 64 \\\n", " --nrWords 8" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Loading Blei corpus file data/topicmodelling/kiva.lda-c ..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ " done\r\n", "\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Dictionary(1000 unique tokens: [u'neighbors', u'sector', u'managed', u'lack', u'eldest']...)\r\n", "Making topic model ..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ " done\r\n", "0.217*region + 0.179*applied + 0.129*borrowed + 0.123*pakistan + 0.082*sheep + 0.070*times + 0.045*develop + 0.036*rented\r\n", "0.039*started + 0.031*ago + 0.024*start + 0.024*support + 0.022*selling + 0.020*decided + 0.018*thanks + 0.018*increase\r\n", "0.179*college + 0.174*hardworking + 0.109*studies + 0.071*word + 0.065*degree + 0.046*teacher + 0.044*continues + 0.036*profession\r\n", "0.687*food + 0.091*beverages + 0.089*foods + 0.045*25,000 + 0.028*serves + 0.012*prepare + 0.011*selling + 0.007*cooking\r\n", "0.082*living + 0.073*earn + 0.055*income + 0.048*hopes + 0.048*rice + 0.038*village + 0.038*support + 0.028*family\r\n", "0.124*poultry + 0.114*fellowship + 0.108*god + 0.092*chickens + 0.072*attain + 0.069*chicken + 0.069*eggs + 0.068*build\r\n", "0.060*meet + 0.053*customers + 0.049*needs + 0.041*increase + 0.031*demand + 0.027*family + 0.024*sales + 0.021*good\r\n", "0.190*member + 0.147*program + 0.086*joined + 0.065*pmpc + 0.057*help + 0.056*foundation + 0.041*recently + 0.037*paglaum\r\n", "0.075*money + 0.074*save + 0.072*requested + 0.067*family + 0.066*enough + 0.052*works + 0.050*hard + 0.040*old\r\n", "0.173*community + 0.106*services + 0.081*costs + 0.066*providing + 0.046*service + 0.044*new + 0.044*provides + 0.042*access\r\n", "0.191*higher + 0.176*low + 0.175*price + 0.170*prices + 0.046*competition + 0.040*al + 0.037*wholesale + 0.034*commercial\r\n", "0.349*clothing + 0.096*shoes + 0.075*tuition + 0.062*merchandise + 0.061*sales + 0.058*cosmetics + 0.046*selling + 0.038*baby\r\n", "0.198*get + 0.151*ahead + 0.114*maria + 0.083*forward + 0.070*desire + 0.066*greatest + 0.053*borrower + 0.046*spouse\r\n", "0.321*nwtf + 0.067*dream + 0.054*sell + 0.042*past + 0.042*charcoal + 0.039*selling + 0.036*expand + 0.034*also\r\n", "0.179*pigs + 0.175*raising + 0.076*earns + 0.058*healthy + 0.057*old + 0.055*living + 0.046*raise + 0.041*activities\r\n", "0.177*products + 0.109*goods + 0.079*etc + 0.070*canned + 0.067*sells + 0.044*sell + 0.044*like + 0.043*shampoo\r\n", "0.177*electricity + 0.090*funds + 0.089*soon + 0.079*ensure + 0.078*including + 0.055*onions + 0.052*mainly + 0.041*along\r\n", "0.155*php + 0.134*philippines + 0.106*additional + 0.056*future + 0.053*earns + 0.047*secure + 0.046*income + 0.039*hard\r\n", "0.286*fish + 0.189*fishing + 0.100*pig + 0.089*blessed + 0.073*mary + 0.055*firewood + 0.046*sell + 0.039*cassava\r\n", "0.063*happy + 0.059*well + 0.049*says + 0.037*likes + 0.031*hand + 0.028*good + 0.028*time + 0.027*gets\r\n", "0.320*lenders + 0.235*entrepreneur + 0.200*village + 0.127*engaged + 0.076*northern + 0.017*english + 0.013*kind + 0.003*eight\r\n", "0.366*like + 0.320*would + 0.117*partner + 0.054*future + 0.046*describes + 0.033*aspires + 0.019*domestic + 0.016*committed\r\n", "0.221*dreams + 0.212*clothes + 0.096*selling + 0.063*goes + 0.041*sell + 0.041*woman + 0.041*sells + 0.036*shirts\r\n", "0.344*man + 0.190*photo + 0.095*wife + 0.077*father + 0.071*equipment + 0.050*furniture + 0.042*right + 0.032*pesos\r\n", "0.063*lives + 0.049*city + 0.048*works + 0.045*house + 0.037*located + 0.035*old + 0.035*area + 0.034*home\r\n", "0.315*borrowing + 0.211*institution + 0.205*communal + 0.090*raised + 0.088*often + 0.059*economy + 0.032*rest + 0.000*recently\r\n", "0.151*bank + 0.109*members + 0.095*member + 0.070*time + 0.065*cycle + 0.056*health + 0.036*community + 0.030*payments\r\n", "0.126*farming + 0.073*harvest + 0.060*land + 0.060*farm + 0.040*fertilizer + 0.039*crops + 0.037*farmers + 0.030*seeds\r\n", "0.092*income + 0.086*husband + 0.065*family + 0.062*expenses + 0.042*help + 0.039*needs + 0.038*cover + 0.036*household\r\n", "0.185*materials + 0.110*home + 0.106*making + 0.083*cement + 0.078*sustaining + 0.055*wood + 0.039*make + 0.037*raw\r\n", "0.237*district + 0.143*province + 0.101*lives + 0.086*requested + 0.068*inputs + 0.055*cambodia + 0.044*old + 0.042*anticipated\r\n", "0.188*son + 0.111*university + 0.085*pay + 0.070*studying + 0.058*noodles + 0.053*parts + 0.047*year + 0.042*education\r\n", "0.241*living + 0.182*improve + 0.122*family + 0.109*conditions + 0.027*better + 0.026*income + 0.025*situation + 0.024*new\r\n", "0.207*previous + 0.108*back + 0.106*new + 0.094*total + 0.082*paid + 0.077*loans + 0.070*used + 0.059*third\r\n", "0.073*sewing + 0.055*machine + 0.049*training + 0.041*hair + 0.033*tools + 0.033*beauty + 0.033*salon + 0.030*tailoring\r\n", "0.092*local + 0.082*small + 0.080*successful + 0.078*capital + 0.056*working + 0.052*assistance + 0.043*stable + 0.042*financially\r\n", "0.264*group + 0.102*women + 0.081*members + 0.050*one + 0.037*fund + 0.032*leader + 0.029*first + 0.019*use\r\n", "0.206*rural + 0.180*field + 0.105*biggest + 0.075*kiva\u2019s + 0.068*benefit + 0.059*sector + 0.055*cooperative + 0.049*areas\r\n", "0.263*repaid + 0.252*successfully + 0.218*involved + 0.072*loans + 0.043*individual + 0.041*56 + 0.038*adult + 0.026*completed\r\n", "0.092*school + 0.075*daughters + 0.068*students + 0.066*restaurant + 0.065*study + 0.063*two + 0.051*sons + 0.043*stories\r\n", "0.316*water + 0.119*monthly + 0.083*week + 0.054*aged + 0.049*days + 0.048*family + 0.042*piped + 0.035*every\r\n", "0.106*shop + 0.071*use + 0.062*hopes + 0.050*kes + 0.047*future + 0.043*operates + 0.041*retail + 0.039*stock\r\n", "0.040*work + 0.027*able + 0.026*family + 0.021*help + 0.019*works + 0.017*wants + 0.015*lives + 0.014*day\r\n", "0.292*store + 0.124*general + 0.081*items + 0.054*grocery + 0.051*groceries + 0.050*sell + 0.048*runs + 0.030*variety\r\n", "0.193*build + 0.184*house + 0.057*expanding + 0.056*building + 0.048*sand + 0.045*basis + 0.044*construction + 0.043*labor\r\n", "0.081*maize + 0.061*farmer + 0.059*income + 0.056*milk + 0.042*dairy + 0.040*cows + 0.037*cattle + 0.037*animals\r\n", "0.104*rice + 0.101*sugar + 0.070*oil + 0.062*flour + 0.050*cooking + 0.045*bread + 0.040*meat + 0.035*beans\r\n", "0.269*livestock + 0.172*feed + 0.125*agriculture + 0.110*fattening + 0.070*primarily + 0.068*gain + 0.064*agricultural + 0.056*shown\r\n", "0.158*day + 0.113*average + 0.083*per + 0.073*credit + 0.072*every + 0.071*within + 0.069*usd + 0.046*month\r\n", "0.073*daughter + 0.069*school + 0.054*lives + 0.051*house + 0.049*old + 0.049*husband + 0.045*applying + 0.039*faces\r\n", "0.211*vegetables + 0.103*vending + 0.083*fruits + 0.074*fruit + 0.069*stall + 0.053*bananas + 0.049*vegetable + 0.049*standing\r\n", "0.081*grade + 0.074*manage + 0.073*born + 0.069*due + 0.069*supports + 0.059*became + 0.054*honest + 0.054*financial\r\n", "0.108*school + 0.057*fees + 0.044*challenge + 0.042*kenya + 0.027*uganda + 0.025*major + 0.024*profits + 0.023*pay\r\n", "0.308*coffee + 0.258*drinks + 0.195*soft + 0.093*weather + 0.088*peru + 0.037*mr. + 0.013*wife + 0.003*thus\r\n", "0.099*repair + 0.087*fertilizers + 0.084*transportation + 0.077*motorcycle + 0.073*transport + 0.058*maintenance + 0.057*driver + 0.054*resell\r\n", "0.283*access + 0.117*loans + 0.093*! + 0.092*stocks + 0.069*taken + 0.067*institutions + 0.057*said + 0.055*cloth\r\n", "0.416*one + 0.245*child + 0.136*year + 0.063*old + 0.050*wheat + 0.042*fourth + 0.028*south + 0.007*youngest\r\n", "0.072*traditional + 0.070*ingredients + 0.066*profit + 0.056*bags + 0.046*live + 0.042*plans + 0.038*soro + 0.037*yiriwaso\r\n", "0.102*businesses + 0.084*community + 0.069*families + 0.052*microfinance + 0.049*groups + 0.044*intends + 0.041*repay + 0.039*share\r\n", "0.168*poor + 0.110*production + 0.102*poverty + 0.073*financial + 0.067*development + 0.066*francs + 0.045*country + 0.043*organization\r\n", "0.240*cash + 0.134*hours + 0.104*hoping + 0.099*manages + 0.068*expects + 0.066*housewife + 0.065*assist + 0.059*net\r\n", "0.251*brac + 0.205*ages + 0.159*generate + 0.104*rosa + 0.101*renovate + 0.062*2011 + 0.031*currently + 0.027*small\r\n", "0.126*amount + 0.099*old + 0.059*past + 0.056*purchase + 0.045*hopes + 0.045*lives + 0.043*requesting + 0.041*two\r\n", "0.055*life + 0.040*quality + 0.038*products + 0.038*better + 0.036*able + 0.027*customers + 0.026*improve + 0.025*good\r\n", "Writing model file data/topicmodelling/kiva.lda_model ... done\r\n", "Creating complete topic/word matrix in memory:\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 1/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 2/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 3/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 4/64 ...\r\n", "topic 5/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 6/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 7/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 8/64 ...\r\n", "topic 9/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 10/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 11/64 ...\r\n", "topic 12/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 13/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 14/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 15/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 16/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 17/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 18/64 ...\r\n", "topic 19/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 20/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 21/64 ...\r\n", "topic 22/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 23/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 24/64 ...\r\n", "topic 25/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 26/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 27/64 ...\r\n", "topic 28/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 29/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 30/64 ...\r\n", "topic 31/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 32/64 ...\r\n", "topic 33/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 34/64 ...\r\n", "topic 35/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 36/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 37/64 ...\r\n", "topic 38/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 39/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 40/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 41/64 ...\r\n", "topic 42/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 43/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 44/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 45/64 ...\r\n", "topic 46/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 47/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 48/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 49/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 50/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 51/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 52/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 53/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 54/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 55/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 56/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 57/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 58/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 59/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 60/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 61/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 62/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 63/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "topic 64/64 ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "done\r\n", "Writing topic/word matrix file data/topicmodelling/kiva_topic_words_matrix.h5 ..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py:2486: PerformanceWarning: \r\n", "your performance may suffer as PyTables will pickle object types that it cannot\r\n", "map directly to c-types [inferred_type->unicode,key->axis0] [items->None]\r\n", "\r\n", " warnings.warn(ws, PerformanceWarning)\r\n", "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py:2486: PerformanceWarning: \r\n", "your performance may suffer as PyTables will pickle object types that it cannot\r\n", "map directly to c-types [inferred_type->unicode,key->block0_items] [items->None]\r\n", "\r\n", " warnings.warn(ws, PerformanceWarning)\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "done\r\n" ] } ], "prompt_number": 8 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Infer topic distribution over the document set" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!python src/infer_document_topic_distributions.py --modelDir data/topicmodelling \\\n", " --modelBaseName kiva \\\n", " --maxNrDocs 250000" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Reading topic/word matrix file data/topicmodelling/kiva_topic_words_matrix.h5 ..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ " done\r\n", "Loading model file data/topicmodelling/kiva.lda_model ... done\r\n", "Loading Blei corpus file data/topicmodelling/kiva.lda-c ..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ " done\r\n", "\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 5000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 10000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 15000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 20000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 25000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 30000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 35000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 40000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 45000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 50000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 55000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 60000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 65000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 70000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 75000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 80000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 85000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 90000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 95000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 100000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 105000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 110000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 115000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 120000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 125000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 130000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 135000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 140000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 145000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 150000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 155000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 160000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 165000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 170000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 175000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 180000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 185000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 190000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 195000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 200000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 205000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 210000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 215000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 220000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 225000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 230000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 235000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 240000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 245000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 250000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 255000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 260000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 265000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 270000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 275000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 280000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 285000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 290000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 295000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 300000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 305000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 310000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 315000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 320000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 325000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 330000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 335000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 340000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 345000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 350000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 355000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 360000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 365000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 370000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 375000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 380000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 385000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 390000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 395000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 400000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 405000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 410000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 415000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 420000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 425000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 430000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 435000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 440000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 445000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 450000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 455000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 460000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 465000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 470000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 475000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 480000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 485000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 490000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 495000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 500000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 505000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 510000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 515000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 520000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 525000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 530000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 535000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 540000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 545000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 550000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 555000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 560000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 565000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 570000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 575000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 580000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 585000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 590000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 595000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 600000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 605000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 610000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 615000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 620000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 625000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 630000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 635000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 640000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 645000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 650000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 655000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 660000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 665000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 670000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 675000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 680000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 685000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 690000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 695000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 700000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 705000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 710000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 715000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 720000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 725000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 730000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 735000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 740000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 745000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 750000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 755000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 760000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 765000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Processed 770000 documents ...\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "475 winning (0.06%); 0.67% weight in [region applied borrowed pakistan sheep times develop rented]\r\n", "73127 winning (9.44%); 5.46% weight in [started ago start support selling decided thanks increase]\r\n", "136 winning (0.02%); 0.72% weight in [college hardworking studies word degree teacher continues profession]\r\n", "63 winning (0.01%); 0.56% weight in [food beverages foods 25,000 serves prepare selling cooking]\r\n", "33504 winning (4.32%); 2.47% weight in [living earn income hopes rice village support family]\r\n", "273 winning (0.04%); 0.59% weight in [poultry fellowship god chickens attain chicken eggs build]\r\n", "31849 winning (4.11%); 3.10% weight in [meet customers needs increase demand family sales good]\r\n", "5630 winning (0.73%); 1.23% weight in [member program joined pmpc help foundation recently paglaum]\r\n", "34069 winning (4.40%); 2.74% weight in [money save requested family enough works hard old]\r\n", "24826 winning (3.20%); 2.21% weight in [community services costs providing service new provides access]\r\n", "26 winning (0.00%); 0.49% weight in [higher low price prices competition al wholesale commercial]\r\n", "1551 winning (0.20%); 0.86% weight in [clothing shoes tuition merchandise sales cosmetics selling baby]\r\n", "77 winning (0.01%); 0.75% weight in [get ahead maria forward desire greatest borrower spouse]\r\n", "1582 winning (0.20%); 0.92% weight in [nwtf dream sell past charcoal selling expand also]\r\n", "4406 winning (0.57%); 1.26% weight in [pigs raising earns healthy old living raise activities]\r\n", "4342 winning (0.56%); 1.39% weight in [products goods etc canned sells sell like shampoo]\r\n", "245 winning (0.03%); 0.74% weight in [electricity funds soon ensure including onions mainly along]\r\n", "17558 winning (2.27%); 2.56% weight in [php philippines additional future earns secure income hard]\r\n", "1370 winning (0.18%); 0.55% weight in [fish fishing pig blessed mary firewood sell cassava]\r\n", "4230 winning (0.55%); 1.88% weight in [happy well says likes hand good time gets]\r\n", "30 winning (0.00%); 0.69% weight in [lenders entrepreneur village engaged northern english kind eight]\r\n", "334 winning (0.04%); 1.27% weight in [like would partner future describes aspires domestic committed]\r\n", "1882 winning (0.24%); 0.75% weight in [dreams clothes selling goes sell woman sells shirts]\r\n", "308 winning (0.04%); 0.60% weight in [man photo wife father equipment furniture right pesos]\r\n", "46233 winning (5.97%); 4.16% weight in [lives city works house located old area home]\r\n", "0 winning (0.00%); 0.39% weight in [borrowing institution communal raised often economy rest recently]\r\n", "5308 winning (0.68%); 1.44% weight in [bank members member time cycle health community payments]\r\n", "21246 winning (2.74%); 2.63% weight in [farming harvest land farm fertilizer crops farmers seeds]\r\n", "26007 winning (3.36%); 3.05% weight in [income husband family expenses help needs cover household]\r\n", "1560 winning (0.20%); 0.88% weight in [materials home making cement sustaining wood make raw]\r\n", "563 winning (0.07%); 0.84% weight in [district province lives requested inputs cambodia old anticipated]\r\n", "1676 winning (0.22%); 0.89% weight in [son university pay studying noodles parts year education]\r\n", "6894 winning (0.89%); 1.96% weight in [living improve family conditions better income situation new]\r\n", "2276 winning (0.29%); 0.95% weight in [previous back new total paid loans used third]\r\n", "6750 winning (0.87%); 1.10% weight in [sewing machine training hair tools beauty salon tailoring]\r\n", "1696 winning (0.22%); 1.36% weight in [local small successful capital working assistance stable financially]\r\n", "21624 winning (2.79%); 2.39% weight in [group women members one fund leader first use]\r\n", "0 winning (0.00%); 0.81% weight in [rural field biggest kiva\u2019s benefit sector cooperative areas]\r\n", "14 winning (0.00%); 0.60% weight in [repaid successfully involved loans individual 56 adult completed]\r\n", "1235 winning (0.16%); 0.88% weight in [school daughters students restaurant study two sons stories]\r\n", "1715 winning (0.22%); 1.04% weight in [water monthly week aged days family piped every]\r\n", "38299 winning (4.94%); 2.56% weight in [shop use hopes kes future operates retail stock]\r\n", "149146 winning (19.25%); 9.51% weight in [work able family help works wants lives day]\r\n", "10037 winning (1.30%); 1.73% weight in [store general items grocery groceries sell runs variety]\r\n", "1413 winning (0.18%); 1.04% weight in [build house expanding building sand basis construction labor]\r\n", "23737 winning (3.06%); 2.06% weight in [maize farmer income milk dairy cows cattle animals]\r\n", "8970 winning (1.16%); 1.57% weight in [rice sugar oil flour cooking bread meat beans]\r\n", "66 winning (0.01%); 0.58% weight in [livestock feed agriculture fattening primarily gain agricultural shown]\r\n", "4903 winning (0.63%); 1.21% weight in [day average per credit every within usd month]\r\n", "24381 winning (3.15%); 2.54% weight in [daughter school lives house old husband applying faces]\r\n", "1886 winning (0.24%); 0.86% weight in [vegetables vending fruits fruit stall bananas vegetable standing]\r\n", "46 winning (0.01%); 0.65% weight in [grade manage born due supports became honest financial]\r\n", "41496 winning (5.35%); 2.60% weight in [school fees challenge kenya uganda major profits pay]\r\n", "125 winning (0.02%); 0.40% weight in [coffee drinks soft weather peru mr. wife thus]\r\n", "2053 winning (0.26%); 0.87% weight in [repair fertilizers transportation motorcycle transport maintenance driver resell]\r\n", "104 winning (0.01%); 0.63% weight in [access loans ! stocks taken institutions said cloth]\r\n", "422 winning (0.05%); 0.85% weight in [one child year old wheat fourth south youngest]\r\n", "5011 winning (0.65%); 1.12% weight in [traditional ingredients profit bags live plans soro yiriwaso]\r\n", "7876 winning (1.02%); 1.29% weight in [businesses community families microfinance groups intends repay share]\r\n", "442 winning (0.06%); 0.89% weight in [poor production poverty financial development francs country organization]\r\n", "14 winning (0.00%); 0.38% weight in [cash hours hoping manages expects housewife assist net]\r\n", "246 winning (0.03%); 0.45% weight in [brac ages generate rosa renovate 2011 currently small]\r\n", "17439 winning (2.25%); 1.83% weight in [amount old past purchase hopes lives requesting two]\r\n", "50150 winning (6.47%); 4.53% weight in [life quality products better able customers improve good]\r\n" ] } ], "prompt_number": 9 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Organize the \"flat\" topic list into a hierarchy for visualization purposes" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!python src/build_topic_hierarchy.py --modelDir data/topicmodelling \\\n", " --modelBaseName kiva \\\n", " --nrClusters 16 \\\n", " --nrWords 7" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Reading topic/word matrix file data/topicmodelling/kiva_topic_words_matrix.h5 ..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ " done\r\n", "Hierarchically clustering topics ..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "recursive hierarchy:\r\n", "[[[[1, 42, 63], 30, 24, 4, 41, 46, 34, 58, 9, 28, 52, 44, 49, 6, 19, 26],\r\n", " 37,\r\n", " 2,\r\n", " 17,\r\n", " 14,\r\n", " 32,\r\n", " 8,\r\n", " 33,\r\n", " 20,\r\n", " 31,\r\n", " 29,\r\n", " 0,\r\n", " 36,\r\n", " 39,\r\n", " 62,\r\n", " 56],\r\n", " [[45, 57], 15, 5, 47, 27, 18, 59, 43, 11, 23, 12, 7, 50, 54, 40, 16],\r\n", " 13,\r\n", " 55,\r\n", " 21,\r\n", " 25,\r\n", " 48,\r\n", " 38,\r\n", " 3,\r\n", " 22,\r\n", " 53,\r\n", " 10,\r\n", " 51,\r\n", " 35,\r\n", " 60,\r\n", " 61]\r\n", " done\r\n", "Loading model file data/topicmodelling/kiva.lda_model ... done\r\n", "Building nested hierarchy in memory ... done\r\n", "{u'topic_data': [{u'a_words': [u'family',\r\n", " u'group',\r\n", " u'income',\r\n", " u'lives',\r\n", " u'community',\r\n", " u'living',\r\n", " u'school'],\r\n", " u'b_name': u'topic_1_42_63_30_24_4_41_46_34_58_9_28_52_44_49_6_19_26_37_2_17_14_32_8_33_20_31_29_0_36_39_62_56',\r\n", " u'b_size': 553701,\r\n", " u'children': [{u'a_words': [u'community',\r\n", " u'family',\r\n", " u'house',\r\n", " u'lives',\r\n", " u'school',\r\n", " u'income',\r\n", " u'able'],\r\n", " u'b_name': u'topic_1_42_63_30_24_4_41_46_34_58_9_28_52_44_49_6_19_26',\r\n", " u'b_size': 398051,\r\n", " u'children': [{u'a_words': [u'able',\r\n", " u'work',\r\n", " u'life',\r\n", " u'family',\r\n", " u'started',\r\n", " u'help',\r\n", " u'quality'],\r\n", " u'b_name': u'topic_1_42_63',\r\n", " u'b_size': 151103,\r\n", " u'children': [{u'a_words': [u'started',\r\n", " u'ago',\r\n", " u'start',\r\n", " u'support',\r\n", " u'selling',\r\n", " u'decided',\r\n", " u'thanks'],\r\n", " u'b_name': u'topic_1',\r\n", " u'b_size': 42309},\r\n", " {u'a_words': [u'work',\r\n", " u'able',\r\n", " u'family',\r\n", " u'help',\r\n", " u'works',\r\n", " u'wants',\r\n", " u'lives'],\r\n", " u'b_name': u'topic_42',\r\n", " u'b_size': 73725},\r\n", " {u'a_words': [u'life',\r\n", " u'quality',\r\n", " u'products',\r\n", " u'better',\r\n", " u'able',\r\n", " u'customers',\r\n", " u'improve'],\r\n", " u'b_name': u'topic_63',\r\n", " u'b_size': 35069}]},\r\n", " {u'a_words': [u'district',\r\n", " u'province',\r\n", " u'lives',\r\n", " u'requested',\r\n", " u'inputs',\r\n", " u'cambodia',\r\n", " u'old'],\r\n", " u'b_name': u'topic_30',\r\n", " u'b_size': 6478},\r\n", " {u'a_words': [u'lives',\r\n", " u'city',\r\n", " u'works',\r\n", " u'house',\r\n", " u'located',\r\n", " u'old',\r\n", " u'area'],\r\n", " u'b_name': u'topic_24',\r\n", " u'b_size': 32244},\r\n", " {u'a_words': [u'living',\r\n", " u'earn',\r\n", " u'income',\r\n", " u'hopes',\r\n", " u'rice',\r\n", " u'village',\r\n", " u'support'],\r\n", " u'b_name': u'topic_4',\r\n", " u'b_size': 19162},\r\n", " {u'a_words': [u'shop',\r\n", " u'use',\r\n", " u'hopes',\r\n", " u'kes',\r\n", " u'future',\r\n", " u'operates',\r\n", " u'retail'],\r\n", " u'b_name': u'topic_41',\r\n", " u'b_size': 19875},\r\n", " {u'a_words': [u'rice',\r\n", " u'sugar',\r\n", " u'oil',\r\n", " u'flour',\r\n", " u'cooking',\r\n", " u'bread',\r\n", " u'meat'],\r\n", " u'b_name': u'topic_46',\r\n", " u'b_size': 12137},\r\n", " {u'a_words': [u'sewing',\r\n", " u'machine',\r\n", " u'training',\r\n", " u'hair',\r\n", " u'tools',\r\n", " u'beauty',\r\n", " u'salon'],\r\n", " u'b_name': u'topic_34',\r\n", " u'b_size': 8531},\r\n", " {u'a_words': [u'businesses',\r\n", " u'community',\r\n", " u'families',\r\n", " u'microfinance',\r\n", " u'groups',\r\n", " u'intends',\r\n", " u'repay'],\r\n", " u'b_name': u'topic_58',\r\n", " u'b_size': 9961},\r\n", " {u'a_words': [u'community',\r\n", " u'services',\r\n", " u'costs',\r\n", " u'providing',\r\n", " u'service',\r\n", " u'new',\r\n", " u'provides'],\r\n", " u'b_name': u'topic_9',\r\n", " u'b_size': 17156},\r\n", " {u'a_words': [u'income',\r\n", " u'husband',\r\n", " u'family',\r\n", " u'expenses',\r\n", " u'help',\r\n", " u'needs',\r\n", " u'cover'],\r\n", " u'b_name': u'topic_28',\r\n", " u'b_size': 23664},\r\n", " {u'a_words': [u'school',\r\n", " u'fees',\r\n", " u'challenge',\r\n", " u'kenya',\r\n", " u'uganda',\r\n", " u'major',\r\n", " u'profits'],\r\n", " u'b_name': u'topic_52',\r\n", " u'b_size': 20147},\r\n", " {u'a_words': [u'build',\r\n", " u'house',\r\n", " u'expanding',\r\n", " u'building',\r\n", " u'sand',\r\n", " u'basis',\r\n", " u'construction'],\r\n", " u'b_name': u'topic_44',\r\n", " u'b_size': 8058},\r\n", " {u'a_words': [u'daughter',\r\n", " u'school',\r\n", " u'lives',\r\n", " u'house',\r\n", " u'old',\r\n", " u'husband',\r\n", " u'applying'],\r\n", " u'b_name': u'topic_49',\r\n", " u'b_size': 19714},\r\n", " {u'a_words': [u'meet',\r\n", " u'customers',\r\n", " u'needs',\r\n", " u'increase',\r\n", " u'demand',\r\n", " u'family',\r\n", " u'sales'],\r\n", " u'b_name': u'topic_6',\r\n", " u'b_size': 24043},\r\n", " {u'a_words': [u'happy',\r\n", " u'well',\r\n", " u'says',\r\n", " u'likes',\r\n", " u'hand',\r\n", " u'good',\r\n", " u'time'],\r\n", " u'b_name': u'topic_19',\r\n", " u'b_size': 14593},\r\n", " {u'a_words': [u'bank',\r\n", " u'members',\r\n", " u'member',\r\n", " u'time',\r\n", " u'cycle',\r\n", " u'health',\r\n", " u'community'],\r\n", " u'b_name': u'topic_26',\r\n", " u'b_size': 11185}]},\r\n", " {u'a_words': [u'rural',\r\n", " u'field',\r\n", " u'biggest',\r\n", " u'kiva\\u2019s',\r\n", " u'benefit',\r\n", " u'sector',\r\n", " u'cooperative'],\r\n", " u'b_name': u'topic_37',\r\n", " u'b_size': 6275},\r\n", " {u'a_words': [u'college',\r\n", " u'hardworking',\r\n", " u'studies',\r\n", " u'word',\r\n", " u'degree',\r\n", " u'teacher',\r\n", " u'continues'],\r\n", " u'b_name': u'topic_2',\r\n", " u'b_size': 5595},\r\n", " {u'a_words': [u'php',\r\n", " u'philippines',\r\n", " u'additional',\r\n", " u'future',\r\n", " u'earns',\r\n", " u'secure',\r\n", " u'income'],\r\n", " u'b_name': u'topic_17',\r\n", " u'b_size': 19862},\r\n", " {u'a_words': [u'pigs',\r\n", " u'raising',\r\n", " u'earns',\r\n", " u'healthy',\r\n", " u'old',\r\n", " u'living',\r\n", " u'raise'],\r\n", " u'b_name': u'topic_14',\r\n", " u'b_size': 9802},\r\n", " {u'a_words': [u'living',\r\n", " u'improve',\r\n", " u'family',\r\n", " u'conditions',\r\n", " u'better',\r\n", " u'income',\r\n", " u'situation'],\r\n", " u'b_name': u'topic_32',\r\n", " u'b_size': 15200},\r\n", " {u'a_words': [u'money',\r\n", " u'save',\r\n", " u'requested',\r\n", " u'family',\r\n", " u'enough',\r\n", " u'works',\r\n", " u'hard'],\r\n", " u'b_name': u'topic_8',\r\n", " u'b_size': 21213},\r\n", " {u'a_words': [u'previous',\r\n", " u'back',\r\n", " u'new',\r\n", " u'total',\r\n", " u'paid',\r\n", " u'loans',\r\n", " u'used'],\r\n", " u'b_name': u'topic_33',\r\n", " u'b_size': 7391},\r\n", " {u'a_words': [u'lenders',\r\n", " u'entrepreneur',\r\n", " u'village',\r\n", " u'engaged',\r\n", " u'northern',\r\n", " u'english',\r\n", " u'kind'],\r\n", " u'b_name': u'topic_20',\r\n", " u'b_size': 5316},\r\n", " {u'a_words': [u'son',\r\n", " u'university',\r\n", " u'pay',\r\n", " u'studying',\r\n", " u'noodles',\r\n", " u'parts',\r\n", " u'year'],\r\n", " u'b_name': u'topic_31',\r\n", " u'b_size': 6903},\r\n", " {u'a_words': [u'materials',\r\n", " u'home',\r\n", " u'making',\r\n", " u'cement',\r\n", " u'sustaining',\r\n", " u'wood',\r\n", " u'make'],\r\n", " u'b_name': u'topic_29',\r\n", " u'b_size': 6798},\r\n", " {u'a_words': [u'region',\r\n", " u'applied',\r\n", " u'borrowed',\r\n", " u'pakistan',\r\n", " u'sheep',\r\n", " u'times',\r\n", " u'develop'],\r\n", " u'b_name': u'topic_0',\r\n", " u'b_size': 5167},\r\n", " {u'a_words': [u'group',\r\n", " u'women',\r\n", " u'members',\r\n", " u'one',\r\n", " u'fund',\r\n", " u'leader',\r\n", " u'first'],\r\n", " u'b_name': u'topic_36',\r\n", " u'b_size': 18547},\r\n", " {u'a_words': [u'school',\r\n", " u'daughters',\r\n", " u'students',\r\n", " u'restaurant',\r\n", " u'study',\r\n", " u'two',\r\n", " u'sons'],\r\n", " u'b_name': u'topic_39',\r\n", " u'b_size': 6816},\r\n", " {u'a_words': [u'amount',\r\n", " u'old',\r\n", " u'past',\r\n", " u'purchase',\r\n", " u'hopes',\r\n", " u'lives',\r\n", " u'requesting'],\r\n", " u'b_name': u'topic_62',\r\n", " u'b_size': 14207},\r\n", " {u'a_words': [u'one',\r\n", " u'child',\r\n", " u'year',\r\n", " u'old',\r\n", " u'wheat',\r\n", " u'fourth',\r\n", " u'south'],\r\n", " u'b_name': u'topic_56',\r\n", " u'b_size': 6558}]},\r\n", " {u'a_words': [u'store',\r\n", " u'farming',\r\n", " u'water',\r\n", " u'clothing',\r\n", " u'products',\r\n", " u'member',\r\n", " u'general'],\r\n", " u'b_name': u'topic_45_57_15_5_47_27_18_59_43_11_23_12_7_50_54_40_16',\r\n", " u'b_size': 143246,\r\n", " u'children': [{u'a_words': [u'maize',\r\n", " u'farmer',\r\n", " u'income',\r\n", " u'milk',\r\n", " u'dairy',\r\n", " u'cows',\r\n", " u'traditional'],\r\n", " u'b_name': u'topic_45_57',\r\n", " u'b_size': 24628,\r\n", " u'children': [{u'a_words': [u'maize',\r\n", " u'farmer',\r\n", " u'income',\r\n", " u'milk',\r\n", " u'dairy',\r\n", " u'cows',\r\n", " u'cattle'],\r\n", " u'b_name': u'topic_45',\r\n", " u'b_size': 15951},\r\n", " {u'a_words': [u'traditional',\r\n", " u'ingredients',\r\n", " u'profit',\r\n", " u'bags',\r\n", " u'live',\r\n", " u'plans',\r\n", " u'soro'],\r\n", " u'b_name': u'topic_57',\r\n", " u'b_size': 8677}]},\r\n", " {u'a_words': [u'products',\r\n", " u'goods',\r\n", " u'etc',\r\n", " u'canned',\r\n", " u'sells',\r\n", " u'sell',\r\n", " u'like'],\r\n", " u'b_name': u'topic_15',\r\n", " u'b_size': 10764},\r\n", " {u'a_words': [u'poultry',\r\n", " u'fellowship',\r\n", " u'god',\r\n", " u'chickens',\r\n", " u'attain',\r\n", " u'chicken',\r\n", " u'eggs'],\r\n", " u'b_name': u'topic_5',\r\n", " u'b_size': 4570},\r\n", " {u'a_words': [u'livestock',\r\n", " u'feed',\r\n", " u'agriculture',\r\n", " u'fattening',\r\n", " u'primarily',\r\n", " u'gain',\r\n", " u'agricultural'],\r\n", " u'b_name': u'topic_47',\r\n", " u'b_size': 4485},\r\n", " {u'a_words': [u'farming',\r\n", " u'harvest',\r\n", " u'land',\r\n", " u'farm',\r\n", " u'fertilizer',\r\n", " u'crops',\r\n", " u'farmers'],\r\n", " u'b_name': u'topic_27',\r\n", " u'b_size': 20389},\r\n", " {u'a_words': [u'fish',\r\n", " u'fishing',\r\n", " u'pig',\r\n", " u'blessed',\r\n", " u'mary',\r\n", " u'firewood',\r\n", " u'sell'],\r\n", " u'b_name': u'topic_18',\r\n", " u'b_size': 4286},\r\n", " {u'a_words': [u'poor',\r\n", " u'production',\r\n", " u'poverty',\r\n", " u'financial',\r\n", " u'development',\r\n", " u'francs',\r\n", " u'country'],\r\n", " u'b_name': u'topic_59',\r\n", " u'b_size': 6911},\r\n", " {u'a_words': [u'store',\r\n", " u'general',\r\n", " u'items',\r\n", " u'grocery',\r\n", " u'groceries',\r\n", " u'sell',\r\n", " u'runs'],\r\n", " u'b_name': u'topic_43',\r\n", " u'b_size': 13387},\r\n", " {u'a_words': [u'clothing',\r\n", " u'shoes',\r\n", " u'tuition',\r\n", " u'merchandise',\r\n", " u'sales',\r\n", " u'cosmetics',\r\n", " u'selling'],\r\n", " u'b_name': u'topic_11',\r\n", " u'b_size': 6693},\r\n", " {u'a_words': [u'man',\r\n", " u'photo',\r\n", " u'wife',\r\n", " u'father',\r\n", " u'equipment',\r\n", " u'furniture',\r\n", " u'right'],\r\n", " u'b_name': u'topic_23',\r\n", " u'b_size': 4631},\r\n", " {u'a_words': [u'get',\r\n", " u'ahead',\r\n", " u'maria',\r\n", " u'forward',\r\n", " u'desire',\r\n", " u'greatest',\r\n", " u'borrower'],\r\n", " u'b_name': u'topic_12',\r\n", " u'b_size': 5809},\r\n", " {u'a_words': [u'member',\r\n", " u'program',\r\n", " u'joined',\r\n", " u'pmpc',\r\n", " u'help',\r\n", " u'foundation',\r\n", " u'recently'],\r\n", " u'b_name': u'topic_7',\r\n", " u'b_size': 9541},\r\n", " {u'a_words': [u'vegetables',\r\n", " u'vending',\r\n", " u'fruits',\r\n", " u'fruit',\r\n", " u'stall',\r\n", " u'bananas',\r\n", " u'vegetable'],\r\n", " u'b_name': u'topic_50',\r\n", " u'b_size': 6640},\r\n", " {u'a_words': [u'repair',\r\n", " u'fertilizers',\r\n", " u'transportation',\r\n", " u'motorcycle',\r\n", " u'transport',\r\n", " u'maintenance',\r\n", " u'driver'],\r\n", " u'b_name': u'topic_54',\r\n", " u'b_size': 6741},\r\n", " {u'a_words': [u'water',\r\n", " u'monthly',\r\n", " u'week',\r\n", " u'aged',\r\n", " u'days',\r\n", " u'family',\r\n", " u'piped'],\r\n", " u'b_name': u'topic_40',\r\n", " u'b_size': 8058},\r\n", " {u'a_words': [u'electricity',\r\n", " u'funds',\r\n", " u'soon',\r\n", " u'ensure',\r\n", " u'including',\r\n", " u'onions',\r\n", " u'mainly'],\r\n", " u'b_name': u'topic_16',\r\n", " u'b_size': 5713}]},\r\n", " {u'a_words': [u'nwtf',\r\n", " u'dream',\r\n", " u'sell',\r\n", " u'past',\r\n", " u'charcoal',\r\n", " u'selling',\r\n", " u'expand'],\r\n", " u'b_name': u'topic_13',\r\n", " u'b_size': 7160},\r\n", " {u'a_words': [u'access',\r\n", " u'loans',\r\n", " u'!',\r\n", " u'stocks',\r\n", " u'taken',\r\n", " u'institutions',\r\n", " u'said'],\r\n", " u'b_name': u'topic_55',\r\n", " u'b_size': 4890},\r\n", " {u'a_words': [u'like',\r\n", " u'would',\r\n", " u'partner',\r\n", " u'future',\r\n", " u'describes',\r\n", " u'aspires',\r\n", " u'domestic'],\r\n", " u'b_name': u'topic_21',\r\n", " u'b_size': 9813},\r\n", " {u'a_words': [u'borrowing',\r\n", " u'institution',\r\n", " u'communal',\r\n", " u'raised',\r\n", " u'often',\r\n", " u'economy',\r\n", " u'rest'],\r\n", " u'b_name': u'topic_25',\r\n", " u'b_size': 3045},\r\n", " {u'a_words': [u'day',\r\n", " u'average',\r\n", " u'per',\r\n", " u'credit',\r\n", " u'every',\r\n", " u'within',\r\n", " u'usd'],\r\n", " u'b_name': u'topic_48',\r\n", " u'b_size': 9393},\r\n", " {u'a_words': [u'repaid',\r\n", " u'successfully',\r\n", " u'involved',\r\n", " u'loans',\r\n", " u'individual',\r\n", " u'56',\r\n", " u'adult'],\r\n", " u'b_name': u'topic_38',\r\n", " u'b_size': 4680},\r\n", " {u'a_words': [u'food',\r\n", " u'beverages',\r\n", " u'foods',\r\n", " u'25,000',\r\n", " u'serves',\r\n", " u'prepare',\r\n", " u'selling'],\r\n", " u'b_name': u'topic_3',\r\n", " u'b_size': 4324},\r\n", " {u'a_words': [u'dreams',\r\n", " u'clothes',\r\n", " u'selling',\r\n", " u'goes',\r\n", " u'sell',\r\n", " u'woman',\r\n", " u'sells'],\r\n", " u'b_name': u'topic_22',\r\n", " u'b_size': 5775},\r\n", " {u'a_words': [u'coffee',\r\n", " u'drinks',\r\n", " u'soft',\r\n", " u'weather',\r\n", " u'peru',\r\n", " u'mr.',\r\n", " u'wife'],\r\n", " u'b_name': u'topic_53',\r\n", " u'b_size': 3098},\r\n", " {u'a_words': [u'higher',\r\n", " u'low',\r\n", " u'price',\r\n", " u'prices',\r\n", " u'competition',\r\n", " u'al',\r\n", " u'wholesale'],\r\n", " u'b_name': u'topic_10',\r\n", " u'b_size': 3787},\r\n", " {u'a_words': [u'grade',\r\n", " u'manage',\r\n", " u'born',\r\n", " u'due',\r\n", " u'supports',\r\n", " u'became',\r\n", " u'honest'],\r\n", " u'b_name': u'topic_51',\r\n", " u'b_size': 5044},\r\n", " {u'a_words': [u'local',\r\n", " u'small',\r\n", " u'successful',\r\n", " u'capital',\r\n", " u'working',\r\n", " u'assistance',\r\n", " u'stable'],\r\n", " u'b_name': u'topic_35',\r\n", " u'b_size': 10551},\r\n", " {u'a_words': [u'cash',\r\n", " u'hours',\r\n", " u'hoping',\r\n", " u'manages',\r\n", " u'expects',\r\n", " u'housewife',\r\n", " u'assist'],\r\n", " u'b_name': u'topic_60',\r\n", " u'b_size': 2963},\r\n", " {u'a_words': [u'brac',\r\n", " u'ages',\r\n", " u'generate',\r\n", " u'rosa',\r\n", " u'renovate',\r\n", " u'2011',\r\n", " u'currently'],\r\n", " u'b_name': u'topic_61',\r\n", " u'b_size': 3450}]}\r\n", " Dumping object hierarchy into JSON file data/topicmodelling/kivadata.json ... done\r\n" ] } ], "prompt_number": 36 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* David Mimno and Andrew MacCallum: [Organizing the OCA: Learning Faceted Subjects from a Library of Digital Books](http://mimno.infosci.cornell.edu/papers/f129-mimno.pdf)\n", "* Jay Pujara and Peter Skomoroch: [Large-Scale Hierarchical Topic Models](http://linqs.cs.umd.edu/basilic/web/Publications/2012/pujara:nips12/pujara_biglearn12.pdf)\n", "* Alison Smith, Timothy Hawes, and Meredith Myers: [Hiérarchie: Interactive Visualization for Hierarchical Topic Models](http://nlp.stanford.edu/events/illvi2014/papers/smith-illvi2014b.pdf)\n", "* Hierarchie [Readme page at GitHub](http://mlvl.github.io/Hierarchie/#/about)\n", "* [Gensim - topic modelling for humans](https://radimrehurek.com/gensim/index.html)\n", "* Nikolaos Aletras and Mark Stevenson: [Measuring the Similarity between Automatically Generated Topics](http://staffwww.dcs.shef.ac.uk/people/N.Aletras/resources/2014_eacl_topicSim_short.pdf)\n", "* [Stand-alone language identification system](https://github.com/saffsd/langid.py) (in Python)" ] } ], "metadata": {} } ] }