In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [2]:
import ktrain

## STEP 1:  Get Raw Document Data

In [3]:
# 20newsgroups
from sklearn.datasets import fetch_20newsgroups
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)
texts = newsgroups_train.data +  newsgroups_test.data

## STEP 2:  Represent Documents as Semantically Meaningful Vectors With LDA

In [4]:
%%time
tm = ktrain.text.get_topic_model(texts, n_features=10000)

n_topics automatically set to 97
preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.
CPU times: user 16min 17s, sys: 41min 59s, total: 58min 17s
Wall time: 1min 56s


In [5]:
%%time
tm.build(texts, threshold=0.25)

done.
CPU times: user 1min 26s, sys: 3min 15s, total: 4min 41s
Wall time: 12.2 s


## STEP 3:  Train a Document Recommender

In [6]:
tm.train_recommender()

## STEP 4: Generate Recommendations


Given some text, recommend documents that are semantically relevant to it.

In [8]:
rawtext = """
            Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees
            the development and manufacturing of advanced rockets and spacecraft for missions
            to and beyond Earth orbit.
            """

In [9]:
for i, doc in enumerate(tm.recommend(text=rawtext, n=5)):
    print('RESULT #%s'% (i+1))
    print('TEXT:\n\t%s' % (" ".join(doc['text'].split()[:500])))
    print()

RESULT #1
TEXT:
	Archive-name: space/new_probes Last-modified: $Date: 93/04/01 14:39:17 $ UPCOMING PLANETARY PROBES - MISSIONS AND SCHEDULES Information on upcoming or currently active missions not mentioned below would be welcome. Sources: NASA fact sheets, Cassini Mission Design team, ISAS/NASDA launch schedules, press kits. ASUKA (ASTRO-D) - ISAS (Japan) X-ray astronomy satellite, launched into Earth orbit on 2/20/93. Equipped with large-area wide-wavelength (1-20 Angstrom) X-ray telescope, X-ray CCD cameras, and imaging gas scintillation proportional counters. CASSINI - Saturn orbiter and Titan atmosphere probe. Cassini is a joint NASA/ESA project designed to accomplish an exploration of the Saturnian system with its Cassini Saturn Orbiter and Huygens Titan Probe. Cassini is scheduled for launch aboard a Titan IV/Centaur in October of 1997. After gravity assists of Venus, Earth and Jupiter in a VVEJGA trajectory, the spacecraft will arrive at Saturn in June of 2004. Upon arrival, t

### Saving and Restoring the Topic Model

The topic model can be saved and restored as follows.

**Save the Topic Model:**

In [10]:
tm.save('/tmp/tm')

**Restore the Topic Model and Rebuild the Document-Topic Matrix**

In [11]:
tm = ktrain.text.load_topic_model('/tmp/tm')

done.


In [12]:
tm.build(texts, threshold=0.25)

done.


Note that the scorer and recommender are not saved, only the LDA topic model is saved.  So, the scorer and recommender should be retrained prior to use as follows:

In [13]:
tm.train_recommender()

In [14]:
rawtext = """
            Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees
            the development and manufacturing of advanced rockets and spacecraft for missions
            to and beyond Earth orbit.
            """

In [15]:
print(tm.recommend(text=rawtext, n=1)[0]['text'])

Archive-name: space/new_probes
Last-modified: $Date: 93/04/01 14:39:17 $

UPCOMING PLANETARY PROBES - MISSIONS AND SCHEDULES

    Information on upcoming or currently active missions not mentioned below
    would be welcome. Sources: NASA fact sheets, Cassini Mission Design
    team, ISAS/NASDA launch schedules, press kits.


    ASUKA (ASTRO-D) - ISAS (Japan) X-ray astronomy satellite, launched into
    Earth orbit on 2/20/93. Equipped with large-area wide-wavelength (1-20
    Angstrom) X-ray telescope, X-ray CCD cameras, and imaging gas
    scintillation proportional counters.


    CASSINI - Saturn orbiter and Titan atmosphere probe. Cassini is a joint
    NASA/ESA project designed to accomplish an exploration of the Saturnian
    system with its Cassini Saturn Orbiter and Huygens Titan Probe. Cassini
    is scheduled for launch aboard a Titan IV/Centaur in October of 1997.
    After gravity assists of Venus, Earth and Jupiter in a VVEJGA
    trajectory, the spacecraft will arrive a