In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [2]:
import ktrain
ktrain.__version__

Using TensorFlow backend.


using Keras version: 2.2.4


'0.6.0'

## STEP 1: Get Raw Document Data

In [3]:
# 20newsgroups
from sklearn.datasets import fetch_20newsgroups

# we only want to keep the body of the documents!
remove = ('headers', 'footers', 'quotes')

# fetch train and test data
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)

# compile the texts
texts = newsgroups_train.data +  newsgroups_test.data

# let's also store the newsgroup category associated with each document
# we can display this information in visualizations
targets = [target for target in list(newsgroups_train.target) + list(newsgroups_test.target)]
categories = [newsgroups_train.target_names[target] for target in targets]

## STEP 2: Train an LDA Topic Model to Discover Topics

The `get_topic_model` function learns a [topic model](https://en.wikipedia.org/wiki/Topic_model) using [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation).

In [4]:
%%time
tm = ktrain.text.get_topic_model(texts, n_features=10000)

n_topics automatically set to 97
preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.
CPU times: user 16min 18s, sys: 42min 45s, total: 59min 3s
Wall time: 1min 58s


 We can examine the discovered topics using `print_topics`, `get_topics`, or `topics`.  Here, we will use `print_topics`:

In [5]:
tm.print_topics()

topic 0 | tape adam tim case moved bag quote mass marked zionism
topic 1 | image jpeg images format programs tiff files jfif save lossless
topic 2 | alternative movie film static cycles films philips dynamic hou phi
topic 3 | hell humans poster frank reality kent gerard gant eternal bell
topic 4 | air phd chz kit cbc ups w-s rus w47 mot
topic 5 | dog math great figure poster couldn don trying rushdie fatwa
topic 6 | collaboration nazi fact end expression germany philly world certified moore
topic 7 | gif points scale postscript mirror plane rendering algorithm polygon rayshade
topic 8 | fonts font shell converted iii characters slight composite breaks compress
topic 9 | power station supply options option led light tank plastic wall
topic 10 | transmission rider bmw driver automatic shift gear japanese stick highway
topic 11 | tyre ezekiel ruler hernia appeared appointed supreme man land power
topic 12 | space nasa earth data launch surface solar moon mission planet
topic 13 | israel j

From the above, we can immediately get a feel for what kinds of subjects are discussed within this dataset.  For instsance, Topic \#13 appears to be about the Middle East with labels: "*israel jews jewish israeli arab peace*".

## STEP 3: Compute the Document-Topic Matrix


In [6]:
%%time
tm.build(texts, threshold=0.25)

done.
CPU times: user 1min 27s, sys: 3min 26s, total: 4min 53s
Wall time: 12.6 s


Since the `build` method prunes documents based on threshold, we should prune the original data and any metadata in a similar way for consistency.  This can be accomplished with the `filter` method. 

In [7]:
texts = tm.filter(texts)
categories = tm.filter(categories)

This is useful to ensure all data and metadata are aligned with the same array indices in case we want to use them later (e.g., in visualizations, for example).

## STEP 4: Inspect and Visualize Topics

Let's list the topics by document count:

In [8]:
tm.print_topics(show_counts=True)

topic:79 | count:3782 | like know does use don just good thanks need want
topic:96 | count:3643 | just don think know like time did going didn people
topic:43 | count:1599 | god people does say believe bible true think evidence religion
topic:42 | count:1246 | people government right think rights law make public fbi don
topic:51 | count:900 | card memory windows board ram bus drivers driver cpu problem
topic:46 | count:782 | game team games year hockey season players player baseball league
topic:92 | count:597 | files file edu ftp available version server data use sun
topic:29 | count:399 | edu university information send new computer research mail internet address
topic:82 | count:371 | price new sale offer sell condition shipping interested asking prices
topic:84 | count:312 | armenian armenians people turkish war said killed children russian turkey
topic:12 | count:296 | space nasa earth data launch surface solar moon mission planet
topic:22 | count:283 | key encryption chip keys cl

The topic with the most documents appears to be conversational questions, replies, and comments that aren't focused on a particular subject.  Other topics are focused on specific domains (e.g., topic 27 with label "*jews israel jewish israeli arab muslims palestinian peace arabs land*").

Notice that some topics contain only a few documents (e.g., topic \#48 about sex, marriage, and relationships).  This is typically an indication that this topic is mentioned within documents that also mention other topics prominently (e.g., topics about government policy vs. individual rights).

Let's visualize the corpus:

In [9]:
tm.visualize_documents(doc_topics=tm.get_doctopics())

reducing to 2 dimensions...[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 15644 samples in 0.048s...
[t-SNE] Computed neighbors for 15644 samples in 36.032s...
[t-SNE] Computed conditional probabilities for sample 1000 / 15644
[t-SNE] Computed conditional probabilities for sample 2000 / 15644
[t-SNE] Computed conditional probabilities for sample 3000 / 15644
[t-SNE] Computed conditional probabilities for sample 4000 / 15644
[t-SNE] Computed conditional probabilities for sample 5000 / 15644
[t-SNE] Computed conditional probabilities for sample 6000 / 15644
[t-SNE] Computed conditional probabilities for sample 7000 / 15644
[t-SNE] Computed conditional probabilities for sample 8000 / 15644
[t-SNE] Computed conditional probabilities for sample 9000 / 15644
[t-SNE] Computed conditional probabilities for sample 10000 / 15644
[t-SNE] Computed conditional probabilities for sample 11000 / 15644
[t-SNE] Computed conditional probabilities for sample 12000 / 15644
[t-SNE] Computed condi

Top-ranked document for the topic \#74, which is about Christianity:

In [10]:
print(tm.get_docs(topic_ids=[74], rank=True)[0]['text'])

For the Lord Himself will descend from Heaven with a shout, with the voice
of an archangel, and with the trumpet of God. And the dead in Christ will
rise first. Then we who are alive and remain will be caught up together
to meet the Lord in the air. And thus we shall always be with the Lord.


Let's visualize the "Christinaity" topic (`topic_id=48`) and the "Medical" topic (`topic_id=15`)

In [11]:
doc_topics = tm.get_doctopics(topic_ids=[15, 74])
tm.visualize_documents(doc_topics=doc_topics)

reducing to 2 dimensions...[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 303 samples in 0.001s...
[t-SNE] Computed neighbors for 303 samples in 0.014s...
[t-SNE] Computed conditional probabilities for sample 303 / 303
[t-SNE] Mean sigma: 0.116946
[t-SNE] KL divergence after 250 iterations with early exaggeration: 57.464523
[t-SNE] KL divergence after 1000 iterations: 0.429532
done.


## STEP 5: Predicting the Topics of New Documents

The `predict` method can predict the topic probability distribution for any arbitrary document directly from raw text:

In [12]:
tm.predict(['Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees '  +
            'the development and manufacturing of advanced rockets and spacecraft for missions ' +
            'to and beyond Earth orbit.'])

array([[0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.65009096, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.06185567, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00

As expected, the highest topic probability for this sentence is from topic \#12 (third row and third column), which is about space and related things:

In [13]:
tm.topics[ np.argmax(tm.predict(['Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees '  +
            'the development and manufacturing of advanced rockets and spacecraft for missions ' +
            'to and beyond Earth orbit.']))]

'space nasa earth data launch surface solar moon mission planet'

## Saving and Restoring the Topic Model

The topic model can be saved and restored as follows.

**Save the Topic Model:**

In [14]:
tm.save('/tmp/tm')

**Restore the Topic Model and Rebuild the Document-Topic Matrix**

In [15]:
tm = ktrain.text.load_topic_model('/tmp/tm')

done.


In [16]:
tm.build(texts, threshold=0.25)

done.


In [17]:
tm.topics[ np.argmax(tm.predict(['Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees '  +
            'the development and manufacturing of advanced rockets and spacecraft for missions ' +
            'to and beyond Earth orbit.']))]

'space nasa earth data launch surface solar moon mission planet'