## NMF topic modeling on 20 newsgroups

This notebook is basically expanded version of [this example](http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py) from scikit-learn documentation.

In [1]:
from __future__ import print_function

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups

n_samples = 8000
n_features = 1000
n_components = 10
n_top_words = 20


def kl_loss(x, y, eps=1e-10):
 return -(x.toarray() * np.log(y+eps)).sum() / x.shape[0]


def frobenius_loss(x, y):
 return np.square(x - y).sum() / x.shape[0]


def print_top_words(model, feature_names, n_top_words):
 for topic_idx, topic in enumerate(model.components_):
 print("Topic #%d: " % topic_idx)
 topic_words = " ".join([feature_names[i]
 for i in topic.argsort()[:-n_top_words - 1:-1]])
 print(topic_words)
 print()
 
 
def score_model(model, data):
 if model.beta_loss == 'kullback-leibler':
 loss_function = kl_loss
 elif model.beta_loss == 'frobenius':
 loss_function = frobenius_loss
 
 reduced_data = model.transform(data)
 reconstructed_data = model.inverse_transform(reduced_data)
 
 return loss_function(data, reconstructed_data)

In [2]:
%%time
print("Loading dataset...")
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
 remove=('headers', 'footers', 'quotes'))

data_train = dataset.data[:n_samples]
data_test = dataset.data[n_samples:]

Loading dataset...
CPU times: user 1.86 s, sys: 69.9 ms, total: 1.93 s
Wall time: 1.97 s


### Use tf-idf features for NMF.


In [3]:
%%time
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
 max_features=n_features,
 stop_words='english')

tfidf_train = tfidf_vectorizer.fit_transform(data_train)
tfidf_test = tfidf_vectorizer.transform(data_test)

Extracting tf-idf features for NMF...
CPU times: user 2.42 s, sys: 8.18 ms, total: 2.43 s
Wall time: 2.43 s


### NMF model with Frobenius loss

In [4]:
%%time
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
 "n_samples=%d and n_features=%d..."
 % (n_samples, n_features))
frobenius_nmf = NMF(n_components=n_components, random_state=1,
 alpha=.1, l1_ratio=.5).fit(tfidf_train)

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=8000 and n_features=1000...
CPU times: user 1.39 s, sys: 40.4 ms, total: 1.43 s
Wall time: 828 ms


In [5]:
print('train reconstruction error:', score_model(frobenius_nmf, tfidf_train))
print('test reconstruction error:', score_model(frobenius_nmf, tfidf_test))

train reconstruction error: 0.890941957403
test reconstruction error: 0.892431321223


#### Topics

In [6]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(frobenius_nmf, tfidf_feature_names, n_top_words)

Topic #0: 
just don people think like know good time right ve make say did way really want going said ll thing
Topic #1: 
card video monitor drivers cards vga bus driver color ram graphics mode bit board memory pc 16 speed performance controller
Topic #2: 
god jesus bible christ faith believe christians christian church sin lord does life man hell truth belief say love father
Topic #3: 
key chip clipper encryption keys government escrow use algorithm public nsa security phone secure law chips des data bit enforcement
Topic #4: 
new 00 car sale 10 price shipping offer 50 20 15 condition 12 interested 11 used 30 25 sell old
Topic #5: 
thanks does know mail advance hi info looking help anybody address appreciated email information post interested reply send like need
Topic #6: 
windows file use dos files program using window problem running run version pc server application screen software ms ftp help
Topic #7: 
edu soon cs university com internet ftp article pub send email mit david mail

### NMF model with KL-divergence loss

In [7]:
%%time
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
 "tf-idf features, n_samples=%d and n_features=%d..."
 % (n_samples, n_features))
kl_nmf = NMF(n_components=n_components, random_state=1,
 beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
 l1_ratio=0.9).fit(tfidf_train)

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=8000 and n_features=1000...
CPU times: user 12.1 s, sys: 380 ms, total: 12.5 s
Wall time: 6.25 s


In [8]:
print('train reconstruction error:', score_model(kl_nmf, tfidf_train))
print('test reconstruction error:', score_model(kl_nmf, tfidf_test))

train reconstruction error: 18.355714861
test reconstruction error: 18.2931233004


#### Topics

In [9]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(kl_nmf, tfidf_feature_names, n_top_words)

Topic #0: 
time like way right really did years good said make just think don long thing going new say want know
Topic #1: 
use thanks need used using software work help does card hi drive video pc mac computer problem new like speed
Topic #2: 
god question does say people believe true read word jesus says point religion bible life christian claim christians mean faith
Topic #3: 
use government people public make state law used key number fact chip using rights note case legal war keys large
Topic #4: 
new sale 10 year 20 15 shipping offer 12 50 following 16 1993 11 price years 30 00 condition 25
Topic #5: 
thanks know mail post does information looking like com send interested email list address info reply net group advance help
Topic #6: 
windows program file problem using run use version running files like window sun ftp try look available code image server
Topic #7: 
just edu like don want try ve soon thing think things stuff sure oh case car deleted tell people bike
Topic #8: 
goo