# Topic Models on WikiHow

[WikiHow dataset page](https://github.com/mahnazkoupaee/WikiHow-Dataset)

[Automatic Evaluation of Topic Coherence](https://www.aclweb.org/anthology/N10-1012)

[Evaluation of Topic Modeling: Topic Coherence](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/)

[Topic Coherence in gensim](https://radimrehurek.com/gensim/models/coherencemodel.html)

In [1]:
!wget -nc -O wikihowAll.csv https://query.data.world/s/lult233wfonljfadtexn2t5x5rb7is

File ‘wikihowAll.csv’ already there; not retrieving.


In [2]:
!pip install git+https://github.com/lambdaofgod/mlutil
!pip install tqdm

Collecting git+https://github.com/lambdaofgod/mlutil
  Cloning https://github.com/lambdaofgod/mlutil to /tmp/pip-0itfu2mr-build
[33mYou are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
from __future__ import print_function
from time import time

import numpy as np
import pandas as pd

import seaborn as sns

import tqdm


from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary

from IPython.display import display, Image

import nltk
nltk.download('wordnet')
nltk.download('wordnet_ic')

import mlutil
from mlutil.textmining import get_wordnet_similarity


import pyLDAvis
import pyLDAvis.sklearn

paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress
[nltk_data] Downloading package wordnet to /home/kuba/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to /home/kuba/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


In [4]:
pyLDAvis.enable_notebook()

In [5]:
def plot_correlations(m):
  m_corr = m @ m.T / (m ** 2).sum(axis=1)
  sns.heatmap(m)

In [6]:
n_features = 5000
n_components = 10
n_top_words = 10

## Loading WikiHow

In [7]:
wikihow_df = pd.read_csv('wikihowAll.csv')
print('wikihow size', wikihow_df.shape)
wikihow_df = wikihow_df[~wikihow_df['text'].isna()]
print('valid wikihow size (removed empty text)', wikihow_df.shape)

wikihow size (215365, 3)
valid wikihow size (removed empty text) (214294, 3)


In [8]:
data_samples = wikihow_df['text']
n_samples = len(data_samples)

In [9]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=5,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=5,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

Extracting tf-idf features for NMF...
done in 92.840s.
Extracting tf features for LDA...
done in 90.236s.



In [9]:
# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

Fitting the NMF model (Frobenius norm) with tf-idf features n_samples=214294 and n_features=5000...
done in 193.698s.

Topics in NMF model (Frobenius norm):


In [10]:
nmf_keywords_per_topic = mlutil.topic_modeling.top_topic_words(nmf, tfidf_feature_names, 100)
display(nmf_keywords_per_topic.iloc[:,:10])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
topic_0,people,don,person,like,feel,time,make,things,say,know
topic_1,add,water,mixture,minutes,oil,heat,stir,bowl,pan,mix
topic_2,click,screen,button,select,menu,tap,app,icon,file,page
topic_3,hair,shampoo,comb,dry,conditioner,look,skin,scalp,brush,oil
topic_4,dog,dogs,vet,pet,puppy,food,training,leash,treat,breed
topic_5,skin,doctor,body,help,blood,foods,pain,symptoms,day,exercise
topic_6,use,make,water,paper,paint,cut,sure,color,place,glue
topic_7,business,information,need,state,company,card,number,credit,money,online
topic_8,cat,cats,vet,food,pet,litter,veterinarian,toys,kitten,box
topic_9,child,children,kids,parents,parent,baby,school,help,behavior,toddler


## Topic coherence

In the following we use average Resnik similarity of words from top topic keywords.


In [11]:
nmf_mean_coherence = mlutil.topic_modeling.get_topic_coherences(nmf_keywords_per_topic)
print('NMF-based topic model mean coherence:', nmf_mean_coherence)

100%|██████████| 10/10 [03:01<00:00, 18.13s/it]

NMF-based topic model mean coherence: 0    1.037505
1    1.003931
2    0.958536
3    1.273350
4    1.448881
5    0.864943
6    1.378219
7    0.715856
8    1.833986
9    1.831830
dtype: float64





In [12]:
# Fit the KL divergence NMF model
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
kl_nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

tfidf_feature_names = tfidf_vectorizer.get_feature_names()

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=214294 and n_features=5000...
done in 1760.113s.


### Topics in NMF model (generalized Kullback-Leibler divergence)

In [14]:
kl_nmf_keywords_per_topic = mlutil.topic_modeling.top_topic_words(kl_nmf, tfidf_feature_names, 100)
display(kl_nmf_keywords_per_topic.iloc[:,:10])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
topic_0,time,try,make,like,way,want,don,people,help,just
topic_1,water,use,using,remove,sure,make,warm,dry,small,minutes
topic_2,click,select,screen,right,open,use,want,type,menu,window
topic_3,look,wear,hair,like,try,want,don,just,style,make
topic_4,pet,need,dog,sure,possible,prevent,safe,provide,likely,vet
topic_5,help,weight,include,reduce,doctor,body,health,treatment,need,increase
topic_6,use,need,work,way,sure,make,want,right,start,using
topic_7,use,information,online,number,website,need,year,example,provide,work
topic_8,stir,minutes,mix,add,mixture,serve,sugar,place,salt,time
topic_9,use,make,sure,place,small,want,using,paper,cut,shape


In [16]:
kl_nmf_mean_coherence = mlutil.topic_modeling.get_topic_coherences(kl_nmf_keywords_per_topic)
print('KL-NMF-based topic model mean coherence:', kl_nmf_mean_coherence)

100%|██████████| 10/10 [02:57<00:00, 19.10s/it]

KL-NMF-based topic model mean coherence: 0    1.218143
1    0.635964
2    0.715969
3    0.829468
4    0.707291
5    0.734174
6    1.303825
7    0.778983
8    1.028046
9    0.870015
dtype: float64





In [10]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0,
                                n_jobs=-1)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()

Fitting LDA models with tf features, n_samples=214294 and n_features=5000...
done in 840.556s.

Topics in LDA model:


In [11]:
lda_keywords_per_topic = mlutil.topic_modeling.top_topic_words(lda, tf_feature_names, 100)
display(lda_keywords_per_topic.iloc[:,:10])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
topic_0,food,foods,blood,skin,eat,help,like,doctor,day,meat
topic_1,use,make,cut,place,end,need,right,hand,sure,paper
topic_2,time,make,work,need,good,like,want,help,ll,people
topic_3,help,child,time,feel,try,body,children,exercise,day,sleep
topic_4,click,button,screen,select,right,open,computer,use,tap,window
topic_5,don,make,people,like,person,want,time,just,know,try
topic_6,water,use,add,dry,remove,make,oil,place,minutes,clean
topic_7,information,need,business,state,number,file,example,use,court,credit
topic_8,paint,look,color,hair,make,use,like,want,colors,wear
topic_9,dog,cat,water,make,need,sure,soil,plant,plants,home


Warning: the results of LDA may be a bit misleading - I don't know whether getting topic keywords from LDA uses the same mechanism as in NMF (which will correspond to tf-idf features, instead of tf ones)

In [23]:
lda_mean_coherence = mlutil.topic_modeling.get_topic_coherences(lda_keywords_per_topic)
print('LDA-based topic model mean coherence:', lda_mean_coherence)

100%|██████████| 10/10 [02:57<00:00, 18.68s/it]

LDA-based topic model mean coherence: 0    0.963005
1    0.679626
2    1.135926
3    0.927911
4    1.046612
5    1.159201
6    0.965963
7    0.917089
8    1.009249
9    0.886062
dtype: float64





In [12]:
%%time
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


CPU times: user 1min 45s, sys: 1.17 s, total: 1min 46s
Wall time: 7min 53s
