reference:http://blog.csdn.net/tiffanyrabbit/article/details/76445909
http://blog.csdn.net/tiffanyrabbit/article/details/76445909

In [9]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='train')
data = news.data[:1000]

## Preprocessing
- tokenize
- stemmerize
- remove stopwords

In [10]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

data_tokenized = []
for text in data:
 text = text.lower()
 #tokenize
 tokens = word_tokenize(text)
 #remove stopwords
 filtered = [word for word in tokens if word not in stopwords.words('English')]
 #stemmerize
 ps = PorterStemmer()
 filtered = [ps.stem(w) for w in tokens]
 data_tokenized.append(' '.join(filtered))
# show a sample result
print data_tokenized[:1]

[u"from : lerxst @ wam.umd.edu ( where 's my thing ) subject : what car is thi ! ? nntp-posting-host : rac3.wam.umd.edu organ : univers of maryland , colleg park line : 15 i wa wonder if anyon out there could enlighten me on thi car i saw the other day . it wa a 2-door sport car , look to be from the late 60s/ earli 70 . it wa call a bricklin . the door were realli small . in addit , the front bumper wa separ from the rest of the bodi . thi is all i know . if anyon can tellm a model name , engin spec , year of product , where thi car is made , histori , or whatev info you have on thi funki look car , pleas e-mail . thank , - il -- -- brought to you by your neighborhood lerxst -- --"]


## CountVectorizer
- 注意由于LDA是基于词频统计的,因此一般不用TF-IDF来做特征提取
- LDA模型学习时的训练数据并不是一篇篇文本,而是Document-word matrix,它可以是array也可以是稀疏矩阵,维数是n_samples*n_features,其中n_features为词(term)的个数。因此在训练LDA主题模型前,需要先利用CountVectorizer统计词频并保存
- CountVectorizer parameters:
 - max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 - min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 - max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.

In [11]:
#vectorize text

from sklearn.feature_extraction.text import CountVectorizer 

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
 stop_words='english')
tf = tf_vectorizer.fit_transform(data_tokenized)

# store the Count Vectoerizer with joblib, so when run second time, the code above can be commented away.

from sklearn.externals import joblib #也可以选择pickle等保存模型,请随意
joblib.dump(tf_vectorizer,'model.pkl' )

# #得到存储的tf_vectorizer,节省预处理时间

# tf_vectorizer = joblib.load(tf_ModelPath)
# tf = tf_vectorizer.fit_transform(docLst)

['model.pkl']

## LDA modeling training
- 测试时max_iter设置为几十次通常很快就会结束,当然如果实际应用的话,建议至少上千次吧。
- 怎么调参数???

In [14]:
from sklearn.decomposition import LatentDirichletAllocation
n_topic = 20
lda = LatentDirichletAllocation(n_topics=n_topic, 
 max_iter=50,
 learning_method='batch')
lda.fit(tf) #tf即为Document_word Sparse Matrix 



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
 evaluate_every=-1, learning_decay=0.7,
 learning_method='batch', learning_offset=10.0,
 max_doc_update_iter=100, max_iter=50, mean_change_tol=0.001,
 n_components=10, n_jobs=1, n_topics=20, perp_tol=0.1,
 random_state=None, topic_word_prior=None,
 total_samples=1000000.0, verbose=0)

## Show topics
- 怎么检测分类结果???

In [13]:
def print_top_words(model, feature_names, n_top_words):
 #打印每个主题下权重较高的term
 for topic_idx, topic in enumerate(model.components_):
 print "Topic #%d:" % topic_idx
 print " ".join([feature_names[i]
 for i in topic.argsort()[:-n_top_words - 1:-1]])
 print
 #打印主题-词语分布矩阵
 print model.components_

n_top_words=20
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic #0:
edu com thi jake indiana new ini write univers ha use comput articl doe host secur duo opinion depart york
Topic #1:
edu wa columbia com cc land jew cunixa posting nntp ani host write uoknor callison arab did dyer hi articl
Topic #2:
wa peopl hi say thi know come becaus want brian ha man look did time way ve happen day whi
Topic #3:
edu toronto henri thi just zoo reserv work state spencer wa write adam alaska use ohio like colorado posting nntp
Topic #4:
netcom com 408 guest servic commun 241 ca 9760 list request drug clipper use electron chip lin thi wa harley
Topic #5:
com thi window weapon israel write attack articl stratus ani civilian arab say right know doe ha edu manag onli
Topic #6:
edu game ca cs team season write articl pitt pittsburgh play player cmu new nntp posting playoff host univers comput
Topic #7:
wa thi health use tobacco like ani 1993 year diseas smokeless report com david medic case age state person articl
Topic #8:
thi wa edu think peopl write articl hum