{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "reference:http://blog.csdn.net/tiffanyrabbit/article/details/76445909\n", "http://blog.csdn.net/tiffanyrabbit/article/details/76445909" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "news = fetch_20newsgroups(subset='train')\n", "data = news.data[:1000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing\n", "- tokenize\n", "- stemmerize\n", "- remove stopwords" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[u\"from : lerxst @ wam.umd.edu ( where 's my thing ) subject : what car is thi ! ? nntp-posting-host : rac3.wam.umd.edu organ : univers of maryland , colleg park line : 15 i wa wonder if anyon out there could enlighten me on thi car i saw the other day . it wa a 2-door sport car , look to be from the late 60s/ earli 70 . it wa call a bricklin . the door were realli small . in addit , the front bumper wa separ from the rest of the bodi . thi is all i know . if anyon can tellm a model name , engin spec , year of product , where thi car is made , histori , or whatev info you have on thi funki look car , pleas e-mail . thank , - il -- -- brought to you by your neighborhood lerxst -- --\"]\n" ] } ], "source": [ "from nltk.tokenize import word_tokenize\n", "from nltk.corpus import stopwords\n", "from nltk.stem.porter import PorterStemmer\n", "\n", "data_tokenized = []\n", "for text in data:\n", " text = text.lower()\n", " #tokenize\n", " tokens = word_tokenize(text)\n", " #remove stopwords\n", " filtered = [word for word in tokens if word not in stopwords.words('English')]\n", " #stemmerize\n", " ps = PorterStemmer()\n", " filtered = [ps.stem(w) for w in tokens]\n", " data_tokenized.append(' '.join(filtered))\n", "# show a sample result\n", "print data_tokenized[:1]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## CountVectorizer\n", "- 注意由于LDA是基于词频统计的,因此一般不用TF-IDF来做特征提取\n", "- LDA模型学习时的训练数据并不是一篇篇文本,而是Document-word matrix,它可以是array也可以是稀疏矩阵,维数是n_samples*n_features,其中n_features为词(term)的个数。因此在训练LDA主题模型前,需要先利用CountVectorizer统计词频并保存\n", "- CountVectorizer parameters:\n", " - max_df : float in range [0.0, 1.0] or int, default=1.0\n", "When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.\n", " - min_df : float in range [0.0, 1.0] or int, default=1\n", "When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.\n", " - max_features : int or None, default=None\n", "If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.\n", "This parameter is ignored if vocabulary is not None." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['model.pkl']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#vectorize text\n", "\n", "from sklearn.feature_extraction.text import CountVectorizer \n", "\n", "tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,\n", " stop_words='english')\n", "tf = tf_vectorizer.fit_transform(data_tokenized)\n", "\n", "# store the Count Vectoerizer with joblib, so when run second time, the code above can be commented away.\n", "\n", "from sklearn.externals import joblib #也可以选择pickle等保存模型,请随意\n", "joblib.dump(tf_vectorizer,'model.pkl' )\n", "\n", "# #得到存储的tf_vectorizer,节省预处理时间\n", "\n", "# tf_vectorizer = joblib.load(tf_ModelPath)\n", "# tf = tf_vectorizer.fit_transform(docLst)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LDA modeling training\n", "- 测试时max_iter设置为几十次通常很快就会结束,当然如果实际应用的话,建议至少上千次吧。\n", "- 怎么调参数???" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/zjm/anaconda/lib/python2.7/site-packages/sklearn/decomposition/online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21\n", " DeprecationWarning)\n" ] }, { "data": { "text/plain": [ "LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,\n", " evaluate_every=-1, learning_decay=0.7,\n", " learning_method='batch', learning_offset=10.0,\n", " max_doc_update_iter=100, max_iter=50, mean_change_tol=0.001,\n", " n_components=10, n_jobs=1, n_topics=20, perp_tol=0.1,\n", " random_state=None, topic_word_prior=None,\n", " total_samples=1000000.0, verbose=0)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.decomposition import LatentDirichletAllocation\n", "n_topic = 20\n", "lda = LatentDirichletAllocation(n_topics=n_topic, \n", " max_iter=50,\n", " learning_method='batch')\n", "lda.fit(tf) #tf即为Document_word Sparse Matrix " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show topics\n", "- 怎么检测分类结果???" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic #0:\n", "edu com thi jake indiana new ini write univers ha use comput articl doe host secur duo opinion depart york\n", "Topic #1:\n", "edu wa columbia com cc land jew cunixa posting nntp ani host write uoknor callison arab did dyer hi articl\n", "Topic #2:\n", "wa peopl hi say thi know come becaus want brian ha man look did time way ve happen day whi\n", "Topic #3:\n", "edu toronto henri thi just zoo reserv work state spencer wa write adam alaska use ohio like colorado posting nntp\n", "Topic #4:\n", "netcom com 408 guest servic commun 241 ca 9760 list request drug clipper use electron chip lin thi wa harley\n", "Topic #5:\n", "com thi window weapon israel write attack articl stratus ani civilian arab say right know doe ha edu manag onli\n", "Topic #6:\n", "edu game ca cs team season write articl pitt pittsburgh play player cmu new nntp posting playoff host univers comput\n", "Topic #7:\n", "wa thi health use tobacco like ani 1993 year diseas smokeless report com david medic case age state person articl\n", "Topic #8:\n", "thi wa edu think peopl write articl human say god believ natur whi becaus time make like did just ha\n", "Topic #9:\n", "com key thi nasa space astronaut encrypt research chip use shuttl govern ha bit clipper applic select mission need want\n", "Topic #10:\n", "wa edu weaver com right sandvik ca thi cooper arm write apple draw object trial spenc articl kent peopl posting\n", "Topic #11:\n", "edu scsi sale host simm posting nntp control ide disk mac drive com univers thi distribut comput work new card\n", "Topic #12:\n", "period 10 12 11 pp power play 14 19 20 15 orbit 28 18 13 scorer pt 17 93 23\n", "Topic #13:\n", "edu thi use window problem file ani com univers pleas ca thank anyon program write need comput time imag help\n", "Topic #14:\n", "edu com thi wa write articl like good thing state make onli know att want ani way think number game\n", "Topic #15:\n", "armenian turkish wa genocid peopl turk soviet argic serdar muslim greek govern armenia russian kill war 000 thi popul world\n", "Topic #16:\n", "moral edu thi father keith wa write spirit son doe say ha use know think caltech com articl engin nntp\n", "Topic #17:\n", "edu com wa articl valu write scienc gay optilink homosexu use thi cramer did ca think univers men post uchicago\n", "Topic #18:\n", "edu thi write ca articl com wa motif power univers option uiuc doe revolv rushdi look host posting nntp like\n", "Topic #19:\n", "thi write ___ univers articl com edu ca gov __ nasa ibm jpl did like just professor wa know ha\n", "Topic #20:\n", "edu com thi write cs wa ha posting nntp host year articl univers morri team did run usa ibm distribut\n", "Topic #21:\n", "ax max 145 1t 04 ql 0d 3t wm cx tm 0t 0m sl gy bj gk 34 p2 m_\n", "Topic #22:\n", "mu mv m0 m9 mt mp __ mh m8 mw mi mz md 22 mj 1x odomet lm mf h0\n", "Topic #23:\n", "thi know edu point com just ani like write think peopl murder ha possibl object doe onli articl univers person\n", "Topic #24:\n", "wa jesu hi thi matthew said time propheci gun peopl day ha psalm john king messiah prophet onli hear gospel\n", "Topic #25:\n", "com wa hp edu thi ani softwar write peopl use just articl ti realli process level dseg quack post veri\n", "Topic #26:\n", "good veri 50 edu excel colorado miss tn geoffrey fair thi 00 com 75 mane modul cover homicid gun wa\n", "Topic #27:\n", "thi god christian edu argument say peopl know wa believ write bibl true truth ha doe exampl becaus hi think\n", "Topic #28:\n", "den p2 p3 p1 wa com cool nuclear p4 doubl radiu thi water edu approv plant tower know au n2\n", "Topic #29:\n", "car 00 year new edu game wa card speed state insur drive driver use 15 rate color modem price buy\n", "\n", "[[ 3.33333333e-02 3.33333333e-02 3.33333333e-02 ..., 3.33333333e-02\n", " 3.33333333e-02 3.33333333e-02]\n", " [ 3.74148502e+00 2.55934619e+00 3.33333333e-02 ..., 3.33333333e-02\n", " 3.33333333e-02 3.33333333e-02]\n", " [ 3.33333333e-02 3.33333333e-02 3.33333333e-02 ..., 3.33333333e-02\n", " 3.33333333e-02 3.33333333e-02]\n", " ..., \n", " [ 9.97698755e+00 1.26471312e+01 3.33333333e-02 ..., 3.33333333e-02\n", " 3.33333333e-02 3.33333333e-02]\n", " [ 3.33333333e-02 3.33333333e-02 3.33333333e-02 ..., 3.33333333e-02\n", " 3.33333333e-02 1.03333333e+00]\n", " [ 7.70723979e+01 1.16162932e+01 3.33333333e-02 ..., 3.33333333e-02\n", " 3.33333333e-02 3.33333333e-02]]\n" ] } ], "source": [ "def print_top_words(model, feature_names, n_top_words):\n", " #打印每个主题下权重较高的term\n", " for topic_idx, topic in enumerate(model.components_):\n", " print \"Topic #%d:\" % topic_idx\n", " print \" \".join([feature_names[i]\n", " for i in topic.argsort()[:-n_top_words - 1:-1]])\n", " print\n", " #打印主题-词语分布矩阵\n", " print model.components_\n", "\n", "n_top_words=20\n", "tf_feature_names = tf_vectorizer.get_feature_names()\n", "print_top_words(lda, tf_feature_names, n_top_words)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }