{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "reference：http://blog.csdn.net/tiffanyrabbit/article/details/76445909\n",
    "http://blog.csdn.net/tiffanyrabbit/article/details/76445909"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "news = fetch_20newsgroups(subset='train')\n",
    "data = news.data[:1000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing\n",
    "- tokenize\n",
    "- stemmerize\n",
    "- remove stopwords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u\"from : lerxst @ wam.umd.edu ( where 's my thing ) subject : what car is thi ! ? nntp-posting-host : rac3.wam.umd.edu organ : univers of maryland , colleg park line : 15 i wa wonder if anyon out there could enlighten me on thi car i saw the other day . it wa a 2-door sport car , look to be from the late 60s/ earli 70 . it wa call a bricklin . the door were realli small . in addit , the front bumper wa separ from the rest of the bodi . thi is all i know . if anyon can tellm a model name , engin spec , year of product , where thi car is made , histori , or whatev info you have on thi funki look car , pleas e-mail . thank , - il -- -- brought to you by your neighborhood lerxst -- --\"]\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem.porter import PorterStemmer\n",
    "\n",
    "data_tokenized = []\n",
    "for text in data:\n",
    "    text = text.lower()\n",
    "    #tokenize\n",
    "    tokens = word_tokenize(text)\n",
    "    #remove stopwords\n",
    "    filtered = [word for word in tokens if word not in stopwords.words('English')]\n",
    "    #stemmerize\n",
    "    ps = PorterStemmer()\n",
    "    filtered = [ps.stem(w) for w in tokens]\n",
    "    data_tokenized.append(' '.join(filtered))\n",
    "# show a sample result\n",
    "print data_tokenized[:1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## CountVectorizer\n",
    "- 注意由于LDA是基于词频统计的，因此一般不用TF-IDF来做特征提取\n",
    "- LDA模型学习时的训练数据并不是一篇篇文本，而是Document-word matrix，它可以是array也可以是稀疏矩阵，维数是n_samples*n_features，其中n_features为词(term)的个数。因此在训练LDA主题模型前，需要先利用CountVectorizer统计词频并保存\n",
    "- CountVectorizer parameters:\n",
    "    - max_df : float in range [0.0, 1.0] or int, default=1.0\n",
    "When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.\n",
    "    - min_df : float in range [0.0, 1.0] or int, default=1\n",
    "When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.\n",
    "    - max_features : int or None, default=None\n",
    "If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.\n",
    "This parameter is ignored if vocabulary is not None."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['model.pkl']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#vectorize text\n",
    "\n",
    "from sklearn.feature_extraction.text import CountVectorizer  \n",
    "\n",
    "tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,\n",
    "                                stop_words='english')\n",
    "tf = tf_vectorizer.fit_transform(data_tokenized)\n",
    "\n",
    "# store the Count Vectoerizer with joblib, so when run second time, the code above can be commented away.\n",
    "\n",
    "from sklearn.externals import joblib  #也可以选择pickle等保存模型，请随意\n",
    "joblib.dump(tf_vectorizer,'model.pkl' )\n",
    "\n",
    "# #得到存储的tf_vectorizer,节省预处理时间\n",
    "\n",
    "# tf_vectorizer = joblib.load(tf_ModelPath)\n",
    "# tf = tf_vectorizer.fit_transform(docLst)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LDA modeling training\n",
    "- 测试时max_iter设置为几十次通常很快就会结束，当然如果实际应用的话，建议至少上千次吧。\n",
    "- 怎么调参数？？？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zjm/anaconda/lib/python2.7/site-packages/sklearn/decomposition/online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21\n",
      "  DeprecationWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,\n",
       "             evaluate_every=-1, learning_decay=0.7,\n",
       "             learning_method='batch', learning_offset=10.0,\n",
       "             max_doc_update_iter=100, max_iter=50, mean_change_tol=0.001,\n",
       "             n_components=10, n_jobs=1, n_topics=20, perp_tol=0.1,\n",
       "             random_state=None, topic_word_prior=None,\n",
       "             total_samples=1000000.0, verbose=0)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.decomposition import LatentDirichletAllocation\n",
    "n_topic = 20\n",
    "lda = LatentDirichletAllocation(n_topics=n_topic, \n",
    "                                max_iter=50,\n",
    "                                learning_method='batch')\n",
    "lda.fit(tf) #tf即为Document_word Sparse Matrix  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show topics\n",
    "- 怎么检测分类结果？？？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topic #0:\n",
      "edu com thi jake indiana new ini write univers ha use comput articl doe host secur duo opinion depart york\n",
      "Topic #1:\n",
      "edu wa columbia com cc land jew cunixa posting nntp ani host write uoknor callison arab did dyer hi articl\n",
      "Topic #2:\n",
      "wa peopl hi say thi know come becaus want brian ha man look did time way ve happen day whi\n",
      "Topic #3:\n",
      "edu toronto henri thi just zoo reserv work state spencer wa write adam alaska use ohio like colorado posting nntp\n",
      "Topic #4:\n",
      "netcom com 408 guest servic commun 241 ca 9760 list request drug clipper use electron chip lin thi wa harley\n",
      "Topic #5:\n",
      "com thi window weapon israel write attack articl stratus ani civilian arab say right know doe ha edu manag onli\n",
      "Topic #6:\n",
      "edu game ca cs team season write articl pitt pittsburgh play player cmu new nntp posting playoff host univers comput\n",
      "Topic #7:\n",
      "wa thi health use tobacco like ani 1993 year diseas smokeless report com david medic case age state person articl\n",
      "Topic #8:\n",
      "thi wa edu think peopl write articl human say god believ natur whi becaus time make like did just ha\n",
      "Topic #9:\n",
      "com key thi nasa space astronaut encrypt research chip use shuttl govern ha bit clipper applic select mission need want\n",
      "Topic #10:\n",
      "wa edu weaver com right sandvik ca thi cooper arm write apple draw object trial spenc articl kent peopl posting\n",
      "Topic #11:\n",
      "edu scsi sale host simm posting nntp control ide disk mac drive com univers thi distribut comput work new card\n",
      "Topic #12:\n",
      "period 10 12 11 pp power play 14 19 20 15 orbit 28 18 13 scorer pt 17 93 23\n",
      "Topic #13:\n",
      "edu thi use window problem file ani com univers pleas ca thank anyon program write need comput time imag help\n",
      "Topic #14:\n",
      "edu com thi wa write articl like good thing state make onli know att want ani way think number game\n",
      "Topic #15:\n",
      "armenian turkish wa genocid peopl turk soviet argic serdar muslim greek govern armenia russian kill war 000 thi popul world\n",
      "Topic #16:\n",
      "moral edu thi father keith wa write spirit son doe say ha use know think caltech com articl engin nntp\n",
      "Topic #17:\n",
      "edu com wa articl valu write scienc gay optilink homosexu use thi cramer did ca think univers men post uchicago\n",
      "Topic #18:\n",
      "edu thi write ca articl com wa motif power univers option uiuc doe revolv rushdi look host posting nntp like\n",
      "Topic #19:\n",
      "thi write ___ univers articl com edu ca gov __ nasa ibm jpl did like just professor wa know ha\n",
      "Topic #20:\n",
      "edu com thi write cs wa ha posting nntp host year articl univers morri team did run usa ibm distribut\n",
      "Topic #21:\n",
      "ax max 145 1t 04 ql 0d 3t wm cx tm 0t 0m sl gy bj gk 34 p2 m_\n",
      "Topic #22:\n",
      "mu mv m0 m9 mt mp __ mh m8 mw mi mz md 22 mj 1x odomet lm mf h0\n",
      "Topic #23:\n",
      "thi know edu point com just ani like write think peopl murder ha possibl object doe onli articl univers person\n",
      "Topic #24:\n",
      "wa jesu hi thi matthew said time propheci gun peopl day ha psalm john king messiah prophet onli hear gospel\n",
      "Topic #25:\n",
      "com wa hp edu thi ani softwar write peopl use just articl ti realli process level dseg quack post veri\n",
      "Topic #26:\n",
      "good veri 50 edu excel colorado miss tn geoffrey fair thi 00 com 75 mane modul cover homicid gun wa\n",
      "Topic #27:\n",
      "thi god christian edu argument say peopl know wa believ write bibl true truth ha doe exampl becaus hi think\n",
      "Topic #28:\n",
      "den p2 p3 p1 wa com cool nuclear p4 doubl radiu thi water edu approv plant tower know au n2\n",
      "Topic #29:\n",
      "car 00 year new edu game wa card speed state insur drive driver use 15 rate color modem price buy\n",
      "\n",
      "[[  3.33333333e-02   3.33333333e-02   3.33333333e-02 ...,   3.33333333e-02\n",
      "    3.33333333e-02   3.33333333e-02]\n",
      " [  3.74148502e+00   2.55934619e+00   3.33333333e-02 ...,   3.33333333e-02\n",
      "    3.33333333e-02   3.33333333e-02]\n",
      " [  3.33333333e-02   3.33333333e-02   3.33333333e-02 ...,   3.33333333e-02\n",
      "    3.33333333e-02   3.33333333e-02]\n",
      " ..., \n",
      " [  9.97698755e+00   1.26471312e+01   3.33333333e-02 ...,   3.33333333e-02\n",
      "    3.33333333e-02   3.33333333e-02]\n",
      " [  3.33333333e-02   3.33333333e-02   3.33333333e-02 ...,   3.33333333e-02\n",
      "    3.33333333e-02   1.03333333e+00]\n",
      " [  7.70723979e+01   1.16162932e+01   3.33333333e-02 ...,   3.33333333e-02\n",
      "    3.33333333e-02   3.33333333e-02]]\n"
     ]
    }
   ],
   "source": [
    "def print_top_words(model, feature_names, n_top_words):\n",
    "    #打印每个主题下权重较高的term\n",
    "    for topic_idx, topic in enumerate(model.components_):\n",
    "        print \"Topic #%d:\" % topic_idx\n",
    "        print \" \".join([feature_names[i]\n",
    "                        for i in topic.argsort()[:-n_top_words - 1:-1]])\n",
    "    print\n",
    "    #打印主题-词语分布矩阵\n",
    "    print model.components_\n",
    "\n",
    "n_top_words=20\n",
    "tf_feature_names = tf_vectorizer.get_feature_names()\n",
    "print_top_words(lda, tf_feature_names, n_top_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}