{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# sklearn-LDA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"代码示例:https://mp.weixin.qq.com/s/hMcJtB3Lss1NBalXRTGZlQ (玉树芝兰)
\n",
"可视化:https://blog.csdn.net/qq_39496504/article/details/107125284
\n",
"sklearn lda参数解读:https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html\n",
"
中文版参数解读:https://blog.csdn.net/TiffanyRabbit/article/details/76445909\n",
"
LDA原理-视频版:https://www.bilibili.com/video/BV1t54y127U8\n",
"
LDA原理-文字版:https://www.jianshu.com/p/5c510694c07e\n",
"
score的计算方法:https://github.com/scikit-learn/scikit-learn/blob/844b4be24d20fc42cc13b957374c718956a0db39/sklearn/decomposition/_lda.py#L729\n",
"
主题困惑度1:https://blog.csdn.net/weixin_43343486/article/details/109255165\n",
"
主题困惑度2:https://blog.csdn.net/weixin_39676021/article/details/112187210"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.预处理"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"import re\n",
"import jieba\n",
"import jieba.posseg as psg"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"output_path = r'G:\\Desktop\\social\\BigHomework\\result'\n",
"file_path = r'G:\\Desktop\\AI\\LDA\\weibo_data\\weibo_data_preprocessed.csv'\n",
"data=pd.read_csv(file_path).astype(str)#content type\n",
"stop_file = r\"G:\\Desktop\\AI\\LDA\\data\\stop_words.txt\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# 中文分词,使用jieba分词,只保留名词/动名词\n",
"def chinese_word_cut(mytext):\n",
" try:\n",
" stopword_list = open(stop_file,encoding ='utf-8')\n",
" except:\n",
" stopword_list = []\n",
" print(\"error in stop_file\")\n",
" stop_list = []\n",
" flag_list = ['n','nz','vn']\n",
" for line in stopword_list:\n",
" line = re.sub(u'\\n|\\\\r', '', line)\n",
" stop_list.append(line)\n",
" \n",
" word_list = []\n",
" #jieba分词\n",
" seg_list = psg.cut(mytext)\n",
" for seg_word in seg_list:\n",
" # 只保留中文\n",
" word = re.sub(u'[^\\u4e00-\\u9fa5]','',seg_word.word)\n",
" find = 0\n",
" for stop_word in stop_list:\n",
" if stop_word == word or len(word)<2: #this word is stopword\n",
" find = 1\n",
" break\n",
" if find == 0 and seg_word.flag in flag_list:\n",
" word_list.append(word) \n",
" return (\" \").join(word_list)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
| \n", " | 博主昵称 | \n", "微博认证 | \n", "content | \n", "发布时间 | \n", "转发 | \n", "评论 | \n", "赞 | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "辣子鸡谁做的好吃 | \n", "nan | \n", "好像可以在自己幻想的元宇宙里过一辈子好像已经过完了一辈子双鱼座的脑子要不得 | \n", "01月10日 23:59 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 1 | \n", "远古的刀 | \n", "nan | \n", "反正闭关锁宇宙我们将会面临下一次的闭关锁国融入不了全球经济王峻涛6688跟你们讲我相信这个元... | \n", "01月10日 23:58 | \n", "32 | \n", "3 | \n", "0 | \n", "
| 2 | \n", "暮景烟_深浅 | \n", "nan | \n", "周深先生之夜元宇宙周深拥有了生米就像拥有了梦的翅膀卡布叻_周深放心飞吧生米永相随cp时尚先生... | \n", "01月10日 23:58 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 3 | \n", "东辉毅恒传媒 | \n", "nan | \n", "王峻涛6688其实吧你有空可以再看看这个视频跟你们讲我相信这个元宇宙真的会来虽然不是一下子就... | \n", "01月10日 23:57 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 4 | \n", "在寒蝉鸣泣中等待夏日重现 | \n", "nan | \n", "敬元宇宙L让基尔希斯坦的女朋友的微博视频 | \n", "01月10日 23:57 | \n", "0 | \n", "0 | \n", "5 | \n", "