{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 用Python做情感分析\n", "

文本情感分析（也称为意见挖掘）是指用自然语言处理、文本挖掘以及计算机语言学等方法来识别和提取原素材中的主观信息。（维基百科）

\n", "

简单的文本情感分析可借助已有工具包，以黑箱式操作完成。

" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I am happy today. Sentiment(polarity=0.8, subjectivity=1.0)\n", "I feel sad today. Sentiment(polarity=-0.5, subjectivity=1.0)\n" ] } ], "source": [ "from textblob import TextBlob\n", "text=\"I am happy today. I feel sad today.\"\n", "blob=TextBlob(text)\n", "for sentence in blob.sentences:\n", " print(sentence,sentence.sentiment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TextBlob的情感极性取值范围是[-1, 1]，-1代表完全负面，1代表完全正面。\n", "

训练集为影评。

" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "我今天很快乐 0.971889316039116\n", "我今天很愤怒 0.07763913772213482\n" ] } ], "source": [ "from snownlp import SnowNLP\n", "text=\"我今天很快乐。我今天很愤怒。\"\n", "s=SnowNLP(text)\n", "for sentence in s.sentences:\n", " print(sentence,SnowNLP(sentence).sentiments)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

SnowNLP的情感分析取值范围是[0，1]，表达的是“这句话代表正面情感的概率”。

\n", "

训练集为购物评价。

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 0 预处理\n", "#### 0.1 分词" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Building prefix dict from the default dictionary ...\n", "Dumping model to file cache C:\\Users\\ZZJASW~1\\AppData\\Local\\Temp\\jieba.cache\n", "Loading model cost 3.007 seconds.\n", "Prefix dict has been built succesfully.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "这样\n", "的\n", "酒店\n", "配\n", "这样\n", "的\n", "价格\n", "还\n", "算\n", "不错\n" ] } ], "source": [ "import jieba\n", "sentence='这样的酒店配这样的价格还算不错'\n", "wordList=jieba.cut(sentence)\n", "for word in wordList:\n", " print (word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 0.2 移除停用词\n", "中科院计算所中文自然语言处理开放平台的[中文停用词表](http://www.datatang.com/data/43894)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "酒店\n", "配\n", "价格\n", "算\n", "不错\n" ] } ], "source": [ "import jieba\n", "sentence='这样的酒店配这样的价格还算不错'\n", "wordList=jieba.cut(sentence)\n", "#####read in the stop word file\n", "file='ChineseStopWord.txt'\n", "f=open(file,'r',encoding='utf-8')\n", "stopList=[]\n", "for line in f:\n", " line=line.strip()\n", " stopList.append(line)\n", "f.close()\n", "#####remove stop words from the wordList\n", "newWordList=[]\n", "for word in wordList:\n", " if word not in stopList:\n", " newWordList.append(word)\n", "#####examine the results\n", "for word in newWordList:\n", " print (word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1，基于词典的方法\n", "情感词典：[BosonNLP数据](http://bosonnlp.com/dev/resource)\n", "#### 算法设计\n", "“假设情感值满足线性叠加原理；然后我们将句子进行分词，如果句子分词后的词语向量包含相应的词语，就加上向前的权值，其中，否定词和程度副词会有特殊的判别规则，否定词会导致权值反号，而程度副词则让权值加倍。最后，根据总权值的正负性来判断句子的情感。”\n", "(Source: http://spaces.ac.cn/archives/3360/)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def classifyWords(wordDict):\n", " # (1) 情感词\n", " senList = readLines('BosonNLP_sentiment_score.txt')\n", " senDict = defaultdict()\n", " for s in senList:\n", " senDict[s.split(' ')[0]] = s.split(' ')[1]\n", " # (2) 否定词\n", " notList = readLines('notDict.txt')\n", " # (3) 程度副词\n", " degreeList = readLines('degreeDict.txt')\n", " degreeDict = defaultdict()\n", " for d in degreeList:\n", " degreeDict[d.split(',')[0]] = d.split(',')[1]\n", "\n", " senWord = defaultdict()\n", " notWord = defaultdict()\n", " degreeWord = defaultdict()\n", "\n", " for word in wordDict.keys():\n", " if word in senDict.keys() and word not in notList and word not in degreeDict.keys():\n", " senWord[wordDict[word]] = senDict[word]\n", " elif word in notList and word not in degreeDict.keys():\n", " notWord[wordDict[word]] = -1\n", " elif word in degreeDict.keys():\n", " degreeWord[wordDict[word]] = degreeDict[word]\n", " return senWord, notWord, degreeWord\n", "\n", "def scoreSent(senWord, notWord, degreeWord, segResult):\n", " W = 1\n", " score = 0\n", " # 存所有情感词的位置的列表\n", " senLoc = senWord.keys()\n", " notLoc = notWord.keys()\n", " degreeLoc = degreeWord.keys()\n", " senloc = -1\n", " # notloc = -1\n", " # degreeloc = -1\n", "\n", " # 遍历句中所有单词segResult，i为单词绝对位置\n", " for i in range(0, len(segResult)):\n", " # 如果该词为情感词\n", " if i in senLoc:\n", " # loc为情感词位置列表的序号\n", " senloc += 1\n", " # 直接添加该情感词分数\n", " score += W * float(senWord[i])\n", " # print \"score = %f\" % score\n", " if senloc < len(senLoc) - 1:\n", " # 判断该情感词与下一情感词之间是否有否定词或程度副词\n", " # j为绝对位置\n", " for j in range(senLoc[senloc], senLoc[senloc + 1]):\n", " # 如果有否定词\n", " if j in notLoc:\n", " W *= -1\n", " # 如果有程度副词\n", " elif j in degreeLoc:\n", " W *= float(degreeWord[j])\n", " # i定位至下一个情感词\n", " if senloc < len(senLoc) - 1:\n", " i = senLoc[senloc + 1]\n", " return score\n", "\n", "###Source:http://www.jianshu.com/p/4cfcf1610a73" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2，朴素贝叶斯分类\n", "

1，准备训练集，即分类已知的文本

\n", "

2，基于训练集，计算词语对于每个分类的贡献

\n", "

3，对于分类未知的文本，基于文本中所含词语决定文本的分类

" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "feel happy this morning : neg\n", "Oh I love my friend : pos\n", "not like that man : neg\n", "this hourse not great : neg\n", "your song annoying : neg\n", "accuracy is: 0.8\n" ] } ], "source": [ "from textblob.classifiers import NaiveBayesClassifier\n", "\n", "train = [\n", " ('I love this car', 'pos'),\n", " ('This view is amazing', 'pos'),\n", " ('I feel great', 'pos'),\n", " ('I am so excited about the concert', 'pos'),\n", " (\"He is my best friend\", 'pos'),\n", " ('I do not like this car', 'neg'),\n", " ('This view is horrible', 'neg'),\n", " (\"I feel tired this morning\", 'neg'),\n", " ('I am not looking forward to the concert', 'neg'),\n", " ('He is an annoying enemy', 'neg')\n", "]\n", "\n", "test = [\n", " ('feel happy this morning', 'pos'),\n", " ('Oh I love my friend', 'pos'),\n", " ('not like that man', 'neg'),\n", " (\"this hourse not great\", 'neg'),\n", " ('your song annoying', 'neg')\n", "]\n", "\n", "cl = NaiveBayesClassifier(train)\n", "\n", "for sentence in test:\n", " print (sentence[0],':',cl.classify(sentence[0]))\n", "\n", "print ('accuracy is:', cl.accuracy(test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3，获取情感语料" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1 获取网页" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'\\n\\nA Useful Page\\n\\n\\n

An Interesting Title

\\n

\\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\\n

\\n\\n\\n'\n" ] } ], "source": [ "from urllib.request import urlopen\n", "html=urlopen(\"http://pythonscraping.com/pages/page1.html\")\n", "print(html.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2 拆分网页" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "

An Interesting Title

\n", "

An Interesting Title

\n", "

An Interesting Title

\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "html=urlopen(\"http://pythonscraping.com/pages/page1.html\")\n", "bsobj=BeautifulSoup(html.read(),\"html.parser\")\n", "print (bsobj.html.body.h1)\n", "print (bsobj.body.h1)\n", "print (bsobj.html.h1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3 保存网页" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "fileName=\"D://example.txt\"\n", "p=open(fileName,\"w\") #open for writing, truncating the file first\n", "print(\"hello\",file=p)\n", "print(\"world\",file=p)\n", "p.close()\n", "\n", "p=open(fileName,\"a\") #open for writing, appending to the end of the file if it exists\n", "print(\"hello world again\",file=p)\n", "p.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }