{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 用Python做情感分析\n", "
文本情感分析(也称为意见挖掘)是指用自然语言处理、文本挖掘以及计算机语言学等方法来识别和提取原素材中的主观信息。(维基百科)
\n", "简单的文本情感分析可借助已有工具包,以黑箱式操作完成。
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I am happy today. Sentiment(polarity=0.8, subjectivity=1.0)\n", "I feel sad today. Sentiment(polarity=-0.5, subjectivity=1.0)\n" ] } ], "source": [ "from textblob import TextBlob\n", "text=\"I am happy today. I feel sad today.\"\n", "blob=TextBlob(text)\n", "for sentence in blob.sentences:\n", " print(sentence,sentence.sentiment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TextBlob的情感极性取值范围是[-1, 1],-1代表完全负面,1代表完全正面。\n", "训练集为影评。
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "我今天很快乐 0.971889316039116\n", "我今天很愤怒 0.07763913772213482\n" ] } ], "source": [ "from snownlp import SnowNLP\n", "text=\"我今天很快乐。我今天很愤怒。\"\n", "s=SnowNLP(text)\n", "for sentence in s.sentences:\n", " print(sentence,SnowNLP(sentence).sentiments)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SnowNLP的情感分析取值范围是[0,1],表达的是“这句话代表正面情感的概率”。
\n", "训练集为购物评价。
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 0 预处理\n", "#### 0.1 分词" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Building prefix dict from the default dictionary ...\n", "Dumping model to file cache C:\\Users\\ZZJASW~1\\AppData\\Local\\Temp\\jieba.cache\n", "Loading model cost 3.007 seconds.\n", "Prefix dict has been built succesfully.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "这样\n", "的\n", "酒店\n", "配\n", "这样\n", "的\n", "价格\n", "还\n", "算\n", "不错\n" ] } ], "source": [ "import jieba\n", "sentence='这样的酒店配这样的价格还算不错'\n", "wordList=jieba.cut(sentence)\n", "for word in wordList:\n", " print (word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 0.2 移除停用词\n", "中科院计算所中文自然语言处理开放平台的[中文停用词表](http://www.datatang.com/data/43894)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "酒店\n", "配\n", "价格\n", "算\n", "不错\n" ] } ], "source": [ "import jieba\n", "sentence='这样的酒店配这样的价格还算不错'\n", "wordList=jieba.cut(sentence)\n", "#####read in the stop word file\n", "file='ChineseStopWord.txt'\n", "f=open(file,'r',encoding='utf-8')\n", "stopList=[]\n", "for line in f:\n", " line=line.strip()\n", " stopList.append(line)\n", "f.close()\n", "#####remove stop words from the wordList\n", "newWordList=[]\n", "for word in wordList:\n", " if word not in stopList:\n", " newWordList.append(word)\n", "#####examine the results\n", "for word in newWordList:\n", " print (word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1,基于词典的方法\n", "情感词典:[BosonNLP数据](http://bosonnlp.com/dev/resource)\n", "#### 算法设计\n", "“假设情感值满足线性叠加原理;然后我们将句子进行分词,如果句子分词后的词语向量包含相应的词语,就加上向前的权值,其中,否定词和程度副词会有特殊的判别规则,否定词会导致权值反号,而程度副词则让权值加倍。最后,根据总权值的正负性来判断句子的情感。”\n", "1,准备训练集,即分类已知的文本
\n", "2,基于训练集,计算词语对于每个分类的贡献
\n", "3,对于分类未知的文本,基于文本中所含词语决定文本的分类
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "feel happy this morning : neg\n", "Oh I love my friend : pos\n", "not like that man : neg\n", "this hourse not great : neg\n", "your song annoying : neg\n", "accuracy is: 0.8\n" ] } ], "source": [ "from textblob.classifiers import NaiveBayesClassifier\n", "\n", "train = [\n", " ('I love this car', 'pos'),\n", " ('This view is amazing', 'pos'),\n", " ('I feel great', 'pos'),\n", " ('I am so excited about the concert', 'pos'),\n", " (\"He is my best friend\", 'pos'),\n", " ('I do not like this car', 'neg'),\n", " ('This view is horrible', 'neg'),\n", " (\"I feel tired this morning\", 'neg'),\n", " ('I am not looking forward to the concert', 'neg'),\n", " ('He is an annoying enemy', 'neg')\n", "]\n", "\n", "test = [\n", " ('feel happy this morning', 'pos'),\n", " ('Oh I love my friend', 'pos'),\n", " ('not like that man', 'neg'),\n", " (\"this hourse not great\", 'neg'),\n", " ('your song annoying', 'neg')\n", "]\n", "\n", "cl = NaiveBayesClassifier(train)\n", "\n", "for sentence in test:\n", " print (sentence[0],':',cl.classify(sentence[0]))\n", "\n", "print ('accuracy is:', cl.accuracy(test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3,获取情感语料" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1 获取网页" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'\\n\\n