{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "***\n", "# 数据清洗之推特数据\n", "***\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 数据清洗(data cleaning)\n", "是数据分析的重要步骤,其主要目标是将混杂的数据清洗为可以被直接分析的数据,一般需要将数据转化为数据框(data frame)的样式。\n", "\n", "本章将以推特文本的清洗作为例子,介绍数据清洗的基本逻辑。\n", "\n", "- 清洗错误行\n", "- 正确分列\n", "- 提取所要分析的内容\n", "- 介绍通过按行、chunk的方式对大规模数据进行预处理\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 1. 抽取tweets样本做实验\n", "此节学生略过" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2752\n" ] } ], "source": [ "bigfile = open('/Users/chengjun/百度云同步盘/Writing/OWS/ows-raw.txt', 'r')\n", "chunkSize = 1000000\n", "chunk = bigfile.readlines(chunkSize)\n", "print(len(chunk))\n", "with open(\"/Users/chengjun/GitHub/cjc/data/ows_tweets_sample.txt\", 'w') as f:\n", " for i in chunk:\n", " f.write(i) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Lazy Method for Reading Big File in Python?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:31:51.644484Z", "start_time": "2019-06-08T07:30:56.170308Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 262665\n", "1 525130\n", "2 787344\n", "3 1049351\n", "4 1312571\n", "5 1574666\n", "6 1835628\n", "7 2097136\n", "8 2358494\n", "9 2619723\n", "10 2880857\n", "11 3140945\n", "12 3404775\n", "13 3665565\n", "14 3927996\n", "15 4189419\n", "16 4449078\n", "17 4709001\n", "18 4969877\n", "19 5230937\n", "20 5492578\n", "21 5756613\n", "22 6022478\n", "23 6286119\n", "24 6549476\n", "25 6602141\n" ] } ], "source": [ "# https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python?lq=1\n", "import csv\n", "bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')\n", "\n", "chunkSize = 10**8\n", "chunk = bigfile.readlines(chunkSize)\n", "num, num_lines = 0, 0\n", "while chunk:\n", " lines = csv.reader((line.replace('\\x00','') for line in chunk), \n", " delimiter=',', quotechar='\"')\n", " #do sth.\n", " num_lines += len(list(lines))\n", " print(num, num_lines)\n", " num += 1\n", " chunk = bigfile.readlines(chunkSize) # read another chunk" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 字节(Byte /bait/)\n", "\n", "计算机信息技术用于计量存储容量的一种计量单位,通常情况下一字节等于有八位, [1] 也表示一些计算机编程语言中的数据类型和语言字符。\n", "- 1B(byte,字节)= 8 bit;\n", "- 1KB=1000B;1MB=1000KB=1000×1000B。其中1000=10^3。\n", "- 1KB(kilobyte,千字节)=1000B= 10^3 B;\n", "- 1MB(Megabyte,兆字节,百万字节,简称“兆”)=1000KB= 10^6 B;\n", "- 1GB(Gigabyte,吉字节,十亿字节,又称“千兆”)=1000MB= 10^9 B;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 用Pandas的get_chunk功能来处理亿级数据\n", "\n", "> 只有在超过5TB数据量的规模下,Hadoop才是一个合理的技术选择。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "f = open('../bigdata/OWS/ows-raw.txt',encoding='utf-8')\n", "reader = pd.read_table(f, sep=',', iterator=True, error_bad_lines=False) #跳过报错行\n", "loop = True\n", "chunkSize = 100000\n", "data = []\n", "\n", "while loop:\n", " try:\n", " chunk = reader.get_chunk(chunkSize)\n", " dat = data_cleaning_funtion(chunk) # do sth.\n", " data.append(dat) \n", " except StopIteration:\n", " loop = False\n", " print(\"Iteration is stopped.\")\n", "\n", "df = pd.concat(data, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 2. 清洗错行的情况" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:42:24.661108Z", "start_time": "2019-06-08T07:42:24.648304Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "with open(\"../data/ows_tweets_sample.txt\", 'r') as f:\n", " lines = f.readlines() " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:42:28.452634Z", "start_time": "2019-06-08T07:42:28.441018Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2753" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 总行数\n", "len(lines)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:42:32.821269Z", "start_time": "2019-06-08T07:42:32.816918Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'121813245488140288,\"@HumanityCritic i\\'m worried that the #ows sells out to the hamsher-norquist spitefuck, and tries to unite with the teahad.\",http://a2.twimg.com/profile_images/627683576/flytits_normal.jpg,2011-10-06,5,5,\"2011-10-06 05:05:15\",N;,fucentarmal,27480502,en,HumanityCritic,230431,\"<a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>\"\\n'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 查看第一行\n", "lines[15]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on built-in function split:\n", "\n", "split(...) method of builtins.str instance\n", " S.split(sep=None, maxsplit=-1) -> list of strings\n", " \n", " Return a list of the words in S, using sep as the\n", " delimiter string. If maxsplit is given, at most maxsplit\n", " splits are done. If sep is not specified or is None, any\n", " whitespace string is a separator and empty strings are\n", " removed from the result.\n", "\n" ] } ], "source": [ "help(lines[1].split)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 问题: 第一行是变量名\n", "> ## 1. 如何去掉换行符?\n", "> ## 2. 如何获取每一个变量名?\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:43:39.363547Z", "start_time": "2019-06-08T07:43:39.358317Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['\"Twitter ID\"',\n", " 'Text',\n", " '\"Profile Image URL\"',\n", " 'Day',\n", " 'Hour',\n", " 'Minute',\n", " '\"Created At\"',\n", " 'Geo',\n", " '\"From User\"',\n", " '\"From User ID\"',\n", " 'Language',\n", " '\"To User\"',\n", " '\"To User ID\"',\n", " 'Source']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "varNames = lines[0].replace('\\n', '').split(',')\n", "varNames" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:43:49.131388Z", "start_time": "2019-06-08T07:43:49.127319Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "14" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(varNames)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:43:53.979866Z", "start_time": "2019-06-08T07:43:53.975920Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'121818600490283009,\"RT @chachiTHEgr8: RT @TheNewDeal: First they ignore you, then they laugh at you, then they fight you, then you win. - Gandhi #OccupyWallStreet #OWS #p2\",http://a0.twimg.com/profile_images/326662126/Photo_233_normal.jpg,2011-10-06,5,26,\"2011-10-06 05:26:32\",N;,k_l_h_j,382233343,en,,0,\"<a href="http://twitter.com/#!/download/iphone" rel="nofollow">Twitter for iPhone</a>\"\\n'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines[1344]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 如何来处理错误换行情况?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T10:57:03.746530Z", "start_time": "2018-04-28T10:57:03.727339Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "with open(\"../data/ows_tweets_sample_clean.txt\", 'w') as f:\n", " right_line = '' # 正确的行,它是一个空字符串\n", " blocks = [] # 确认为正确的行会被添加到blocks里面\n", " for line in lines:\n", " right_line += line.replace('\\n', ' ')\n", " line_length = len(right_line.split(','))\n", " if line_length >= 14:\n", " blocks.append(right_line)\n", " right_line = '' \n", " for i in blocks:\n", " f.write(i + '\\n')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T10:57:07.915900Z", "start_time": "2018-04-28T10:57:07.911441Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2627" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(blocks)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T10:57:16.586149Z", "start_time": "2018-04-28T10:57:16.582151Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'121818879105310720,\"RT @Min_Reyes: RT @The99Percenters: New video to go viral. From We Are Change http://t.co/6Ff718jk Listen to the guy begging... #ows #cdnpoli\",http://a3.twimg.com/sticky/default_profile_images/default_profile_0_normal.png,2011-10-06,5,27,\"2011-10-06 05:27:38\",N;,MiyazakiMegu,260948518,en,,0,\"<a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>\" '" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blocks[1344]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 同时考虑分列符和引用符\n", "\n", "- 分列符🔥分隔符:sep, delimiter\n", "- 引用符☁️:quotechar\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:51:20.459071Z", "start_time": "2019-06-08T07:51:20.453871Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['121813245488140288',\n", " \"@HumanityCritic i'm worried that the #ows sells out to the hamsher-norquist spitefuck, and tries to unite with the teahad.\",\n", " 'http://a2.twimg.com/profile_images/627683576/flytits_normal.jpg,2011-10-06,5,5',\n", " '2011-10-06 05:05:15',\n", " 'N;,fucentarmal,27480502,en,HumanityCritic,230431',\n", " '<a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>\"\\n']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "re.split(',\"|\",', lines[15])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:52:33.453629Z", "start_time": "2019-06-08T07:52:33.441462Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line = 35 length = 6\n", "line = 36 length = 6\n", "line = 37 length = 6\n", "line = 38 length = 6\n", "line = 39 length = 6\n", "line = 40 length = 6\n", "line = 41 length = 2\n", "line = 42 length = 5\n", "line = 43 length = 6\n", "line = 44 length = 6\n", "line = 45 length = 6\n", "line = 46 length = 6\n", "line = 47 length = 6\n", "line = 48 length = 2\n", "line = 49 length = 5\n" ] } ], "source": [ "import re\n", "\n", "with open(\"../data/ows_tweets_sample.txt\",'r') as f:\n", " lines = f.readlines()\n", " \n", "for i in range(35,50):\n", " i_ = re.split(',\"|\",', lines[i])\n", " print('line =',i,' length =', len(i_))\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:54:54.976462Z", "start_time": "2019-06-08T07:54:54.944533Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "with open(\"../data/ows_tweets_sample_clean4.txt\", 'w') as f:\n", " right_line = '' # 正确的行,它是一个空字符串\n", " blocks = [] # 确认为正确的行会被添加到blocks里面\n", " for line in lines:\n", " right_line += line.replace('\\n', ' ').replace('\\r', ' ')\n", " #line_length = len(right_line.split(','))\n", " i_ = re.split(',\"|\",', right_line)\n", " line_length = len(i_)\n", " if line_length >= 6:\n", " blocks.append(right_line)\n", " right_line = ''\n", "# for i in blocks:\n", "# f.write(i + '\\n')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:54:59.860355Z", "start_time": "2019-06-08T07:54:59.856381Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2626" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(blocks)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 3. 读取数据、正确分列" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:55:54.719495Z", "start_time": "2019-06-08T07:55:54.712843Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# 提示:你可能需要修改以下路径名\n", "with open(\"../data/ows_tweets_sample.txt\", 'r') as f:\n", " chunk = f.readlines()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:55:57.501462Z", "start_time": "2019-06-08T07:55:57.497278Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2753" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(chunk)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:56:00.549021Z", "start_time": "2019-06-08T07:56:00.544656Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['\"Twitter ID\",Text,\"Profile Image URL\",Day,Hour,Minute,\"Created At\",Geo,\"From User\",\"From User ID\",Language,\"To User\",\"To User ID\",Source\\n',\n", " '121813144174727168,\"RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE RT !!HELP!!!!\",http://a2.twimg.com/profile_images/1539375713/Twitter_normal.jpg,2011-10-06,5,4,\"2011-10-06 05:04:51\",N;,Anonops_Cop,401240477,en,,0,\"<a href="http://twitter.com/">web</a>\"\\n',\n", " '121813146137657344,\"@jamiekilstein @allisonkilkenny Interesting interview (never aired, wonder why??) by Fox with #ows protester http://t.co/Fte55Kh7\",http://a2.twimg.com/profile_images/1574715503/Kate6_normal.jpg,2011-10-06,5,4,\"2011-10-06 05:04:51\",N;,KittyHybrid,34532053,en,jamiekilstein,2149053,\"<a href="http://twitter.com/">web</a>\"\\n']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunk[:3]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:56:05.677057Z", "start_time": "2019-06-08T07:56:05.656929Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2627\n" ] } ], "source": [ "import csv\n", "lines_csv = csv.reader(chunk, delimiter=',', quotechar='\"') \n", "print(len(list(lines_csv)))\n", "# next(lines_csv)\n", "# next(lines_csv)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T01:12:38.678653Z", "start_time": "2018-04-29T01:12:38.611535Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import re\n", "import csv\n", "\n", "from collections import defaultdict\n", "\n", "def extract_rt_user(tweet):\n", " rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", re.IGNORECASE)\n", " rt_user_name = rt_patterns.findall(tweet)\n", " if rt_user_name:\n", " rt_user_name = rt_user_name[0][1].strip(' @')\n", " else:\n", " rt_user_name = None\n", " return rt_user_name\n", "\n", "rt_network = defaultdict(int)\n", "f = open(\"../data/ows_tweets_sample.txt\", 'r')\n", "chunk = f.readlines(100000)\n", "while chunk: \n", " #lines = csv.reader(chunk, delimiter=',', quotechar='\"') \n", " lines = csv.reader((line.replace('\\x00','') for line in chunk), delimiter=',', quotechar='\"')\n", " for line in lines:\n", " tweet = line[1]\n", " from_user = line[8]\n", " rt_user = extract_rt_user(tweet)\n", " rt_network[(from_user, rt_user)] += 1 \n", " chunk = f.readlines(100000)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:56:22.886245Z", "start_time": "2019-06-08T07:56:22.198448Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Twitter IDTextProfile Image URLDayHourMinuteCreated AtGeoFrom UserFrom User IDLanguageTo UserTo User IDSource
0121813144174727168RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN...http://a2.twimg.com/profile_images/1539375713/...2011-10-06542011-10-06 05:04:51N;Anonops_Cop401240477enNaN0&lt;a href=&quot;http://twitter.com/&quot;&gt;...
1121813146137657344@jamiekilstein @allisonkilkenny Interesting in...http://a2.twimg.com/profile_images/1574715503/...2011-10-06542011-10-06 05:04:51N;KittyHybrid34532053enjamiekilstein2149053&lt;a href=&quot;http://twitter.com/&quot;&gt;...
2121813150000619521@Seductivpancake Right! Those guys have a vict...http://a1.twimg.com/profile_images/1241412831/...2011-10-06542011-10-06 05:04:52N;nerdsherpa95067344enSeductivpancake19695580&lt;a href=&quot;http://www.echofon.com/&quot;...
\n", "
" ], "text/plain": [ " Twitter ID Text \\\n", "0 121813144174727168 RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN... \n", "1 121813146137657344 @jamiekilstein @allisonkilkenny Interesting in... \n", "2 121813150000619521 @Seductivpancake Right! Those guys have a vict... \n", "\n", " Profile Image URL Day Hour \\\n", "0 http://a2.twimg.com/profile_images/1539375713/... 2011-10-06 5 \n", "1 http://a2.twimg.com/profile_images/1574715503/... 2011-10-06 5 \n", "2 http://a1.twimg.com/profile_images/1241412831/... 2011-10-06 5 \n", "\n", " Minute Created At Geo From User From User ID Language \\\n", "0 4 2011-10-06 05:04:51 N; Anonops_Cop 401240477 en \n", "1 4 2011-10-06 05:04:51 N; KittyHybrid 34532053 en \n", "2 4 2011-10-06 05:04:52 N; nerdsherpa 95067344 en \n", "\n", " To User To User ID \\\n", "0 NaN 0 \n", "1 jamiekilstein 2149053 \n", "2 Seductivpancake 19695580 \n", "\n", " Source \n", "0 <a href="http://twitter.com/">... \n", "1 <a href="http://twitter.com/">... \n", "2 <a href="http://www.echofon.com/"... " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"../data/ows_tweets_sample.txt\",\n", " sep = ',', quotechar='\"')\n", "df[:3]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:57:21.705488Z", "start_time": "2019-06-08T07:57:21.701307Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2626" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:57:25.595447Z", "start_time": "2019-06-08T07:57:25.588512Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE RT !!HELP!!!!'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Text[0]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T11:06:31.097919Z", "start_time": "2018-04-28T11:06:31.092038Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0 Anonops_Cop\n", "1 KittyHybrid\n", "2 nerdsherpa\n", "3 hamudistan\n", "4 kl_knox\n", "5 vickycrampton\n", "6 burgerbuilders\n", "7 neverfox\n", "8 davidgaliel\n", "9 AnonOws\n", "Name: From User, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['From User'][:10]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 4. 统计数量\n", "### 统计发帖数量所对应的人数的分布\n", "> 人数在发帖数量方面的分布情况" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:59:11.081963Z", "start_time": "2019-06-08T07:59:11.076747Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from collections import defaultdict\n", "data_dict = defaultdict(int)\n", "for i in df['From User']:\n", " data_dict[i] +=1 " ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:59:11.737607Z", "start_time": "2019-06-08T07:59:11.706495Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[('MiranHosny', 1),\n", " ('BradMarston', 1),\n", " ('Sir_Richard_311', 1),\n", " ('elChepi', 1),\n", " ('jboy', 1)]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(data_dict.items())[:5]\n", "#data_dict" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:59:23.202541Z", "start_time": "2019-06-08T07:59:22.945172Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 安装微软雅黑字体\n", "为了在绘图时正确显示中文,需要安装/data/文件夹中的微软雅黑字体(msyh.ttf)\n", "\n", "详见[common questions](0.common_questions.ipynb)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:59:28.182593Z", "start_time": "2019-06-08T07:59:27.976785Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAETCAYAAAD6R0vDAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvDW2N/gAAErRJREFUeJzt3X+sX/V93/HnazhkS1oFA1eM2G5NF1aJVtqCrghdugiVjgCJYlqlEVHUOAmSFYlsybIpcRqpVJsqwbo1S7aKyQssZkL5sTQZVus0cUmiaH9AYygh/EjGhUKxZ/BtoKQd6lLDe398PyZfLvfaF/vj7znXfj6kr77nfD6f77nve/y9evl8zvmeb6oKSZJ6+jtDFyBJOvkYLpKk7gwXSVJ3hoskqTvDRZLUneEiSerOcJEkdWe4SJK6M1wkSd2tG7qAoZx99tm1efPmocuQpDXlrrvu+ouqmjvauFM2XDZv3szevXuHLkOS1pQkj61mnNNikqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3Q0aLkluTnIwyX1Tbb+T5HtJ7k3y5SRnTPV9LMlCku8nefNU++WtbSHJ9ln/HpKkFxv6E/qfAf4zcMtU2x7gY1V1KMkNwMeAjya5ALga+DngtcAfJ/mH7TW/B/wzYB/w7SS7quqBE1n45u1/eCI3v6JHr3/LID9Xkl6OQY9cqupbwFNL2r5WVYfa6h3Axra8BfhcVf2/qvozYAG4qD0WquqRqvoR8Lk2VpI0kLGfc3kf8JW2vAF4fKpvX2tbqV2SNJDRhkuSjwOHgFs7bnNbkr1J9i4uLvbarCRpiVGGS5L3AG8F3lVV1Zr3A5umhm1sbSu1v0RV7aiq+aqan5s76h2jJUnHaHThkuRy4CPA26rq2amuXcDVSV6Z5DzgfOBPgG8D5yc5L8npTE7675p13ZKkHxv0arEknwUuAc5Osg+4jsnVYa8E9iQBuKOq3l9V9yf5AvAAk+mya6vqubadDwBfBU4Dbq6q+2f+y0iSXjBouFTVO5dpvukI438b+O1l2ncDuzuWJkk6DqObFpMkrX2GiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneDhkuSm5McTHLfVNuZSfYkeag9r2/tSfKpJAtJ7k1y4dRrtrbxDyXZOsTvIkn6saGPXD4DXL6kbTtwe1WdD9ze1gGuAM5vj23AjTAJI+A64A3ARcB1hwNJkjSMQcOlqr4FPLWkeQuwsy3vBK6aar+lJu4AzkhyLvBmYE9VPVVVTwN7eGlgSZJmaOgjl+WcU1UH2vITwDlteQPw+NS4fa1tpXZJ0kDGGC4vqKoCqtf2kmxLsjfJ3sXFxV6blSQtMcZwebJNd9GeD7b2/cCmqXEbW9tK7S9RVTuqar6q5ufm5roXLkmaGGO47AIOX/G1Fbhtqv3d7aqxi4Fn2vTZV4HLkqxvJ/Iva22SpIGsG/KHJ/kscAlwdpJ9TK76uh74QpJrgMeAd7Thu4ErgQXgWeC9AFX1VJJ/C3y7jfs3VbX0IgFJ0gwNGi5V9c4Vui5dZmwB166wnZuBmzuWJkk6DmOcFpMkrXGGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUnejDZck/zLJ/UnuS/LZJH83yXlJ7kyykOTzSU5vY1/Z1hda/+Zhq5ekU9sowyXJBuBfAPNV9fPAacDVwA3AJ6rqdcDTwDXtJdcAT7f2T7RxkqSBjDJcmnXA30uyDngVcAD4JeCLrX8ncFVb3tLWaf2XJskMa5UkTRlluFTVfuDfA3/OJFSeAe4C/rKqDrVh+4ANbXkD8Hh77aE2/qxZ1ixJ+rFRhkuS9UyORs4DXgu8Gri8w3a3JdmbZO/i4uLxbk6StIJRhgvwy8CfVdViVf0t8CXgjcAZbZoMYCOwvy3vBzYBtP7XAD9YutGq2lFV81U1Pzc3d6J/B0k6ZY01XP4cuDjJq9q5k0uBB4BvAG9vY7YCt7XlXW2d1v/1qqoZ1itJmjLKcKmqO5mcmL8b+C6TOncAHwU+nGSByTmVm9pLbgLOau0fBrbPvGhJ0gvWHX3IMKrqOuC6Jc2PABctM/ZvgF+bRV2SpKMb5ZGLJGltM1wkSd0ZLpKk7gwXSVJ3hoskqTvDRZLUneEiSerOcJEkdWe4SJK6M1wkSd0ZLpKk7gwXSVJ3hoskqbsj3hU5yQYm361yrAI8D/xUVf2f49iOJGkNWc0t9wNsBp47hu2H4wsnSdIatJpwKWBfVT1/LD9g8kWSkqRTiedcJEndGS6SpO4MF0lSd4aLJKk7w0WS1N1qrhYD+Kkkx3S1GJOrzSRJp5DVfs7l4fYsSdJRHTFcqmo/Tp1Jkl4mg0OS1N0RwyXJhiTPHcfj+SSHkrz25RaW5IwkX0zyvSQPJvmFJGcm2ZPkofa8vo1Nkk8lWUhyb5ILj3WHSJKO35jvLfZJ4I+q6u1JTgdeBfwGcHtVXZ9kO7Ad+ChwBXB+e7wBuLE9S5IGMMp7iyV5DfAm4D0AVfUj4EdJtgCXtGE7gW8yCZctwC1VVcAd7ajn3Ko6cCw1S5KOz1jPuZwHLAL/LcmfJvl0klcD50wFxhPAOW15A/D41Ov3tTZJ0gDGGi7rgAuBG6vq9cD/ZTIF9oJ2lPKyPkOTZFuSvUn2Li4uditWkvRiYw2XfUym4u5s619kEjZPJjkXoD0fbP37gU1Tr9/Y2l6kqnZU1XxVzc/NzZ2w4iXpVDfKcKmqJ4DHk/xsa7oUeADYBWxtbVuB29ryLuDd7aqxi4FnPN8iScNZ7e1fhvDPgVvblWKPAO9lEoZfSHIN8BjwjjZ2N3AlsAA828ZKkgYy2nuLVdU9wPwyXZcuM7aAa4/l50iS+vPeYpKk7ry3mCSpO4NDktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUnejDpckpyX50yR/0NbPS3JnkoUkn09yemt/ZVtfaP2bh6xbkk51ow4X4IPAg1PrNwCfqKrXAU8D17T2a4CnW/sn2jhJ0kBGGy5JNgJvAT7d1gP8EvDFNmQncFVb3tLWaf2XtvGSpAGMNlyA/wh8BHi+rZ8F/GVVHWrr+4ANbXkD8DhA63+mjZckDWCU4ZLkrcDBqrqr83a3JdmbZO/i4mLPTUuSpowyXIA3Am9L8ijwOSbTYZ8Ezkiyro3ZCOxvy/uBTQCt/zXAD5ZutKp2VNV8Vc3Pzc2d2N9Akk5howyXqvpYVW2sqs3A1cDXq+pdwDeAt7dhW4Hb2vKutk7r/3pV1QxLliRNGWW4HMFHgQ8nWWByTuWm1n4TcFZr/zCwfaD6JEnAuqMPGVZVfRP4Zlt+BLhomTF/A/zaTAuTJK1orR25SJLWAMNFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpu1GGS5JNSb6R5IEk9yf5YGs/M8meJA+15/WtPUk+lWQhyb1JLhz2N5CkU9sowwU4BPyrqroAuBi4NskFwHbg9qo6H7i9rQNcAZzfHtuAG2dfsiTpsFGGS1UdqKq72/JfAQ8CG4AtwM42bCdwVVveAtxSE3cAZyQ5d8ZlS5KaUYbLtCSbgdcDdwLnVNWB1vUEcE5b3gA8PvWyfa1NkjSAUYdLkp8Afh/4UFX9cLqvqgqol7m9bUn2Jtm7uLjYsVJJ0rTRhkuSVzAJllur6kut+cnD013t+WBr3w9smnr5xtb2IlW1o6rmq2p+bm7uxBUvSae4UYZLkgA3AQ9W1e9Ode0CtrblrcBtU+3vbleNXQw8MzV9JkmasXVDF7CCNwK/Dnw3yT2t7TeA64EvJLkGeAx4R+vbDVwJLADPAu+dbbmSpGmjDJeq+l9AVui+dJnxBVx7QouSJK3aKKfFJElrm+EiSerOcJEkdWe4SJK6M1wkSd0ZLpKk7gwXSVJ3o/yci1a2efsfDvazH73+LYP9bElri0cukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSdN67Uqg1100xvmCmtPR65SJK6M1wkSd0ZLpKk7gwXSVJ3J9UJ/SSXA58ETgM+XVXXD1yS1jC/9VM6difNkUuS04DfA64ALgDemeSCYauSpFPTyXTkchGwUFWPACT5HLAFeGDQqnTchjyCkHRsTqZw2QA8PrW+D3jDQLVIx8VA1Yk0i2nXkylcjirJNmBbW/3rJN8fsp6X4WzgL4Yu4mVYa/WCNc/KWqt5rdULq6g5NxzX9n96NYNOpnDZD2yaWt/Y2l5QVTuAHbMsqocke6tqfug6Vmut1QvWPCtrrea1Vi+Mp+aT5oQ+8G3g/CTnJTkduBrYNXBNknRKOmmOXKrqUJIPAF9lcinyzVV1/8BlSdIp6aQJF4Cq2g3sHrqOE2CtTeWttXrBmmdlrdW81uqFkdScqhq6BknSSeZkOuciSRoJw2UEkmxK8o0kDyS5P8kHlxlzSZJnktzTHr85RK1Lano0yXdbPXuX6U+STyVZSHJvkguHqHOqnp+d2n/3JPlhkg8tGTP4fk5yc5KDSe6bajszyZ4kD7Xn9Su8dmsb81CSrQPW+ztJvtf+3b+c5IwVXnvE99CMa/6tJPun/u2vXOG1lyf5fntfbx+45s9P1ftokntWeO3s93NV+Rj4AZwLXNiWfxL438AFS8ZcAvzB0LUuqelR4Owj9F8JfAUIcDFw59A1T9V2GvAE8NNj28/Am4ALgfum2v4dsL0tbwduWOZ1ZwKPtOf1bXn9QPVeBqxryzcsV+9q3kMzrvm3gH+9ivfNw8DPAKcD31n6tzrLmpf0/wfgN8eynz1yGYGqOlBVd7flvwIeZHLHgbVuC3BLTdwBnJHk3KGLai4FHq6qx4YuZKmq+hbw1JLmLcDOtrwTuGqZl74Z2FNVT1XV08Ae4PITVmizXL1V9bWqOtRW72DyubPRWGEfr8YLt5mqqh8Bh28zdcIdqeYkAd4BfHYWtayG4TIySTYDrwfuXKb7F5J8J8lXkvzcTAtbXgFfS3JXu/vBUsvdkmcsoXk1K/8hjm0/A5xTVQfa8hPAOcuMGev+fh+TI9jlHO09NGsfaFN5N68w9TjWffxPgSer6qEV+me+nw2XEUnyE8DvAx+qqh8u6b6byRTOPwL+E/A/Z13fMn6xqi5kcifqa5O8aeiCVqN9yPZtwP9YpnuM+/lFajLPsSYu80zyceAQcOsKQ8b0HroR+AfAPwYOMJlmWiveyZGPWma+nw2XkUjyCibBcmtVfWlpf1X9sKr+ui3vBl6R5OwZl7m0pv3t+SDwZSZTBtOOekuegVwB3F1VTy7tGON+bp48PKXYng8uM2ZU+zvJe4C3Au9qgfgSq3gPzUxVPVlVz1XV88B/XaGWUe1jgCTrgF8FPr/SmCH2s+EyAm2+9Cbgwar63RXG/P02jiQXMfm3+8HsqnxJPa9O8pOHl5mcwL1vybBdwLvbVWMXA89MTe0MacX/5Y1tP0/ZBRy++msrcNsyY74KXJZkfZvSuay1zVwmX9z3EeBtVfXsCmNW8x6amSXnA39lhVrGeJupXwa+V1X7luscbD/P8uoBH8s/gF9kMs1xL3BPe1wJvB94fxvzAeB+Jlen3AH8k4Fr/plWy3daXR9v7dM1h8kXuD0MfBeYH8G+fjWTsHjNVNuo9jOT4DsA/C2TOf1rgLOA24GHgD8Gzmxj55l86+rh174PWGiP9w5Y7wKTcxOH38//pY19LbD7SO+hAWv+7+19ei+TwDh3ac1t/UomV3Q+PHTNrf0zh9+/U2MH389+Ql+S1J3TYpKk7gwXSVJ3hoskqTvDRZLUneEiSerupPqyMGkMkqzqPlrVPpdwosdLQ/BSZKmzJKv6o6qqwx/WPKHjpSE4LSadGK8DXrHCY/MA46WZclpMOjGeqx/fcv5Fkjw3wHhppjxykSR1Z7hIkrozXCRJ3RkukqTuDBdJUneGiySpO8NFktSd4SJJ6s5wkSR1Z7hIkrozXCRJ3RkukqTuDBdJUnd+n4vUmd/nInnLfelE2DSy8dLMeeQiSerOcy6SpO4MF0lSd4aLJKk7w0WS1J3hIknqznCRJHX3/wFKs/AiOmGx0gAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.hist(data_dict.values())\n", "#plt.yscale('log')\n", "#plt.xscale('log')\n", "plt.xlabel(u'发帖数', fontsize = 20)\n", "plt.ylabel(u'人数', fontsize = 20)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:59:53.302817Z", "start_time": "2019-06-08T07:59:52.510526Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAEXCAYAAABsyHmSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvDW2N/gAAEDhJREFUeJzt3V+IpFV+xvHn6TEGeiF9seNNZqa7XEYkk71ZKJSQGy8SMu4yGowkDk0uFmPhgrlXOrC56YvcLbImUsFhEiwUkRBGM8FciTdeWBMIGRHJrEyPPYF1NoYObF+I7i8Xbw3T3dt/6j1dp963Tn0/UBTvef/Uz94tH8976j3HESEAAFIsNF0AAGB2ESIAgGSECAAgGSECAEhGiAAAkhEiAIBkhAgAIBkhAgBIRogAAJIRIgCAZPc1XUBuJ0+ejE6n03QZADBTrl279ouIeOCo44oPkU6no+Fw2HQZADBTbG+Mcxy3swAAyQgRAEAyQgQAkGymQsT279h+1fbbtn/UdD0AMO8aDxHbl2x/Yfv6nvbztj+1fcP2i5IUEZ9ExPOS/lTS72crajCQOh1pYaF6HwyyfRQAzLLGQ0TSZUnndzbYPiHpFUmPSzon6aLtc6N9T0j6F0lXs1QzGEi9nrSxIUVU770eQQIA+2g8RCLiA0lf7ml+RNKNiPgsIr6S9KakJ0fHX4mIxyWtZilobU3a3t7dtr1dtQMAdmnrcyKnJH2+Y3tT0qO2H5P0lKTf1CE9Eds9ST1JWl5ervfJt27VaweAOdbWENlXRLwv6f0xjutL6ktSt9uNWh+yvFzdwtqvHQCwS+O3sw5wW9KZHdunR235ra9Li4u72xYXq3YAwC5tDZGPJD1k+0Hb90t6RtKVOhewfcF2f2trq94nr65K/b60siLZ1Xu/X7UDAHZxRL27PRMvwH5D0mOSTkr6uaQfR8Rrtr8v6SeSTki6FBFJXYFutxvMnQUA9di+FhHdo45rfEwkIi4e0H5VuX7GCwCYiLbezgIAzIBiQyR5TAQAMLZiQyQi3omI3tLSUtOlAECxig0RAEB+hAgAIFmxIcKYCADkV2yIMCYCAPkVGyIAgPwIEQBAsmJDhDERAMiv2BBhTAQA8is2RAAA+REiAIBkhAgAIFmxIcLAOgDkV2yIMLAOAPkVGyIAgPwIEQBAMkIEAJCMEAEAJCNEAADJig0RfuILAPkVGyL8xBcA8is2RAAA+REiAIBkhAgAIBkhAgBIRogAAJIRIgCAZMWGCM+JAEB+xYYIz4kAQH7FhggAID9CBACQjBABACQjRJBmMJA6HWlhoXofDJquCEAD7mu6AMygwUDq9aTt7Wp7Y6PalqTV1ebqAjB19ERQ39ravQC5a3u7agcwVwgR1HfrVr12AMUiRFDf8nK9dgDFIkRQ3/q6tLi4u21xsWoHMFeKDRGmPclodVXq96WVFcmu3vt9BtWBOeSIaLqGrLrdbgyHw6bLAICZYvtaRHSPOq7YnggAID9CBACQjBABACQjRAAAyQgRAEAyQgQAkIwQAQAkI0QAAMkIEQBAMkIEAJCMEAEAJCNEAADJCBEAQDJCBACQ7L6mC6jL9h9L+oGk35L0WkT8W8MlAcDcakVPxPYl21/Yvr6n/bztT23fsP2iJEXEP0fEc5Kel/RnTdQLAKi0IkQkXZZ0fmeD7ROSXpH0uKRzki7aPrfjkL8a7QcANKQVIRIRH0j6ck/zI5JuRMRnEfGVpDclPenK30j614j492nXCgC4pxUhcoBTkj7fsb05avtLSX8g6Wnbz+93ou2e7aHt4Z07d/JXCgBzauYG1iPiZUkvH3FMX1JfqtZYn0ZdADCP2twTuS3pzI7t06M2oJ7BQOp0pIWF6n0waLoioBhtDpGPJD1k+0Hb90t6RtKVcU+2fcF2f2trK1uBmAGDgdTrSRsbUkT13usRJMCEtCJEbL8h6UNJD9vetP1sRHwt6QVJ70n6RNJbEfHxuNeMiHciore0tJSnaMyGtTVpe3t32/Z21Q7g2FoxJhIRFw9ovyrp6pTLQUlu3arXDqCWVvREgGyWl+u1A6il2BBhTASSpPV1aXFxd9viYtUO4NiKDRHGRCBJWl2V+n1pZUWyq/d+v2oHcGytGBMBslpdJTSATIrtiXA7CwDyKzZEuJ0FAPkVGyIAgPwIEQBAMkIEAJCs2BBhYB0A8is2RBhYB4D8ig0RAEB+hAgAIBkhAuTCYliYA8VOe2L7gqQLZ8+ebboUzKO7i2HdXcvk7mJYElOwoCiOKHsJ8m63G8PhsOkyMG86nSo49lpZkW7enHY1QG22r0VE96jjuJ0F5MBiWJgThAiQA4thYU4cGiK2T9n+5hivX9n+2vZvT+sfCGgFFsPCnBhnYN2SOpK+Sbi+JdF/x/y5O3i+tlbdwlpergKEQXUUZpwQCUmbEfGrlA+wnXLasfHrLDSOxbAwB4odE2HaEwDIr9gQAQDkR4gAAJIRIgCAZIQIACDZuHNnLdtO+nWWql93AQAKNE5PxJJ+Julm4quZ3/gCpWOWYLTAoT2RiLgtbnkB7cMswWiJYgOCNdZRtLW1ewFy1/Z21Q5MUbFzZ/GwIYrGLMFoCebOAmbR8vL+65UwSzCmrNi5s4Cira/vHhORmCUYjSh2TAQo2uqq1O9XKyXa1Xu/z6A6pq7YNdaB4jFLMFqAnggAnjlBMnoiwLzjmRMcAz0RYN7xzAmOgbmzgHnHMyc4hnGfE/mZmAMLKBPPnOAYDr2dFRG3I2IhIk6M3lNf/z2tf6C7mPYEGNP6evWMyU48c4IxFTsmwrQnwJh45gTHwK+zAPDMCZIV2xMBAORHiAAAkhEiAIBkhAgAIBkhAgBIRogAAJIRIgCAZIQIACAZIQIASEaIAACSESIAgGSECIDjS11eN+U8lvJtFSZgBHA8qcvrppzHUr6t44jZWXjQ9nckrUlaioinxzmn2+3GcDjMWxgwzzqd/Re1WlmRbt6c7Hmpn4XabF+LiO5RxzV+O8v2Jdtf2L6+p/287U9t37D9oiRFxGcR8WwzlQLYV+ryuinnsZRv6zQeIpIuSzq/s8H2CUmvSHpc0jlJF22fm35pAI500DK6Ry2vm3Je6mchm8ZDJCI+kPTlnuZHJN0Y9Ty+kvSmpCenXhyAo6Uur5tyHkv5tk7jIXKAU5I+37G9KemU7W/bflXS92y/dNDJtnu2h7aHd+7cyV0rMN9Sl9dNOY+lfFunFQPrtjuS3o2I7462n5Z0PiL+YrT955IejYgX6l6bgXUAqG9mBtYPcFvSmR3bp0dtAIAWaWuIfCTpIdsP2r5f0jOSrtS5gO0LtvtbW1tZCgQAtCBEbL8h6UNJD9vetP1sRHwt6QVJ70n6RNJbEfFxnetGxDsR0VtaWpp80QAASS14Yj0iLh7QflXS1SmXAwCoofGeCABgdhUbIoyJAEB+xYYIYyIAkF+xIQIAyK/YEOF2FgDkV2yIcDsLAPIrNkQAAPkRIgCAZIQIACBZsSHCwDoA5FdsiDCwDgD5FRsiAKDBQOp0pIWF6n0waLqi4jQ+ASMAZDEYSL2etL1dbW9sVNsSKyFOED0RAGVaW7sXIHdtb1ftmJhiQ4SBdWDO3bpVrx1Jig0RBtaBObe8XK8dSYoNEQBzbn1dWlzc3ba4WLVjYggRAGVaXZX6fWllRbKr936fQfUJ49dZAMq1ukpoZEZPBACQrNgQ4ddZAJBfsSHCr7MAIL9iQwQAkB8hAgBIRogAAJIRIgCAZIQIACAZIQIASEaIAACSFRsiPGwIAPkVGyI8bAgA+RUbIgCA/AgRAEAyQgQAkIwQAQAkI0QAAMkIEQBAMkIEAAYDqdORFhaq98Eg7Zi2mGKtrLEOYL4NBlKvJ21vV9sbG9W2dG999nGOaYsp1+qImPhF26Tb7cZwOGy6DABt1elU/6Lda2VFunlz/GPaYkK12r4WEd2jjiv2dhbTngAYy61bR7ePc0xbTLnWYkOEaU8AjGV5+ej2cY5piynXWmyIAMBY1telxcXdbYuLVXudY9piyrUSIgDm2+qq1O9XYwZ29d7v7x6EHueYtphyrQysAwB+zdwPrAMA8iNEAADJCBEAQDJCBACQjBABACQjRAAAyQgRAEAyQgQAkIwQAQAkI0QAAMkIEQBAspla2dD2tyT9raSvJL0fES1enxIAytd4T8T2Jdtf2L6+p/287U9t37D94qj5KUlvR8Rzkp6YerEAcJCj1jU/bH+OfVPShp7IZUk/lfSPdxtsn5D0iqQ/lLQp6SPbVySdlvSfo8O+mW6ZAHCAo9Y1P2y/NPl9U5yivhVTwdvuSHo3Ir472v49SX8dEX802n5pdOimpP+NiHdtvxkRzxx1baaCB5DdUeuaH7Zfmvy+Caz7Pu5U8G3oieznlKTPd2xvSnpU0suSfmr7B5LeOehk2z1JPUlabuPylQDKctS65inrnufYl0HjYyJ1RMQvI+KHEfGjwwbVI6IfEd2I6D7wwAPTLBHAPDpqXfPD9ufYN0VtDZHbks7s2D49agOA9jlqXfPD9ufYN00R0fhLUkfS9R3b90n6TNKDku6X9B+SfrfmNS9I6p89ezYAILvXX49YWYmwq/fXXx9/f459xyRpGGP8u7bxgXXbb0h6TNJJST+X9OOIeM329yX9RNIJSZciIileGVgHgPpmZmA9Ii4e0H5V0tUplwMAqKGtYyIAgBlQbIjYvmC7v7W11XQpAFCsYkMkIt6JiN7S0lLTpQBAsYoNEQBAfo0PrOdi+4Kqn/n+n+3/2rFrSdK497hOSvrFpGsrTJ2/Z9OaqjX3507y+se9Vur5KefxXZ6svX/PlbHOGud3wCW9JPVrHDvW76Tn+VXn79n0q6lac3/uJK9/3Gulnp9yHt/lyb5S/7ebx9tZB865hSSz9PdsqtbcnzvJ6x/3Wqnnp5w3S//fmwVJf8/GHzZsM9vDGONhGwDtxnc5n3nsidTRb7oAABPBdzkTeiIAgGT0RAAAyQgRAEAyQgQAkIwQqcH2t2z/g+2/t73adD0A6rP9Hduv2X676VpKMPchYvuS7S9sX9/Tft72p7Zv2H5x1PyUpLcj4jlJT0y9WAD7qvM9jojPIuLZZiotz9yHiKTLks7vbLB9QtIrkh6XdE7SRdvnVC3T+/nosG+mWCOAw13W+N9jTNDch0hEfCDpyz3Nj0i6Mfovlq8kvSnpSUmbqoJE4m8HtEbN7zEmiH8R7u+U7vU4pCo8Tkn6J0l/YvvvxJQLQNvt+z22/W3br0r6nu2XmimtHMXO4ptDRPxS0g+brgNAuoj4H0nPN11HKeiJ7O+2pDM7tk+P2gDMDr7HU0CI7O8jSQ/ZftD2/ZKekXSl4ZoA1MP3eArmPkRsvyHpQ0kP2960/WxEfC3pBUnvSfpE0lsR8XGTdQI4GN/j5jABIwAg2dz3RAAA6QgRAEAyQgQAkIwQAQAkI0QAAMkIEQBAMqY9ARLYPn30UVJEbE7jeKApPCcCJLA91hcnIjyN44GmcDsLSHdW0m8c8Oo0cDwwddzOAtJ9M5pa49fY3m/RstzHA1NHTwQAkIwQAQAkI0QAAMkIEQBAMkIEAJCMEAEAJCNEAADJCBEAQDJCBACQjBABACQjRAAAyQgRAEAyQgQAkIz1RIAErCcCVJgKHkhzpmXHA42gJwIASMaYCAAgGSECAEhGiAAAkhEiAIBkhAgAIBkhAgBI9v8o6Btow2NUUwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tweet_dict = defaultdict(int)\n", "for i in data_dict.values():\n", " tweet_dict[i] += 1\n", " \n", "plt.loglog(tweet_dict.keys(), tweet_dict.values(), 'ro')#linewidth=2) \n", "plt.xlabel(u'推特数', fontsize=20)\n", "plt.ylabel(u'人数', fontsize=20 )\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:00:01.767550Z", "start_time": "2019-06-08T08:00:00.760854Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import numpy as np\n", "import statsmodels.api as sm\n", "\n", "def powerPlot(d_value, d_freq, color, marker):\n", " d_freq = [i + 1 for i in d_freq]\n", " d_prob = [float(i)/sum(d_freq) for i in d_freq]\n", " #d_rank = ss.rankdata(d_value).astype(int)\n", " x = np.log(d_value)\n", " y = np.log(d_prob)\n", " xx = sm.add_constant(x, prepend=True)\n", " res = sm.OLS(y,xx).fit()\n", " constant,beta = res.params\n", " r2 = res.rsquared\n", " plt.plot(d_value, d_prob, linestyle = '',\\\n", " color = color, marker = marker)\n", " plt.plot(d_value, np.exp(constant+x*beta),\"red\")\n", " plt.xscale('log'); plt.yscale('log')\n", " plt.text(max(d_value)/2,max(d_prob)/10,\n", " r'$\\beta$ = ' + str(round(beta,2)) +'\\n' + r'$R^2$ = ' + str(round(r2, 2)), fontsize = 20)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T11:13:27.176717Z", "start_time": "2018-04-28T11:13:26.420464Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "histo, bin_edges = np.histogram(list(data_dict.values()), 15)\n", "bin_center = 0.5*(bin_edges[1:] + bin_edges[:-1])\n", "powerPlot(bin_center,histo, 'r', '^')\n", "#lg=plt.legend(labels = [u'Tweets', u'Fit'], loc=3, fontsize=20)\n", "plt.ylabel(u'概率', fontsize=20)\n", "plt.xlabel(u'推特数', fontsize=20) \n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T11:14:19.219105Z", "start_time": "2018-04-28T11:14:19.171044Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import statsmodels.api as sm\n", "from collections import defaultdict\n", "import numpy as np\n", "\n", "def powerPlot2(data):\n", " d = sorted(data, reverse = True )\n", " d_table = defaultdict(int)\n", " for k in d:\n", " d_table[k] += 1\n", " d_value = sorted(d_table)\n", " d_value = [i+1 for i in d_value]\n", " d_freq = [d_table[i]+1 for i in d_value]\n", " d_prob = [float(i)/sum(d_freq) for i in d_freq]\n", " x = np.log(d_value)\n", " y = np.log(d_prob)\n", " xx = sm.add_constant(x, prepend=True)\n", " res = sm.OLS(y,xx).fit()\n", " constant,beta = res.params\n", " r2 = res.rsquared\n", " plt.plot(d_value, d_prob, 'ro')\n", " plt.plot(d_value, np.exp(constant+x*beta),\"red\")\n", " plt.xscale('log'); plt.yscale('log')\n", " plt.text(max(d_value)/2,max(d_prob)/5,\n", " 'Beta = ' + str(round(beta,2)) +'\\n' + 'R squared = ' + str(round(r2, 2)))\n", " plt.title('Distribution')\n", " plt.ylabel('P(K)')\n", " plt.xlabel('K')\n", " plt.show()\n", " " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T11:14:26.818914Z", "start_time": "2018-04-28T11:14:26.499569Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "powerPlot2(data_dict.values())" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T11:14:50.947967Z", "start_time": "2018-04-28T11:14:50.933308Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import powerlaw\n", "def plotPowerlaw(data,ax,col,xlab):\n", " fit = powerlaw.Fit(data,xmin=2)\n", " #fit = powerlaw.Fit(data)\n", " fit.plot_pdf(color = col, linewidth = 2)\n", " a,x = (fit.power_law.alpha,fit.power_law.xmin)\n", " fit.power_law.plot_pdf(color = col, linestyle = 'dotted', ax = ax, \\\n", " label = r\"$\\alpha = %d \\:\\:, x_{min} = %d$\" % (a,x))\n", " ax.set_xlabel(xlab, fontsize = 20)\n", " ax.set_ylabel('$Probability$', fontsize = 20)\n", " plt.legend(loc = 0, frameon = False)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T11:14:53.968210Z", "start_time": "2018-04-28T11:14:53.962880Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from collections import defaultdict\n", "data_dict = defaultdict(int)\n", "\n", "for i in df['From User']:\n", " data_dict[i] += 1" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T11:14:57.469192Z", "start_time": "2018-04-28T11:14:56.922983Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.cm as cm\n", "cmap = cm.get_cmap('rainbow_r',6)\n", "\n", "fig = plt.figure(figsize=(6, 4),facecolor='white')\n", "ax = fig.add_subplot(1, 1, 1)\n", "plotPowerlaw(list(data_dict.values()), ax,cmap(1), \n", " '$Tweets$')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 5. 清洗tweets文本" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-04-01T03:46:46.498846Z", "start_time": "2019-04-01T03:46:46.496236Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "tweet = '''RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:02:19.500334Z", "start_time": "2019-06-08T08:02:19.259536Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import re\n", "\n", "import twitter_text\n", "# https://github.com/dryan/twitter-text-py/issues/21\n", "#Macintosh HD ▸ 用户 ▸ datalab ▸ 应用程序 ▸ anaconda ▸ lib ▸ python3.5 ▸ site-packages" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 安装twitter_text\n", "\n", "[twitter-text-py](https://github.com/dryan/twitter-text-py/issues/21) could not be used for python 3\n", "\n", "\n", "> ### pip install twitter-text\n", "\n", "Glyph debug the problem, and make [a new repo of twitter-text-py3](https://github.com/glyph/twitter-text-py).\n", "\n", "> ## pip install twitter-text\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 无法正常安装的同学\n", "## 可以在spyder中打开terminal安装\n", "\n", "pip install twitter-text" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:04:37.675542Z", "start_time": "2019-06-08T08:04:37.668241Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'AnonKitsu: @who'" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "tweet = '''RT @AnonKitsu: @who ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", \\\n", " re.IGNORECASE)\n", "rt_user_name = rt_patterns.findall(tweet)[0][1].strip(' @')#.split(':')[0]\n", "rt_user_name" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-04-01T03:59:45.727956Z", "start_time": "2019-04-01T03:59:45.720369Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'AnonKitsu'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "tweet = '''RT @AnonKitsu: @who ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", \\\n", " re.IGNORECASE)\n", "rt_user_name = rt_patterns.findall(tweet)[0][1].strip(' @').split(':')[0]\n", "rt_user_name" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:05:00.196880Z", "start_time": "2019-06-08T08:05:00.188010Z" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[]\n", "None\n" ] } ], "source": [ "import re\n", "\n", "tweet = '''@chengjun:@who ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", re.IGNORECASE)\n", "rt_user_name = rt_patterns.findall(tweet)\n", "print(rt_user_name)\n", "\n", "if rt_user_name:\n", " print('it exits.')\n", "else:\n", " print('None')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:05:27.804540Z", "start_time": "2019-06-08T08:05:27.795572Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import re\n", "\n", "def extract_rt_user(tweet):\n", " rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", re.IGNORECASE)\n", " rt_user_name = rt_patterns.findall(tweet)\n", " if rt_user_name:\n", " rt_user_name = rt_user_name[0][1].strip(' @').split(':')[0]\n", " else:\n", " rt_user_name = None\n", " return rt_user_name" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:05:31.592897Z", "start_time": "2019-06-08T08:05:31.587624Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'chengjun'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweet = '''RT @chengjun: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "extract_rt_user(tweet) " ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:05:42.978825Z", "start_time": "2019-06-08T08:05:42.975151Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "tweet = '''@chengjun: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "print(extract_rt_user(tweet) )" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:06:01.060683Z", "start_time": "2019-06-08T08:06:01.032491Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "[('RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE RT !!HELP!!!!',\n", " 'Anonops_Cop'),\n", " ('@jamiekilstein @allisonkilkenny Interesting interview (never aired, wonder why??) by Fox with #ows protester http://t.co/Fte55Kh7',\n", " 'KittyHybrid'),\n", " (\"@Seductivpancake Right! Those guys have a victory condition: regime change. #ows doesn't seem to have a goal I can figure out.\",\n", " 'nerdsherpa')]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import csv\n", "\n", "with open(\"../data/ows_tweets_sample.txt\", 'r') as f:\n", " chunk = f.readlines()\n", " \n", "rt_network = []\n", "lines = csv.reader(chunk[1:], delimiter=',', quotechar='\"')\n", "tweet_user_data = [(i[1], i[8]) for i in lines]\n", "tweet_user_data[:3]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:07:37.624179Z", "start_time": "2019-06-08T08:07:37.588574Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[(('OccupyNCGBORO', 'angela0328'), 1),\n", " (('evlance', 'KeithOlbermann'), 1),\n", " (('Lusho0487', 'anonops'), 1)]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import defaultdict\n", "\n", "rt_network = []\n", "rt_dict = defaultdict(int)\n", "for k, i in enumerate(tweet_user_data):\n", " tweet,user = i\n", " rt_user = extract_rt_user(tweet)\n", " if rt_user:\n", " rt_network.append((user, rt_user)) #(rt_user,' ', user, end = '\\n')\n", " rt_dict[(user, rt_user)] += 1\n", "#rt_network[:5]\n", "list(rt_dict.items())[:3]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 获得清洗过的推特文本\n", "\n", "不含人名、url、各种符号(如RT @等)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:08:42.317807Z", "start_time": "2019-06-08T08:08:42.309193Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def extract_tweet_text(tweet, at_names, urls):\n", " for i in at_names:\n", " tweet = tweet.replace(i, '')\n", " for j in urls:\n", " tweet = tweet.replace(j, '')\n", " marks = ['RT @', '@', '"', '#', '\\n', '\\t', ' ']\n", " for k in marks:\n", " tweet = tweet.replace(k, '')\n", " return tweet" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:09:07.984224Z", "start_time": "2019-06-08T08:09:07.973948Z" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['AnonKitsu', 'chengjun', 'mili'] ['http://computational-communication.com', 'http://ccc.nju.edu.cn'] ['OCCUPYWALLSTREET', 'OWS', 'OCCUPYNY'] AnonKitsu -------->\n" ] } ], "source": [ "import twitter_text\n", "\n", "tweet = '''RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "ex = twitter_text.Extractor(tweet)\n", "at_names = ex.extract_mentioned_screen_names()\n", "urls = ex.extract_urls()\n", "hashtags = ex.extract_hashtags()\n", "rt_user = extract_rt_user(tweet)\n", "#tweet_text = extract_tweet_text(tweet, at_names, urls)\n", "\n", "print(at_names, urls, hashtags, rt_user,'-------->')#, tweet_text)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:10:11.740636Z", "start_time": "2019-06-08T08:10:11.722855Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import csv\n", "\n", "lines = csv.reader(chunk,delimiter=',', quotechar='\"')\n", "tweets = [i[1] for i in lines]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T08:10:16.517097Z", "start_time": "2019-06-08T08:10:16.506944Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[] [] [] None\n", "['AnonKitsu'] [] ['OCCUPYWALLSTREET', 'OWS', 'OCCUPYNY'] AnonKitsu\n", "['jamiekilstein', 'allisonkilkenny'] ['http://t.co/Fte55Kh7'] ['ows'] None\n", "['Seductivpancake'] [] ['ows'] None\n", "['bembel'] ['http://j.mp/rhHavq'] ['OccupyWallStreet', 'OWS'] bembel\n" ] } ], "source": [ "for tweet in tweets[:5]:\n", " ex = twitter_text.Extractor(tweet)\n", " at_names = ex.extract_mentioned_screen_names()\n", " urls = ex.extract_urls()\n", " hashtags = ex.extract_hashtags()\n", " rt_user = extract_rt_user(tweet)\n", " #tweet_text = extract_tweet_text(tweet, at_names, urls)\n", "\n", " print(at_names, urls, hashtags, rt_user)\n", " #print(tweet_text)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "# 思考:\n", "\n", "### 提取出raw tweets中的rtuser与user的转发网络\n", "\n", "## 格式:\n", "rt_user1, user1, 3\n", "\n", "rt_user2, user3, 2\n", "\n", "rt_user2, user4, 1\n", "\n", "...\n", "\n", "数据保存为csv格式" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 阅读文献" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "1260px", "left": "1835px", "top": "224px", "width": "512px" }, "toc_section_display": false, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }