{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "***\n", "# 数据清洗之推特数据\n", "***\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 数据清洗(data cleaning)\n", "是数据分析的重要步骤,其主要目标是将混杂的数据清洗为可以被直接分析的数据,一般需要将数据转化为数据框(data frame)的样式。\n", "\n", "本章将以推特文本的清洗作为例子,介绍数据清洗的基本逻辑。\n", "\n", "- 清洗错误行\n", "- 正确分列\n", "- 提取所要分析的内容\n", "- 介绍通过按行、chunk的方式对大规模数据进行预处理\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 1. 抽取tweets样本做实验\n", "此节学生略过" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2752\n" ] } ], "source": [ "bigfile = open('/Users/chengjun/百度云同步盘/Writing/OWS/ows-raw.txt', 'r')\n", "chunkSize = 1000000\n", "chunk = bigfile.readlines(chunkSize)\n", "print(len(chunk))\n", "with open(\"/Users/chengjun/GitHub/cjc/data/ows_tweets_sample.txt\", 'w') as f:\n", " for i in chunk:\n", " f.write(i) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Lazy Method for Reading Big File in Python?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:31:51.644484Z", "start_time": "2019-06-08T07:30:56.170308Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 262665\n", "1 525130\n", "2 787344\n", "3 1049351\n", "4 1312571\n", "5 1574666\n", "6 1835628\n", "7 2097136\n", "8 2358494\n", "9 2619723\n", "10 2880857\n", "11 3140945\n", "12 3404775\n", "13 3665565\n", "14 3927996\n", "15 4189419\n", "16 4449078\n", "17 4709001\n", "18 4969877\n", "19 5230937\n", "20 5492578\n", "21 5756613\n", "22 6022478\n", "23 6286119\n", "24 6549476\n", "25 6602141\n" ] } ], "source": [ "# https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python?lq=1\n", "import csv\n", "bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')\n", "\n", "chunkSize = 10**8\n", "chunk = bigfile.readlines(chunkSize)\n", "num, num_lines = 0, 0\n", "while chunk:\n", " lines = csv.reader((line.replace('\\x00','') for line in chunk), \n", " delimiter=',', quotechar='\"')\n", " #do sth.\n", " num_lines += len(list(lines))\n", " print(num, num_lines)\n", " num += 1\n", " chunk = bigfile.readlines(chunkSize) # read another chunk" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 字节(Byte /bait/)\n", "\n", "计算机信息技术用于计量存储容量的一种计量单位,通常情况下一字节等于有八位, [1] 也表示一些计算机编程语言中的数据类型和语言字符。\n", "- 1B(byte,字节)= 8 bit;\n", "- 1KB=1000B;1MB=1000KB=1000×1000B。其中1000=10^3。\n", "- 1KB(kilobyte,千字节)=1000B= 10^3 B;\n", "- 1MB(Megabyte,兆字节,百万字节,简称“兆”)=1000KB= 10^6 B;\n", "- 1GB(Gigabyte,吉字节,十亿字节,又称“千兆”)=1000MB= 10^9 B;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 用Pandas的get_chunk功能来处理亿级数据\n", "\n", "> 只有在超过5TB数据量的规模下,Hadoop才是一个合理的技术选择。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "f = open('../bigdata/OWS/ows-raw.txt',encoding='utf-8')\n", "reader = pd.read_table(f, sep=',', iterator=True, error_bad_lines=False) #跳过报错行\n", "loop = True\n", "chunkSize = 100000\n", "data = []\n", "\n", "while loop:\n", " try:\n", " chunk = reader.get_chunk(chunkSize)\n", " dat = data_cleaning_funtion(chunk) # do sth.\n", " data.append(dat) \n", " except StopIteration:\n", " loop = False\n", " print(\"Iteration is stopped.\")\n", "\n", "df = pd.concat(data, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 2. 清洗错行的情况" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:42:24.661108Z", "start_time": "2019-06-08T07:42:24.648304Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "with open(\"../data/ows_tweets_sample.txt\", 'r') as f:\n", " lines = f.readlines() " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:42:28.452634Z", "start_time": "2019-06-08T07:42:28.441018Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2753" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 总行数\n", "len(lines)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:42:32.821269Z", "start_time": "2019-06-08T07:42:32.816918Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'121813245488140288,\"@HumanityCritic i\\'m worried that the #ows sells out to the hamsher-norquist spitefuck, and tries to unite with the teahad.\",http://a2.twimg.com/profile_images/627683576/flytits_normal.jpg,2011-10-06,5,5,\"2011-10-06 05:05:15\",N;,fucentarmal,27480502,en,HumanityCritic,230431,\"<a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>\"\\n'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 查看第一行\n", "lines[15]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on built-in function split:\n", "\n", "split(...) method of builtins.str instance\n", " S.split(sep=None, maxsplit=-1) -> list of strings\n", " \n", " Return a list of the words in S, using sep as the\n", " delimiter string. If maxsplit is given, at most maxsplit\n", " splits are done. If sep is not specified or is None, any\n", " whitespace string is a separator and empty strings are\n", " removed from the result.\n", "\n" ] } ], "source": [ "help(lines[1].split)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 问题: 第一行是变量名\n", "> ## 1. 如何去掉换行符?\n", "> ## 2. 如何获取每一个变量名?\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:43:39.363547Z", "start_time": "2019-06-08T07:43:39.358317Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['\"Twitter ID\"',\n", " 'Text',\n", " '\"Profile Image URL\"',\n", " 'Day',\n", " 'Hour',\n", " 'Minute',\n", " '\"Created At\"',\n", " 'Geo',\n", " '\"From User\"',\n", " '\"From User ID\"',\n", " 'Language',\n", " '\"To User\"',\n", " '\"To User ID\"',\n", " 'Source']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "varNames = lines[0].replace('\\n', '').split(',')\n", "varNames" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:43:49.131388Z", "start_time": "2019-06-08T07:43:49.127319Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "14" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(varNames)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:43:53.979866Z", "start_time": "2019-06-08T07:43:53.975920Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'121818600490283009,\"RT @chachiTHEgr8: RT @TheNewDeal: First they ignore you, then they laugh at you, then they fight you, then you win. - Gandhi #OccupyWallStreet #OWS #p2\",http://a0.twimg.com/profile_images/326662126/Photo_233_normal.jpg,2011-10-06,5,26,\"2011-10-06 05:26:32\",N;,k_l_h_j,382233343,en,,0,\"<a href="http://twitter.com/#!/download/iphone" rel="nofollow">Twitter for iPhone</a>\"\\n'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines[1344]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 如何来处理错误换行情况?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T10:57:03.746530Z", "start_time": "2018-04-28T10:57:03.727339Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "with open(\"../data/ows_tweets_sample_clean.txt\", 'w') as f:\n", " right_line = '' # 正确的行,它是一个空字符串\n", " blocks = [] # 确认为正确的行会被添加到blocks里面\n", " for line in lines:\n", " right_line += line.replace('\\n', ' ')\n", " line_length = len(right_line.split(','))\n", " if line_length >= 14:\n", " blocks.append(right_line)\n", " right_line = '' \n", " for i in blocks:\n", " f.write(i + '\\n')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T10:57:07.915900Z", "start_time": "2018-04-28T10:57:07.911441Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2627" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(blocks)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T10:57:16.586149Z", "start_time": "2018-04-28T10:57:16.582151Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'121818879105310720,\"RT @Min_Reyes: RT @The99Percenters: New video to go viral. From We Are Change http://t.co/6Ff718jk Listen to the guy begging... #ows #cdnpoli\",http://a3.twimg.com/sticky/default_profile_images/default_profile_0_normal.png,2011-10-06,5,27,\"2011-10-06 05:27:38\",N;,MiyazakiMegu,260948518,en,,0,\"<a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>\" '" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blocks[1344]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 同时考虑分列符和引用符\n", "\n", "- 分列符🔥分隔符:sep, delimiter\n", "- 引用符☁️:quotechar\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:51:20.459071Z", "start_time": "2019-06-08T07:51:20.453871Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['121813245488140288',\n", " \"@HumanityCritic i'm worried that the #ows sells out to the hamsher-norquist spitefuck, and tries to unite with the teahad.\",\n", " 'http://a2.twimg.com/profile_images/627683576/flytits_normal.jpg,2011-10-06,5,5',\n", " '2011-10-06 05:05:15',\n", " 'N;,fucentarmal,27480502,en,HumanityCritic,230431',\n", " '<a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>\"\\n']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "re.split(',\"|\",', lines[15])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:52:33.453629Z", "start_time": "2019-06-08T07:52:33.441462Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line = 35 length = 6\n", "line = 36 length = 6\n", "line = 37 length = 6\n", "line = 38 length = 6\n", "line = 39 length = 6\n", "line = 40 length = 6\n", "line = 41 length = 2\n", "line = 42 length = 5\n", "line = 43 length = 6\n", "line = 44 length = 6\n", "line = 45 length = 6\n", "line = 46 length = 6\n", "line = 47 length = 6\n", "line = 48 length = 2\n", "line = 49 length = 5\n" ] } ], "source": [ "import re\n", "\n", "with open(\"../data/ows_tweets_sample.txt\",'r') as f:\n", " lines = f.readlines()\n", " \n", "for i in range(35,50):\n", " i_ = re.split(',\"|\",', lines[i])\n", " print('line =',i,' length =', len(i_))\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:54:54.976462Z", "start_time": "2019-06-08T07:54:54.944533Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "with open(\"../data/ows_tweets_sample_clean4.txt\", 'w') as f:\n", " right_line = '' # 正确的行,它是一个空字符串\n", " blocks = [] # 确认为正确的行会被添加到blocks里面\n", " for line in lines:\n", " right_line += line.replace('\\n', ' ').replace('\\r', ' ')\n", " #line_length = len(right_line.split(','))\n", " i_ = re.split(',\"|\",', right_line)\n", " line_length = len(i_)\n", " if line_length >= 6:\n", " blocks.append(right_line)\n", " right_line = ''\n", "# for i in blocks:\n", "# f.write(i + '\\n')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:54:59.860355Z", "start_time": "2019-06-08T07:54:59.856381Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2626" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(blocks)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 3. 读取数据、正确分列" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:55:54.719495Z", "start_time": "2019-06-08T07:55:54.712843Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# 提示:你可能需要修改以下路径名\n", "with open(\"../data/ows_tweets_sample.txt\", 'r') as f:\n", " chunk = f.readlines()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:55:57.501462Z", "start_time": "2019-06-08T07:55:57.497278Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2753" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(chunk)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:56:00.549021Z", "start_time": "2019-06-08T07:56:00.544656Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['\"Twitter ID\",Text,\"Profile Image URL\",Day,Hour,Minute,\"Created At\",Geo,\"From User\",\"From User ID\",Language,\"To User\",\"To User ID\",Source\\n',\n", " '121813144174727168,\"RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE RT !!HELP!!!!\",http://a2.twimg.com/profile_images/1539375713/Twitter_normal.jpg,2011-10-06,5,4,\"2011-10-06 05:04:51\",N;,Anonops_Cop,401240477,en,,0,\"<a href="http://twitter.com/">web</a>\"\\n',\n", " '121813146137657344,\"@jamiekilstein @allisonkilkenny Interesting interview (never aired, wonder why??) by Fox with #ows protester http://t.co/Fte55Kh7\",http://a2.twimg.com/profile_images/1574715503/Kate6_normal.jpg,2011-10-06,5,4,\"2011-10-06 05:04:51\",N;,KittyHybrid,34532053,en,jamiekilstein,2149053,\"<a href="http://twitter.com/">web</a>\"\\n']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunk[:3]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:56:05.677057Z", "start_time": "2019-06-08T07:56:05.656929Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2627\n" ] } ], "source": [ "import csv\n", "lines_csv = csv.reader(chunk, delimiter=',', quotechar='\"') \n", "print(len(list(lines_csv)))\n", "# next(lines_csv)\n", "# next(lines_csv)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T01:12:38.678653Z", "start_time": "2018-04-29T01:12:38.611535Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import re\n", "import csv\n", "\n", "from collections import defaultdict\n", "\n", "def extract_rt_user(tweet):\n", " rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", re.IGNORECASE)\n", " rt_user_name = rt_patterns.findall(tweet)\n", " if rt_user_name:\n", " rt_user_name = rt_user_name[0][1].strip(' @')\n", " else:\n", " rt_user_name = None\n", " return rt_user_name\n", "\n", "rt_network = defaultdict(int)\n", "f = open(\"../data/ows_tweets_sample.txt\", 'r')\n", "chunk = f.readlines(100000)\n", "while chunk: \n", " #lines = csv.reader(chunk, delimiter=',', quotechar='\"') \n", " lines = csv.reader((line.replace('\\x00','') for line in chunk), delimiter=',', quotechar='\"')\n", " for line in lines:\n", " tweet = line[1]\n", " from_user = line[8]\n", " rt_user = extract_rt_user(tweet)\n", " rt_network[(from_user, rt_user)] += 1 \n", " chunk = f.readlines(100000)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T07:56:22.886245Z", "start_time": "2019-06-08T07:56:22.198448Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | Twitter ID | \n", "Text | \n", "Profile Image URL | \n", "Day | \n", "Hour | \n", "Minute | \n", "Created At | \n", "Geo | \n", "From User | \n", "From User ID | \n", "Language | \n", "To User | \n", "To User ID | \n", "Source | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "121813144174727168 | \n", "RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN... | \n", "http://a2.twimg.com/profile_images/1539375713/... | \n", "2011-10-06 | \n", "5 | \n", "4 | \n", "2011-10-06 05:04:51 | \n", "N; | \n", "Anonops_Cop | \n", "401240477 | \n", "en | \n", "NaN | \n", "0 | \n", "<a href="http://twitter.com/">... | \n", "
1 | \n", "121813146137657344 | \n", "@jamiekilstein @allisonkilkenny Interesting in... | \n", "http://a2.twimg.com/profile_images/1574715503/... | \n", "2011-10-06 | \n", "5 | \n", "4 | \n", "2011-10-06 05:04:51 | \n", "N; | \n", "KittyHybrid | \n", "34532053 | \n", "en | \n", "jamiekilstein | \n", "2149053 | \n", "<a href="http://twitter.com/">... | \n", "
2 | \n", "121813150000619521 | \n", "@Seductivpancake Right! Those guys have a vict... | \n", "http://a1.twimg.com/profile_images/1241412831/... | \n", "2011-10-06 | \n", "5 | \n", "4 | \n", "2011-10-06 05:04:52 | \n", "N; | \n", "nerdsherpa | \n", "95067344 | \n", "en | \n", "Seductivpancake | \n", "19695580 | \n", "<a href="http://www.echofon.com/"... | \n", "