{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "***\n", "***\n", "# 数据抓取\n", " > # 抓取历届政府工作报告\n", "***\n", "***\n", "\n", "王成军 \n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-03-09T03:05:07.010558Z", "start_time": "2019-03-09T03:05:06.758870Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-03-09T03:05:09.172450Z", "start_time": "2019-03-09T03:05:09.160153Z" }, "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import display_html, HTML\n", "HTML('')\n", "# the webpage we would like to crawl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Inspect\n", "\n", "# · 2016年政府工作报告\n", "\n", " · 2016年政府工作报告\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T07:46:12.420891Z", "start_time": "2018-04-28T07:46:11.907222Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'ISO-8859-1'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get the link for each year\n", "url = \"http://www.hprc.org.cn/wxzl/wxysl/lczf/\" \n", "content = requests.get(url)\n", "content.encoding" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Encoding\n", "\n", "- ASCII\n", " - 7位字符集\n", " - 美国标准信息交换代码(American Standard Code for Information Interchange)的缩写, 为美国英语通信所设计。\n", " - 它由128个字符组成,包括大小写字母、数字0-9、标点符号、非打印字符(换行符、制表符等4个)以及控制字符(退格、响铃等)组成。\n", "- iso8859-1 通常叫做Latin-1。\n", " - 和ascii编码相似。\n", " - 属于单字节编码,最多能表示的字符范围是0-255,应用于英文系列。比如,字母a的编码为0x61=97。 \n", " - 无法表示中文字符。\n", " - 单字节编码,和计算机最基础的表示单位一致,所以很多时候,仍旧使用iso8859-1编码来表示。在很多协议上,默认使用该编码。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- gb2312/gbk/gb18030\n", " - 是汉字的国标码,专门用来表示汉字,是双字节编码,而英文字母和iso8859-1一致(兼容iso8859-1编码)。\n", " - 其中gbk编码能够用来同时表示繁体字和简体字,K 为汉语拼音 Kuo Zhan(扩展)中“扩”字的声母\n", " - gb2312只能表示简体字,gbk是兼容gb2312编码的。 \n", " - gb18030,全称:国家标准 GB 18030-2005《信息技术中文编码字符集》,是中华人民共和国现时最新的内码字集" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- unicode \n", " - 最统一的编码,用来表示所有语言的字符。\n", " - 占用更多的空间,定长双字节(也有四字节的)编码,包括英文字母在内。\n", " - 不兼容iso8859-1等其它编码。相对于iso8859-1编码来说,uniocode编码只是在前面增加了一个0字节,比如字母a为\"00 61\"。 \n", " - 定长编码便于计算机处理(注意GB2312/GBK不是定长编码),unicode又可以用来表示所有字符,所以在很多软件内部是使用unicode编码来处理的,比如java。 \n", "- UTF \n", " - unicode不便于传输和存储,产生了utf编码\n", " - utf编码兼容iso8859-1编码,同时也可以用来表示所有语言的字符\n", " - utf编码是不定长编码,每一个字符的长度从1-6个字节不等。\n", " - 其中,utf8(8-bit Unicode Transformation Format)是一种针对Unicode的可变长度字符编码,又称万国码。\n", " - 由Ken Thompson于1992年创建。现在已经标准化为RFC 3629。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# decode\n", "urllib2.urlopen(url).read().decode('gb18030') \n", "\n", " content.encoding = 'gb18030'\n", " \n", " content = content.text\n", " \n", "Or\n", "\n", " content = content.text.encode(content.encoding).decode('gb18030')\n", "\n", "\n", "\n", "# html.parser\n", "BeautifulSoup(content, 'html.parser')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T07:46:20.172629Z", "start_time": "2018-04-28T07:46:20.162479Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Specify the encoding\n", "content.encoding = 'gb18030'\n", "content = content.text" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T07:46:22.928321Z", "start_time": "2018-04-28T07:46:22.883038Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2017年政府工作报告" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup = BeautifulSoup(content, 'html.parser') \n", "# links = soup.find_all('td', {'class', 'bl'}) \n", "links = soup.select('.bl a')\n", "links[0]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T07:46:26.263394Z", "start_time": "2018-04-28T07:46:26.258925Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "48" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(links)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T07:46:47.885519Z", "start_time": "2018-04-28T07:46:47.881368Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'./dishiyijie_10/200908/t20090818_27558.html'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "links[-1]['href']" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2017-05-13T14:11:38.009711", "start_time": "2017-05-13T14:11:38.006207" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'d12qgrdzfbg/201703/t20170317_389845.html'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "links[0]['href'].split('./')[1]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2017-05-13T14:02:21.666779", "start_time": "2017-05-13T14:02:21.663290" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201703/t20170317_389845.html'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url + links[0]['href'].split('./')[1]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T08:07:35.408712Z", "start_time": "2018-04-28T08:07:35.402844Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201703/t20170317_389845.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201603/t20160318_369509.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201503/t20150318_319434.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201403/t20140315_270863.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201402/t20140214_266528.html']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hyperlinks = [url + i['href'].split('./')[1] for i in links]\n", "hyperlinks[:5]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T08:07:37.437271Z", "start_time": "2018-04-28T08:07:37.432692Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_9/200908/t20090818_27563.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_27561.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_27560.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_27559.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_27558.html']" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hyperlinks[-5:]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T08:07:39.588619Z", "start_time": "2018-04-28T08:07:39.584121Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_1/200908/t20090818_27775.html'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hyperlinks[10] # 2007年有分页" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2017-05-13T14:04:02.006511", "start_time": "2017-05-13T14:04:02.001754" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import display_html, HTML\n", "HTML('')\n", "# 2007年有分页" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Inspect 下一页\n", "\n", "下一页\n", "\n", " 下一页\n", " \n", "- a\n", " - script\n", " - td" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T07:58:29.845982Z", "start_time": "2018-04-28T07:58:29.533895Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "url_i = 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_1/200908/t20090818_27775.html'\n", "content = requests.get(url_i)\n", "content.encoding = 'gb18030'\n", "content = content.text\n", "#content = content.text.encode(content.encoding).decode('gb18030')\n", "soup = BeautifulSoup(content, 'html.parser') \n", "#scripts = soup.find_all('script')\n", "#scripts[0]\n", "scripts = soup.select('td script')[0]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T07:58:31.954109Z", "start_time": "2018-04-28T07:58:31.949467Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scripts" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T07:58:35.217698Z", "start_time": "2018-04-28T07:58:35.213425Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'\\n\\tvar currentPage = 0;//所在页从0开始\\n\\tvar prevPage = currentPage-1//上一页\\n\\tvar 下一页Page = currentPage+1//下一页\\n\\tvar countPage = 4//共多少页\\n\\t//document.write(\"共\"+countPage+\"页  \");\\n\\t\\n\\t//循环\\n\\tvar num = 17;\\n\\tfor(var i=0+(currentPage-1-(currentPage-1)%num) ; i<=(num+(currentPage-1-(currentPage-1)%num))&&(i1){\\n\\t\\t\\tif(currentPage==i)\\n\\t\\t\\t\\tdocument.write(\"【\"+(i+1)+\"】 \");\\n\\t\\t\\telse if(i==0)\\n\\t\\t\\t\\tdocument.write(\"【\"+(i+1)+\"】 \");\\n\\t\\t\\telse\\n\\t\\t\\t\\tdocument.write(\"【\"+(i+1)+\"】 \");\\n\\t\\t}\\t\\n\\t}\\n\\t\\n\\tdocument.write(\"

\");\\n\\t//设置上一页代码\\n\\tif(countPage>1&¤tPage!=0&¤tPage!=1)\\n\\t\\tdocument.write(\"上一页 \");\\n\\telse if(countPage>1&¤tPage!=0&¤tPage==1)\\n\\t\\tdocument.write(\"上一页 \");\\n\\t//else\\n\\t//\\tdocument.write(\"上一页  \");\\n\\t\\n\\t\\n\\t//设置下一页代码 \\n\\tif(countPage>1&¤tPage!=(countPage-1))\\n\\t\\tdocument.write(\"下一页  \");\\n\\t//else\\n\\t//\\tdocument.write(\"下一页  \");\\n\\t\\t\\t\\t\\t \\n\\t'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scripts.text" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T08:00:21.956859Z", "start_time": "2018-04-28T08:00:21.951133Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# countPage = int(''.join(scripts).split('countPage = ')\\\n", "# [1].split('//')[0])\n", "# countPage\n", "\n", "countPage = int(scripts.text.split('countPage = ')[1].split('//')[0])\n", "countPage" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T08:06:17.723344Z", "start_time": "2018-04-28T08:06:17.676681Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import sys\n", "def flushPrint(s):\n", " sys.stdout.write('\\r')\n", " sys.stdout.write('%s' % s)\n", " sys.stdout.flush()\n", " \n", "def crawler(url_i):\n", " content = requests.get(url_i)\n", " content.encoding = 'gb18030' \n", " content = content.text\n", " soup = BeautifulSoup(content, 'html.parser') \n", " year = soup.find('span', {'class', 'huang16c'}).text[:4]\n", " year = int(year)\n", " report = ''.join(s.text for s in soup('p'))\n", " # 找到分页信息\n", " scripts = soup.find_all('script')\n", " countPage = int(''.join(scripts[1]).split('countPage = ')[1].split('//')[0])\n", " if countPage == 1:\n", " pass\n", " else:\n", " for i in range(1, countPage):\n", " url_child = url_i.split('.html')[0] +'_'+str(i)+'.html'\n", " content = requests.get(url_child)\n", " content.encoding = 'gb18030'\n", " content = content.text\n", " soup = BeautifulSoup(content, 'html.parser') \n", " report_child = ''.join(s.text for s in soup('p'))\n", " report = report + report_child\n", " return year, report\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T08:08:39.112172Z", "start_time": "2018-04-28T08:07:49.043562Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1954" ] } ], "source": [ "# 抓取48年政府工作报告内容\n", "reports = {}\n", "for link in hyperlinks:\n", " year, report = crawler(link)\n", " flushPrint(year)\n", " reports[year] = report " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T08:17:30.258116Z", "start_time": "2018-04-28T08:17:30.246809Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "with open('../data/gov_reports1954-2017.txt', 'w', encoding = 'utf8') as f:\n", " for r in reports:\n", " line = str(r)+'\\t'+reports[r].replace('\\n', '\\t') +'\\n'\n", " f.write(line)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T08:11:31.959353Z", "start_time": "2018-04-28T08:11:30.751124Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_table('../data/gov_reports1954-2017.txt', names = ['year', 'report'])" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T08:11:34.898876Z", "start_time": "2018-04-28T08:11:34.888410Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearreport
019541954年政府工作报告——1954年5月23日在中华人民共和国第一届全国人民代表大会第一次会...
119551955年国务院政府工作报告关于发展国民经济的第一个五年计划的报告 ——1955年7月5日至...
219561956年国务院政府工作报告关于1955年国家决算和1956年国家预算的报告——1956年6...
\n", "
" ], "text/plain": [ " year report\n", "0 1954 1954年政府工作报告——1954年5月23日在中华人民共和国第一届全国人民代表大会第一次会...\n", "1 1955 1955年国务院政府工作报告关于发展国民经济的第一个五年计划的报告 ——1955年7月5日至...\n", "2 1956 1956年国务院政府工作报告关于1955年国家决算和1956年国家预算的报告——1956年6..." ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[:3]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "# This is the end.\n", "> ## Thank you for your attention." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [conda env:anaconda]", "language": "python", "name": "conda-env-anaconda-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "48px", "left": "1539px", "top": "303px", "width": "168px" }, "toc_section_display": false, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }