{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "***\n", "***\n", "# 数据抓取\n", " > # 抓取历届政府工作报告\n", "***\n", "***\n", "\n", "王成军 \n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:05:29.909682Z", "start_time": "2019-06-08T03:05:29.699307Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:05:44.431623Z", "start_time": "2019-06-08T03:05:44.425640Z" }, "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import display_html, HTML\n", "HTML('')\n", "# the webpage we would like to crawl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Inspect\n", "\n", "# · 2016年政府工作报告\n", "\n", " · 2016年政府工作报告\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:12:30.997590Z", "start_time": "2019-06-08T03:12:30.914046Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'ISO-8859-1'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get the link for each year\n", "url = \"http://www.hprc.org.cn/wxzl/wxysl/lczf/\" \n", "content = requests.get(url)\n", "content.encoding" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Encoding\n", "\n", "- ASCII\n", " - 7位字符集\n", " - 美国标准信息交换代码(American Standard Code for Information Interchange)的缩写, 为美国英语通信所设计。\n", " - 它由128个字符组成,包括大小写字母、数字0-9、标点符号、非打印字符(换行符、制表符等4个)以及控制字符(退格、响铃等)组成。\n", "- iso8859-1 通常叫做Latin-1。\n", " - 和ascii编码相似。\n", " - 属于单字节编码,最多能表示的字符范围是0-255,应用于英文系列。比如,字母a的编码为0x61=97。 \n", " - 无法表示中文字符。\n", " - 单字节编码,和计算机最基础的表示单位一致,所以很多时候,仍旧使用iso8859-1编码来表示。在很多协议上,默认使用该编码。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- gb2312/gbk/gb18030\n", " - 是汉字的国标码,专门用来表示汉字,是双字节编码,而英文字母和iso8859-1一致(兼容iso8859-1编码)。\n", " - 其中gbk编码能够用来同时表示繁体字和简体字,K 为汉语拼音 Kuo Zhan(扩展)中“扩”字的声母\n", " - gb2312只能表示简体字,gbk是兼容gb2312编码的。 \n", " - gb18030,全称:国家标准 GB 18030-2005《信息技术中文编码字符集》,是中华人民共和国现时最新的内码字集" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- unicode \n", " - 最统一的编码,用来表示所有语言的字符。\n", " - 占用更多的空间,定长双字节(也有四字节的)编码,包括英文字母在内。\n", " - 不兼容iso8859-1等其它编码。相对于iso8859-1编码来说,uniocode编码只是在前面增加了一个0字节,比如字母a为\"00 61\"。 \n", " - 定长编码便于计算机处理(注意GB2312/GBK不是定长编码),unicode又可以用来表示所有字符,所以在很多软件内部是使用unicode编码来处理的,比如java。 \n", "- UTF \n", " - unicode不便于传输和存储,产生了utf编码\n", " - utf编码兼容iso8859-1编码,同时也可以用来表示所有语言的字符\n", " - utf编码是不定长编码,每一个字符的长度从1-6个字节不等。\n", " - 其中,utf8(8-bit Unicode Transformation Format)是一种针对Unicode的可变长度字符编码,又称万国码。\n", " - 由Ken Thompson于1992年创建。现在已经标准化为RFC 3629。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# decode\n", "urllib2.urlopen(url).read().decode('gb18030') \n", "\n", " content.encoding = 'gb18030'\n", " \n", " content = content.text\n", " \n", "Or\n", "\n", " content = content.text.encode(content.encoding).decode('gb18030')\n", "\n", "\n", "\n", "# html.parser\n", "BeautifulSoup(content, 'html.parser')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:12:48.004262Z", "start_time": "2019-06-08T03:12:48.000482Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Specify the encoding\n", "content.encoding = 'utf8' # 'gb18030'\n", "content = content.text" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:12:59.097620Z", "start_time": "2019-06-08T03:12:59.051598Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019年政府工作报告\n" ] } ], "source": [ "soup = BeautifulSoup(content, 'html.parser') \n", "# links = soup.find_all('td', {'class', 'bl'}) \n", "links = soup.select('.bl a')\n", "print(links[0])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:13:13.681631Z", "start_time": "2019-06-08T03:13:13.677227Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "50" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(links)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:13:21.227981Z", "start_time": "2019-06-08T03:13:21.223396Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'./dishiyijie_10/200908/t20090818_3955459.html'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "links[-1]['href']" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:13:32.195933Z", "start_time": "2019-06-08T03:13:32.191166Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'dssjqgrmdbdh_1/201903/t20190318_4849567.html'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "links[0]['href'].split('./')[1]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:13:44.082726Z", "start_time": "2019-06-08T03:13:44.077900Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'http://www.hprc.org.cn/wxzl/wxysl/lczf/dssjqgrmdbdh_1/201903/t20190318_4849567.html'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url + links[0]['href'].split('./')[1]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:13:58.725075Z", "start_time": "2019-06-08T03:13:58.719510Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['http://www.hprc.org.cn/wxzl/wxysl/lczf/dssjqgrmdbdh_1/201903/t20190318_4849567.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dssjqgrmdbdh_1/201803/t20180323_4240852.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201703/t20170317_4144138.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201603/t20160318_4135203.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201503/t20150318_4106347.html']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hyperlinks = [url + i['href'].split('./')[1] for i in links]\n", "hyperlinks[:5]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:14:09.815393Z", "start_time": "2019-06-08T03:14:09.810879Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_9/200908/t20090818_3955464.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_3955462.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_3955461.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_3955460.html',\n", " 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_3955459.html']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hyperlinks[-5:]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:15:01.286399Z", "start_time": "2019-06-08T03:15:01.282256Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_1/200908/t20090818_3955570.html'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hyperlinks[12] # 2007年有分页" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:17:15.842395Z", "start_time": "2019-06-08T03:17:15.837032Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import display_html, HTML\n", "\n", "HTML('')\n", "# 2007年有分页" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Inspect 下一页\n", "\n", "下一页\n", "\n", " 下一页\n", " \n", "- a\n", " - script\n", " - td" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:19:53.397306Z", "start_time": "2019-06-08T03:19:53.264239Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "url_i = 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_1/200908/t20090818_3955570.html'\n", "content = requests.get(url_i)\n", "content.encoding = 'utf8'\n", "content = content.text\n", "#content = content.text.encode(content.encoding).decode('gb18030')\n", "soup = BeautifulSoup(content, 'html.parser') \n", "#scripts = soup.find_all('script')\n", "#scripts[0]\n", "scripts = soup.select('td script')[0]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:19:54.692037Z", "start_time": "2019-06-08T03:19:54.687023Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scripts" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:20:04.358252Z", "start_time": "2019-06-08T03:20:04.353874Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'\\n\\tvar currentPage = 0;//所在页从0开始\\n\\tvar prevPage = currentPage-1//上一页\\n\\tvar 下一页Page = currentPage+1//下一页\\n\\tvar countPage = 4//共多少页\\n\\t//document.write(\"共\"+countPage+\"页  \");\\n\\t\\n\\t//循环\\n\\tvar num = 17;\\n\\tfor(var i=0+(currentPage-1-(currentPage-1)%num) ; i<=(num+(currentPage-1-(currentPage-1)%num))&&(i1){\\n\\t\\t\\tif(currentPage==i)\\n\\t\\t\\t\\tdocument.write(\"【\"+(i+1)+\"】 \");\\n\\t\\t\\telse if(i==0)\\n\\t\\t\\t\\tdocument.write(\"【\"+(i+1)+\"】 \");\\n\\t\\t\\telse\\n\\t\\t\\t\\tdocument.write(\"【\"+(i+1)+\"】 \");\\n\\t\\t}\\t\\n\\t}\\n\\t\\n\\tdocument.write(\"

\");\\n\\t//设置上一页代码\\n\\tif(countPage>1&¤tPage!=0&¤tPage!=1)\\n\\t\\tdocument.write(\"上一页 \");\\n\\telse if(countPage>1&¤tPage!=0&¤tPage==1)\\n\\t\\tdocument.write(\"上一页 \");\\n\\t//else\\n\\t//\\tdocument.write(\"上一页  \");\\n\\t\\n\\t\\n\\t//设置下一页代码 \\n\\tif(countPage>1&¤tPage!=(countPage-1))\\n\\t\\tdocument.write(\"下一页  \");\\n\\t//else\\n\\t//\\tdocument.write(\"下一页  \");\\n\\t\\t\\t\\t\\t \\n\\t'" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scripts.text" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:20:31.027898Z", "start_time": "2019-06-08T03:20:31.022062Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# countPage = int(''.join(scripts).split('countPage = ')\\\n", "# [1].split('//')[0])\n", "# countPage\n", "\n", "countPage = int(scripts.text.split('countPage = ')[1].split('//')[0])\n", "countPage" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:22:16.877815Z", "start_time": "2019-06-08T03:22:16.830695Z" }, "code_folding": [ 1 ], "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import sys\n", "def flushPrint(s):\n", " sys.stdout.write('\\r')\n", " sys.stdout.write('%s' % s)\n", " sys.stdout.flush()\n", " \n", "def crawler(url_i):\n", " content = requests.get(url_i)\n", " content.encoding = 'utf8' \n", " content = content.text\n", " soup = BeautifulSoup(content, 'html.parser') \n", " year = soup.find('span', {'class', 'huang16c'}).text[:4]\n", " year = int(year)\n", " report = ''.join(s.text for s in soup('p'))\n", " # 找到分页信息\n", " scripts = soup.find_all('script')\n", " countPage = int(''.join(scripts[1]).split('countPage = ')[1].split('//')[0])\n", " if countPage == 1:\n", " pass\n", " else:\n", " for i in range(1, countPage):\n", " url_child = url_i.split('.html')[0] +'_'+str(i)+'.html'\n", " content = requests.get(url_child)\n", " content.encoding = 'gb18030'\n", " content = content.text\n", " soup = BeautifulSoup(content, 'html.parser') \n", " report_child = ''.join(s.text for s in soup('p'))\n", " report = report + report_child\n", " return year, report\n" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:22:48.582670Z", "start_time": "2019-06-08T03:22:27.192142Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1954" ] } ], "source": [ "# 抓取50年政府工作报告内容\n", "reports = {}\n", "for link in hyperlinks:\n", " year, report = crawler(link)\n", " flushPrint(year)\n", " reports[year] = report " ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:23:02.154355Z", "start_time": "2019-06-08T03:23:02.140302Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "with open('../data/gov_reports1954-2019.txt', 'w', encoding = 'utf8') as f:\n", " for r in reports:\n", " line = str(r)+'\\t'+reports[r].replace('\\n', '\\t') +'\\n'\n", " f.write(line)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:23:10.813183Z", "start_time": "2019-06-08T03:23:10.401123Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_table('../data/gov_reports1954-2019.txt', names = ['year', 'report'])" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T03:23:23.270233Z", "start_time": "2019-06-08T03:23:23.261529Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearreport
452015国务院总理 李克强  各位代表:  现在,我代表国务院,向大会报告政府工作,请予审议,并...
462016政府工作报告
472017各位代表:  现在,我代表国务院,向大会报告政府工作,请予审议,并请全国政协各位委员提出...
482018各位代表:  现在,我代表国务院,向大会报告过去五年政府工作,对今年工作提出建议,请予审...
492019新华社北京3月16日电   政府工作报告  ——2019年3月5日在第十三届全国人民代表...
\n", "
" ], "text/plain": [ " year report\n", "45 2015   国务院总理 李克强  各位代表:  现在,我代表国务院,向大会报告政府工作,请予审议,并...\n", "46 2016 政府工作报告\n", "47 2017   各位代表:  现在,我代表国务院,向大会报告政府工作,请予审议,并请全国政协各位委员提出...\n", "48 2018   各位代表:  现在,我代表国务院,向大会报告过去五年政府工作,对今年工作提出建议,请予审...\n", "49 2019   新华社北京3月16日电   政府工作报告  ——2019年3月5日在第十三届全国人民代表..." ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[-5:]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "# This is the end.\n", "> ## Thank you for your attention." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [conda env:anaconda]", "language": "python", "name": "conda-env-anaconda-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "48px", "left": "1539px", "top": "303px", "width": "168px" }, "toc_section_display": false, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }