{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "# 数据抓取:\n", "\n", "> # 使用Python编写网络爬虫\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 需要解决的问题 \n", "\n", "- 页面解析\n", "- 获取Javascript隐藏源数据\n", "- 自动翻页\n", "- 自动登录\n", "- 连接API接口\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import urllib2\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- 一般的数据抓取,使用urllib2和beautifulsoup配合就可以了。\n", "- 尤其是对于翻页时url出现规则变化的网页,只需要处理规则化的url就可以了。\n", "- 以简单的例子是抓取天涯论坛上关于某一个关键词的帖子。\n", " - 在天涯论坛,关于雾霾的帖子的第一页是:\n", "http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=雾霾\n", " - 第二页是:\n", "http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=雾霾\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "# 数据抓取:\n", "> # 抓取天涯回帖网络\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import display_html, HTML\n", "HTML('')\n", "# the webpage we would like to crawl" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "page_num = 0\n", "url = \"http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k=PX\" % page_num\n", "content = urllib2.urlopen(url).read() #获取网页的html文本\n", "soup = BeautifulSoup(content, \"lxml\") \n", "articles = soup.find_all('tr')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " 标题\n", "作者\n", "点击\n", "回复\n", "发表时间\n", "\n" ] } ], "source": [ "print articles[0]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\r\n", "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸\n", "\n", "\n", "贾也\n", "194699\n", "2703\n", "10-29 07:59\n", "\n" ] } ], "source": [ "print articles[1]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "50" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(articles[1:])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](./img/inspect.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=PX\n", " \n", "# 通过分析帖子列表的源代码,使用inspect方法,会发现所有要解析的内容都在‘td’这个层级下\n", " " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\r\n", "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸\n", "\n", "\n", "贾也\n", "194699\n", "2703\n", "10-29 07:59\n" ] } ], "source": [ "for t in articles[1].find_all('td'): print t" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "td = articles[1].find_all('td')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\r\n", "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸\n", "\n", "\n" ] } ], "source": [ "print td[0]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\r\n", "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸\n", "\n", "\n" ] } ], "source": [ "print td[0]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\r\n", "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸\n", "\n", "\n" ] } ], "source": [ "print td[0].text" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "【民间语文第161期】宁波px启示:船进港湾人应上岸\n" ] } ], "source": [ "print td[0].text.strip()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/post-free-2849477-1.shtml\n" ] } ], "source": [ "print td[0].a['href']" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "贾也\n" ] } ], "source": [ "print td[1]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "194699\n" ] } ], "source": [ "print td[2]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2703\n" ] } ], "source": [ "print td[3]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10-29 07:59\n" ] } ], "source": [ "print td[4]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "records = []\n", "for i in articles[1:]:\n", " td = i.find_all('td')\n", " title = td[0].text.strip()\n", " title_url = td[0].a['href']\n", " author = td[1].text\n", " author_url = td[1].a['href']\n", " views = td[2].text\n", " replies = td[3].text\n", " date = td[4]['title']\n", " record = title + '\\t' + title_url+ '\\t' + author + '\\t'+ author_url + '\\t' + views+ '\\t' + replies+ '\\t'+ date\n", " records.append(record)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "宁波准备停止PX项目了,元芳,你怎么看?\t/post-free-2848797-1.shtml\t牧阳光\thttp://www.tianya.cn/36535656\t82888\t625\t2012-10-28 19:11\n" ] } ], "source": [ "print records[2]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 抓取天涯论坛PX帖子列表\n", "\n", "回帖网络(Thread network)的结构\n", "- 列表\n", "- 主帖\n", "- 回帖" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def crawler(page_num, file_name):\n", " try:\n", " # open the browser\n", " url = \"http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k=PX\" % page_num\n", " content = urllib2.urlopen(url).read() #获取网页的html文本\n", " soup = BeautifulSoup(content, \"lxml\") \n", " articles = soup.find_all('tr')\n", " # write down info\n", " for i in articles[1:]:\n", " td = i.find_all('td')\n", " title = td[0].text.strip()\n", " title_url = td[0].a['href']\n", " author = td[1].text\n", " author_url = td[1].a['href']\n", " views = td[2].text\n", " replies = td[3].text\n", " date = td[4]['title']\n", " record = title + '\\t' + title_url+ '\\t' + author + '\\t'+ \\\n", " author_url + '\\t' + views+ '\\t' + replies+ '\\t'+ date\n", " with open(file_name,'a') as p: # '''Note''':Append mode, run only once!\n", " p.write(record.encode('utf-8')+\"\\n\") ##!!encode here to utf-8 to avoid encoding\n", "\n", " except Exception, e:\n", " print e\n", " pass" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "2\n", "3\n", "4\n", "5\n", "6\n", "7\n", "8\n", "9\n" ] } ], "source": [ "# crawl all pages\n", "for page_num in range(10):\n", " print (page_num)\n", " crawler(page_num, '/Users/chengjun/bigdata/tianya_bbs_threads_list.txt') " ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
0【民间语文第161期】宁波px启示:船进港湾人应上岸/post-free-2849477-1.shtml贾也http://www.tianya.cn/5049945019467527032012-10-29 07:59
1宁波镇海PX项目引发群体上访 当地政府发布说明(转载)/post-free-2839539-1.shtml无上卫士ABChttp://www.tianya.cn/743418358824410412012-10-24 12:41
\n", "
" ], "text/plain": [ " 0 1 2 \\\n", "0 【民间语文第161期】宁波px启示:船进港湾人应上岸 /post-free-2849477-1.shtml 贾也 \n", "1 宁波镇海PX项目引发群体上访 当地政府发布说明(转载) /post-free-2839539-1.shtml 无上卫士ABC \n", "\n", " 3 4 5 6 \n", "0 http://www.tianya.cn/50499450 194675 2703 2012-10-29 07:59 \n", "1 http://www.tianya.cn/74341835 88244 1041 2012-10-24 12:41 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_list.txt', sep = \"\\t\", header=None)\n", "df[:2]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "467" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlelinkauthorauthor_pageclickreplytime
0【民间语文第161期】宁波px启示:船进港湾人应上岸/post-free-2849477-1.shtml贾也http://www.tianya.cn/5049945019467527032012-10-29 07:59
1宁波镇海PX项目引发群体上访 当地政府发布说明(转载)/post-free-2839539-1.shtml无上卫士ABChttp://www.tianya.cn/743418358824410412012-10-24 12:41
\n", "
" ], "text/plain": [ " title link author \\\n", "0 【民间语文第161期】宁波px启示:船进港湾人应上岸 /post-free-2849477-1.shtml 贾也 \n", "1 宁波镇海PX项目引发群体上访 当地政府发布说明(转载) /post-free-2839539-1.shtml 无上卫士ABC \n", "\n", " author_page click reply time \n", "0 http://www.tianya.cn/50499450 194675 2703 2012-10-29 07:59 \n", "1 http://www.tianya.cn/74341835 88244 1041 2012-10-24 12:41 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df=df.rename(columns = {0:'title', 1:'link', 2:'author',3:'author_page', 4:'click', 5:'reply', 6:'time'})\n", "df[:2]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "467" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df.link)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 抓取作者信息" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 http://www.tianya.cn/50499450\n", "1 http://www.tianya.cn/74341835\n", "2 http://www.tianya.cn/36535656\n", "3 http://www.tianya.cn/36959960\n", "4 http://www.tianya.cn/53134970\n", "Name: author_page, dtype: object" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.author_page[:5]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "http://www.tianya.cn/62237033\n", " \n", "http://www.tianya.cn/67896263\n", " " ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# user_info" ] }, { "cell_type": "code", "execution_count": 357, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "url = df.author_page[1]\n", "content = urllib2.urlopen(url).read() #获取网页的html文本\n", "soup1 = BeautifulSoup(content, \"lxml\") " ] }, { "cell_type": "code", "execution_count": 359, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "浙江杭州市 259643 5832 2016-04-16 16:53:46 2011-04-14 20:49:00\n" ] } ], "source": [ "user_info = soup.find('div', {'class', 'userinfo'})('p')\n", "area, nid, freq_use, last_login_time, reg_time = [i.get_text()[4:] for i in user_info]\n", "print area, nid, freq_use, last_login_time, reg_time \n", "\n", "link_info = soup1.find_all('div', {'class', 'link-box'})\n", "followed_num, fans_num = [i.a.text for i in link_info]\n", "print followed_num, fans_num" ] }, { "cell_type": "code", "execution_count": 393, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 5\n" ] } ], "source": [ "activity = soup1.find_all('span', {'class', 'subtitle'})\n", "post_num, reply_num = [j.text[2:] for i in activity[:1] for j in i('a')]\n", "print post_num, reply_num" ] }, { "cell_type": "code", "execution_count": 386, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "贾也的博客 \r\n", "\t\t\t\r\n", "\r\n", "\t\t\t\n" ] } ], "source": [ "print activity[2]" ] }, { "cell_type": "code", "execution_count": 370, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "152 27451\n" ] } ], "source": [ "link_info = soup.find_all('div', {'class', 'link-box'})\n", "followed_num, fans_num = [i.a.text for i in link_info]\n", "print followed_num, fans_num" ] }, { "cell_type": "code", "execution_count": 369, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "u'152'" ] }, "execution_count": 369, "metadata": {}, "output_type": "execute_result" } ], "source": [ "link_info[0].a.text" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# user_info = soup.find('div', {'class', 'userinfo'})('p')\n", "# user_infos = [i.get_text()[4:] for i in user_info]\n", " \n", "def author_crawler(url, file_name):\n", " try:\n", " content = urllib2.urlopen(url).read() #获取网页的html文本\n", " soup = BeautifulSoup(content, \"lxml\")\n", " link_info = soup.find_all('div', {'class', 'link-box'})\n", " followed_num, fans_num = [i.a.text for i in link_info]\n", " try:\n", " activity = soup.find_all('span', {'class', 'subtitle'})\n", " post_num, reply_num = [j.text[2:] for i in activity[:1] for j in i('a')]\n", " except:\n", " post_num, reply_num = 1, 0\n", " record = '\\t'.join([url, followed_num, fans_num, post_num, reply_num])\n", " with open(file_name,'a') as p: # '''Note''':Append mode, run only once!\n", " p.write(record.encode('utf-8')+\"\\n\") ##!!encode here to utf-8 to avoid encoding\n", "\n", " except Exception, e:\n", " print e, url\n", " record = '\\t'.join([url, 'na', 'na', 'na', 'na'])\n", " with open(file_name,'a') as p: # '''Note''':Append mode, run only once!\n", " p.write(record.encode('utf-8')+\"\\n\") ##!!encode here to utf-8 to avoid encoding\n", " pass" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "10\n", "20\n", "30\n", "40\n", "need more than 0 values to unpack http://www.tianya.cn/67896263\n", "need more than 0 values to unpack http://www.tianya.cn/42330613\n", "sequence item 3: expected string or Unicode, int found http://www.tianya.cn/26517664\n", "50\n", "need more than 0 values to unpack http://www.tianya.cn/75591747\n", "60\n", "need more than 0 values to unpack http://www.tianya.cn/24068399\n", "70\n", "80\n", "90\n", "need more than 0 values to unpack http://www.tianya.cn/67896263\n", "sequence item 3: expected string or Unicode, int found http://www.tianya.cn/62237033\n", "100\n", "110\n", "120\n", "130\n", "140\n", "150\n", "160\n", "170\n", "180\n", "190\n", "need more than 0 values to unpack http://www.tianya.cn/67896263\n", "200\n", "need more than 0 values to unpack http://www.tianya.cn/85353911\n", "210\n", "220\n", "230\n", "240\n", "250\n", "260\n", "270\n", "280\n", "need more than 0 values to unpack http://www.tianya.cn/67896263\n", "290\n", "need more than 0 values to unpack http://www.tianya.cn/67896263\n", "300\n", "310\n", "320\n", "need more than 0 values to unpack http://www.tianya.cn/67896263\n", "330\n", "340\n", "350\n", "360\n", "370\n", "need more than 0 values to unpack http://www.tianya.cn/67896263\n", "380\n", "390\n", "400\n", "410\n", "420\n", "430\n", "440\n", "450\n", "460\n" ] } ], "source": [ "for k, url in enumerate(df.author_page):\n", " if k % 10==0:\n", " print k\n", " author_crawler(url, '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_author_info2.txt') " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "http://www.tianya.cn/50499450/follow \n", "\n", "还可抓取他们的关注列表和粉丝列表" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "***\n", "# 数据抓取:\n", "> # 使用Python抓取回帖\n", "***\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'/post-free-2849477-1.shtml'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.link[0]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'http://bbs.tianya.cn/post-free-2848797-1.shtml'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'http://bbs.tianya.cn' + df.link[2]\n", "url" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import display_html, HTML\n", "HTML('')\n", "# the webpage we would like to crawl" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "post = urllib2.urlopen(url).read() #获取网页的html文本\n", "post_soup = BeautifulSoup(post, \"lxml\") \n", "#articles = soup.find_all('tr')" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false, "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", " \n", " \n", " \n", " 宁波准备停止PX项目了,元芳,你怎么看?_天涯杂谈_天涯论坛\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
\n", "
\n", " \n", " \n", "
\n", "
\n", "
\n", "
\n", "
    \n", "
  • \n", " \n", " 论坛\n", " \n", "
  • \n", "
  • \n", " \n", " 聚焦\n", " \n", "
  • \n", "
  • \n", " \n", " 部落\n", " \n", "
  • \n", "
  • \n", " \n", " 博客\n", " \n", "
  • \n", "
  • \n", " \n", " 问答\n", " \n", "
  • \n", " 36535656 ---> 牧阳光 ---> 从宁波市政府新闻发言人处获悉,宁波市经与项目投资方研究决定:(1)坚决不上PX项目;(2)炼化一体化项目前期工作停止推进,再作科学论证。 \n", "\n", "2012-10-28 19:17:56 ---> 73157063 ---> 怨魂鬼 ---> 图片分享 \n", "\n", "2012-10-28 19:18:17 ---> 73157063 ---> 怨魂鬼 ---> @怨魂鬼 2012-10-28 19:17:56  图片分享   [发自掌中天涯客户端 ]  -----------------------------  2楼我的天下! \n", "\n", "2012-10-28 19:18:46 ---> 36535656 ---> 牧阳光 ---> 。。。沙发板凳这么快就被坐了~~ \n", "\n", "2012-10-28 19:19:04 ---> 41774471 ---> zgh0213 ---> 元芳你怎么看 \n", "\n", "2012-10-28 19:19:37 ---> 73157063 ---> 怨魂鬼 ---> @牧阳光 2012-10-28 19:18:46  。。。沙发板凳这么快就被坐了~~  -----------------------------  运气、 \n", "\n", "2012-10-28 19:20:04 ---> 36535656 ---> 牧阳光 ---> @怨魂鬼 5楼   运气、  -----------------------------  哈哈。。。 \n", "\n", "2012-10-28 19:20:07 ---> 54060837 ---> 帆小叶 ---> 八卦的被和谐了。帖个链接http://api.pwnz.org/0/?url=bG10aC4  wOTIyNzQvNzIvMDEvMjEvc3dlbi9tb2MuYW5  paGN0ZXJjZXMud3d3Ly9BMyVwdHRo \n", "\n", "2012-10-28 19:20:33 ---> 36535656 ---> 牧阳光 ---> @怨魂鬼 2楼   2楼我的天下!  -----------------------------  。。。还是掌中天涯,NB的~~ \n", "\n", "2012-10-28 19:25:22 ---> 36535656 ---> 牧阳光 ---> 消息来源,官方微博@宁波发布 \n", "\n" ] } ], "source": [ "for i in pa[:10]:\n", " p_info = i.find('a', class_ = 'reportme a-link')\n", " p_time = p_info['replytime']\n", " p_author_id = p_info['authorid']\n", " p_author_name = p_info['author']\n", " p_content = i.find('div', {'class', 'bbs-content'}).text.strip()\n", " p_content = p_content.replace('\\t', '')\n", " print p_time, '--->', p_author_id, '--->', p_author_name,'--->', p_content, '\\n'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 如何翻页\n", "\n", "http://bbs.tianya.cn/post-free-2848797-1.shtml\n", " \n", " http://bbs.tianya.cn/post-free-2848797-2.shtml\n", " \n", " http://bbs.tianya.cn/post-free-2848797-3.shtml" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "
    \\n\\u4e0a\\u9875\\n1\\n2\\n3\\n4\\n\\u2026\\n7\\n\\u4e0b\\u9875\\n\\xa0\\u5230\\u9875\\xa0
    " ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "post_soup.find('div', {'class', 'atl-pages'})#.['onsubmit']" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'7'" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "post_pages = post_soup.find('div', {'class', 'atl-pages'})\n", "post_pages = post_pages.form['onsubmit'].split(',')[-1].split(')')[0]\n", "post_pages\n", "#post_soup.select('.atl-pages')[0].select('form')[0].select('onsubmit')\n" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'http://bbs.tianya.cn/postfree2848797-%d.shtml'" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'http://bbs.tianya.cn' + df.link[2]\n", "url_base = ''.join(url.split('-')[:-1]) + '-%d.shtml'\n", "url_base" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def parsePage(pa):\n", " records = []\n", " for i in pa:\n", " p_info = i.find('a', class_ = 'reportme a-link')\n", " p_time = p_info['replytime']\n", " p_author_id = p_info['authorid']\n", " p_author_name = p_info['author']\n", " p_content = i.find('div', {'class', 'bbs-content'}).text.strip()\n", " p_content = p_content.replace('\\t', '').replace('\\n', '')#.replace(' ', '')\n", " record = p_time + '\\t' + p_author_id+ '\\t' + p_author_name + '\\t'+ p_content\n", " records.append(record)\n", " return records\n", "\n", "import sys\n", "def flushPrint(s):\n", " sys.stdout.write('\\r')\n", " sys.stdout.write('%s' % s)\n", " sys.stdout.flush()" ] }, { "cell_type": "code", "execution_count": 246, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "
    " ] }, "execution_count": 246, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url_1 = 'http://bbs.tianya.cn' + df.link[10]\n", "content = urllib2.urlopen(url_1).read() #获取网页的html文本\n", "post_soup = BeautifulSoup(content, \"lxml\") \n", "pa = post_soup.find_all('div', {'class', 'atl-item'})\n", "b = post_soup.find('div', class_= 'atl-pages')\n", "b" ] }, { "cell_type": "code", "execution_count": 247, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "
    \\n\\u4e0a\\u9875\\n1\\n2\\n3\\n4\\n\\u2026\\n28\\n\\u4e0b\\u9875\\n\\xa0\\u5230\\u9875\\xa0
    " ] }, "execution_count": 247, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url_1 = 'http://bbs.tianya.cn' + df.link[0]\n", "content = urllib2.urlopen(url_1).read() #获取网页的html文本\n", "post_soup = BeautifulSoup(content, \"lxml\") \n", "pa = post_soup.find_all('div', {'class', 'atl-item'})\n", "a = post_soup.find('div', {'class', 'atl-pages'})\n", "a" ] }, { "cell_type": "code", "execution_count": 251, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "
    \\n\\u4e0a\\u9875\\n1\\n2\\n3\\n4\\n\\u2026\\n28\\n\\u4e0b\\u9875\\n\\xa0\\u5230\\u9875\\xa0
    " ] }, "execution_count": 251, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.form" ] }, { "cell_type": "code", "execution_count": 254, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "false\n" ] } ], "source": [ "if b.form:\n", " print 'true'\n", "else:\n", " print 'false'" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import random\n", "import time\n", "\n", "def crawler(url, file_name):\n", " try:\n", " # open the browser\n", " url_1 = 'http://bbs.tianya.cn' + url\n", " content = urllib2.urlopen(url_1).read() #获取网页的html文本\n", " post_soup = BeautifulSoup(content, \"lxml\") \n", " # how many pages in a post\n", " post_form = post_soup.find('div', {'class', 'atl-pages'})\n", " if post_form.form:\n", " post_pages = post_form.form['onsubmit'].split(',')[-1].split(')')[0]\n", " post_pages = int(post_pages)\n", " url_base = '-'.join(url_1.split('-')[:-1]) + '-%d.shtml'\n", " else:\n", " post_pages = 1\n", " # for the first page\n", " pa = post_soup.find_all('div', {'class', 'atl-item'})\n", " records = parsePage(pa)\n", " with open(file_name,'a') as p: # '''Note''':Append mode, run only once!\n", " for record in records: \n", " p.write('1'+ '\\t' + url + '\\t' + record.encode('utf-8')+\"\\n\") \n", " # for the 2nd+ pages\n", " if post_pages > 1:\n", " for page_num in range(2, post_pages+1):\n", " time.sleep(random.random())\n", " flushPrint(page_num)\n", " url2 =url_base % page_num\n", " content = urllib2.urlopen(url2).read() #获取网页的html文本\n", " post_soup = BeautifulSoup(content, \"lxml\") \n", " pa = post_soup.find_all('div', {'class', 'atl-item'})\n", " records = parsePage(pa)\n", " with open(file_name,'a') as p: # '''Note''':Append mode, run only once!\n", " for record in records: \n", " p.write(str(page_num) + '\\t' +url + '\\t' + record.encode('utf-8')+\"\\n\") \n", " else:\n", " pass\n", " except Exception, e:\n", " print e\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 测试" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7" ] } ], "source": [ "url = df.link[2]\n", "file_name = '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_2test.txt'\n", "crawler(url, file_name)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 正式抓取!" ] }, { "cell_type": "code", "execution_count": 417, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/post-free-2849477-1.shtmlThis it the post of : 0\n", "/post-free-2842180-1.shtmlThis it the post of : 10\n", "/post-free-3316698-1.shtmlThis it the post of : 20\n", "/post-free-923387-1.shtmlThis it the post of : 30\n", "/post-free-4236026-1.shtmlThis it the post of : 40\n", "/post-free-2850721-1.shtmlThis it the post of : 50\n", "/post-free-5054821-1.shtmlThis it the post of : 60\n", "/post-free-3326274-1.shtmlThis it the post of : 70\n", "/post-free-4236793-1.shtmlThis it the post of : 80\n", "/post-free-4239792-1.shtmlThis it the post of : 90\n", "/post-free-5042110-1.shtmlThis it the post of : 100\n", "/post-free-2241144-1.shtmlThis it the post of : 110\n", "/post-free-3324561-1.shtmlThis it the post of : 120\n", "/post-free-3835452-1.shtmlThis it the post of : 130\n", "/post-free-5045950-1.shtmlThis it the post of : 140\n", "/post-free-2848818-1.shtmlThis it the post of : 150\n", "/post-free-3281916-1.shtmlThis it the post of : 160\n", "/post-free-949151-1.shtmlThis it the post of : 170\n", "/post-free-2848839-1.shtmlThis it the post of : 180\n", "/post-free-3228423-1.shtmlThis it the post of : 190\n", "/post-free-2852970-1.shtmlThis it the post of : 200\n", "/post-free-3325388-1.shtmlThis it the post of : 210\n", "/post-free-3835748-1.shtmlThis it the post of : 220\n", "/post-free-3833431-1.shtmlThis it the post of : 230\n", "/post-free-3378998-1.shtmlThis it the post of : 240\n", "/post-free-3359022-1.shtmlThis it the post of : 250\n", "/post-free-3365791-1.shtmlThis it the post of : 260\n", "/post-free-3396378-1.shtmlThis it the post of : 270\n", "/post-free-3835212-1.shtmlThis it the post of : 280\n", "/post-free-4248593-1.shtmlThis it the post of : 290\n", "/post-free-3833373-1.shtmlThis it the post of : 300\n", "/post-free-3847600-1.shtmlThis it the post of : 310\n", "/post-free-3832970-1.shtmlThis it the post of : 320\n", "/post-free-4076130-1.shtmlThis it the post of : 330\n", "/post-free-3835673-1.shtmlThis it the post of : 340\n", "/post-free-3835434-1.shtmlThis it the post of : 350\n", "/post-free-3368554-1.shtmlThis it the post of : 360\n", "/post-free-3832938-1.shtmlThis it the post of : 370\n", "/post-free-3835075-1.shtmlThis it the post of : 380\n", "/post-free-3832963-1.shtmlThis it the post of : 390\n", "/post-free-4250604-1.shtmlThis it the post of : 400\n", "/post-free-3834828-1.shtmlThis it the post of : 410\n", "/post-free-3835007-1.shtmlThis it the post of : 420\n", "/post-free-3838253-1.shtmlThis it the post of : 430\n", "/post-free-3835167-1.shtmlThis it the post of : 440\n", "/post-free-3835898-1.shtmlThis it the post of : 450\n", "/post-free-3835123-1.shtmlThis it the post of : 460\n", "/post-free-3835031-1.shtml" ] } ], "source": [ "for k, link in enumerate(df.link):\n", " flushPrint(link)\n", " if k % 10== 0:\n", " print 'This it the post of : ' + str(k)\n", " file_name = '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_network.txt'\n", " crawler(link, file_name)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 读取数据" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "8079" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dtt = []\n", "with open('/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_network.txt', 'r') as f:\n", " for line in f:\n", " pnum, link, time, author_id, author, content = line.replace('\\n', '').split('\\t')\n", " dtt.append([pnum, link, time, author_id, author, content])\n", "len(dtt)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    012345
    01/post-free-2849477-1.shtml2012-10-29 07:59:0050499450贾也先生是一位真爷们!第161期导语:人人宁波,面朝大海,春暖花开!  宁波的事,怎谈?无从谈,...
    11/post-free-2849477-1.shtml2012-10-29 08:13:5422122799三平67我们中国人都在一条船,颠簸已久,我们都想做宁波人,希望有一个风平浪静的港湾,面朝大海,春暖花...
    21/post-free-2849477-1.shtml2012-10-29 08:27:0239027950赶浪头默默围观~
    31/post-free-2849477-1.shtml2012-10-29 08:43:1553986501m408833176不能收藏?
    41/post-free-2849477-1.shtml2012-10-29 08:55:5239073643兰质薰心楼主好文!  相信政府一定有能力解决好这些问题.
    \n", "
    " ], "text/plain": [ " 0 1 2 3 4 \\\n", "0 1 /post-free-2849477-1.shtml 2012-10-29 07:59:00 50499450 贾也 \n", "1 1 /post-free-2849477-1.shtml 2012-10-29 08:13:54 22122799 三平67 \n", "2 1 /post-free-2849477-1.shtml 2012-10-29 08:27:02 39027950 赶浪头 \n", "3 1 /post-free-2849477-1.shtml 2012-10-29 08:43:15 53986501 m408833176 \n", "4 1 /post-free-2849477-1.shtml 2012-10-29 08:55:52 39073643 兰质薰心 \n", "\n", " 5 \n", "0 先生是一位真爷们!第161期导语:人人宁波,面朝大海,春暖花开!  宁波的事,怎谈?无从谈,... \n", "1 我们中国人都在一条船,颠簸已久,我们都想做宁波人,希望有一个风平浪静的港湾,面朝大海,春暖花... \n", "2 默默围观~ \n", "3 不能收藏? \n", "4 楼主好文!  相信政府一定有能力解决好这些问题. " ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt = pd.DataFrame(dtt)\n", "dt[:5]" ] }, { "cell_type": "code", "execution_count": 420, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    page_numlinktimeauthorauthor_namereply
    01/post-free-2849477-1.shtml2012-10-29 07:59:0050499450贾也先生是一位真爷们!第161期导语:人人宁波,面朝大海,春暖花开!  宁波的事,怎谈?无从谈,...
    11/post-free-2849477-1.shtml2012-10-29 08:13:5422122799三平67我们中国人都在一条船,颠簸已久,我们都想做宁波人,希望有一个风平浪静的港湾,面朝大海,春暖花...
    21/post-free-2849477-1.shtml2012-10-29 08:27:0239027950赶浪头默默围观~
    31/post-free-2849477-1.shtml2012-10-29 08:43:1553986501m408833176不能收藏?
    41/post-free-2849477-1.shtml2012-10-29 08:55:5239073643兰质薰心楼主好文!  相信政府一定有能力解决好这些问题.
    \n", "
    " ], "text/plain": [ " page_num link time author \\\n", "0 1 /post-free-2849477-1.shtml 2012-10-29 07:59:00 50499450 \n", "1 1 /post-free-2849477-1.shtml 2012-10-29 08:13:54 22122799 \n", "2 1 /post-free-2849477-1.shtml 2012-10-29 08:27:02 39027950 \n", "3 1 /post-free-2849477-1.shtml 2012-10-29 08:43:15 53986501 \n", "4 1 /post-free-2849477-1.shtml 2012-10-29 08:55:52 39073643 \n", "\n", " author_name reply \n", "0 贾也 先生是一位真爷们!第161期导语:人人宁波,面朝大海,春暖花开!  宁波的事,怎谈?无从谈,... \n", "1 三平67 我们中国人都在一条船,颠簸已久,我们都想做宁波人,希望有一个风平浪静的港湾,面朝大海,春暖花... \n", "2 赶浪头 默默围观~ \n", "3 m408833176 不能收藏? \n", "4 兰质薰心 楼主好文!  相信政府一定有能力解决好这些问题. " ] }, "execution_count": 420, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt=dt.rename(columns = {0:'page_num', 1:'link', 2:'time', 3:'author',4:'author_name', 5:'reply'})\n", "dt[:5]" ] }, { "cell_type": "code", "execution_count": 421, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 先生是一位真爷们!第161期导语:人人宁波,面朝大海,春暖花开!  宁波的事,怎谈?无从谈,...\n", "1 我们中国人都在一条船,颠簸已久,我们都想做宁波人,希望有一个风平浪静的港湾,面朝大海,春暖花...\n", "2 默默围观~\n", "3 不能收藏?\n", "4 楼主好文!  相信政府一定有能力解决好这些问题.\n", "5 人民在觉醒。\n", "6 理性的文字,向楼主致敬!\n", "7 呼唤变革,人民需要的是服务型政府!\n", "8 顶贾兄!让我们携手努力保卫家园!\n", "9 围观就是力量,顶起就有希望.\n", "10 文章写得太有力量了,支持你!\n", "11 @贾也 2012-10-29 7:59:00  导语:人人宁波,面朝大海,春暖花开  ......\n", "12 中国人从文盲走向民粹,实在是太快了。\n", "13 杀死娘胎里的,毒死已出生的,这个社会怎么了?\n", "14 3\n", "15 环境比什么都可贵,每一次呼吸,每一顿粮食,都息息相关,若任其恶化,而无从改观,那遑谈国家之未...\n", "16 写的很好\n", "17 未来这里将是全球最大的垃圾场,而他们早已放浪西方。苟活的将面临数不清的癌症,无助的死亡。悲哀...\n", "18 媒体失声,高压维稳,只保留微博和论坛可以说这件事!因为那些人知道,网上的人和事就只能热几天,...\n", "19 说的太好了,看的我泪流满面!\n", "20 “我相信官场中,许多官员应该葆有社会正能量”  通篇好文,顶!唯这句,不说也罢....\n", "21 先占一环,然后看帖\n", "22 说的太好了\n", "23 我上的小学,隔壁就是一家水泥厂,到处飞扬的水泥灰是我最熟悉的颜色;坐一站地车,就是一家造纸厂...\n", "24 我们中国人都在一条船,颠簸已久,我们都想做宁波人,希望有一个风平浪静的港湾,面朝大海,春暖花开!\n", "25 前排占座~~\n", "26 贾也先生是一位真爷们!\n", "27 \n", "28 为什么我的眼里常含着泪水?因为我对这片土地爱得深沉!\n", "29 又是因为环保的群体事件,影响面大,危害严重,理由叫的响,取得阶段性胜利。  那些拆迁的、城管...\n", " ... \n", "70 这是我近几年看过的写的最好的文章,!!!不多说了,危险\n", "71 @shdsb 2012-10-29 10:17:43  媒体失声,高压维稳,只保留微博和论坛...\n", "72 @pals2009 48楼   每次看到这样的消息,都很痛心,很堵很堵。为什么在经济发展的同...\n", "73 成都啊成都\n", "74 是不得人心呀\n", "75 真爷们。。顶  我是宁。波人,楼主说的是我们的心声。。。  现在看到人民警察,不是有安全感,...\n", "76 作者:弱水三千chen 回复日期:2012-10-29 11:43:18  回复   @兰...\n", "77 泣血之作!  谢。\n", "78 作者:文蛮子 时间:2012-10-29 11:42:58   摆明了,带路党们难以煽动民众...\n", "79 字字真切!\n", "80 @曾开贵 2012-10-29 11:40:09  没有ZF,哪来新ZG,没有新ZG,你们吃...\n", "81 好文,顶一下,为了我的故乡\n", "82 0\n", "83 好文章,顶一个!\n", "84 作者:文蛮子 时间:2012-10-29 11:42:58   摆明了,带路党们难以煽动民众...\n", "85 一定要顶。在被和谐前让多点人知道吧\n", "86 围观也是一种力量  天涯,也是好样的\n", "87 很沉重\n", "88 生不逢国\n", "89 很好的文章。\n", "90 我们中国人都在一条船,颠簸已久,我们都想做宁波人,希望有一个风平浪静的港湾,面朝大海,春暖花开!\n", "91 路过\n", "92 民间语文,狗屁有点通。  似是而非,点风扇火,实是不该。  排污排毒,环境污染,确实严重。 ...\n", "93 @横冲节度使 2012-10-29 12:11:50  楼主这种汉奸、带路党,成天就做梦盼着...\n", "94 @赶浪头 2楼   默默围观~    ---------------------------...\n", "95 午休时间静静看完了,心中莫名地压抑。\n", "96 好文!必须顶起!\n", "97 扎口了。\n", "98 谢谢分享 楼主辛苦了\n", "99 看不到我的回复哦\n", "Name: reply, dtype: object" ] }, "execution_count": 421, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt.reply[:100]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 总帖数是多少?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "http://search.tianya.cn/bbs?q=PX 共有18459 条内容" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "369" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "18459/50" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "实际上到第10页就没有了 http://bbs.tianya.cn/list.jsp?item=free&order=1&nextid=9&k=PX, 原来那只是天涯论坛,还有其它各种版块,如天涯聚焦: http://focus.tianya.cn/ 等等。\n", "\n", "- 娱乐八卦 512\n", "- 股市论坛 187\n", "- 情感天地 242\n", "- 天涯杂谈 1768 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "在天涯杂谈搜索雾霾,有41页 http://bbs.tianya.cn/list.jsp?item=free&order=20&nextid=40&k=%E9%9B%BE%E9%9C%BE" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 天涯SDK" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "http://open.tianya.cn/wiki/index.php?title=SDK%E4%B8%8B%E8%BD%BD" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 }, "toc": { "toc_cell": false, "toc_number_sections": true, "toc_threshold": 6, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 0 }