{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "***\n",
    "# 数据抓取：\n",
    "\n",
    "> # 使用Python编写网络爬虫\n",
    "***\n",
    "\n",
    "王成军\n",
    "\n",
    "wangchengjun@nju.edu.cn\n",
    "\n",
    "计算传播网 http://computational-communication.com"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 需要解决的问题 \n",
    "\n",
    "- 页面解析\n",
    "- 获取Javascript隐藏源数据\n",
    "- 自动翻页\n",
    "- 自动登录\n",
    "- 连接API接口\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "import urllib2\n",
    "from bs4 import BeautifulSoup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- 一般的数据抓取，使用urllib2和beautifulsoup配合就可以了。\n",
    "- 尤其是对于翻页时url出现规则变化的网页，只需要处理规则化的url就可以了。\n",
    "- 以简单的例子是抓取天涯论坛上关于某一个关键词的帖子。\n",
    "    - 在天涯论坛，关于雾霾的帖子的第一页是：\n",
    "http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=雾霾\n",
    "    - 第二页是：\n",
    "http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=雾霾\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "***\n",
    "# 数据抓取：\n",
    "> # 抓取天涯回帖网络\n",
    "***\n",
    "\n",
    "王成军\n",
    "\n",
    "wangchengjun@nju.edu.cn\n",
    "\n",
    "计算传播网 http://computational-communication.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "scrolled": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe src=http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k=PX width=1000 height=500></iframe>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import display_html, HTML\n",
    "HTML('<iframe src=http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k=PX width=1000 height=500></iframe>')\n",
    "# the webpage we would like to crawl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "page_num = 0\n",
    "url = \"http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k=PX\" % page_num\n",
    "content = urllib2.urlopen(url).read() #获取网页的html文本\n",
    "soup = BeautifulSoup(content, \"lxml\") \n",
    "articles = soup.find_all('tr')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<tr>\n",
      "<th scope=\"col\"> 标题</th>\n",
      "<th scope=\"col\">作者</th>\n",
      "<th scope=\"col\">点击</th>\n",
      "<th scope=\"col\">回复</th>\n",
      "<th scope=\"col\">发表时间</th>\n",
      "</tr>\n"
     ]
    }
   ],
   "source": [
    "print articles[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<tr class=\"bg\">\n",
      "<td class=\"td-title \">\n",
      "<span class=\"face\" title=\"\">\n",
      "</span>\n",
      "<a href=\"/post-free-2849477-1.shtml\" target=\"_blank\">\r\n",
      "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸<span class=\"art-ico art-ico-3\" title=\"内有0张图片\"></span>\n",
      "</a>\n",
      "</td>\n",
      "<td><a class=\"author\" href=\"http://www.tianya.cn/50499450\" target=\"_blank\">贾也</a></td>\n",
      "<td>194699</td>\n",
      "<td>2703</td>\n",
      "<td title=\"2012-10-29 07:59\">10-29 07:59</td>\n",
      "</tr>\n"
     ]
    }
   ],
   "source": [
    "print articles[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "50"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(articles[1:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](./img/inspect.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=PX\n",
    "    \n",
    "# 通过分析帖子列表的源代码，使用inspect方法，会发现所有要解析的内容都在‘td’这个层级下\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<td class=\"td-title \">\n",
      "<span class=\"face\" title=\"\">\n",
      "</span>\n",
      "<a href=\"/post-free-2849477-1.shtml\" target=\"_blank\">\r\n",
      "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸<span class=\"art-ico art-ico-3\" title=\"内有0张图片\"></span>\n",
      "</a>\n",
      "</td>\n",
      "<td><a class=\"author\" href=\"http://www.tianya.cn/50499450\" target=\"_blank\">贾也</a></td>\n",
      "<td>194699</td>\n",
      "<td>2703</td>\n",
      "<td title=\"2012-10-29 07:59\">10-29 07:59</td>\n"
     ]
    }
   ],
   "source": [
    "for t in articles[1].find_all('td'): print t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "td = articles[1].find_all('td')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<td class=\"td-title \">\n",
      "<span class=\"face\" title=\"\">\n",
      "</span>\n",
      "<a href=\"/post-free-2849477-1.shtml\" target=\"_blank\">\r\n",
      "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸<span class=\"art-ico art-ico-3\" title=\"内有0张图片\"></span>\n",
      "</a>\n",
      "</td>\n"
     ]
    }
   ],
   "source": [
    "print td[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<td class=\"td-title \">\n",
      "<span class=\"face\" title=\"\">\n",
      "</span>\n",
      "<a href=\"/post-free-2849477-1.shtml\" target=\"_blank\">\r\n",
      "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸<span class=\"art-ico art-ico-3\" title=\"内有0张图片\"></span>\n",
      "</a>\n",
      "</td>\n"
     ]
    }
   ],
   "source": [
    "print td[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "\r\n",
      "\t\t\t\t\t\t\t【民间语文第161期】宁波px启示:船进港湾人应上岸\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print td[0].text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "【民间语文第161期】宁波px启示:船进港湾人应上岸\n"
     ]
    }
   ],
   "source": [
    "print td[0].text.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/post-free-2849477-1.shtml\n"
     ]
    }
   ],
   "source": [
    "print td[0].a['href']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<td><a class=\"author\" href=\"http://www.tianya.cn/50499450\" target=\"_blank\">贾也</a></td>\n"
     ]
    }
   ],
   "source": [
    "print td[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<td>194699</td>\n"
     ]
    }
   ],
   "source": [
    "print td[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<td>2703</td>\n"
     ]
    }
   ],
   "source": [
    "print td[3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<td title=\"2012-10-29 07:59\">10-29 07:59</td>\n"
     ]
    }
   ],
   "source": [
    "print td[4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "records = []\n",
    "for i in articles[1:]:\n",
    "    td = i.find_all('td')\n",
    "    title = td[0].text.strip()\n",
    "    title_url = td[0].a['href']\n",
    "    author = td[1].text\n",
    "    author_url = td[1].a['href']\n",
    "    views = td[2].text\n",
    "    replies = td[3].text\n",
    "    date = td[4]['title']\n",
    "    record = title + '\\t' + title_url+ '\\t' + author + '\\t'+ author_url + '\\t' + views+ '\\t'  + replies+ '\\t'+ date\n",
    "    records.append(record)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "宁波准备停止PX项目了,元芳,你怎么看?\t/post-free-2848797-1.shtml\t牧阳光\thttp://www.tianya.cn/36535656\t82888\t625\t2012-10-28 19:11\n"
     ]
    }
   ],
   "source": [
    "print records[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 抓取天涯论坛PX帖子列表\n",
    "\n",
    "回帖网络（Thread network）的结构\n",
    "- 列表\n",
    "- 主帖\n",
    "- 回帖"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "def crawler(page_num, file_name):\n",
    "    try:\n",
    "        # open the browser\n",
    "        url = \"http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k=PX\" % page_num\n",
    "        content = urllib2.urlopen(url).read() #获取网页的html文本\n",
    "        soup = BeautifulSoup(content, \"lxml\") \n",
    "        articles = soup.find_all('tr')\n",
    "        # write down info\n",
    "        for i in articles[1:]:\n",
    "            td = i.find_all('td')\n",
    "            title = td[0].text.strip()\n",
    "            title_url = td[0].a['href']\n",
    "            author = td[1].text\n",
    "            author_url = td[1].a['href']\n",
    "            views = td[2].text\n",
    "            replies = td[3].text\n",
    "            date = td[4]['title']\n",
    "            record = title + '\\t' + title_url+ '\\t' + author + '\\t'+ \\\n",
    "                        author_url + '\\t' + views+ '\\t'  + replies+ '\\t'+ date\n",
    "            with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!\n",
    "                        p.write(record.encode('utf-8')+\"\\n\") ##!!encode here to utf-8 to avoid encoding\n",
    "\n",
    "    except Exception, e:\n",
    "        print e\n",
    "        pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "1\n",
      "2\n",
      "3\n",
      "4\n",
      "5\n",
      "6\n",
      "7\n",
      "8\n",
      "9\n"
     ]
    }
   ],
   "source": [
    "# crawl all pages\n",
    "for page_num in range(10):\n",
    "    print (page_num)\n",
    "    crawler(page_num, '/Users/chengjun/bigdata/tianya_bbs_threads_list.txt') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>【民间语文第161期】宁波px启示:船进港湾人应上岸</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>贾也</td>\n",
       "      <td>http://www.tianya.cn/50499450</td>\n",
       "      <td>194675</td>\n",
       "      <td>2703</td>\n",
       "      <td>2012-10-29 07:59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>宁波镇海PX项目引发群体上访 当地政府发布说明(转载)</td>\n",
       "      <td>/post-free-2839539-1.shtml</td>\n",
       "      <td>无上卫士ABC</td>\n",
       "      <td>http://www.tianya.cn/74341835</td>\n",
       "      <td>88244</td>\n",
       "      <td>1041</td>\n",
       "      <td>2012-10-24 12:41</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             0                           1        2  \\\n",
       "0   【民间语文第161期】宁波px启示:船进港湾人应上岸  /post-free-2849477-1.shtml       贾也   \n",
       "1  宁波镇海PX项目引发群体上访 当地政府发布说明(转载)  /post-free-2839539-1.shtml  无上卫士ABC   \n",
       "\n",
       "                               3       4     5                 6  \n",
       "0  http://www.tianya.cn/50499450  194675  2703  2012-10-29 07:59  \n",
       "1  http://www.tianya.cn/74341835   88244  1041  2012-10-24 12:41  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_list.txt', sep = \"\\t\", header=None)\n",
    "df[:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "467"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>link</th>\n",
       "      <th>author</th>\n",
       "      <th>author_page</th>\n",
       "      <th>click</th>\n",
       "      <th>reply</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>【民间语文第161期】宁波px启示:船进港湾人应上岸</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>贾也</td>\n",
       "      <td>http://www.tianya.cn/50499450</td>\n",
       "      <td>194675</td>\n",
       "      <td>2703</td>\n",
       "      <td>2012-10-29 07:59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>宁波镇海PX项目引发群体上访 当地政府发布说明(转载)</td>\n",
       "      <td>/post-free-2839539-1.shtml</td>\n",
       "      <td>无上卫士ABC</td>\n",
       "      <td>http://www.tianya.cn/74341835</td>\n",
       "      <td>88244</td>\n",
       "      <td>1041</td>\n",
       "      <td>2012-10-24 12:41</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         title                        link   author  \\\n",
       "0   【民间语文第161期】宁波px启示:船进港湾人应上岸  /post-free-2849477-1.shtml       贾也   \n",
       "1  宁波镇海PX项目引发群体上访 当地政府发布说明(转载)  /post-free-2839539-1.shtml  无上卫士ABC   \n",
       "\n",
       "                     author_page   click  reply              time  \n",
       "0  http://www.tianya.cn/50499450  194675   2703  2012-10-29 07:59  \n",
       "1  http://www.tianya.cn/74341835   88244   1041  2012-10-24 12:41  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df=df.rename(columns = {0:'title', 1:'link', 2:'author',3:'author_page', 4:'click', 5:'reply', 6:'time'})\n",
    "df[:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "467"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df.link)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 抓取作者信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    http://www.tianya.cn/50499450\n",
       "1    http://www.tianya.cn/74341835\n",
       "2    http://www.tianya.cn/36535656\n",
       "3    http://www.tianya.cn/36959960\n",
       "4    http://www.tianya.cn/53134970\n",
       "Name: author_page, dtype: object"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.author_page[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "http://www.tianya.cn/62237033\n",
    "    \n",
    "http://www.tianya.cn/67896263\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# user_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 357,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "url = df.author_page[1]\n",
    "content = urllib2.urlopen(url).read() #获取网页的html文本\n",
    "soup1 = BeautifulSoup(content, \"lxml\") "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 359,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "浙江杭州市 259643 5832 2016-04-16 16:53:46 2011-04-14 20:49:00\n"
     ]
    }
   ],
   "source": [
    "user_info = soup.find('div',  {'class', 'userinfo'})('p')\n",
    "area, nid, freq_use, last_login_time, reg_time = [i.get_text()[4:] for i in user_info]\n",
    "print area, nid, freq_use, last_login_time, reg_time \n",
    "\n",
    "link_info = soup1.find_all('div', {'class', 'link-box'})\n",
    "followed_num, fans_num = [i.a.text for i in link_info]\n",
    "print followed_num, fans_num"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 393,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2 5\n"
     ]
    }
   ],
   "source": [
    "activity = soup1.find_all('span', {'class', 'subtitle'})\n",
    "post_num, reply_num = [j.text[2:] for i in activity[:1] for j in i('a')]\n",
    "print post_num, reply_num"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 386,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<span class=\"subtitle\">\n",
      "<a href=\"http://blog.tianya.cn/blog-3644295-1.shtml\" target=\"_blank\">贾也的博客</a>　\r\n",
      "\t\t\t\r\n",
      "\r\n",
      "\t\t\t</span>\n"
     ]
    }
   ],
   "source": [
    "print activity[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 370,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "152 27451\n"
     ]
    }
   ],
   "source": [
    "link_info = soup.find_all('div', {'class', 'link-box'})\n",
    "followed_num, fans_num = [i.a.text for i in link_info]\n",
    "print followed_num, fans_num"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 369,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'152'"
      ]
     },
     "execution_count": 369,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "link_info[0].a.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# user_info = soup.find('div',  {'class', 'userinfo'})('p')\n",
    "# user_infos = [i.get_text()[4:] for i in user_info]\n",
    "            \n",
    "def author_crawler(url, file_name):\n",
    "    try:\n",
    "        content = urllib2.urlopen(url).read() #获取网页的html文本\n",
    "        soup = BeautifulSoup(content, \"lxml\")\n",
    "        link_info = soup.find_all('div', {'class', 'link-box'})\n",
    "        followed_num, fans_num = [i.a.text for i in link_info]\n",
    "        try:\n",
    "            activity = soup.find_all('span', {'class', 'subtitle'})\n",
    "            post_num, reply_num = [j.text[2:] for i in activity[:1] for j in i('a')]\n",
    "        except:\n",
    "            post_num, reply_num = 1, 0\n",
    "        record =  '\\t'.join([url, followed_num, fans_num, post_num, reply_num])\n",
    "        with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!\n",
    "                    p.write(record.encode('utf-8')+\"\\n\") ##!!encode here to utf-8 to avoid encoding\n",
    "\n",
    "    except Exception, e:\n",
    "        print e, url\n",
    "        record =  '\\t'.join([url, 'na', 'na', 'na', 'na'])\n",
    "        with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!\n",
    "                    p.write(record.encode('utf-8')+\"\\n\") ##!!encode here to utf-8 to avoid encoding\n",
    "        pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "10\n",
      "20\n",
      "30\n",
      "40\n",
      "need more than 0 values to unpack http://www.tianya.cn/67896263\n",
      "need more than 0 values to unpack http://www.tianya.cn/42330613\n",
      "sequence item 3: expected string or Unicode, int found http://www.tianya.cn/26517664\n",
      "50\n",
      "need more than 0 values to unpack http://www.tianya.cn/75591747\n",
      "60\n",
      "need more than 0 values to unpack http://www.tianya.cn/24068399\n",
      "70\n",
      "80\n",
      "90\n",
      "need more than 0 values to unpack http://www.tianya.cn/67896263\n",
      "sequence item 3: expected string or Unicode, int found http://www.tianya.cn/62237033\n",
      "100\n",
      "110\n",
      "120\n",
      "130\n",
      "140\n",
      "150\n",
      "160\n",
      "170\n",
      "180\n",
      "190\n",
      "need more than 0 values to unpack http://www.tianya.cn/67896263\n",
      "200\n",
      "need more than 0 values to unpack http://www.tianya.cn/85353911\n",
      "210\n",
      "220\n",
      "230\n",
      "240\n",
      "250\n",
      "260\n",
      "270\n",
      "280\n",
      "need more than 0 values to unpack http://www.tianya.cn/67896263\n",
      "290\n",
      "need more than 0 values to unpack http://www.tianya.cn/67896263\n",
      "300\n",
      "310\n",
      "320\n",
      "need more than 0 values to unpack http://www.tianya.cn/67896263\n",
      "330\n",
      "340\n",
      "350\n",
      "360\n",
      "370\n",
      "need more than 0 values to unpack http://www.tianya.cn/67896263\n",
      "380\n",
      "390\n",
      "400\n",
      "410\n",
      "420\n",
      "430\n",
      "440\n",
      "450\n",
      "460\n"
     ]
    }
   ],
   "source": [
    "for k, url in enumerate(df.author_page):\n",
    "    if k % 10==0:\n",
    "        print k\n",
    "    author_crawler(url, '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_author_info2.txt') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "http://www.tianya.cn/50499450/follow  \n",
    "\n",
    "还可抓取他们的关注列表和粉丝列表"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "***\n",
    "***\n",
    "# 数据抓取：\n",
    "> # 使用Python抓取回帖\n",
    "***\n",
    "***\n",
    "\n",
    "王成军\n",
    "\n",
    "wangchengjun@nju.edu.cn\n",
    "\n",
    "计算传播网 http://computational-communication.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/post-free-2849477-1.shtml'"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.link[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'http://bbs.tianya.cn/post-free-2848797-1.shtml'"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = 'http://bbs.tianya.cn' + df.link[2]\n",
    "url"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "scrolled": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe src=http://bbs.tianya.cn/post-free-2848797-1.shtml width=1000 height=500></iframe>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import display_html, HTML\n",
    "HTML('<iframe src=http://bbs.tianya.cn/post-free-2848797-1.shtml width=1000 height=500></iframe>')\n",
    "# the webpage we would like to crawl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "post = urllib2.urlopen(url).read() #获取网页的html文本\n",
    "post_soup = BeautifulSoup(post, \"lxml\") \n",
    "#articles = soup.find_all('tr')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false,
    "scrolled": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<!DOCTYPE HTML>\n",
      "<html>\n",
      " <head>\n",
      "  <meta charset=\"utf-8\"/>\n",
      "  <title>\n",
      "   宁波准备停止PX项目了，元芳，你怎么看？_天涯杂谈_天涯论坛\n",
      "  </title>\n",
      "  <meta content=\"宁波准备停止PX项目了，元芳，你怎么看？　　从宁波市政府新闻发言人处获悉，宁波市经与项目投资方研究决定：（1）坚决不上PX项目；（2）炼化一体化项目前期工作停止推进，再作科学论证。...\" name=\"description\"/>\n",
      "  <meta content=\"IE=EmulateIE9\" http-equiv=\"X-UA-Compatible\"/>\n",
      "  <meta content=\"牧阳光\" name=\"author\"/>\n",
      "  <meta content=\"format=xhtml; url=http://bbs.tianya.cn/m/post-free-2848797-1.shtml\" http-equiv=\"mobile-agent\"/>\n",
      "  <link href=\"http://static.tianyaui.com/global/ty/TY.css\" rel=\"stylesheet\" type=\"text/css\"/>\n",
      "  <link href=\"http://static.tianyaui.com/global/bbs/web/static/css/bbs_article_c55fffc.css\" rel=\"stylesheet\" type=\"text/css\"/>\n",
      "  <link href=\"http://static.tianyaui.com/favicon.ico\" rel=\"shortcut icon\" type=\"image/vnd.microsoft.icon\"/>\n",
      "  <link href=\"http://bbs.tianya.cn/post-free-2848797-2.shtml\" rel=\"next\"/>\n",
      "  <script type=\"text/javascript\">\n",
      "   var bbsGlobal = {\r\n",
      "\tisEhomeItem : false,\r\n",
      "\tisNewArticle : false,\r\n",
      "\tauthorId : \"36535656\",\r\n",
      "\tauthorName : \"牧阳光\", \r\n",
      "\tblocktype : \"主版\",\r\n",
      "\titemType : \"主版\",\r\n",
      "\tlaibaType : \"null\",  \r\n",
      "\tpage : \"1\",\r\n",
      "\tpermission : true,\r\n",
      "\tpermissionStatus : \"开放\",\r\n",
      "\titemPermission : 1,\r\n",
      "\titemCategory : \"社会\",\r\n",
      "\tisWeiLun : false,\r\n",
      "\tisSealItem : false,\r\n",
      "\titem : \"free\",\r\n",
      "\titemName : \"天涯杂谈\",\r\n",
      "\tartId : 2848797,\r\n",
      "\tmedia : 0,\r\n",
      "\tsubType : \"\",\r\n",
      "\tad : 0,\r\n",
      "\tadshow : 0,\r\n",
      "\tadblock : 0,\r\n",
      "\twords : [{\"link\":\"http://ebook.tianya.cn/html2/chapter.aspx?bookid=75204&comefrom=tianya \",\"word\":\"养鬼\"},{\"link\":\"http://ebook.tianya.cn/html2/chapter.aspx?bookid=78822&comefrom=tianya \",\"word\":\"爱情\"},{\"link\":\"http://ebook.tianya.cn/html2/chapter.aspx?bookid=78829&comefrom=tianya \",\"word\":\"离婚\"},{\"link\":\"http://zan.tianya.cn/\",\"word\":\"原创\"},{\"link\":\"http://bbs.tianya.cn/list-410-1.shtml\",\"word\":\"主播\"},{\"link\":\"http://groups.tianya.cn/list-163029-1.shtml\",\"word\":\"日记\"},{\"link\":\"http://groups.tianya.cn/list-163907-1.shtml\",\"word\":\"吐糟\"},{\"link\":\"http://groups.tianya.cn/list-65695-1.shtml\",\"word\":\"青春\"},{\"link\":\"http://groups.tianya.cn/list-86723-1.shtml\",\"word\":\"EXO\"},{\"link\":\"http://groups.tianya.cn/list-81007-1.shtml\",\"word\":\"李易峰\"},{\"link\":\"http://groups.tianya.cn/list-9999-1.shtml\",\"word\":\"乔振宇\"},{\"link\":\"http://groups.tianya.cn/list-9619-1.shtml\",\"word\":\"陈晓\"},{\"link\":\"http://groups.tianya.cn/list-9214-1.shtml\",\"word\":\"历史\"},{\"link\":\"http://groups.tianya.cn/list-163222-1.shtml\",\"word\":\"哲学\"},{\"link\":\"http://groups.tianya.cn/list-163503-1.shtml\",\"word\":\"感动\"},{\"link\":\"http://bbs.tianya.cn/list-730-1.shtml\",\"word\":\"中山大学\"},{\"link\":\"http://groups.tianya.cn/list-14270-1.shtml\",\"word\":\"雨后\"},{\"link\":\"http://groups.tianya.cn/list-9797-1.shtml\",\"word\":\"钟汉良\"}],\r\n",
      "\tpageCount : 7,\r\n",
      "\tdashang : {\r\n",
      "\t\tmerId : \"%E5%A4%A9%E6%B6%AF%E8%AE%BA%E5%9D%9B\",\r\n",
      "\t\tmerNum : \"free-2848797\",\r\n",
      "\t\tgetName : \"%E7%89%A7%E9%98%B3%E5%85%89\",\r\n",
      "\t\ttime : \"1463208857581\",\r\n",
      "\t\text1 : \"2848797\",\r\n",
      "\t\text2 : \"free\",\r\n",
      "\t\text4 : \"%E5%AE%81%E6%B3%A2%E5%87%86%E5%A4%87%E5%81%9C%E6%AD%A2PX%E9%A1%B9%E7%9B%AE%E4%BA%86%EF%BC%8C%E5%85%83%E8%8A%B3%EF%BC%8C%E4%BD%A0%E6%80%8E%E4%B9%88%E7%9C%8B%EF%BC%9F\",\r\n",
      "\t\tsign : \"4fd8fdddad1703e34ab83df926976edb\",\r\n",
      "\t\tamount : 0,\r\n",
      "\t\tzhiyinCount : 0,\r\n",
      "\t\tnewestRecords : null\r\n",
      "\t},\r\n",
      "\tisWenda : false,\r\n",
      "\t\r\n",
      "\tisSlide : true,\r\n",
      "\ttrueName : \"\",\r\n",
      "\tartProtectedTips : \"\"\r\n",
      "};\r\n",
      "\r\n",
      "var adsGlobal = {\r\n",
      "\t\titemId : \"free\",\r\n",
      "\t\tpageType : \"02\",\r\n",
      "\t\tpopWinId: \"16\"\r\n",
      "};\n",
      "  </script>\n",
      "  <script charset=\"utf-8\" src=\"http://static.tianyaui.com/global/ty/TY.js\" type=\"text/javascript\">\n",
      "  </script>\n",
      "  <script charset=\"utf-8\" src=\"http://static.tianyaui.com/global/bbs/web/static/js/main_e88627e.js\" type=\"text/javascript\">\n",
      "  </script>\n",
      " </head>\n",
      " <body>\n",
      "  <div id=\"top_nav_wrap\">\n",
      "   <div class=\"clearfix\" id=\"top_nav\">\n",
      "    <div class=\"top-nav-logo\">\n",
      "     <a _tystat=\"新版顶导航/Logo\" href=\"http://focus.tianya.cn/\">\n",
      "     </a>\n",
      "    </div>\n",
      "    <div class=\"top-nav-main clearfix\">\n",
      "     <div class=\"top-nav-menu clearfix\">\n",
      "      <div class=\"top-nav-fl clearfix\">\n",
      "       <ul class=\"top-nav-menu-list clearfix\">\n",
      "        <li class=\"top-nav-menu-li top-nav-menu-li-first\">\n",
      "         <a _checklocation=\"1\" _tystat=\"新版顶导航/论坛\" appstr=\"bbs\" class=\"top-nav-main-menu\" href=\"http://bbs.tianya.cn/\">\n",
      "          论坛\n",
      "         </a>\n",
      "        </li>\n",
      "        <li class=\"top-nav-menu-li\">\n",
      "         <a _checklocation=\"1\" _tystat=\"新版顶导航/分社区/聚焦\" appstr=\"focus\" class=\"top-nav-main-menu\" href=\"http://focus.tianya.cn/\">\n",
      "          聚焦\n",
      "         </a>\n",
      "        </li>\n",
      "        <li class=\"top-nav-menu-li\">\n",
      "         <a _tystat=\"新版顶导航/部落\" class=\"top-nav-main-menu\" href=\"http://groups.tianya.cn\">\n",
      "          部落\n",
      "         </a>\n",
      "        </li>\n",
      "        <li class=\"top-nav-menu-li\">\n",
      "         <a _checklocation=\"1\" _tystat=\"新版顶导航/博客\" appstr=\"blog\" class=\"top-nav-main-menu\" href=\"http://blog.tianya.cn/\">\n",
      "          博客\n",
      "         </a>\n",
      "        </li>\n",
      "        <li class=\"top-nav-menu-li\">\n",
      "         <a _checklocation=\"1\" _tystat=\"新版顶导航/问答\" appstr=\"wenda\" class=\"top-nav-main-menu\" href=\"http://wenda.tianya.cn/\">\n",
      "          问答\n",
      "         </a>\n",
      "        </li>\n",
      "        <!-- <li class=\"top-nav-menu-li\"><a _tystat=\"新版顶导航/天涯客\" href=\"http://travel.tianya.cn/\" cl\n"
     ]
    }
   ],
   "source": [
    "print (post_soup.prettify())[:5000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "90"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pa = post_soup.find_all('div', {'class', 'atl-item'})\n",
    "len(pa)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div _host=\"%E7%89%A7%E9%98%B3%E5%85%89\" class=\"atl-item host-item\">\n",
      "<div class=\"atl-content\">\n",
      "<div class=\"atl-con-hd clearfix\">\n",
      "<div class=\"atl-con-hd-l\"></div>\n",
      "<div class=\"atl-con-hd-r\"></div>\n",
      "</div>\n",
      "<div class=\"atl-con-bd clearfix\">\n",
      "<div class=\"bbs-content clearfix\">\n",
      "<br/>\n",
      "　　从宁波市政府新闻发言人处获悉，宁波市经与项目投资方研究决定：（1）坚决不上PX项目；（2）炼化一体化项目前期工作停止推进，再作科学论证。<br/>\n",
      "<br/>\n",
      "</div>\n",
      "<div class=\"clearfix\" id=\"alt_action\"></div>\n",
      "<div class=\"clearfix\">\n",
      "<div class=\"host-data\">\n",
      "<span>楼主发言：11次</span> <span>发图：0张</span>\n",
      "</div>\n",
      "<div class=\"atl-reply\" id=\"alt_reply\">\n",
      "<a author=\"牧阳光\" authorid=\"36535656\" class=\"reportme a-link\" href=\"javascript:void(0);\" replyid=\"0\" replytime=\"2012-10-28 19:11:00\"> 举报</a> | \r\n",
      "\t\t\t\t\t\t\t<a class=\"a-link acl-share\" href=\"javascript:void(0);\">分享</a> |  \r\n",
      "\t\t\t               \t<a class=\"a-link acl-more\" href=\"javascript:void(0);\">更多</a> |\r\n",
      "\t\t\t\t\t\t\t<span><a class=\"a-link\" name=\"0\">楼主</a></span>\n",
      "<a _name=\"牧阳光\" _time=\"2012-10-28 19:11:00\" class=\"a-link2 replytop\" href=\"#fabu_anchor\">回复</a>\n",
      "</div>\n",
      "</div>\n",
      "<div id=\"ds-quick-box\" style=\"display:none;\"></div>\n",
      "</div>\n",
      "<div class=\"atl-con-ft clearfix\">\n",
      "<div class=\"atl-con-ft-l\"></div>\n",
      "<div class=\"atl-con-ft-r\"></div>\n",
      "<div id=\"niuren_ifm\"></div>\n",
      "</div>\n",
      "</div>\n",
      "</div>\n"
     ]
    }
   ],
   "source": [
    "print pa[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false,
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div _host=\"%E6%80%A8%E9%AD%82%E9%AC%BC\" class=\"atl-item\" id=\"1\" js_restime=\"2012-10-28 19:17:56\" js_username=\"%E6%80%A8%E9%AD%82%E9%AC%BC\" replyid=\"92725226\">\n",
      "<div class=\"atl-head\" id=\"ea93038aa568ef2bf7a8cf6b6853b744\">\n",
      "<div class=\"atl-head-reply\"></div>\n",
      "<div class=\"atl-info\">\n",
      "<span>作者：<a class=\"js-vip-check\" href=\"http://www.tianya.cn/73157063\" target=\"_blank\" uid=\"73157063\" uname=\"怨魂鬼\">怨魂鬼</a> </span>\n",
      "<span>时间：2012-10-28 19:17:56</span>\n",
      "</div>\n",
      "</div>\n",
      "<div class=\"atl-content\">\n",
      "<div class=\"atl-con-hd clearfix\">\n",
      "<div class=\"atl-con-hd-l\"></div>\n",
      "<div class=\"atl-con-hd-r\"></div>\n",
      "</div>\n",
      "<div class=\"atl-con-bd clearfix\">\n",
      "<div class=\"bbs-content\">\r\n",
      "\t\t\t\t\t\t\t　　图片分享<img original=\"http://img3.laibafile.cn/p/m/122161321.jpg\" src=\"http://static.tianyaui.com/img/static/2011/imgloading.gif\"/><br/><br/>　　\r\n",
      "\t\t\t\t\t\t\t\r\n",
      "\t\t\t\t\t\t</div>\n",
      "<div class=\"atl-reply\">\r\n",
      "\t\t\t\t\t\t\t来自 <a _stat=\"/stat/bbs/post/来自\" class=\"a-link\" href=\"http://www.tianya.cn/mobile/\" rel=\"nofollow\" target=\"_blank\">天涯社区（微论）客户端</a> |\r\n",
      "\t\t\t\t\t\t\t<a author=\"怨魂鬼\" authorid=\"73157063\" class=\"reportme a-link\" href=\"javascript:void(0);\" replyid=\"92725226\" replytime=\"2012-10-28 19:17:56\">举报</a> |\r\n",
      "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n",
      "\t\t\t\t\t\t\t<span>1楼</span> |\r\n",
      "\t\t\t\t\t\t\t<a class=\"a-link-2 ir-shang\" floor=\"1\" href=\"javascript:void(0);\" title=\"打赏层主\">\r\n",
      "\t\t\t\t\t\t\t\t打赏\r\n",
      "\t\t\t\t\t\t\t</a> |\r\n",
      "\t\t\t\t\t\t\t<a class=\"a-link-2 reply\" href=\"#fabu_anchor\" title=\"引用回复\">回复</a> |\r\n",
      "\t\t\t\t\t\t\t<a _stat=\"/stat/bbs/post/评论\" class=\"a-link-2 ir-remark\" floor=\"1\" href=\"javascript:void(0);\" title=\"插入评论\">\r\n",
      "\t\t\t\t\t\t\t\t评论\r\n",
      "\t\t\t\t\t\t\t</a>\n",
      "</div>\n",
      "</div>\n",
      "<div class=\"atl-con-ft clearfix\">\n",
      "<div class=\"atl-con-ft-l\"></div>\n",
      "<div class=\"atl-con-ft-r\"></div>\n",
      "</div>\n",
      "</div>\n",
      "</div>\n"
     ]
    }
   ],
   "source": [
    "print pa[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div _host=\"%E5%B1%B1%E9%9B%A8%E6%AC%B2%E6%BB%A1%E6%A5%BC\" class=\"atl-item\" id=\"100\" js_restime=\"2012-10-28 21:34:21\" js_username=\"%E5%B1%B1%E9%9B%A8%E6%AC%B2%E6%BB%A1%E6%A5%BC\" replyid=\"92725457\">\n",
      "<div class=\"atl-head\" id=\"8cc4c81381b126cc08dd65759233d6de\">\n",
      "<div class=\"atl-head-reply\"></div>\n",
      "<div class=\"atl-info\">\n",
      "<span>作者：<a class=\"js-vip-check\" href=\"http://www.tianya.cn/74980506\" target=\"_blank\" uid=\"74980506\" uname=\"山雨欲满楼\">山雨欲满楼</a> </span>\n",
      "<span>时间：2012-10-28 21:34:21</span>\n",
      "</div>\n",
      "</div>\n",
      "<div class=\"atl-content\">\n",
      "<div class=\"atl-con-hd clearfix\">\n",
      "<div class=\"atl-con-hd-l\"></div>\n",
      "<div class=\"atl-con-hd-r\"></div>\n",
      "</div>\n",
      "<div class=\"atl-con-bd clearfix\">\n",
      "<div class=\"bbs-content\">\r\n",
      "\t\t\t\t\t\t\t　　围观也是力量，得道多助失道寡助\r\n",
      "\t\t\t\t\t\t\t\r\n",
      "\t\t\t\t\t\t</div>\n",
      "<div class=\"atl-reply\">\n",
      "<a author=\"山雨欲满楼\" authorid=\"74980506\" class=\"reportme a-link\" href=\"javascript:void(0);\" replyid=\"92725457\" replytime=\"2012-10-28 21:34:21\">举报</a> |\r\n",
      "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n",
      "\t\t\t\t\t\t\t<span>100楼</span> |\r\n",
      "\t\t\t\t\t\t\t<a class=\"a-link-2 ir-shang\" floor=\"100\" href=\"javascript:void(0);\" title=\"打赏层主\">\r\n",
      "\t\t\t\t\t\t\t\t打赏\r\n",
      "\t\t\t\t\t\t\t</a> |\r\n",
      "\t\t\t\t\t\t\t<a class=\"a-link-2 reply\" href=\"#fabu_anchor\" title=\"引用回复\">回复</a> |\r\n",
      "\t\t\t\t\t\t\t<a _stat=\"/stat/bbs/post/评论\" class=\"a-link-2 ir-remark\" floor=\"100\" href=\"javascript:void(0);\" title=\"插入评论\">\r\n",
      "\t\t\t\t\t\t\t\t评论\r\n",
      "\t\t\t\t\t\t\t</a>\n",
      "</div>\n",
      "</div>\n",
      "<div class=\"atl-con-ft clearfix\">\n",
      "<div class=\"atl-con-ft-l\"></div>\n",
      "<div class=\"atl-con-ft-r\"></div>\n",
      "</div>\n",
      "</div>\n",
      "</div>\n"
     ]
    }
   ],
   "source": [
    "print pa[89]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "作者：柠檬在追逐 时间：2012-10-28 21:33:55\n",
    "\n",
    "　　@lice5 2012-10-28 20:37:17\n",
    "  \n",
    "　　作为宁波人 还是说一句：革命尚未成功 同志仍需努力 \n",
    "  \n",
    "　　-----------------------------\n",
    "  \n",
    "　　对 现在说成功还太乐观，就怕说一套做一套"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "作者：lice5 时间：2012-10-28 20:37:17\n",
    "        \n",
    "　　作为宁波人 还是说一句：革命尚未成功 同志仍需努力"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "4\t/post-free-4242156-1.shtml\t2014-04-09 15:55:35\t61943225\t野渡自渡人\t@Y雷政府34楼2014-04-0422:30:34　　野渡君雄文！支持是必须的。　　-----------------------------　　@清坪过客16楼2014-04-0804:09:48　　绝对的权力导致绝对的腐败！　　-----------------------------　　@T大漠鱼T35楼2014-04-0810:17:27　　@周丕东@普欣@拾月霜寒2012@小摸包@姚文嚼字@四號@凌宸@乔志峰@野渡自渡人@曾兵2010@缠绕夜色@曾颖@风青扬请关注"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "从宁波市政府新闻发言人处获悉，宁波市经与项目投资方研究决定：（1）坚决不上PX项目；（2）炼化一体化项目前期工作停止推进，再作科学论证。\n"
     ]
    }
   ],
   "source": [
    "print pa[0].find('div', {'class', 'bbs-content'}).text.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "@lice5 2012-10-28 20:37:17　　作为宁波人 还是说一句：革命尚未成功 同志仍需努力 　　-----------------------------　　对 现在说成功还太乐观，就怕说一套做一套\n"
     ]
    }
   ],
   "source": [
    "print pa[87].find('div', {'class', 'bbs-content'}).text.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<a class=\"js-vip-check\" href=\"http://www.tianya.cn/73157063\" target=\"_blank\" uid=\"73157063\" uname=\"\\u6028\\u9b42\\u9b3c\">\\u6028\\u9b42\\u9b3c</a>"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pa[1].a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<a author=\"牧阳光\" authorid=\"36535656\" class=\"reportme a-link\" href=\"javascript:void(0);\" replyid=\"0\" replytime=\"2012-10-28 19:11:00\"> 举报</a>\n"
     ]
    }
   ],
   "source": [
    "print pa[0].find('a', class_ = 'reportme a-link')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2012-10-28 19:11:00\n"
     ]
    }
   ],
   "source": [
    "print pa[0].find('a', class_ = 'reportme a-link')['replytime']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "牧阳光\n"
     ]
    }
   ],
   "source": [
    "print pa[0].find('a', class_ = 'reportme a-link')['author']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2012-10-28 19:11:00 ---> 36535656 ---> 牧阳光 ---> 从宁波市政府新闻发言人处获悉，宁波市经与项目投资方研究决定：（1）坚决不上PX项目；（2）炼化一体化项目前期工作停止推进，再作科学论证。 \n",
      "\n",
      "2012-10-28 19:17:56 ---> 73157063 ---> 怨魂鬼 ---> 图片分享 \n",
      "\n",
      "2012-10-28 19:18:17 ---> 73157063 ---> 怨魂鬼 ---> @怨魂鬼 2012-10-28 19:17:56　　图片分享   　　[发自掌中天涯客户端 ]　　-----------------------------　　2楼我的天下！ \n",
      "\n",
      "2012-10-28 19:18:46 ---> 36535656 ---> 牧阳光 ---> 。。。沙发板凳这么快就被坐了~~ \n",
      "\n",
      "2012-10-28 19:19:04 ---> 41774471 ---> zgh0213 ---> 元芳你怎么看 \n",
      "\n",
      "2012-10-28 19:19:37 ---> 73157063 ---> 怨魂鬼 ---> @牧阳光 2012-10-28 19:18:46　　。。。沙发板凳这么快就被坐了~~　　-----------------------------　　运气、 \n",
      "\n",
      "2012-10-28 19:20:04 ---> 36535656 ---> 牧阳光 ---> @怨魂鬼 5楼 　　运气、　　-----------------------------　　哈哈。。。 \n",
      "\n",
      "2012-10-28 19:20:07 ---> 54060837 ---> 帆小叶 ---> 八卦的被和谐了。帖个链接http://api.pwnz.org/0/?url=bG10aC4　　wOTIyNzQvNzIvMDEvMjEvc3dlbi9tb2MuYW5　　paGN0ZXJjZXMud3d3Ly9BMyVwdHRo \n",
      "\n",
      "2012-10-28 19:20:33 ---> 36535656 ---> 牧阳光 ---> @怨魂鬼 2楼 　　2楼我的天下！　　-----------------------------　　。。。还是掌中天涯，NB的~~ \n",
      "\n",
      "2012-10-28 19:25:22 ---> 36535656 ---> 牧阳光 ---> 消息来源，官方微博@宁波发布 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for i in pa[:10]:\n",
    "    p_info = i.find('a', class_ = 'reportme a-link')\n",
    "    p_time = p_info['replytime']\n",
    "    p_author_id = p_info['authorid']\n",
    "    p_author_name = p_info['author']\n",
    "    p_content = i.find('div', {'class', 'bbs-content'}).text.strip()\n",
    "    p_content = p_content.replace('\\t', '')\n",
    "    print p_time, '--->', p_author_id, '--->', p_author_name,'--->', p_content, '\\n'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 如何翻页\n",
    "\n",
    "http://bbs.tianya.cn/post-free-2848797-1.shtml\n",
    "    \n",
    "    http://bbs.tianya.cn/post-free-2848797-2.shtml\n",
    "        \n",
    "        http://bbs.tianya.cn/post-free-2848797-3.shtml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<div class=\"atl-pages\"><form action=\"\" method=\"get\" onsubmit=\"return goPage(this,'free',2848797,7);\">\\n<span>\\u4e0a\\u9875</span>\\n<strong>1</strong>\\n<a href=\"/post-free-2848797-2.shtml\">2</a>\\n<a href=\"/post-free-2848797-3.shtml\">3</a>\\n<a href=\"/post-free-2848797-4.shtml\">4</a>\\n\\u2026\\n<a href=\"/post-free-2848797-7.shtml\">7</a>\\n<a class=\"js-keyboard-next\" href=\"/post-free-2848797-2.shtml\">\\u4e0b\\u9875</a>\\n\\xa0\\u5230<input class=\"pagetxt\" name=\"page\" type=\"text\"/>\\u9875\\xa0<input class=\"pagebtn\" maxlength=\"6\" name=\"gopage\" type=\"submit\" value=\"\\u786e\\u5b9a\"/></form></div>"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "post_soup.find('div', {'class', 'atl-pages'})#.['onsubmit']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'7'"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "post_pages = post_soup.find('div', {'class', 'atl-pages'})\n",
    "post_pages = post_pages.form['onsubmit'].split(',')[-1].split(')')[0]\n",
    "post_pages\n",
    "#post_soup.select('.atl-pages')[0].select('form')[0].select('onsubmit')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'http://bbs.tianya.cn/postfree2848797-%d.shtml'"
      ]
     },
     "execution_count": 144,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = 'http://bbs.tianya.cn' + df.link[2]\n",
    "url_base = ''.join(url.split('-')[:-1]) + '-%d.shtml'\n",
    "url_base"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "def parsePage(pa):\n",
    "    records = []\n",
    "    for i in pa:\n",
    "        p_info = i.find('a', class_ = 'reportme a-link')\n",
    "        p_time = p_info['replytime']\n",
    "        p_author_id = p_info['authorid']\n",
    "        p_author_name = p_info['author']\n",
    "        p_content = i.find('div', {'class', 'bbs-content'}).text.strip()\n",
    "        p_content = p_content.replace('\\t', '').replace('\\n', '')#.replace(' ', '')\n",
    "        record = p_time + '\\t' + p_author_id+ '\\t' + p_author_name + '\\t'+ p_content\n",
    "        records.append(record)\n",
    "    return records\n",
    "\n",
    "import sys\n",
    "def flushPrint(s):\n",
    "    sys.stdout.write('\\r')\n",
    "    sys.stdout.write('%s' % s)\n",
    "    sys.stdout.flush()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 246,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<div class=\"atl-pages host-pages\"></div>"
      ]
     },
     "execution_count": 246,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url_1 = 'http://bbs.tianya.cn' + df.link[10]\n",
    "content = urllib2.urlopen(url_1).read() #获取网页的html文本\n",
    "post_soup = BeautifulSoup(content, \"lxml\") \n",
    "pa = post_soup.find_all('div', {'class', 'atl-item'})\n",
    "b = post_soup.find('div', class_= 'atl-pages')\n",
    "b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 247,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<div class=\"atl-pages\"><form action=\"\" method=\"get\" onsubmit=\"return goPage(this,'free',2849477,28);\">\\n<span>\\u4e0a\\u9875</span>\\n<strong>1</strong>\\n<a href=\"/post-free-2849477-2.shtml\">2</a>\\n<a href=\"/post-free-2849477-3.shtml\">3</a>\\n<a href=\"/post-free-2849477-4.shtml\">4</a>\\n\\u2026\\n<a href=\"/post-free-2849477-28.shtml\">28</a>\\n<a class=\"js-keyboard-next\" href=\"/post-free-2849477-2.shtml\">\\u4e0b\\u9875</a>\\n\\xa0\\u5230<input class=\"pagetxt\" name=\"page\" type=\"text\"/>\\u9875\\xa0<input class=\"pagebtn\" maxlength=\"6\" name=\"gopage\" type=\"submit\" value=\"\\u786e\\u5b9a\"/></form></div>"
      ]
     },
     "execution_count": 247,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url_1 = 'http://bbs.tianya.cn' + df.link[0]\n",
    "content = urllib2.urlopen(url_1).read() #获取网页的html文本\n",
    "post_soup = BeautifulSoup(content, \"lxml\") \n",
    "pa = post_soup.find_all('div', {'class', 'atl-item'})\n",
    "a = post_soup.find('div', {'class', 'atl-pages'})\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 251,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<form action=\"\" method=\"get\" onsubmit=\"return goPage(this,'free',2849477,28);\">\\n<span>\\u4e0a\\u9875</span>\\n<strong>1</strong>\\n<a href=\"/post-free-2849477-2.shtml\">2</a>\\n<a href=\"/post-free-2849477-3.shtml\">3</a>\\n<a href=\"/post-free-2849477-4.shtml\">4</a>\\n\\u2026\\n<a href=\"/post-free-2849477-28.shtml\">28</a>\\n<a class=\"js-keyboard-next\" href=\"/post-free-2849477-2.shtml\">\\u4e0b\\u9875</a>\\n\\xa0\\u5230<input class=\"pagetxt\" name=\"page\" type=\"text\"/>\\u9875\\xa0<input class=\"pagebtn\" maxlength=\"6\" name=\"gopage\" type=\"submit\" value=\"\\u786e\\u5b9a\"/></form>"
      ]
     },
     "execution_count": 251,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.form"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 254,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "false\n"
     ]
    }
   ],
   "source": [
    "if b.form:\n",
    "    print 'true'\n",
    "else:\n",
    "    print 'false'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "import random\n",
    "import time\n",
    "\n",
    "def crawler(url, file_name):\n",
    "    try:\n",
    "        # open the browser\n",
    "        url_1 = 'http://bbs.tianya.cn' + url\n",
    "        content = urllib2.urlopen(url_1).read() #获取网页的html文本\n",
    "        post_soup = BeautifulSoup(content, \"lxml\") \n",
    "        # how many pages in a post\n",
    "        post_form = post_soup.find('div', {'class', 'atl-pages'})\n",
    "        if post_form.form:\n",
    "            post_pages = post_form.form['onsubmit'].split(',')[-1].split(')')[0]\n",
    "            post_pages = int(post_pages)\n",
    "            url_base = '-'.join(url_1.split('-')[:-1]) + '-%d.shtml'\n",
    "        else:\n",
    "            post_pages = 1\n",
    "        # for the first page\n",
    "        pa = post_soup.find_all('div', {'class', 'atl-item'})\n",
    "        records = parsePage(pa)\n",
    "        with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!\n",
    "            for record in records:    \n",
    "                p.write('1'+ '\\t' + url + '\\t' + record.encode('utf-8')+\"\\n\") \n",
    "        # for the 2nd+ pages\n",
    "        if post_pages > 1:\n",
    "            for page_num in range(2, post_pages+1):\n",
    "                time.sleep(random.random())\n",
    "                flushPrint(page_num)\n",
    "                url2 =url_base  % page_num\n",
    "                content = urllib2.urlopen(url2).read() #获取网页的html文本\n",
    "                post_soup = BeautifulSoup(content, \"lxml\") \n",
    "                pa = post_soup.find_all('div', {'class', 'atl-item'})\n",
    "                records = parsePage(pa)\n",
    "                with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!\n",
    "                    for record in records:    \n",
    "                        p.write(str(page_num) + '\\t' +url + '\\t' + record.encode('utf-8')+\"\\n\") \n",
    "        else:\n",
    "            pass\n",
    "    except Exception, e:\n",
    "        print e\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 测试"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7"
     ]
    }
   ],
   "source": [
    "url = df.link[2]\n",
    "file_name = '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_2test.txt'\n",
    "crawler(url, file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 正式抓取！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 417,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/post-free-2849477-1.shtmlThis it the post of : 0\n",
      "/post-free-2842180-1.shtmlThis it the post of : 10\n",
      "/post-free-3316698-1.shtmlThis it the post of : 20\n",
      "/post-free-923387-1.shtmlThis it the post of : 30\n",
      "/post-free-4236026-1.shtmlThis it the post of : 40\n",
      "/post-free-2850721-1.shtmlThis it the post of : 50\n",
      "/post-free-5054821-1.shtmlThis it the post of : 60\n",
      "/post-free-3326274-1.shtmlThis it the post of : 70\n",
      "/post-free-4236793-1.shtmlThis it the post of : 80\n",
      "/post-free-4239792-1.shtmlThis it the post of : 90\n",
      "/post-free-5042110-1.shtmlThis it the post of : 100\n",
      "/post-free-2241144-1.shtmlThis it the post of : 110\n",
      "/post-free-3324561-1.shtmlThis it the post of : 120\n",
      "/post-free-3835452-1.shtmlThis it the post of : 130\n",
      "/post-free-5045950-1.shtmlThis it the post of : 140\n",
      "/post-free-2848818-1.shtmlThis it the post of : 150\n",
      "/post-free-3281916-1.shtmlThis it the post of : 160\n",
      "/post-free-949151-1.shtmlThis it the post of : 170\n",
      "/post-free-2848839-1.shtmlThis it the post of : 180\n",
      "/post-free-3228423-1.shtmlThis it the post of : 190\n",
      "/post-free-2852970-1.shtmlThis it the post of : 200\n",
      "/post-free-3325388-1.shtmlThis it the post of : 210\n",
      "/post-free-3835748-1.shtmlThis it the post of : 220\n",
      "/post-free-3833431-1.shtmlThis it the post of : 230\n",
      "/post-free-3378998-1.shtmlThis it the post of : 240\n",
      "/post-free-3359022-1.shtmlThis it the post of : 250\n",
      "/post-free-3365791-1.shtmlThis it the post of : 260\n",
      "/post-free-3396378-1.shtmlThis it the post of : 270\n",
      "/post-free-3835212-1.shtmlThis it the post of : 280\n",
      "/post-free-4248593-1.shtmlThis it the post of : 290\n",
      "/post-free-3833373-1.shtmlThis it the post of : 300\n",
      "/post-free-3847600-1.shtmlThis it the post of : 310\n",
      "/post-free-3832970-1.shtmlThis it the post of : 320\n",
      "/post-free-4076130-1.shtmlThis it the post of : 330\n",
      "/post-free-3835673-1.shtmlThis it the post of : 340\n",
      "/post-free-3835434-1.shtmlThis it the post of : 350\n",
      "/post-free-3368554-1.shtmlThis it the post of : 360\n",
      "/post-free-3832938-1.shtmlThis it the post of : 370\n",
      "/post-free-3835075-1.shtmlThis it the post of : 380\n",
      "/post-free-3832963-1.shtmlThis it the post of : 390\n",
      "/post-free-4250604-1.shtmlThis it the post of : 400\n",
      "/post-free-3834828-1.shtmlThis it the post of : 410\n",
      "/post-free-3835007-1.shtmlThis it the post of : 420\n",
      "/post-free-3838253-1.shtmlThis it the post of : 430\n",
      "/post-free-3835167-1.shtmlThis it the post of : 440\n",
      "/post-free-3835898-1.shtmlThis it the post of : 450\n",
      "/post-free-3835123-1.shtmlThis it the post of : 460\n",
      "/post-free-3835031-1.shtml"
     ]
    }
   ],
   "source": [
    "for k, link in enumerate(df.link):\n",
    "    flushPrint(link)\n",
    "    if k % 10== 0:\n",
    "        print 'This it the post of : ' + str(k)\n",
    "    file_name = '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_network.txt'\n",
    "    crawler(link, file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8079"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dtt = []\n",
    "with open('/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_network.txt', 'r') as f:\n",
    "    for line in f:\n",
    "        pnum, link, time, author_id, author, content = line.replace('\\n', '').split('\\t')\n",
    "        dtt.append([pnum, link, time, author_id, author, content])\n",
    "len(dtt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 07:59:00</td>\n",
       "      <td>50499450</td>\n",
       "      <td>贾也</td>\n",
       "      <td>先生是一位真爷们！第161期导语：人人宁波，面朝大海，春暖花开!　　宁波的事，怎谈？无从谈，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 08:13:54</td>\n",
       "      <td>22122799</td>\n",
       "      <td>三平67</td>\n",
       "      <td>我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 08:27:02</td>\n",
       "      <td>39027950</td>\n",
       "      <td>赶浪头</td>\n",
       "      <td>默默围观~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 08:43:15</td>\n",
       "      <td>53986501</td>\n",
       "      <td>m408833176</td>\n",
       "      <td>不能收藏？</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 08:55:52</td>\n",
       "      <td>39073643</td>\n",
       "      <td>兰质薰心</td>\n",
       "      <td>楼主好文！　　相信政府一定有能力解决好这些问题.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0                           1                    2         3           4  \\\n",
       "0  1  /post-free-2849477-1.shtml  2012-10-29 07:59:00  50499450          贾也   \n",
       "1  1  /post-free-2849477-1.shtml  2012-10-29 08:13:54  22122799        三平67   \n",
       "2  1  /post-free-2849477-1.shtml  2012-10-29 08:27:02  39027950         赶浪头   \n",
       "3  1  /post-free-2849477-1.shtml  2012-10-29 08:43:15  53986501  m408833176   \n",
       "4  1  /post-free-2849477-1.shtml  2012-10-29 08:55:52  39073643        兰质薰心   \n",
       "\n",
       "                                                   5  \n",
       "0  先生是一位真爷们！第161期导语：人人宁波，面朝大海，春暖花开!　　宁波的事，怎谈？无从谈，...  \n",
       "1  我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花...  \n",
       "2                                              默默围观~  \n",
       "3                                              不能收藏？  \n",
       "4                           楼主好文！　　相信政府一定有能力解决好这些问题.  "
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt = pd.DataFrame(dtt)\n",
    "dt[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 420,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>page_num</th>\n",
       "      <th>link</th>\n",
       "      <th>time</th>\n",
       "      <th>author</th>\n",
       "      <th>author_name</th>\n",
       "      <th>reply</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 07:59:00</td>\n",
       "      <td>50499450</td>\n",
       "      <td>贾也</td>\n",
       "      <td>先生是一位真爷们！第161期导语：人人宁波，面朝大海，春暖花开!　　宁波的事，怎谈？无从谈，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 08:13:54</td>\n",
       "      <td>22122799</td>\n",
       "      <td>三平67</td>\n",
       "      <td>我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 08:27:02</td>\n",
       "      <td>39027950</td>\n",
       "      <td>赶浪头</td>\n",
       "      <td>默默围观~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 08:43:15</td>\n",
       "      <td>53986501</td>\n",
       "      <td>m408833176</td>\n",
       "      <td>不能收藏？</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>/post-free-2849477-1.shtml</td>\n",
       "      <td>2012-10-29 08:55:52</td>\n",
       "      <td>39073643</td>\n",
       "      <td>兰质薰心</td>\n",
       "      <td>楼主好文！　　相信政府一定有能力解决好这些问题.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  page_num                        link                 time    author  \\\n",
       "0        1  /post-free-2849477-1.shtml  2012-10-29 07:59:00  50499450   \n",
       "1        1  /post-free-2849477-1.shtml  2012-10-29 08:13:54  22122799   \n",
       "2        1  /post-free-2849477-1.shtml  2012-10-29 08:27:02  39027950   \n",
       "3        1  /post-free-2849477-1.shtml  2012-10-29 08:43:15  53986501   \n",
       "4        1  /post-free-2849477-1.shtml  2012-10-29 08:55:52  39073643   \n",
       "\n",
       "  author_name                                              reply  \n",
       "0          贾也  先生是一位真爷们！第161期导语：人人宁波，面朝大海，春暖花开!　　宁波的事，怎谈？无从谈，...  \n",
       "1        三平67  我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花...  \n",
       "2         赶浪头                                              默默围观~  \n",
       "3  m408833176                                              不能收藏？  \n",
       "4        兰质薰心                           楼主好文！　　相信政府一定有能力解决好这些问题.  "
      ]
     },
     "execution_count": 420,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt=dt.rename(columns = {0:'page_num', 1:'link', 2:'time', 3:'author',4:'author_name', 5:'reply'})\n",
    "dt[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 421,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     先生是一位真爷们！第161期导语：人人宁波，面朝大海，春暖花开!　　宁波的事，怎谈？无从谈，...\n",
       "1     我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花...\n",
       "2                                                 默默围观~\n",
       "3                                                 不能收藏？\n",
       "4                              楼主好文！　　相信政府一定有能力解决好这些问题.\n",
       "5                                                人民在觉醒。\n",
       "6                                          理性的文字，向楼主致敬！\n",
       "7                                     呼唤变革,人民需要的是服务型政府！\n",
       "8                                      顶贾兄！让我们携手努力保卫家园！\n",
       "9                                        围观就是力量,顶起就有希望.\n",
       "10                                       文章写得太有力量了，支持你！\n",
       "11    @贾也 2012-10-29 7:59:00　　导语：人人宁波，面朝大海，春暖花开　　......\n",
       "12                                   中国人从文盲走向民粹，实在是太快了。\n",
       "13                               杀死娘胎里的，毒死已出生的，这个社会怎么了？\n",
       "14                                                    3\n",
       "15    环境比什么都可贵，每一次呼吸，每一顿粮食，都息息相关，若任其恶化，而无从改观，那遑谈国家之未...\n",
       "16                                                 写的很好\n",
       "17    未来这里将是全球最大的垃圾场，而他们早已放浪西方。苟活的将面临数不清的癌症，无助的死亡。悲哀...\n",
       "18    媒体失声，高压维稳，只保留微博和论坛可以说这件事！因为那些人知道，网上的人和事就只能热几天，...\n",
       "19                                       说的太好了，看的我泪流满面！\n",
       "20          “我相信官场中，许多官员应该葆有社会正能量”　　通篇好文，顶！唯这句，不说也罢....\n",
       "21                                            先占一环，然后看帖\n",
       "22                                                说的太好了\n",
       "23    我上的小学，隔壁就是一家水泥厂，到处飞扬的水泥灰是我最熟悉的颜色;坐一站地车，就是一家造纸厂...\n",
       "24     我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花开！\n",
       "25                                               前排占座~~\n",
       "26                                          贾也先生是一位真爷们！\n",
       "27                                                     \n",
       "28                           为什么我的眼里常含着泪水?因为我对这片土地爱得深沉!\n",
       "29    又是因为环保的群体事件，影响面大，危害严重，理由叫的响，取得阶段性胜利。　　那些拆迁的、城管...\n",
       "                            ...                        \n",
       "70                          这是我近几年看过的写的最好的文章，！！！不多说了，危险\n",
       "71    @shdsb 2012-10-29 10:17:43　　媒体失声，高压维稳，只保留微博和论坛...\n",
       "72    @pals2009 48楼 　　每次看到这样的消息，都很痛心，很堵很堵。为什么在经济发展的同...\n",
       "73                                                成都啊成都\n",
       "74                                               是不得人心呀\n",
       "75    真爷们。。顶　　我是宁。波人，楼主说的是我们的心声。。。　　现在看到人民警察，不是有安全感，...\n",
       "76    作者：弱水三千chen　回复日期：2012-10-29 11:43:18　 回复  　　@兰...\n",
       "77                                            泣血之作！　　谢。\n",
       "78    作者：文蛮子 时间：2012-10-29 11:42:58 　　摆明了，带路党们难以煽动民众...\n",
       "79                                                字字真切!\n",
       "80    @曾开贵 2012-10-29 11:40:09　　没有ZF，哪来新ZG，没有新ZG，你们吃...\n",
       "81                                        好文，顶一下，为了我的故乡\n",
       "82                                                    0\n",
       "83                                             好文章，顶一个！\n",
       "84    作者：文蛮子 时间：2012-10-29 11:42:58 　　摆明了，带路党们难以煽动民众...\n",
       "85                                    一定要顶。在被和谐前让多点人知道吧\n",
       "86                                   围观也是一种力量　　天涯，也是好样的\n",
       "87                                                  很沉重\n",
       "88                                                 生不逢国\n",
       "89                                               很好的文章。\n",
       "90     我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花开！\n",
       "91                                                   路过\n",
       "92    民间语文，狗屁有点通。　　似是而非，点风扇火，实是不该。　　排污排毒，环境污染，确实严重。　...\n",
       "93    @横冲节度使 2012-10-29 12:11:50　　楼主这种汉奸、带路党，成天就做梦盼着...\n",
       "94    @赶浪头 2楼 　　默默围观~　　　　---------------------------...\n",
       "95                                   午休时间静静看完了，心中莫名地压抑。\n",
       "96                                             好文！必须顶起！\n",
       "97                                                 扎口了。\n",
       "98                                           谢谢分享 楼主辛苦了\n",
       "99                                             看不到我的回复哦\n",
       "Name: reply, dtype: object"
      ]
     },
     "execution_count": 421,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt.reply[:100]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## 总帖数是多少？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "http://search.tianya.cn/bbs?q=PX  共有18459 条内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "369"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "18459/50"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "实际上到第10页就没有了 http://bbs.tianya.cn/list.jsp?item=free&order=1&nextid=9&k=PX, 原来那只是天涯论坛，还有其它各种版块，如天涯聚焦：  http://focus.tianya.cn/  等等。\n",
    "\n",
    "- 娱乐八卦 512\n",
    "- 股市论坛 187\n",
    "- 情感天地 242\n",
    "- 天涯杂谈 1768 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "在天涯杂谈搜索雾霾，有41页 http://bbs.tianya.cn/list.jsp?item=free&order=20&nextid=40&k=%E9%9B%BE%E9%9C%BE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 天涯SDK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "http://open.tianya.cn/wiki/index.php?title=SDK%E4%B8%8B%E8%BD%BD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  },
  "latex_envs": {
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 0
  },
  "toc": {
   "toc_cell": false,
   "toc_number_sections": true,
   "toc_threshold": 6,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}