{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 轻型微博爬虫\n", "\n", "\n", "- https://github.com/hidadeng/weibo_crawler\n", "- https://pypi.org/project/weibo-crawler/\n", "\n", "weibo_crawler参考【nghuyong/WeiboSpider】 对代码用法进行了简化,可以做轻度的微博数据采集。\n", "\n", "- 用户信息抓取\n", "- 用户微博抓取(全量/指定时间段)\n", "- 用户社交关系抓取(粉丝/关注)\n", "- 微博评论抓取\n", "- 基于关键词和时间段(粒度到小时)的微博抓取\n", "- 微博转发抓取\n", "\n", "\n", "使用简介:https://www.douban.com/group/topic/247718378/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 安装" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-11-09T05:26:52.360424Z", "start_time": "2021-11-09T05:26:49.338728Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: weibo-crawler in /opt/anaconda3/lib/python3.7/site-packages (1.0)\n", "Requirement already satisfied: pyquery in /opt/anaconda3/lib/python3.7/site-packages (from weibo-crawler) (1.4.3)\n", "Requirement already satisfied: requests in /opt/anaconda3/lib/python3.7/site-packages (from weibo-crawler) (2.24.0)\n", "Requirement already satisfied: cssselect>0.7.9 in /opt/anaconda3/lib/python3.7/site-packages (from pyquery->weibo-crawler) (1.1.0)\n", "Requirement already satisfied: lxml>=2.1 in /opt/anaconda3/lib/python3.7/site-packages (from pyquery->weibo-crawler) (4.6.1)\n", "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/anaconda3/lib/python3.7/site-packages (from requests->weibo-crawler) (1.25.11)\n", "Requirement already satisfied: idna<3,>=2.5 in /opt/anaconda3/lib/python3.7/site-packages (from requests->weibo-crawler) (2.8)\n", "Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.7/site-packages (from requests->weibo-crawler) (2019.11.28)\n", "Requirement already satisfied: chardet<4,>=3.0.2 in /opt/anaconda3/lib/python3.7/site-packages (from requests->weibo-crawler) (3.0.4)\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install weibo-crawler" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2021-11-09T05:34:28.529523Z", "start_time": "2021-11-09T05:34:28.526110Z" } }, "source": [ "## 获取cookie\n", "\n", "- 使用chrome浏览器打开手机微博 https://weibo.cn 登录\n", "- 右键inspect(即打开开发者模式)\n", "- 查看network内容\n", "- 获取html文件header中的cookie信息\n", " - 其中可能需要SSOLoginState字段\n", " \n", "![image.png](img/weibo-crawler.png)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-11-09T05:26:58.614456Z", "start_time": "2021-11-09T05:26:58.461231Z" } }, "outputs": [], "source": [ "from weibo_crawler import Profile\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2021-11-09T05:28:50.517815Z", "start_time": "2021-11-09T05:28:50.079391Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'userid': '1087770692', 'nickname': '陈坤', 'gender': '男', 'province': '重庆', 'introduction': '莫失己道,莫扰他心。', 'birthday': '0001-00-00', 'vip_level': '7级送Ta会员', 'authentication': '演员,代表作《龙门飞甲》《画皮》等,行走的力量发起者', 'labels': '演员'}\n", "采集完毕,请查看 ./data/weibo-chenkun-intro.csv 内的数据\n" ] } ], "source": [ "# 如果程序失败,需要传入你的微博cookies\n", "cookies='_T_WM=9b80727fa0cc3b6b6c374b9262ff084d; SUB=_2A25MjZmZDeRhGeNI4lYX-S7FwjWIHXVscSfRrDV6PUJbktAKLXSmkW1NSAJ40ykLq1lxtFqpHJ4BRMiY1XKHNT6g; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WhkPe1HSir85xF8hwHpTZa75NHD95QfSo.XSo.71K.4Ws4DqcjT9s8Xqgpyqoz7eK-t; SSOLoginState=1636428233'\n", "\n", "# csv文件路径\n", "prof=Profile(csvfile='./data/weibo-chenkun-intro.csv', delay=1, cookies=cookies)\n", "\n", "prof.get_profile(userid='1087770692') # 陈坤微博的id" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2021-11-09T05:43:18.157239Z", "start_time": "2021-11-09T05:43:18.154103Z" } }, "source": [ "更多操作见:\n", "\n", "https://github.com/hidadeng/weibo_crawler" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }