{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 使用Pyppeteer实现异步抓取!\n", "\n", "https://mp.weixin.qq.com/s/cWDbLcB_eYBDqBg11Jof3g\n", "\n", "Selenium 库是一个自动化测试工具,很多人可能对它并不陌生,不过在使用 Selenium 过程中,会遇到一些麻烦的事情,如要提前准备好环境配置、驱动等,而且在大规模部署中也会与遇到让我们头疼的事情,那有什么解决方法呢?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- **Pu**ppeteer 是 Google 基于 Node.js 开发的一个工具,有了它我们可以通过 JavaScript 来控制 Chrome 浏览器的一些操作,当然也可以用作网络爬虫上,其 API 极其完善,功能非常强大。\n", "\n", "- **Py**ppeteer 是 Puppeteer 的 Python 版本的实现。是一位来自于日本的工程师依据 Puppeteer 的一些功能开发出来的非官方版本。\n", " - 它背后也是有一个类似 Chrome 浏览器的 Chromium 浏览器在执行一些动作进行网页渲染" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2023-11-10T07:37:10.580652Z", "start_time": "2023-11-10T07:37:06.822664Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pyppeteer in /opt/anaconda3/lib/python3.9/site-packages (0.2.6)\n", "Requirement already satisfied: importlib-metadata>=1.4 in /opt/anaconda3/lib/python3.9/site-packages (from pyppeteer) (4.8.1)\n", "Requirement already satisfied: pyee<9.0.0,>=8.1.0 in /opt/anaconda3/lib/python3.9/site-packages (from pyppeteer) (8.2.2)\n", "Requirement already satisfied: websockets<10.0,>=9.1 in /opt/anaconda3/lib/python3.9/site-packages (from pyppeteer) (9.1)\n", "Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in /opt/anaconda3/lib/python3.9/site-packages (from pyppeteer) (1.26.7)\n", "Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in /opt/anaconda3/lib/python3.9/site-packages (from pyppeteer) (1.4.4)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in /opt/anaconda3/lib/python3.9/site-packages (from pyppeteer) (4.62.3)\n", "Requirement already satisfied: zipp>=0.5 in /opt/anaconda3/lib/python3.9/site-packages (from importlib-metadata>=1.4->pyppeteer) (3.6.0)\n" ] } ], "source": [ "!pip3 install pyppeteer" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2023-11-10T07:37:19.492209Z", "start_time": "2023-11-10T07:37:18.849892Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import pyppeteer" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2023-11-10T07:37:31.500773Z", "start_time": "2023-11-10T07:37:27.383711Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting pyquery\n", " Downloading pyquery-2.0.0-py3-none-any.whl (22 kB)\n", "Collecting cssselect>=1.2.0\n", " Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)\n", "Requirement already satisfied: lxml>=2.1 in /opt/anaconda3/lib/python3.9/site-packages (from pyquery) (4.6.3)\n", "Installing collected packages: cssselect, pyquery\n", "Successfully installed cssselect-1.2.0 pyquery-2.0.0\n" ] } ], "source": [ "!pip install pyquery" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2021-11-01T08:31:40.626887Z", "start_time": "2021-11-01T08:31:39.466388Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Quotes: 0\n" ] } ], "source": [ "import requests\n", "from pyquery import PyQuery as pq\n", "\n", "url = 'http://quotes.toscrape.com/js/'\n", "response = requests.get(url)\n", "doc = pq(response.text)\n", "print('Quotes:', doc('.quote').length)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "结果是 0,这就证明使用 requests 是无法正常抓取到相关数据的。因为什么?因为这个页面是 JavaScript 渲染而成的,我们所看到的内容都是网页加载后又执行了 JavaScript 之后才呈现出来的,因此这些条目数据并不存在于原始 HTML 代码中,而 requests 仅仅抓取的是原始 HTML 代码。\n", "\n", "\n", "- 分析网页源代码数据,如果数据是隐藏在 HTML 中的其他地方,以 JavaScript 变量的形式存在,直接提取就好了。\n", "- 分析 Ajax,很多数据可能是经过 Ajax 请求时候获取的,所以可以分析其接口。\n", "- 模拟 JavaScript 渲染过程,直接抓取渲染后的结果。\n", " - 而 Pyppeteer 和 Selenium 就是用的第三种方法" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2023-11-10T07:41:27.744695Z", "start_time": "2023-11-10T07:41:27.736764Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import nest_asyncio\n", "nest_asyncio.apply()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2023-11-10T07:42:03.418609Z", "start_time": "2023-11-10T07:41:56.628319Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Quotes: 10\n" ] } ], "source": [ "import asyncio\n", "from pyppeteer import launch\n", "from pyquery import PyQuery as pq\n", "\n", "async def main():\n", " browser = await launch()\n", " page = await browser.newPage()\n", " await page.goto('http://quotes.toscrape.com/js/')\n", " doc = pq(await page.content())\n", " print('Quotes:', doc('.quote').length)\n", " await browser.close()\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Pyppeteer完成了浏览器的开启、新建页面、页面加载等操作。另外 Pyppeteer进行了异步操作,需要配合 async/await 关键词来实现。\n", "\n", "- 首先, launch 方法会新建一个 Browser 对象,\n", "- 然后赋值给 browser,\n", "- 然后调用 newPage \n", " - 方法相当于浏览器中新建了一个选项卡,同时新建了一个 Page 对象。\n", "- 然后 Page 对象调用了 goto 方法就相当于在浏览器中输入了这个 URL,浏览器跳转到了对应的页面进行加载,加载完成之后再调用 content 方法,返回当前浏览器页面的源代码。\n", "- 然后进一步地,用 pyquery 进行同样地解析,就可以得到 JavaScript 渲染的结果了。\n", "\n", "另外其他的一些方法如调用 asyncio 的 get_event_loop 等方法的相关操作则属于 Python 异步 async 相关的内容了,大家如果不熟悉可以了解下 Python 的 async/await 的相关知识。" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2021-11-01T05:53:46.602755Z", "start_time": "2021-11-01T05:53:44.531019Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "通过上面的代码,我们就可以完成 JavaScript 渲染页面的爬取了。\n", "\n", "- 我们没有配置 Chrome 浏览器,\n", "- 没有配置浏览器驱动,\n", "\n", "免去了一些繁琐的步骤,同样达到了 Selenium 的效果,还实现了异步抓取!\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2021-11-01T09:10:28.931683Z", "start_time": "2021-11-01T09:10:26.637163Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'width': 800, 'height': 600, 'deviceScaleFactor': 1}\n" ] } ], "source": [ "# 模拟网页截图,保存 PDF\n", "import asyncio\n", "from pyppeteer import launch\n", "\n", "async def main():\n", " browser = await launch()\n", " page = await browser.newPage()\n", " await page.goto('http://quotes.toscrape.com/js/')\n", " await page.screenshot(path='example.png')\n", " await page.pdf(path='./data/example.pdf')\n", " dimensions = await page.evaluate('''() => {\n", " return {\n", " width: document.documentElement.clientWidth,\n", " height: document.documentElement.clientHeight,\n", " deviceScaleFactor: window.devicePixelRatio,\n", " }\n", " }''')\n", "\n", " print(dimensions)\n", " # >>> {'width': 800, 'height': 600, 'deviceScaleFactor': 1}\n", " await browser.close()\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 详细用法\n", "\n", "\n", "了解了基本的实例之后,我们再来梳理一下 Pyppeteer 的一些基本和常用操作。\n", "\n", "- Pyppeteer 的几乎所有功能都能在其官方文档的 API Reference 里面找到https://miyakogi.github.io/pyppeteer/reference.html,\n", "- 用到哪个方法就来这里查询就好了,参数不必死记硬背,即用即查就好。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 开启浏览器\n", "\n", "\n", "使用 Pyppeteer 的第一步便是启动浏览器,首先我们看下怎样启动一个浏览器,其实就相当于我们点击桌面上的浏览器图标一样,把它开起来。用 Pyppeteer 完成同样的操作,只需要调用 launch 方法即可。\n", "\n", "\n", "我们先看下 launch 方法的 API,链接为:\n", "\n", "https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.launcher.launch" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "首先可以试用下最常用的参数 headless,\n", "- 如果我们将它设置为 True 或者默认不设置它,在启动的时候我们是看不到任何界面的,\n", "- 如果把它设置为 False,那么在启动的时候就可以看到界面了\n", "- 一般我们在调试的时候会把它设置为 False,在生产环境上就可以设置为 True" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2021-11-01T09:19:48.338369Z", "start_time": "2021-11-01T09:19:37.346196Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "\n", "import asyncio\n", "from pyppeteer import launch\n", "\n", "async def main():\n", " await launch(headless=False)\n", " await asyncio.sleep(10)\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "开启调试模式\n", "\n", "比如在写爬虫的时候会经常需要分析网页结构还有网络请求,所以开启调试工具还是很有必要的,\n", "- 我们可以将 devtools 参数设置为 True,\n", "- 这样每开启一个界面就会弹出一个调试窗口,非常方便,示例如下:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2021-11-01T09:22:47.090076Z", "start_time": "2021-11-01T09:21:04.867759Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "\n", "import asyncio\n", "from pyppeteer import launch\n", "\n", "async def main():\n", " browser = await launch(devtools=True)\n", " page = await browser.newPage()\n", " await page.goto('https://www.baidu.com')\n", " await asyncio.sleep(10)\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "args 参数禁用操作:\n", "\n", "我们可以看到上面的一条提示:\"Chrome 正受到自动测试软件的控制\",如何关闭呢?\n", "\n", " \n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2021-11-01T09:23:59.134440Z", "start_time": "2021-11-01T09:23:58.225619Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "browser = await launch(headless=False, args=['--disable-infobars'])\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "这里你只是把提示关闭了,有些网站还是会检测到是 webdriver 吧,比如淘宝检测到是 webdriver 就会禁止登录了,我们可以试试:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2021-11-01T09:26:01.491647Z", "start_time": "2021-11-01T09:24:18.778506Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "\n", "import asyncio\n", "from pyppeteer import launch\n", "\n", "async def main():\n", " browser = await launch(headless=False)\n", " page = await browser.newPage()\n", " await page.goto('https://www.taobao.com')\n", " await asyncio.sleep(100)\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "爬虫的时候看到这界面是很让人崩溃的吧,而且这时候我们还发现了页面的 bug,整个浏览器窗口比显示的内容窗口要大,这个是某些页面会出现的情况,让人看起来很不爽。\n", "\n", "\n", "我们可以先解决一下这个显示的 bug,需要设置下 window-size 还有 viewport,代码如下:\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2021-11-01T09:27:44.857206Z", "start_time": "2021-11-01T09:26:01.494849Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "\n", "import asyncio\n", "from pyppeteer import launch\n", "\n", "width, height = 1366, 768\n", "\n", "async def main():\n", " browser = await launch(headless=False,\n", " args=[f'--window-size={width},{height}'])\n", " page = await browser.newPage()\n", " await page.setViewport({'width': width, 'height': height})\n", " await page.goto('https://www.taobao.com')\n", " await asyncio.sleep(10)\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "OK,那刚才所说的 webdriver 检测问题怎样来解决呢?其实淘宝主要通过 window.navigator.webdriver 来对 webdriver 进行检测,所以我们只需要使用 JavaScript 将它设置为 false 即可,代码如下:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2021-11-01T09:43:08.925151Z", "start_time": "2021-11-01T09:41:19.291522Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import asyncio\n", "from pyppeteer import launch\n", "\n", "\n", "async def main():\n", " browser = await launch(headless=False, args=['--disable-infobars'])\n", " page = await browser.newPage()\n", " await page.goto('https://login.taobao.com/member/login.jhtml?redirectURL=https://www.taobao.com/')\n", " await page.evaluate(\n", " '''() =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) }''')\n", " await asyncio.sleep(100)\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "这里没加输入用户名密码的代码,当然后面可以自行添加,下面打开之后,我们点击输入用户名密码,然后这时候会出现一个滑动条,这里滑动的话,就可以通过了\n", "\n", "OK,这样的话我们就成功规避了 webdriver 的检测,使用鼠标拖动模拟就可以完成淘宝的登录了。\n", "\n", "![image.png](img/taobao2021.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "另一种方法设置用户目录\n", "\n", "可以进一步免去淘宝登录的烦恼。平时我们已经注意到,当我们登录淘宝之后,如果下次再次打开浏览器发现还是登录的状态。这是因为淘宝的一些关键 Cookies 已经保存到本地了,下次登录的时候可以直接读取并保持登录状态。\n", "\n", "- 这些信息保存在用户目录下了\n", " - 里面不仅包含了浏览器的基本配置信息,\n", " - 还有一些 Cache、Cookies 等各种信息都在里面\n", "- 如果我们能在浏览器启动的时候读取这些信息,那么启动的时候就可以恢复一些历史记录甚至一些登录状态信息了。\n", "\n", "\n", "\n", "这也就解决了一个问题:很多朋友在每次启动 Selenium 或 Pyppeteer 的时候总是是一个全新的浏览器,那就是没有设置用户目录,如果设置了它,每次打开就不再是一个全新的浏览器了,它可以恢复之前的历史记录,也可以恢复很多网站的登录信息。\n", "\n", "\n", "那么这个怎么来做呢?很简单,在启动的时候设置 userDataDir 就好了,示例如下:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2023-11-10T07:50:50.757092Z", "start_time": "2023-11-10T07:49:06.761981Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import asyncio\n", "from pyppeteer import launch\n", "\n", "\n", "async def main():\n", " browser = await launch(headless=False, userDataDir='../userdata', args=['--disable-infobars'])\n", " page = await browser.newPage()\n", " await page.goto('https://login.taobao.com/member/login.jhtml?redirectURL=https://www.taobao.com/')\n", " await page.evaluate(\n", " '''() =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) }''')\n", " await asyncio.sleep(100)\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- 上面👆代码加了一个 userDataDir 的属性,值为 userdata,即当前目录外面的 userdata 文件夹。\n", " - 然后登录一次淘宝,这时候我们同时可以观察到在当前运行目录下又多了一个 userdata 的文件夹\n", "- 下面👇代码再次运行,这时候可以发现现在就已经是登录状态了,不需要再次登录了,这样就成功跳过了登录的流程。\n", " - 当然可能时间太久了,Cookies 都过期了,那还是需要登录的\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2023-11-10T07:53:13.275552Z", "start_time": "2023-11-10T07:51:29.343728Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "\n", "import asyncio\n", "from pyppeteer import launch\n", "\n", "async def main():\n", " browser = await launch(headless=False, userDataDir='../userdata', args=['--disable-infobars'])\n", " page = await browser.newPage()\n", " await page.goto('https://www.taobao.com')\n", " await asyncio.sleep(100)\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "具体的介绍可以看官方的一些说明,如:\n", "\n", "https://chromium.googlesource.com/chromium/src/+/master/docs/user_data_dir.md\n", "\n", "这里面介绍了 userdatadir 的相关内容。\n", "\n", "\n", "\n", "再次运行上面的代码,这时候可以发现现在就已经是登录状态了,不需要再次登录了,这样就成功跳过了登录的流程。当然可能时间太久了,Cookies 都过期了,那还是需要登录的。" ] } ], "metadata": { "celltoolbar": "幻灯片", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }