{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "***\n", "***\n", "# 数据抓取\n", " > # 使用Selenium操纵浏览器\n", "\n", "***\n", "***\n", "\n", "王成军 \n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Selenium 是一套完整的web应用程序测试系统,包含了\n", "- 测试的录制(selenium IDE)\n", "- 编写及运行(Selenium Remote Control)\n", "- 测试的并行处理(Selenium Grid)。\n", "\n", "Selenium的核心Selenium Core基于JsUnit,完全由JavaScript编写,因此可以用于任何支持JavaScript的浏览器上。selenium可以模拟真实浏览器,自动化测试工具,支持多种浏览器,爬虫中主要用来解决JavaScript渲染问题。https://www.cnblogs.com/zhaof/p/6953241.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "上面我们知道了selenium支持很多的浏览器,但是如果想要声明并调用浏览器则需要:\n", "https://pypi.org/project/selenium/" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T00:57:02.726390Z", "start_time": "2019-10-17T00:56:56.947418Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting selenium\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)\n", "\u001b[K 100% |████████████████████████████████| 911kB 9.3MB/s ta 0:00:011\n", "\u001b[?25hRequirement already satisfied: urllib3 in /Users/datalab/anaconda3/lib/python3.7/site-packages (from selenium) (1.24.1)\n", "Installing collected packages: selenium\n", "Successfully installed selenium-3.141.0\n" ] } ], "source": [ "!pip install selenium" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Webdriver\n", "- 主要用的是selenium的Webdriver\n", "- 我们可以通过下面的方式先看看Selenium.Webdriver支持哪些浏览器\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T00:57:07.111400Z", "start_time": "2019-10-17T00:57:07.067485Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from selenium import webdriver" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T00:57:10.624675Z", "start_time": "2019-10-17T00:57:10.619107Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on package selenium.webdriver in selenium:\n", "\n", "NAME\n", " selenium.webdriver\n", "\n", "DESCRIPTION\n", " # Licensed to the Software Freedom Conservancy (SFC) under one\n", " # or more contributor license agreements. See the NOTICE file\n", " # distributed with this work for additional information\n", " # regarding copyright ownership. The SFC licenses this file\n", " # to you under the Apache License, Version 2.0 (the\n", " # \"License\"); you may not use this file except in compliance\n", " # with the License. You may obtain a copy of the License at\n", " #\n", " # http://www.apache.org/licenses/LICENSE-2.0\n", " #\n", " # Unless required by applicable law or agreed to in writing,\n", " # software distributed under the License is distributed on an\n", " # \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", " # KIND, either express or implied. See the License for the\n", " # specific language governing permissions and limitations\n", " # under the License.\n", "\n", "PACKAGE CONTENTS\n", " android (package)\n", " blackberry (package)\n", " chrome (package)\n", " common (package)\n", " edge (package)\n", " firefox (package)\n", " ie (package)\n", " opera (package)\n", " phantomjs (package)\n", " remote (package)\n", " safari (package)\n", " support (package)\n", " webkitgtk (package)\n", "\n", "VERSION\n", " 3.14.1\n", "\n", "FILE\n", " /Users/datalab/anaconda3/lib/python3.7/site-packages/selenium/webdriver/__init__.py\n", "\n", "\n" ] } ], "source": [ "help(webdriver) " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 下载和设置Webdriver\n", "\n", "对于Chrome需要的webdriver下载地址\n", "\n", "http://chromedriver.storage.googleapis.com/index.html\n", "\n", "需要将webdriver放在系统路径下:\n", "- 确保anaconda在系统路径名里\n", "- 把下载的webdriver 放在`Anaconda的bin文件夹`下" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### PhantomJS\n", "\n", "PhantomJS是一个而基于WebKit的服务端JavaScript API,支持Web而不需要浏览器支持,其快速、原生支持各种Web标准:Dom处理,CSS选择器,JSON等等。PhantomJS可以用用于页面自动化、网络监测、网页截屏,以及无界面测试" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T00:57:17.147546Z", "start_time": "2019-10-17T00:57:14.749313Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "#browser = webdriver.Firefox() # 打开Firefox浏览器\n", "browser = webdriver.Chrome() # 打开Chrome浏览器" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 访问页面" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T03:39:01.788430Z", "start_time": "2019-10-17T03:38:58.474675Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "网易云音乐\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "

网易云音乐

\n", "\n", "
登录\n", "
\n", "
\n", "
\n", "\n", "
\n", "
\n", "创作者中心\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "
\n", "

现在支持搜索MV啦~

\n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "上一首\n", "播放/暂停\n", "下一首\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "00:00 / 00:00\n", "
\n", "
\n", "
\n", "收藏\n", "分享\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "已添加到播放列表\n", "0\n", "\n", "
循环
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "
\n" ] } ], "source": [ "from selenium import webdriver\n", "\n", "browser = webdriver.Chrome()\n", " \n", "browser.get(\"http://music.163.com\") \n", "print(browser.page_source)\n", "#browser.close() " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 查找元素\n", "单个元素查找" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T03:39:32.451564Z", "start_time": "2019-10-17T03:39:28.999764Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] } ], "source": [ "from selenium import webdriver\n", "\n", "browser = webdriver.Chrome()\n", "\n", "browser.get(\"http://music.163.com\")\n", "input_first = browser.find_element_by_id(\"g_search\")\n", "input_second = browser.find_element_by_css_selector(\"#g_search\")\n", "input_third = browser.find_element_by_xpath('//*[@id=\"g_search\"]')\n", "print(input_first)\n", "print(input_second)\n", "print(input_third)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "这里我们通过三种不同的方式去获取响应的元素,第一种是通过id的方式,第二个中是CSS选择器,第三种是xpath选择器,结果都是相同的。\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 常用的查找元素方法:\n", "\n", "- find_element_by_name\n", "- find_element_by_id\n", "- find_element_by_xpath\n", "- find_element_by_link_text\n", "- find_element_by_partial_link_text\n", "- find_element_by_tag_name\n", "- find_element_by_class_name\n", "- find_element_by_css_selector" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T03:40:12.762331Z", "start_time": "2019-10-17T03:40:12.759968Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# 下面这种方式是比较通用的一种方式:这里需要记住By模块所以需要导入\n", "from selenium.webdriver.common.by import By" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T03:40:18.546410Z", "start_time": "2019-10-17T03:40:14.277771Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "browser = webdriver.Chrome()\n", "browser.get(\"http://music.163.com\")\n", "input_first = browser.find_element(By.ID,\"g_search\")\n", "print(input_first)\n", "browser.close()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 多个元素查找\n", "\n", "其实多个元素和单个元素的区别,举个例子:find_elements,单个元素是find_element,其他使用上没什么区别,通过其中的一个例子演示:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T03:40:36.622392Z", "start_time": "2019-10-17T03:40:32.247072Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[]\n" ] } ], "source": [ "browser = webdriver.Chrome()\n", "browser.get(\"http://music.163.com\")\n", "lis = browser.find_elements_by_css_selector('body')\n", "print(lis)\n", "browser.close() " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "当然上面的方式也是可以通过导入`from selenium.webdriver.common.by import By` 这种方式实现\n", "\n", "> lis = browser.find_elements(By.CSS_SELECTOR,'.service-bd li')\n", "\n", "同样的在单个元素中查找的方法在多个元素查找中同样存在:\n", "- find_elements_by_name\n", "- find_elements_by_id\n", "- find_elements_by_xpath\n", "- find_elements_by_link_text\n", "- find_elements_by_partial_link_text\n", "- find_elements_by_tag_name\n", "- find_elements_by_class_name\n", "- find_elements_by_css_selector" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 元素交互操作\n", "对于获取的元素调用交互方法" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T03:40:57.466649Z", "start_time": "2019-10-17T03:40:51.101641Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from selenium import webdriver\n", "import time\n", "browser = webdriver.Chrome()\n", "\n", "browser.get(\"https://music.163.com/\")\n", "input_str = browser.find_element_by_id('srch')\n", "input_str.send_keys(\"周杰伦\")\n", "time.sleep(3) #休眠,模仿人工搜索\n", "input_str.clear()\n", "input_str.send_keys(\"林俊杰\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "运行的结果可以看出程序会自动打开Chrome浏览器并打开淘宝输入ipad,然后删除,重新输入MacBook pro,并点击搜索\n", "\n", "Selenium所有的api文档:http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.action_chains" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 执行JavaScript\n", "这是一个非常有用的方法,这里就可以直接调用js方法来实现一些操作,\n", "下面的例子是通过登录知乎然后通过js翻到页面底部,并弹框提示" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T01:16:20.950284Z", "start_time": "2019-10-17T01:16:17.156296Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from selenium import webdriver\n", "browser = webdriver.Chrome()\n", "browser.get(\"https://www.zhihu.com/explore/\")\n", "browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')\n", "browser.execute_script('alert(\"To Bottom\")')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 一个例子" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:32:02.234295Z", "start_time": "2019-06-08T06:30:56.716427Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "```pyton\n", "from selenium import webdriver\n", "\n", "browser = webdriver.Chrome()\n", "browser.get(\"https://www.privco.com/home/login\") #需要翻墙打开网址\n", "username = 'fake_username'\n", "password = 'fake_password'\n", "browser.find_element_by_id(\"username\").clear()\n", "browser.find_element_by_id(\"username\").send_keys(username) \n", "browser.find_element_by_id(\"password\").clear()\n", "browser.find_element_by_id(\"password\").send_keys(password)\n", "browser.find_element_by_css_selector(\"#login-form > div:nth-child(5) > div > button\").click()\n", "```" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:33:11.197128Z", "start_time": "2019-06-08T06:33:11.169229Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# url = \"https://www.privco.com/private-company/329463\"\n", "def download_excel(url):\n", " browser.get(url)\n", " name = url.split('/')[-1]\n", " title = browser.title\n", " source = browser.page_source\n", " with open(name+'.html', 'w') as f:\n", " f.write(source)\n", " try:\n", " soup = BeautifulSoup(source, 'html.parser')\n", " url_new = soup.find('span', {'class', 'profile-name'}).a['href']\n", " url_excel = url_new + '/export'\n", " browser.get(url_excel)\n", " except Exception as e:\n", " print(url, 'no excel')\n", " pass\n", " \n", " " ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:32:13.789332Z", "start_time": "2019-06-08T06:32:13.785931Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "urls = [ 'https://www.privco.com/private-company/1135789',\n", " 'https://www.privco.com/private-company/542756',\n", " 'https://www.privco.com/private-company/137908',\n", " 'https://www.privco.com/private-company/137138']" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:33:19.547094Z", "start_time": "2019-06-08T06:33:15.569463Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "https://www.privco.com/private-company/1135789 no excel\n", "1\n", "https://www.privco.com/private-company/542756 no excel\n", "2\n", "https://www.privco.com/private-company/137908 no excel\n", "3\n", "https://www.privco.com/private-company/137138 no excel\n" ] } ], "source": [ "for k, url in enumerate(urls):\n", " print(k)\n", " try:\n", " download_excel(url)\n", " except Exception as e:\n", " print(url, e)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "647px", "left": "1361px", "top": "123px", "width": "340px" }, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }