{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "***\n", "***\n", "# 数据抓取\n", " > # 使用Selenium操纵浏览器\n", "\n", "***\n", "***\n", "\n", "王成军 \n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "selenium 是一套完整的web应用程序测试系统,包含了\n", "- 测试的录制(selenium IDE)\n", "- 编写及运行(Selenium Remote Control)\n", "- 测试的并行处理(Selenium Grid)。\n", "\n", "Selenium的核心Selenium Core基于JsUnit,完全由JavaScript编写,因此可以用于任何支持JavaScript的浏览器上。selenium可以模拟真实浏览器,自动化测试工具,支持多种浏览器,爬虫中主要用来解决JavaScript渲染问题。https://www.cnblogs.com/zhaof/p/6953241.html" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Webdriver\n", "用python写爬虫的时候,主要用的是selenium的Webdriver,我们可以通过下面的方式先看看Selenium.Webdriver支持哪些浏览器\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:07:48.331420Z", "start_time": "2019-06-08T06:07:48.328503Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from selenium import webdriver" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:07:53.221670Z", "start_time": "2019-06-08T06:07:53.215623Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on package selenium.webdriver in selenium:\n", "\n", "NAME\n", " selenium.webdriver\n", "\n", "DESCRIPTION\n", " # Licensed to the Software Freedom Conservancy (SFC) under one\n", " # or more contributor license agreements. See the NOTICE file\n", " # distributed with this work for additional information\n", " # regarding copyright ownership. The SFC licenses this file\n", " # to you under the Apache License, Version 2.0 (the\n", " # \"License\"); you may not use this file except in compliance\n", " # with the License. You may obtain a copy of the License at\n", " #\n", " # http://www.apache.org/licenses/LICENSE-2.0\n", " #\n", " # Unless required by applicable law or agreed to in writing,\n", " # software distributed under the License is distributed on an\n", " # \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", " # KIND, either express or implied. See the License for the\n", " # specific language governing permissions and limitations\n", " # under the License.\n", "\n", "PACKAGE CONTENTS\n", " android (package)\n", " blackberry (package)\n", " chrome (package)\n", " common (package)\n", " edge (package)\n", " firefox (package)\n", " ie (package)\n", " opera (package)\n", " phantomjs (package)\n", " remote (package)\n", " safari (package)\n", " support (package)\n", " webkitgtk (package)\n", "\n", "VERSION\n", " 3.9.0\n", "\n", "FILE\n", " /Users/datalab/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/__init__.py\n", "\n", "\n" ] } ], "source": [ "help(webdriver) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### PhantomJS\n", "\n", "PhantomJS是一个而基于WebKit的服务端JavaScript API,支持Web而不需要浏览器支持,其快速、原生支持各种Web标准:Dom处理,CSS选择器,JSON等等。PhantomJS可以用用于页面自动化、网络监测、网页截屏,以及无界面测试" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 声明浏览器对象\n", "上面我们知道了selenium支持很多的浏览器,但是如果想要声明并调用浏览器则需要:\n", "https://pypi.org/project/selenium/" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:09:56.452841Z", "start_time": "2019-06-08T06:09:54.512065Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "#browser = webdriver.Firefox() # 打开Firefox浏览器\n", "browser = webdriver.Chrome() # 打开Chrome浏览器" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 访问页面" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:11:19.448418Z", "start_time": "2019-06-08T06:11:12.334976Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " \n", "\t\n", " \n", " \n", " \n", " \n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", " \n", " 百度一下,你就知道\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\t\n", "
\n", " \n", "\n", "\n", " \n", "\n", "\n", "
\"到百度首页\"\"到百度首页\"
输入法
\n", "\n", "\n", "\n", "
\n", "
\n", " 网页\n", " 资讯\n", " 贴吧\n", " 知道\n", " 音乐\n", " 图片\n", " 视频\n", " 地图\n", " 文库\n", " 更多»\n", "
\n", "
\n", "\n", " \n", "\n", "
\n", "\t
\n", "\t\t
\n", "\t\t\t
\n", "\t\t\t
\n", "\t\t\t\t

百度

\n", "\t\t\t
\n", "\t\t
\n", "\t
\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "from selenium import webdriver\n", "\n", "browser = webdriver.Chrome()\n", " \n", "browser.get(\"http://www.baidu.com\") \n", "print(browser.page_source)\n", "#browser.close() " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 查找元素\n", "单个元素查找" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:14:23.325473Z", "start_time": "2019-06-08T06:14:19.930498Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] } ], "source": [ "from selenium import webdriver\n", "\n", "browser = webdriver.Chrome()\n", "\n", "browser.get(\"http://www.taobao.com\")\n", "input_first = browser.find_element_by_id(\"q\")\n", "input_second = browser.find_element_by_css_selector(\"#q\")\n", "input_third = browser.find_element_by_xpath('//*[@id=\"q\"]')\n", "print(input_first)\n", "print(input_second)\n", "print(input_third)\n", "browser.close()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "这里我们通过三种不同的方式去获取响应的元素,第一种是通过id的方式,第二个中是CSS选择器,第三种是xpath选择器,结果都是相同的。\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 常用的查找元素方法:\n", "\n", "- find_element_by_name\n", "- find_element_by_id\n", "- find_element_by_xpath\n", "- find_element_by_link_text\n", "- find_element_by_partial_link_text\n", "- find_element_by_tag_name\n", "- find_element_by_class_name\n", "- find_element_by_css_selector" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T03:38:58.488192Z", "start_time": "2018-04-28T03:38:58.484902Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# 下面这种方式是比较通用的一种方式:这里需要记住By模块所以需要导入\n", "from selenium.webdriver.common.by import By" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T03:39:11.442664Z", "start_time": "2018-04-28T03:39:07.749475Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "browser = webdriver.Chrome()\n", "browser.get(\"http://www.taobao.com\")\n", "input_first = browser.find_element(By.ID,\"q\")\n", "print(input_first)\n", "browser.close()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 多个元素查找\n", "\n", "其实多个元素和单个元素的区别,举个例子:find_elements,单个元素是find_element,其他使用上没什么区别,通过其中的一个例子演示:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T03:39:46.230829Z", "start_time": "2018-04-28T03:39:43.100998Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[, , , , , , , , , , , , , , , ]\n" ] } ], "source": [ "browser = webdriver.Chrome()\n", "browser.get(\"http://www.taobao.com\")\n", "lis = browser.find_elements_by_css_selector('.service-bd li')\n", "print(lis)\n", "browser.close() " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "当然上面的方式也是可以通过导入`from selenium.webdriver.common.by import By` 这种方式实现\n", "\n", "> lis = browser.find_elements(By.CSS_SELECTOR,'.service-bd li')\n", "\n", "同样的在单个元素中查找的方法在多个元素查找中同样存在:\n", "- find_elements_by_name\n", "- find_elements_by_id\n", "- find_elements_by_xpath\n", "- find_elements_by_link_text\n", "- find_elements_by_partial_link_text\n", "- find_elements_by_tag_name\n", "- find_elements_by_class_name\n", "- find_elements_by_css_selector" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 元素交互操作\n", "对于获取的元素调用交互方法" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:16:56.125011Z", "start_time": "2019-06-08T06:16:53.085069Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from selenium import webdriver\n", "import time\n", "browser = webdriver.Chrome()\n", "browser.get(\"http://www.taobao.com\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:17:06.773136Z", "start_time": "2019-06-08T06:17:02.552367Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "#browser.get(\"http://www.taobao.com\")\n", "input_str = browser.find_element_by_id('q')\n", "input_str.send_keys(\"ipad\")\n", "time.sleep(3)\n", "input_str.clear()\n", "input_str.send_keys(\"MacBook pro\")\n", "button = browser.find_element_by_class_name('btn-search')\n", "button.click()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "运行的结果可以看出程序会自动打开Chrome浏览器并打开淘宝输入ipad,然后删除,重新输入MacBook pro,并点击搜索\n", "\n", "Selenium所有的api文档:http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.action_chains" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 执行JavaScript\n", "这是一个非常有用的方法,这里就可以直接调用js方法来实现一些操作,\n", "下面的例子是通过登录知乎然后通过js翻到页面底部,并弹框提示" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:25:53.591510Z", "start_time": "2019-06-08T06:24:34.585820Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from selenium import webdriver\n", "browser = webdriver.Chrome()\n", "browser.get(\"https://www.zhihu.com/explore\")\n", "browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')\n", "browser.execute_script('alert(\"To Bottom\")')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 获取元素属性\n", "get_attribute('class')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:27:22.220466Z", "start_time": "2019-06-08T06:26:42.843021Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "ename": "NoSuchWindowException", "evalue": "Message: no such window: target window already closed\nfrom unknown error: web view not found\n (Session info: chrome=74.0.3729.169)\n (Driver info: chromedriver=2.37.544337 (8c0344a12e552148c185f7d5117db1f28d6c9e85),platform=Mac OS X 10.14.5 x86_64)\n", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNoSuchWindowException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mbrowser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"http://www.zhihu.com/explore\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mlogo\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbrowser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind_element_by_id\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'zh-top-link-logo'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlogo\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlogo\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_attribute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'class'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py\u001b[0m in \u001b[0;36mfind_element_by_id\u001b[0;34m(self, id_)\u001b[0m\n\u001b[1;32m 349\u001b[0m \u001b[0melement\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdriver\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind_element_by_id\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'foo'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 350\u001b[0m \"\"\"\n\u001b[0;32m--> 351\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind_element\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mby\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mBy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mID\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mid_\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 352\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 353\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfind_elements_by_id\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mid_\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py\u001b[0m in \u001b[0;36mfind_element\u001b[0;34m(self, by, value)\u001b[0m\n\u001b[1;32m 953\u001b[0m return self.execute(Command.FIND_ELEMENT, {\n\u001b[1;32m 954\u001b[0m \u001b[0;34m'using'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mby\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 955\u001b[0;31m 'value': value})['value']\n\u001b[0m\u001b[1;32m 956\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 957\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfind_elements\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mby\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mBy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mID\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py\u001b[0m in \u001b[0;36mexecute\u001b[0;34m(self, driver_command, params)\u001b[0m\n\u001b[1;32m 310\u001b[0m \u001b[0mresponse\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcommand_executor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdriver_command\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 311\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mresponse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 312\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror_handler\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcheck_response\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresponse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 313\u001b[0m response['value'] = self._unwrap_value(\n\u001b[1;32m 314\u001b[0m response.get('value', None))\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/remote/errorhandler.py\u001b[0m in \u001b[0;36mcheck_response\u001b[0;34m(self, response)\u001b[0m\n\u001b[1;32m 240\u001b[0m \u001b[0malert_text\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'alert'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'text'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 241\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mexception_class\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mscreen\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstacktrace\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0malert_text\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 242\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mexception_class\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mscreen\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstacktrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 243\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 244\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_value_or_default\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdefault\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNoSuchWindowException\u001b[0m: Message: no such window: target window already closed\nfrom unknown error: web view not found\n (Session info: chrome=74.0.3729.169)\n (Driver info: chromedriver=2.37.544337 (8c0344a12e552148c185f7d5117db1f28d6c9e85),platform=Mac OS X 10.14.5 x86_64)\n" ] } ], "source": [ "browser.get(\"http://www.zhihu.com/explore\")\n", "logo = browser.find_element_by_id('zh-top-link-logo')\n", "print(logo)\n", "print(logo.get_attribute('class'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 获取文本值\n", "text" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "提问\n" ] } ], "source": [ "browser.get(\"http://www.zhihu.com/explore\")\n", "input = browser.find_element_by_class_name('zu-top-add-question')\n", "print(input.text)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 获取ID,位置,标签名\n", "id\n", "location\n", "tag_name\n", "size" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.14881650366503973-1\n", "{'x': 849, 'y': 7}\n", "button\n", "{'width': 66, 'height': 32}\n" ] } ], "source": [ "browser.get(\"http://www.zhihu.com/explore\")\n", "input = browser.find_element_by_class_name('zu-top-add-question')\n", "print(input.id)\n", "print(input.location)\n", "print(input.tag_name)\n", "print(input.size)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 一个例子" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:32:02.234295Z", "start_time": "2019-06-08T06:30:56.716427Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "ename": "NoSuchElementException", "evalue": "Message: no such element: Unable to locate element: {\"method\":\"id\",\"selector\":\"username\"}\n (Session info: chrome=74.0.3729.169)\n (Driver info: chromedriver=2.37.544337 (8c0344a12e552148c185f7d5117db1f28d6c9e85),platform=Mac OS X 10.14.5 x86_64)\n", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNoSuchElementException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0musername\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'wangchj04@126.com'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mpassword\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'CityUniversityHK'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mbrowser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind_element_by_id\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"username\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclear\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0mbrowser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind_element_by_id\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"username\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_keys\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0musername\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0mbrowser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind_element_by_id\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"password\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclear\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py\u001b[0m in \u001b[0;36mfind_element_by_id\u001b[0;34m(self, id_)\u001b[0m\n\u001b[1;32m 349\u001b[0m \u001b[0melement\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdriver\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind_element_by_id\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'foo'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 350\u001b[0m \"\"\"\n\u001b[0;32m--> 351\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind_element\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mby\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mBy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mID\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mid_\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 352\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 353\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfind_elements_by_id\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mid_\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py\u001b[0m in \u001b[0;36mfind_element\u001b[0;34m(self, by, value)\u001b[0m\n\u001b[1;32m 953\u001b[0m return self.execute(Command.FIND_ELEMENT, {\n\u001b[1;32m 954\u001b[0m \u001b[0;34m'using'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mby\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 955\u001b[0;31m 'value': value})['value']\n\u001b[0m\u001b[1;32m 956\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 957\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfind_elements\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mby\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mBy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mID\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py\u001b[0m in \u001b[0;36mexecute\u001b[0;34m(self, driver_command, params)\u001b[0m\n\u001b[1;32m 310\u001b[0m \u001b[0mresponse\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcommand_executor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdriver_command\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 311\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mresponse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 312\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror_handler\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcheck_response\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresponse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 313\u001b[0m response['value'] = self._unwrap_value(\n\u001b[1;32m 314\u001b[0m response.get('value', None))\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/remote/errorhandler.py\u001b[0m in \u001b[0;36mcheck_response\u001b[0;34m(self, response)\u001b[0m\n\u001b[1;32m 240\u001b[0m \u001b[0malert_text\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'alert'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'text'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 241\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mexception_class\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mscreen\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstacktrace\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0malert_text\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 242\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mexception_class\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mscreen\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstacktrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 243\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 244\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_value_or_default\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdefault\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNoSuchElementException\u001b[0m: Message: no such element: Unable to locate element: {\"method\":\"id\",\"selector\":\"username\"}\n (Session info: chrome=74.0.3729.169)\n (Driver info: chromedriver=2.37.544337 (8c0344a12e552148c185f7d5117db1f28d6c9e85),platform=Mac OS X 10.14.5 x86_64)\n" ] } ], "source": [ "from selenium import webdriver\n", "# import selenium.webdriver.support.ui as ui\n", "browser = webdriver.Chrome()\n", "browser.get(\"https://www.privco.com/home/login\") #需要翻墙打开网址\n", "username = 'wangchj04@126.com'\n", "password = 'CityUniversityHK'\n", "browser.find_element_by_id(\"username\").clear()\n", "browser.find_element_by_id(\"username\").send_keys(username) \n", "browser.find_element_by_id(\"password\").clear()\n", "browser.find_element_by_id(\"password\").send_keys(password)\n", "browser.find_element_by_css_selector(\"#login-form > div:nth-child(5) > div > button\").click()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:33:11.197128Z", "start_time": "2019-06-08T06:33:11.169229Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# url = \"https://www.privco.com/private-company/329463\"\n", "def download_excel(url):\n", " browser.get(url)\n", " name = url.split('/')[-1]\n", " title = browser.title\n", " source = browser.page_source\n", " with open(name+'.html', 'w') as f:\n", " f.write(source)\n", " try:\n", " soup = BeautifulSoup(source, 'html.parser')\n", " url_new = soup.find('span', {'class', 'profile-name'}).a['href']\n", " url_excel = url_new + '/export'\n", " browser.get(url_excel)\n", " except Exception as e:\n", " print(url, 'no excel')\n", " pass\n", " \n", " " ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:32:13.789332Z", "start_time": "2019-06-08T06:32:13.785931Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "urls = [ 'https://www.privco.com/private-company/1135789',\n", " 'https://www.privco.com/private-company/542756',\n", " 'https://www.privco.com/private-company/137908',\n", " 'https://www.privco.com/private-company/137138']" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:33:19.547094Z", "start_time": "2019-06-08T06:33:15.569463Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "https://www.privco.com/private-company/1135789 no excel\n", "1\n", "https://www.privco.com/private-company/542756 no excel\n", "2\n", "https://www.privco.com/private-company/137908 no excel\n", "3\n", "https://www.privco.com/private-company/137138 no excel\n" ] } ], "source": [ "for k, url in enumerate(urls):\n", " print(k)\n", " try:\n", " download_excel(url)\n", " except Exception as e:\n", " print(url, e)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [conda env:anaconda]", "language": "python", "name": "conda-env-anaconda-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "647px", "left": "1361px", "top": "123px", "width": "340px" }, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }