{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "***\n", "***\n", "# 数据抓取\n", " > # 使用Selenium操纵浏览器\n", "\n", "***\n", "***\n", "\n", "王成军 \n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Selenium 是一套完整的web应用程序测试系统,包含了\n", "- 测试的录制(selenium IDE)\n", "- 编写及运行(Selenium Remote Control)\n", "- 测试的并行处理(Selenium Grid)。\n", "\n", "Selenium的核心Selenium Core基于JsUnit,完全由JavaScript编写,因此可以用于任何支持JavaScript的浏览器上。selenium可以模拟真实浏览器,自动化测试工具,支持多种浏览器,爬虫中主要用来解决JavaScript渲染问题。https://www.cnblogs.com/zhaof/p/6953241.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "上面我们知道了selenium支持很多的浏览器,但是如果想要声明并调用浏览器则需要:\n", "https://pypi.org/project/selenium/" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T00:57:02.726390Z", "start_time": "2019-10-17T00:56:56.947418Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting selenium\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)\n", "\u001b[K 100% |████████████████████████████████| 911kB 9.3MB/s ta 0:00:011\n", "\u001b[?25hRequirement already satisfied: urllib3 in /Users/datalab/anaconda3/lib/python3.7/site-packages (from selenium) (1.24.1)\n", "Installing collected packages: selenium\n", "Successfully installed selenium-3.141.0\n" ] } ], "source": [ "!pip install selenium" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Webdriver\n", "- 主要用的是selenium的Webdriver\n", "- 我们可以通过下面的方式先看看Selenium.Webdriver支持哪些浏览器\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T00:57:07.111400Z", "start_time": "2019-10-17T00:57:07.067485Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from selenium import webdriver" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T00:57:10.624675Z", "start_time": "2019-10-17T00:57:10.619107Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on package selenium.webdriver in selenium:\n", "\n", "NAME\n", " selenium.webdriver\n", "\n", "DESCRIPTION\n", " # Licensed to the Software Freedom Conservancy (SFC) under one\n", " # or more contributor license agreements. See the NOTICE file\n", " # distributed with this work for additional information\n", " # regarding copyright ownership. The SFC licenses this file\n", " # to you under the Apache License, Version 2.0 (the\n", " # \"License\"); you may not use this file except in compliance\n", " # with the License. You may obtain a copy of the License at\n", " #\n", " # http://www.apache.org/licenses/LICENSE-2.0\n", " #\n", " # Unless required by applicable law or agreed to in writing,\n", " # software distributed under the License is distributed on an\n", " # \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", " # KIND, either express or implied. See the License for the\n", " # specific language governing permissions and limitations\n", " # under the License.\n", "\n", "PACKAGE CONTENTS\n", " android (package)\n", " blackberry (package)\n", " chrome (package)\n", " common (package)\n", " edge (package)\n", " firefox (package)\n", " ie (package)\n", " opera (package)\n", " phantomjs (package)\n", " remote (package)\n", " safari (package)\n", " support (package)\n", " webkitgtk (package)\n", "\n", "VERSION\n", " 3.14.1\n", "\n", "FILE\n", " /Users/datalab/anaconda3/lib/python3.7/site-packages/selenium/webdriver/__init__.py\n", "\n", "\n" ] } ], "source": [ "help(webdriver) " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 下载和设置Webdriver\n", "\n", "对于Chrome需要的webdriver下载地址\n", "\n", "http://chromedriver.storage.googleapis.com/index.html\n", "\n", "需要将webdriver放在系统路径下:\n", "- 确保anaconda在系统路径名里\n", "- 把下载的webdriver 放在`Anaconda的bin文件夹`下" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### PhantomJS\n", "\n", "PhantomJS是一个而基于WebKit的服务端JavaScript API,支持Web而不需要浏览器支持,其快速、原生支持各种Web标准:Dom处理,CSS选择器,JSON等等。PhantomJS可以用用于页面自动化、网络监测、网页截屏,以及无界面测试" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T00:57:17.147546Z", "start_time": "2019-10-17T00:57:14.749313Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "#browser = webdriver.Firefox() # 打开Firefox浏览器\n", "browser = webdriver.Chrome() # 打开Chrome浏览器" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 访问页面" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-10-17T03:39:01.788430Z", "start_time": "2019-10-17T03:38:58.474675Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "