{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "***\n", "***\n", "# 数据抓取\n", " > # 使用Selenium操纵浏览器\n", "\n", "***\n", "***\n", "\n", "王成军 \n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "selenium 是一套完整的web应用程序测试系统,包含了\n", "- 测试的录制(selenium IDE)\n", "- 编写及运行(Selenium Remote Control)\n", "- 测试的并行处理(Selenium Grid)。\n", "\n", "Selenium的核心Selenium Core基于JsUnit,完全由JavaScript编写,因此可以用于任何支持JavaScript的浏览器上。selenium可以模拟真实浏览器,自动化测试工具,支持多种浏览器,爬虫中主要用来解决JavaScript渲染问题。https://www.cnblogs.com/zhaof/p/6953241.html" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Webdriver\n", "用python写爬虫的时候,主要用的是selenium的Webdriver,我们可以通过下面的方式先看看Selenium.Webdriver支持哪些浏览器\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:07:48.331420Z", "start_time": "2019-06-08T06:07:48.328503Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from selenium import webdriver" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:07:53.221670Z", "start_time": "2019-06-08T06:07:53.215623Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on package selenium.webdriver in selenium:\n", "\n", "NAME\n", " selenium.webdriver\n", "\n", "DESCRIPTION\n", " # Licensed to the Software Freedom Conservancy (SFC) under one\n", " # or more contributor license agreements. See the NOTICE file\n", " # distributed with this work for additional information\n", " # regarding copyright ownership. The SFC licenses this file\n", " # to you under the Apache License, Version 2.0 (the\n", " # \"License\"); you may not use this file except in compliance\n", " # with the License. You may obtain a copy of the License at\n", " #\n", " # http://www.apache.org/licenses/LICENSE-2.0\n", " #\n", " # Unless required by applicable law or agreed to in writing,\n", " # software distributed under the License is distributed on an\n", " # \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", " # KIND, either express or implied. See the License for the\n", " # specific language governing permissions and limitations\n", " # under the License.\n", "\n", "PACKAGE CONTENTS\n", " android (package)\n", " blackberry (package)\n", " chrome (package)\n", " common (package)\n", " edge (package)\n", " firefox (package)\n", " ie (package)\n", " opera (package)\n", " phantomjs (package)\n", " remote (package)\n", " safari (package)\n", " support (package)\n", " webkitgtk (package)\n", "\n", "VERSION\n", " 3.9.0\n", "\n", "FILE\n", " /Users/datalab/Applications/anaconda/lib/python3.5/site-packages/selenium/webdriver/__init__.py\n", "\n", "\n" ] } ], "source": [ "help(webdriver) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### PhantomJS\n", "\n", "PhantomJS是一个而基于WebKit的服务端JavaScript API,支持Web而不需要浏览器支持,其快速、原生支持各种Web标准:Dom处理,CSS选择器,JSON等等。PhantomJS可以用用于页面自动化、网络监测、网页截屏,以及无界面测试" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 声明浏览器对象\n", "上面我们知道了selenium支持很多的浏览器,但是如果想要声明并调用浏览器则需要:\n", "https://pypi.org/project/selenium/" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:09:56.452841Z", "start_time": "2019-06-08T06:09:54.512065Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "#browser = webdriver.Firefox() # 打开Firefox浏览器\n", "browser = webdriver.Chrome() # 打开Chrome浏览器" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 访问页面" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-06-08T06:11:19.448418Z", "start_time": "2019-06-08T06:11:12.334976Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
\n", " \n", " \n", " \n", "\t\n", " \n", " \n", " \n", " \n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", " \n", "百度
\n", "\t\t\t©2019 Baidu 使用百度前必读 意见反馈 京ICP证030173号 京公网安备11000002000001号