{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", " Scraping websites with help of Selenium\n", " \n", "
\n", "\n", " \n", " Vadim Voskresenskii (slack: Vadimvoskresenskiy)\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Today we will study how to work with one very useful and impressive framework which will help us to scrape websites having dynamic data requests. This framework is called **Selenium** and we can efficiently work with it on Python. The idea laying behind Selenium is very simple -- it allows web developers test their applications before launching them. With help of Selenium, they can emulate the work of browser and check how different elements of their application work from the side of a user.\n", "\n", "But, apart from giving web developers possibility to check their applications, Selenium can be useful also for data analysts who want to get data from websites with sophisticated internal strucutres. Probably, you faced such situations when you try to collect data with help of Beatiful Soup in Python or any other package and cannot get it because you need to wait some time until data is uploaded to a website from a server. Unfortunately, your script does not know about this feature of a website and tries to ge it at once. Finally, instead of getting desirable data you get blank list. Also, I suppose, sometimes, data analysts want to collect data from websites where you need firstly put some information into text fields or click some buttons. Certainly, you cannot do such actions with help of Beuatiful Soup. My tutorial will show you how to tackle with such issues with help of Selenium.\n", "\n", "The plan of the workshop is following:\n", "\n", "1. We will know how to install Selenium and will cover briefly main terminology \n", "2. I will introduce you to the case we are going to solve in the framework of current tutorial\n", "2. We will write script with special Selenium functions allowing to interact with a browser\n", "3. We will add very simple code collecting data we need with help of BeatifulSoup\n", "4. We will write final function combining all our previous steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Installation and main functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to know how to launch Selenium. That's very simple!\n", "\n", "With help of pip, you can install selenium.\n", "\n", "`pip install selenium`\n", "\n", "After that, you need to install driver on you computer which will allow you to interact with a browser.\n", "\n", "*My advise*: choose Firefox for work with Selenium. Initially, I started working with Chrome and found that Chrome sometimes cannot find some elements on webpage which definitely exist. At the same time, Firefox had no any issues with finding these elements. I did not check other browsers though.\n", "\n", "Regardless of a browser you selected, the algorithm of working with drivers is very similar. First, you download driver (geckodriver for Mozilla Firefox can be found [here](https://github.com/mozilla/geckodriver/releases)). Then, you set executable file (*geckodriver.exe*) as an environment variable on your computer (on Windows, you need to add the path to executable file to PATH). That's it. Now, you can work with Selenium. \n", "\n", "If we installed everything correctly we can check how Selenium works. For that, let's import needed modules and try to get to the website." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "import re\n", "import time\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import requests\n", "from bs4 import BeautifulSoup\n", "from dateutil import relativedelta\n", "from selenium import webdriver\n", "from selenium.webdriver.common.keys import Keys" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "driver = webdriver.Firefox()\n", "driver.get(\"https://mlcourse.ai/roadmap\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If everything is fine, you will see how magically Firefox browser opens the webpage of our course.\n", "\n", "Before scraping I offer you to get started with looking at main functions we are going to use in the tutorial.\n", "\n", "Our approach to collect data is very simple. First, we need to find HTML element with which we want to interact and ,second, interact with it by sending keys (browser thinks that a real user presses buttons on her keyboard) or clicking buttons. \n", "\n", "HTML element can be identified by different ways. Here are the most important functions for us:\n", "\n", "`driver.find_element_by_id`\n", "\n", "With help of this function we can find element by it's id. All elements on a webpage have their own unique ids.\n", "\n", "`driver.find_element_by_xpath`\n", "\n", "Xpath is a path to html element we need. Sometimes, elements on one page can have the same paths. So, we need to be very careful with this approach. But in most cases, Xpath is the easiest way how to get very specific element on the webpage.\n", "\n", "`driver.find_element_by_link_text`\n", "\n", "The most dangerous function is searching for element on the base of text (the text you see on a webpage). As you can understand, it can be used only in the case if only one element is represented by this text.\n", "\n", "Ok, we found element. How to interact with it?\n", "\n", "For this, there are some other functions.\n", "\n", "`element.send_keys(\"text\")`\n", "\n", "With help of this function, we can send some text to the website. For instance, we can sign up or write name of a book we want to buy on Amazon.\n", "\n", "`click()`\n", "\n", "If we work with a button, we can click on it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping Airbnb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the current workshop, we will be scraping data from [Airbnb](https://ru.airbnb.com/). Airbnb is the website for travelers which sometimes allows you to find cheaper place for living than websites like Booking.com. On Airbnb you are searching not for hotels or hostels but for apartment offered by hosts living in a city you want to visit. Airbnb is based on principles of sharing economy where trust between hosts and guests is supported by reviews technology.\n", "\n", "Our task is following. Let's imagine that you and your friend want to travel to London and live there from 15th of March of 2018 to 23rd of May (completely random dates). We do not want to go to the website each day in a hope to find the best offer. Instead of it, we want to write function which will be collecting regularly for us offers from hosts, some characteristics of the apartments and their prices. But as you can see Airbnb website is created well and it has a lot of interactive elements: apart from search fields we have calendars, special buttons for choosing number of guests, children. By using only Beautiful Soup we cannot collect all data we need. Thereby, we, certainly, need Selenium. Let's start!\n", "\n", "So, we now on the main page of Airbnb and we need to choose city, country, dates and number of guests. Obviously, we need to start with place we are going to visit. The best way is to put a city and a country into the search field. For identifying html element of the search field we have special function in our browser called \"Inspect element\". To find it, we need to press right button of a mouse on the element we wanna get and click on Q on our keyboard (look at the picture below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "