{ "cells": [ { "cell_type": "markdown", "id": "d162b058", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

\n" ] }, { "cell_type": "markdown", "id": "3184edc1", "metadata": {}, "source": [ "

Lecture 5.3 (Web Scraping using Selenium - I)

" ] }, { "cell_type": "markdown", "id": "35b552c1", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "d3009e3b", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "7fd50ab2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bca7407c", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "90ab85fc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "76918a38", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c365f3b3", "metadata": {}, "source": [ "\n", "\n", "## Learning agenda of this notebook\n", "\n", "**Recap of Previous Session**\n", "\n", "\n", "\n", "1. **Overview of Selenium (Why, What and How)**\n", " - Why use Selenium?\n", " - What is Selenium? (Selenium Architecture)\n", " - How to use Selenium?\n", " - Download and Install Selenium\n", " - Download Selenium WebDriver for your browser (Chrome, Safari, Firefox, Internet Explorer)\n", " - Setting options of Chrome Driver (Headless mode)\n", " \n", " \n", "2. **A Step-by-Step Hello World with Selenium**\n", " - Create an instance of Browser\n", " - Load a Web page in the browser window\n", " - Access browser information\n", " - Perform Different operations on the browser\n", " - Create new tab in the browser window and shift between tabs\n", " - Close browser tab or close the entire session\n", "\n", "\n", "\n", "3. **Example 1:** Scraping a JavaScript Driven WebSite (https://arifpucit.github.io/bss2/js/)\n", " - What is JavaScript Driven Website?\n", " - What happens when we use Requests and BeautifulSoup to scrape JS websites?\n", " - Using Selenium and BeautifulSoup to scrape JS websites\n", "\n", "\n", "4. **Example 2:** Scraping Dynamic WebSites (https://arifpucit.github.io/bss2/login/)\n", " - Different Ways to Locate Web elements using Selenium\n", " - Selenium `find_element()` and `find_elements()` methods\n", " - Selenium Locators\n", " - ID\n", " - NAME\n", " - TAG_NAME\n", " - CLASS_NAME\n", " - LINK_TEXT\n", " - PARTIAL_LINK_TEXT\n", " - CSS_SELECTOR\n", " - XPATH\n", " - Entrying text in a Text Box on a Web Page\n", " - Clicking a Button element on a web page\n", " - Consolidated Script to Login and Scrape Books Data\n", " \n", " \n", "5. **Example 3:** Scraping Web Pages that Employ Infinite Scrolling: https://arifpucit.github.io/bss2/scrolling/\n", " \n", "\n", "6. **Example 4:** Scraping Web Pages that Employ Pagination: https://arifpucit.github.io/bss2/pagination/\n", "\n", "\n", "7. **Example 5:** Scraping Web Pages that use Pop-ups: https://arifpucit.github.io/bss2/popup/\n", " \n", " \n", "8. **Bonus:**\n", " - Email Scraped CSV file from Python \n", " \n", "\n", "### To Be Continued...\n", " - Web Scraping Best Practices and Scraping of Real Websites" ] }, { "cell_type": "code", "execution_count": null, "id": "dff7c2cb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e6d009c1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7f38a73e", "metadata": {}, "source": [ "\n", "\n", "## 1. Overview of Selenium (Why, What and How)\n", "- Selenium: https://www.selenium.dev/\n" ] }, { "cell_type": "code", "execution_count": null, "id": "258c697d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f83c85d4", "metadata": {}, "source": [ "### a. Architecture of Selenium" ] }, { "cell_type": "markdown", "id": "8806eefb", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "26fc5f3b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5e47cb0c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ad8dd339", "metadata": {}, "source": [ "\n", "\n", "### b. Download and Install Selenium\n", "- Download and Install Selenium Client Library for Python: https://www.selenium.dev\n", "- Read Selenium Documentation for Python: https://selenium-python.readthedocs.io/ " ] }, { "cell_type": "code", "execution_count": null, "id": "1a1f8b8b", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install --upgrade pip -q\n", "!{sys.executable} -m pip install --upgrade selenium -q" ] }, { "cell_type": "code", "execution_count": null, "id": "8df00024", "metadata": {}, "outputs": [], "source": [ "import selenium\n", "selenium.__version__ , selenium.__path__" ] }, { "cell_type": "code", "execution_count": null, "id": "3a7d033c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f4fab41d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b207f8bb", "metadata": {}, "source": [ "\n", "\n", "### c. Download Selenium WebDriver\n", "- Download Selenium Web Driver for Python: https://www.selenium.dev/\n", "- Copy the ChromeDriver executable at a known location on your disk" ] }, { "cell_type": "code", "execution_count": null, "id": "bcb621a4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "faf5c650", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3e6331f3", "metadata": {}, "source": [ "## 2. A Step-by-Step Hello World with Selenium\n", "- Steps to follow while scraping websites:\n", " - Create an instance of WebDriver\n", " - Navigate to the desired Web page that you want to scrape\n", " - Locate the Web element on the Web page\n", " - Perform an action on that web element\n", " - Write the scraped data into appropriate format in a file\n", " - Close the instance of WebDriver" ] }, { "cell_type": "code", "execution_count": null, "id": "da695071", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "cbbbcbbd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5cae738a", "metadata": {}, "source": [ "### a. Create an Instance of a WebDriver, Load a Webpage, Play and Quit" ] }, { "cell_type": "markdown", "id": "b5897013", "metadata": {}, "source": [ "> **Create an instance of Browser:**\n", ">- The `Service('pathtochromedriver)` method is used to create a Service object that needs to be passed to `Chrome()` method.\n", ">- The `Chrome(service, options)` method is used to create a new instance of the chrome driver, starts the service and then creates a new instance of chrome browser.\n", ">- ChromeOptions is a new concept added in Selenium WebDriver starting from Selenium version 3.6. 0 which is used for customizing the ChromeDriver session. \n", ">- The `Options()` method is used to change the default settings of chrome driver. The object is then passed to the webdriver.chrome() method." ] }, { "cell_type": "code", "execution_count": null, "id": "ff6779c2", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "s = Service('/Users/arif/Documents/chromedriver')\n", "driver = Chrome(service=s)" ] }, { "cell_type": "code", "execution_count": null, "id": "70437746", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "08bd89d4", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver.chrome.options import Options\n", "\n", "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options()\n", "myoptions.headless = True\n", "\n", "driver = Chrome(service=s, options=myoptions)" ] }, { "cell_type": "code", "execution_count": null, "id": "33f6c400", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "d8d4178a", "metadata": {}, "outputs": [], "source": [ "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options()\n", "\n", "driver = Chrome(service=s, options=myoptions)" ] }, { "cell_type": "code", "execution_count": null, "id": "e6f5f2bc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "296ff28b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "2d100874", "metadata": {}, "source": [ "> **Load a Web page in the browser window:**\n", ">- The `driver.get('URL')` method is used to load a web page in the current browser session, after which you can access the browser and the HTML code using the driver object.\n", ">- This is similar to `resp = requests.get('URL')`, after which you simply get the response object." ] }, { "cell_type": "code", "execution_count": null, "id": "4b42dc2e", "metadata": {}, "outputs": [], "source": [ "driver.get('https://google.com')" ] }, { "cell_type": "code", "execution_count": null, "id": "1c82a9d2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "402d055f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6498b1da", "metadata": {}, "source": [ "> **Access browser information:** \n", ">- There is a bunch information about the browser you can request, including window handles, browser size / position, cookies, alerts, etc." ] }, { "cell_type": "code", "execution_count": null, "id": "cc4a36bc", "metadata": {}, "outputs": [], "source": [ "print(dir(driver))" ] }, { "cell_type": "code", "execution_count": null, "id": "1fa41d46", "metadata": {}, "outputs": [], "source": [ "driver.title" ] }, { "cell_type": "code", "execution_count": null, "id": "76a13717", "metadata": {}, "outputs": [], "source": [ "driver.current_url" ] }, { "cell_type": "code", "execution_count": null, "id": "ebad206a", "metadata": {}, "outputs": [], "source": [ "driver.current_window_handle" ] }, { "cell_type": "code", "execution_count": null, "id": "99f67e79", "metadata": {}, "outputs": [], "source": [ "driver.session_id" ] }, { "cell_type": "code", "execution_count": null, "id": "d5a502ff", "metadata": {}, "outputs": [], "source": [ "driver.page_source" ] }, { "cell_type": "code", "execution_count": null, "id": "535ab23a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e1ffd461", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5f2be3cd", "metadata": {}, "source": [ "> **Perform Different operations on the browser:**\n", ">- The `driver.refresh()` method is used to refresh the page contents.\n", ">- The `driver.set_window_position(x,y)` is used to set the positions of the top left corner of the browser window.\n", ">- The `driver.set_window_size(x,y)` is used to set the width and height of current window.\n", ">- The `driver.maximize_window()` is used to maximinize the size of the window.\n", ">- The `driver.minimize_window()` is used to minimize the browser in the taskbar." ] }, { "cell_type": "code", "execution_count": null, "id": "17e672b8", "metadata": {}, "outputs": [], "source": [ "driver.refresh()" ] }, { "cell_type": "code", "execution_count": null, "id": "aa2ba167", "metadata": {}, "outputs": [], "source": [ "driver.set_window_position(0,0)" ] }, { "cell_type": "code", "execution_count": null, "id": "ecbd3ef3", "metadata": {}, "outputs": [], "source": [ "driver.maximize_window()" ] }, { "cell_type": "code", "execution_count": null, "id": "e4843022", "metadata": {}, "outputs": [], "source": [ "driver.minimize_window()" ] }, { "cell_type": "code", "execution_count": null, "id": "4aec3162", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "70d8588f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d341b321", "metadata": {}, "source": [ "> **Create new tab in the browser window and shift between tabs:**\n", ">- Clicking a link may opens in a new browser tab\n", ">- You can also create a new browser tab programmatically using the `driver.switch_to.new_window('tab')`.\n", ">- All calls to the driver will now be interpreted as being directed to the current browser tab.\n", ">- WebDriver supports moving between windows using:\n", " - `driver.switch_to.window(\"windowname\")`\n", " - `driver.switch_to.frame('framename')`\n", " - `driver.switch_to.default_content()`\n", " - All calls to driver will now be interpreted as being directed to the particular window." ] }, { "cell_type": "code", "execution_count": null, "id": "f640b999", "metadata": {}, "outputs": [], "source": [ "google_tab = driver.current_window_handle" ] }, { "cell_type": "code", "execution_count": null, "id": "e2403425", "metadata": {}, "outputs": [], "source": [ "driver.switch_to.new_window('tab')" ] }, { "cell_type": "code", "execution_count": null, "id": "c1c15b9b", "metadata": {}, "outputs": [], "source": [ "driver.get('https://www.yahoo.com')" ] }, { "cell_type": "code", "execution_count": null, "id": "79b4d4c9", "metadata": {}, "outputs": [], "source": [ "driver.switch_to.window(google_tab)" ] }, { "cell_type": "code", "execution_count": null, "id": "b97a7b43", "metadata": {}, "outputs": [], "source": [ "driver.close()" ] }, { "cell_type": "code", "execution_count": null, "id": "768b7e7a", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "2c64c5f4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "85852d06", "metadata": {}, "source": [ "> **Close browser tab or close the entire session:**\n", ">- The `driver.close()` will simply closes the current tab of the browser and will not close the browser process.\n", ">- The `driver.quit()` will close all the browser tabs and the background driver process." ] }, { "cell_type": "code", "execution_count": null, "id": "e3d2e4e7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "97fc51a3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f75a13d2", "metadata": {}, "source": [ "## 3. Example 1: Scraping a JavaScript Driven WebSite (https://arifpucit.github.io/bss2/js/)\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "36a482f2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "8a646c79", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3aa3a9ee", "metadata": {}, "source": [ "### a. Using Requests and BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "id": "4d329d49", "metadata": {}, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "import pandas as pd\n", "\n", "titles = []\n", "prices = []\n", "availability=[]\n", "reviews=[]\n", "links=[]\n", "stars=[]\n", "\n", "resp = requests.get(\"https://arifpucit.github.io/bss2/js\")\n", "soup = BeautifulSoup(resp.text, 'lxml') #resp.text do not contain the HTML for the books data\n", "\n", "sp_titles = soup.find_all('p', class_=\"book_name\")\n", "sp_prices = soup.find_all('p', class_=\"price green\")\n", "sp_availability = data = soup.find_all('p', class_='stock')\n", "sp_reviews = soup.find_all('p',{'class','review'})\n", "data = soup.find_all('p', class_=\"book_name\")\n", "sp_links=[]\n", "for val in data:\n", " sp_links.append(val.find('a').get('href'))\n", "books = soup.find_all('div',{'class','book_container'})\n", "for book in books:\n", " stars.append(5 - len(book.find_all('span',{'class','not_filled'})))\n", "\n", " \n", "for i in range(len(sp_titles)):\n", " titles.append(sp_titles[i].text)\n", " prices.append(sp_prices[i].text)\n", " availability.append(sp_availability[i].text)\n", " reviews.append(sp_reviews[i].text)\n", " links.append(sp_links[i])\n", "\n", "\n", "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n", " 'Reviews':reviews, 'Links':links, 'Stars':stars}\n", "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])\n", "df.to_csv('books1.csv', index=False)\n", "df = pd.read_csv('books1.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "22eb114f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e8a504a2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "fc9c8313", "metadata": {}, "source": [ "### b. Using Selenium and BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "id": "102fe026", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.chrome.options import Options\n", "from bs4 import BeautifulSoup\n", "import pandas as pd\n", "\n", "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options() \n", "driver = Chrome (service=s, options=myoptions)\n", "driver.get('https://arifpucit.github.io/bss2/js/')\n", "\n", "titles = []\n", "prices = []\n", "availability=[]\n", "reviews=[]\n", "links=[]\n", "stars=[]\n", "soup = BeautifulSoup(driver.page_source, 'lxml')#driver.page_source DO contain the HTML for the books data\n", "sp_titles = soup.find_all('p', class_=\"book_name\")\n", "sp_prices = soup.find_all('p', class_=\"price green\")\n", "sp_availability = data = soup.find_all('p', class_='stock')\n", "sp_reviews = soup.find_all('p',{'class','review'})\n", "data = soup.find_all('p', class_=\"book_name\")\n", "sp_links=[]\n", "for val in data:\n", " sp_links.append(val.find('a').get('href'))\n", "books = soup.find_all('div',{'class','book_container'})\n", "for book in books:\n", " stars.append(5 - len(book.find_all('span',{'class','not_filled'}))) \n", "for i in range(len(sp_titles)):\n", " titles.append(sp_titles[i].text)\n", " prices.append(sp_prices[i].text)\n", " availability.append(sp_availability[i].text)\n", " reviews.append(sp_reviews[i].text)\n", " links.append(sp_links[i])\n", "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n", " 'Reviews':reviews, 'Links':links, 'Stars':stars}\n", "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])\n", "df.to_csv('books1.csv', index=False)\n", "df = pd.read_csv('books1.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "11d5a570", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "a3e3aaf8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6f9ba098", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "dccd946f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9599030c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "db5af25a", "metadata": {}, "source": [ "## 4. Example 2: Scraping Dynamic WebSites (https://arifpucit.github.io/bss2/login/)\n", "\n", "" ] }, { "cell_type": "markdown", "id": "9d5194ae", "metadata": {}, "source": [ "### a. Different Ways to Locate Web elements using Selenium\n", "- Once we have the webpage loaded inside our browser, the next task is to locate the web element(s) of our interest and later perform actions on it.\n", "- The two most commonly used methods used to locate elements are:\n", " - The `driver.find_element(By.LOCATOR, \"value\")` method is used to locate a single element.\n", " - The `driver.find_elements(By.LOCATOR, \"value\")` method is used to locate multiple elements.\n", "- The first argument to these methods are the locators, and second argument is the value of that locator.\n", "- In Selenium, there are eight different types of Locators or ways using which we can locate a web element:\n", " - ID, NAME, and CLASS_NAME attributes of a web element are called direct locators, as they are fast. Their limitation is they may not always work in case of dynamic web sites. \n", " - XPATH, and CSS_SELECTOR are called indirect locators as they are comparatively slow, but are really useful in case of dynamic web sites.\n", " - LINK_TEXT, and PARTIAL_LINK_TEXT\n", " - TAG_NAME itself, which is seldomly used.\n", "\n", "\n", "\n", "- **Locating Web Elements:** https://selenium-python.readthedocs.io/locating-elements.html\n", "- Interacting with Web Elements: https://www.selenium.dev/documentation/webdriver/elements/interactions/\n", "- Read about CSS_SELECTOR: https://www.w3schools.com/cssref/css_selectors.asp\n", "- Read about XPATH: https://www.guru99.com/xpath-selenium.html, https://www.browserstack.com/guide/find-element-by-xpath-in-selenium\n", "- **Install Chrome Extension (Selector Hub):** https://selectorshub.com/\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8ca518b4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3bb7b774", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c483703b", "metadata": {}, "source": [ "> **Load the Login version of Book Scraping Site:** https://arifpucit.github.io/bss2/login/" ] }, { "cell_type": "code", "execution_count": null, "id": "98b5d606", "metadata": {}, "outputs": [], "source": [ "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options() \n", "driver = Chrome (service=s, options=myoptions)\n", "driver.get('https://arifpucit.github.io/bss2/login/')\n", "driver.maximize_window()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "591ee921", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "c638ba46", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6eefca9e", "metadata": {}, "outputs": [], "source": [ "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options() \n", "driver = Chrome (service=s, options=myoptions)\n", "driver.get('https://arifpucit.github.io/bss2/login/')\n", "driver.maximize_window()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "70cfd52b", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver.common.by import By\n", "\n", "tbox = driver.find_element(By.ID, 'name')\n", "type(tbox)" ] }, { "cell_type": "code", "execution_count": null, "id": "07664486", "metadata": {}, "outputs": [], "source": [ "tbox.send_keys(\"arif\")" ] }, { "cell_type": "code", "execution_count": null, "id": "7326b098", "metadata": {}, "outputs": [], "source": [ "tbox.clear()" ] }, { "cell_type": "code", "execution_count": null, "id": "95403777", "metadata": {}, "outputs": [], "source": [ "mylink = driver.find_element(By.LINK_TEXT, 'Ask Google for Password')\n", "mylink.click()" ] }, { "cell_type": "code", "execution_count": null, "id": "bc1acfe3", "metadata": {}, "outputs": [], "source": [ "driver.back()" ] }, { "cell_type": "code", "execution_count": null, "id": "6ff1723e", "metadata": {}, "outputs": [], "source": [ "tbox2 = driver.find_element(By.CSS_SELECTOR, '#password')\n", "tbox2.send_keys('datascience')" ] }, { "cell_type": "code", "execution_count": null, "id": "33204838", "metadata": {}, "outputs": [], "source": [ "btn = driver.find_element(By.XPATH, '//*[@id=\"submit_button\"]')\n", "btn.click()" ] }, { "cell_type": "code", "execution_count": null, "id": "f8c024a4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "63e5f46e", "metadata": {}, "outputs": [], "source": [ "for i in range(1,10):\n", " price = driver.find_element(By.XPATH, '/html/body/section/div/div[2]/div[2]/div[' + str(i)+ ']/div/p[1]').text\n", " print(price)" ] }, { "cell_type": "code", "execution_count": null, "id": "c464956e", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "4c95b591", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "47ea3d18", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c6ce8caa", "metadata": {}, "source": [ "### b. Consolidated Script to Login and Scrape Books Data: (https://arifpucit.github.io/bss2/login/)" ] }, { "cell_type": "code", "execution_count": null, "id": "5060fcb2", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.common.by import By\n", "import time\n", "import pandas as pd\n", "\n", "\n", "s = Service('/Users/arif/Documents/chromedriver') \n", "driver = Chrome(service=s) \n", "driver.get('https://arifpucit.github.io/bss2/login') \n", "driver.maximize_window()\n", "\n", "\n", "driver.find_element(By.ID, \"name\").send_keys(\"arif\")\n", "driver.find_element(By.ID, \"password\").send_keys(\"datascience\")\n", "btn = driver.find_element(By.ID, \"submit_button\")\n", "time.sleep(2)\n", "btn.click()\n", "time.sleep(2)\n", "\n", "\n", "titles = []\n", "prices = []\n", "availability=[]\n", "reviews=[]\n", "links=[]\n", "\n", "for i in range(1,10):\n", " title = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/p').text\n", " titles.append(title)\n", " price = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[1]').text\n", " prices.append(price)\n", " avail = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[2]').text\n", " availability.append(avail)\n", " review = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[3]').text\n", " reviews.append(review)\n", " link = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/p/a').get_attribute('href')\n", " links.append(link)\n", "\n", "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n", " 'Reviews':reviews, 'Links':links}\n", "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links'])\n", "df.to_csv('books2.csv', index=False)\n", "df = pd.read_csv('books2.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "b3f900d2", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "markdown", "id": "5d44f408", "metadata": {}, "source": [ ">- Above bot scrape the data of nine OS books only.\n", ">- Try extending above crawler to scrape the books data of SP and CA as an exercise." ] }, { "cell_type": "code", "execution_count": null, "id": "7694ce09", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "933d844b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8614a9c6", "metadata": {}, "source": [ "## 5. Example 3: Scraping Multiple Web Pages that Employ Infinite Scrolling\n", "- https://arifpucit.github.io/bss2/scroll/" ] }, { "cell_type": "code", "execution_count": null, "id": "80770ba3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "33833cd4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9cff458a", "metadata": {}, "source": [ "### a. Fetch Contents w/o Scrolling" ] }, { "cell_type": "code", "execution_count": null, "id": "1e798c32", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.common.by import By\n", "import time\n", "import pandas as pd\n", "\n", "\n", "s = Service('/Users/arif/Documents/chromedriver') \n", "driver = Chrome(service=s) \n", "driver.get('https://arifpucit.github.io/bss2/scroll') \n", "driver.maximize_window()\n", "\n", "\n", "titles = []\n", "prices = []\n", "availability=[]\n", "reviews=[]\n", "links=[]\n", "star_rates =[]\n", "\n", "\n", "books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')\n", "books_count = len(books)\n", "for i in range(1,books_count+1):\n", " title = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]').text\n", " titles.append(title)\n", " price = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[1]').text\n", " prices.append(price)\n", " avail = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[2]').text\n", " availability.append(avail)\n", " review = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[3]').text\n", " reviews.append(review)\n", " link = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')\n", " links.append(link)\n", " star_rate = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')\n", " star_rates.append(star_rate)\n", " \n", " \n", "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n", " 'Reviews':reviews, 'Links':links, 'StarRating': star_rates}\n", "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])\n", "df.to_csv('books3.csv', index=False)\n", "df = pd.read_csv('books3.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "6ac9f7ef", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "16e3cf64", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7b4613e8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "117fc743", "metadata": {}, "source": [ "### b. How to Scroll an Infinite Scrolling Web Page using Selenium\n", "- The `driver.execute_script(JS)` method is used to synchronously execute JavaScript in the current window/frame.\n", "```\n", "driver.execute_script('alert(\"Hello JavaScript\")')\n", "```\n", "- The `window.scrollTo()` method is used to perform scrolling operation. The pixels to be scrolled horizontally along the x-axis and pixels to be scrolled vertically along the y-axis are passed as parameters to the method." ] }, { "cell_type": "code", "execution_count": null, "id": "08a72bb1", "metadata": {}, "outputs": [], "source": [ "driver.execute_script('return document.body.scrollHeight')" ] }, { "cell_type": "code", "execution_count": null, "id": "c71eee10", "metadata": {}, "outputs": [], "source": [ "driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')" ] }, { "cell_type": "code", "execution_count": null, "id": "fa243e3f", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "e5aff7bb", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "import time\n", "\n", "s = Service('/Users/arif/Documents/chromedriver') \n", "driver = Chrome(service=s) \n", "driver.get('https://arifpucit.github.io/bss2/scroll') \n", "driver.maximize_window()\n", "\n", "last_height =driver.execute_script('return document.body.scrollHeight')\n", "while True:\n", " driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')\n", " time.sleep(2)\n", " new_height =driver.execute_script('return document.body.scrollHeight')\n", " if (new_height == last_height):\n", " break\n", " last_height = new_height\n", " \n", "# Count of books in the entire page\n", "books_count = driver.find_elements(By.CLASS_NAME, 'col-sm-4')\n", "len(books_count)" ] }, { "cell_type": "code", "execution_count": null, "id": "f61eddaa", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "a01083e0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "8ade15ce", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7f6ce2b1", "metadata": {}, "source": [ "### e. Scrape all the Books using Self-Scrolling" ] }, { "cell_type": "code", "execution_count": null, "id": "7b8d984d", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.common.by import By\n", "import time\n", "import pandas as pd\n", "\n", "\n", "s = Service('/Users/arif/Documents/chromedriver') \n", "driver = Chrome(service=s) \n", "driver.get('https://arifpucit.github.io/bss2/scroll') \n", "driver.maximize_window()\n", "\n", "#Scroll the entire page and then starts scraping\n", "last_height =driver.execute_script('return document.body.scrollHeight')\n", "while True:\n", " driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')\n", " time.sleep(2)\n", " new_height =driver.execute_script('return document.body.scrollHeight')\n", " if (new_height == last_height):\n", " break\n", " last_height = new_height\n", "\n", "\n", "titles = []\n", "prices = []\n", "availability=[]\n", "reviews=[]\n", "links=[]\n", "star_rates =[]\n", "\n", "books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')\n", "books_count = len(books)\n", "for i in range(1,books_count+1):\n", " title = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]').text\n", " titles.append(title)\n", " price = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[1]').text\n", " prices.append(price)\n", " avail = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[2]').text\n", " availability.append(avail)\n", " review = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[3]').text\n", " reviews.append(review)\n", " link = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')\n", " links.append(link)\n", " star_rate = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')\n", " star_rate = round(float(star_rate),2)\n", " star_rates.append(star_rate)\n", " \n", " \n", "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n", " 'Reviews':reviews, 'Links':links, 'StarRating': star_rates}\n", "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])\n", "df.to_csv('books3.csv', index=False)\n", "df = pd.read_csv('books3.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "0e9b6a02", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "9dc44e36", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d6edb6b7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "61ec589c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "67f21cce", "metadata": {}, "source": [ "## 6. Example 4: Scraping Multiple Web Pages that Employ Pagination\n", "- https://arifpucit.github.io/bss2/pagination/" ] }, { "cell_type": "code", "execution_count": null, "id": "823e83c8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7b819f0f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d680d530", "metadata": {}, "source": [ "### a. Fetch Contents of First Page" ] }, { "cell_type": "code", "execution_count": null, "id": "bcc02572", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.common.by import By\n", "import time\n", "import pandas as pd\n", "\n", "titles = []\n", "prices = []\n", "availability=[]\n", "reviews=[]\n", "links=[]\n", "star_rates =[]\n", "\n", "def books():\n", " time.sleep(2)\n", " books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')\n", " books_count = len(books)\n", " for i in range(1,books_count+1):\n", " title = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]').text\n", " titles.append(title)\n", " price = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[1]').text\n", " prices.append(price)\n", " avail = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[2]').text\n", " availability.append(avail)\n", " review = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[3]').text\n", " reviews.append(review)\n", " link = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')\n", " links.append(link)\n", " star_rate = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')\n", " star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points\n", " star_rates.append(star_rate)\n", " \n", "\n", "s = Service('/Users/arif/Documents/chromedriver') \n", "driver = Chrome(service=s) \n", "url = 'https://arifpucit.github.io/bss2/pagination'\n", "driver.get(url)\n", "driver.maximize_window()\n", "\n", "books()\n", "\n", "# Writing in the file\n", "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n", " 'Reviews':reviews, 'Links':links, 'StarRating': star_rates}\n", "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])\n", "df.to_csv('books4.csv', index=False)\n", "df = pd.read_csv('books4.csv')\n", "df\n" ] }, { "cell_type": "code", "execution_count": null, "id": "aa0153d7", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "03ffbc05", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c08078b3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "13103a81", "metadata": {}, "source": [ "### b. Logic to Iterate from First Page to Last Page in Pagination\n", "- At times we need to wait for all the web elements to be properly displayed on a web page. There are two ways to wait:\n", " - Use `time.sleep(ARBITRARY_TIME)` method.\n", " - Use `WebDriverWait()` method.\n", "- If you use `time.sleep()` you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough. Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server. \n", "- So better solution is use the WebDriverWait() method, that will wait for the exact amount of time necessary for your element/data to be loaded.\n", "- There are many interesting expected conditions on which you can wait, like:\n", " - `presence_of_element_located`\n", " - `element_to_be_clickable`\n", " - `text_to_be_present_in_element`\n", " - `element_to_be_clickable`" ] }, { "cell_type": "code", "execution_count": null, "id": "5cd61e68", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.common.by import By\n", "import time\n", "import pandas as pd\n", "# new header files\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "\n", "s = Service('/Users/arif/Documents/chromedriver') \n", "driver = Chrome(service=s) \n", "url = 'https://arifpucit.github.io/bss2/pagination'\n", "driver.get(url)\n", "driver.maximize_window()" ] }, { "cell_type": "code", "execution_count": null, "id": "d4b32ecd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "de22b8d0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "09c2ea4d", "metadata": {}, "outputs": [], "source": [ "while(True): \n", " WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) \n", " try: \n", " driver.find_element(By.XPATH,'//*[@id=\"page_8\"]')\n", " driver.execute_script(\"arguments[0].scrollIntoView();\", driver.find_element(By.XPATH,'//*[@id=\"page_8\"]'))\n", " try:\n", " time.sleep(2)\n", " driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') \n", " break\n", " except:\n", " driver.find_element(By.XPATH,'//*[@id=\"page_8\"]').click() \n", " except:\n", " break\n", "\n", "print(\"Successfully, reached on the last page\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6b2dd9a1", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "fbdc0056", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "94deca1d", "metadata": {}, "source": [ "### c. Scrape all the Books using Pagination" ] }, { "cell_type": "code", "execution_count": null, "id": "b2276e1e", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.common.by import By\n", "import time\n", "import pandas as pd\n", "# new header files\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "\n", "titles = []\n", "prices = []\n", "availability=[]\n", "reviews=[]\n", "links=[]\n", "star_rates =[]\n", "\n", "def books():\n", " time.sleep(2)\n", " books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')\n", " books_count = len(books)\n", " for i in range(1,books_count+1):\n", " title = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]').text\n", " titles.append(title)\n", " price = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[1]').text\n", " prices.append(price)\n", " avail = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[2]').text\n", " availability.append(avail)\n", " review = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[3]').text\n", " reviews.append(review)\n", " link = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')\n", " links.append(link)\n", " star_rate = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')\n", " star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points\n", " star_rates.append(star_rate)\n", " \n", "\n", "s = Service('/Users/arif/Documents/chromedriver') \n", "driver = Chrome(service=s) \n", "url = 'https://arifpucit.github.io/bss2/pagination'\n", "driver.get(url)\n", "driver.maximize_window()\n", "\n", "\n", "# Let us call the books() function and click the next button and repeat till it disappears/disabled \n", "while(True):\n", " books() \n", " page = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) \n", " try: \n", " driver.find_element(By.XPATH,'//*[@id=\"page_8\"]')\n", " driver.execute_script(\"arguments[0].scrollIntoView();\", driver.find_element(By.XPATH,'//*[@id=\"page_8\"]'))\n", " try:\n", " time.sleep(2)\n", " driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') \n", " break\n", " except:\n", " driver.find_element(By.XPATH,'//*[@id=\"page_8\"]').click() \n", " except:\n", " break\n", "\n", "\n", "\n", "# Writing in the file\n", "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n", " 'Reviews':reviews, 'Links':links, 'StarRating': star_rates}\n", "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])\n", "df.to_csv('books4.csv', index=False)\n", "df = pd.read_csv('books4.csv')\n", "df\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ebd8f064", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "9915611d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e158290c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9a735ea7", "metadata": {}, "source": [ "## 7. Example 5: Handling Popups (https://arifpucit.github.io/bss2/popup/)\n", "- Pop-ups are kind of informational or promotional offers that displays on top of your content.\n", "- They are designed to capture user's attention quickly. " ] }, { "cell_type": "code", "execution_count": null, "id": "2658b7d4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "eb0d00b6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "2a76eb3f", "metadata": {}, "source": [ "### a. Apply Above Pagination Bot on the Pop-Up Version" ] }, { "cell_type": "code", "execution_count": null, "id": "f093f743", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.common.by import By\n", "import time\n", "import pandas as pd\n", "# new header files\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "\n", "titles = []\n", "prices = []\n", "availability=[]\n", "reviews=[]\n", "links=[]\n", "star_rates =[]\n", "\n", "def books():\n", " time.sleep(2)\n", " books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')\n", " books_count = len(books)\n", " for i in range(1,books_count+1):\n", " title = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]').text\n", " titles.append(title)\n", " price = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[1]').text\n", " prices.append(price)\n", " avail = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[2]').text\n", " availability.append(avail)\n", " review = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[3]').text\n", " reviews.append(review)\n", " link = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')\n", " links.append(link)\n", " star_rate = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')\n", " star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points\n", " star_rates.append(star_rate)\n", " \n", "\n", "s = Service('/Users/arif/Documents/chromedriver') \n", "driver = Chrome(service=s) \n", "url = 'https://arifpucit.github.io/bss2/popup'\n", "driver.get(url)\n", "driver.maximize_window()\n", "\n", "# Let us call the books() function and click the next button and repeat till it disappears/disabled \n", "while(True):\n", " books() \n", " page = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) ## this is what wait looks like and this is important otherwise this code dont even But github worked without wait as i loaded all the data very fast\n", " try: \n", " driver.find_element(By.XPATH,'//*[@id=\"page_8\"]')\n", " driver.execute_script(\"arguments[0].scrollIntoView();\", driver.find_element(By.XPATH,'//*[@id=\"page_8\"]'))\n", " try:\n", " time.sleep(2)\n", " driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') ## finding the disbale class which disable the last page if we get this then break otherwise exception will be throwed\n", " break\n", " except:\n", " driver.find_element(By.XPATH,'//*[@id=\"page_8\"]').click() ## as exception throwed now should to the next button i mean a tag of next page \n", " except:\n", " break\n", "\n", "\n", "\n", "# Writing in the file\n", "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n", " 'Reviews':reviews, 'Links':links, 'StarRating': star_rates}\n", "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])\n", "df.to_csv('books5.csv', index=False)\n", "df = pd.read_csv('books5.csv')\n", "df\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9d4ec2d8", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "a01d55ec", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "85e02e75", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c74c415e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9e9daa15", "metadata": {}, "source": [ "### b. Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "a2895c05", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.common.by import By\n", "import time\n", "import pandas as pd\n", "# new header files\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "\n", "titles = []\n", "prices = []\n", "availability=[]\n", "reviews=[]\n", "links=[]\n", "star_rates =[]\n", "\n", "def books():\n", " time.sleep(2)\n", " books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')\n", " books_count = len(books)\n", " for i in range(1,books_count+1):\n", " title = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]').text\n", " titles.append(title)\n", " price = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[1]').text\n", " prices.append(price)\n", " avail = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[2]').text\n", " availability.append(avail)\n", " review = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/p[3]').text\n", " reviews.append(review)\n", " link = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')\n", " links.append(link)\n", " star_rate = driver.find_element(By.XPATH,'//*[@id=\"container\"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')\n", " star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points\n", " star_rates.append(star_rate)\n", " \n", "\n", "s = Service('/Users/arif/Documents/chromedriver') \n", "driver = Chrome(service=s) \n", "url = 'https://arifpucit.github.io/bss2/popup'\n", "driver.get(url)\n", "driver.maximize_window()\n", "\n", "\n", "#### Lets close the pop up\n", "time.sleep(5)\n", "driver.switch_to.frame(driver.find_element(By.ID,'frame'))\n", "clos_button = driver.find_element(By.XPATH,'//*[@id=\"staticBackdrop\"]/div/div/div[1]/button')\n", "clos_button.click()\n", "driver.switch_to.default_content()\n", "\n", "\n", "\n", "\n", "# Let us call the books() function and click the next button and repeat till it disappears/disabled \n", "while(True):\n", " books() \n", " page = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) ## this is what wait looks like and this is important otherwise this code dont even But github worked without wait as i loaded all the data very fast\n", " try: \n", " driver.find_element(By.XPATH,'//*[@id=\"page_8\"]')\n", " driver.execute_script(\"arguments[0].scrollIntoView();\", driver.find_element(By.XPATH,'//*[@id=\"page_8\"]'))\n", " try:\n", " time.sleep(2)\n", " driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') ## finding the disbale class which disable the last page if we get this then break otherwise exception will be throwed\n", " break\n", " except:\n", " driver.find_element(By.XPATH,'//*[@id=\"page_8\"]').click() ## as exception throwed now should to the next button i mean a tag of next page \n", " except:\n", " break\n", "\n", "\n", "\n", "# Writing in the file\n", "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n", " 'Reviews':reviews, 'Links':links, 'StarRating': star_rates}\n", "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])\n", "df.to_csv('books5.csv', index=False)\n", "df = pd.read_csv('books5.csv')\n", "df\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a53c0648", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "447fe617", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1042a34e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "cf9b750d", "metadata": {}, "source": [ "## 8. Bonus" ] }, { "cell_type": "markdown", "id": "53bfb8b6", "metadata": {}, "source": [ "**How to Generate App Passwords in Gmail**\n", "- Before 30 May 2022, one could enable \"Less Secure Accounts option\" on his/her Google account and can send emails via 3rd party tools, eg., from Python code. For security reasons, Google has disabled the \"Less Secure Accounts option\".\n", "- The alternative solution is \"Google App Password\", using which you can sign in to your Google account from applications on devices that do not support 2-Step Verification\n", "- So we can use App Passwords feature to create separate passwords exclusively for the script instead of main password without sacrificing two-factor authentication.\n", "- The steps to generate app passwords in gmail are:\n", " - Login to your google account and click ::: icon at top right, then click Accounts to open your account settings\n", " - and visit google account page at accounts.google.com\n", " - Click security tab and make sure two factor authentication is turned ON\n", " - Click on App Passwords\n", " - Generate an app password by selecting the app as ‘Mail’ and the device as Windows or Mac (whatever applies) and and click Generate\n", " - It will display a 16 character app password\n", " - Copy and use the generated app password instead of original gmail password in python scripts" ] }, { "cell_type": "markdown", "id": "07d0b296", "metadata": {}, "source": [ "- The smtplib is a Python library for sending emails using the Simple Mail Transfer Protocol (SMTP). The smtplib is a built-in module; we do not need to install it. It abstracts away all the complexities of SMTP\n", "\n", "- MIMEBase is just a base class. As the specification says: \"Ordinarily you won’t create instances specifically of MIMEBase\"\n", "- MIMEText is for text (e.g. text/plain or text/html), if the whole message is in text format, or if a part of it is.\n", "- MIMEMultipart is for saying \"I have more than one part\", and then listing the parts - you do that if you have attachments, you also do it to provide alternative versions of the same content (e.g. a plain text version plus an HTML version)" ] }, { "cell_type": "code", "execution_count": null, "id": "fe6e14d0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "98df5f69", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9595e403", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "efc0fb27", "metadata": {}, "source": [ "### a. Sending Scraped Data vis E-Mail from Python\n", "-- Last generated app password is: clqthhcveozaehuk" ] }, { "cell_type": "code", "execution_count": null, "id": "50d2fc49", "metadata": {}, "outputs": [], "source": [ "import smtplib, ssl\n", "from email.mime.text import MIMEText\n", "from email.mime.base import MIMEBase\n", "from email.mime.multipart import MIMEMultipart\n", "from email import encoders\n", "\n", "sender = 'arifbuttscraper@gmail.com'\n", "passwd = 'xxxxxxxxxxxxx'\n", "receiver = 'xxxxx@gmail.com'\n", "\n", "# create MIMEMultipart object, fill its header information and attach the body to it\n", "msg = MIMEMultipart()\n", "msg['From'] = sender\n", "msg['To'] = receiver\n", "msg['Subject'] = 'Books Data'\n", "msg.attach(MIMEText(\n", "'''\n", "AoA,\n", "Please see the attached file containing information about books.\n", "Best\n", "''', \n", "'plain'))\n", "\n", "\n", "# create MIMEBase object for creating an attachment and attach it with the MIMEMultipart object\n", "part = MIMEBase('application', 'octet-stream')\n", "fd = open('books4.csv', 'rb') \n", "file_contents = fd.read()\n", "part.set_payload(file_contents)\n", "encoders.encode_base64(part)\n", "part.add_header('Content-Disposition', 'attachment; filename =\"books4.csv\"')\n", "msg.attach(part)\n", "\n", "# Send message object as email using smptplib by first creating an smtp session\n", "s = smtplib.SMTP_SSL(host='smtp.gmail.com', port=465)\n", "s.login(user = sender, password = passwd)\n", "s.sendmail(sender, receiver, msg.as_string())\n", "s.quit()\n", "print('Done..!!')" ] }, { "cell_type": "code", "execution_count": null, "id": "0270309e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "a826f8c1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "46693b42", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3f7c931c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "348b8cd5", "metadata": {}, "source": [ "### b. Schedule your E-Mail" ] }, { "cell_type": "code", "execution_count": null, "id": "f4258c5a", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install schedule -q" ] }, { "cell_type": "code", "execution_count": null, "id": "5db53528", "metadata": {}, "outputs": [], "source": [ "import smtplib, ssl\n", "from email.mime.text import MIMEText\n", "from email.mime.base import MIMEBase\n", "from email.mime.multipart import MIMEMultipart\n", "from email import encoders\n", "import os\n", "sender = 'xxxxx@gmail.com'\n", "receiver = 'yyyyy@gmail.com'\n", "passwd = 'xxxxxxxxxxxxx'\n", "msg = MIMEMultipart()\n", "msg['From'] = sender\n", "msg['To'] = receiver\n", "msg['Subject'] = 'Scheduled Email'\n", "msg.attach(MIMEText('This is an email scheduled to be sent at a specific date and time', 'plain'))\n", "part = MIMEBase('application', 'octet-stream')\n", "fd = open('books4.csv', 'rb') \n", "file_contents = fd.read()\n", "part.set_payload(file_contents)\n", "encoders.encode_base64(part)\n", "part.add_header('Content-Disposition', 'attachment; filename =\"books.csv\"')\n", "msg.attach(part)" ] }, { "cell_type": "code", "execution_count": null, "id": "f6fa33c1", "metadata": {}, "outputs": [], "source": [ "def mail():\n", " s = smtplib.SMTP_SSL(host='smtp.gmail.com', port=465)\n", " s.login(user = sender, password = passwd)\n", " s.sendmail(sender, receiver, msg.as_string())" ] }, { "cell_type": "code", "execution_count": null, "id": "ac79195b", "metadata": {}, "outputs": [], "source": [ "!date" ] }, { "cell_type": "code", "execution_count": null, "id": "96e5adbc", "metadata": {}, "outputs": [], "source": [ "import schedule\n", "import time\n", "schedule.every().day.at(\"16:27\").do(mail)\n", "while (True):\n", " schedule.run_pending()\n", " time.sleep(1)" ] }, { "cell_type": "code", "execution_count": null, "id": "310f661f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1383377b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0d23666b", "metadata": {}, "source": [ "# To Be Continued...\n", " - Web Scraping Best Practices and Scraping of Real Websites" ] }, { "cell_type": "code", "execution_count": null, "id": "e272beb6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }