{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 8—Web Automation with Selenium\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Version 1.1. Prepared by [Makzan](https://makzan.net). Updated at 2021 Janurary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this series, we will use 3 lectures to learn fetching data online. This includes:\n", "\n", "- Finding patterns in URL\n", "- Open web URL\n", "- Downloading files in Python\n", "- Fetch data with API\n", "- Web scraping with Requests and BeautifulSoup\n", "- **Web automation with Selenium**\n", "- **Converting Wikipedia tabular data into CSV**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use Selenium when:\n", "- When Requests and BeautifulSoup does not work.\n", "- When page requires JavaScript to render the data.\n", "\n", "Pros:\n", "- It launches real browser and automate browser.\n", "- Better compatibility .\n", "\n", "Cons:\n", "- Slow because it launches real browser.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading browser driver\n", "\n", "We need web browser driver to use Selenium. \n", "\n", "- [Gecko Driver for Firefox](https://github.com/mozilla/geckodriver/releases)\n", "- [Chrome Driver](https://chromedriver.chromium.org/)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: selenium in c:\\users\\thomas\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (3.141.0)\n", "Requirement already satisfied: urllib3 in c:\\users\\thomas\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from selenium) (1.24.1)\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install --user selenium" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: webdriver-manager in c:\\users\\thomas\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (3.5.4)\n", "Requirement already satisfied: requests in c:\\users\\thomas\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from webdriver-manager) (2.21.0)\n", "Requirement already satisfied: idna<2.9,>=2.5 in c:\\users\\thomas\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from requests->webdriver-manager) (2.8)\n", "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\\users\\thomas\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from requests->webdriver-manager) (3.0.4)\n", "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\thomas\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from requests->webdriver-manager) (2019.3.9)\n", "Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\\users\\thomas\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from requests->webdriver-manager) (1.24.1)\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install --user webdriver-manager" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from selenium import webdriver\n", "from selenium.webdriver.chrome.options import Options\n", "\n", "from webdriver_manager.chrome import ChromeDriverManager" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Common library to import\n", "from selenium.webdriver.common.by import By\n", "from selenium.webdriver.support.ui import Select" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "====== WebDriver manager ======\n", "Current google-chrome version is 104.0.5112\n", "Get LATEST chromedriver version for 104.0.5112 google-chrome\n", "Driver [C:\\Users\\thomas\\.wdm\\drivers\\chromedriver\\win32\\104.0.5112.79\\chromedriver.exe] found in cache\n" ] } ], "source": [ "options = Options()\n", "# options.add_argument('-headless')\n", "\n", "browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)\n", "browser.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selenium Cheat Sheet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://codoid.com/selenium-webdriver-python-cheat-sheet/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some essential commands to control web browser through Selenium:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "====== WebDriver manager ======\n", "Current google-chrome version is 104.0.5112\n", "Get LATEST chromedriver version for 104.0.5112 google-chrome\n", "Driver [C:\\Users\\thomas\\.wdm\\drivers\\chromedriver\\win32\\104.0.5112.79\\chromedriver.exe] found in cache\n" ] } ], "source": [ "browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)\n", "browser.maximize_window()\n", "browser.get('https://example.com')\n", "browser.find_element(By.CSS_SELECTOR, 'a')\n", "browser.find_elements(By.CSS_SELECTOR, 'a')\n", "browser.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Taking screenshot" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "====== WebDriver manager ======\n", "Current google-chrome version is 104.0.5112\n", "Get LATEST chromedriver version for 104.0.5112 google-chrome\n", "Driver [C:\\Users\\thomas\\.wdm\\drivers\\chromedriver\\win32\\104.0.5112.79\\chromedriver.exe] found in cache\n" ] } ], "source": [ "'''Capture the screenshot of a website via Headless Browser.'''\n", "\n", "from selenium import webdriver\n", "from selenium.webdriver.chrome.options import Options\n", "from webdriver_manager.chrome import ChromeDriverManager\n", "\n", "options = Options()\n", "options.add_argument('-headless')\n", "\n", "browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)\n", "browser.maximize_window()\n", "browser.get('http://macaodaily.com')\n", "browser.save_screenshot('MacaoDaily.png')\n", "browser.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Fetching stock data from aastock" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to fetch stock quote from aastock.com. If we try to directly access the stock page, the data may not load. We can load any one page from aastock and then simulate inputting the stock number and press enter. By using this automation, we can simulate a normal web browser browsing behavior." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "====== WebDriver manager ======\n", "Current google-chrome version is 104.0.5112\n", "Get LATEST chromedriver version for 104.0.5112 google-chrome\n", "Driver [C:\\Users\\thomas\\.wdm\\drivers\\chromedriver\\win32\\104.0.5112.79\\chromedriver.exe] found in cache\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "收市價(港元)\n", "(指數|行業)\n", "波幅\n", "121.800 - 123.300\n", "123.000\n" ] } ], "source": [ "'''Fetch current stock from aastock.'''\n", "\n", "from selenium import webdriver\n", "from selenium.webdriver.chrome.options import Options\n", "from selenium.webdriver.common.keys import Keys\n", "from webdriver_manager.chrome import ChromeDriverManager\n", "from selenium.webdriver.common.by import By\n", "import time\n", "\n", "stock_number = '0011'\n", "\n", "options = Options()\n", "# options.add_argument('-headless')\n", "\n", "browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)\n", "browser.maximize_window()\n", "\n", "browser.get('http://www.aastocks.com/tc/stocks/aboutus/companyinfo.aspx')\n", "element = browser.find_element(By.CSS_SELECTOR, '#sb-txtSymbol-aa')\n", "element.send_keys(stock_number)\n", "element.send_keys(Keys.RETURN)\n", "\n", "time.sleep(3)\n", "\n", "element = browser.find_element(By.CSS_SELECTOR, '.lastBox')\n", "print(element.text)\n", "\n", "\n", "browser.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Fetch dicj data with Selenium" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We had used API to fetch DICJ data. This example shows an alternative to fetch the same data by using Selenium." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "====== WebDriver manager ======\n", "Current google-chrome version is 104.0.5112\n", "Get LATEST chromedriver version for 104.0.5112 google-chrome\n", "Driver [C:\\Users\\thomas\\.wdm\\drivers\\chromedriver\\win32\\104.0.5112.79\\chromedriver.exe] found in cache\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2020年及2019年每月幸運博彩毛收入\n", "一月份 22,126 24,942 -11.3% 22,126 24,942 -11.3%\n", "二月份 3,104 25,370 -87.8% 25,229 50,312 -49.9%\n", "三月份 5,257 25,840 -79.7% 30,486 76,152 -60.0%\n", "四月份 754 23,588 -96.8% 31,240 99,739 -68.7%\n", "五月份 1,764 25,952 -93.2% 33,004 125,691 -73.7%\n", "六月份 716 23,812 -97.0% 33,720 149,503 -77.4%\n", "七月份 1,344 24,453 -94.5% 35,064 173,956 -79.8%\n", "八月份 1,330 24,262 -94.5% 36,394 198,218 -81.6%\n", "九月份 2,211 22,079 -90.0% 38,605 220,297 -82.5%\n", "十月份 7,270 26,443 -72.5% 45,875 246,740 -81.4%\n", "十一月份 6,748 22,877 -70.5% 52,623 269,617 -80.5%\n", "十二月份 7,818 22,838 -65.8% 60,441 292,455 -79.3%\n" ] } ], "source": [ "from selenium import webdriver\n", "from selenium.webdriver.chrome.options import Options\n", "from webdriver_manager.chrome import ChromeDriverManager\n", "from selenium.webdriver.common.by import By\n", "import time\n", "\n", "options = Options()\n", "options.add_argument('-headless')\n", "\n", "browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)\n", "\n", "browser.get('http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html')\n", "\n", "time.sleep(5)\n", "\n", "element = browser.find_element(By.CSS_SELECTOR, \"#report #table1\")\n", "\n", "rows = element.find_elements(By.CSS_SELECTOR, \"tr\")\n", "print(rows[0].text)\n", "for row in rows[3:]:\n", " print(row.text)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Fetch flight price from ctrip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we will fetch airline query by querying flights.ctrip.com with 4 parameters: departure date, arrival date, departure airport, arrival airport." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from selenium import webdriver\n", "from selenium.webdriver.chrome.options import Options\n", "import time\n", "import datetime\n", "from webdriver_manager.chrome import ChromeDriverManager\n", "from selenium.webdriver.common.by import By" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-08-31\n", "2022-09-05\n" ] } ], "source": [ "today = datetime.date.today()\n", "five_days_later = today + datetime.timedelta(days=5)\n", "\n", "print(today.isoformat())\n", "print(five_days_later.isoformat())\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "====== WebDriver manager ======\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "https://flights.ctrip.com/international/search/round-hkg-hel?depdate=2022-08-31_2022-09-05&cabin=y_s&adult=1&child=0&infant=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Current google-chrome version is 104.0.5112\n", "Get LATEST chromedriver version for 104.0.5112 google-chrome\n", "Driver [C:\\Users\\thomas\\.wdm\\drivers\\chromedriver\\win32\\104.0.5112.79\\chromedriver.exe] found in cache\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Found 7 results.\n", "HKG\n", "HEL\n", "土耳其航空\n", "¥17846起\n", "国泰航空\n", "¥18400起\n", "汉莎航空\n", "¥18896起\n", "国泰航空\n", "¥18400起\n", "汉莎航空\n", "¥18896起\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "options = Options()\n", "#options.add_argument('-headless')\n", "\n", "from_city = \"hkg\"\n", "to_city = \"hel\"\n", "\n", "url = f\"https://flights.ctrip.com/international/search/round-{from_city}-{to_city}?depdate={today}_{five_days_later}&cabin=y_s&adult=1&child=0&infant=0\"\n", "\n", "print(url)\n", "\n", "browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)\n", "browser.maximize_window()\n", "browser.get(url)\n", "\n", "time.sleep(3)\n", "\n", "elements = browser.find_elements(By.CSS_SELECTOR, \".flight-item\")\n", "\n", "print(f\"Found {len(elements)} results.\")\n", "\n", "print(from_city.upper())\n", "print(to_city.upper())\n", "for row in elements:\n", " airline = row.find_element(By.CSS_SELECTOR, \".airline-name\")\n", " print(airline.text)\n", " price = row.find_element(By.CSS_SELECTOR, \".price\")\n", " print(price.text)\n", " \n", " \n", "browser.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Use MailGun to send result to yourself" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "DOMAIN = None\n", "API_KEY= None\n", "FROM = \"mak@makzan.net\"\n", "TO = [\"mak@makzan.net\"]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "今日有0篇新聞您可能感興趣\n", "\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "import requests\n", "import datetime\n", "\n", "def send_simple_message(content, subject=\"Yeah\"):\n", " return requests.post(\n", " f\"https://api.mailgun.net/v3/{DOMAIN}/messages\",\n", " auth=(\"api\", API_KEY),\n", " data={\"from\": FROM,\n", " \"to\": TO,\n", " \"subject\": subject,\n", " \"text\": content})\n", "\n", "# keywords\n", "keywords = [\"創業\", \"科技\"]\n", "\n", "# today\n", "today = datetime.datetime.today()\n", "year = str(today.year).zfill(2)\n", "month = str(today.month).zfill(2)\n", "day = str(today.day).zfill(2)\n", "\n", "res = requests.get(f\"http://www.macaodaily.com/html/{year}-{month}/{day}/node_1.htm\")\n", "\n", "res.encoding = \"utf-8\"\n", "\n", "soup = BeautifulSoup(res.text, \"html5lib\")\n", "\n", "results = []\n", "\n", "links = soup.select(\"#all_article_list a\")\n", "for link in links:\n", " news_title = link.getText()\n", "\n", " for keyword in keywords:\n", " if keyword in news_title:\n", " results.append(f\"{year}-{month}-{day}: {news_title}\")\n", "\n", "content = \"\\n\".join(results)\n", "subject = f\"今日有{len(results)}篇新聞您可能感興趣\"\n", "# send_simple_message(content, subject=subject)\n", "print(subject)\n", "print(content)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }