# Lesson 8—Web Automation with Selenium





Version 1.1. Prepared by [Makzan](https://makzan.net). Updated at 2021 Janurary.

In this series, we will use 3 lectures to learn fetching data online. This includes:

- Finding patterns in URL
- Open web URL
- Downloading files in Python
- Fetch data with API
- Web scraping with Requests and BeautifulSoup
- **Web automation with Selenium**
- **Converting Wikipedia tabular data into CSV**

We use Selenium when:
- When Requests and BeautifulSoup does not work.
- When page requires JavaScript to render the data.

Pros:
- It launches real browser and automate browser.
- Better compatibility .

Cons:
- Slow because it launches real browser.


## Downloading browser driver

We need web browser driver to use Selenium. 

- [Gecko Driver for Firefox](https://github.com/mozilla/geckodriver/releases)
- [Chrome Driver](https://chromedriver.chromium.org/)

In [1]:
pip install --user selenium

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install --user webdriver-manager

Note: you may need to restart the kernel to use updated packages.


In [3]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

from webdriver_manager.chrome import ChromeDriverManager

In [4]:
# Common library to import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select

In [5]:
options = Options()
# options.add_argument('-headless')

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.quit()



Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache


## Selenium Cheat Sheet

https://codoid.com/selenium-webdriver-python-cheat-sheet/

Here are some essential commands to control web browser through Selenium:

In [6]:
browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get('https://example.com')
browser.find_element(By.CSS_SELECTOR, 'a')
browser.find_elements(By.CSS_SELECTOR, 'a')
browser.quit()



Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache


## Taking screenshot

In [7]:
'''Capture the screenshot of a website via Headless Browser.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get('http://macaodaily.com')
browser.save_screenshot('MacaoDaily.png')
browser.quit()



Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache


## Example: Fetching stock data from aastock

Let's try to fetch stock quote from aastock.com. If we try to directly access the stock page, the data may not load. We can load any one page from aastock and then simulate inputting the stock number and press enter. By using this automation, we can simulate a normal web browser browsing behavior.

In [8]:
'''Fetch current stock from aastock.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time

stock_number = '0011'

options = Options()
# options.add_argument('-headless')

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()

browser.get('http://www.aastocks.com/tc/stocks/aboutus/companyinfo.aspx')
element = browser.find_element(By.CSS_SELECTOR, '#sb-txtSymbol-aa')
element.send_keys(stock_number)
element.send_keys(Keys.RETURN)

time.sleep(3)

element = browser.find_element(By.CSS_SELECTOR, '.lastBox')
print(element.text)


browser.quit()



Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache


收市價(港元)
(指數|行業)
波幅
121.800 - 123.300
123.000


## Example: Fetch dicj data with Selenium

We had used API to fetch DICJ data. This example shows an alternative to fetch the same data by using Selenium.

In [9]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)

browser.get('http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html')

time.sleep(5)

element = browser.find_element(By.CSS_SELECTOR, "#report #table1")

rows = element.find_elements(By.CSS_SELECTOR, "tr")
print(rows[0].text)
for row in rows[3:]:
 print(row.text)




Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache


2020年及2019年每月幸運博彩毛收入
一月份 22,126 24,942 -11.3% 22,126 24,942 -11.3%
二月份 3,104 25,370 -87.8% 25,229 50,312 -49.9%
三月份 5,257 25,840 -79.7% 30,486 76,152 -60.0%
四月份 754 23,588 -96.8% 31,240 99,739 -68.7%
五月份 1,764 25,952 -93.2% 33,004 125,691 -73.7%
六月份 716 23,812 -97.0% 33,720 149,503 -77.4%
七月份 1,344 24,453 -94.5% 35,064 173,956 -79.8%
八月份 1,330 24,262 -94.5% 36,394 198,218 -81.6%
九月份 2,211 22,079 -90.0% 38,605 220,297 -82.5%
十月份 7,270 26,443 -72.5% 45,875 246,740 -81.4%
十一月份 6,748 22,877 -70.5% 52,623 269,617 -80.5%
十二月份 7,818 22,838 -65.8% 60,441 292,455 -79.3%


## Example: Fetch flight price from ctrip

In this example, we will fetch airline query by querying flights.ctrip.com with 4 parameters: departure date, arrival date, departure airport, arrival airport.

In [10]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import datetime
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

In [11]:
today = datetime.date.today()
five_days_later = today + datetime.timedelta(days=5)

print(today.isoformat())
print(five_days_later.isoformat())


2022-08-31
2022-09-05


In [12]:
options = Options()
#options.add_argument('-headless')

from_city = "hkg"
to_city = "hel"

url = f"https://flights.ctrip.com/international/search/round-{from_city}-{to_city}?depdate={today}_{five_days_later}&cabin=y_s&adult=1&child=0&infant=0"

print(url)

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get(url)

time.sleep(3)

elements = browser.find_elements(By.CSS_SELECTOR, ".flight-item")

print(f"Found {len(elements)} results.")

print(from_city.upper())
print(to_city.upper())
for row in elements:
 airline = row.find_element(By.CSS_SELECTOR, ".airline-name")
 print(airline.text)
 price = row.find_element(By.CSS_SELECTOR, ".price")
 print(price.text)
 
 
browser.quit()





https://flights.ctrip.com/international/search/round-hkg-hel?depdate=2022-08-31_2022-09-05&cabin=y_s&adult=1&child=0&infant=0


Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache


Found 7 results.
HKG
HEL
土耳其航空
¥17846起
国泰航空
¥18400起
汉莎航空
¥18896起
国泰航空
¥18400起
汉莎航空
¥18896起






## Example: Use MailGun to send result to yourself

In [13]:
DOMAIN = None
API_KEY= None
FROM = "mak@makzan.net"
TO = ["mak@makzan.net"]

In [14]:
from bs4 import BeautifulSoup
import requests
import datetime

def send_simple_message(content, subject="Yeah"):
 return requests.post(
 f"https://api.mailgun.net/v3/{DOMAIN}/messages",
 auth=("api", API_KEY),
 data={"from": FROM,
 "to": TO,
 "subject": subject,
 "text": content})

# keywords
keywords = ["創業", "科技"]

# today
today = datetime.datetime.today()
year = str(today.year).zfill(2)
month = str(today.month).zfill(2)
day = str(today.day).zfill(2)

res = requests.get(f"http://www.macaodaily.com/html/{year}-{month}/{day}/node_1.htm")

res.encoding = "utf-8"

soup = BeautifulSoup(res.text, "html5lib")

results = []

links = soup.select("#all_article_list a")
for link in links:
 news_title = link.getText()

 for keyword in keywords:
 if keyword in news_title:
 results.append(f"{year}-{month}-{day}: {news_title}")

content = "\n".join(results)
subject = f"今日有{len(results)}篇新聞您可能感興趣"
# send_simple_message(content, subject=subject)
print(subject)
print(content)

今日有0篇新聞您可能感興趣

