# Python 101 
## Part IX.

---

## Web Scraping - Part III.

### I. [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

__Making life easier to select the proper content from a website. The ones and only the ones you need.__

1. Click on the SelectorGadget icon to activate it. It is located in the upper right corner.
2. Right after clicking it, a bar will appear in the bottom right corner of your chrome window. Also you will realise that as you start moving the cursor, things will get frames. Do not panick, this is normal!
![frame](pics/selector_gadget_2_bar.png)
3. You will probably want to get multiple instances of the same type of content (e.g. pictures from the main page of telex.hu). This program will help you select what they have in common.
4. Rules for selection:
 - First click to mark an instace of the type of content you like. It will become green, other things the program thinks to be similar will become yellow.
  ![example](pics/selector_gadget_4_example_selector.png) <br></br>
 - Again, the same type of content will also be framed. If there is something you want to exclude (e.g. the telex logo at the top or the tiny weather icon), click on one of them. Starting with the second click, you may exclude anything. The program is smart enough to figure out that if you did not want the telex logo, it is likely that you will want to exclude the weather icon as well. Therefore, it is going to be removed automatically.<br></br>
   ![example](pics/selector_gadget_4_good_state.png)
- In the bottom right corner, you will see the magic command (`.article_title img`) you should use to select all the content you want. Run `soup.select()` to get a list of instances.

As most of the sites', `telex.hu`'s code is constantly updated. The image we use above is a great example of that as you might discovered it for yourself. Today's site has a different style, layout and CSS selectors: the `img` tag itself will be enough to find the images we are looking for. 

Let try it!

In [None]:
import requests
from bs4 import BeautifulSoup


In [None]:
url = "https://telex.hu"
response = requests.get(url)
soup = BeautifulSoup(response.content)


So far, this is business as usual. Let's get the pictures!

In [None]:
image_list = []
for image in soup.select("img"): # select will always return a list
    image_list.append(image.get("src"))
image_list


Ooooor the way cool kids do it. List comprehension:

In [None]:
[image.get("src") for image in soup.select("img")]


![coolkids](https://a.wattpad.com/cover/163492905-352-k572763.jpg)

#### Exercise I: Used cars
Search for a specific brand of car in [hasznaltauto.hu](https://www.hasznaltauto.hu) and list the car urls from the __first page__.

In [None]:
USER_AGENTS = [
    # Chrome OS-based laptop using Chrome browser (Chromebook)
    'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
    # Windows 7-based PC using a Chrome browser
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36',
    # Linux-based PC using a Firefox browser
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
    # Mac OS X-based computer using a Safari browser
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
    # Windows 10-based PC using Edge browser
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
    # Playstation 4 Browser
    'Mozilla/5.0 (PlayStation 4 3.11) AppleWebKit/537.73 (KHTML, like Gecko)',
]

import random
def get_header(agents):
    return {'User-agent': random.choice(agents)}

soup = BeautifulSoup(requests.get("https://ingatlan.com/lista/elado+lakas+budapest", headers=get_header(USER_AGENTS)).content)


#### Exercise II: Real estate market
Get some pieces of information on the real estate market of Budapest. Check out all the houses on [ingatlan.com](https://ingatlan.com/lista/elado+lakas+budapest) and get the following content for the __first page__.
- Price
- Unit price (displayed in _Ft/m2_)
- Number of rooms
- Area (displayed in _m2_)

Make sure you select the proper format of storing these variable! Printing them is not enough, save them!

---

### II. Dynamically generated pages

Dynamically generated pages could not be parsed by simply downloading them since the generated content won't be present. For this case there is an another library called selenium. This library also requires a browser to operate. A browser will be started and every operation will be executed inside that browser. We have to download it first (eg. from [here](https://sites.google.com/chromium.org/driver/)), then the path to the downloaded executable must be set in order to use it.

In [None]:
%conda install selenium -y


In [None]:
import os
from helpers import get_download_dir, chromedriver_download

#chromedriver_download(version="95.0.4638.54")  # check the latest version number
os.environ['PATH'] = get_download_dir() + ";" + os.environ['PATH']


In [None]:
os.environ['PATH']


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException


#### a) Simple lookup
- initialize the browser which will be used by the library

In [None]:
driver = webdriver.Chrome()


- request a page

In [None]:
driver.get('https://bit.ly/ShuffleNav')


- find items

In [None]:
driver.find_element()


In [None]:
try:
    media = (
        driver
        .find_element(By.CSS_SELECTOR, '.post-container')
        .find_element(By.TAG_NAME, 'img')
        .get_attribute('src')
    )
except NoSuchElementException:
    media = (
        driver
        .find_element(By.CSS_SELECTOR, '.post-container')
        .find_element(By.TAG_NAME, 'video')
        .find_element(By.TAG_NAME, 'source')
        .get_attribute('src')
    )

print(media)


Available alternative By classes:
- `find_element(By.ID, "id")`
- `find_element(By.NAME, "name")`
- `find_element(By.XPATH, "xpath")`
- `find_element(By.LINK_TEXT, "link text")`
- `find_element(By.PARTIAL_LINK_TEXT, "partial link text")`
- `find_element(By.TAG_NAME, "tag name")`
- `find_element(By.CLASS_NAME, "class name")`
- `find_element(By.CSS_SELECTOR, "css selector")`

#### CSS selectors
- `tagname`
- `.classname`
- `#id`
- `[attribute=value]`

In [None]:
try:
    media = (driver
             .find_element(By.CSS_SELECTOR, '#individual-post .post-container img')
             .get_attribute('src'))
except NoSuchElementException:
    media = (driver
             .find_element(By.CSS_SELECTOR, '#individual-post .post-container video source')
             .get_attribute('src'))

media


#### b) Interaction with the site
- request the page

In [None]:
driver.get('https://444.hu/kereses')


- find search field

In [None]:
search_field = driver.find_element(By.CSS_SELECTOR, 'input.ember-text-field')


- fill in search query

In [None]:
search_field.send_keys('migráns')


- find submit button and click on it

In [None]:
submit_button = driver.find_element(By.CSS_SELECTOR, 'button[type=submit]')
submit_button.click()


- find related content

In [None]:
urls = []
for article in driver.find_elements(By.TAG_NAME, 'article'):
    urls.append(article.find_element(By.TAG_NAME, 'a').get_attribute('href'))
len(urls)


- solution for infinite scrolldown

In [None]:
import time


def scrolldown():
    lastHeight = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)
        newHeight = driver.execute_script("return document.body.scrollHeight")
        if newHeight == lastHeight:
            break
        lastHeight = newHeight
    return True


In [None]:
urls = []
button = True
page_counter, page_limit = 0, 10

while button:
    # report progress
    print('.', end='')

    # scroll to the end of the page
    scrolldown()

    # check if we have continue button at the end of the page and click on it
    try:
        # we are searching for a button tag which is does not have the type attribute set to `submit`
        button = driver.find_element(By.CSS_SELECTOR, 'button:not([type=submit])')
        button.click()
        page_counter += 1

    # otherwise stop the iteration
    except NoSuchElementException:
        button = False

    # stop the iteration in case we hit the page limit
    if page_counter > page_limit:
        button = False


# after collect loading the infinite scroll page,
# find all articles and collect them into the urls list
for article in driver.find_elements(By.TAG_NAME, 'article'):
    urls.append(article.find_element(By.TAG_NAME, 'a').get_attribute('href'))


In [None]:
len(set(urls))


---

### III. Querying APIs

![REST API](pics/RESTAPI.png)
**Figure:** REST API - Author: Seobility - License: [CC BY-SA 4.0]("/en/wiki/Creative_Commons_License_BY-SA_4.0")

Websites sometimes use a different approach to serve their content - instead of generating and returning a complete site, they send a skeleton of a site with javascript code snippets which queries the server for contents to dynamically populate the aforementioned site-skeleton. It is a widespread solution, most of the sites applies this approach to provide their contents. This architecture style is called a **[REST API](https://en.wikipedia.org/wiki/Representational_state_transfer)**. It has three main component:
- Client (the javascript code running in the webbrowser)
- API (the software running on a server)
- Database (the storage solution)

The client communicates with the API (but it has no direct access to the database itself) through different commands:
- GET: the receive data
- POST: to send (and possibly receive) data
- PUT: to add new content
- DELETE: to remove content

Throughout the communication the data is sent in a structured format, generally in [JSON](https://en.wikipedia.org/wiki/JSON) or [XML](https://en.wikipedia.org/wiki/XML). The client side code is responsible to transform the received data and populate the site.

The API will wait for incoming commands in a so called [endpoint](https://en.wikipedia.org/wiki/Service-oriented_architecture). Some sites tell you about (expose) their endpoint directly - in this case you are encouraged to use them to gather information. Other sites don't but that doesn't mean they are not using one. We are going to use this information for our advantage.

####  General algorithm to uncover and exploit REST APIs:

__Warning #1:__ Sometimes, the direct usage of APIs is forbidden for commercial purposes. Before you start building a business on it, you might want to read the related terms and conditions of the website. Rare and non-commercial usage should not result in any actions.

__Warning #2:__ Not every website uses REST API (or they are restricted in some ways). Therefore, this method will __not__ work in every single case. Sometimes, parsing an HTML is just not something you can avoid. However, it is surely worth checking as you may retrieve the whole dataset without having to parse and clean anything. 

__Task__: Say you want to scrape the departing flights for a given day from [Budapest Liszt Ferenc Airport](https://www.bud.hu/indulo_jaratok). You need every detail that is accessible.

1. Open the [website](https://www.bud.hu/indulo_jaratok), right click and go inspect. On the top bar, instead of browsing the `Elements` tab, change to `Network`. If nothing is displayed here, refresh the page. This will show you the list of network traffic that happens under the hoods. There are pictures here, JavaScript codes and a bunch of scary process that we will avoid, don't worry. You will want to order the requests by `Type`. In most of the cases, `xhr` and `document` types will be the ones we care about. If you click on one of the `xhr` types, this is what should pop up.

![micro0](pics/RESTAPI_1_check_downloads.png)

2. The `Headers` tab shows you the input details of the request that was sent out retrieve this specific content. If you change to the `Preview` or the `Response` tabs, the result of this request will be shown to you. While clicking the former will give you a nicer and rendered look, the latter returns a raw version.

3. Now, the task is to find the entry that returns the pieces of flights data we need. Let's check all the ones with `Type` = `xhr` first and check their `Preview` tabs to find the right one. I think we have a winner here, this looks great: 

 ![micro2](pics/RESTAPI_3_find_entry.png)
 
4. Click on the "play button looking" triangle to expand an entry. Okay, this is very cool, we have it.

5. Next, we need to find a way replicate it so that we can get the data programmatically. If only there was a way to retrieve the input data for this very request. Oh wait! This is what the `Headers` tab is there for, isn't it? It is!

6. Now, the `Headers` tab contains details in a non-Python format (this is not entirely true, but at this point you are not assumed to have the skills needed to transform it manually).

7. We are going to transform it with a third party service: https://curlconverter.com/

8. We need to copy the [curl](https://en.wikipedia.org/wiki/CURL) equivalent of the request by right clicking -> Copy -> copy as curl. At this point, the curl command is copied to the keyboard. Go to https://curlconverter.com/ and paste it to the curl command box. This will generate the Python code we can use.

![micro3](pics/RESTAPI_8_copy_command.png)

9. You are all done :) From now in, the sucess only depends on your Python skills.

In [None]:
# This is the code snippet curl.trillworks.com generated to me
import requests

cookies = {
    '_ga': 'GA1.2.141944687.1700439071',
    '_gid': 'GA1.2.252220149.1700439071',
    '_gat': '1',
    'cookie_bar': 'enabled',
    'XSRF-TOKEN': 'eyJpdiI6Ik9CY2dGMHc4YkhwSmRPdm1kMzFwUGc9PSIsInZhbHVlIjoid200RUhXNXpHZ1hpdU5MdEFiXC80QVdNeG5TYk02dm5sOTMxN01EbDQ4UnJ2ak9Ld0gzdCtWUGp2Y0NQVGJiM1JwNGloXC9hcWZ4dUFPcDk0NkRnYjBlQT09IiwibWFjIjoiNGM5M2QwZTg3ZTAyMGQ5NmUzYjFjNWZjNzNlMGEzNWYxNzg3ZjBmOTBjZTk5OTE0MzFkMDY2OTVmZTI5OWJjMiJ9',
    'budhu_session': 'eyJpdiI6IlBTY0J4ZWdjd1Z0c25SR1wvU053YXZ3PT0iLCJ2YWx1ZSI6InNqbHZJZHk3ckJubjFIZTBJbzFLTFJWN1hhVjI0aXNJZXVMbFwvRlhOa01qTytDTFczTnJua2RSSk1FTWFcL0xoTEJJdkttSXBZekJOdkQxTmo1TkoxQVE9PSIsIm1hYyI6IjU1YWZmYWVmZjRkNDg3ZmY3ZDUyMzJhMWNjODAzZGM4ZDgwMTM5NDMzMmRlOWMyMGU4Y2QxNjIzZmYwYjE3OWEifQ%3D%3D',
}

headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.9,hu;q=0.8',
    'Connection': 'keep-alive',
    # 'Cookie': '_ga=GA1.2.141944687.1700439071; _gid=GA1.2.252220149.1700439071; _gat=1; cookie_bar=enabled; XSRF-TOKEN=eyJpdiI6Ik9CY2dGMHc4YkhwSmRPdm1kMzFwUGc9PSIsInZhbHVlIjoid200RUhXNXpHZ1hpdU5MdEFiXC80QVdNeG5TYk02dm5sOTMxN01EbDQ4UnJ2ak9Ld0gzdCtWUGp2Y0NQVGJiM1JwNGloXC9hcWZ4dUFPcDk0NkRnYjBlQT09IiwibWFjIjoiNGM5M2QwZTg3ZTAyMGQ5NmUzYjFjNWZjNzNlMGEzNWYxNzg3ZjBmOTBjZTk5OTE0MzFkMDY2OTVmZTI5OWJjMiJ9; budhu_session=eyJpdiI6IlBTY0J4ZWdjd1Z0c25SR1wvU053YXZ3PT0iLCJ2YWx1ZSI6InNqbHZJZHk3ckJubjFIZTBJbzFLTFJWN1hhVjI0aXNJZXVMbFwvRlhOa01qTytDTFczTnJua2RSSk1FTWFcL0xoTEJJdkttSXBZekJOdkQxTmo1TkoxQVE9PSIsIm1hYyI6IjU1YWZmYWVmZjRkNDg3ZmY3ZDUyMzJhMWNjODAzZGM4ZDgwMTM5NDMzMmRlOWMyMGU4Y2QxNjIzZmYwYjE3OWEifQ%3D%3D',
    'Referer': 'https://www.bud.hu/indulo_jaratok',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    'sec-ch-ua': '"Google Chrome";v="119", "Chromium";v="119", "Not?A_Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}

response = requests.get(
    'https://www.bud.hu/api/ajaxFlights/?mode=list&lang=hun&dir=1&flightdate_custom_from_date=today&flightdate_custom_from_time=01:00',
    cookies=cookies,
    headers=headers,
)


Always check the status code!

In [None]:
response.status_code


Remember, 200 is great, means success. It is usually the case, that you do not need to include cookies in the request. Just saying, but up to you.

In [None]:
response = requests.get('https://www.bud.hu/api/ajaxFlights/'
                        '?mode=list'
                        '&lang=hun'
                        '&dir=1'
                        '&flightdate_custom_from_date=today'
                        '&flightdate_custom_from_time=01:00',
                        headers=headers)  # deleted cookies from here
response.status_code


Now, as the response is a JSON file, we don't need to parse it with `BeautifulSoup`, just simply convert it to a variable. If you are not familiar with the format JSON, just think of it as a Python dictionary or a list of dictionaries.

In [None]:
data = response.json()  # interpreting it as JSON
type(data)              # result object is a list this time


As there is no documentation in what format data are coming, we need to uncover the pattern. But relax, it is usually not very handy. First, have a look at the first item of the list.

In [None]:
data[0]  # First item of the list, a dictionary


This will probably be a list of dictionaries, each item containing pieces of information on one spicific departing flight. Hurray!!

---

## Let's do some...

<img align="left" width=150 src="pics/magic.gif">
<br style="clear:left;"/>

### Cool library of the week: tqdm

#### A progressbar to follow the progress of your computation

- import and try it!

In [None]:
%conda install tqdm -y


In [None]:
import time
from tqdm import tqdm


In [None]:
for i in tqdm(range(100)):
    time.sleep(.01)


- Let's use it with the LOTR example

In [None]:
import requests
from bs4 import BeautifulSoup


In [None]:
import nltk

url ='http://ae-lib.org.ua/texts-c/tolkien__the_lord_of_the_rings_{book}__en.htm'
LOTR = requests.get(url.format(book=1)).content
LOTR = BeautifulSoup(LOTR, "html.parser").getText()

def needed(token):
    stopword = token not in nltk.corpus.stopwords.words('english')
    number = not token.isnumeric()
    length = len(token) > 1 # can be 2 as well
    return stopword and number and length


In [None]:
tokens = filter(needed, tqdm(nltk.word_tokenize(LOTR.lower())))

wordcount = nltk.FreqDist(tokens)
wordcount.most_common(25)


----

Back to more...

#### Exercise III: Let's hack the system!
![hackerman](https://wompampsupport.azureedge.net/fetchimage?siteId=7575&v=2&jpgQuality=100&width=700&url=https%3A%2F%2Fi.kym-cdn.com%2Fentries%2Ficons%2Ffacebook%2F000%2F021%2F807%2Fig9OoyenpxqdCQyABmOQBZDI0duHk2QZZmWg2Hxd4ro.jpg) <br></br>
 Change the parameters so that:
 
 - Instead of today, it will return flights from the day before (that is, yesterday). 
 - Instead of departing flights, it will return the arrivals.
 - Instead of showing flights after 10.30 AM, it will return all the flights that day.
 
__Warning #3:__ Note, that every single website has different API and hence parameters. What we are doing is specific to [bud.hu](https://www.bud.hu/). When scraping another website, you need to uncover the parameter space and find the possibilities you have.

In [None]:
custom_params = {} # FILL this out

response = requests.get('https://www.bud.hu/api/ajaxFlights/', headers=headers, params=custom_params)


In [None]:
# Check your result here


#### Exercise IV:  More flying with Wizz

- Go to the [fare finder](https://wizzair.com/en-gb/flights/fare-finder#/) page of wizzair.
- Pick an origin and a destination (make sure you choose something that they operate a flight on). Budapest/London surely works.
- Get the dates and prices for a given month.
- Start messing with the input parameters to find out their meanings.

__Extra:__ Functionise it!

#### Exercise V: Vote counting

- Go to [this](https://www.valasztas.hu/ogy2018) page which contains data on the 2018 parliamentary elections in Hungary. Wait for the regional map load, it takes some seconds. 
- Then scrape all the data for a given sub-region (e.g _Veszprém megye 3. számú OEVK (székhely: Tapolca)_).

__Extra:__ Iterate over every single sub-region to collect all the pieces of data for the whole of the country. This way you would get the whole dataset of the election in just a couple of lines of code. Cool, huhh?