# Python 101 
## Part VIII.

---

## Web Scraping - Part II.

### Act I: Let's scrape!

But first, import the necessary libraries.

In [None]:
import requests
from bs4 import BeautifulSoup

BASE_URI = './data/'


#### Exercise 1. Collect the articles about Soros from portfolio.hu

This will require to search in the site.
On the upper-left corner, there is a search icon. Use it, and observe the resulting url:

`https://portfolio.hu/kereses?q=Soros&df=1999-02-10&dt=2020-11-26&page=1`

It has multiple parts:
- `http://` - protocol
- `portfolio.hu` - base url
- `/kereses` - sub url
- `?q=Soros&df=1999-02-10&dt=2020-11-26&page=1` - query

Let's investigate the query part a little more!  
Every query starts with a __`?`__ charater followed by one or more key-value pairs. The key-value pairs are separated with the __`&`__ character. Based on this information, we can extract the query parameters:
- `df` - `date from`
- `dt` - `date to`
- `q` - stands for query (the word we are looking for)
- `page` - page number

Use these values to construct our own request:

In [None]:
base_url = 'https://portfolio.hu'
sub_url = '/kereses'
query = {
    'q': 'Soros',
    'df': '1999-02-09',
    'dt': '2021-10-25',
    'page': 1
}


We can use the requests library to send the query:

In [None]:
resp = requests.get(url=base_url+sub_url, params=query) # some pages requires `data` instead of `params`
resp


Using the response, extract the urls from the articles! Pay attention, you may find `<article>` tags in weird places that you do not want to include.

You can see that only 20 results showed up. We can customize our query to cover shorter amount of timed by replacing __`df`__ and __`dt`__ parameters with a formattable string: __`'{year}-{month:0>2}-{day:0>2}'`__. This string can be formatted by providing the required parameters:
- year
- month
- day

like this:

In [None]:
'{year}-{month:0>2}-{day:0>2}'.format(year=2016, month=1, day=1)


There is a useful library called __`datetime`__. You can use it to generate dates automatically.

In [None]:
import datetime

date = datetime.date(1999, 1, 1)
day_after_date = date + datetime.timedelta(days=1)
day_before_date = date - datetime.timedelta(days=1)
today = datetime.date.today()

print(day_before_date)
print(date)
print(day_after_date)
print(today)

print(today.year, today.month, today.day)


Create a loop which iterates through every day from 1999-01-01 till today and executes the same procedure you created previously. (Pro tip: create a function!) Observe the number of results!

---

### Act II: Disguise yourself!

![Disguise](pics/batman-superman-disguise.gif)

Let's pretend to be a browser instead of a script:

In [None]:
USER_AGENTS = [
    # Chrome OS-based laptop using Chrome browser (Chromebook)
    'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
    # Windows 7-based PC using a Chrome browser
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36',
    # Linux-based PC using a Firefox browser
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
    # Mac OS X-based computer using a Safari browser
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
    # Windows 10-based PC using Edge browser
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
    # Playstation 4 Browser
    'Mozilla/5.0 (PlayStation 4 3.11) AppleWebKit/537.73 (KHTML, like Gecko)',
]


You can find more user agents [here](https://deviceatlas.com/blog/list-of-user-agent-strings).

Let's write a wrapper function to handle the user-agent string.

In [None]:
import random
def get_header(agents):
    return {'User-agent': random.choice(agents)}


#### Get the main articles from telex.hu
Write a function that extracts the current main articles! It should contain:
- the title
- the article text
- the url
- every picture from the article

In [None]:
url = 'http://telex.hu'
telex_response = requests.get(url, headers=get_header(USER_AGENTS))


---

#### Exercise 2. Check out discounts on isthereanydeal.com!

List the names, prices and discount values for the top 100 games list (based on metacritic scores: http://www.metacritic.com/browse/games/score/metascore/all/pc/filtered )!

Extra questions:  
- How much does it cost to buy every (available) games?
- How much money would I save if I'd bought them at their lowest price?
- How much money do I save if we compare their price to their initial price? (Let's assume that every game initial price was \$60) 

#### Exercise 3. Functionize!

##### a. Create a function to check a game's price

In [None]:
def check_price(game):
    pass


##### b. Create a function to get the top100 games

In [None]:
def get_top100():
    pass


##### c. Write a function with the same functionality as the 2nd exercise!

In [None]:
def main():
    pass


---

### Intermission: Creating a standalone script

Create a new text file with .py extension! You can specify the filename.
Start it with:  
    `# encoding: utf-8`  
then copy-paste:
    - the imports, 
    - the global variables 
    - the three functions
and insert the following two lines into the end of the file:  
`if __name__ == '__main__':  
     main()`  
Save it, and now you can execute this script by invoking:  
    __`python your_specified_filename.py`__


You can even:
- import your newly created script:

In [None]:
import myscript # use your filename


- get it's contents

In [None]:
dir(myscript)


- print its variables

In [None]:
print(myscript.base_url, myscript.sub_url, myscript.query)


- use its functions

In [None]:
hl2 = myscript.check_price('half-life 2')
print(hl2)


In [None]:
myscript.get_top100()


---

## Let's do some...

<img align="left" width=150 src="http://www.reactiongifs.com/r/mgc.gif">

### Act III: Cool library of the week: Tkinter
#### Create graphical user interfaces!
All you have to do, is:
- Import it

In [None]:
import random
import tkinter


- Create a class:
    - with window layout
    - with function bindings 

In [None]:
class Dice(tkinter.Tk):

    def __init__(self, parent):
        # init main window (parent is the parent window)
        tkinter.Tk.__init__(self, parent)
        self.parent = parent
        self.initialize()

    def initialize(self):
        self.grid()

        # add label
        self.labelVariable = tkinter.StringVar()
        label = tkinter.Label(self, textvariable=self.labelVariable)
        label.grid(column=0, row=0, sticky='EW')
        self.labelVariable.set(0)

        # add button
        button = tkinter.Button(self, text=u"Throw!", command=self.throw)
        button.grid(column=1, row=0)

    def throw(self):
        self.labelVariable.set(random.randint(1, 6))


- Initiate and use it

In [None]:
app = Dice(None)
app.title('Throw a dice!')
app.mainloop()


An easy to follow tutorial can be found <a href="http://sebsauvage.net/python/gui/">here</a>.

---

## Final Act:  It's your turn - write the missing code snippets!

Write a script called `articles.py`, in which you create an object called ArticleTags.
It has an attribute: `base_url` (telex.hu's base url)
It has three functions: `init`, `get`, and `set`

Init:
- Arguments: (`self` and) `telex_article_suburl`
- Output: -
- Workflow: set the `self.article` to `telex_article_suburl`

Get:
- Arguments: `self`
- Output: the list of article tags
- Workflow: 
    * get the `self.article` page
    * parse it for the related article tags
    * return them in a list

Set:
- Arguments: (`self` and) `telex_article_suburl`
- Output: -
- Workflow: set the `self.article` to `telex_article_suburl`

Don't forget about the diguise!

In [None]:
# test the script
import articles


In [None]:
related_tags = articles.ArticleTags('')


In [None]:
for tag in related_tags.get():
    print(tag)
