# Python 101 
## Part VII.

---

## Web scraping

### Prerequisites

We'll use the __`requests`__ and the __`BeautfulSoup`__ libraries for web scraping, let's install them:
```bash
pip install -U requests beautifulsoup4
```

### 0. Easy file sharing
Start your own web-server:
- in command prompt change your directory to the notebook directory
- start the server with the `python -m http.server` command

### 1. Obtain a webpage

The easiest way is to use a third party library called __`requests`__. Let's import it right away!

In [None]:
import requests

And then we simply ask a server to give us an html document by requesting it through an url.

In [None]:
existing_url = 'http://localhost:8000/data/test.html'
response = requests.get(existing_url)
print(response.status_code) # hopefully 200 -> successful download

In [None]:
not_existing_url = 'http://localhost:8000/test1.html'
response = requests.get(not_existing_url)
print(response.status_code) # unfortunately 404 -> not exists

__Common status codes:__
- 200: success
- 301: permanent redirect
- 303: redirect
- 400: bad request
- 401: unauthorized
- 404: not exists
- 500: internal server error

In [None]:
response = requests.get(existing_url)
print(response.content.decode('utf-8'))

Jupyter can render the page if it was successfully downloaded.

In [None]:
from IPython.display import HTML
if response.status_code == 200:
    result = HTML(response.content.decode('utf-8'))
else:
    result = 'Nah, let\'s have a beer instead!'

In [None]:
result

### 2. Process HTML

#### Story time: The skeleton of a html document

__HTML__ is a markup language, its basic build blocks are the `<tag>`s.<br>
(Almost) every `<tag>` has two parts:

- Opening `<tag>` 
- Closing `</tag>` 

Important html `<tag>`s:

- `<html></html>`
- `<head></head>`
- `<body></body>`
- `<h1></h1>`, ..., `<h6></h6>`
- `<div></div>`
- `<p></p>`
- `<span></span>`
- `<section></section>`
- `<a href=""></a>`
- `<img src="">`
- `<br>`
- ```
  <table>
    <thead>
        <tr>
            <th></th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td></td>
            ...
        </tr>
    </tbody>
  </table>
  ```
- `<ul></ul>` / `<ol></ol>` + `<li></li>`
    
Tags can have different attributes:
- `<a>`: href
- `<img>`: src
- id
- class
- anything that is not a html keyword
    

#### Let's parse it!

We have a third party module for this purpose as well, the __`BeautifulSoup`__.  
Let's import it!

In [None]:
from bs4 import BeautifulSoup

Then create a soup from the downloaded document.

In [None]:
document = response.content
soup = BeautifulSoup(document, 'html.parser')

In [None]:
print(soup.prettify())

With the created soup (which is a parsed document) we can easily access any part of the document.  
Let's try to:
- get the title of the document

In [None]:
print(soup.title)
print(type(soup.title))

- get the title text

In [None]:
print(soup.title.get_text())
print(type(soup.title.get_text()))

- get the text-only version of the page

In [None]:
print(soup.get_text())

- get all the links from the document

In [None]:
soup.find_all('a')

- get the actual urls from the tags

In [None]:
for url in soup.find_all('a'):
    print(url.get('href'))

During scraping, there are a lot of different tasks that must be solve in order to get the data we need. 
In this case this demo document has important and unimportant parts. We only need the important parts.   
#### a) Let's find the important links!

In [None]:
important_urls = []
for url in soup.find_all('a'):
    if 'important_part' in url.get('href'):
        important_urls.append(url.get('href'))
print(important_urls)

#### b) Let's find the important text in the document
- select every paragraph which has "important" class

In [None]:
soup.find_all('p', {'class': 'important'})
# or:
soup.find_all('p', class_='important')

- Whooops, something's going on! Let's investigate!

In [None]:
important_paragraphs = soup.find_all('p', class_='important')

- print the text in the tags, and tags' parent's id attribute

In [None]:
for p in important_paragraphs:
    print(p.get_text(), '>', p.parent.get('id'))

- We can see, that the "fake" result is from somewhere else

In [None]:
print(soup.find(id='not_main_section'))

- We have a hidden fake section! Let's modify our search!

In [None]:
soup.find(id='main_content').find_all('p', class_='important')

#### c) Let's find the pictures of our interest
- Let's have the "nice" pictures from the div with random_images_1 class!

In [None]:
(
    soup
    .find(id='main_content')
    .find('div', class_='random_images_1')
    .find_all('img', class_='nice')
)

- Whoops again. Filter out the result we don't like.

In [None]:
imgs = (
    soup
    .find(id='main_content')
    .find('div', class_='random_images_1')
    .find_all('img', class_='nice')
)
nice_imgs = []
for img in imgs:
    if 'not' not in img.get('class'):
        nice_imgs.append(img.get('src'))
print(nice_imgs)


We have one more tool we can use to simplify this situations: the `select` function, which allows the usage of the CSS selectors.

In [None]:
(
    soup
    .find(id='main_content')
    .find('div', class_='random_images_1')
    .select('img[class="nice"]')
)

Most important methods:
- `.find(tag, id, class_, attrs)`
- `.find_all(tag, id, class_, attrs)`
- `.select(selector expression string)`
- `.get(attribute)`
- `.get_text()`

#### Exercise:
- Find every **visible** headlines (`h1`...`h6`) texts and subtitles

---

## Let's do some...

<img align="left" width=150 src="pics/magic.gif">
<br style="clear:left;"/>

### Cool library of the week, part I: gTTS
#### Create your own audiobook

- install gTTS with:
    ```bash
    pip install gtts
    ```

- make it talk

In [None]:
from gtts import gTTS
en_hello = gTTS('Hello!', lang='en')
hu_hello = gTTS('Szia!', lang='hu')

en_hello.save('./data/en_hello.mp3')
hu_hello.save('./data/hu_hello.mp3')

- play it within the notebook

In [None]:
import IPython

IPython.display.Audio("./data/en_hello.mp3")

In [None]:
IPython.display.Audio("./data/hu_hello.mp3")


### Cool library of the week, part II: NLTK
#### Analyze texts in a few lines

- install it with:
    ```bash
    pip install nltk
    ```
- download required assets

In [None]:
import nltk
nltk.download(['punkt', 'stopwords'])

- download and extract the first LOTR book

In [None]:
import requests

base_url ='https://github.com/ganesh-k13/shell/raw/refs/heads/master/test_search/www.glozman.com/TextPages/{book}'
book_names = {
    1: '01%20-%20The%20Fellowship%20Of%20The%20Ring.txt',
    2: '02%20-%20The%20Two%20Towers.txt',
    3: '03%20-%20The%20Return%20Of%20The%20King.txt',
} 
LOTR = requests.get(base_url.format(book=book_names[1])).text

- write a stopword and punctuation filter

In [None]:
def needed(token):
    stopword = token not in nltk.corpus.stopwords.words('english')
    number = not token.isnumeric()
    length = len(token) > 1 # can be 2 as well
    return stopword and number and length

list(filter(needed, u'I am the number 1 Elephant in the world'.split()))

- tokenize words
- filter out stopwords and punctuations

In [None]:
tokens = nltk.word_tokenize(LOTR.lower())
tokens = filter(needed, tokens)

- compute word frequencies
- show the top 25 words

In [None]:
wordcount = nltk.FreqDist(tokens)
wordcount.most_common(25)

#### Let's play a little!  
Check how did the top25 words change through the trilogy!

In [None]:
wordcounts = []
for book in range(1, 4):
    print('Processing book {}'.format(book), end='')
    LOTR = requests.get(base_url.format(book=book_names[book])).text
    print('.', end='')
    tokens = filter(needed, nltk.word_tokenize(LOTR.lower()))
    print('.', end=' ')
    wordcounts.append(nltk.FreqDist(tokens).most_common(25))
    print('Done.')

In [None]:
wordcounts[0]

In [None]:
wordcounts[1]

In [None]:
wordcounts[2]

## It's your turn - write the missing code snippets!

#### 1. Save every important link to a file from the example page

In [None]:
BASE_URI = './data/'
filename = 'important_urls.txt'
# your code goes here


#### 2. Let's get a random post from bash.hu!
- get the page from http://bash.hu/random
- posts are contained in __`div`__ tags with __`qtxt`__ class
- print the text

In [None]:
URI = "http://bash.hu/random"
# your code goes here


#### 3. Put the previous code into a function with two arguments: number of posts, and output filename

In [None]:
def i_want_fun(output, times=5):
    pass # your code goes here


In [None]:
i_want_fun(BASE_URI+'fun.txt')

#### 4. Create a class from the previous function. 
The class should store all of the post texts.
The class should have a method:
 - called `crawl` which crawls one random bash.hu post
 - called `crawl_multiple` which crawls a number (given as argument) of bash.hu posts
 - called `show_posts` which prints out the crawled posts
 - called `export` which saves the posts into a file (filename is given as argument)
 - called `reset` which empties the posts

I already created the class' skeleton for you. Write your code in place of the `pass` statements.

In [None]:
class IWantFun(object):

    URI = "http://bash.hu/random"

    def __init__(self):
        pass

    def crawl(self):
        pass

    def crawl_multiple(self, times=5):
        pass

    def show_urls(self):
        pass

    def export(self, output):
        pass

    def reset(self):
        pass


In [None]:
nine = IWantFun()
nine.crawl()
nine.show_urls()
nine.crawl_multiple(5)
nine.show_posts()
nine.export(BASE_URI + 'fun.txt')
nine.reset()
nine.show_urls()
