{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python 101 \n", "## Part VII.\n", "\n", "---\n", "\n", "## Web scraping\n", "\n", "### Prerequisites\n", "\n", "We'll use the __`requests`__ and the __`BeautfulSoup`__ libraries for web scraping, let's install them:\n", "```bash\n", "pip install -U requests beautifulsoup4\n", "```\n", "\n", "### 0. Easy file sharing\n", "Start your own web-server:\n", "- in command prompt change your directory to the notebook directory\n", "- start the server with the `python -m http.server` command\n", "\n", "### 1. Obtain a webpage\n", "\n", "The easiest way is to use a third party library called __`requests`__. Let's import it right away!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then we simply ask a server to give us an html document by requesting it through an url." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "existing_url = 'http://localhost:8000/data/test.html'\n", "response = requests.get(existing_url)\n", "print(response.status_code) # hopefully 200 -> successful download" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "not_existing_url = 'http://localhost:8000/test1.html'\n", "response = requests.get(not_existing_url)\n", "print(response.status_code) # unfortunately 404 -> not exists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Common status codes:__\n", "- 200: success\n", "- 301: permanent redirect\n", "- 303: redirect\n", "- 400: bad request\n", "- 401: unauthorized\n", "- 404: not exists\n", "- 500: internal server error" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = requests.get(existing_url)\n", "print(response.content.decode('utf-8'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jupyter can render the page if it was successfully downloaded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import HTML\n", "if response.status_code == 200:\n", " result = HTML(response.content.decode('utf-8'))\n", "else:\n", " result = 'Nah, let\\'s have a beer instead!'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Process HTML\n", "\n", "#### Story time: The skeleton of a html document\n", "\n", "__HTML__ is a markup language, its basic build blocks are the ``s.
\n", "(Almost) every `` has two parts:\n", "\n", "- Opening `` \n", "- Closing `` \n", "\n", "Important html ``s:\n", "\n", "- ``\n", "- ``\n", "- ``\n", "- `

`, ..., `
`\n", "- `
`\n", "- `

`\n", "- ``\n", "- `
`\n", "- ``\n", "- ``\n", "- `
`\n", "- ```\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " ...\n", " \n", " \n", "
\n", " ```\n", "- `
    ` / `
      ` + `
    1. `\n", " \n", "Tags can have different attributes:\n", "- ``: href\n", "- ``: src\n", "- id\n", "- class\n", "- anything that is not a html keyword\n", " \n", "\n", "#### Let's parse it!\n", "\n", "We have a third party module for this purpose as well, the __`BeautifulSoup`__. \n", "Let's import it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then create a soup from the downloaded document." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "document = response.content\n", "soup = BeautifulSoup(document, 'html.parser')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the created soup (which is a parsed document) we can easily access any part of the document. \n", "Let's try to:\n", "- get the title of the document" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(soup.title)\n", "print(type(soup.title))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- get the title text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(soup.title.get_text())\n", "print(type(soup.title.get_text()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- get the text-only version of the page" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(soup.get_text())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- get all the links from the document" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "soup.find_all('a')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- get the actual urls from the tags" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for url in soup.find_all('a'):\n", " print(url.get('href'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "During scraping, there are a lot of different tasks that must be solve in order to get the data we need. \n", "In this case this demo document has important and unimportant parts. We only need the important parts. \n", "#### a) Let's find the important links!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "important_urls = []\n", "for url in soup.find_all('a'):\n", " if 'important_part' in url.get('href'):\n", " important_urls.append(url.get('href'))\n", "print(important_urls)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### b) Let's find the important text in the document\n", "- select every paragraph which has \"important\" class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "soup.find_all('p', {'class': 'important'})\n", "# or:\n", "soup.find_all('p', class_='important')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Whooops, something's going on! Let's investigate!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "important_paragraphs = soup.find_all('p', class_='important')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- print the text in the tags, and tags' parent's id attribute" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for p in important_paragraphs:\n", " print(p.get_text(), '>', p.parent.get('id'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We can see, that the \"fake\" result is from somewhere else" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(soup.find(id='not_main_section'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We have a hidden fake section! Let's modify our search!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "soup.find(id='main_content').find_all('p', class_='important')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### c) Let's find the pictures of our interest\n", "- Let's have the \"nice\" pictures from the div with random_images_1 class!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " soup\n", " .find(id='main_content')\n", " .find('div', class_='random_images_1')\n", " .find_all('img', class_='nice')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Whoops again. Filter out the result we don't like." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "imgs = (\n", " soup\n", " .find(id='main_content')\n", " .find('div', class_='random_images_1')\n", " .find_all('img', class_='nice')\n", ")\n", "nice_imgs = []\n", "for img in imgs:\n", " if 'not' not in img.get('class'):\n", " nice_imgs.append(img.get('src'))\n", "print(nice_imgs)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have one more tool we can use to simplify this situations: the `select` function, which allows the usage of the CSS selectors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " soup\n", " .find(id='main_content')\n", " .find('div', class_='random_images_1')\n", " .select('img[class=\"nice\"]')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most important methods:\n", "- `.find(tag, id, class_, attrs)`\n", "- `.find_all(tag, id, class_, attrs)`\n", "- `.select(selector expression string)`\n", "- `.get(attribute)`\n", "- `.get_text()`\n", "\n", "#### Exercise:\n", "- Find every **visible** headlines (`h1`...`h6`) texts and subtitles" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Let's do some...\n", "\n", "\n", "
      \n", "\n", "### Cool library of the week, part I: gTTS\n", "#### Create your own audiobook\n", "\n", "- install gTTS with:\n", " ```bash\n", " pip install gtts\n", " ```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- make it talk" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gtts import gTTS\n", "en_hello = gTTS('Hello!', lang='en')\n", "hu_hello = gTTS('Szia!', lang='hu')\n", "\n", "en_hello.save('./data/en_hello.mp3')\n", "hu_hello.save('./data/hu_hello.mp3')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- play it within the notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import IPython\n", "\n", "IPython.display.Audio(\"./data/en_hello.mp3\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IPython.display.Audio(\"./data/hu_hello.mp3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Cool library of the week, part II: NLTK\n", "#### Analyze texts in a few lines\n", "\n", "- install it with:\n", " ```bash\n", " pip install nltk\n", " ```\n", "- download required assets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "nltk.download(['punkt', 'stopwords'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- download and extract the first LOTR book" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "base_url ='https://github.com/ganesh-k13/shell/raw/refs/heads/master/test_search/www.glozman.com/TextPages/{book}'\n", "book_names = {\n", " 1: '01%20-%20The%20Fellowship%20Of%20The%20Ring.txt',\n", " 2: '02%20-%20The%20Two%20Towers.txt',\n", " 3: '03%20-%20The%20Return%20Of%20The%20King.txt',\n", "} \n", "LOTR = requests.get(base_url.format(book=book_names[1])).text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- write a stopword and punctuation filter" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def needed(token):\n", " stopword = token not in nltk.corpus.stopwords.words('english')\n", " number = not token.isnumeric()\n", " length = len(token) > 1 # can be 2 as well\n", " return stopword and number and length\n", "\n", "list(filter(needed, u'I am the number 1 Elephant in the world'.split()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- tokenize words\n", "- filter out stopwords and punctuations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokens = nltk.word_tokenize(LOTR.lower())\n", "tokens = filter(needed, tokens)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- compute word frequencies\n", "- show the top 25 words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordcount = nltk.FreqDist(tokens)\n", "wordcount.most_common(25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's play a little! \n", "Check how did the top25 words change through the trilogy!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordcounts = []\n", "for book in range(1, 4):\n", " print('Processing book {}'.format(book), end='')\n", " LOTR = requests.get(base_url.format(book=book_names[book])).text\n", " print('.', end='')\n", " tokens = filter(needed, nltk.word_tokenize(LOTR.lower()))\n", " print('.', end=' ')\n", " wordcounts.append(nltk.FreqDist(tokens).most_common(25))\n", " print('Done.')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordcounts[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordcounts[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordcounts[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## It's your turn - write the missing code snippets!\n", "\n", "#### 1. Save every important link to a file from the example page" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "BASE_URI = './data/'\n", "filename = 'important_urls.txt'\n", "# your code goes here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Let's get a random post from bash.hu!\n", "- get the page from http://bash.hu/random\n", "- posts are contained in __`div`__ tags with __`qtxt`__ class\n", "- print the text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "URI = \"http://bash.hu/random\"\n", "# your code goes here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. Put the previous code into a function with two arguments: number of posts, and output filename" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def i_want_fun(output, times=5):\n", " pass # your code goes here\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "i_want_fun(BASE_URI+'fun.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4. Create a class from the previous function. \n", "The class should store all of the post texts.\n", "The class should have a method:\n", " - called `crawl` which crawls one random bash.hu post\n", " - called `crawl_multiple` which crawls a number (given as argument) of bash.hu posts\n", " - called `show_posts` which prints out the crawled posts\n", " - called `export` which saves the posts into a file (filename is given as argument)\n", " - called `reset` which empties the posts\n", "\n", "I already created the class' skeleton for you. Write your code in place of the `pass` statements." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class IWantFun(object):\n", "\n", " URI = \"http://bash.hu/random\"\n", "\n", " def __init__(self):\n", " pass\n", "\n", " def crawl(self):\n", " pass\n", "\n", " def crawl_multiple(self, times=5):\n", " pass\n", "\n", " def show_urls(self):\n", " pass\n", "\n", " def export(self, output):\n", " pass\n", "\n", " def reset(self):\n", " pass\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nine = IWantFun()\n", "nine.crawl()\n", "nine.show_urls()\n", "nine.crawl_multiple(5)\n", "nine.show_posts()\n", "nine.export(BASE_URI + 'fun.txt')\n", "nine.reset()\n", "nine.show_urls()\n" ] } ], "metadata": { "kernelspec": { "display_name": "szisz_python_2025_automn", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 1 }