{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python 101 \n", "## Part VIII.\n", "\n", "---\n", "\n", "## Web Scraping - Part II.\n", "\n", "### Act I: Let's scrape!\n", "\n", "But first, import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "BASE_URI = './data/'\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise 1. Collect the articles about Soros from portfolio.hu\n", "\n", "This will require to search in the site.\n", "On the upper-left corner, there is a search icon. Use it, and observe the resulting url:\n", "\n", "`https://portfolio.hu/kereses?q=Soros&df=1999-02-10&dt=2020-11-26&page=1`\n", "\n", "It has multiple parts:\n", "- `http://` - protocol\n", "- `portfolio.hu` - base url\n", "- `/kereses` - sub url\n", "- `?q=Soros&df=1999-02-10&dt=2020-11-26&page=1` - query\n", "\n", "Let's investigate the query part a little more! \n", "Every query starts with a __`?`__ charater followed by one or more key-value pairs. The key-value pairs are separated with the __`&`__ character. Based on this information, we can extract the query parameters:\n", "- `df` - `date from`\n", "- `dt` - `date to`\n", "- `q` - stands for query (the word we are looking for)\n", "- `page` - page number\n", "\n", "Use these values to construct our own request:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base_url = 'https://portfolio.hu'\n", "sub_url = '/kereses'\n", "query = {\n", " 'q': 'Soros',\n", " 'df': '1999-02-09',\n", " 'dt': '2021-10-25',\n", " 'page': 1\n", "}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the requests library to send the query:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = requests.get(url=base_url+sub_url, params=query) # some pages requires `data` instead of `params`\n", "resp\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the response, extract the urls from the articles! Pay attention, you may find `
` tags in weird places that you do not want to include." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that only 20 results showed up. We can customize our query to cover shorter amount of timed by replacing __`df`__ and __`dt`__ parameters with a formattable string: __`'{year}-{month:0>2}-{day:0>2}'`__. This string can be formatted by providing the required parameters:\n", "- year\n", "- month\n", "- day\n", "\n", "like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'{year}-{month:0>2}-{day:0>2}'.format(year=2016, month=1, day=1)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a useful library called __`datetime`__. You can use it to generate dates automatically." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "\n", "date = datetime.date(1999, 1, 1)\n", "day_after_date = date + datetime.timedelta(days=1)\n", "day_before_date = date - datetime.timedelta(days=1)\n", "today = datetime.date.today()\n", "\n", "print(day_before_date)\n", "print(date)\n", "print(day_after_date)\n", "print(today)\n", "\n", "print(today.year, today.month, today.day)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a loop which iterates through every day from 1999-01-01 till today and executes the same procedure you created previously. (Pro tip: create a function!) Observe the number of results!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### Act II: Disguise yourself!\n", "\n", "![Disguise](pics/batman-superman-disguise.gif)\n", "\n", "Let's pretend to be a browser instead of a script:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "USER_AGENTS = [\n", " # Chrome OS-based laptop using Chrome browser (Chromebook)\n", " 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',\n", " # Windows 7-based PC using a Chrome browser\n", " 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36',\n", " # Linux-based PC using a Firefox browser\n", " 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',\n", " # Mac OS X-based computer using a Safari browser\n", " 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',\n", " # Windows 10-based PC using Edge browser\n", " 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',\n", " # Playstation 4 Browser\n", " 'Mozilla/5.0 (PlayStation 4 3.11) AppleWebKit/537.73 (KHTML, like Gecko)',\n", "]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find more user agents [here](https://deviceatlas.com/blog/list-of-user-agent-strings).\n", "\n", "Let's write a wrapper function to handle the user-agent string." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random\n", "def get_header(agents):\n", " return {'User-agent': random.choice(agents)}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get the main articles from telex.hu\n", "Write a function that extracts the current main articles! It should contain:\n", "- the title\n", "- the article text\n", "- the url\n", "- every picture from the article" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = 'http://telex.hu'\n", "telex_response = requests.get(url, headers=get_header(USER_AGENTS))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise 2. Check out discounts on isthereanydeal.com!\n", "\n", "List the names, prices and discount values for the top 100 games list (based on metacritic scores: http://www.metacritic.com/browse/games/score/metascore/all/pc/filtered )!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extra questions: \n", "- How much does it cost to buy every (available) games?\n", "- How much money would I save if I'd bought them at their lowest price?\n", "- How much money do I save if we compare their price to their initial price? (Let's assume that every game initial price was \\$60) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise 3. Functionize!\n", "\n", "##### a. Create a function to check a game's price" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def check_price(game):\n", " pass\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### b. Create a function to get the top100 games" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_top100():\n", " pass\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### c. Write a function with the same functionality as the 2nd exercise!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def main():\n", " pass\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### Intermission: Creating a standalone script\n", "\n", "Create a new text file with .py extension! You can specify the filename.\n", "Start it with: \n", " `# encoding: utf-8` \n", "then copy-paste:\n", " - the imports, \n", " - the global variables \n", " - the three functions\n", "and insert the following two lines into the end of the file: \n", "`if __name__ == '__main__': \n", " main()` \n", "Save it, and now you can execute this script by invoking: \n", " __`python your_specified_filename.py`__\n", "\n", "\n", "You can even:\n", "- import your newly created script:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import myscript # use your filename\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- get it's contents" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dir(myscript)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- print its variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(myscript.base_url, myscript.sub_url, myscript.query)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- use its functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hl2 = myscript.check_price('half-life 2')\n", "print(hl2)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myscript.get_top100()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Let's do some...\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Act III: Cool library of the week: Tkinter\n", "#### Create graphical user interfaces!\n", "All you have to do, is:\n", "- Import it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random\n", "import tkinter\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Create a class:\n", " - with window layout\n", " - with function bindings " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Dice(tkinter.Tk):\n", "\n", " def __init__(self, parent):\n", " # init main window (parent is the parent window)\n", " tkinter.Tk.__init__(self, parent)\n", " self.parent = parent\n", " self.initialize()\n", "\n", " def initialize(self):\n", " self.grid()\n", "\n", " # add label\n", " self.labelVariable = tkinter.StringVar()\n", " label = tkinter.Label(self, textvariable=self.labelVariable)\n", " label.grid(column=0, row=0, sticky='EW')\n", " self.labelVariable.set(0)\n", "\n", " # add button\n", " button = tkinter.Button(self, text=u\"Throw!\", command=self.throw)\n", " button.grid(column=1, row=0)\n", "\n", " def throw(self):\n", " self.labelVariable.set(random.randint(1, 6))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Initiate and use it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "app = Dice(None)\n", "app.title('Throw a dice!')\n", "app.mainloop()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An easy to follow tutorial can be found here.\n", "\n", "---\n", "\n", "## Final Act: It's your turn - write the missing code snippets!\n", "\n", "Write a script called `articles.py`, in which you create an object called ArticleTags.\n", "It has an attribute: `base_url` (telex.hu's base url)\n", "It has three functions: `init`, `get`, and `set`\n", "\n", "Init:\n", "- Arguments: (`self` and) `telex_article_suburl`\n", "- Output: -\n", "- Workflow: set the `self.article` to `telex_article_suburl`\n", "\n", "Get:\n", "- Arguments: `self`\n", "- Output: the list of article tags\n", "- Workflow: \n", " * get the `self.article` page\n", " * parse it for the related article tags\n", " * return them in a list\n", "\n", "Set:\n", "- Arguments: (`self` and) `telex_article_suburl`\n", "- Output: -\n", "- Workflow: set the `self.article` to `telex_article_suburl`\n", "\n", "Don't forget about the diguise!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# test the script\n", "import articles\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "related_tags = articles.ArticleTags('')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for tag in related_tags.get():\n", " print(tag)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 1 }