{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python 101 \n",
    "## Part VIII.\n",
    "\n",
    "---\n",
    "\n",
    "## Web Scraping - Part II.\n",
    "\n",
    "### Act I: Let's scrape!\n",
    "\n",
    "But first, import the necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "BASE_URI = './data/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise 1. Collect the articles about Soros from portfolio.hu\n",
    "\n",
    "This will require to search in the site.\n",
    "On the upper-left corner, there is a search icon. Use it, and observe the resulting url:\n",
    "\n",
    "`https://portfolio.hu/kereses?q=Soros&df=1999-02-10&dt=2025-10-27&page=1`\n",
    "\n",
    "It has multiple parts:\n",
    "- `http://` - protocol\n",
    "- `portfolio.hu` - base url\n",
    "- `/kereses` - sub url\n",
    "- `?q=Soros&df=1999-02-10&dt=2025-10-27&page=1` - query\n",
    "\n",
    "Let's investigate the query part a little more!  \n",
    "Every query starts with a __`?`__ charater followed by one or more key-value pairs. The key-value pairs are separated with the __`&`__ character. Based on this information, we can extract the query parameters:\n",
    "- `df` - `date from`\n",
    "- `dt` - `date to`\n",
    "- `q` - stands for query (the word we are looking for)\n",
    "- `page` - page number\n",
    "\n",
    "Use these values to construct our own request:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "today = datetime.date.today().isoformat()\n",
    "base_url = 'https://portfolio.hu'\n",
    "sub_url = '/kereses'\n",
    "query = {\n",
    "    'q': 'Soros',\n",
    "    'df': '1999-02-09',\n",
    "    'dt': today,\n",
    "    'page': 1\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the requests library to send the query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.get(url=base_url+sub_url, params=query) # some pages requires `data` instead of `params`\n",
    "response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the response, extract the urls from the articles! Pay attention, you may find `<article>` tags in weird places that you do not want to include."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that only 20 results showed up. We can customize our query to cover shorter amount of timed by replacing __`df`__ and __`dt`__ parameters with a formattable string: __`'{year}-{month:0>2}-{day:0>2}'`__. This string can be formatted by providing the required parameters:\n",
    "- year\n",
    "- month\n",
    "- day\n",
    "\n",
    "like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'{year}-{month:0>2}-{day:0>2}'.format(year=2016, month=1, day=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a useful library called __`datetime`__. You can use it to generate dates automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "date = datetime.date(1999, 1, 1)\n",
    "one_day = datetime.timedelta(days=1)\n",
    "day_after_date = date + one_day\n",
    "day_before_date = date - one_day\n",
    "today = datetime.date.today()\n",
    "\n",
    "print(day_before_date)\n",
    "print(date)\n",
    "print(day_after_date)\n",
    "print(today)\n",
    "\n",
    "print(today.year, today.month, today.day)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a loop which iterates through every day from 1999-01-01 till today and executes the same procedure you created previously. (Pro tip: create a function!) Observe the number of results!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### Act II: Disguise yourself!\n",
    "\n",
    "![Disguise](pics/batman-superman-disguise.gif)\n",
    "\n",
    "Let's pretend to be a browser instead of a script:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "USER_AGENTS = [\n",
    "    # Chrome OS-based laptop using Chrome browser (Chromebook)\n",
    "    'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',\n",
    "    # Windows 7-based PC using a Chrome browser\n",
    "    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36',\n",
    "    # Linux-based PC using a Firefox browser\n",
    "    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',\n",
    "    # Mac OS X-based computer using a Safari browser\n",
    "    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',\n",
    "    # Windows 10-based PC using Edge browser\n",
    "    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',\n",
    "    # Playstation 4 Browser\n",
    "    'Mozilla/5.0 (PlayStation 4 3.11) AppleWebKit/537.73 (KHTML, like Gecko)',\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can find more user agents [here](https://deviceatlas.com/blog/list-of-user-agent-strings).\n",
    "\n",
    "Let's write a wrapper function to handle the user-agent string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "def get_header(agents):\n",
    "    return {'User-agent': random.choice(agents)}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get the main articles from telex.hu\n",
    "Write a function that extracts the current main articles! It should contain:\n",
    "- the title\n",
    "- the article text\n",
    "- the url\n",
    "- every picture from the article"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'http://telex.hu'\n",
    "telex_response = requests.get(url, headers=get_header(USER_AGENTS))\n",
    "\n",
    "# raise an error in case the download was unsuccessful\n",
    "telex_response.raise_for_status()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "#### Exercise 2. Check out discounts on isthereanydeal.com!\n",
    "\n",
    "List the names, prices and discount values for the top 100 games list (based on metacritic scores: http://www.metacritic.com/browse/games/score/metascore/all/pc/filtered )!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extra questions:  \n",
    "- How much does it cost to buy every (available) games?\n",
    "- How much money would I save if I'd bought them at their lowest price?\n",
    "- How much money do I save if we compare their price to their initial price? (Let's assume that every game initial price was \\$60) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise 3. Functionize!\n",
    "\n",
    "##### a. Create a function to check a game's price"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def check_price(game):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### b. Create a function to get the top100 games"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_top100():\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### c. Write a function with the same functionality as the 2nd exercise!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def main():\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### Intermission: Creating a standalone script\n",
    "\n",
    "Create a new text file with .py extension! You can specify the filename.\n",
    "Start it with:  \n",
    "    `# encoding: utf-8`  \n",
    "then copy-paste:\n",
    "    - the imports, \n",
    "    - the global variables \n",
    "    - the three functions\n",
    "and insert the following two lines into the end of the file:  \n",
    "`if __name__ == '__main__':  \n",
    "     main()`  \n",
    "Save it, and now you can execute this script by invoking:  \n",
    "    __`python your_specified_filename.py`__\n",
    "\n",
    "\n",
    "You can even:\n",
    "- import your newly created script:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import myscript # use your filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- get it's contents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dir(myscript)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- print its variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(myscript.base_url, myscript.sub_url, myscript.query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- use its functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hl2 = myscript.check_price('half-life 2')\n",
    "print(hl2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "myscript.get_top100()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Let's do some...\n",
    "\n",
    "<img align=\"left\" width=150 src=\"http://www.reactiongifs.com/r/mgc.gif\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Act III: Cool library of the week: Tkinter\n",
    "#### Create graphical user interfaces!\n",
    "All you have to do, is:\n",
    "- Import it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import tkinter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Create a class:\n",
    "    - with window layout\n",
    "    - with function bindings "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Dice(tkinter.Tk):\n",
    "\n",
    "    def __init__(self, parent):\n",
    "        # init main window (parent is the parent window)\n",
    "        tkinter.Tk.__init__(self, parent)\n",
    "        self.parent = parent\n",
    "        self.initialize()\n",
    "\n",
    "    def initialize(self):\n",
    "        self.grid()\n",
    "\n",
    "        # add label\n",
    "        self.labelVariable = tkinter.StringVar()\n",
    "        label = tkinter.Label(self, textvariable=self.labelVariable)\n",
    "        label.grid(column=0, row=0, sticky='EW')\n",
    "        self.labelVariable.set(0)\n",
    "\n",
    "        # add button\n",
    "        button = tkinter.Button(self, text=u\"Throw!\", command=self.throw)\n",
    "        button.grid(column=1, row=0)\n",
    "\n",
    "    def throw(self):\n",
    "        self.labelVariable.set(random.randint(1, 6))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Initiate and use it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "app = Dice(None)\n",
    "app.title('Throw a dice!')\n",
    "app.mainloop()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An easy to follow tutorial can be found <a href=\"http://sebsauvage.net/python/gui/\">here</a>.\n",
    "\n",
    "---\n",
    "\n",
    "## Final Act:  It's your turn - write the missing code snippets!\n",
    "\n",
    "Write a script called `articles.py`, in which you create an function called `get_article_tags` and a global variable called `BASE_URL` (telex.hu's base url).\n",
    "\n",
    "\n",
    "`get_article_tags(telex_article_suburl)`:\n",
    "- Arguments: telex_article_suburl\n",
    "- Output: a list of article tags\n",
    "- Workflow:\n",
    "    1. Combine BASE_URL and the article sub-URL to form the full URL.\n",
    "    1. Send a disguised HTTP request (use a User-Agent header).\n",
    "    1. Parse the page with BeautifulSoup.\n",
    "    1. Extract all article tags and return them as a list.\n",
    "\n",
    "Don't forget about the diguise!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test the script\n",
    "import articles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "related_tags = articles.get_article_tags('')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for tag in related_tags:\n",
    "    print(tag)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "szisz_python_2025_automn",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}