{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d162b058",
   "metadata": {},
   "source": [
    "---   \n",
    " <img align=\"left\" width=\"75\" height=\"75\"  src=\"https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png\"> \n",
    "\n",
    "<h1 align=\"center\">Department of Data Science</h1>\n",
    "<h1 align=\"center\">Course: Tools and Techniques for Data Science</h1>\n",
    "\n",
    "---\n",
    "<h3><div align=\"right\">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3184edc1",
   "metadata": {},
   "source": [
    "<h1 align=\"center\">Lecture 5.2 (Web Scraping using BeautifulSoup)</h1><br>\n",
    "<a href=\"https://colab.research.google.com/github/arifpucit/data-science/blob/master/Section-5-(Data-Acquisition)/Lec-5.2(Web-Scraping-using-BeautifulSoup).ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e64b25e",
   "metadata": {},
   "source": [
    "<img align=\"center\" width=\"900\" height=\"650\"  src=\"images/scrap.PNG\"  >"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e99711f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b160b1e8",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65ca04bb",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19a21dbb",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e9e502b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "38dc9b19",
   "metadata": {},
   "source": [
    "<img align=\"right\" width=\"400\" src=\"images/webscraping.png\"  >\n",
    "\n",
    "## Learning agenda of this notebook\n",
    "1. **Overview of BeautifulSoup**\n",
    "    - What is BeautifulSoup and how it works?\n",
    "    - Download and Install BeautifulSoup\n",
    "\n",
    "\n",
    "2. **Playing with BeautifulSoup**<br>\n",
    "    - Reviewing the Books Scraping Website\n",
    "    - Fetching HTML Contents Using `requests` Library\n",
    "    - Creating the Soup Object using `BeautifulSoup` Library\n",
    "    - Accessing Attributes of `Soup` Object\n",
    "    - Using the `soup.find()` Method\n",
    "    - Using the `soup.find_all()` Method\n",
    "    - Iterating Through the List returned by `soup.find_all()` Method\n",
    "\n",
    "\n",
    "3. **Example 1: Scraping Information from a Single Web Page** https://arifpucit.github.io/bss2/ <br>\n",
    "    - Extracting Book Titles/Authors\n",
    "    - Extracting Book Prices\n",
    "    - Extracting Book Availability (In-Stock)\n",
    "    - Extracting Book Review Count\n",
    "    - Extracting Book Star Ratings\n",
    "    - Extracting Book Links\n",
    "    - Saving data into CSV file on disk\n",
    "\n",
    "\n",
    "4. **Example 1 (cont): Scraping Information from a Multiple Web Pages** https://arifpucit.github.io/bss2/ <br>\n",
    "    - Extracting Book Titles/Authors, Prices, Availability, Review Count, Star Ratings and Links from multiple pages\n",
    "    - Saving data into CSV file on disk\n",
    "\n",
    "\n",
    "5. **Example 2: Scraping Information from a Multiple Web Pages (Pagination)** http://www.arifbutt.me/category/sp-with-linux/ <br>\n",
    "    - Extracting required information\n",
    "    - The Concept of **Pagination**\n",
    "    - How to extract information from Multiple Web Pages using **Pagination**?\n",
    "    - Saving data into CSV file on disk\n",
    "    \n",
    "\n",
    "6. **Limitations of BeautifulSoup** <br>\n",
    "\n",
    "\n",
    "7. **Some Coding Exercises** <br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ea32583",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e020b9da",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "cd860180",
   "metadata": {},
   "source": [
    "## 1. Overview of BeautifulSoup\n",
    "- URLLIB3 Library: https://pypi.org/project/urllib3/\n",
    "- Requests Library: https://requests.readthedocs.io/en/latest/\n",
    "- Requests-html Library: https://requests.readthedocs.io/projects/requests-html/en/latest/\n",
    "- Beautifulsoup4 Download: https://pypi.org/project/beautifulsoup4/\n",
    "- Beautifulsoup4: https://www.crummy.com/software/BeautifulSoup/\n",
    "- Beautifulsoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n",
    "- LXML Parser: https://lxml.de/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7830f00",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f72c42c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "b5f4319f",
   "metadata": {},
   "source": [
    "### a. What is BeautifulSoup and How it Works?\n",
    "- Beautiful Soup is a Python library for pulling data out of `HTML` and `XML` files. \n",
    "- BeutifulSoup cannot fetch HTML contents from a web site. To pull HTML we will use `requests` library and then pass the HTML to BeautifulSoup constructor.\n",
    "- The three main features of BeautifulSoup are:\n",
    "    - It generates a parse tree of the HTML and offers simple methods for navigating, searching and modifying that parse tree.\n",
    "    - It automatically converts incoming documents to Unicode and outgoing documents to UTF-8. So you don't have to worry about encodings.\n",
    "    - It has support of different parsers, using which BeautifulSoup parse the HTML documents. Some example parsers are: lxml, html5lib, html.parser.\n",
    "  \n",
    "- Different parsers may create different parse trees and could return different results depending on the HTML that you are trying to parse. If your are trying to parse perfectly formed HTML, then the different parsers will give almost the same output, but if there are mistakes in the html then different parsers will try to fill in missing information differently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86702012",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33857c32",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "7831ea98",
   "metadata": {},
   "source": [
    "### b. Download and Install BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff5e67a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "!{sys.executable} -m pip install --upgrade pip -q\n",
    "!{sys.executable} -m pip install requests -q\n",
    "!{sys.executable} -m pip install beautifulsoup4 -q\n",
    "!{sys.executable} -m pip install --upgrade lxml -q\n",
    "!{sys.executable} -m pip install html5lib -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "703c27ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import bs4 # bs4 is a dummy package managed by the developer of Beautiful Soup to prevent name squatting\n",
    "from bs4 import BeautifulSoup\n",
    "import lxml\n",
    "import html5lib\n",
    "\n",
    "requests.__version__, bs4.__version__ , lxml.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b42ab93",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ccf9530",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2b6d298",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a35e12d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "1649115d",
   "metadata": {},
   "source": [
    "## 2. Playing with BeautifulSoup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7011a218",
   "metadata": {},
   "source": [
    "### a. Reviewing the Books Scraping Website\n",
    "https://arifpucit.github.io/bss2/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aafe1225",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f9bbc70",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "2153eec8",
   "metadata": {},
   "source": [
    "### b. Fetching HTML Contents Using `requests` Library\n",
    "- A good practical tutorial on using Requests Library: https://www.jcchouinard.com/python-requests/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e5e4251",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "print(dir(requests))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd123e4d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ecf5b4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp = requests.get(\"https://arifpucit.github.io/bss2\")\n",
    "resp.status_code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af4a071b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "175aaa65",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dir(resp))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9bbef380",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d56ca9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp.url"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "950ca4f7",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc4b2ed2",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp.headers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f25c612",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8f3d4f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp.content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "478bba9a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f2132b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(resp.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee25b05a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ee20672",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "2a0c9b7f",
   "metadata": {},
   "source": [
    "### c. Creating the Soup Object using `BeautifulSoup` Library\n",
    "- The `BeautifulSoup()` method is used to create a BeautifulSoup object.\n",
    "\n",
    "##### <center> `BeautifulSoup(markup, \"lxml\")` </center>\n",
    "\n",
    "- The first argument to the BeautifulSoup constructor is a string or an open filehandle containing the markup you want to be parsed. \n",
    "- The second argument is how you’d like the markup parsed. If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.\n",
    "\n",
    "- The method returns a BeautifulSoup object which represents the parsed document and knows how to navigate through the DOM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3b76aec",
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(resp.text, 'lxml')\n",
    "print(type(soup))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65846f75",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b61b1db4",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dir(soup))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fdeae18",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24e538c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(soup)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5799c8b8",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1dd88827",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(soup.prettify())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "795aa49e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "730c8257",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "0f4f468c",
   "metadata": {},
   "source": [
    "> **Tag Objects**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61bfb99f",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.header"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a5be41f",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.p"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23bdfc25",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1fe641db",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "08d797b0",
   "metadata": {},
   "source": [
    "> **Name Objects**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30ace3fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.header.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e20b9bd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.img.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4da66e92",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(type(soup.header.name))\n",
    "print(type(soup.img.name))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30213ef9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38accc38",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "269d9013",
   "metadata": {},
   "source": [
    "> **Attribute Objects**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63dab9ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.p.attrs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7b0a37a",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.img.attrs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "456579d3",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "15ebbebd",
   "metadata": {},
   "source": [
    "> **Navigatable String Object**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fe485a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.title.string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "533c0695",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "df3e0339",
   "metadata": {},
   "source": [
    "> **You can Navigate the Entire Tree of Soup Object**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21f0c6ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.body.a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04bd56cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.body.a.parent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94f17001",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.body.a.parent.parent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4362aec",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.body.ul"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8332c201",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47b136fc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be146f69",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.body.ul.children"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2a84fc9",
   "metadata": {},
   "outputs": [],
   "source": [
    "for tag in soup.body.ul.children:\n",
    "    print(tag)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85d0128f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "929e65fd",
   "metadata": {},
   "source": [
    "### d. Using the `soup.find()` Method\n",
    "- The `soup.find()` method returns the first tag that matches the search criteria:\n",
    "\n",
    "`soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`\n",
    "\n",
    "- Where\n",
    "    - `name` is the tag name to search.\n",
    "    - `attrs={}`, A dictionary of filters on attribute values.\n",
    "    - `recursive=True`, If this is `True`, find() will perform a recursive search of this PageElement's children. Otherwise, only the direct children will be considered.\n",
    "    - `text=None`, St\n",
    "    - `limit`, Stop looking after finding this many results.\n",
    "    \n",
    "**The `find()` method can be called on the entire soup object or you can call `find()` method from a specific tag from within a soup object**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11042964",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.find('div', {'class':'navbar'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56248435",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup.find('div', class_='navbar')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2ce69b7",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26930806",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "84e8d7ec",
   "metadata": {},
   "source": [
    "### e. Using the `soup.find_all()` Method\n",
    "- The `soup.find_all()` method returns a list of all the tags or strings that match a particular criteria.\n",
    "\n",
    "`soup.find(name=None, attrs={}, limit, string=None, recursive=True, text=None, **kwargs)`\n",
    "\n",
    "- Where\n",
    "    - `name` is  the name of the tag to return.\n",
    "    - `attrs={}`, A dictionary of filters on attribute values.\n",
    "    - `string=None`, is used if you want to search for a text string rather than tagname\n",
    "    - `recursive=True`, If this is `True`, will perform a recursive search of all the descendents. Otherwise, only the direct children will be considered.\n",
    "    - `string=None`, is used if you want to search for a text string rather than tagname\n",
    "    - `limit`, is the number of elements to return. Defaults to all matching (`find()` method is similar to find_all() by passing the limit=1\n",
    "\n",
    "\n",
    "**Note:** The class attribute having space separated string means multiple classes, while an id attribute having space separated string means a single id whose name is having spaces in between"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3f85925",
   "metadata": {},
   "outputs": [],
   "source": [
    "prices = soup.find_all('p', class_='price green')\n",
    "prices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae588d9a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90570583",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Since `soup.find_all()` method returns a list of tags we can iterate through all the list values\n",
    "for price in prices:\n",
    "    print(price.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54784056",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e053ca4b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8accea2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "5561d1ec",
   "metadata": {},
   "source": [
    "## 3. Example 1: Scraping Information from a Single Web Page:\n",
    "<h3 align=\"center\" style=\"color:green\">https://arifpucit.github.io/bss2/</h3>\n",
    "<br>\n",
    "\n",
    "- Visit above web page and scrap following six items of the nine books from the index page:\n",
    "    - Titles/Authors of the Book\n",
    "    - Links of the Book\n",
    "    - Price of the Book\n",
    "    - Availability of the Book (In-Stock or Not in Stock)\n",
    "    - Count of Reviews\n",
    "    - Star ratings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "202ac961",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import lxml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dba21c99",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp = requests.get(\"https://arifpucit.github.io/bss2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f88d2284",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup = BeautifulSoup(resp.text, 'lxml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "afa9feb4",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c97dc06c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "bb8f9d6e",
   "metadata": {},
   "source": [
    "### a. Extract Book Title and Author Name\n",
    "- Suppose you want to get the book titles and author names of all the books. \n",
    "- Start by getting the information about the first book, once you are satisfied, then try finding information of all the books"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5366590",
   "metadata": {},
   "outputs": [],
   "source": [
    "sp_titles = soup.find_all('p', class_=\"book_name\")\n",
    "sp_titles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a49742d6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8357ed0e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61d50a58",
   "metadata": {},
   "outputs": [],
   "source": [
    "titles = []\n",
    "for title in sp_titles:\n",
    "    titles.append(title.text)\n",
    "print(titles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85adbdea",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa4f5997",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "d35bfec1",
   "metadata": {},
   "source": [
    "### b. Extract Links of Books"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c0d07aa5",
   "metadata": {},
   "outputs": [],
   "source": [
    "sp_titles = soup.find_all('p', class_=\"book_name\")\n",
    "sp_titles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7de85b7e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a4640f3",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7832a602",
   "metadata": {},
   "outputs": [],
   "source": [
    "for item in sp_titles:\n",
    "    print(item.find('a'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "592fb8a0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77e774f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "for item in sp_titles:\n",
    "    print(item.find('a').get('href')) # print(item.find('a')['href'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9ec290d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e328f01",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e39a607b",
   "metadata": {},
   "outputs": [],
   "source": [
    "links=[]\n",
    "for item in sp_titles:\n",
    "    links.append(item.find('a').get('href'))\n",
    "links"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5c80284",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64079b57",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "4007a6a8",
   "metadata": {},
   "source": [
    "### c. Extract Price"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47898a4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "sp_prices = soup.find_all('p', class_=\"price green\")\n",
    "sp_prices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5915f8cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "prices = []\n",
    "for price in sp_prices:\n",
    "    prices.append(price.text)\n",
    "print(prices)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f39ea7f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c511efd3",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "23683ade",
   "metadata": {},
   "source": [
    "### d. Extract Availability of Books (In-Stock or Not in Stock)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dbae312e",
   "metadata": {},
   "outputs": [],
   "source": [
    "sp_availability =  soup.find_all('p', class_='stock')\n",
    "sp_availability"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6dbfc3e6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a5056f7",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67a4029c",
   "metadata": {},
   "outputs": [],
   "source": [
    "availability=[]\n",
    "for aval in sp_availability:\n",
    "    availability.append(aval.text)\n",
    "print(availability)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91fb3d87",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46adedd6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "90ed7957",
   "metadata": {},
   "source": [
    "### e. Extract Count of Reviews"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "311205d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "sp_reviews = soup.find_all('p', class_='review')\n",
    "sp_reviews"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "815655bf",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58250d22",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "656a3f49",
   "metadata": {},
   "outputs": [],
   "source": [
    "reviews = []\n",
    "for review in sp_reviews:\n",
    "    reviews.append(review.get('data-rating')) ## get() method is passwed an attribute and it returns its value\n",
    "print(reviews)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7db1a2bc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e1303b4",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "84ceb8af",
   "metadata": {},
   "source": [
    "### f. Extract Star Ratings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f106443e",
   "metadata": {},
   "outputs": [],
   "source": [
    "book = soup.find('div', class_ = 'book_container')\n",
    "print(book.prettify())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b401ada",
   "metadata": {},
   "outputs": [],
   "source": [
    "book.find_all('span', class_ = 'not_filled')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14d7d7bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "len(book.find_all('span', class_ = 'not_filled'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3999c37",
   "metadata": {},
   "outputs": [],
   "source": [
    "5-len(book.find_all('span',{'class','not_filled'}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a39acfd",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8be20ab",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfe7cef8",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e50e49bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "stars = list()\n",
    "books = soup.find_all('div',{'class','book_container'})\n",
    "for book in books:\n",
    "    stars.append(5 - len(book.find_all('span',{'class','not_filled'})))\n",
    "print(stars) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35df29ad",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c686acff",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be8260d5",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "438c1c95",
   "metadata": {},
   "source": [
    "### g. Display output on screen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa51ea28",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(9):\n",
    "    print(\"\",titles[i])\n",
    "    print(\"      Link: \",links[i])\n",
    "    print(\"      Price: \",prices[i])\n",
    "    print(\"      Stock: \",availability[i])\n",
    "    print(\"      Reviews: \",reviews[i])\n",
    "    print(\"      Stars: \",stars[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe189068",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50fbd2bc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "1b697c9f",
   "metadata": {},
   "source": [
    "### h. Saving Data into a CSV File"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f0bdc61",
   "metadata": {},
   "source": [
    "#### Option 1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70a70cb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n",
    "        'Reviews':reviews, 'Links':links, 'Stars':stars}\n",
    "\n",
    "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])\n",
    "df.to_csv('books1.csv', index=False)\n",
    "df = pd.read_csv('books1.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "587409f2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee23f093",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0110a75b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "help(csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a94121e",
   "metadata": {},
   "source": [
    "#### Option 2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9aca138f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "import pandas as pd\n",
    "\n",
    "fd = open('books2.csv', 'wt')\n",
    "csv_writer = csv.writer(fd)\n",
    "\n",
    "csv_writer.writerow(['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])\n",
    "\n",
    "for i in range(len(titles)):\n",
    "    csv_writer.writerow([titles[i], prices[i], availability[i], reviews[i], links[i], stars[i]])\n",
    "\n",
    "fd.close()\n",
    "df = pd.read_csv('books2.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "995b231a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03cc1898",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "0774e21c",
   "metadata": {},
   "source": [
    "### i. Consolidating in a Single Script"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02707ba1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import pandas as pd\n",
    "\n",
    "titles = []\n",
    "prices = []\n",
    "availability=[]\n",
    "reviews=[]\n",
    "links=[]\n",
    "stars=[]\n",
    "\n",
    "def books(soup):\n",
    "    sp_titles = soup.find_all('p', class_=\"book_name\")\n",
    "    sp_prices = soup.find_all('p', class_=\"price green\")\n",
    "    sp_availability = data = soup.find_all('p', class_='stock')\n",
    "    sp_reviews = soup.find_all('p',{'class','review'})\n",
    "    data = soup.find_all('p', class_=\"book_name\")\n",
    "    sp_links=[]\n",
    "    for val in data:\n",
    "        sp_links.append(val.find('a').get('href'))\n",
    "    books = soup.find_all('div',{'class','book_container'})\n",
    "    for book in books:\n",
    "        stars.append(5 - len(book.find_all('span',{'class','not_filled'})))\n",
    "    \n",
    "    for i in range(len(sp_titles)):\n",
    "        titles.append(sp_titles[i].text)\n",
    "        prices.append(sp_prices[i].text)\n",
    "        availability.append(sp_availability[i].text)\n",
    "        reviews.append(sp_reviews[i].text)\n",
    "        links.append(sp_links[i])\n",
    "\n",
    "        \n",
    "resp = requests.get(\"https://arifpucit.github.io/bss2\")\n",
    "soup = BeautifulSoup(resp.text, 'lxml')\n",
    "books(soup)\n",
    "\n",
    "\n",
    "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n",
    "        'Reviews':reviews, 'Links':links, 'Stars':stars}\n",
    "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])\n",
    "df.to_csv('books3.csv', index=False)\n",
    "df = pd.read_csv('books3.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "807df2d3",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2d8ce9f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "1b42c36d",
   "metadata": {},
   "source": [
    "## 4. Example 1 (cont): Scraping Information from a Multiple Web Pages:\n",
    "<h3 align=\"center\" style=\"color:green\">https://arifpucit.github.io/bss2/</h3>\n",
    "<br>\n",
    "\n",
    "- Visit above web page and scrap following six items of the 27 books on all the three web pages:\n",
    "    - Titles/Authors of the Book\n",
    "    - Links of the Book\n",
    "    - Price of the Book\n",
    "    - Availability of the Book (In-Stock or Not in Stock)\n",
    "    - Count of Reviews\n",
    "    - Star ratings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30502423",
   "metadata": {},
   "source": [
    "- Note that the HTML structure of all the three pages of our Book Scraping Site is same\n",
    "- We have already written the code to scrap the information from the first page\n",
    "- Now we need to find a way to go to multiple pages and use the same code in a loop for all those pages to grab data of our interest.\n",
    "- Generally when a website runs into multiple pages it usually add some extra elements into its URL and keep rest of the URL same. \n",
    "- After closely observing the structure of the URL, and the changes that occurs when we go from page to page. One can devise the way to generate the URLs from the base URL by some sort of appending strings to the base URL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9a513da",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0efa0098",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import pandas as pd\n",
    "\n",
    "titles = []\n",
    "prices = []\n",
    "availability=[]\n",
    "reviews=[]\n",
    "links=[]\n",
    "stars=[]\n",
    "\n",
    "def books(soup):\n",
    "    sp_titles = soup.find_all('p', class_=\"book_name\")\n",
    "    sp_prices = soup.find_all('p', class_=\"price green\")\n",
    "    sp_availability = data = soup.find_all('p', class_='stock')\n",
    "    sp_reviews = soup.find_all('p',{'class','review'})\n",
    "    # for links\n",
    "    data = soup.find_all('p', class_=\"book_name\")\n",
    "    sp_links=[]\n",
    "    for val in data:\n",
    "        sp_links.append(val.find('a').get('href'))\n",
    "    books = soup.find_all('div',{'class','book_container'})\n",
    "    for book in books:\n",
    "        stars.append(5 - len(book.find_all('span',{'class','not_filled'})))\n",
    "    \n",
    "    for i in range(len(sp_titles)):\n",
    "        titles.append(sp_titles[i].text)\n",
    "        prices.append(sp_prices[i].text)\n",
    "        availability.append(sp_availability[i].text)\n",
    "        reviews.append(sp_reviews[i].text)\n",
    "        links.append(sp_links[i])\n",
    "\n",
    "\n",
    "urls = ['https://arifpucit.github.io/bss2/index.html', \n",
    "        'https://arifpucit.github.io/bss2/SP.html', \n",
    "        'https://arifpucit.github.io/bss2/CA.html']                  \n",
    "for url in urls:\n",
    "    resp = requests.get(url)\n",
    "    soup = BeautifulSoup(resp.text, 'lxml')\n",
    "    books(soup)\n",
    "\n",
    "# Creating a dataframe and saving data in a csv file\n",
    "data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, \n",
    "        'Reviews':reviews, 'Links':links, 'Stars':stars}\n",
    "df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])\n",
    "df.to_csv('books3.csv', index=False)\n",
    "df = pd.read_csv('books3.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00836028",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5e0e8cc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "5723f207",
   "metadata": {},
   "source": [
    "## 5. Example 2: Scraping Information from a Multiple Web Pages (Pagination):\n",
    "<h3 align=\"center\" style=\"color:green\">https://arifbutt.me/category/sp-with-linux</h3>\n",
    "<br>\n",
    "\n",
    "- Visit above web page and scrap following three items of System Programming videos on the first page:\n",
    "    - Video Lecture Title\n",
    "    - Description\n",
    "    - YouTube Video Link"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37bc1f91",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7321757",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "e80a4fed",
   "metadata": {},
   "source": [
    "### a. Scraping Data from the First Page: \n",
    "- http://www.arifbutt.me/category/sp-with-linux/page/1/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e08b6b08",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'http://www.arifbutt.me/category/sp-with-linux/'\n",
    "resp = requests.get(url)\n",
    "soup = BeautifulSoup(resp.text, 'lxml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9161a78",
   "metadata": {},
   "outputs": [],
   "source": [
    "articles = soup.find_all('div', class_='media-body')\n",
    "articles    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb62b753",
   "metadata": {},
   "outputs": [],
   "source": [
    "article = soup.find('div', class_='media-body')\n",
    "article"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f0f283d",
   "metadata": {},
   "outputs": [],
   "source": [
    "article.find('h4', class_='media-heading1').text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d3c8f74",
   "metadata": {},
   "outputs": [],
   "source": [
    "article.find('p', align=\"justify\").text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2966af1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "article.find('iframe').get('src')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78feb896",
   "metadata": {},
   "outputs": [],
   "source": [
    "article.find('iframe').get('src').split('/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d666a96",
   "metadata": {},
   "outputs": [],
   "source": [
    "article.find('iframe').get('src').split('/')[4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8f2cdff",
   "metadata": {},
   "outputs": [],
   "source": [
    "article.find('iframe').get('src').split('/')[4].split('?')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e988eca",
   "metadata": {},
   "outputs": [],
   "source": [
    "video_id = article.find('iframe').get('src').split('/')[4].split('?')[0]\n",
    "video_id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2861761a",
   "metadata": {},
   "outputs": [],
   "source": [
    "f'https://youtube.com/watch?v={video_id}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a65e8fa",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a5e9296",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "e9a6197d",
   "metadata": {},
   "source": [
    "### b. Scraping Data from the All the Pages of System Programming: \n",
    "- http://www.arifbutt.me/category/sp-with-linux/page/1/\n",
    "- http://www.arifbutt.me/category/sp-with-linux/page/2/ \n",
    "- http://www.arifbutt.me/category/sp-with-linux/page/3/\n",
    "- http://www.arifbutt.me/category/sp-with-linux/page/4/\n",
    "- http://www.arifbutt.me/category/sp-with-linux/page/5/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99e5848b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f10a264",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "801318bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "788ba68e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "07c34f18",
   "metadata": {},
   "outputs": [],
   "source": [
    "def videos(soup):\n",
    "    articles = soup.find_all('div', class_='media-body')\n",
    "    for article in articles:\n",
    "        title = article.find('h4', class_=\"media-heading1\").text\n",
    "        titles.append(title)\n",
    "        \n",
    "        descr = article.find('p', align='justify').text\n",
    "        descriptions.append(descr)\n",
    "\n",
    "        video_id = article.find('iframe')['src'].split('/')[4].split('?')[0]\n",
    "        youtube_link = f'https://youtube.com/watch?v={video_id}'\n",
    "        links.append(youtube_link)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4898f368",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2029e75",
   "metadata": {},
   "outputs": [],
   "source": [
    "titles = []\n",
    "descriptions = []\n",
    "links=[]\n",
    "\n",
    "first_page = requests.get(\"http://www.arifbutt.me/category/sp-with-linux/\")\n",
    "soup = BeautifulSoup(first_page.text,'lxml')\n",
    "videos(soup)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d04b0449",
   "metadata": {},
   "outputs": [],
   "source": [
    "titles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c9a45a7",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78a9a67b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5bf8d4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "pegination_code = soup.find('div',class_=\"navigation_pegination\")\n",
    "pegination_code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2504563",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "013eae8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "pegination_code = soup.find('div',class_=\"navigation_pegination\") \n",
    "all_links= pegination_code.find_all('li')\n",
    "\n",
    "last_link = None \n",
    "for last_link in all_links:\n",
    "    pass \n",
    "\n",
    "next_url = last_link.find('a').get('href')\n",
    "\n",
    "resp = requests.get(next_url)\n",
    "soup = BeautifulSoup(resp.text,'lxml')\n",
    "videos(soup)\n",
    "print(next_url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e621617d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebed7769",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7188a43e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "eb4e5e44",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>YouTube Link</th>\n",
       "      <th>Description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Lec01 Introduction to System Programming (Arif...</td>\n",
       "      <td>https://youtube.com/watch?v=qThI-U34KYs</td>\n",
       "      <td>This is the first session on the subject of Sy...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Lec02 C Compilation: A System Programmer Persp...</td>\n",
       "      <td>https://youtube.com/watch?v=a7GhFL0Gh6Y</td>\n",
       "      <td>This session starts with the C-Compilation pro...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lec03 Working of Linkers: Creating your own Li...</td>\n",
       "      <td>https://youtube.com/watch?v=A67t7X2LUsA</td>\n",
       "      <td>Linking and loading a process (Behind the curt...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Lec04 UNIX make utility (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=8hG0MTyyxMI</td>\n",
       "      <td>This session deals with the famous UNIX make u...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Lec05 GNU autotools and cmake (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=Ncb_xzjGAwM</td>\n",
       "      <td>This session starts with a brief comparison be...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Lec06 Versioning Systems git-I (Arif Butt @ PU...</td>\n",
       "      <td>https://youtube.com/watch?v=TBqLJg6PmWQ</td>\n",
       "      <td>This session gives an overview of different mo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Lec07 Versioning Systems git-II (Arif Butt @ P...</td>\n",
       "      <td>https://youtube.com/watch?v=3akXFcBDYc0</td>\n",
       "      <td>This is a continuity of previous session and s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Lec08 Exit Handlers and Resource Limits (Arif ...</td>\n",
       "      <td>https://youtube.com/watch?v=ujzom1OyPMY</td>\n",
       "      <td>This session describes as to how a C program s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Lec09 Stack Behind the Curtain (Arif Butt @ PU...</td>\n",
       "      <td>https://youtube.com/watch?v=1XbTmmWxHzo</td>\n",
       "      <td>This session describes how a process is laid o...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Lec10 Heap Behind the Curtain (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=zpcPS27ZQr0</td>\n",
       "      <td>This session start with a discussion on types ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Lec11 Design and Code of UNIX more utility (Ar...</td>\n",
       "      <td>https://youtube.com/watch?v=epefPagPgvk</td>\n",
       "      <td>This session deals with the design and develop...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Lec12 UNIX File System Architecture (Arif Butt...</td>\n",
       "      <td>https://youtube.com/watch?v=x_bu6De71KY</td>\n",
       "      <td>In this session will start with a quick recap ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Lec13 UNIX File Management (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=DZQkyoXgkMs</td>\n",
       "      <td>This session will deal with various file relat...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Lec14 Design and Code of UNIX ls Utility (Arif...</td>\n",
       "      <td>https://youtube.com/watch?v=24WNjxn4asY</td>\n",
       "      <td>This session deals with the designing the ls p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Lec15 Design and Code Of UNIX who Utility (Ari...</td>\n",
       "      <td>https://youtube.com/watch?v=96EcaPZo90U</td>\n",
       "      <td>This session deals with the different categori...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Lec16 Programming Terminal Devices (Arif Butt ...</td>\n",
       "      <td>https://youtube.com/watch?v=t5sC6G73oo4</td>\n",
       "      <td>This session starts with character and block s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Lec17 Process Management-I (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=R_01xGLp0ZQ</td>\n",
       "      <td>This session starts with a quick recap of proc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Lec18 Process Management-II (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=91qzstPN1p8</td>\n",
       "      <td>This session starts with a comparison between ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Lec19 Process Management-III (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=QWfeh1bFvs0</td>\n",
       "      <td>This is a continuation of previous two session...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Lec20 Design and Code Of Daemon Processes (Ari...</td>\n",
       "      <td>https://youtube.com/watch?v=p0ccoTM7v8I</td>\n",
       "      <td>This session gives an overview of daemon proce...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Lec21 Process Scheduling Algorithms (Arif Butt...</td>\n",
       "      <td>https://youtube.com/watch?v=Y86pa2nrT_k</td>\n",
       "      <td>This session gives an overview of  process sch...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Lec22 Design And Code Of UNIX Shell Utility (A...</td>\n",
       "      <td>https://youtube.com/watch?v=F7oAWvh5J_o</td>\n",
       "      <td>This session gives an overview of working of U...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Lec23 Multi Threaded Programming (Arif Butt @ ...</td>\n",
       "      <td>https://youtube.com/watch?v=OgnLaXwLC8Y</td>\n",
       "      <td>This session gives an overview of concurrent p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Lec24 Overview Of UNIX IPC And Signals On The ...</td>\n",
       "      <td>https://youtube.com/watch?v=EX7EWSX8-qM</td>\n",
       "      <td>This session gives an overview of taxonomy of ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Lec25 Design and Code Of Signal Handlers (Arif...</td>\n",
       "      <td>https://youtube.com/watch?v=YBg9sWw4qbU</td>\n",
       "      <td>This session is a continuation of previous ses...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Lec26 Programming UNIX Pipes (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=VA8FEgahi1Y</td>\n",
       "      <td>This session deals with the concept and use of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Lec27 Programming UNIX Named Pipes (Arif Butt ...</td>\n",
       "      <td>https://youtube.com/watch?v=jowB4nuf55c</td>\n",
       "      <td>This session deals with the concept and use of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>Lec28 Message Queues (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=UAbMS3kYV5s</td>\n",
       "      <td>This session deals with the concept and use of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>Lec29 Programming With Shared Memory (Arif But...</td>\n",
       "      <td>https://youtube.com/watch?v=IzhnAW8u1iQ</td>\n",
       "      <td>This session deals with the concept and use of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Lec30 Memory Mapped Files (Arif Butt @ PUCIT)</td>\n",
       "      <td>https://youtube.com/watch?v=z0I1TlqDi50</td>\n",
       "      <td>This session deals with the concept and use of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>Lec31 Synchronization among Threads (Arif Butt...</td>\n",
       "      <td>https://youtube.com/watch?v=SvFr7rPWI3g</td>\n",
       "      <td>This session starts with a quick recap of POSI...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Lec32 Programming with POSIX Semaphores (Arif ...</td>\n",
       "      <td>https://youtube.com/watch?v=KupTFYvxRnE</td>\n",
       "      <td>This session starts with introduction to POSIX...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>Lec33 Overview Of TCPIP Architecture and Servi...</td>\n",
       "      <td>https://youtube.com/watch?v=p5SrRob-bWg</td>\n",
       "      <td>This session starts with introduction to TCP/I...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Lec34 Socket Programming Part-I (Arif Butt @ P...</td>\n",
       "      <td>https://youtube.com/watch?v=tk_RpIVbOMQ</td>\n",
       "      <td>This session starts with introduction to Clien...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>Lec35 Socket Programming Part-II (Arif Butt @ ...</td>\n",
       "      <td>https://youtube.com/watch?v=yNUFQaSclmM</td>\n",
       "      <td>This session starts with introduction to Datag...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>Lec36 Socket Programming Part-III (Arif Butt @...</td>\n",
       "      <td>https://youtube.com/watch?v=TDRIweWXHe4</td>\n",
       "      <td>This session starts with introduction to UNIX ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>Lec37 Socket Programming Part-IV (Arif Butt @ ...</td>\n",
       "      <td>https://youtube.com/watch?v=irRkNrruwxc</td>\n",
       "      <td>This session starts with a discussion on concu...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>Lec39 Exploiting Buffer Overflow Vulnerability...</td>\n",
       "      <td>https://youtube.com/watch?v=eAzCm0Ncnhg</td>\n",
       "      <td>This is a continuation of Video Session 38. In...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>Lec38 Exploiting Buffer Overflow Vulnerability...</td>\n",
       "      <td>https://youtube.com/watch?v=3hDNvlIZFQ8</td>\n",
       "      <td>This is a series of three videos, which gives ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>Lec40 Exploiting Buffer Overflow Vulnerability...</td>\n",
       "      <td>https://youtube.com/watch?v=DayRrBYZRRk</td>\n",
       "      <td>This is a continuation of Video Session 39. In...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                Title  \\\n",
       "0   Lec01 Introduction to System Programming (Arif...   \n",
       "1   Lec02 C Compilation: A System Programmer Persp...   \n",
       "2   Lec03 Working of Linkers: Creating your own Li...   \n",
       "3         Lec04 UNIX make utility (Arif Butt @ PUCIT)   \n",
       "4   Lec05 GNU autotools and cmake (Arif Butt @ PUCIT)   \n",
       "5   Lec06 Versioning Systems git-I (Arif Butt @ PU...   \n",
       "6   Lec07 Versioning Systems git-II (Arif Butt @ P...   \n",
       "7   Lec08 Exit Handlers and Resource Limits (Arif ...   \n",
       "8   Lec09 Stack Behind the Curtain (Arif Butt @ PU...   \n",
       "9   Lec10 Heap Behind the Curtain (Arif Butt @ PUCIT)   \n",
       "10  Lec11 Design and Code of UNIX more utility (Ar...   \n",
       "11  Lec12 UNIX File System Architecture (Arif Butt...   \n",
       "12     Lec13 UNIX File Management (Arif Butt @ PUCIT)   \n",
       "13  Lec14 Design and Code of UNIX ls Utility (Arif...   \n",
       "14  Lec15 Design and Code Of UNIX who Utility (Ari...   \n",
       "15  Lec16 Programming Terminal Devices (Arif Butt ...   \n",
       "16     Lec17 Process Management-I (Arif Butt @ PUCIT)   \n",
       "17    Lec18 Process Management-II (Arif Butt @ PUCIT)   \n",
       "18   Lec19 Process Management-III (Arif Butt @ PUCIT)   \n",
       "19  Lec20 Design and Code Of Daemon Processes (Ari...   \n",
       "20  Lec21 Process Scheduling Algorithms (Arif Butt...   \n",
       "21  Lec22 Design And Code Of UNIX Shell Utility (A...   \n",
       "22  Lec23 Multi Threaded Programming (Arif Butt @ ...   \n",
       "23  Lec24 Overview Of UNIX IPC And Signals On The ...   \n",
       "24  Lec25 Design and Code Of Signal Handlers (Arif...   \n",
       "25   Lec26 Programming UNIX Pipes (Arif Butt @ PUCIT)   \n",
       "26  Lec27 Programming UNIX Named Pipes (Arif Butt ...   \n",
       "27           Lec28 Message Queues (Arif Butt @ PUCIT)   \n",
       "28  Lec29 Programming With Shared Memory (Arif But...   \n",
       "29      Lec30 Memory Mapped Files (Arif Butt @ PUCIT)   \n",
       "30  Lec31 Synchronization among Threads (Arif Butt...   \n",
       "31  Lec32 Programming with POSIX Semaphores (Arif ...   \n",
       "32  Lec33 Overview Of TCPIP Architecture and Servi...   \n",
       "33  Lec34 Socket Programming Part-I (Arif Butt @ P...   \n",
       "34  Lec35 Socket Programming Part-II (Arif Butt @ ...   \n",
       "35  Lec36 Socket Programming Part-III (Arif Butt @...   \n",
       "36  Lec37 Socket Programming Part-IV (Arif Butt @ ...   \n",
       "37  Lec39 Exploiting Buffer Overflow Vulnerability...   \n",
       "38  Lec38 Exploiting Buffer Overflow Vulnerability...   \n",
       "39  Lec40 Exploiting Buffer Overflow Vulnerability...   \n",
       "\n",
       "                               YouTube Link  \\\n",
       "0   https://youtube.com/watch?v=qThI-U34KYs   \n",
       "1   https://youtube.com/watch?v=a7GhFL0Gh6Y   \n",
       "2   https://youtube.com/watch?v=A67t7X2LUsA   \n",
       "3   https://youtube.com/watch?v=8hG0MTyyxMI   \n",
       "4   https://youtube.com/watch?v=Ncb_xzjGAwM   \n",
       "5   https://youtube.com/watch?v=TBqLJg6PmWQ   \n",
       "6   https://youtube.com/watch?v=3akXFcBDYc0   \n",
       "7   https://youtube.com/watch?v=ujzom1OyPMY   \n",
       "8   https://youtube.com/watch?v=1XbTmmWxHzo   \n",
       "9   https://youtube.com/watch?v=zpcPS27ZQr0   \n",
       "10  https://youtube.com/watch?v=epefPagPgvk   \n",
       "11  https://youtube.com/watch?v=x_bu6De71KY   \n",
       "12  https://youtube.com/watch?v=DZQkyoXgkMs   \n",
       "13  https://youtube.com/watch?v=24WNjxn4asY   \n",
       "14  https://youtube.com/watch?v=96EcaPZo90U   \n",
       "15  https://youtube.com/watch?v=t5sC6G73oo4   \n",
       "16  https://youtube.com/watch?v=R_01xGLp0ZQ   \n",
       "17  https://youtube.com/watch?v=91qzstPN1p8   \n",
       "18  https://youtube.com/watch?v=QWfeh1bFvs0   \n",
       "19  https://youtube.com/watch?v=p0ccoTM7v8I   \n",
       "20  https://youtube.com/watch?v=Y86pa2nrT_k   \n",
       "21  https://youtube.com/watch?v=F7oAWvh5J_o   \n",
       "22  https://youtube.com/watch?v=OgnLaXwLC8Y   \n",
       "23  https://youtube.com/watch?v=EX7EWSX8-qM   \n",
       "24  https://youtube.com/watch?v=YBg9sWw4qbU   \n",
       "25  https://youtube.com/watch?v=VA8FEgahi1Y   \n",
       "26  https://youtube.com/watch?v=jowB4nuf55c   \n",
       "27  https://youtube.com/watch?v=UAbMS3kYV5s   \n",
       "28  https://youtube.com/watch?v=IzhnAW8u1iQ   \n",
       "29  https://youtube.com/watch?v=z0I1TlqDi50   \n",
       "30  https://youtube.com/watch?v=SvFr7rPWI3g   \n",
       "31  https://youtube.com/watch?v=KupTFYvxRnE   \n",
       "32  https://youtube.com/watch?v=p5SrRob-bWg   \n",
       "33  https://youtube.com/watch?v=tk_RpIVbOMQ   \n",
       "34  https://youtube.com/watch?v=yNUFQaSclmM   \n",
       "35  https://youtube.com/watch?v=TDRIweWXHe4   \n",
       "36  https://youtube.com/watch?v=irRkNrruwxc   \n",
       "37  https://youtube.com/watch?v=eAzCm0Ncnhg   \n",
       "38  https://youtube.com/watch?v=3hDNvlIZFQ8   \n",
       "39  https://youtube.com/watch?v=DayRrBYZRRk   \n",
       "\n",
       "                                          Description  \n",
       "0   This is the first session on the subject of Sy...  \n",
       "1   This session starts with the C-Compilation pro...  \n",
       "2   Linking and loading a process (Behind the curt...  \n",
       "3   This session deals with the famous UNIX make u...  \n",
       "4   This session starts with a brief comparison be...  \n",
       "5   This session gives an overview of different mo...  \n",
       "6   This is a continuity of previous session and s...  \n",
       "7   This session describes as to how a C program s...  \n",
       "8   This session describes how a process is laid o...  \n",
       "9   This session start with a discussion on types ...  \n",
       "10  This session deals with the design and develop...  \n",
       "11  In this session will start with a quick recap ...  \n",
       "12  This session will deal with various file relat...  \n",
       "13  This session deals with the designing the ls p...  \n",
       "14  This session deals with the different categori...  \n",
       "15  This session starts with character and block s...  \n",
       "16  This session starts with a quick recap of proc...  \n",
       "17  This session starts with a comparison between ...  \n",
       "18  This is a continuation of previous two session...  \n",
       "19  This session gives an overview of daemon proce...  \n",
       "20  This session gives an overview of  process sch...  \n",
       "21  This session gives an overview of working of U...  \n",
       "22  This session gives an overview of concurrent p...  \n",
       "23  This session gives an overview of taxonomy of ...  \n",
       "24  This session is a continuation of previous ses...  \n",
       "25  This session deals with the concept and use of...  \n",
       "26  This session deals with the concept and use of...  \n",
       "27  This session deals with the concept and use of...  \n",
       "28  This session deals with the concept and use of...  \n",
       "29  This session deals with the concept and use of...  \n",
       "30  This session starts with a quick recap of POSI...  \n",
       "31  This session starts with introduction to POSIX...  \n",
       "32  This session starts with introduction to TCP/I...  \n",
       "33  This session starts with introduction to Clien...  \n",
       "34  This session starts with introduction to Datag...  \n",
       "35  This session starts with introduction to UNIX ...  \n",
       "36  This session starts with a discussion on concu...  \n",
       "37  This is a continuation of Video Session 38. In...  \n",
       "38  This is a series of three videos, which gives ...  \n",
       "39  This is a continuation of Video Session 39. In...  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import pandas as pd\n",
    "\n",
    "titles = []\n",
    "descriptions = []\n",
    "links=[]\n",
    "\n",
    "first_page = requests.get(\"http://www.arifbutt.me/category/sp-with-linux/\")\n",
    "soup = BeautifulSoup(first_page.text,'lxml')\n",
    "videos(soup)\n",
    "\n",
    "\n",
    "while True:\n",
    "    pegination_code = soup.find('div',class_=\"navigation_pegination\") \n",
    "    all_links= pegination_code.find_all('li')\n",
    "\n",
    "    last_link = None \n",
    "    for last_link in all_links:\n",
    "        pass \n",
    "    if(last_link.find('a').text == \"Next Page »\"):\n",
    "        next_url = last_link.find('a').get('href')\n",
    "        resp = requests.get(next_url)\n",
    "        soup = BeautifulSoup(resp.text,'lxml')\n",
    "        videos(soup)\n",
    "    else:\n",
    "        break;    \n",
    "\n",
    "\n",
    "# Creating a dataframe and saving data in a csv file\n",
    "data = {'Title':titles, 'YouTube Link':links, 'Description':descriptions}\n",
    "df = pd.DataFrame(data, columns=['Title', 'YouTube Link', 'Description'])\n",
    "df.to_csv('spvideos.csv', index=False)\n",
    "df = pd.read_csv('spvideos.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e54adf16",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ceb4b945",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7289ce9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "ebdc6439",
   "metadata": {},
   "source": [
    "## 6. Limitations of Requests and BeautifulSoup Library\n",
    "- When you load up a website you want to scrape using your browser, the browser will make a request to the page's server to retrieve the page content. That's usually some HTML code, some CSS, and some JavaScript.\n",
    "- A key difference between loading the page using your browser and getting the page contents using requests is that your browser executes any JavaScript code that the page comes with. Sometimes you will see the initial page content (before the JavaScript runs) for a few moments, and then the JavaScript kicks in.\n",
    "- It's a very frequent problem in my courses to see this happen. Unfortunately, the only way to get the page after JavaScript has ran is, well, running the JavaScript. You need a JavaScript engine in order to do that. That means you need a browser or browser-like program in order to get the final page.\n",
    "- Solution:\n",
    "    - Selenium is a browser automation tool, which means you can use Selenium to control a browser. You can make Selenium load the page you're interested in, evaluate the JavaScript, and then get the page content.\n",
    "    - requests-html is another library that will let you evaluate the JavaScript after you've retrieved the page. It uses requests to get the page content, and then runs the page through the Chrome browser engine (Chromium) in order to \"calculate\" the final page. However, it's still very much under active development and I've had a few problems with it."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "276b986f",
   "metadata": {},
   "source": [
    "### a. You cannot Scrap JavaScript Driven Websites using BeautifulSoup (https://arifpucit.github.io/bss2/js)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63ef9ff7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "resp = requests.get(\"https://arifpucit.github.io/bss2/\")\n",
    "soup = BeautifulSoup(resp.text,'lxml')\n",
    "price = soup.find_all('p', class_='price green')\n",
    "price"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0791a75",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a38f2153",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "resp = requests.get(\"https://arifpucit.github.io/bss2/js\")\n",
    "soup = BeautifulSoup(resp.text,'lxml')\n",
    "prices = soup.find_all('p', class_='price green')\n",
    "prices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31149cce",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5731a097",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24384460",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "0bb75658",
   "metadata": {},
   "source": [
    "### b. You cannot enter Text and click buttons using BeautifulSoup (https://arifpucit.github.io/bss2/login)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d61a2408",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "resp = requests.get(\"https://arifpucit.github.io/bss2/login/\")\n",
    "soup = BeautifulSoup(resp.text,'lxml')\n",
    "prices = soup.find_all('p', class_='green')\n",
    "prices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb14fd81",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccc5a38e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfcb2b9f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}