{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Welcome to Flatiron's Intro to Web Scraping! \n",
    "\n",
    "![](https://media.giphy.com/media/1dcWGdOBg0wcU/giphy.gif)\n",
    "\n",
    "\n",
    "## A few things:\n",
    "- This file is called a **[Jupyter Notebook](https://jupyter.org/about)**. An industry standard tool for Data Science.\n",
    "\n",
    "- Each section of a jupyter notebook is called a `cell`. \n",
    "    - To run the code inside a cell, click into the cell and press `shift` + `enter`.\n",
    "\n",
    "    \n",
    "### Workshop Goals:\n",
    "1. Retrieve the HTML of a webpage with the `requests` library.\n",
    "2. Introduction to the `tree` structure of `HTML`.\n",
    "3. Use the `inspect` tool to sift through the HTML.\n",
    "4. Parse HTML with the `BeautifulSoup` library.\n",
    "5. Store data in a `csv` file using the `Pandas` library.\n",
    "\n",
    "# Let's scrape some data!\n",
    "The data we are scraping today will be from the [Quotes to Scrape](http://quotes.toscrape.com/) website.\n",
    "\n",
    "\n",
    "## Step 1:\n",
    "> **Import the necessary tools for our project**\n",
    "\n",
    "![](https://media.giphy.com/media/KcE7Dq5f8TTXzZ1LAF/giphy.gif)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:52.701401Z",
     "start_time": "2021-04-21T21:10:51.086864Z"
    }
   },
   "outputs": [],
   "source": [
    "# Webscraping\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "# Data organization\n",
    "import pandas as pd\n",
    "\n",
    "# Visualization\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams.update({'font.size': 22})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2\n",
    "> **We use the `requests` library to connect to the website we wish to scrape.**\n",
    "\n",
    "<img src='https://media.giphy.com/media/eCwAEs05phtK/giphy.gif' width='300'></img>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.113870Z",
     "start_time": "2021-04-21T21:10:52.703140Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Response [200]>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = 'http://quotes.toscrape.com'\n",
    "response = requests.get(url)\n",
    "response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "✅**A `Response 200` means our request was sucessful!** ✅\n",
    "\n",
    "❌Let's take a quick look at an *unsuccessful* response. ❌"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.365799Z",
     "start_time": "2021-04-21T21:10:53.115768Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Response [404]>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bad_url = 'http://quotes.toscrape.com/20'\n",
    "requests.get(bad_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A `Response 404` means that the url you are using in your request is not pointing to a valid webpage.**\n",
    "\n",
    "<img src='https://media.giphy.com/media/VwoJkTfZAUBSU/giphy.gif' width='300'></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3\n",
    "> **We collect the html from the website by adding `.text` to the end of the response variable.** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.371878Z",
     "start_time": "2021-04-21T21:10:53.368038Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'<!DOCTYPE html>\\n<html lang=\"en\">\\n<head>\\n\\t<meta cha'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html = response.text\n",
    "html[:50]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4\n",
    "> **We use `BeautifulSoup` to turn the html into something we can manipulate.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.408867Z",
     "start_time": "2021-04-21T21:10:53.373516Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<!DOCTYPE html>\n",
       "<html lang=\"en\">\n",
       "<head>\n",
       "<meta charset=\"utf-8\"/>\n",
       "<title>Quotes to Scrape</title>\n",
       "<link href=\"/static/bootstrap.min.css\" rel=\"stylesheet\"/>\n",
       "<link href=\"/static/main.css\" rel=\"stylesheet\"/>\n",
       "</head>\n",
       "<body>\n",
       "<div class=\"container\">\n",
       "<div class=\"row header-box\">\n",
       "<div class=\"col-md-8\">\n",
       "<h1>\n",
       "<a href=\"/\" style=\"text-decoration: none\">Quotes to Scrape</a>\n",
       "</h1>\n",
       "</div>\n",
       "<div class=\"col-md-4\">\n",
       "<p>\n",
       "<a href=\"/login\">Login</a>\n",
       "</p>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"row\">\n",
       "<div class=\"col-md-8\">\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">Albert Einstein</small>\n",
       "<a href=\"/author/Albert-Einstein\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"change,deep-thoughts,thinking,world\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/change/page/1/\">change</a>\n",
       "<a class=\"tag\" href=\"/tag/deep-thoughts/page/1/\">deep-thoughts</a>\n",
       "<a class=\"tag\" href=\"/tag/thinking/page/1/\">thinking</a>\n",
       "<a class=\"tag\" href=\"/tag/world/page/1/\">world</a>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">J.K. Rowling</small>\n",
       "<a href=\"/author/J-K-Rowling\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"abilities,choices\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/abilities/page/1/\">abilities</a>\n",
       "<a class=\"tag\" href=\"/tag/choices/page/1/\">choices</a>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">Albert Einstein</small>\n",
       "<a href=\"/author/Albert-Einstein\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"inspirational,life,live,miracle,miracles\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/inspirational/page/1/\">inspirational</a>\n",
       "<a class=\"tag\" href=\"/tag/life/page/1/\">life</a>\n",
       "<a class=\"tag\" href=\"/tag/live/page/1/\">live</a>\n",
       "<a class=\"tag\" href=\"/tag/miracle/page/1/\">miracle</a>\n",
       "<a class=\"tag\" href=\"/tag/miracles/page/1/\">miracles</a>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">Jane Austen</small>\n",
       "<a href=\"/author/Jane-Austen\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"aliteracy,books,classic,humor\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/aliteracy/page/1/\">aliteracy</a>\n",
       "<a class=\"tag\" href=\"/tag/books/page/1/\">books</a>\n",
       "<a class=\"tag\" href=\"/tag/classic/page/1/\">classic</a>\n",
       "<a class=\"tag\" href=\"/tag/humor/page/1/\">humor</a>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">Marilyn Monroe</small>\n",
       "<a href=\"/author/Marilyn-Monroe\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"be-yourself,inspirational\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/be-yourself/page/1/\">be-yourself</a>\n",
       "<a class=\"tag\" href=\"/tag/inspirational/page/1/\">inspirational</a>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“Try not to become a man of success. Rather become a man of value.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">Albert Einstein</small>\n",
       "<a href=\"/author/Albert-Einstein\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"adulthood,success,value\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/adulthood/page/1/\">adulthood</a>\n",
       "<a class=\"tag\" href=\"/tag/success/page/1/\">success</a>\n",
       "<a class=\"tag\" href=\"/tag/value/page/1/\">value</a>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“It is better to be hated for what you are than to be loved for what you are not.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">André Gide</small>\n",
       "<a href=\"/author/Andre-Gide\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"life,love\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/life/page/1/\">life</a>\n",
       "<a class=\"tag\" href=\"/tag/love/page/1/\">love</a>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“I have not failed. I've just found 10,000 ways that won't work.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">Thomas A. Edison</small>\n",
       "<a href=\"/author/Thomas-A-Edison\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"edison,failure,inspirational,paraphrased\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/edison/page/1/\">edison</a>\n",
       "<a class=\"tag\" href=\"/tag/failure/page/1/\">failure</a>\n",
       "<a class=\"tag\" href=\"/tag/inspirational/page/1/\">inspirational</a>\n",
       "<a class=\"tag\" href=\"/tag/paraphrased/page/1/\">paraphrased</a>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">Eleanor Roosevelt</small>\n",
       "<a href=\"/author/Eleanor-Roosevelt\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"misattributed-eleanor-roosevelt\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/misattributed-eleanor-roosevelt/page/1/\">misattributed-eleanor-roosevelt</a>\n",
       "</div>\n",
       "</div>\n",
       "<div class=\"quote\" itemscope=\"\" itemtype=\"http://schema.org/CreativeWork\">\n",
       "<span class=\"text\" itemprop=\"text\">“A day without sunshine is like, you know, night.”</span>\n",
       "<span>by <small class=\"author\" itemprop=\"author\">Steve Martin</small>\n",
       "<a href=\"/author/Steve-Martin\">(about)</a>\n",
       "</span>\n",
       "<div class=\"tags\">\n",
       "            Tags:\n",
       "            <meta class=\"keywords\" content=\"humor,obvious,simile\" itemprop=\"keywords\"/>\n",
       "<a class=\"tag\" href=\"/tag/humor/page/1/\">humor</a>\n",
       "<a class=\"tag\" href=\"/tag/obvious/page/1/\">obvious</a>\n",
       "<a class=\"tag\" href=\"/tag/simile/page/1/\">simile</a>\n",
       "</div>\n",
       "</div>\n",
       "<nav>\n",
       "<ul class=\"pager\">\n",
       "<li class=\"next\">\n",
       "<a href=\"/page/2/\">Next <span aria-hidden=\"true\">→</span></a>\n",
       "</li>\n",
       "</ul>\n",
       "</nav>\n",
       "</div>\n",
       "<div class=\"col-md-4 tags-box\">\n",
       "<h2>Top Ten tags</h2>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/love/\" style=\"font-size: 28px\">love</a>\n",
       "</span>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/inspirational/\" style=\"font-size: 26px\">inspirational</a>\n",
       "</span>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/life/\" style=\"font-size: 26px\">life</a>\n",
       "</span>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/humor/\" style=\"font-size: 24px\">humor</a>\n",
       "</span>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/books/\" style=\"font-size: 22px\">books</a>\n",
       "</span>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/reading/\" style=\"font-size: 14px\">reading</a>\n",
       "</span>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/friendship/\" style=\"font-size: 10px\">friendship</a>\n",
       "</span>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/friends/\" style=\"font-size: 8px\">friends</a>\n",
       "</span>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/truth/\" style=\"font-size: 8px\">truth</a>\n",
       "</span>\n",
       "<span class=\"tag-item\">\n",
       "<a class=\"tag\" href=\"/tag/simile/\" style=\"font-size: 6px\">simile</a>\n",
       "</span>\n",
       "</div>\n",
       "</div>\n",
       "</div>\n",
       "<footer class=\"footer\">\n",
       "<div class=\"container\">\n",
       "<p class=\"text-muted\">\n",
       "                Quotes by: <a href=\"https://www.goodreads.com/quotes\">GoodReads.com</a>\n",
       "</p>\n",
       "<p class=\"copyright\">\n",
       "                Made with <span class=\"sh-red\">❤</span> by <a href=\"https://scrapinghub.com\">Scrapinghub</a>\n",
       "</p>\n",
       "</div>\n",
       "</footer>\n",
       "</body>\n",
       "</html>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup = BeautifulSoup(html, 'lxml')\n",
    "soup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><h1><b><i><u>Very Soupy</u></i></b></h1></center>\n",
    "\n",
    "The name ***Beautiful Soup*** is an appropriate description. HTML does not make for a lovely reading experience. If you feel like you're staring at complete gibberish, you're not entirely wrong! HTML is a language designed for computers, not for human eyes.\n",
    "\n",
    "<img src='https://media.giphy.com/media/5xtDarBbqdSQxfGFdNS/giphy.gif' width=\"200\"></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Fortunately for us,** <u>we do not have to read through every line of the html in order to web scrape.</u> \n",
    "\n",
    "Modern day web browsers come equipped with tools that allow us to easily sift through this soupy text.\n",
    "\n",
    "\n",
    "## Step 4\n",
    ">**We open up the page we're trying to scrape in a new tab.** <b><a href='http://quotes.toscrape.com/' target='_blank'>Click Here!</a></b> 👀"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5\n",
    "   > **We create a list of every `div` that has a `class` of \"quote\"**\n",
    "\n",
    "**In this instance,** every item we want to collect is divided into identically labeled containers.\n",
    "- A div with a class of 'quote'.\n",
    "\n",
    "Not all HTML is as well organized as this page, but HTML is basically just a bunch of different organizational containers that we use to divide up text and other forms of media. Figuring out the organizational structure of a website is the entire process for web scraping. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.415596Z",
     "start_time": "2021-04-21T21:10:53.410245Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quote_divs = soup.find_all('div', {'class': 'quote'})\n",
    "len(quote_divs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6\n",
    "> **To figure out how to grab all the datapoints from a quote, we isolate a single quote, and work through the code for a single `div`.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.419751Z",
     "start_time": "2021-04-21T21:10:53.417109Z"
    }
   },
   "outputs": [],
   "source": [
    "first_quote = quote_divs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First we grab the text of the quote"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.426181Z",
     "start_time": "2021-04-21T21:10:53.422305Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = first_quote.find('span', {'class':'text'})\n",
    "text = text.text\n",
    "text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Next we grab the author"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.431920Z",
     "start_time": "2021-04-21T21:10:53.428068Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Albert Einstein'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "author = first_quote.find('small', {'class': 'author'})\n",
    "author_name = author.text\n",
    "author_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's also grab the link that points to the author's bio!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.437215Z",
     "start_time": "2021-04-21T21:10:53.433319Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'http://quotes.toscrape.com/author/Albert-Einstein'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "author_link = author.findNextSibling().attrs.get('href')\n",
    "author_link = url + author_link\n",
    "author_link"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### And finally, let's grab all of the tags for the quote"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.443153Z",
     "start_time": "2021-04-21T21:10:53.438609Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['change', 'deep-thoughts', 'thinking', 'world']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tag_container = first_quote.find('div', {'class': 'tags'})\n",
    "tag_links = tag_container.find_all('a')\n",
    "\n",
    "tags = []\n",
    "for tag in tag_links:\n",
    "    tags.append(tag.text)\n",
    "    \n",
    "tags"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.448574Z",
     "start_time": "2021-04-21T21:10:53.444499Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\n",
      "author name:  Albert Einstein\n",
      "author link:  http://quotes.toscrape.com/author/Albert-Einstein\n",
      "tags:  ['change', 'deep-thoughts', 'thinking', 'world']\n"
     ]
    }
   ],
   "source": [
    "print('text:', text)\n",
    "print('author name: ', author_name)\n",
    "print('author link: ', author_link)\n",
    "print('tags: ', tags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 7\n",
    "> We create a function to make out code reusable.\n",
    "\n",
    "Now that we know how to collect this data from a quote div, we can compile the code into a [function](https://www.geeksforgeeks.org/functions-in-python/) called `quote_data`. This allows us to grab a quote div, feed it into the function like so...\n",
    "> `quote_data(quote_div)`\n",
    "\n",
    "...and receive all of the data from that div."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.455650Z",
     "start_time": "2021-04-21T21:10:53.450078Z"
    }
   },
   "outputs": [],
   "source": [
    "def quote_data(quote_div):\n",
    "    # Collect the quote\n",
    "    text = quote_div.find('span', {'class':'text'})\n",
    "    text = text.text\n",
    "    \n",
    "    # Collect the author name\n",
    "    author = quote_div.find('small', {'class': 'author'})\n",
    "    author_name = author.text\n",
    "    \n",
    "    # Collect author link\n",
    "    author_link = author.findNextSibling().attrs.get('href')\n",
    "    author_link = url + author_link\n",
    "    \n",
    "    # Collect tags\n",
    "    tag_container = quote_div.find('div', {'class': 'tags'})\n",
    "\n",
    "    tag_links = tag_container.find_all('a')\n",
    "\n",
    "    tags = []\n",
    "    for tag in tag_links:\n",
    "        tags.append(tag.text)\n",
    "       \n",
    "    # Return data as a dictionary\n",
    "    return {'author': author_name,\n",
    "            'text': text,\n",
    "            'author_link': author_link,\n",
    "            'tags': tags}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's test our fuction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.460722Z",
     "start_time": "2021-04-21T21:10:53.457107Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'author': 'Thomas A. Edison',\n",
       " 'text': \"“I have not failed. I've just found 10,000 ways that won't work.”\",\n",
       " 'author_link': 'http://quotes.toscrape.com/author/Thomas-A-Edison',\n",
       " 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quote_data(quote_divs[7])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can collect the data from every quote on the first page with a simple [```for loop```](https://www.w3schools.com/python/python_for_loops.asp)!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.468852Z",
     "start_time": "2021-04-21T21:10:53.462096Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 quotes scraped!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'author': 'Albert Einstein',\n",
       "  'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',\n",
       "  'author_link': 'http://quotes.toscrape.com/author/Albert-Einstein',\n",
       "  'tags': ['change', 'deep-thoughts', 'thinking', 'world']},\n",
       " {'author': 'J.K. Rowling',\n",
       "  'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',\n",
       "  'author_link': 'http://quotes.toscrape.com/author/J-K-Rowling',\n",
       "  'tags': ['abilities', 'choices']}]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "page_one_data = []\n",
    "for div in quote_divs:\n",
    "    # Apply our function on each quote\n",
    "    data_from_div = quote_data(div)\n",
    "    page_one_data.append(data_from_div)\n",
    "    \n",
    "print(len(page_one_data), 'quotes scraped!')\n",
    "page_one_data[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-07-03T18:55:14.762708Z",
     "start_time": "2020-07-03T18:55:14.758667Z"
    }
   },
   "source": [
    "# We just scraped an entire webpage!\n",
    "\n",
    "![](https://media.giphy.com/media/KiXl0vfc9XIIM/giphy.gif)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Level Up: What if we wanted to scrape the quotes from *every* page?\n",
    "\n",
    "**Step 1:** The first thing we do is take the code from above that scraped the data for all of the quotes on page one, and move it into a function called `parse_page`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.473642Z",
     "start_time": "2021-04-21T21:10:53.470305Z"
    }
   },
   "outputs": [],
   "source": [
    "def scrape_page(quote_divs):\n",
    "    data = []\n",
    "    for div in quote_divs:\n",
    "        div_data = quote_data(div)\n",
    "        data.append(div_data)\n",
    "        \n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 2:** We grab the code we used at the very beginning to collect the html and the list of divs from a web page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.601521Z",
     "start_time": "2021-04-21T21:10:53.475074Z"
    }
   },
   "outputs": [],
   "source": [
    "base_url = 'http://quotes.toscrape.com'\n",
    "response = requests.get(url)\n",
    "html = response.text\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "quote_divs = soup.find_all('div', {'class': 'quote'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 3:** We feed all of the `quote_divs` into our newly made `parse_page` function to grab all of the data from that page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.606735Z",
     "start_time": "2021-04-21T21:10:53.602884Z"
    }
   },
   "outputs": [],
   "source": [
    "data = scrape_page(quote_divs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.611866Z",
     "start_time": "2021-04-21T21:10:53.608230Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'author': 'Albert Einstein',\n",
       "  'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',\n",
       "  'author_link': 'http://quotes.toscrape.com/author/Albert-Einstein',\n",
       "  'tags': ['change', 'deep-thoughts', 'thinking', 'world']},\n",
       " {'author': 'J.K. Rowling',\n",
       "  'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',\n",
       "  'author_link': 'http://quotes.toscrape.com/author/J-K-Rowling',\n",
       "  'tags': ['abilities', 'choices']}]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 4:** We check to see if there is a `Next Page` button at the bottom of the page.\n",
    "\n",
    "*This is requires multiple steps.*\n",
    "\n",
    "1. We grab the outer container called that has a class of `pager`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.616414Z",
     "start_time": "2021-04-21T21:10:53.613232Z"
    }
   },
   "outputs": [],
   "source": [
    "pager = soup.find('ul', {'class': 'pager'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If there is no pager element on the webpage, pager will be set to `None`.\n",
    "\n",
    "2. We use an if check to make sure a pager exists:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.620316Z",
     "start_time": "2021-04-21T21:10:53.617767Z"
    }
   },
   "outputs": [],
   "source": [
    "if pager:\n",
    "    next_page = pager.find('li', {'class': 'next'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. We then check to see if a `Next Page` button exists on the page. \n",
    "\n",
    "    - Every page has a next button except the *last* page which only has a `Previous Page` button. Basically, we're checking to see if the `Next button` exists. It it does, we \"click\" it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.624558Z",
     "start_time": "2021-04-21T21:10:53.621819Z"
    }
   },
   "outputs": [],
   "source": [
    "if next_page:\n",
    "    next_page = next_page.findChild('a')\\\n",
    "                         .attrs\\\n",
    "                         .get('href')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With most webscraping tools, \"clicking a button\" means collecting the link inside the button and making a new request.\n",
    "\n",
    "If a link is pointing to a page on the same website, the links are usually just the forward slashes that need to be added to the base website url. This is called a `relative` link.\n",
    "\n",
    "**Step 5:** Collect the relative link that points to the next page, and add it to our base_url"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.630467Z",
     "start_time": "2021-04-21T21:10:53.627259Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'http://quotes.toscrape.com/page/2/'"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next_page = url + next_page\n",
    "next_page"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 6:** We repeat the exact same process for this new link!\n",
    "\n",
    "ie:\n",
    "1. Make request using a url that points to the next page.\n",
    "2. Scrape quote divs\n",
    "3. Collect data from every quote div on that page\n",
    "4. Find the `Next page` button.\n",
    "5. Collect the url from the button\n",
    "6. Repeat\n",
    "\n",
    "So how do we do this over and over again without repeating ourselves?\n",
    "\n",
    "The first step is compile all of these steps into a new function called `scrape_quotes`.\n",
    "\n",
    "The second step is, something called `recursion`. \n",
    "\n",
    "<center><h1><u>Recursion</u></h1></center>\n",
    "\n",
    "![](https://media.giphy.com/media/l1J9R1Q7LJGSZOxFe/giphy.gif)\n",
    "\n",
    "> **Recursion** is a bit of a mind bend, so don't feel bad if it is hard to wrap your brain around. It took me a while to be able to understand recursive functions!\n",
    "\n",
    "Essentially, recursion is where we use a function inside of itself.\n",
    "\n",
    "**In this instance,** our code will be telling the computer, if there is a `Next page` button, rerun all of the code again on the page the next button points us to and check to see if there is a `Next page` button on *that* page. If there is, keep repeating the process until a `Next page` button is not found."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:53.637283Z",
     "start_time": "2021-04-21T21:10:53.632117Z"
    }
   },
   "outputs": [],
   "source": [
    "def scrape_quotes(url):\n",
    "    base_url = 'http://quotes.toscrape.com'\n",
    "    response = requests.get(url)\n",
    "    html = response.text\n",
    "    soup = BeautifulSoup(html, 'lxml')\n",
    "    quote_divs = soup.find_all('div', {'class': 'quote'})\n",
    "    data = scrape_page(quote_divs)\n",
    "    \n",
    "    pager = soup.find('ul', {'class': 'pager'})\n",
    "    if pager:\n",
    "        next_page = pager.find('li', {'class': 'next'})\n",
    "        \n",
    "        if next_page:\n",
    "            next_page =  next_page.findChild('a')\\\n",
    "                                  .attrs\\\n",
    "                                  .get('href')\n",
    "            \n",
    "            next_page = base_url + next_page\n",
    "            print('Scraping', next_page)\n",
    "            ## This is where the recursion happens\n",
    "            data += scrape_quotes(next_page)\n",
    "        \n",
    "    \n",
    "\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can set a variable called `data` that is the output of our recursive function!\n",
    "\n",
    "> A print statement has been added to output what page is being scraped"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:55.662132Z",
     "start_time": "2021-04-21T21:10:53.638724Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scraping http://quotes.toscrape.com/page/2/\n",
      "Scraping http://quotes.toscrape.com/page/3/\n",
      "Scraping http://quotes.toscrape.com/page/4/\n",
      "Scraping http://quotes.toscrape.com/page/5/\n",
      "Scraping http://quotes.toscrape.com/page/6/\n",
      "Scraping http://quotes.toscrape.com/page/7/\n",
      "Scraping http://quotes.toscrape.com/page/8/\n",
      "Scraping http://quotes.toscrape.com/page/9/\n",
      "Scraping http://quotes.toscrape.com/page/10/\n",
      "100 Quotes scraped!\n"
     ]
    }
   ],
   "source": [
    "data = scrape_quotes(url)\n",
    "print(len(data), 'Quotes scraped!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can visualize and explore our data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:55.860731Z",
     "start_time": "2021-04-21T21:10:55.663516Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlAAAAF1CAYAAAAna9RdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAW7UlEQVR4nO3df/BldX3f8ddbFgWBBC1fKQK6xlojSSukKxqxqRG1KE7UmTQjrQxNjZhOtJgmadC0MVq1OI3aP9o4QwQhE8QQf4xWSCpjJNTUYhZEhKBVcRWEsGsRBUdR8N0/7qHzFXfd72e/P+5lv4/HzJ3vvefc873vA8Py3HPuPbe6OwAArNxD5j0AAMCDjYACABgkoAAABgkoAIBBAgoAYJCAAgAYJKCAuaiqC6rqjXN67aqqd1XV16vqk/OYAXhwE1BAkqSqdlTV7VV1yLJlv1JVV8xxrPXyjCTPSXJMd5+4fEVVvbaq7p5u36mq+5Y9vmE+4wKLRkABy21Jcta8hxhVVQcMbvLYJDu6+1sPXNHdb+7uQ7v70CS/muQT9z/u7p9ai3mBBz8BBSz3n5P8ZlUd/sAVVbW1qrqqtixbdkVV/cp0/19W1V9V1dur6s6quqmqnj4tv7mqdlbVGQ/4tUdU1eVVdVdV/WVVPXbZ7/7Jad0dVfW5qvqlZesuqKp3VNVlVfWtJD+/m3kfXVUfmrb/QlW9fFr+siTvTPKz01Gl14/+Q5pe+5aq+mZVfbKqnrZs3aFV9e7pn8H1VfWaqvrCsvX/oapum7a9sar+8ejrA/MnoIDltie5Islv7uP2T01yXZK/k+TdSd6T5ClJ/l6Slyb5r1V16LLn/4sk/zHJEUmuTXJRkkynES+ffsejkpyW5A+qavkRoH+e5E1JDkvy8d3McnGSW5I8OskvJnlzVZ3c3eflB48svW4f9vMTSf7BtJ8fTPKnVXXgtO6NSZYyO8p1apLT79+oqp6c5JeTHJ/kx6f1t+zD6wNzJqCAB/rdJK+qqqV92PZL3f2u7r4vyZ8kOTbJG7r7nu7+SJLvZhZT97u0u6/s7nuS/E5mR4WOTfKCzE6xvau77+3ua5K8L7MQut8Hu/uvuvv73f2d5UNMv+MZSX67u7/T3ddmdtTp9KyB7v6j7v56d38vyZszC6mfmFb/UpI3dvc3uvvLSf5g2ab3Jjk4yXFJDujum7r7S2sxE7CxBBTwA7r7+iQfTnL2Pmx++7L7355+3wOXLT8CdfOy1707yR2ZHTF6bJKnTqfB7qyqOzM7WvV3d7ftbjw6yR3dfdeyZV9OcvTAvuzRdFruc1X1jSRfT3JQZqcjK8mRD5ht+T7ekNk/1zcl2VlVF1XVkWsxE7CxBBSwO69L8vL8YHDc/4brhy9btjxo9sWx99+ZTu09MsmtmUXHX3b34ctuh3b3v162bf+I33trkkdW1WHLlj0myVdXOW+q6jlJXpXkxUkOn2b+dpLq7k6yM8kxyzY5dvn23X1hdz89syNWB2V2yg94kBFQwA/p7i9kdgru3yxbtiuzAHlpVR1QVf8qyeNX+VLPr6pnVNVDM3sv1FXdfXNmR8D+flWdXlUHTrenVNWTVjj/zUn+V5L/VFUHVdU/TPKyTO+xWqXDknwvya4kD03yhsxC6H6XJPmdqvrxqnpMkv8ffVV1XFX9k6p6WGbR9e0k963BTMAGE1DAnrwhySEPWPbyJL+V5P8m+anMImU13p3Z0a47kvyjzE7TZTr19twkL8nsaNLfJnlLkocN/O7Tkmydtv9Aktd19+WrnDdJ/nuSK5N8MclNSb6WWUzd799ndlrvy0n+LLOgumdad3CSt07b3JbZ6czfXYOZgA1WsyPOAKyHqvr1JKd09z+d9yzA2nEECmANVdWxVfW0qnrIdNmFszI7AgbsR7bs/SkADHhYkvMz+yTh15P8cWaXUAD2I07hAQAMcgoPAGCQgAIAGLSh74E64ogjeuvWrRv5kgAA++Tqq6/+Wnfv9mutNjSgtm7dmu3bt2/kSwIA7JOq+vKe1jmFBwAwSEABAAwSUAAAgwQUAMAgAQUAMEhAAQAMElAAAIMEFADAIAEFADBIQAEADBJQAACDBBQAwCABBQAwaMu8B2Dc1rMvnfcI62rHOafOewQA+JEcgQIAGCSgAAAGCSgAgEECCgBg0F4DqqoOqqpPVtWnq+qGqnr9tPyCqvpSVV073Y5f/3EBAOZvJZ/CuyfJs7r77qo6MMnHq+rPpnW/1d3vXb/xAAAWz14Dqrs7yd3TwwOnW6/nUAAAi2xF74GqqgOq6tokO5Nc3t1XTaveVFXXVdXbq+ph6zYlAMACWVFAdfd93X18kmOSnFhVP53kNUl+MslTkjwyyW/vbtuqOrOqtlfV9l27dq3R2AAA8zP0KbzuvjPJFUlO6e7beuaeJO9KcuIetjm3u7d197alpaVVDwwAMG8r+RTeUlUdPt0/OMmzk3y2qo6allWSFyW5fj0HBQBYFCv5FN5RSS6sqgMyC65LuvvDVfUXVbWUpJJcm+RX13FOAICFsZJP4V2X5ITdLH/WukwEALDgXIkcAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAbtNaCq6qCq+mRVfbqqbqiq10/LH1dVV1XV56vqT6rqoes/LgDA/K3kCNQ9SZ7V3U9OcnySU6rqaUnekuTt3f2EJF9P8rL1GxMAYHHsNaB65u7p4YHTrZM8K8l7p+UXJnnRukwIALBgVvQeqKo6oKquTbIzyeVJvpjkzu6+d3rKLUmO3sO2Z1bV9qravmvXrrWYGQBgrlYUUN19X3cfn+SYJCcmedLunraHbc/t7m3dvW1paWnfJwUAWBBDn8Lr7juTXJHkaUkOr6ot06pjkty6tqMBACymlXwKb6mqDp/uH5zk2UluTPKxJL84Pe2MJB9cryEBABbJlr0/JUclubCqDsgsuC7p7g9X1d8keU9VvTHJp5Kct45zAgAsjL0GVHdfl+SE3Sy/KbP3QwEAbCquRA4AMEhAAQAMElAAAINW8iZy2FBbz7503iOsqx3nnDrvEQBYJUegAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBew2oqjq2qj5WVTdW1Q1Vdda0/Peq6qtVde10e/76jwsAMH9bVvCce5P8RndfU1WHJbm6qi6f1r29u39//cYDAFg8ew2o7r4tyW3T/buq6sYkR6/3YAAAi2roPVBVtTXJCUmumha9sqquq6rzq+oRe9jmzKraXlXbd+3ataphAQAWwYoDqqoOTfK+JK/u7m8meUeSxyc5PrMjVG/d3XbdfW53b+vubUtLS2swMgDAfK0ooKrqwMzi6aLufn+SdPft3X1fd38/yR8mOXH9xgQAWBwr+RReJTkvyY3d/bZly49a9rQXJ7l+7ccDAFg8K/kU3klJTk/ymaq6dlr22iSnVdXxSTrJjiSvWJcJAQAWzEo+hffxJLWbVZet/TgAAIvPlcgBAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBew2oqjq2qj5WVTdW1Q1Vdda0/JFVdXlVfX76+Yj1HxcAYP5WcgTq3iS/0d1PSvK0JL9WVcclOTvJR7v7CUk+Oj0GANjv7TWguvu27r5mun9XkhuTHJ3khUkunJ52YZIXrdeQAACLZOg9UFW1NckJSa5KcmR335bMIivJo/awzZlVtb2qtu/atWt10wIALIAVB1RVHZrkfUle3d3fXOl23X1ud2/r7m1LS0v7MiMAwEJZUUBV1YGZxdNF3f3+afHtVXXUtP6oJDvXZ0QAgMWykk/hVZLzktzY3W9btupDSc6Y7p+R5INrPx4AwOLZsoLnnJTk9CSfqaprp2WvTXJOkkuq6mVJvpLkn63PiAAAi2WvAdXdH09Se1h98tqOAwCw+FyJHABgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGLTXgKqq86tqZ1Vdv2zZ71XVV6vq2un2/PUdEwBgcazkCNQFSU7ZzfK3d/fx0+2ytR0LAGBx7TWguvvKJHdswCwAAA8Kq3kP1Cur6rrpFN8j9vSkqjqzqrZX1fZdu3at4uUAABbDvgbUO5I8PsnxSW5L8tY9PbG7z+3ubd29bWlpaR9fDgBgcexTQHX37d19X3d/P8kfJjlxbccCAFhc+xRQVXXUsocvTnL9np4LALC/2bK3J1TVxUmemeSIqrolyeuSPLOqjk/SSXYkecU6zggAsFD2GlDdfdpuFp+3DrMAADwouBI5AMAgAQUAMEhAAQAMElAAAIMEFADAIAEFADBIQAEADBJQAACDBBQAwCABBQAwSEABAAwSUAAAg/b6ZcLA2tp69qXzHmHd7Tjn1HmPALCuHIECABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGLTXgKqq86tqZ1Vdv2zZI6vq8qr6/PTzEes7JgDA4ljJEagLkpzygGVnJ/lodz8hyUenxwAAm8JeA6q7r0xyxwMWvzDJhdP9C5O8aI3nAgBYWPv6Hqgju/u2JJl+PmrtRgIAWGzr/ibyqjqzqrZX1fZdu3at98sBAKy7fQ2o26vqqCSZfu7c0xO7+9zu3tbd25aWlvbx5QAAFse+BtSHkpwx3T8jyQfXZhwAgMW3kssYXJzkE0meWFW3VNXLkpyT5DlV9fkkz5keAwBsClv29oTuPm0Pq05e41kAAB4UXIkcAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQXv9MmGAUVvPvnTeI6yrHeecOu8RgDlzBAoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBg0JbVbFxVO5LcleS+JPd297a1GAoAYJGtKqAmP9/dX1uD3wMA8KDgFB4AwKDVBlQn+UhVXV1VZ67FQAAAi261p/BO6u5bq+pRSS6vqs9295XLnzCF1ZlJ8pjHPGaVLwcAMH+rOgLV3bdOP3cm+UCSE3fznHO7e1t3b1taWlrNywEALIR9DqiqOqSqDrv/fpLnJrl+rQYDAFhUqzmFd2SSD1TV/b/n3d3952syFQDAAtvngOrum5I8eQ1nAQB4UHAZAwCAQQIKAGCQgAIAGLQWX+UCsKlsPfvSeY/AKu0459R5j8CDnCNQAACDBBQAwCABBQAwSEABAAwSUAAAgwQUAMAgAQUAMEhAAQAM2u8upOkCdwDAenMECgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQfvdhTQBYLPbDBeV3nHOqXN9fUegAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJDrQAGw6WyG6ySxvhyBAgAYJKAAAAYJKACAQQIKAGDQqgKqqk6pqs9V1Req6uy1GgoAYJHtc0BV1QFJ/luS5yU5LslpVXXcWg0GALCoVnME6sQkX+jum7r7u0nek+SFazMWAMDiWk1AHZ3k5mWPb5mWAQDs11ZzIc3azbL+oSdVnZnkzOnh3VX1uVW85kockeRr6/wai2wz7/9m3vdkc++/fd+8NvP+b+Z9T71lQ/b/sXtasZqAuiXJscseH5Pk1gc+qbvPTXLuKl5nSFVt7+5tG/V6i2Yz7/9m3vdkc++/fd+c+55s7v3fzPuezH//V3MK76+TPKGqHldVD03ykiQfWpuxAAAW1z4fgerue6vqlUn+R5IDkpzf3Tes2WQAAAtqVV8m3N2XJblsjWZZKxt2unBBbeb938z7nmzu/bfvm9dm3v/NvO/JnPe/un/ofd8AAPwIvsoFAGDQfhVQm/mrZarq/KraWVXXz3uWjVZVx1bVx6rqxqq6oarOmvdMG6WqDqqqT1bVp6d9f/28Z9poVXVAVX2qqj4871k2WlXtqKrPVNW1VbV93vNspKo6vKreW1Wfnf7b/9l5z7RRquqJ07/z+2/frKpXz3uujVJVvz79eXd9VV1cVQfNZY795RTe9NUy/yfJczK7xMJfJzmtu/9mroNtkKr6uSR3J/mj7v7pec+zkarqqCRHdfc1VXVYkquTvGgz/LuvqkpySHffXVUHJvl4krO6+3/PebQNU1X/Nsm2JD/W3S+Y9zwbqap2JNnW3ZvuWkBVdWGS/9nd75w+Cf7w7r5z3nNttOn/fV9N8tTu/vK851lvVXV0Zn/OHdfd366qS5Jc1t0XbPQs+9MRqE391TLdfWWSO+Y9xzx0923dfc10/64kN2aTXBW/Z+6eHh443faPvxWtQFUdk+TUJO+c9yxsnKr6sSQ/l+S8JOnu727GeJqcnOSLmyGeltmS5OCq2pLk4dnNNSg3wv4UUL5ahlTV1iQnJLlqvpNsnOkU1rVJdia5vLs3zb4n+S9J/l2S7897kDnpJB+pqqunb33YLH4iya4k75pO376zqg6Z91Bz8pIkF897iI3S3V9N8vtJvpLktiTf6O6PzGOW/SmgVvTVMuy/qurQJO9L8uru/ua859ko3X1fdx+f2bcBnFhVm+IUblW9IMnO7r563rPM0Und/TNJnpfk16ZT+ZvBliQ/k+Qd3X1Ckm8l2VTve02S6dTlLyT503nPslGq6hGZnV16XJJHJzmkql46j1n2p4Ba0VfLsH+a3v/zviQXdff75z3PPEynMK5IcsqcR9koJyX5hel9QO9J8qyq+uP5jrSxuvvW6efOJB/I7K0Mm8EtSW5ZdrT1vZkF1WbzvCTXdPft8x5kAz07yZe6e1d3fy/J+5M8fR6D7E8B5atlNqnpjdTnJbmxu98273k2UlUtVdXh0/2DM/vD5bPznWpjdPdruvuY7t6a2X/vf9Hdc/mb6DxU1SHThyYynb56bpJN8Snc7v7bJDdX1ROnRScn2e8/NLIbp2UTnb6bfCXJ06rq4dOf/Sdn9r7XDbeqK5Evks3+1TJVdXGSZyY5oqpuSfK67j5vvlNtmJOSnJ7kM9N7gZLktdOV8vd3RyW5cPokzkOSXNLdm+7j/JvUkUk+MPt/SLYkeXd3//l8R9pQr0py0fQX5puS/PKc59lQVfXwzD51/op5z7KRuvuqqnpvkmuS3JvkU5nTFcn3m8sYAABslP3pFB4AwIYQUAAAgwQUAMAgAQUAMEhAAQAMElAAAIMEFADAIAEFADDo/wGK7j7ZTuvb1gAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 720x432 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "def count_tags(quote):\n",
    "    return len(quote['tags'])\n",
    "\n",
    "def tag_lengths(data):\n",
    "    lengths = []\n",
    "    for quote in data:\n",
    "        lengths.append(count_tags(quote))\n",
    "        \n",
    "    return lengths\n",
    "        \n",
    "lengths = tag_lengths(data)\n",
    "plt.figure(figsize=(10,6))   \n",
    "plt.hist(lengths, bins=9)\n",
    "plt.title('Number of Tags');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Saving our data\n",
    "\n",
    "There are multiple ways to save data to a file. Pandas, `The Excel of Python` allows us to do this easily."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:55.874367Z",
     "start_time": "2021-04-21T21:10:55.861846Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author</th>\n",
       "      <th>text</th>\n",
       "      <th>author_link</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Albert Einstein</td>\n",
       "      <td>“The world as we have created it is a process ...</td>\n",
       "      <td>http://quotes.toscrape.com/author/Albert-Einstein</td>\n",
       "      <td>[change, deep-thoughts, thinking, world]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>J.K. Rowling</td>\n",
       "      <td>“It is our choices, Harry, that show what we t...</td>\n",
       "      <td>http://quotes.toscrape.com/author/J-K-Rowling</td>\n",
       "      <td>[abilities, choices]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Albert Einstein</td>\n",
       "      <td>“There are only two ways to live your life. On...</td>\n",
       "      <td>http://quotes.toscrape.com/author/Albert-Einstein</td>\n",
       "      <td>[inspirational, life, live, miracle, miracles]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Jane Austen</td>\n",
       "      <td>“The person, be it gentleman or lady, who has ...</td>\n",
       "      <td>http://quotes.toscrape.com/author/Jane-Austen</td>\n",
       "      <td>[aliteracy, books, classic, humor]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Marilyn Monroe</td>\n",
       "      <td>“Imperfection is beauty, madness is genius and...</td>\n",
       "      <td>http://quotes.toscrape.com/author/Marilyn-Monroe</td>\n",
       "      <td>[be-yourself, inspirational]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            author                                               text  \\\n",
       "0  Albert Einstein  “The world as we have created it is a process ...   \n",
       "1     J.K. Rowling  “It is our choices, Harry, that show what we t...   \n",
       "2  Albert Einstein  “There are only two ways to live your life. On...   \n",
       "3      Jane Austen  “The person, be it gentleman or lady, who has ...   \n",
       "4   Marilyn Monroe  “Imperfection is beauty, madness is genius and...   \n",
       "\n",
       "                                         author_link  \\\n",
       "0  http://quotes.toscrape.com/author/Albert-Einstein   \n",
       "1      http://quotes.toscrape.com/author/J-K-Rowling   \n",
       "2  http://quotes.toscrape.com/author/Albert-Einstein   \n",
       "3      http://quotes.toscrape.com/author/Jane-Austen   \n",
       "4   http://quotes.toscrape.com/author/Marilyn-Monroe   \n",
       "\n",
       "                                             tags  \n",
       "0        [change, deep-thoughts, thinking, world]  \n",
       "1                            [abilities, choices]  \n",
       "2  [inspirational, life, live, miracle, miracles]  \n",
       "3              [aliteracy, books, classic, humor]  \n",
       "4                    [be-yourself, inspirational]  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(data)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T21:10:55.880960Z",
     "start_time": "2021-04-21T21:10:55.875494Z"
    }
   },
   "outputs": [],
   "source": [
    "df.to_csv('quotes_data.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Web scraping is a powerful tool. \n",
    "\n",
    "It can be used:\n",
    "- To discover the most in demand skills for thousands of online job postings.\n",
    "- To learn the average price of a product from thousands of online sale.\n",
    "- To research social media networks.\n",
    "\n",
    "**And so much more!** As our world continues to develop online markets and communities, the uses for webscraping continue to grow. In the end, the power of web scraping comes from the ability to create datasets that otherwise do not exist, or at the very least, are not readily available to the public.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "303.797px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}