{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Demo: How to scrape multiple things from multiple pages\n",
    "\n",
    "The goal is to scrape info about the **five top-grossing movies** for each year, for 10 years. I want the title and rank of the movie, and also, how much money did it gross at the box office. In the end I will put the scraped data into a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "import requests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'https://www.boxofficemojo.com/year/2018/'\n",
    "page = requests.get(url)\n",
    "soup = BeautifulSoup(page.text, 'html.parser')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using [Developer Tools](https://developers.google.com/web/tools/chrome-devtools#elements), I discover the data I want is in an HTML **table.** I also discover that it is the only table on the page.\n",
    "\n",
    "I store it in a variable named `table`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "table = soup.find( 'table' )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I use trial-and-error testing with `print()` to discover whether I can get row and cell data cleanly from the table. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<tr><td class=\"a-text-right mojo-header-column mojo-truncate mojo-field-type-rank mojo-sort-column\">1</td><td class=\"a-text-left mojo-field-type-release mojo-cell-wide\"><a class=\"a-link-normal\" href=\"/release/rl2992866817/?ref_=bo_yld_table_1\">Black Panther</a></td><td class=\"a-text-left mojo-field-type-genre hidden\">-</td><td class=\"a-text-right mojo-field-type-money hidden\">-</td><td class=\"a-text-right mojo-field-type-duration hidden\">-</td><td class=\"a-text-right mojo-field-type-money mojo-estimatable\">$700,059,566</td><td class=\"a-text-right mojo-field-type-positive_integer\">4,084</td><td class=\"a-text-right mojo-field-type-money mojo-estimatable\">$700,059,566</td><td class=\"a-text-left mojo-field-type-date a-nowrap\">Feb 16</td><td class=\"a-text-left mojo-field-type-studio\"><a class=\"a-link-normal\" href=\"https://pro.imdb.com/company/co0226183/boxoffice/?view=releases&amp;ref_=mojo_yld_table_1&amp;rf=mojo_yld_table_1\" rel=\"noopener\" target=\"_blank\">Walt Disney Studios Motion Pictures<svg class=\"mojo-new-window-svg\" viewbox=\"0 0 32 32\" xmlns=\"http://www.w3.org/2000/svg\">\n",
      "<path d=\"M24,15.57251l3,3V23.5A3.50424,3.50424,0,0,1,23.5,27H8.5A3.50424,3.50424,0,0,1,5,23.5V8.5A3.50424,3.50424,0,0,1,8.5,5h4.92755l3,3H8.5a.50641.50641,0,0,0-.5.5v15a.50641.50641,0,0,0,.5.5h15a.50641.50641,0,0,0,.5-.5ZM19.81952,8.56372,12.8844,17.75a.49989.49989,0,0,0,.04547.65479l.66534.66528a.49983.49983,0,0,0,.65479.04553l9.18628-6.93518,2.12579,2.12585a.5.5,0,0,0,.84741-.27526l1.48273-9.35108a.50006.50006,0,0,0-.57214-.57214L17.969,5.59058a.5.5,0,0,0-.27526.84741Z\"></path>\n",
      "</svg></a></td><td class=\"a-text-right mojo-field-type-boolean hidden\">false</td></tr>\n",
      "Black Panther\n"
     ]
    }
   ],
   "source": [
    "# get all the rows from that one table\n",
    "rows = table.find_all('tr')\n",
    "# some more trial-and-error testing to find out which row holds the first movie\n",
    "print(rows[1])\n",
    "# now that I have the right row, get all the cells in that row\n",
    "cells = rows[1].find_all('td')\n",
    "# see whether I can print the movie title cleanly\n",
    "title = cells[1].text\n",
    "print(title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next I try a for-loop to see if I can cleanly get the first five movies in the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Black Panther\n",
      "Avengers: Infinity War\n",
      "Incredibles 2\n",
      "Jurassic World: Fallen Kingdom\n",
      "Deadpool 2\n"
     ]
    }
   ],
   "source": [
    "# get top 5 movies on this page - I know the first row is [1]\n",
    "for i in range(1, 6):\n",
    "    cells = rows[i].find_all('td')\n",
    "    title = cells[1].text\n",
    "    print(title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try a similar for-loop to get total gross for the top five movies. Developer Tools show me this value is in the eighth cell in each row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "$700,059,566\n",
      "$678,815,482\n",
      "$608,581,744\n",
      "$417,719,760\n",
      "$318,491,426\n"
     ]
    }
   ],
   "source": [
    "# I would like to get the total gross number also\n",
    "for i in range(1, 6):\n",
    "    cells = rows[i].find_all('td')\n",
    "    gross = cells[7].text\n",
    "    print(gross)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now I test getting all the values I want from each row, and it works!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 Black Panther $700,059,566\n",
      "2 Avengers: Infinity War $678,815,482\n",
      "3 Incredibles 2 $608,581,744\n",
      "4 Jurassic World: Fallen Kingdom $417,719,760\n",
      "5 Deadpool 2 $318,491,426\n"
     ]
    }
   ],
   "source": [
    "# next I want to get rank (1-5), title and gross all on one line\n",
    "for i in range(1, 6):\n",
    "    cells = rows[i].find_all('td')\n",
    "    print(cells[0].text, cells[1].text, cells[7].text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I want this same data for each of 10 years, so first I will create list of the years I want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010]\n"
     ]
    }
   ],
   "source": [
    "# create a list of the 10 years I want\n",
    "years = []\n",
    "start = 2019\n",
    "for i in range(0, 10):\n",
    "    years.append(start - i)\n",
    "print(years)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Still prepping for the 10 years, I create a base URL to use when I open each year's page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://www.boxofficemojo.com/year/2019/\n"
     ]
    }
   ],
   "source": [
    "# create base url\n",
    "base_url = 'https://www.boxofficemojo.com/year/'\n",
    "# test it\n",
    "# print(base_url + years[0] + '/') -- ERROR!\n",
    "print( base_url + str(years[0]) + '/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now I *should* have all the pieces I need ... I will test the code with a print statement --"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 Avengers: Endgame $858,373,000\n",
      "2 The Lion King $543,638,043\n",
      "3 Toy Story 4 $434,038,008\n",
      "4 Frozen II $470,089,732\n",
      "5 Captain Marvel $426,829,839\n",
      "1 Black Panther $700,059,566\n",
      "2 Avengers: Infinity War $678,815,482\n",
      "3 Incredibles 2 $608,581,744\n",
      "4 Jurassic World: Fallen Kingdom $417,719,760\n",
      "5 Deadpool 2 $318,491,426\n",
      "1 Star Wars: Episode VIII - The Last Jedi $620,181,382\n",
      "2 Beauty and the Beast $504,014,165\n",
      "3 Wonder Woman $412,563,408\n",
      "4 Guardians of the Galaxy Vol. 2 $389,813,101\n",
      "5 Spider-Man: Homecoming $334,201,140\n",
      "1 Finding Dory $486,295,561\n",
      "2 Rogue One: A Star Wars Story $532,177,324\n",
      "3 Captain America: Civil War $408,084,349\n",
      "4 The Secret Life of Pets $368,384,330\n",
      "5 The Jungle Book $364,001,123\n",
      "1 Jurassic World $652,270,625\n",
      "2 Star Wars: Episode VII - The Force Awakens $936,662,225\n",
      "3 Avengers: Age of Ultron $459,005,868\n",
      "4 Inside Out $356,461,711\n",
      "5 Furious 7 $353,007,020\n",
      "1 Guardians of the Galaxy $333,176,600\n",
      "2 The Hunger Games: Mockingjay - Part 1 $337,135,885\n",
      "3 Captain America: The Winter Soldier $259,766,572\n",
      "4 The Lego Movie $257,760,692\n",
      "5 Transformers: Age of Extinction $245,439,076\n",
      "1 Iron Man 3 $409,013,994\n",
      "2 The Hunger Games: Catching Fire $424,668,047\n",
      "3 Despicable Me 2 $368,065,385\n",
      "4 Man of Steel $291,045,518\n",
      "5 Monsters University $268,492,764\n",
      "1 The Avengers $623,357,910\n",
      "2 The Dark Knight Rises $448,139,099\n",
      "3 The Hunger Games $408,010,692\n",
      "4 Skyfall $304,360,277\n",
      "5 The Twilight Saga: Breaking Dawn - Part 2 $292,324,737\n",
      "1 Harry Potter and the Deathly Hallows: Part 2 $381,011,219\n",
      "2 Transformers: Dark of the Moon $352,390,543\n",
      "3 The Twilight Saga: Breaking Dawn - Part 1 $281,287,133\n",
      "4 The Hangover Part II $254,464,305\n",
      "5 Pirates of the Caribbean: On Stranger Tides $241,071,802\n",
      "1 Avatar $749,766,139\n",
      "2 Toy Story 3 $415,004,880\n",
      "3 Alice in Wonderland $334,191,110\n",
      "4 Iron Man 2 $312,433,331\n",
      "5 The Twilight Saga: Eclipse $300,531,751\n"
     ]
    }
   ],
   "source": [
    "# collect all necessary pieces (tested above) to make a loop that gets \n",
    "# top 5 movies for each of the 10 years in my list of years\n",
    "\n",
    "for year in years:\n",
    "    url = base_url + str(year) + '/'\n",
    "    page = requests.get(url)\n",
    "    soup = BeautifulSoup(page.text, 'html.parser')\n",
    "    table = soup.find( 'table' )\n",
    "    rows = table.find_all('tr')\n",
    "    for i in range(1, 6):\n",
    "        cells = rows[i].find_all('td')\n",
    "        print(cells[0].text, cells[1].text, cells[7].text)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When I see the result, I realize I need to make two adjustments.\n",
    "\n",
    "1. Each line needs to have the year also\n",
    "2. Maybe I should clean the gross so it's a pure integer\n",
    "\n",
    "I can get rid of the dollar sign and the commas with a combination of two string methods -- \n",
    "`.strip()` and `.replace()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "293004164\n"
     ]
    }
   ],
   "source": [
    "# testing the clean-up code\n",
    "\n",
    "num = '$293,004,164'\n",
    "print(num.strip('$').replace(',', ''))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2017 1 Star Wars: Episode VIII - The Last Jedi 620181382\n",
      "2017 2 Beauty and the Beast 504014165\n",
      "2017 3 Wonder Woman 412563408\n",
      "2017 4 Guardians of the Galaxy Vol. 2 389813101\n",
      "2017 5 Spider-Man: Homecoming 334201140\n",
      "2014 1 Guardians of the Galaxy 333176600\n",
      "2014 2 The Hunger Games: Mockingjay - Part 1 337135885\n",
      "2014 3 Captain America: The Winter Soldier 259766572\n",
      "2014 4 The Lego Movie 257760692\n",
      "2014 5 Transformers: Age of Extinction 245439076\n"
     ]
    }
   ],
   "source": [
    "# testing a way to add the year to each line, using a list with only two years in it to save time\n",
    "\n",
    "miniyears = [2017, 2014]\n",
    "for year in miniyears:\n",
    "    url = base_url + str(year) + '/'\n",
    "    page = requests.get(url)\n",
    "    soup = BeautifulSoup(page.text, 'html.parser')\n",
    "    table = soup.find( 'table' )\n",
    "    rows = table.find_all('tr')\n",
    "    for i in range(1, 6):\n",
    "        cells = rows[i].find_all('td')\n",
    "        gross = cells[7].text.strip('$').replace(',', '')\n",
    "        print(year, cells[0].text, cells[1].text, gross)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that I know it all works, I want to save the data in a CSV file. \n",
    "\n",
    "Python has a handy **built-in module** for reading and writing CSVs. We need to import it before we can use it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The CSV is done!\n"
     ]
    }
   ],
   "source": [
    "import csv\n",
    "\n",
    "# open new file for writing - this creates the file\n",
    "csvfile = open(\"movies.csv\", 'w', newline='', encoding='utf-8')\n",
    "\n",
    "# make a new variable, c, for Python's CSV writer object -\n",
    "c = csv.writer(csvfile)\n",
    "\n",
    "# write a header row to the csv\n",
    "c.writerow( ['year', 'rank', 'title', 'gross'] )\n",
    "\n",
    "# modified code from above\n",
    "for year in years:\n",
    "    url = base_url + str(year) + '/'\n",
    "    page = requests.get(url)\n",
    "    soup = BeautifulSoup(page.text, 'html.parser')\n",
    "    table = soup.find( 'table' )\n",
    "    rows = table.find_all('tr')\n",
    "    for i in range(1, 6):\n",
    "        cells = rows[i].find_all('td')\n",
    "        gross = cells[7].text.strip('$').replace(',', '')\n",
    "        # print(year, cells[0].text, cells[1].text, gross)\n",
    "        # instead of printing, I need to make a LIST and write that list to the CSV as one row\n",
    "        # I use the same cells that I had printed before \n",
    "        c.writerow( [year, cells[0].text, cells[1].text, gross] )\n",
    "\n",
    "# close the file\n",
    "csvfile.close()\n",
    "\n",
    "print(\"The CSV is done!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a CSV file, named movies.csv, that has 51 rows: the header row plus 5 movies for each year from 2010 through 2019. It has four columns: year, rank, title, and gross.\n",
    "\n",
    "Note that **only the final cell above** is needed to create this CSV, by scraping 10 separate web pages. Everything *above* the final cell above is just instruction, demonstration. It is intended to show the problem-solving you need to go through to get to a desired scraping result.\n",
    "\n",
    "You would not need to keep all the other work. Those cells could be deleted."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}