{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Demo: How to scrape multiple things from multiple pages\n", "\n", "The goal is to scrape info about the **five top-grossing movies** for each year, for 10 years. I want the title and rank of the movie, and also, how much money did it gross at the box office. In the end I will put the scraped data into a CSV file." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import requests" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "url = 'https://www.boxofficemojo.com/year/2018/'\n", "page = requests.get(url)\n", "soup = BeautifulSoup(page.text, 'html.parser')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using [Developer Tools](https://developers.google.com/web/tools/chrome-devtools#elements), I discover the data I want is in an HTML **table.** I also discover that it is the only table on the page.\n", "\n", "I store it in a variable named `table`." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "table = soup.find( 'table' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I use trial-and-error testing with `print()` to discover whether I can get row and cell data cleanly from the table. " ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1Black Panther---$700,059,5664,084$700,059,566Feb 16Walt Disney Studios Motion Pictures\n", "\n", "false\n", "Black Panther\n" ] } ], "source": [ "# get all the rows from that one table\n", "rows = table.find_all('tr')\n", "# some more trial-and-error testing to find out which row holds the first movie\n", "print(rows[1])\n", "# now that I have the right row, get all the cells in that row\n", "cells = rows[1].find_all('td')\n", "# see whether I can print the movie title cleanly\n", "title = cells[1].text\n", "print(title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next I try a for-loop to see if I can cleanly get the first five movies in the table." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Black Panther\n", "Avengers: Infinity War\n", "Incredibles 2\n", "Jurassic World: Fallen Kingdom\n", "Deadpool 2\n" ] } ], "source": [ "# get top 5 movies on this page - I know the first row is [1]\n", "for i in range(1, 6):\n", " cells = rows[i].find_all('td')\n", " title = cells[1].text\n", " print(title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try a similar for-loop to get total gross for the top five movies. Developer Tools show me this value is in the eighth cell in each row." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$700,059,566\n", "$678,815,482\n", "$608,581,744\n", "$417,719,760\n", "$318,491,426\n" ] } ], "source": [ "# I would like to get the total gross number also\n", "for i in range(1, 6):\n", " cells = rows[i].find_all('td')\n", " gross = cells[7].text\n", " print(gross)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I test getting all the values I want from each row, and it works!" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 Black Panther $700,059,566\n", "2 Avengers: Infinity War $678,815,482\n", "3 Incredibles 2 $608,581,744\n", "4 Jurassic World: Fallen Kingdom $417,719,760\n", "5 Deadpool 2 $318,491,426\n" ] } ], "source": [ "# next I want to get rank (1-5), title and gross all on one line\n", "for i in range(1, 6):\n", " cells = rows[i].find_all('td')\n", " print(cells[0].text, cells[1].text, cells[7].text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I want this same data for each of 10 years, so first I will create list of the years I want." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010]\n" ] } ], "source": [ "# create a list of the 10 years I want\n", "years = []\n", "start = 2019\n", "for i in range(0, 10):\n", " years.append(start - i)\n", "print(years)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Still prepping for the 10 years, I create a base URL to use when I open each year's page." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.boxofficemojo.com/year/2019/\n" ] } ], "source": [ "# create base url\n", "base_url = 'https://www.boxofficemojo.com/year/'\n", "# test it\n", "# print(base_url + years[0] + '/') -- ERROR!\n", "print( base_url + str(years[0]) + '/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I *should* have all the pieces I need ... I will test the code with a print statement --" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 Avengers: Endgame $858,373,000\n", "2 The Lion King $543,638,043\n", "3 Toy Story 4 $434,038,008\n", "4 Frozen II $470,089,732\n", "5 Captain Marvel $426,829,839\n", "1 Black Panther $700,059,566\n", "2 Avengers: Infinity War $678,815,482\n", "3 Incredibles 2 $608,581,744\n", "4 Jurassic World: Fallen Kingdom $417,719,760\n", "5 Deadpool 2 $318,491,426\n", "1 Star Wars: Episode VIII - The Last Jedi $620,181,382\n", "2 Beauty and the Beast $504,014,165\n", "3 Wonder Woman $412,563,408\n", "4 Guardians of the Galaxy Vol. 2 $389,813,101\n", "5 Spider-Man: Homecoming $334,201,140\n", "1 Finding Dory $486,295,561\n", "2 Rogue One: A Star Wars Story $532,177,324\n", "3 Captain America: Civil War $408,084,349\n", "4 The Secret Life of Pets $368,384,330\n", "5 The Jungle Book $364,001,123\n", "1 Jurassic World $652,270,625\n", "2 Star Wars: Episode VII - The Force Awakens $936,662,225\n", "3 Avengers: Age of Ultron $459,005,868\n", "4 Inside Out $356,461,711\n", "5 Furious 7 $353,007,020\n", "1 Guardians of the Galaxy $333,176,600\n", "2 The Hunger Games: Mockingjay - Part 1 $337,135,885\n", "3 Captain America: The Winter Soldier $259,766,572\n", "4 The Lego Movie $257,760,692\n", "5 Transformers: Age of Extinction $245,439,076\n", "1 Iron Man 3 $409,013,994\n", "2 The Hunger Games: Catching Fire $424,668,047\n", "3 Despicable Me 2 $368,065,385\n", "4 Man of Steel $291,045,518\n", "5 Monsters University $268,492,764\n", "1 The Avengers $623,357,910\n", "2 The Dark Knight Rises $448,139,099\n", "3 The Hunger Games $408,010,692\n", "4 Skyfall $304,360,277\n", "5 The Twilight Saga: Breaking Dawn - Part 2 $292,324,737\n", "1 Harry Potter and the Deathly Hallows: Part 2 $381,011,219\n", "2 Transformers: Dark of the Moon $352,390,543\n", "3 The Twilight Saga: Breaking Dawn - Part 1 $281,287,133\n", "4 The Hangover Part II $254,464,305\n", "5 Pirates of the Caribbean: On Stranger Tides $241,071,802\n", "1 Avatar $749,766,139\n", "2 Toy Story 3 $415,004,880\n", "3 Alice in Wonderland $334,191,110\n", "4 Iron Man 2 $312,433,331\n", "5 The Twilight Saga: Eclipse $300,531,751\n" ] } ], "source": [ "# collect all necessary pieces (tested above) to make a loop that gets \n", "# top 5 movies for each of the 10 years in my list of years\n", "\n", "for year in years:\n", " url = base_url + str(year) + '/'\n", " page = requests.get(url)\n", " soup = BeautifulSoup(page.text, 'html.parser')\n", " table = soup.find( 'table' )\n", " rows = table.find_all('tr')\n", " for i in range(1, 6):\n", " cells = rows[i].find_all('td')\n", " print(cells[0].text, cells[1].text, cells[7].text)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When I see the result, I realize I need to make two adjustments.\n", "\n", "1. Each line needs to have the year also\n", "2. Maybe I should clean the gross so it's a pure integer\n", "\n", "I can get rid of the dollar sign and the commas with a combination of two string methods -- \n", "`.strip()` and `.replace()`" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "293004164\n" ] } ], "source": [ "# testing the clean-up code\n", "\n", "num = '$293,004,164'\n", "print(num.strip('$').replace(',', ''))" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2017 1 Star Wars: Episode VIII - The Last Jedi 620181382\n", "2017 2 Beauty and the Beast 504014165\n", "2017 3 Wonder Woman 412563408\n", "2017 4 Guardians of the Galaxy Vol. 2 389813101\n", "2017 5 Spider-Man: Homecoming 334201140\n", "2014 1 Guardians of the Galaxy 333176600\n", "2014 2 The Hunger Games: Mockingjay - Part 1 337135885\n", "2014 3 Captain America: The Winter Soldier 259766572\n", "2014 4 The Lego Movie 257760692\n", "2014 5 Transformers: Age of Extinction 245439076\n" ] } ], "source": [ "# testing a way to add the year to each line, using a list with only two years in it to save time\n", "\n", "miniyears = [2017, 2014]\n", "for year in miniyears:\n", " url = base_url + str(year) + '/'\n", " page = requests.get(url)\n", " soup = BeautifulSoup(page.text, 'html.parser')\n", " table = soup.find( 'table' )\n", " rows = table.find_all('tr')\n", " for i in range(1, 6):\n", " cells = rows[i].find_all('td')\n", " gross = cells[7].text.strip('$').replace(',', '')\n", " print(year, cells[0].text, cells[1].text, gross)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that I know it all works, I want to save the data in a CSV file. \n", "\n", "Python has a handy **built-in module** for reading and writing CSVs. We need to import it before we can use it." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The CSV is done!\n" ] } ], "source": [ "import csv\n", "\n", "# open new file for writing - this creates the file\n", "csvfile = open(\"movies.csv\", 'w', newline='', encoding='utf-8')\n", "\n", "# make a new variable, c, for Python's CSV writer object -\n", "c = csv.writer(csvfile)\n", "\n", "# write a header row to the csv\n", "c.writerow( ['year', 'rank', 'title', 'gross'] )\n", "\n", "# modified code from above\n", "for year in years:\n", " url = base_url + str(year) + '/'\n", " page = requests.get(url)\n", " soup = BeautifulSoup(page.text, 'html.parser')\n", " table = soup.find( 'table' )\n", " rows = table.find_all('tr')\n", " for i in range(1, 6):\n", " cells = rows[i].find_all('td')\n", " gross = cells[7].text.strip('$').replace(',', '')\n", " # print(year, cells[0].text, cells[1].text, gross)\n", " # instead of printing, I need to make a LIST and write that list to the CSV as one row\n", " # I use the same cells that I had printed before \n", " c.writerow( [year, cells[0].text, cells[1].text, gross] )\n", "\n", "# close the file\n", "csvfile.close()\n", "\n", "print(\"The CSV is done!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a CSV file, named movies.csv, that has 51 rows: the header row plus 5 movies for each year from 2010 through 2019. It has four columns: year, rank, title, and gross.\n", "\n", "Note that **only the final cell above** is needed to create this CSV, by scraping 10 separate web pages. Everything *above* the final cell above is just instruction, demonstration. It is intended to show the problem-solving you need to go through to get to a desired scraping result.\n", "\n", "You would not need to keep all the other work. Those cells could be deleted." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.1" } }, "nbformat": 4, "nbformat_minor": 2 }