{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Demo: How to scrape multiple things from multiple pages\n", "\n", "The goal is to scrape info about the **five top-grossing movies** for each year, for 10 years. I want the title and rank of the movie, and also, how much money did it gross at the box office. In the end I will put the scraped data into a CSV file." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import requests" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "url = 'https://www.boxofficemojo.com/year/2018/'\n", "page = requests.get(url)\n", "soup = BeautifulSoup(page.text, 'html.parser')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using [Developer Tools](https://developers.google.com/web/tools/chrome-devtools#elements), I discover the data I want is in an HTML **table.** I also discover that it is the only table on the page.\n", "\n", "I store it in a variable named `table`." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "table = soup.find( 'table' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I use trial-and-error testing with `print()` to discover whether I can get row and cell data cleanly from the table. " ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "