"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.113870Z",
"start_time": "2021-04-21T21:10:52.703140Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3\n",
"> **We collect the html from the website by adding `.text` to the end of the response variable.** "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.371878Z",
"start_time": "2021-04-21T21:10:53.368038Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'\\n\\n\\n\\t **We use `BeautifulSoup` to turn the html into something we can manipulate.**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.408867Z",
"start_time": "2021-04-21T21:10:53.373516Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"\n",
"\n",
"\n",
"\n", "Login\n", "
\n", "
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Fortunately for us,** we do not have to read through every line of the html in order to web scrape. \n",
"\n",
"Modern day web browsers come equipped with tools that allow us to easily sift through this soupy text.\n",
"\n",
"\n",
"## Step 4\n",
">**We open up the page we're trying to scrape in a new tab.** Click Here! 👀"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5\n",
" > **We create a list of every `div` that has a `class` of \"quote\"**\n",
"\n",
"**In this instance,** every item we want to collect is divided into identically labeled containers.\n",
"- A div with a class of 'quote'.\n",
"\n",
"Not all HTML is as well organized as this page, but HTML is basically just a bunch of different organizational containers that we use to divide up text and other forms of media. Figuring out the organizational structure of a website is the entire process for web scraping. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.415596Z",
"start_time": "2021-04-21T21:10:53.410245Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quote_divs = soup.find_all('div', {'class': 'quote'})\n",
"len(quote_divs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 6\n",
"> **To figure out how to grab all the datapoints from a quote, we isolate a single quote, and work through the code for a single `div`.**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.419751Z",
"start_time": "2021-04-21T21:10:53.417109Z"
}
},
"outputs": [],
"source": [
"first_quote = quote_divs[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### First we grab the text of the quote"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.426181Z",
"start_time": "2021-04-21T21:10:53.422305Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = first_quote.find('span', {'class':'text'})\n",
"text = text.text\n",
"text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Next we grab the author"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.431920Z",
"start_time": "2021-04-21T21:10:53.428068Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'Albert Einstein'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"author = first_quote.find('small', {'class': 'author'})\n",
"author_name = author.text\n",
"author_name"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let's also grab the link that points to the author's bio!"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.437215Z",
"start_time": "2021-04-21T21:10:53.433319Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'http://quotes.toscrape.com/author/Albert-Einstein'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"author_link = author.findNextSibling().attrs.get('href')\n",
"author_link = url + author_link\n",
"author_link"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### And finally, let's grab all of the tags for the quote"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.443153Z",
"start_time": "2021-04-21T21:10:53.438609Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['change', 'deep-thoughts', 'thinking', 'world']"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_container = first_quote.find('div', {'class': 'tags'})\n",
"tag_links = tag_container.find_all('a')\n",
"\n",
"tags = []\n",
"for tag in tag_links:\n",
" tags.append(tag.text)\n",
" \n",
"tags"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Our data:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.448574Z",
"start_time": "2021-04-21T21:10:53.444499Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"text: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\n",
"author name: Albert Einstein\n",
"author link: http://quotes.toscrape.com/author/Albert-Einstein\n",
"tags: ['change', 'deep-thoughts', 'thinking', 'world']\n"
]
}
],
"source": [
"print('text:', text)\n",
"print('author name: ', author_name)\n",
"print('author link: ', author_link)\n",
"print('tags: ', tags)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 7\n",
"> We create a function to make out code reusable.\n",
"\n",
"Now that we know how to collect this data from a quote div, we can compile the code into a [function](https://www.geeksforgeeks.org/functions-in-python/) called `quote_data`. This allows us to grab a quote div, feed it into the function like so...\n",
"> `quote_data(quote_div)`\n",
"\n",
"...and receive all of the data from that div."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.455650Z",
"start_time": "2021-04-21T21:10:53.450078Z"
}
},
"outputs": [],
"source": [
"def quote_data(quote_div):\n",
" # Collect the quote\n",
" text = quote_div.find('span', {'class':'text'})\n",
" text = text.text\n",
" \n",
" # Collect the author name\n",
" author = quote_div.find('small', {'class': 'author'})\n",
" author_name = author.text\n",
" \n",
" # Collect author link\n",
" author_link = author.findNextSibling().attrs.get('href')\n",
" author_link = url + author_link\n",
" \n",
" # Collect tags\n",
" tag_container = quote_div.find('div', {'class': 'tags'})\n",
"\n",
" tag_links = tag_container.find_all('a')\n",
"\n",
" tags = []\n",
" for tag in tag_links:\n",
" tags.append(tag.text)\n",
" \n",
" # Return data as a dictionary\n",
" return {'author': author_name,\n",
" 'text': text,\n",
" 'author_link': author_link,\n",
" 'tags': tags}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let's test our fuction."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.460722Z",
"start_time": "2021-04-21T21:10:53.457107Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'author': 'Thomas A. Edison',\n",
" 'text': \"“I have not failed. I've just found 10,000 ways that won't work.”\",\n",
" 'author_link': 'http://quotes.toscrape.com/author/Thomas-A-Edison',\n",
" 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quote_data(quote_divs[7])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can collect the data from every quote on the first page with a simple [```for loop```](https://www.w3schools.com/python/python_for_loops.asp)!"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.468852Z",
"start_time": "2021-04-21T21:10:53.462096Z"
},
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 quotes scraped!\n"
]
},
{
"data": {
"text/plain": [
"[{'author': 'Albert Einstein',\n",
" 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',\n",
" 'author_link': 'http://quotes.toscrape.com/author/Albert-Einstein',\n",
" 'tags': ['change', 'deep-thoughts', 'thinking', 'world']},\n",
" {'author': 'J.K. Rowling',\n",
" 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',\n",
" 'author_link': 'http://quotes.toscrape.com/author/J-K-Rowling',\n",
" 'tags': ['abilities', 'choices']}]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"page_one_data = []\n",
"for div in quote_divs:\n",
" # Apply our function on each quote\n",
" data_from_div = quote_data(div)\n",
" page_one_data.append(data_from_div)\n",
" \n",
"print(len(page_one_data), 'quotes scraped!')\n",
"page_one_data[:2]"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-07-03T18:55:14.762708Z",
"start_time": "2020-07-03T18:55:14.758667Z"
}
},
"source": [
"# We just scraped an entire webpage!\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Level Up: What if we wanted to scrape the quotes from *every* page?\n",
"\n",
"**Step 1:** The first thing we do is take the code from above that scraped the data for all of the quotes on page one, and move it into a function called `parse_page`."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.473642Z",
"start_time": "2021-04-21T21:10:53.470305Z"
}
},
"outputs": [],
"source": [
"def scrape_page(quote_divs):\n",
" data = []\n",
" for div in quote_divs:\n",
" div_data = quote_data(div)\n",
" data.append(div_data)\n",
" \n",
" return data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 2:** We grab the code we used at the very beginning to collect the html and the list of divs from a web page."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.601521Z",
"start_time": "2021-04-21T21:10:53.475074Z"
}
},
"outputs": [],
"source": [
"base_url = 'http://quotes.toscrape.com'\n",
"response = requests.get(url)\n",
"html = response.text\n",
"soup = BeautifulSoup(html, 'lxml')\n",
"quote_divs = soup.find_all('div', {'class': 'quote'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 3:** We feed all of the `quote_divs` into our newly made `parse_page` function to grab all of the data from that page."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.606735Z",
"start_time": "2021-04-21T21:10:53.602884Z"
}
},
"outputs": [],
"source": [
"data = scrape_page(quote_divs)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.611866Z",
"start_time": "2021-04-21T21:10:53.608230Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[{'author': 'Albert Einstein',\n",
" 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',\n",
" 'author_link': 'http://quotes.toscrape.com/author/Albert-Einstein',\n",
" 'tags': ['change', 'deep-thoughts', 'thinking', 'world']},\n",
" {'author': 'J.K. Rowling',\n",
" 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',\n",
" 'author_link': 'http://quotes.toscrape.com/author/J-K-Rowling',\n",
" 'tags': ['abilities', 'choices']}]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[:2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 4:** We check to see if there is a `Next Page` button at the bottom of the page.\n",
"\n",
"*This is requires multiple steps.*\n",
"\n",
"1. We grab the outer container called that has a class of `pager`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.616414Z",
"start_time": "2021-04-21T21:10:53.613232Z"
}
},
"outputs": [],
"source": [
"pager = soup.find('ul', {'class': 'pager'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If there is no pager element on the webpage, pager will be set to `None`.\n",
"\n",
"2. We use an if check to make sure a pager exists:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.620316Z",
"start_time": "2021-04-21T21:10:53.617767Z"
}
},
"outputs": [],
"source": [
"if pager:\n",
" next_page = pager.find('li', {'class': 'next'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. We then check to see if a `Next Page` button exists on the page. \n",
"\n",
" - Every page has a next button except the *last* page which only has a `Previous Page` button. Basically, we're checking to see if the `Next button` exists. It it does, we \"click\" it."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.624558Z",
"start_time": "2021-04-21T21:10:53.621819Z"
}
},
"outputs": [],
"source": [
"if next_page:\n",
" next_page = next_page.findChild('a')\\\n",
" .attrs\\\n",
" .get('href')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With most webscraping tools, \"clicking a button\" means collecting the link inside the button and making a new request.\n",
"\n",
"If a link is pointing to a page on the same website, the links are usually just the forward slashes that need to be added to the base website url. This is called a `relative` link.\n",
"\n",
"**Step 5:** Collect the relative link that points to the next page, and add it to our base_url"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-21T21:10:53.630467Z",
"start_time": "2021-04-21T21:10:53.627259Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'http://quotes.toscrape.com/page/2/'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"next_page = url + next_page\n",
"next_page"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 6:** We repeat the exact same process for this new link!\n",
"\n",
"ie:\n",
"1. Make request using a url that points to the next page.\n",
"2. Scrape quote divs\n",
"3. Collect data from every quote div on that page\n",
"4. Find the `Next page` button.\n",
"5. Collect the url from the button\n",
"6. Repeat\n",
"\n",
"So how do we do this over and over again without repeating ourselves?\n",
"\n",
"The first step is compile all of these steps into a new function called `scrape_quotes`.\n",
"\n",
"The second step is, something called `recursion`. \n",
"\n",
"| \n", " | author | \n", "text | \n", "author_link | \n", "tags | \n", "
|---|---|---|---|---|
| 0 | \n", "Albert Einstein | \n", "“The world as we have created it is a process ... | \n", "http://quotes.toscrape.com/author/Albert-Einstein | \n", "[change, deep-thoughts, thinking, world] | \n", "
| 1 | \n", "J.K. Rowling | \n", "“It is our choices, Harry, that show what we t... | \n", "http://quotes.toscrape.com/author/J-K-Rowling | \n", "[abilities, choices] | \n", "
| 2 | \n", "Albert Einstein | \n", "“There are only two ways to live your life. On... | \n", "http://quotes.toscrape.com/author/Albert-Einstein | \n", "[inspirational, life, live, miracle, miracles] | \n", "
| 3 | \n", "Jane Austen | \n", "“The person, be it gentleman or lady, who has ... | \n", "http://quotes.toscrape.com/author/Jane-Austen | \n", "[aliteracy, books, classic, humor] | \n", "
| 4 | \n", "Marilyn Monroe | \n", "“Imperfection is beauty, madness is genius and... | \n", "http://quotes.toscrape.com/author/Marilyn-Monroe | \n", "[be-yourself, inspirational] | \n", "