{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's say we want to save a CSV of data from H&M.\n", "\n", "Before we get started we'll do all the normal imports. Notice we're also importing pandas! We're going to use pandas to save our content as a CSV once we're done scraping. Why else would we scrape anything, if not to save it?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import pandas as pd\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we'll just visit the page as usual. In this case H&M is trying to protect itself from bots, so we're pretending we're a totally normal human being." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "headers = {\n", " 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'\n", "}\n", "\n", "response = requests.get('https://www2.hm.com/en_us/sale/home/view-all.html', headers=headers)\n", "doc = BeautifulSoup(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then use the inspector to find out that the class of each product is `item-heading` and the price of each product is `item-price`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](identify.jpg)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Knit Throw with Fringe\n", "Patterned Duvet Cover\n", "Cotton Pillowcase\n" ] } ], "source": [ "# Using [:3] to only go through the first 3\n", "names = doc.find_all(class_=\"item-heading\")\n", "for name in names[:3]:\n", " print(name.text.strip())" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$29.99\n", "$59.99\n", "$44.99\n", "$119.00\n", "$9.99\n", "$17.99\n" ] } ], "source": [ "# Using [:3] to only go through the first 3\n", "# (it looks like more because 2 prices per item)\n", "prices = doc.find_all(class_=\"item-price\")\n", "for price in prices[:3]:\n", " print(price.text.strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting to a list of dictionaries\n", "\n", "The problem is that we want to keep the first name attached to the first price, and the second name attached to the second price, and the third name attached to the third price. Right now they're in two separate lists, when want we *really* want is one list, where each element has a `name` and a `price`. Like a list of dictionaries, right?\n", "\n", "First, let's work on building our dictionaries. Instead of selecting all of the names and all of the prices, we need to figure out thing container that has the name *and* the price inside." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](identify-block.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basically \"find the thing that surrounds every item\". Now, instead of finding each name or each price or whatever, we're going to find each one of these blocks." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----this is an item------\n", "SAVE AS FAVORITE\n", "\n", "\n", "\n", "Knit Throw with Fringe\n", "\n", "$29.99\n", "$59.99\n", "\n", "\n", "\r\n", "\t\t\t\tDark gray\n", "----this is an item------\n", "SAVE AS FAVORITE\n", "\n", "\n", "\r\n", "\t\tCLASSIC COLLECTION\n", "\n", "Patterned Duvet Cover\n", "\n", "$44.99\n", "$119.00\n", "\n", "\n", "\r\n", "\t\t\t\tWhite/striped\n" ] } ], "source": [ "# Using [:3] to only go through the first 3\n", "items = doc.find_all(class_=\"hm-product-item\")\n", "for item in items[:2]:\n", " print(\"----this is an item------\")\n", " print(item.text.strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See? It has all of the information inside of it! Name, price, even the collection and the colorways. But we need it **organized**, not just in a weird random string.\n", "\n", "We're going to change what we do in the loop. Right now we just **print out everything inside of the block.** Instead, we're going to **just find the name**, and then **just find the price**. It's just like what we were doing before when we found _all_ of the names, but we're only looking for the one inside of each block, _not_ across the whole page." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----this is an item------\n", "Knit Throw with Fringe $29.99\n", "$59.99\n", "----this is an item------\n", "Patterned Duvet Cover $44.99\n", "$119.00\n", "----this is an item------\n", "Cotton Pillowcase $9.99\n", "$17.99\n", "----this is an item------\n", "Pillowcase with Pin-tucks $9.99\n", "$17.99\n", "----this is an item------\n", "Linen-blend Bedspread $54.99\n", "$99.00\n" ] } ], "source": [ "# Using [:5] to only go through the first 5\n", "items = doc.find_all(class_=\"hm-product-item\")\n", "for item in items[:5]:\n", " print(\"----this is an item------\")\n", " name = item.find(class_='item-heading').text.strip()\n", " price = item.find(class_='item-price').text.strip()\n", " print(name, price)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice we're doing `item.find`, not `doc.find`! Just like we usually use `.text` to get the text of an element, `.find` will only find the pieces inside of it.\n", "\n", "If that doesn't make sense, it's ok to just memorize it! Use `.find_all` to find the big blocks, then use `.find` to find the individual pieces inside.\n", "\n", "Now, we're looking to put together some dictionaries. Each product will be a row in the CSV we want to create. What is each column? Oh, name and price - the same as the things we're printing out! We're going to make a dictionary out of them, where each key ends up being a column in our CSV." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----this is an item------\n", "{'name': 'Knit Throw with Fringe', 'price': '$29.99\\n$59.99'}\n", "----this is an item------\n", "{'name': 'Patterned Duvet Cover', 'price': '$44.99\\n$119.00'}\n", "----this is an item------\n", "{'name': 'Cotton Pillowcase', 'price': '$9.99\\n$17.99'}\n", "----this is an item------\n", "{'name': 'Pillowcase with Pin-tucks', 'price': '$9.99\\n$17.99'}\n", "----this is an item------\n", "{'name': 'Linen-blend Bedspread', 'price': '$54.99\\n$99.00'}\n" ] } ], "source": [ "# Find each product block\n", "items = doc.find_all(class_=\"hm-product-item\")\n", "\n", "# Go through each of the blocks... (well, [:5] means the first 5)\n", "for item in items[:5]:\n", " print(\"----this is an item------\")\n", "\n", " # Create an empty row for our CSV file \n", " row = {}\n", " \n", " # Fill in the 'name' and 'price' headers\n", " row['name'] = item.find(class_='item-heading').text.strip()\n", " row['price'] = item.find(class_='item-price').text.strip()\n", "\n", " # Print it out to double-check\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've got these dictionaries, we need to save them as we go along. Let's make an empty list, and every time we look at a new product we can save it to the list." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----this is an item------\n", "{'name': 'Knit Throw with Fringe', 'price': '$29.99\\n$59.99'}\n", "----this is an item------\n", "{'name': 'Patterned Duvet Cover', 'price': '$44.99\\n$119.00'}\n", "----this is an item------\n", "{'name': 'Cotton Pillowcase', 'price': '$9.99\\n$17.99'}\n", "----this is an item------\n", "{'name': 'Pillowcase with Pin-tucks', 'price': '$9.99\\n$17.99'}\n", "----this is an item------\n", "{'name': 'Linen-blend Bedspread', 'price': '$54.99\\n$99.00'}\n", "------\n", "Final list: [{'name': 'Knit Throw with Fringe', 'price': '$29.99\\n$59.99'}, {'name': 'Patterned Duvet Cover', 'price': '$44.99\\n$119.00'}, {'name': 'Cotton Pillowcase', 'price': '$9.99\\n$17.99'}, {'name': 'Pillowcase with Pin-tucks', 'price': '$9.99\\n$17.99'}, {'name': 'Linen-blend Bedspread', 'price': '$54.99\\n$99.00'}]\n" ] } ], "source": [ "# Find each product block\n", "items = doc.find_all(class_=\"hm-product-item\")\n", "\n", "# A list of rows. Each row will be a row in our final CSV\n", "# We start without any!\n", "rows = []\n", "\n", "# Go through each of the blocks... (well, [:5] means the first 5)\n", "for item in items[:5]:\n", " print(\"----this is an item------\")\n", "\n", " # Create an empty row for our CSV file \n", " row = {}\n", " \n", " # Fill in the 'name' and 'price' headers\n", " row['name'] = item.find(class_='item-heading').text.strip()\n", " row['price'] = item.find(class_='item-price').text.strip()\n", "\n", " # Now that we've filled in our row, add it to our list\n", " rows.append(row)\n", " \n", " # Print it out to double-check\n", " print(row)\n", "\n", "print(\"------\")\n", "print(\"Final list:\",rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, cool, a list of dictionaries. But what we are going to do with it?\n", "\n", "Convert it into a dataframe with pandas, of course! Pandas will easily take a list of dictionaries and save it right into a dataframe." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameprice
0Knit Throw with Fringe$29.99\\n$59.99
1Patterned Duvet Cover$44.99\\n$119.00
2Cotton Pillowcase$9.99\\n$17.99
3Pillowcase with Pin-tucks$9.99\\n$17.99
4Linen-blend Bedspread$54.99\\n$99.00
\n", "
" ], "text/plain": [ " name price\n", "0 Knit Throw with Fringe $29.99\\n$59.99\n", "1 Patterned Duvet Cover $44.99\\n$119.00\n", "2 Cotton Pillowcase $9.99\\n$17.99\n", "3 Pillowcase with Pin-tucks $9.99\\n$17.99\n", "4 Linen-blend Bedspread $54.99\\n$99.00" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find each product block\n", "items = doc.find_all(class_=\"hm-product-item\")\n", "\n", "# A list of rows. Each row will be a row in our final CSV\n", "# We start without any!\n", "rows = []\n", "\n", "# Go through each of the blocks... (well, [:5] means the first 5)\n", "for item in items:\n", " # Create an empty row for our CSV file \n", " row = {}\n", " \n", " # Fill in the 'name' and 'price' headers\n", " row['name'] = item.find(class_='item-heading').text.strip()\n", " row['price'] = item.find(class_='item-price').text.strip()\n", "\n", " # Now that we've filled in our row, add it to our list\n", " rows.append(row)\n", "\n", "df = pd.DataFrame(rows)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we just need to save it to a CSV. Just remember to do `index=False` so that it gets saved without the weird nameless index column!" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "df.to_csv(\"scraped.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }