{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS5481 - Tutorial 2\n", "\n", "## Introduction to Web Crawling\n", "\n", "Welcome to CS5481 tutorial. In this tutorial, you will learn to how to crawl the data from web with Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Introduction to HTML (20 minutes)\n", "\n", "### What is HTML?\n", "HTML (HyperText Markup Language) is the standard language used for creating web pages. It structures content on the web and allows browsers to interpret and display it.\n", "\n", "### Key Features of HTML\n", "- **Markup Language**: HTML is a markup language that uses tags to define elements within a document.\n", "- **Browser Compatibility**: HTML is universally supported by all web browsers, making it a foundational technology for web development.\n", "\n", "### Common HTML Tags\n", "- ``: The root element that wraps all other HTML content.\n", "- `
`: Contains meta-information about the document, such as the title and links to stylesheets.\n", "- ``: Defines a paragraph of text.\n",
"- ``: Makes text bold.\n",
"- ``: Italicizes text.\n",
"- `
`: Inserts a line break.\n",
"\n",
"### Link and Image Tags\n",
"- ``: Anchor tag used to create hyperlinks. Example: `Visit Example`.\n",
"- ``: Embeds an image. Example: `
`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example HTML Structure\n",
"```\n",
"\n",
"\n",
"\n",
" \n",
" Welcome to My Web Page
\n",
"
This is a sample paragraph.
\n", " Visit Example\n", "\n", "```\n", "\n", "From: https://cn.w3schools.com/html/html_elements.asp, You can learn more about HTML :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Introduction to Web Scraping (30 minutes)\n", "\n", "### What is Web Scraping?\n", "Web scraping is the process of extracting data from websites. \n", "\n", "Python provides powerful libraries like `requests` and `Beautiful Soup` for this purpose.\n", "\n", "### Installing Libraries\n", "To get started, ensure you have the required libraries installed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install requests\n", "! pip install bs4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Import Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests as r\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Find the Url of Target Html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = r'https://stackoverflow.com/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Obtain Html Framework and Contents" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = r.get(url)\n", "html = res.text\n", "print(html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Reformat and Parse Html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bf = BeautifulSoup(html)\n", "print(bf.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Obtain Information We Need" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# obtain title according to