{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS5481 - Tutorial 2\n", "\n", "## Introduction to Web Crawling\n", "\n", "Welcome to CS5481 tutorial. In this tutorial, you will learn to how to crawl the data from web with Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Introduction to HTML (20 minutes)\n", "\n", "### What is HTML?\n", "HTML (HyperText Markup Language) is the standard language used for creating web pages. It structures content on the web and allows browsers to interpret and display it.\n", "\n", "### Key Features of HTML\n", "- **Markup Language**: HTML is a markup language that uses tags to define elements within a document.\n", "- **Browser Compatibility**: HTML is universally supported by all web browsers, making it a foundational technology for web development.\n", "\n", "### Common HTML Tags\n", "- ``: The root element that wraps all other HTML content.\n", "- ``: Contains meta-information about the document, such as the title and links to stylesheets.\n", "- ``: Sets the title of the web page that appears in the browser tab.\n", "- `<body>`: Contains the main content of the page, including text, images, and other media.\n", "\n", "### Header Tags\n", "- `<h1>`: Represents the main heading of the page (largest).\n", "- `<h2>`, `<h3>`, etc.: Subheadings, with decreasing size and importance.\n", "\n", "### Text Content Tags\n", "- `<p>`: Defines a paragraph of text.\n", "- `<b>`: Makes text bold.\n", "- `<i>`: Italicizes text.\n", "- `<br>`: Inserts a line break.\n", "\n", "### Link and Image Tags\n", "- `<a>`: Anchor tag used to create hyperlinks. Example: `<a href=\"https://example.com\">Visit Example</a>`.\n", "- `<img>`: Embeds an image. Example: `<img src=\"image.jpg\" alt=\"Description\">`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example HTML Structure\n", "```\n", "<!DOCTYPE html>\n", "<html lang=\"en\">\n", "<head>\n", " <meta charset=\"UTF-8\">\n", " <title>Sample Page\n", "\n", "\n", "

Welcome to My Web Page

\n", "

This is a sample paragraph.

\n", " Visit Example\n", "\n", "```\n", "\n", "From: https://cn.w3schools.com/html/html_elements.asp, You can learn more about HTML :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Introduction to Web Scraping (30 minutes)\n", "\n", "### What is Web Scraping?\n", "Web scraping is the process of extracting data from websites. \n", "\n", "Python provides powerful libraries like `requests` and `Beautiful Soup` for this purpose.\n", "\n", "### Installing Libraries\n", "To get started, ensure you have the required libraries installed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install requests\n", "! pip install bs4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Import Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests as r\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Find the Url of Target Html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = r'https://stackoverflow.com/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Obtain Html Framework and Contents" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = r.get(url)\n", "html = res.text\n", "print(html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Reformat and Parse Html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bf = BeautifulSoup(html)\n", "print(bf.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Obtain Information We Need" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# obtain title according to tag\n", "print(bf.title) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# obtain title string\n", "print(bf.title.string)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# obtain all <a> tags\n", "for item in bf.find_all(\"a\"):\n", " print(item)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# obtain text content from document\n", "print(bf.get_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# find <a> tags including \"id\" attributes\n", "for item in bf.find_all(\"a\", id=True):\n", " print(item)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# find <a> tags whose id is \"nav-tags\"\n", "for item in bf.find_all(\"a\", id=\"nav-tags\"):\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**More use cases could be found at** https://beautiful-soup-4.readthedocs.io/en/latest/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Practice:\n", "\n", "Try to print title, source, editor, full text in the target html\n", "\n", "https://english.news.cn/20220904/b1955558af1c4179a355fab10b1ee28f/c.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# insert your code\n", "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "# URL of the news article\n", "url = \"https://english.news.cn/20220904/b1955558af1c4179a355fab10b1ee28f/c.html\"\n", "\n", "# Fetch the page\n", "response = requests.get(url)\n", "response.encoding = 'utf-8' # Ensure proper encoding\n", "\n", "# Create BeautifulSoup object\n", "soup = BeautifulSoup(response.text)\n", "print(soup)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(soup.prettify())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract title\n", "title = soup.find('title').text.strip() if soup.find('title') else 'Title not found'\n", "\n", "# Extract source\n", "source = soup.find('p', class_='source').text.strip() if soup.find('p', class_='source') else 'Source not found'\n", "\n", "# Extract editor\n", "editor = soup.find('p', class_='editor').text.strip() if soup.find('p', class_='editor') else 'Editor not found'\n", "\n", "# Extract full text\n", "full_text = soup.find('div', id='detailContent').text.strip() if soup.find('div', id='detailContent') else 'Full text not found'\n", "\n", "# Print the extracted information\n", "print(f\"Title: {title}\")\n", "print(f\"Source: {source}\")\n", "print(f\"Editor: {editor}\")\n", "print(f\"Full Text: {full_text}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" }, "vscode": { "interpreter": { "hash": "88279d2366fe020547cde40dd65aa0e3aa662a6ec1f3ca12d88834876c85e1a6" } } }, "nbformat": 4, "nbformat_minor": 4 }