# CS5481 - Tutorial 2

## Introduction to Web Crawling

Welcome to CS5481 tutorial. In this tutorial, you will learn to how to crawl the data from web with Python.

## Part 1: Introduction to HTML (20 minutes)

### What is HTML?
HTML (HyperText Markup Language) is the standard language used for creating web pages. It structures content on the web and allows browsers to interpret and display it.

### Key Features of HTML
- **Markup Language**: HTML is a markup language that uses tags to define elements within a document.
- **Browser Compatibility**: HTML is universally supported by all web browsers, making it a foundational technology for web development.

### Common HTML Tags
- `<html>`: The root element that wraps all other HTML content.
- `<head>`: Contains meta-information about the document, such as the title and links to stylesheets.
- `<title>`: Sets the title of the web page that appears in the browser tab.
- `<body>`: Contains the main content of the page, including text, images, and other media.

### Header Tags
- `<h1>`: Represents the main heading of the page (largest).
- `<h2>`, `<h3>`, etc.: Subheadings, with decreasing size and importance.

### Text Content Tags
- `<p>`: Defines a paragraph of text.
- `<b>`: Makes text bold.
- `<i>`: Italicizes text.
- `<br>`: Inserts a line break.

### Link and Image Tags
- `<a>`: Anchor tag used to create hyperlinks. Example: `<a href="https://example.com">Visit Example</a>`.
- `<img>`: Embeds an image. Example: `<img src="image.jpg" alt="Description">`.

### Example HTML Structure
```
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Sample Page</title>
</head>
<body>
    <h1>Welcome to My Web Page</h1>
    <p>This is a sample paragraph.</p>
    <a href="https://example.com">Visit Example</a>
</body>
```

From: https://cn.w3schools.com/html/html_elements.aspï¼Œ You can learn more about HTML :)

## Part 2: Introduction to Web Scraping (30 minutes)

### What is Web Scraping?
Web scraping is the process of extracting data from websites. 

Python provides powerful libraries like `requests` and `Beautiful Soup` for this purpose.

### Installing Libraries
To get started, ensure you have the required libraries installed:

In [None]:
! pip install requests
! pip install bs4

## 1. Import Libraries

In [None]:
import requests as r
from bs4 import BeautifulSoup

## 2. Find the Url of Target Html

In [None]:
url = r'https://stackoverflow.com/'

## 3. Obtain Html Framework and Contents

In [None]:
res = r.get(url)
html = res.text
print(html)

## 4. Reformat and Parse Html

In [None]:
bf = BeautifulSoup(html)
print(bf.prettify())

## 5. Obtain Information We Need

In [None]:
# obtain title according to <title> tag
print(bf.title) 

In [None]:
# obtain title string
print(bf.title.string)

In [None]:
# obtain all <a> tags
for item in bf.find_all("a"):
    print(item)

In [None]:
# obtain text content from document
print(bf.get_text)

In [None]:
# find <a> tags including "id" attributes
for item in bf.find_all("a", id=True):
    print(item)

In [None]:
# find <a> tags whose id is "nav-tags"
for item in bf.find_all("a", id="nav-tags"):
    print(item)

**More use cases could be found at** https://beautiful-soup-4.readthedocs.io/en/latest/

# Practice:

Try to print title, source, editor, full text in the target html

https://english.news.cn/20220904/b1955558af1c4179a355fab10b1ee28f/c.html

In [None]:
# insert your code
import requests
from bs4 import BeautifulSoup

# URL of the news article
url = "https://english.news.cn/20220904/b1955558af1c4179a355fab10b1ee28f/c.html"

# Fetch the page
response = requests.get(url)
response.encoding = 'utf-8'  # Ensure proper encoding

# Create BeautifulSoup object
soup = BeautifulSoup(response.text)
print(soup)

In [None]:
print(soup.prettify())

In [None]:
# Extract title
title = soup.find('title').text.strip() if soup.find('title') else 'Title not found'

# Extract source
source = soup.find('p', class_='source').text.strip() if soup.find('p', class_='source') else 'Source not found'

# Extract editor
editor = soup.find('p', class_='editor').text.strip() if soup.find('p', class_='editor') else 'Editor not found'

# Extract full text
full_text = soup.find('div', id='detailContent').text.strip() if soup.find('div', id='detailContent') else 'Full text not found'

# Print the extracted information
print(f"Title: {title}")
print(f"Source: {source}")
print(f"Editor: {editor}")
print(f"Full Text: {full_text}")