The Dormouse's story

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Beautiful Soup 4.4.0 文档" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.\n", "\n", "这篇文档介绍了BeautifulSoup4中所有主要特性,并且有小例子.让我来向你展示它适合做什么,如何工作,怎样使用,如何达到你想要的效果,和处理异常情况.\n", "\n", "文档中出现的例子在Python2.7和Python3.2中的执行结果相同" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 快速开始" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下面的一段HTML代码将作为例子被多次用到.这是爱丽丝梦游仙境的的一段内容(以后内容中简称为爱丽丝的文档):" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "html_doc = \"\"\"\n", "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "\n", "

...

\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " The Dormouse's story\n", " \n", " \n", " \n", "

\n", " \n", " The Dormouse's story\n", " \n", "

\n", "

\n", " Once upon a time there were three little sisters; and their names were\n", " \n", " Elsie\n", " \n", " ,\n", " \n", " Lacie\n", " \n", " and\n", " \n", " Tillie\n", " \n", " ;\n", "and they lived at the bottom of a well.\n", "

\n", "

\n", " ...\n", "

\n", " \n", "\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "soup = BeautifulSoup(html_doc, 'html.parser')\n", "\n", "print(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "几个简单的浏览结构化数据的方法:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'title'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.name" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"The Dormouse's story\"" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.string" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'head'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.parent.name" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "

The Dormouse's story

Sacré bleu!

Extremely bold