{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Beautiful Soup 4.4.0 文档" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.\n", "\n", "这篇文档介绍了BeautifulSoup4中所有主要特性,并且有小例子.让我来向你展示它适合做什么,如何工作,怎样使用,如何达到你想要的效果,和处理异常情况.\n", "\n", "文档中出现的例子在Python2.7和Python3.2中的执行结果相同" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 快速开始" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下面的一段HTML代码将作为例子被多次用到.这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档):" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "html_doc = \"\"\"\n", "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "\n", "

...

\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " The Dormouse's story\n", " \n", " \n", " \n", "

\n", " \n", " The Dormouse's story\n", " \n", "

\n", "

\n", " Once upon a time there were three little sisters; and their names were\n", " \n", " Elsie\n", " \n", " ,\n", " \n", " Lacie\n", " \n", " and\n", " \n", " Tillie\n", " \n", " ;\n", "and they lived at the bottom of a well.\n", "

\n", "

\n", " ...\n", "

\n", " \n", "\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "soup = BeautifulSoup(html_doc, 'html.parser')\n", "\n", "print(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "几个简单的浏览结构化数据的方法:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'title'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.name" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"The Dormouse's story\"" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.string" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'head'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.parent.name" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "

The Dormouse's story

" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.p" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['title']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.p['class']" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Elsie" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.a" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('a')" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Tillie" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find(id=\"link3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从文档中找到所有标签的链接:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "http://example.com/elsie\n", "http://example.com/lacie\n", "http://example.com/tillie\n" ] } ], "source": [ "for link in soup.find_all('a'):\n", " print(link.get('href'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从文档中获取所有文字内容:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The Dormouse's story\n", "\n", "The Dormouse's story\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.\n", "...\n", "\n" ] } ], "source": [ "print(soup.get_text())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这是你想要的吗?别着急,还有更好用的" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 安装 Beautiful Soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:\n", "\n", "$ apt-get install Python-bs4\n", "\n", "Beautiful Soup 4 通过PyPi发布,所以如果你无法使用系统包管理安装,那么也可以通过 easy_install 或 pip 来安装.包的名字是 beautifulsoup4 ,这个包兼容Python2和Python3.\n", "\n", "$ easy_install beautifulsoup4\n", "\n", "$ pip install beautifulsoup4\n", "\n", "(在PyPi中还有一个名字是 BeautifulSoup 的包,但那可能不是你想要的,那是 Beautiful Soup3 的发布版本,因为很多项目还在使用BS3, 所以 BeautifulSoup 包依然有效.但是如果你在编写新项目,那么你应该安装的 beautifulsoup4 )\n", "\n", "如果你没有安装 easy_install 或 pip ,那你也可以 下载BS4的源码 ,然后通过setup.py来安装.\n", "\n", "$ Python setup.py install\n", "\n", "如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.\n", "\n", "作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在所有当前的Python版本中正常工作" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 安装完成后的问题" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beautiful Soup发布时打包成Python2版本的代码,在Python3环境下安装时,会自动转换成Python3的代码,如果没有一个安装的过程,那么代码就不会被转换.\n", "\n", "如果代码抛出了 ImportError 的异常: “No module named HTMLParser”, 这是因为你在Python3版本中执行Python2版本的代码.\n", "\n", "如果代码抛出了 ImportError 的异常: “No module named html.parser”, 这是因为你在Python2版本中执行Python3版本的代码.\n", "\n", "如果遇到上述2种情况,最好的解决方法是重新安装BeautifulSoup4.\n", "\n", "如果在ROOT_TAG_NAME = u’[document]’代码处遇到 SyntaxError “Invalid syntax”错误,需要将把BS4的Python代码版本从Python2转换到Python3. 可以重新安装BS4:\n", "\n", "$ Python3 setup.py install\n", "\n", "或在bs4的目录中执行Python代码版本转换脚本\n", "\n", "$ 2to3-3.2 -w bs4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 安装解析器" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:\n", "\n", "$ apt-get install Python-lxml\n", "\n", "$ easy_install lxml\n", "\n", "$ pip install lxml\n", "\n", "另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:\n", "\n", "$ apt-get install Python-html5lib\n", "\n", "$ easy_install html5lib\n", "\n", "$ pip install html5lib\n", "\n", "下表列出了主要的解析器:\n", "\n", " 解析器 \t使用方法 \n", " Python标准库\t BeautifulSoup(markup, \"html.parser\") \n", " \n", " lxml HTML 解析器\tBeautifulSoup(markup, \"lxml\") \n", "\n", " lxml XML 解析器 BeautifulSoup(markup, [\"lxml-xml\"]) \n", "\n", " BeautifulSoup(markup, \"xml\") \n", "\n", " html5lib\t BeautifulSoup(markup, \"html5lib\")\t \n", "\n", "\n", "\n", "推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.\n", "\n", "提示: 如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,查看 解析器之间的区别 了解更多细节" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 如何使用" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(open(\"index.html\"))\n", "\n", "soup = BeautifulSoup(\"data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "

Sacré bleu!

" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "BeautifulSoup(\"Sacré bleu!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 对象的种类" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tag" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tag 对象与XML或HTML原生文档中的tag相同:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.element.Tag" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup = BeautifulSoup('Extremely bold')\n", "tag = soup.b\n", "type(tag)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tag有很多方法和属性,在 遍历文档树 和 搜索文档树 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "每个tag都有自己的名字,通过 .name 来获取:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'b'" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
Extremely bold
" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag.name = \"blockquote\"\n", "tag" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Attributes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一个tag可能有很多个属性. tag 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['boldest']" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag['class']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "也可以直接”点”取属性, 比如: .attrs :" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'class': ['boldest']}" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag.attrs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
Extremely bold
" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag['class'] = 'verybold'\n", "tag['id'] = 1\n", "tag" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
Extremely bold
" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del tag['class']\n", "del tag['id']\n", "tag" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "'class'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mtag\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'class'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32mD:\\ProgramData\\Anaconda3\\lib\\site-packages\\bs4\\element.py\u001b[0m in \u001b[0;36m__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 1069\u001b[0m \"\"\"tag[key] returns the value of the 'key' attribute for the tag,\n\u001b[0;32m 1070\u001b[0m and throws an exception if it's not there.\"\"\"\n\u001b[1;32m-> 1071\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mattrs\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1072\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1073\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m__iter__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mKeyError\u001b[0m: 'class'" ] } ], "source": [ "tag['class']\n", "# KeyError: 'class'" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "print(tag.get('class'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 多值属性" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值属性的返回类型是list:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['body', 'strikeout']" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "css_soup = BeautifulSoup('

')\n", "css_soup.p['class']" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['body']" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "css_soup = BeautifulSoup('

')\n", "css_soup.p['class']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'my id'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id_soup = BeautifulSoup('

')\n", "id_soup.p['id']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "将tag转换成字符串时,多值属性会合并为一个值" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['index']" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rel_soup = BeautifulSoup('

Back to the homepage

')\n", "rel_soup.a['rel']" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "

Back to the homepage

\n" ] } ], "source": [ "rel_soup.a['rel'] = ['index', 'contents']\n", "print(rel_soup.p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果转换的文档是XML格式,那么tag中不包含多值属性" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'body strikeout'" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xml_soup = BeautifulSoup('

', 'xml')\n", "xml_soup.p['class']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 可以遍历的字符串" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Extremely bold'" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag.string" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.element.NavigableString" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(tag.string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一个 NavigableString 字符串与Python中的Unicode字符串相同,并且还支持包含在 遍历文档树 和 搜索文档树 中的一些特性. 通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.element.NavigableString" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#unicode_string = unicode(tag.string)\n", "unicode_string\n", "type(unicode_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "
No longer bold
" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag.string.replace_with(\"No longer bold\")\n", "tag" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NavigableString 对象支持 遍历文档树 和 搜索文档树 中定义的大部分属性, 并非全部.尤其是,一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 .contents 或 .string 属性或 find() 方法.\n", "\n", "如果想在Beautiful Soup之外使用 NavigableString 对象,需要调用 unicode() 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }