The moving parts ================ html5lib consists of a number of components, which are responsible for handling its features. Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document. Several tree representations are supported, as are translations to other formats via *tree adapters*. The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes. The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization. Tree builders ------------- The parser reads HTML by tokenizing the content and building a tree that the user can later access. html5lib can build three types of trees: * ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`, which can be found in the standard library. Whenever possible, the accelerated ``ElementTree`` implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x) is used. * ``dom`` - builds a tree based on :mod:`xml.dom.minidom`. * ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree`` API. The performance gains are relatively small compared to using the accelerated ``ElementTree`` module. You can specify the builder by name when using the shorthand API: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml") To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function. When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute: .. code-block:: python import html5lib TreeBuilder = html5lib.getTreeBuilder("dom") parser = html5lib.HTMLParser(tree=TreeBuilder) minidom_document = parser.parse("
Hello World!")
The implementation of builders can be found in `html5lib/treebuilders/
Witam wszystkich')
>>> walker = html5lib.getTreeWalker("etree")
>>> stream = walker(element)
>>> s = html5lib.serializer.HTMLSerializer()
>>> output = s.serialize(stream)
>>> for item in output:
... print("%r" % item)
' '
'Witam wszystkich'
You can customize the serializer behaviour in a variety of ways. Consult
the :class:`~html5lib.serializer.HTMLSerializer` documentation.
Filters
~~~~~~~
html5lib provides several filters:
* :class:`alphabeticalattributes.Filter