The moving parts ================ html5lib consists of a number of components, which are responsible for handling its features. Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document. Several tree representations are supported, as are translations to other formats via *tree adapters*. The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes. The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization. Tree builders ------------- The parser reads HTML by tokenizing the content and building a tree that the user can later access. html5lib can build three types of trees: * ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`, which can be found in the standard library. Whenever possible, the accelerated ``ElementTree`` implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x) is used. * ``dom`` - builds a tree based on :mod:`xml.dom.minidom`. * ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree`` API. The performance gains are relatively small compared to using the accelerated ``ElementTree`` module. You can specify the builder by name when using the shorthand API: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml") To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function. When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute: .. code-block:: python import html5lib TreeBuilder = html5lib.getTreeBuilder("dom") parser = html5lib.HTMLParser(tree=TreeBuilder) minidom_document = parser.parse("

Hello World!") The implementation of builders can be found in `html5lib/treebuilders/ `_. Tree walkers ------------ In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it. html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams `_. The implementation of walkers can be found in `html5lib/treewalkers/ `_. html5lib provides :class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream. HTMLSerializer ~~~~~~~~~~~~~~ The serializer lets you write HTML back as a stream of bytes. .. code-block:: pycon >>> import html5lib >>> element = html5lib.parse('

Witam wszystkich') >>> walker = html5lib.getTreeWalker("etree") >>> stream = walker(element) >>> s = html5lib.serializer.HTMLSerializer() >>> output = s.serialize(stream) >>> for item in output: ... print("%r" % item) '' 'Witam wszystkich' You can customize the serializer behaviour in a variety of ways. Consult the :class:`~html5lib.serializer.HTMLSerializer` documentation. Filters ~~~~~~~ html5lib provides several filters: * :class:`alphabeticalattributes.Filter ` sorts attributes on tags to be in alphabetical order * :class:`inject_meta_charset.Filter ` sets a user-specified encoding in the correct ```` tag in the ```` section of the document * :class:`lint.Filter ` raises :exc:`AssertionError` exceptions on invalid tag and attribute names, invalid PCDATA, etc. * :class:`optionaltags.Filter ` removes tags from the token stream which are not necessary to produce valid HTML * :class:`sanitizer.Filter ` removes unsafe markup and CSS. Elements that are known to be safe are passed through and the rest is converted to visible text. The default configuration of the sanitizer follows the `WHATWG Sanitization Rules `_. * :class:`whitespace.Filter ` collapses all whitespace characters to single spaces unless they're in ``

`` or ``