JustHTML: agent usage notes (copy/paste) Official docs: https://emilstenstrom.github.io/justhtml/ Examples migrating code to JustHTML: https://emilstenstrom.github.io/justhtml/migration-examples.html Goal - Parse HTML5 into a DOM that is easy to query and serialize. - Safe by default: sanitization runs during construction unless explicitly disabled. Core model - HTML (str/bytes) -> parse -> transforms -> sanitize -> query/serialize. - Use `fragment=True` for snippets; document mode for full pages with html/head/body. Note on sanitization - When `sanitize=True`, JustHTML ensures exactly one sanitize pass runs by auto-appending `Sanitize(...)` to the end of `transforms` (unless you already included `Sanitize(...)`). Safety - `sanitize=True` by default. - Use `sanitize=False` when you're only extracting data (querying, `to_text()`, etc) or you will sanitize later. - Avoid emitting HTML from an unsanitized tree unless the input is trusted. - Optional: pass `policy=SanitizationPolicy(...)` to customize allowlists. Transform ordering (most common pitfall) - If `sanitize=True` and you DO NOT include `Sanitize(...)` in `transforms`, JustHTML auto-appends a final sanitize step. - If you DO include `Sanitize(...)` in `transforms`, JustHTML will NOT auto-append another. - If a transform must run AFTER sanitization, include `Sanitize(...)` explicitly and put your later transforms after it. Output (pick one) - Readable text with line breaks: `to_markdown()` (handles `
`, paragraphs/lists, and normalizes whitespace). - Plain text for indexing/search: `to_text(separator=" ", strip=True)` (fast; no layout-aware newlines). Common patterns (from the migration examples) - Query: `doc.root.query(".product.special")` (or `doc.query("...")`) returns a list of nodes. - Attributes: `a.attrs.get("href")`. - Pretty HTML: `node.to_html(pretty=True)`. - Comments: use the `:comment` selector. - If you allow links, keep URL handling explicit with `UrlPolicy(...)` / `UrlRule(...) (this includes inline css which also needs to be explicitly allowed). Diagnostics (when debugging) - `strict=True` to raise on parse errors; `collect_errors=True` to inspect errors without raising. - `track_node_locations=True` to get `origin_offset/origin_line/origin_col` on nodes. Transforms (keep it simple) - Prefer selector-based transforms: `Drop`, `Unwrap`, `Empty`, `Edit`. - Treat `EditDocument(...)` as last resort. - Prefer multiple smaller transforms over big and complex ones - Keep transforms stateless (no `nonlocal`/flags). Use ordering + DOM re-checks instead. - Avoid regex post-processing unless there’s no better option. Built-in transforms (quick reference) All transforms are importable from `justhtml` (they also live in `justhtml.transforms`). Full docs: https://emilstenstrom.github.io/justhtml/transforms.html - Core selector transforms (exported from `justhtml`) - `SetAttrs(selector, attributes=None, **attrs)` — set/overwrite attributes on matching elements. - `Drop(selector)` — remove matching nodes. - `Unwrap(selector)` — remove the element but keep its children. - `Escape(selector)` — escape the element's tags (as text) but keep its children. - `Empty(selector)` — remove all children of matching elements. - `Edit(selector, func)` — run custom logic for matching elements. - Attribute-only transforms - `EditAttrs(selector, func)` — rewrite attributes based on a callback (`RewriteAttrs` is an alias). - `DropAttrs(selector, patterns=())` — drop attributes matching glob-like patterns (ex: `("data-*", "on*")`). - `AllowlistAttrs(selector, allowed_attributes=...)` — keep only allowlisted attributes. - Text/cleanup - `CollapseWhitespace(skip_tags=(...))` — collapse HTML whitespace runs in text nodes. - `PruneEmpty(selector, strip_whitespace=True)` — recursively drop empty elements. - Sanitization and policy building blocks - `Sanitize(policy=None)` — sanitize the in-memory tree (same sanitizer as `sanitize=True`). - `DropComments()` — drop `#comment` nodes. - `DropDoctype()` — drop `!doctype` nodes. - `DropForeignNamespaces()` — drop elements in foreign namespaces (SVG/MathML). - `DropUrlAttrs(selector, url_policy=...)` — validate/rewrite/drop URL-valued attributes. - `AllowStyleAttrs(selector, allowed_css_properties=...)` — sanitize inline `style` attributes. - `MergeAttrs(tag, attr=..., tokens=...)` — merge tokens into a whitespace-delimited attribute (ex: enforce `rel`). - Advanced building blocks - `Decide(selector, func)` — keep/drop/unwrap/empty/escape based on a callback. - `EditDocument(func)` — run once on the root container. - `Stage([...])` — group transforms into explicit passes (advanced).