Python and PDF

Python PDF Generation from HTML with WeasyPrint

While there are numerous ways to handle PDF documents with Python, I find generating or editing HTML far easier and more reliable than trying to figure out the intricacies of the PDF format. Sure, there is the venerable ReportLab, and if HTML is not your cup of tea, I encourage you to look into that option. There is also PyPDF2. Or maybe PyPDF3? No, perhaps PyPDF4! Hmmm... see the problem? My best guess is PyPDF3, for what that is worth.

So many choices...

So many choices in the cereal aisle

But there is an easy choice if you are comfortable with HTML.

Enter WeasyPrint. It takes HTML and CSS, and converts it to a usable and potentially beautiful PDF document.

The code samples in this article can be accessed in the associated Github repo. Feel free to clone and adapt.

Installation

To install WeasyPrint, I recommend you first set up a virtual environment with the tool of your choice.

Then, installation is as simple as performing something like the following in an activated virtual environment:

pip install weasyprint

Alternatives to the above, depending on your tooling:

You get the idea.

If you only want the weasyprint command-line tool, you could even use pipx and install with pipx install weasyprint. While that would not make it very convenient to access as a Python library, if you just want to convert web pages to PDFs, that may be all you need.

A command line tool (Python usage optional)

Once installed, the weasyprint command line tool is available. You can convert an HTML file or a web page to PDF. For instance, you could try the following:

weasyprint \
"https://en.wikipedia.org/wiki/Python_(programming_language)" \
python.pdf

The above command will save a file python.pdf in the current working directory, converted from the HTML from the Python programming language article in English on Wikipedia. It ain't perfect, but it gives you an idea, hopefully.

You don't have to specify a web address, of course. Local HTML files work fine, and they provide necessary control over content and styling.

weasyprint sample.html out/sample.pdf

Feel free to download a sample.html and an associated sample.css stylesheet with the contents of this article.

See the WeasyPrint docs for further examples and instructions regarding the standalone weasyprint command line tool.

Utilizing WeasyPrint as a Python library

The Python API for WeasyPrint is quite versatile. It can be used to load HTML when passed appropriate file pointers, file names, or the text of the HTML itself.

Here is an example of a simple makepdf() function that accepts an HTML string, and returns the binary PDF data.

from weasyprint import HTML


def makepdf(html):
    """Generate a PDF file from a string of HTML."""
    htmldoc = HTML(string=html, base_url="")
    return htmldoc.write_pdf()

The main workhorse here is the HTML class. When instantiating it, I found I needed to pass a base_url parameter in order for it to load images and other assets from relative urls, as in <img src="somefile.png">.

Using HTML and write_pdf(), not only will the HTML be parsed, but associated CSS, whether it is embedded in the head of the HTML (in a <style> tag), or included in a stylesheet (with a <link href="sample.css" rel="stylesheet"\> tag).

I should note that HTML can load straight from files, and write_pdf() can write to a file, by specifying filenames or file pointers. See the docs for more detail.

Here is a more full-fledged example of the above, with primitive command line handling capability added:

from pathlib import Path
import sys

from weasyprint import HTML


def makepdf(html):
    """Generate a PDF file from a string of HTML."""
    htmldoc = HTML(string=html, base_url="")
    return htmldoc.write_pdf()


def run():
    """Command runner."""
    infile = sys.argv[1]
    outfile = sys.argv[2]
    html = Path(infile).read_text()
    pdf = makepdf(html)
    Path(outfile).write_bytes(pdf)


if __name__ == "__main__":
    run()

You may download the above file directly, or browse the Github repo.

A note about Python types: the string parameter when instantiating HTML is a normal (Unicode) str, but makepdf() outputs bytes.

Assuming the above file is in your working directory as weasyprintdemo.py and that a sample.html and an out directory are also there, the following should work well:

python weasyprintdemo.py sample.html out/sample.pdf

Try it out, then open out/sample.pdf with your PDF reader. Are we close?

Styling HTML for print

As is probably apparent, using WeasyPrint is easy. The real work with HTML to PDF conversion, however, is in the styling. Thankfully, CSS has pretty good support for printing.

Some useful CSS print resources:

This simple stylesheet demonstrates a few basic tricks:

body {
  font-family: sans-serif;
}
@media print {
  a::after {
    content: " (" attr(href) ") ";
  }
  pre {
    white-space: pre-wrap;
  }
  @page {
    margin: 0.75in;
    size: Letter;
    @top-right {
      content: counter(page);
    }
  }
  @page :first {
    @top-right {
      content: "";
    }
  }
}

First, use media queries. This allows you to use the same stylesheet for both print and screen, using @media print and @media screen respectively. In the example stylesheet, I assume that the defaults (such as seen in the body declaration) apply to all formats, and that @media print provides overrides. Alternatively, you could include separate stylesheets for print and screen, using the media attribute of the <link> tag, as in <link rel="stylesheet" src="print.css" media="print" />.

Second, use @page CSS rules. While browser support is pretty abysmal in 2020, WeasyPrint does a pretty good job of supporting what you need. Note the margin and size adjustments above, and the page numbering, in which we first define a counter in the top-right, then override with :first to make it blank on the first page only. In other words, page numbers only show from page 2 onward.

Also note the a::after trick to explicitly display the href attribute when printing. This is either clever or annoying, depending on your goals.

Another hint, not demonstrated above: within the @media print block, set display: none on any elements that don't need to be printed, and set background: none where you don't want backgrounds printed.

Django and Flask support

If you write Django or Flask apps, you may benefit from the convenience of the respective libraries for generating PDFs within these frameworks:

Generate HTML the way you like

WeasyPrint encourages the developer to make HTML and CSS, and the PDF just happens. If that fits your skill set, then you may enjoy experimenting with and utilizing this library.

How you generate HTML is entirely up to you. You might:

Then generate the PDF using WeasyPrint.

Anything I missed? Feel free to leave comments!