While there are numerous ways to handle PDF documents with Python, I find generating or editing HTML far easier and more reliable than trying to figure out the intricacies of the PDF format. Sure, there is the venerable ReportLab, and if HTML is not your cup of tea, I encourage you to look into that option. There is also PyPDF2. Or maybe PyPDF3? No, perhaps PyPDF4! Hmmm... see the problem? My best guess is PyPDF3, for what that is worth.
So many choices...
But there is an easy choice if you are comfortable with HTML.
Enter WeasyPrint. It takes HTML and CSS, and converts it to a usable and potentially beautiful PDF document.
The code samples in this article can be accessed in the associated Github repo. Feel free to clone and adapt.
To install WeasyPrint, I recommend you first set up a virtual environment with the tool of your choice.
Then, installation is as simple as performing something like the following in an activated virtual environment:
pip install weasyprint
Alternatives to the above, depending on your tooling:
poetry add weasyprintconda install -c conda-forge weasyprintpipenv install weasyprintYou get the idea.
If you only want the weasyprint command-line tool, you could
even
use pipx
and install with pipx install weasyprint. While that would
not make it very convenient to access as a Python library, if you just
want to convert web pages to PDFs, that may be all you need.
Once installed, the weasyprint command line tool is
available. You can convert an HTML file or a web page to PDF. For
instance, you could try the following:
weasyprint \
"https://en.wikipedia.org/wiki/Python_(programming_language)" \
python.pdf
The above command will save a file python.pdf in the current
working directory, converted from the HTML from the
Python programming language article in English on Wikipedia. It ain't perfect, but it gives you an idea, hopefully.
You don't have to specify a web address, of course. Local HTML files work fine, and they provide necessary control over content and styling.
weasyprint sample.html out/sample.pdf
Feel free to
download a sample.html
and an associated
sample.css stylesheet
with the contents of this article.
See
the WeasyPrint docs
for further examples and instructions regarding the standalone
weasyprint command line tool.
The Python API for WeasyPrint is quite versatile. It can be used to load HTML when passed appropriate file pointers, file names, or the text of the HTML itself.
Here is an example of a simple makepdf() function that
accepts an HTML string, and returns the binary PDF data.
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
The main workhorse here is the HTML class. When instantiating
it, I found I needed to pass a base_url parameter in order
for it to load images and other assets from relative urls, as in
<img src="somefile.png">.
Using HTML and write_pdf(), not only will the
HTML be parsed, but associated CSS, whether it is embedded in the head of
the HTML (in a <style> tag), or included in a
stylesheet (with a
<link href="sample.css"
rel="stylesheet"\>
tag).
I should note that HTML can load straight from files, and
write_pdf() can write to a file, by specifying filenames or
file pointers. See
the docs for more detail.
Here is a more full-fledged example of the above, with primitive command line handling capability added:
from pathlib import Path
import sys
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
def run():
"""Command runner."""
infile = sys.argv[1]
outfile = sys.argv[2]
html = Path(infile).read_text()
pdf = makepdf(html)
Path(outfile).write_bytes(pdf)
if __name__ == "__main__":
run()
You may download the above file directly, or browse the Github repo.
A note about Python types: the
stringparameter when instantiatingHTMLis a normal (Unicode)str, butmakepdf()outputsbytes.
Assuming the above file is in your working directory as
weasyprintdemo.py and that a sample.html and an
out directory are also there, the following should work well:
python weasyprintdemo.py sample.html out/sample.pdf
Try it out, then open out/sample.pdf with your PDF reader.
Are we close?
As is probably apparent, using WeasyPrint is easy. The real work with HTML to PDF conversion, however, is in the styling. Thankfully, CSS has pretty good support for printing.
Some useful CSS print resources:
This simple stylesheet demonstrates a few basic tricks:
body {
font-family: sans-serif;
}
@media print {
a::after {
content: " (" attr(href) ") ";
}
pre {
white-space: pre-wrap;
}
@page {
margin: 0.75in;
size: Letter;
@top-right {
content: counter(page);
}
}
@page :first {
@top-right {
content: "";
}
}
}
First, use
media queries. This allows you to use the same stylesheet for both print and screen,
using @media print and
@media screen respectively. In the example stylesheet, I
assume that the defaults (such as seen in the
body declaration) apply to all formats, and that
@media print provides overrides. Alternatively, you could
include separate stylesheets for print and screen, using the
media attribute of the <link> tag, as in
<link rel="stylesheet" src="print.css"
media="print" />.
Second,
use @page CSS rules. While
browser support
is pretty abysmal in 2020, WeasyPrint does a pretty good job of supporting
what you need. Note the margin and size adjustments above, and the page
numbering, in which we first define a counter in the top-right, then
override with :first to make it blank on the first page only.
In other words, page numbers only show from page 2 onward.
Also note the a::after trick to explicitly display the
href attribute when printing. This is either clever or
annoying, depending on your goals.
Another hint, not demonstrated above: within the
@media print block, set display: none on any
elements that don't need to be printed, and set
background: none where you don't want backgrounds printed.
If you write Django or Flask apps, you may benefit from the convenience of the respective libraries for generating PDFs within these frameworks:
WeasyTemplateView view base class or a
WeasyTemplateResponseMixin mixin on a TemplateView
HTML class that works just like
WeasyPrint's, but respects Flask routes and WSGI. Also provided is a
render_pdf function that can be called on a template or on
the url_for() of another view, setting the correct
mimetype.
WeasyPrint encourages the developer to make HTML and CSS, and the PDF just happens. If that fits your skill set, then you may enjoy experimenting with and utilizing this library.
How you generate HTML is entirely up to you. You might:
Then generate the PDF using WeasyPrint.
Anything I missed? Feel free to leave comments!