While there are numerous ways to handle PDF documents with Python, I find generating or editing HTML far easier and more reliable than trying to figure out the intricacies of the PDF format. Sure, there is the venerable ReportLab, and if HTML is not your cup of tea, I encourage you to look into that option. There is also PyPDF2. Or maybe PyPDF3? No, perhaps PyPDF4! Hmmm... see the problem? My best guess is PyPDF3, for what that is worth.
So many choices...
But there is an easy choice if you are comfortable with HTML.
Enter WeasyPrint. It takes HTML and CSS, and converts it to a usable and potentially beautiful PDF document.
The code samples in this article can be accessed in the associated Github repo. Feel free to clone and adapt.
To install WeasyPrint, I recommend you first set up a virtual environment with the tool of your choice.
Then, installation is as simple as performing something like the following in an activated virtual environment:
pip install weasyprint
Alternatives to the above, depending on your tooling:
poetry add weasyprint
conda install -c conda-forge weasyprint
pipenv install weasyprint
You get the idea.
If you only want the weasyprint
command-line tool, you could
even
use pipx
and install with pipx install weasyprint
. While that would
not make it very convenient to access as a Python library, if you just
want to convert web pages to PDFs, that may be all you need.
Once installed, the weasyprint
command line tool is
available. You can convert an HTML file or a web page to PDF. For
instance, you could try the following:
weasyprint \
"https://en.wikipedia.org/wiki/Python_(programming_language)" \
python.pdf
The above command will save a file python.pdf
in the current
working directory, converted from the HTML from the
Python programming language article in English on Wikipedia. It ain't perfect, but it gives you an idea, hopefully.
You don't have to specify a web address, of course. Local HTML files work fine, and they provide necessary control over content and styling.
weasyprint sample.html out/sample.pdf
Feel free to
download a sample.html
and an associated
sample.css
stylesheet
with the contents of this article.
See
the WeasyPrint docs
for further examples and instructions regarding the standalone
weasyprint
command line tool.
The Python API for WeasyPrint is quite versatile. It can be used to load HTML when passed appropriate file pointers, file names, or the text of the HTML itself.
Here is an example of a simple makepdf()
function that
accepts an HTML string, and returns the binary PDF data.
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
The main workhorse here is the HTML
class. When instantiating
it, I found I needed to pass a base_url
parameter in order
for it to load images and other assets from relative urls, as in
<img src="somefile.png">
.
Using HTML
and write_pdf()
, not only will the
HTML be parsed, but associated CSS, whether it is embedded in the head of
the HTML (in a <style>
tag), or included in a
stylesheet (with a
<link href="sample.css"
rel="stylesheet"\>
tag).
I should note that HTML
can load straight from files, and
write_pdf()
can write to a file, by specifying filenames or
file pointers. See
the docs for more detail.
Here is a more full-fledged example of the above, with primitive command line handling capability added:
from pathlib import Path
import sys
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
def run():
"""Command runner."""
infile = sys.argv[1]
outfile = sys.argv[2]
html = Path(infile).read_text()
pdf = makepdf(html)
Path(outfile).write_bytes(pdf)
if __name__ == "__main__":
run()
You may download the above file directly, or browse the Github repo.
A note about Python types: the
string
parameter when instantiatingHTML
is a normal (Unicode)str
, butmakepdf()
outputsbytes
.
Assuming the above file is in your working directory as
weasyprintdemo.py
and that a sample.html
and an
out
directory are also there, the following should work well:
python weasyprintdemo.py sample.html out/sample.pdf
Try it out, then open out/sample.pdf
with your PDF reader.
Are we close?
As is probably apparent, using WeasyPrint is easy. The real work with HTML to PDF conversion, however, is in the styling. Thankfully, CSS has pretty good support for printing.
Some useful CSS print resources:
This simple stylesheet demonstrates a few basic tricks:
body {
font-family: sans-serif;
}
@media print {
a::after {
content: " (" attr(href) ") ";
}
pre {
white-space: pre-wrap;
}
@page {
margin: 0.75in;
size: Letter;
@top-right {
content: counter(page);
}
}
@page :first {
@top-right {
content: "";
}
}
}
First, use
media queries. This allows you to use the same stylesheet for both print and screen,
using @media print
and
@media screen
respectively. In the example stylesheet, I
assume that the defaults (such as seen in the
body
declaration) apply to all formats, and that
@media print
provides overrides. Alternatively, you could
include separate stylesheets for print and screen, using the
media
attribute of the <link>
tag, as in
<link rel="stylesheet" src="print.css"
media="print" />
.
Second,
use @page
CSS rules. While
browser support
is pretty abysmal in 2020, WeasyPrint does a pretty good job of supporting
what you need. Note the margin and size adjustments above, and the page
numbering, in which we first define a counter in the top-right, then
override with :first
to make it blank on the first page only.
In other words, page numbers only show from page 2 onward.
Also note the a::after
trick to explicitly display the
href
attribute when printing. This is either clever or
annoying, depending on your goals.
Another hint, not demonstrated above: within the
@media print
block, set display: none
on any
elements that don't need to be printed, and set
background: none
where you don't want backgrounds printed.
If you write Django or Flask apps, you may benefit from the convenience of the respective libraries for generating PDFs within these frameworks:
WeasyTemplateView
view base class or a
WeasyTemplateResponseMixin
mixin on a TemplateView
HTML
class that works just like
WeasyPrint's, but respects Flask routes and WSGI. Also provided is a
render_pdf
function that can be called on a template or on
the url_for()
of another view, setting the correct
mimetype.
WeasyPrint encourages the developer to make HTML and CSS, and the PDF just happens. If that fits your skill set, then you may enjoy experimenting with and utilizing this library.
How you generate HTML is entirely up to you. You might:
Then generate the PDF using WeasyPrint.
Anything I missed? Feel free to leave comments!