{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \"Use Python to Fill PDF Files!\"\n", "> \"Use Python to populate form fields in a PDF file.\"\n", "\n", "- toc: true\n", "- branch: master\n", "- badges: true\n", "- comments: true\n", "- categories: [python]\n", "\n", "- hide: false\n", "- search_exclude: false\n", "- metadata_key1: metadata_value1\n", "- metadata_key2: metadata_value2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Use Python to Fill PDF Files!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PDFs are hard to work with. Over the years I've tried several approaches to filling them out in an automated way. It's amazing my job has so many manual tasks that require filling out PDFs. It's fairly routine for me to be manually filling out PDF files to process transactions. Needless to say I've either created or borrowed several solutions. First let me say I'm no VBA expert but I have experimented with solutions here as well.\n", "\n", "I've wrote a VBA script to fill out a PDF using \"send keys\". Think \"how could I do everything by use of just the keyboard shortcuts?\" Once you know how to open a PDF with shortcuts, tab through the form fields and use shortcuts to save, you can automate this in VBA. The problem here is that it's painstaking to set up. Plus if the form changes or you want to add a new form it's basically like starting from scratch.\n", "\n", "Next I used VBA and the Acrobat reference to access and manipulate PDFs. This works much better because you can access the PDF form fields using VBA and Java Script. I would highly recommend this route if you're going to use VBA. I still felt as if every PDF template had to be setup completely separate. Some of this is likely due to my experience level with VBA. Either way there was a lot of copying a pasting code.\n", "\n", "Then came my experience with Python, PyPDF2 and reportlab. I won't go into too much detail about exactly how I did this. In short you create your PDF template, create blank PDF with just your data fields, and paste the new PDF as a watermark on top of your PDF template. Again, this is painstaking because you're using grid coordinates to position where text should be placed on the page. This worked, it was fast, but it wasn't great if the PDF template changed or if you wanted to manipulate the PDF file afterward.\n", "\n", "It was great when I found you could fill PDF form fields with python using PyPDF2 and pdfrw. Both of these libraries look to be able to do similar tasks but I chose pdfrw because it appears to be maintained better. PyPDF2 actually is no longer maintained. There is a PyPDF3 and PyPDF4; however, I already settled on pdfrw. The only issue I ran into is that you could fill in the fields but those values wouldn't show until you refreshed the field in Acrobat. I found two ways around this; one was to click into every field and hit `Enter`. This option isn't doable if you have several PDFs. The next was to open the PDFs in a web browser which causes a refresh of the fields. \n", "\n", "Because of these challenges I gave up for a while... However, while digging into Python and PDFs again I found the solution that refreshes the fields!\n", "\n", "So now I have a working solution I can pass around the office easily. A basic macro reference a Python exe file located on a shared network drive. Meaning there is no python install! And we can populate PDF forms with a simple excel macro while still getting all the flexibility and functionality of Python. The rest of this post will be going through an example of how to fill out a PDF using python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# PDF Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'm using Adobe Acrobat DC. I'm going to create a sample PDF file for this example. If you have an existing PDF you want to use just open, click on `Tools` > `Prepare Form`. This action will create a fillable PDF form.\n", "\n", "Now let's create a simple PDF for this example. We have the following fields.\n", "\n", "- name\n", "- phone\n", "- date\n", "- account_number\n", "- cb_1 (check box \"Yes\")\n", "- cb_2 (check box \"No\")\n", "\n", "Now that we have a sample PDF we will get started with a little Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "![image003.png](https://github.com/akdux1937/dux/blob/master/images/pdfrwtut/image003.png?raw=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# pdfrw Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> First thing to do is install pdfrw using `!pip install pdfrw`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "!pip3 install pdfrw" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.4'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pdfrw\n", "pdfrw.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Accessing our PDF" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Let's first set some variable to reference our PDF template and output.pdf\n", "pdf_template = \"template.pdf\"\n", "pdf_output = \"output.pdf\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "template_pdf = pdfrw.PdfReader(pdf_template) # create a pdfrw object from our template.pdf\n", "# template_pdf # uncomment to see all the data captured from this PDF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should print out `template_pdf` to see everything availabe in the PDF. There is a lot so for ease of reading I'll comment out.\n", "\n", "For now let's just try to get the form fields of the PDF we created. To do this we will set some of the variable we find important. I grabbed this code from a random snippet online but you can find several similar setups on stack overflow." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "ANNOT_KEY = '/Annots'\n", "ANNOT_FIELD_KEY = '/T'\n", "ANNOT_VAL_KEY = '/V'\n", "ANNOT_RECT_KEY = '/Rect'\n", "SUBTYPE_KEY = '/Subtype'\n", "WIDGET_SUBTYPE_KEY = '/Widget'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we can loop through the page(s). Here we only have one but you it's a good idea to prepare for future functionality. We grab all the annotations to grab just the form field keys." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name\n", "phone\n", "date\n", "account_number\n", "cb_1\n", "cb_2\n" ] } ], "source": [ "for page in template_pdf.pages:\n", " annotations = page[ANNOT_KEY]\n", " for annotation in annotations:\n", " if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:\n", " if annotation[ANNOT_FIELD_KEY]:\n", " key = annotation[ANNOT_FIELD_KEY][1:-1]\n", " print(key)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There you can see we were able to grab our form field names!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Filling a PDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fill a PDF we can create a dictionary of what we want to populate the PDF. The dictionary `keys` will be the form field names and the `values` will be what we want to fill into the PDF. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from datetime import date\n", "\n", "data_dict = {\n", " 'name': 'Andrew Krcatovich',\n", " 'phone': '(123) 123-1234',\n", " 'date': date.today(),\n", " 'account_number': '123123123',\n", " 'cb_1': True,\n", " 'cb_2': False,\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's setup a function to handle grabbing the keys, populating the values, and saving out the `output.pdf` file" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def fill_pdf(input_pdf_path, output_pdf_path, data_dict):\n", " template_pdf = pdfrw.PdfReader(input_pdf_path)\n", " for page in template_pdf.pages:\n", " annotations = page[ANNOT_KEY]\n", " for annotation in annotations:\n", " if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:\n", " if annotation[ANNOT_FIELD_KEY]:\n", " key = annotation[ANNOT_FIELD_KEY][1:-1]\n", " if key in data_dict.keys():\n", " if type(data_dict[key]) == bool:\n", " if data_dict[key] == True:\n", " annotation.update(pdfrw.PdfDict(\n", " AS=pdfrw.PdfName('Yes')))\n", " else:\n", " annotation.update(\n", " pdfrw.PdfDict(V='{}'.format(data_dict[key]))\n", " )\n", " annotation.update(pdfrw.PdfDict(AP=''))\n", " pdfrw.PdfWriter().write(output_pdf_path, template_pdf)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "fill_pdf(pdf_template, pdf_output, data_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay! That just filled out a PDF. Opening in preview on my Mac shows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "![image004.png](https://github.com/akdux1937/dux/blob/master/images/pdfrwtut/image004.png?raw=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, opening the very same PDF in Acrobat doesn't show the values of the form fields. If you click into the field you can see it did fill but for some reason the field isn't refreshed to show the value. Printing the PDF here won't help either as it will print blank. After a long while searching for an answer I found the following solution. Worked like a charm and the form fields are now showing in Acrobat as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Tip: add Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject(\"true\")))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Honestly, I don't know why this isn't the default setting. It seems like everyone online runs into the same issue and this solution seems hidden away to where there are several hard work-arounds that are being used. Either way just add the above reference line to the `fill_pdf` function like so." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def fill_pdf(input_pdf_path, output_pdf_path, data_dict):\n", " template_pdf = pdfrw.PdfReader(input_pdf_path)\n", " for page in template_pdf.pages:\n", " annotations = page[ANNOT_KEY]\n", " for annotation in annotations:\n", " if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:\n", " if annotation[ANNOT_FIELD_KEY]:\n", " key = annotation[ANNOT_FIELD_KEY][1:-1]\n", " if key in data_dict.keys():\n", " if type(data_dict[key]) == bool:\n", " if data_dict[key] == True:\n", " annotation.update(pdfrw.PdfDict(\n", " AS=pdfrw.PdfName('Yes')))\n", " else:\n", " annotation.update(\n", " pdfrw.PdfDict(V='{}'.format(data_dict[key]))\n", " )\n", " annotation.update(pdfrw.PdfDict(AP=''))\n", " template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true'))) # NEW\n", " pdfrw.PdfWriter().write(output_pdf_path, template_pdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I added one additional function `fill_simple_pdf_file` as I found it very useful to manipulate a data dictionary, especially if it came from an excel file, first before populating the data. This way you can create many fillable forms from the same data source, do formating on the fields and set default values if nothing was supplied." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bringing it all together" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import pdfrw\n", "from datetime import date\n", "\n", "\n", "ANNOT_KEY = '/Annots'\n", "ANNOT_FIELD_KEY = '/T'\n", "ANNOT_VAL_KEY = '/V'\n", "ANNOT_RECT_KEY = '/Rect'\n", "SUBTYPE_KEY = '/Subtype'\n", "WIDGET_SUBTYPE_KEY = '/Widget'\n", "\n", "\n", "def fill_pdf(input_pdf_path, output_pdf_path, data_dict):\n", " template_pdf = pdfrw.PdfReader(input_pdf_path)\n", " for page in template_pdf.pages:\n", " annotations = page[ANNOT_KEY]\n", " for annotation in annotations:\n", " if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:\n", " if annotation[ANNOT_FIELD_KEY]:\n", " key = annotation[ANNOT_FIELD_KEY][1:-1]\n", " if key in data_dict.keys():\n", " if type(data_dict[key]) == bool:\n", " if data_dict[key] == True:\n", " annotation.update(pdfrw.PdfDict(\n", " AS=pdfrw.PdfName('Yes')))\n", " else:\n", " annotation.update(\n", " pdfrw.PdfDict(V='{}'.format(data_dict[key]))\n", " )\n", " annotation.update(pdfrw.PdfDict(AP=''))\n", " template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))\n", " pdfrw.PdfWriter().write(output_pdf_path, template_pdf)\n", " \n", "\n", "# NEW\n", "def fill_simple_pdf_file(data, template_input, template_output):\n", " some_date = date.today()\n", " data_dict = {\n", " 'name': data.get('name', ''),\n", " 'phone': data.get('phone', ''),\n", " 'date': some_date,\n", " 'account_number': data.get('account_number', ''),\n", " 'cb_1': data.get('cb_1', False),\n", " 'cb_2': data.get('cb_2', False),\n", " }\n", " return fill_pdf(template_input, template_output, data_dict)\n", "\n", "\n", "if __name__ == '__main__':\n", " pdf_template = \"template.pdf\"\n", " pdf_output = \"output.pdf\"\n", " \n", " sample_data_dict = {\n", " 'name': 'Andrew Krcatovich',\n", " 'phone': '(123) 123-1234',\n", "# 'date': date.today(), # Removed date so we can dynamically set in python.\n", " 'account_number': '123123123',\n", " 'cb_1': True,\n", " 'cb_2': False,\n", " }\n", " fill_simple_pdf_file(sample_data_dict, pdf_template, pdf_output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks for reading! Hope this can help someone else!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }