PDF bursting on the cheap

I have always been intrigued by PDF bursting programs, programs that take a PDF containing, say, a bunch of invoices, and based on a hidden field on each invoice, split them up into one file per customer, to be sent to them by email.

(I am also tickled by the name “PDF burster”, because I am old enough to remember when every data center had a back room where very noisy bursting machines would break up six-part greenbar listings into individual pages for manual distribution.)

Most of the commercial PDF burster programs are very expensive. If you just want to try out the workflow, or if you are short on budget, they may be out of range. I got to thinking of how one might do roll-your-own PDF bursting.

I will assuming that you are starting with XML data, which you will run through an XML stylesheet to produce an XSL-FO object, to be processed by something like Apache FOP to produce the PDF.

The key is to split up the XML document before you make it into a PDF. That way you do not have to get involved in PDF internals. Consider, for example, this XML file to generate invoices. (Needless to say, a real invoice file would have more fields.)

<?xml version="1.0" encoding="UTF-8"?>
<invoices>
   <invoice dest="joe.blow@example.com">
      <firstName>Joe</firstName>
      <lastName>Blow</lastName>
   </invoice>
   <invoice dest="joe.coder@example.net">
      <firstName>Joe</firstName>
      <lastName>Coder</lastName>
   </invoice>
   <invoice dest="john.q.public@example.org">
      <firstName>John</firstName>
      <lastName>Public</lastName>
   </invoice>
   <invoice dest="joe.doakes@example.org">
      <firstName>Joe</firstName>
      <lastName>Doakes</lastName>
   </invoice>
</invoices>

We use a Python program to create a new XML file for each <invoice> element. The file name is specified in the dest attribute, and would usually be the email address that the invoice will be sent to.

The program is run like this, with the name of the input file passed as a parameter:

python burster.py invoices.xml

For each invoice, it will create a file in the emails subdirectory, with data for that invoice. For example, for the first invoice, it will create a file named joe.blow@example.com.xml that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<invoice dest="joe.blow@example.com">
  <firstName>Joe</firstName>
  <lastName>Blow</lastName>
</invoice>

Before writing the file, the program runs it through HTML Tidy (which, in spite of its name, can also process XML) to nicely indent the XML. Or course, XML parsers do not care whether the XML is nicely indented, but I find it makes debugging easier.

Once you have an XML file per destination, you can use a batch script to take each file, apply an XSLT to convert it to an XSL-FO document (you can use Apache Xalan to do this), then feed the XSL-FO document to a Formatting Objects Processor like Apache FO, producing a PDF with the same name as the input document, but with the .xml extension changed to .pdf.

A second script can take each PDF file and, based on its file name, email it to the appropriate destination.

In order to run the burster program, you need to have tidylib installed, like this:

pip install tidylib

You also need to have HTML Tidy installed, which you can get here.

Here is the burster program, burster.py:

#
# This program will split an XML file up into
# separate files, based on the "dest" attribute
# of each child of the root element. 
#
# The first and only parameter to this program
# is the name of the input file. 
#
# Consider an XML file of invoices, like this: 
#
# <?xml version="1.0" encoding="UTF-8"?>
# <invoices>
#    <invoice dest="joe.blow@example.com">
#       <firstName>Joe</firstName>
#       <lastName>Blow</lastName>
#    </invoice>
#    <invoice dest="joe.coder@example.net">
#       <firstName>Joe</firstName>
#       <lastName>Coder</lastName>
#    </invoice>
#    <invoice dest="john.q.public@example.org">
#       <firstName>John</firstName>
#       <lastName>Public</lastName>
#    </invoice>
#    <invoice dest="joe.doakes@example.org">
#       <firstName>Joe</firstName>
#       <lastName>Doakes</lastName>
#    </invoice>
# </invoices>
# 
# A file will be created in the "emails" subdirectory for
# each <invoice> element, with a file name equal to the
# value of the "dest" attribute. 
# 
# So, for example, for the first <invoice> element, we
# will get a file named "joe.blow@example.com.xml" that
# looks like this: 
#
# <?xml version="1.0" encoding="utf-8"?>
# <invoice dest="joe.blow@example.com">
#   <firstName>Joe</firstName>
#   <lastName>Blow</lastName>
# </invoice>
#
# Note that there are no restrictions on the naming of the
# root element or its children. The only requirement is
# that each child of the root has the dest attribute.
#
# License: Public Domain
#
# To the extent possible under law, Lynn Grant has waived 
# all copyright and related or neighboring rights 
# to this program.
# This work is published from: United States.
#

import xml.etree.ElementTree as ET
from tidylib import tidy_document
import sys

tidyOptions = {"input-xml": 1, "indent": 1}

inFile = sys.argv[1]
tree = ET.parse(inFile)
packets = tree.getroot()

for packet in packets:
    dest = packet.attrib["dest"]
    df = open("emails/" + dest + ".xml", "wt")
    xml_data = '<?xml version="1.0" encoding="UTF-8"?>\n' + 
       ET.tostring(packet).decode('UTF-8')
    document, errors = tidy_document(xml_data, tidyOptions)
    if (errors):
        print("Errors in " + dest + ":\n", errors)
    df.write(document)
    df.close()

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *