Hacking PDF forms with iText, jython, perl, and emacs

27 Feb 2004

Update: This page seems to draw a fair bit of traffic from Google. In addition to the body of this post, there are some Java samples in the comments. Have a look there too.

For one problem last week I had two tricks to figure out: how to concatenate PDF forms and how to fill in some PDF form fields. With Acrobat people can create PDF forms which you can complete with Reader. In our case these are multi-page tax forms. The IRS defined the forms -- they're not under our control. iText was the tool of choice, but I didn't know the API. The concat_pdf tool put the forms together well enough, but it trashed the data in the forms.

I used jython to experiment with the API and diagnose the problem. It turns out that the names of the form fields on several of the forms were the same. It was a simple problem of name collisions. Jython was entirely great for diagnosing the problem. I could interrogate the forms before and after concatenation to find out their field names and values. I tried and tried and failed and failed to get Jython and iText to change the names of those fields. I spent entirely too much time in trial and error (and error) failing to bend the iText API to my task. Attempts to create subclasses or delegates around the API met with various limits -- crucial methods that were protected or whatever. There's a separate story here about recognizing when you're on the wrong path or using the wrong tools. I find myself down that dead end more often than I'd like to admit. But this is a different story, so I won't go there now.

At some point I remembered Rob's story about a colleague who spent a long time implementing the PDF spec to generate correct PDFs that nevertheless wouldn't work with Acrobat. It seems the spec and the implementations differ. (When has that ever happened?) The point of the story was that they eventually threw out the carefully crafted tool and used perl string replacement on existing files created with Acrobat. So I turned my attention to seeing if I could find a useful pattern in the field names that would yield to perl's regular expression prowess.

Jython again came in handy for extracting all the field names. All the time walking down dead ends had left me well enough acquainted with PDF internals to see the boundaries of the pattern. Emacs had been in the background of all of these tasks, but it came front and center as I tested my theory about the name collisions and about the pattern. Sure enough, once I ensured that all the fields were uniquely named, the concatenation worked quite smoothly. Quickly enough I had a perl solution to renaming the fields that was really fast.

PDFs are pretty on the display and printing side of things, but pretty ugly on the inside. Paul ended up throwing out my solution too. He found things in the beta versions of iText that allow PDF forms to be "flattened". Then the form field names aren't an issue and the files are smaller too. So all I have to show for my work is a little unwanted knowledge about PDF internals and a story for my blog about technical-pot-luck problem solving. That said, I'll include a little code here in case the string replacement trick for enforcing unique field names helps someone else from avoiding dead-ends.

    my $ax = 'aa';
    foreach my $file (@pdf_files) {
      $file =~ s{\(([cf]\d-[a-z0-9]+)\)}{($ax-$1)}g;
      $ax++;
      # save the files to disk
    }

The key part is that field names are delimited with parenthesis. In my case the field names themselves were fairly predictable. They looked like this: (f1-04) or in some cases (c4-alpha). I don't think you can just count on finding parentheses -- PDFs are more complex than that. (The $ax = 'aa'; $ax++ thing is a fun perl trick. Perl will increment the string alphanumerically thusly: aa, ab, ac ...)

iText and Jython make it easy to get the field names from a PDF (assuming you're not in control of those field names). Here's how:

    % env CLASSPATH=./iText.jar jython
    >>> from com.lowagie.text.pdf import PdfReader
    >>> reader = PdfReader('path/to/your.pdf')
    >>> [f.name for f in reader.acroForm.fields]

Then you can analyze the results and figure out your own replacement pattern.

seth commented

30 March 2004 at 14:45

Is there a way to populate the form fields with iText and write out a PDF with the fields filled in?

Thanks!

eric commented

06 April 2004 at 10:08

Yes, iText does let you fill in form fields and write out a completed form. Here's a simple example in java which puts a string of Y's into every field in the PDF form. I found the example code in the iText library to be too complicated for what I was trying to do. The comment form will munj my indentation, but javac shouldn't care. Hope that helps.

-Eric

import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PRAcroForm;
import com.lowagie.text.pdf.PdfStamper;
import com.lowagie.text.pdf.AcroFields;
import com.lowagie.text.DocumentException;

import java.util.Iterator;
import java.io.IOException;
import java.io.FileOutputStream;

public class PdfForm {

public static void main(String[] args) throws IOException, DocumentException {
PdfReader reader = new PdfReader("/full/path/to/source.pdf");
PdfStamper stamp = new PdfStamper(reader, new FileOutputStream("/full/path/to/modified.pdf"));
AcroFields form = stamp.getAcroFields();
for(Iterator i = reader.getAcroForm().getFields().iterator(); i.hasNext();) {
PRAcroForm.FieldInformation field = (PRAcroForm.FieldInformation) i.next();
field.getInfo();

form.setField(field.getName(),"YYYYY");
}
stamp.close();
}
}

seth commented

06 April 2004 at 11:57

Thanks! Luckily I discovered this a little while ago. This handles check boxes and radio buttons, too!

Mihai commented

06 April 2004 at 21:28

Hello,

Is there a way with itext to set the checkboxes? I couldn't find a way to do it.

Thanks!

eric commented

07 April 2004 at 13:37

I haven't had to fill in check boxes, so I can't speak from experience. Maybe one of these:

form.setField(field.getName(), "1")
form.setField(field.getName(), "on")

This method might also be helpful -- the javadocs specifically refer to checkboxes:

form.getAppearanceStates(field.getName())

http://itext.sourceforge.net/docs/com/lowagie/text/pdf/AcroFields.html#getAppearanceStates(java.lang.String)

-Eric

Mihai commented

08 April 2004 at 08:37

I figured it out after I posted the question.
Checkboxes and radio buttons are "special" - getAppearanceStatus() returns a non-empty array of possible values that you can set. Setting the field to one of those values does the trick.

Thanks!

eric commented

08 April 2004 at 12:38

Mihai, thanks for confirming how to work with PDF checkboxes.

ingmar commented

22 April 2004 at 04:51

thank you for these samples, they helped me a lot!

Prashant Nirmal commented

10 July 2004 at 13:02

Hi
Is it possible to extract data from the pdf document into text document using Perl or PHP
If so please guide . It will be a great help.

Kevin Baker commented

17 October 2004 at 18:27

So you mention that paul used pdf flattening in the beta rather than your solution. Do you know if this allow for populating forms in existing Pdfs? Examples? If not I will likely explore your solution above.

Thanks

eric commented

18 October 2004 at 18:10

Prashant,

Apologies for taking so long to reply. Paul said he had looked at the PDF options available in perl and wasn't satisfied with what he found. That's why we ended up using the iText java stuff. I almost never work in PHP anymore, so I can't help you there either.

Kevin,

Paul's tricks are cool, so I'm glad you asked. He exported the field data using Acrobat (not Acrobat Reader) into an FDF file. Then he uses the iText library to populate the form with data and flatten it.

The java code looks something like this (keep in mind that this is for the beta version of iText):

PdfStamper stamper = new PdfStamper(
new PdfReader(pdf_in),
new FileOutputStream(pdf_out));
AcroFields form = stamper.getAcroFields();
form.setFields(new FdfReader(fdf_file));
stamper.setFormFlattening(true);
stamper.close();

The tricky bit is getting the data in the FDF file figured out. The FDF internals are just as cryptic as PDF's, but there's much less in the way. Fields look like this:

<< /V (some value) /T (f1-04)>>

The 'f1-04' corresponds to the field name in the PDF file, and 'some value' is the part you probably want to replace with your data. So there's another way to get the field names out of the PDF file, provided you have Acrobat.