Perl state machine to scrape HTML

29 Aug 2004

I finally had an occasion to roll my own state machine earlier this month.

My problem was to integrate a little piece of our application with a foreign web application. Jon Udell taught me to look at all web applications as web services in Practical Internet Groupware. The short story is I needed to scrape some lightly-tag-soup-style HTML. It's a parsing problem. Bob Martin's example of using a state machine to parse code comments came to mind. HTTP::Parser gets me part of the way. It'll scrape the angle brackets and give me a stream of four specific events: start-tag, end-tag, text, and comment. If the other app were using validated XHTML I might be able to do this more simply with XPath. But these apps don't live in the ideal web-services bubble. They live on today's web, warts and all.

<FONT FACE="Arial">
<H3>FLINTSTONE , FRED</H3>
</FONT>
<DIR><DIR>
<TABLE  BORDER=0 CELLPADDING=0>
<TR>
<TD ALIGN="LEFT" VALIGN="TOP"><STRONG>Business Address</STRONG></TD>
<TD>&nbsp;</TD>
<TD>12 W SLATE QUARRY RD<BR>BRONTOSAURUS 3<BR>BEDROCK, PG 00004</TD>
</TR>
<TR>
<TD><STRONG>Phone Number</STRONG></TD>
<TD>&nbsp;</TD>
<TD>(888) 555-1212</TD>
</TR>
...

While I was thinking through the states needed to digest this html, I remembered a couple other useful ideas. Bob's state machine compiler lets you work on your state machine through the more concise structure of a state transition table. You let his compiler transform the table into the moving parts of a state machine. Then I remembered the beautiful example where Paul Graham transforms a list of data and a set of functions into a network of closures which can be directly executed for the same effect. [On Lisp pp. 76-81]

I ended up with something not quite as elegant as Paul's example, but more satisfying than what I could have done in Java. My state transition table is a perl hash of hashes where the values of the inner hashes are closures. Most of the magic is actually triggered by the _notify() function which uses the current state and the current event to look up the appropriate closure to execute. Here's most of the code:

my($_STATE_TABLE) = {
    begin => {
	start => sub {_change_state(@_, 'h3', 'save_name')},
    },
    save_name => {
	text => sub {_save($_[0], {_parse_name(@_)})},
	end => sub {_change_state(@_, undef, 'street1')},
    },
    street1 => {
	text => sub {_change_state(@_, 'Business Address',
				   'empty_address_cell')},
    },
    empty_address_cell => {
	start => sub {_change_state(@_, 'td', 'skip_empty_address_cell')},
    },
    skip_empty_address_cell => {
	start => sub {_change_state(@_, 'td', 'save_address')},
    },
    save_address => {
	# /(skip_)?empty_address_cell/ get us inside the 'td' that holds the
	# address.  The only start tag we should see is 'br' which will cause
	# address chunks to be sent in separate text events depending on the
	# number of 'br' tags.  Save whatever part of the address we can match
	# until we hit the closing 'td' tag.
	text => sub {
    	    # if we can find "city, state zip", then save each of them 
	    if ($_[1] =~ /^([^,]+)?,\s+([A-Z]{2})\s+(\d{5})?$/) {
		_save($_[0], {map(defined($_->[1]) && $_->[1]
				      ? ('Address.'.$_->[0] => $_->[1])
				      : (),
				  [city => ucfirst(lc($1))],
				  [state => $2],
				  [zip => $3])});
	    }
    	    # otherwise, we're either in street1 or street2
	    else {
		_save($_[0], {(!exists($_[0]->[$_IDI]->{'Address.street1'})
				   ? 'Address.street1'
				   : 'Address.street2'), $_[1]});
	    }
	},
	end => sub {_change_state(@_, undef, 'phone')},
    },
    phone => {
	text => sub {_change_state(@_, 'Phone Number', 'save_phone')},
    },
    save_phone => {
	text => sub {_save($_[0], {'Phone_1.phone', Bivio::Type::Phone
				       ->from_literal($_[1])})
			 unless _is_empty(@_)},
	end => sub {_change_state(@_, 'tr', 'end')},
    },
    end => {},
};


# _change_state(self, string actual, string expect, string state)
# change to 'state' unless 'actual' ne 'expect', or change to 'state'
# unconditionally if 'expect' is undef
sub _change_state {
    my($self, $actual, $expect, $state) = @_;
    return if defined($expect) && $expect ne $actual;
    my($fields) = $self->[$_IDI];
    $fields->{state} = $state;
    return;
}

# _is_empty(self, string) : boolean
sub _is_empty {
    my($self, $value) = @_;
    return $value =~ /^(\s*|&nbsp;)$/;
}

# _notify(self, string event, string data)
# 'event' must be one of 'start', 'text', 'end', or 'comment'
# 'data' is the tag for 'start' and 'end', and the text for 'text'
# and the comment for 'comment'
# Looks up the next action from $_STATE_TABLE using event and data,
# and if the action exists, it is called with 'self' and 'data'.
# The action will either _change_state(), _save(), or ignore the
# event
sub _notify {
    my($self, $event, $data) = @_;
    my($fields) = $self->[$_IDI];
    my($fn) = $_STATE_TABLE->{$fields->{'state'}}->{$event};
    return $fn->($self, $data) if defined($fn);
}

The actual states and transitions are modeled on Bob's comment parser. The comment parser changes state when special characters are found. It starts in the OutsideCommentState. In that state a '/' will change to MaybeCommentState. There another '/' or '*' will change to InsideCommentState. Anything else would change back to OutsideComment. Or something to that effect -- don't have his book at hand as I write this. My html parser changes states when special tags or text are encountered.

Here's what this state machine does in English. The parser starts out in the 'begin' state. HTML::Parser sends start, end, text, and comment events which are forwarded to _notify(). In the 'begin' state all events are ignored until we run into the first <H3> start tag. That will trigger a change to the 'save_name' state. In the save_name state, the first block of text will be _save()'d as the name. Then the first end tag will change the state to 'street1'. In the 'street1' state we ignore everything until we see the text 'Business Address' which triggers a change to 'empty_address_cell'. 'street1', 'empty_address_cell', and 'skip_empty_address_cell' work together to skip past the table cell containing the   and into the cell that actually contains the address. The 'save_address' state is the most complicated of the bunch and skipping that   makes it somewhat less complicated. Addresses are complicated because in some examples we get a street1, city, state, and zip whereas in other cases we get street1, street2, city, state and zip. In a few cases we only get city, state, and zip and one case gives only the state. I won't go into the details about the regular expression for parsing the city, state, and zip. Nor will I explain the map(). Those are both worthy of discussion, but not central to the mechanics of the state machine. The 'phone' and 'save_phone' states work similarly to the 'begin' and 'save_name' states. 'phone' ignores all tags until it finds the text 'Phone Number' when it changes to 'save_phone'. The closing row tag changes to 'end'. 'end' ignores all remaining events. What the parser ultimately returns is a hash containing the name, address and phone number.

What I find most satisfying about this solution is the concise and declarative structure of the $_STATE_TABLE. There's some effort required to understand the mechanics of the notification and the start, text, and end events. But with that understood, the $_STATE_TABLE is considerably less cluttered than the pile of if .. elsif .. else blocks it could have been. The bulk of the state machine is also fits in a single window in emacs whereas in Java I would have a pile of objects in different files to express the same machine. I think it also provides a nice example of the kind of power and expressiveness you get from having anonymous functions and arrays and hashes built into the language. Python, Ruby, and Groovy offer similar expressive power, but the proof is left as an exercise for the reader. :-) For what it's worth, In Java, I think the Command pattern with a little Dependency Injection could be used to provide something similar to the anonymous functions.

This technique for parsing angle-bracketed data would be equally applicable with SAX and an XML data source.