True genius is knowing when to stop

09 May 2004

I call it flailing. Intuition tells me the problem on which I'm working has an easy solution, but it evades my every turn. An elusive solution teases me along dead end after dead end.

I have to import data from twelve years of backups. I don't expect that to be easy. But some steps should be. Our client's original system was built in Access. The files were archived annually with names like DB1990.mdb, DB1991.mdb, DB1992.mdb and so on through 2002. There is a Win32 laptop at work, but Access isn't installed: DBI and DBD::ODBC to the rescue. Though I haven't used DBI before, JDBC translates quite well. I work with a bunch of unix geeks, so cygwin is installed. I thankfully didn't have to re-traumatize my fingers with DOS.

This "one"-liner prints out the tables from one data source.

% perl -MDBI -e'$dbh=DBI->connect("dbi:ODBC:DB1990"); \
    print join("\n", @{$dbh->tables}), "\n";'

I can ignore most of the tables -- reports and summaries. But I can't ignore BROKER and BROKER90 nor POLICY and POLICY90 and so on. The other backups include tables like FOO and FOO91 and FOO and FOO92. The challenge is to weed out the redundant data from all those backups and to clean the inevitable data inconsistencies. A few more perl scripts and the mighty diff revealed some columns were added over time but the schema is mostly stable for the first 10 years with some bigger changes in the last two.

What's the simplest thing that could possibly work?

I thought about dumping the data into tab delimited tables and groveling over it with regular expressions. But it was already in a structured form and I could use SQL instead of regular expressions. Leaving the data where it was seemed like less work, so I started banging out some one-off scripts using DBI. Simple. The repetition quickly got tedious, and Rob routinely advocates once-and-only-once, so I made a slightly interactive script and started reusing.

1 while(dispatch_command());

sub dispatch_command {
    print $prompt;
    my $cmd = <STDIN>;
    chomp($cmd);
    $cmd =~ s/;\s*$//;
    $cmd =~ m/^([^ ]+) /;
    my $method = lc($1); 
    return $dispatch->{$meth}->($cmd);
}

Subroutines referenced in the $dispatch table handled specific commands: connecting to a data source, displaying metadata, or an SQL select. I didn't get there immediately. I just added a simple thing here and there. I removed a little repitition. I ended up with a rudamentary perl SQL shell. That turns out not to be the simplest thing that could possibly work. But it was a chain of simplest things that lead me there. And I was kinda proud of dispatch_command.

The title of this post comes from one of Dr. Veltman's good friends.

True genius is knowing when to stop. Many people have ideas, but the genius quickly abandons the bad ones, whereas the rest of us press on long after an idea has revealed itself as a failure.

For the time I spent I should have just bought a copy of Access and installed it. I talked with Rob about it later. His first suggestion was looking for something on CPAN -- except that our internet connection was down that day (more telecom frustration). His second suggestion was also more simple: import the raw data into postgresql and use psql.

The next day I decided to abaondon the interactive approach and just see what I could get out of the first database. I thought it should be pretty easy, and by this time I needed a small success to build back some momentum. What were the differences between POLICY and POLICY90? I dumped the two tables into tab-delimited text files. perl warned me about uninitialized values in string concatination, so I replaced undefined values with --null--. But diff didn't tell me much. It compares line-by-line and each line was different -- more columns in one than the other. While I was talking with Rob about something else he asked about the --null-- things. "If you just write out perl, you can eval it later and just leave those as undef. Don't invent your own syntax." Sometimes it's the little things. Actually, um... I did that on purpose. See, I knew I'd learn another cool trick from Rob if I put --null-- in there. ;-)

Switching to Data::Dumper cut my code down dramatically. The lesson from Rob is if you just use perl you don't have to write a parser. Data::Dumper is my new friend. It also put each 'cell' on its own line. That made diff -ub much more helpful. But I was still visually inspecting the diffs. I wanted a mechanical way of confirming that the extra columns contained all null values.

I wrote an emacs macro to repeatedly search over the diff file. I thought it might stop when it couldn't find the pattern in marked region, but instead it just expanded the region. Took a while to follow that dead end. I tried using emacs diff-mode to see if it would make the macro any easier. That lead me to ediff and emerge. But the files I was diffing were 12MB each and emacs kept crashing -- "don't know my own strength". I tried various other ways of bending emacs to the task thinking it would be simpler than writing a perl script to do the same. Finally after repeated failures I asked David if he new an emacs incantation that would help. "Why not use perl to look at the table with the extra columns and just grep for nulls?"

That would be simple.

$result = do 'DB1990.dump';
print "hooray, they're all null\n"
    if (int(@$result) == grep(!defined($_->[21]), @$result));

How do you you learn to recognize the dead-ends before you get to the end? Is it just a matter of experience? Or does it actually require genius?