Sunday, February 8, 2009

The importance of trim()... or how much I hate whitespace!!

So... One of my colleagues went on leave last friday, and Murphys law dictates that that is the perfect time for the application he maintains to suddenly and precipitously cease working in production. Of course this problem fell into my lap to resolve so I set to it around about lunchtime...

The situation is that a third party regularly sends us a variety of data in plain text or spreadsheet format, in this particular case it's spreadsheet. Once we get this, we read it off the filesystem, look up data from the spreadsheet and compare it to data in our systems and perform various actions based on it. This has been in production for about 8months and has worked perfectly the whole time. Until the maintainer went on leave... :-)

Digging in, the first problem was that it wasn't reading dates properly. Hmmm, ok, so we grab the last spreadsheet that worked and compare with that, ah, they'd changed the date format from YYYY-MM-DD to DD/MM/YYYY for no reason at all. Ok, not a biggie, one excel macro later it looks fine, reload but still no workie... what tha?

After more digging we were being bitten by Excel date handling. A cell may look like a date, but if the cell is formatted as a date then internally it's actually just an integer, which needs to be converted. In my checked-out version of code this was handled transparently, but in prod and QUAL this broke as numeric dates were being missed, resulting in nulls. Ok, another excel macro and we have columns of strings that look like dates instead of numbers that look like dates. Now loading this file into QUAL the dates are picked up fine, at which point I noticed that a lookup between the file and our system wasn't working. WTF mate? Each row has several cells that contain codes that need to be converted into rich data structures from a lookup, this data structure then being used for calculations later.

The lookup was broken, why? Well, after several hours of checking differences in environment, checking datasources, checking versions of code and finally in desperation stepping through the lookup code, it dawned on me... The spreadsheet had whitespace in it.

Normally when I read tabular data from disk, or really any untrusted data at all, trim()'ing and validation is the order of the day. If the field is an enumeration, then validate that. If it's a number, check for alphas, dollar signs, commas etc and so on. It's the sort of paranoia that maintenance programming gives you. I had ASSumed that my colleague would have done the same by default, oops no. Which is fine, we all make mistakes.

So, a quick google, this handy KB article and some VBA later, and the spreadsheet was all clean again. Re-running the load it came up roses, which is to say it started working like billy-o where it had previously just cruised through, and all was well with the world.

Moral of the story is, Excel is a pain to work with. Not because it's the spawn of pure evil, but rather because although it *looks* like a datasource with strong data rules, it isn't, and you can never, ever assume that anything inside it is what you think it is until you've checked and made sure. Columns will move around, dates will randomly be strings or numbers, any and every field will randomly be space/tab/zero padded. You simply have to treat it as a hostile datasource.

And never, ever fail silently when something is wrong with the data!! Die loudly and with great gusto!! Please for my sanity!!!

// No record exists may be something wrong with **** in the input file

1 comment:

Sometimes I write stuff. said...

Wow. Some of this actually made sense to me.