Sunday, March 28, 2010

Bad data and the wonder of RegEx

Regular expressions aren't something that I use regularly but I think that's about the change. I came up a programming world where C, Cobol and Fortran were just the best thing considering that the alternative was writing assembler. And in that world, regular expressions don't really exist. The basic toolkit I developed internally for solving programming problems didn't include it. To this day, I sometimes still think about iterating over a character sequence in memory until a NULL is found. But in the modern programming world, languages like Python have tools built into the language and standard libraries that provide a lot of help to a developer. One of those tools is regular expression matching.

I'm currently working on a project to convert a FileMaker 5.5 database to SalesForce. Inside the FileMaker database, the data is in a very strange and odd states because there was no data validation done. So a date field can contain strings like '1/1/01', '6/24/2007' or '9-1-2005' or '2/27' or 'Jan 1.' These different formats cause a world of grief in trying to move the data to a database that expects a date to be structured. Something like 1/1/2001. It doesn't know about the various format of our dates and it rejects these as invalid. So the dates had to be filtered into something more standard.

I started to write some code that searches the strings for '/' and '-' and the text names of months. And naturally, it quickly became a rats nest of nested if and conditions that made understanding the code very cumbersome. So I went looking for another way to solve the problem. I remembered that Python had this module called re that provides regular expression processing but I hadn't really used it and wasn't sure about how it would work. So I start searching the web to find some help on using regular expressions and the re module. What I found was just wonderful. A.M. Kuchling wrote up the fantastic Regular Expression HOWTO. In almost no time flat I was starting to put together a regular expression that would match most the permutations of dates (those based on '1/1/01') that I've seen in our database and a second that would match the ones that had the months written out. The quickest way that I found to test the regular expressions was using Python's unittest module. I would edit the regular expression, run the unittests, re-editing, re-run, etc until it worked and all the tests past.

The code that I developed looks like this:

import unittest
import re

class TestDateRegEx(unittest.TestCase):
def test_slashes(self):
slashes = re.compile(r'(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})$')

self.assertEqual(slashes.match('1/1/10').group(), '1/1/10')
self.assertEqual(slashes.match('1/1/10').group(1), '1')
self.assertEqual(slashes.match('1/1/10').group(2), '1')
self.assertEqual(slashes.match('1/1/10').group(3), '10')
self.assertEqual(slashes.match('01/1/10').group(), '01/1/10')
self.assertEqual(slashes.match('11/1/10').group(), '11/1/10')
self.assertEqual(slashes.match('11/13/10').group(), '11/13/10')
self.assertEqual(slashes.match('1/1/2010').group(), '1/1/2010')
self.assertEqual(slashes.match('12/1/2010').group(), '12/1/2010')
self.assertEqual(slashes.match('1/21/2010').group(), '1/21/2010')
self.assertEqual(slashes.match('1/21/2010').group(1), '1')
self.assertEqual(slashes.match('1/21/2010').group(2), '21')
self.assertEqual(slashes.match('1/21/2010').group(3), '2010')

self.assertEqual(slashes.match('111/1/10'), None)
self.assertEqual(slashes.match('11/111/10'), None)
self.assertEqual(slashes.match('11/11/10100'), None)

self.assertEqual(slashes.match('1-1-10').group(), '1-1-10')
self.assertEqual(slashes.match('01-1-10').group(), '01-1-10')
self.assertEqual(slashes.match('11-1-10').group(), '11-1-10')
self.assertEqual(slashes.match('11-13-10').group(), '11-13-10')
self.assertEqual(slashes.match('1-1-2010').group(), '1-1-2010')
self.assertEqual(slashes.match('12-1-2010').group(), '12-1-2010')
self.assertEqual(slashes.match('1-21-2010').group(), '1-21-2010')
self.assertEqual(slashes.match('1-21-2010').group(1), '1')
self.assertEqual(slashes.match('1-21-2010').group(2), '21')
self.assertEqual(slashes.match('1-21-2010').group(3), '2010')

self.assertEqual(slashes.match('111-1-10'), None)
self.assertEqual(slashes.match('11-111-10'), None)
self.assertEqual(slashes.match('11-11-10100'), None)

self.assertEqual(slashes.match('1'), None)
self.assertEqual(slashes.match('1/'), None)
self.assertEqual(slashes.match('1/1'), None)
self.assertEqual(slashes.match('1'), None)
self.assertEqual(slashes.match('1/'), None)
self.assertEqual(slashes.match('1/1'), None)
self.assertEqual(slashes.match('Jan 1'), None)
self.assertEqual(slashes.match('Jan 1, 2010'), None)

def test_names(self):
names = re.compile(r'(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\s(\d{1,2})(,\s(\d{2,4}))?$', re.IGNORECASE)

self.assertEqual(names.match('Jan 1').group(), 'Jan 1')
self.assertEqual(names.match('Jan 1').group(1), 'Jan')
self.assertEqual(names.match('Jan 1').group(2), '1')
self.assertEqual(names.match('feb 21').group(), 'feb 21')
self.assertEqual(names.match('feb 21').group(1), 'feb')
self.assertEqual(names.match('feb 21').group(2), '21')
self.assertEqual(names.match('Jan 1, 2010').group(), 'Jan 1, 2010')
self.assertEqual(names.match('Jan 1, 2010').group(1), 'Jan')
self.assertEqual(names.match('Jan 1, 2010').group(2), '1')
self.assertEqual(names.match('Jan 1, 2010').group(4), '2010')
self.assertEqual(names.match('MAR 14, 2010').group(), 'MAR 14, 2010')
self.assertEqual(names.match('MAR 14, 2010').group(1), 'MAR')
self.assertEqual(names.match('MAR 14, 2010').group(2), '14')
self.assertEqual(names.match('MAR 14, 2010').group(4), '2010')

self.assertEqual(names.match('MA 20, 2010'), None)
self.assertEqual(names.match('xyz 2,'), None)
self.assertEqual(names.match('jan 2,'), None)

if __name__ == '__main__':

The use of regular expressions and the ability to match and group the matches makes the code needed to clean up the dates much simpler and a lot easier to maintain. It might take a moment next time I need to parse strings to think about using regular expressions and I'm pretty sure I'll have to refer to the howto a couple more times but I'm really happy with how powerful and how much simpler text processing can be by using them.

1 comment:

JamesK said...

Two moments in my life when I realized programming wasn't my future: (1) when they dropped the first Windows SDK box on my desk; and (2) coding regex's in bash scripts for termcap lookups. Sitting on the couch with the boys and watching Toy Story is a lot more fun, and just as rewarding. ;o)