Thursday, May 13, 2010

OS X and the mystery bits

Sometimes I have to learn a lesson over and over again before I get a glimmer of what's going on. I spent some time again today trying to get the MySQL client library module for Python working. I need it to be able to talk w/ a MySQL database for an online store. I'd rather use PostgreSQL but in this case, I don't have a choice.

I downloaded MySQL OS X package and ran the installation program. It did what it was supposed to - install the software under /usr/local. The MySQL client runs and all seems right with the world. I download and install the MySQL Python client module and it builds as expected. But when I try to run Django, I get this error:

django.core.exceptions.ImproperlyConfigured: Error loading MySQLdb module: dynamic module does not define init function (init_mysql)


Grumble, grumble, grumble.

I try to load the MySQL Python module directly:

peter@drag:~> python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
port>>> import _mysql
Traceback (most recent call last):
File "", line 1, in
File "build/bdist.macosx-10.6-universal/egg/_mysql.py", line 7, in
File "build/bdist.macosx-10.6-universal/egg/_mysql.py", line 6, in __bootstrap__
ImportError: dynamic module does not define init function (init_mysql)


I checked the build output. Checked the install output. Everything looks fine. This should work. Searching the web doesn't help much since the references I find are from 2007 and 2008. The patches that they describe have already been incorporated into the MySQL Python module.

About this time, I realize that I've solved this problem before. It has to do with bits. I run:

peter@drag:~> python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform
>>> platform.architecture()
('64bit', '')


I look at the original MySQL disk image that I downloaded and see that its a 32bit application. Mixing a 32bit dynamic library with a 64bit executable doesn't work.

After downloading the 64bit MySQL disk image, installing it and rebuilding the MySQL Python module, everything works. Now its on to writing code for the online store. Yea.

Sunday, March 28, 2010

Bad data and the wonder of RegEx

Regular expressions aren't something that I use regularly but I think that's about the change. I came up a programming world where C, Cobol and Fortran were just the best thing considering that the alternative was writing assembler. And in that world, regular expressions don't really exist. The basic toolkit I developed internally for solving programming problems didn't include it. To this day, I sometimes still think about iterating over a character sequence in memory until a NULL is found. But in the modern programming world, languages like Python have tools built into the language and standard libraries that provide a lot of help to a developer. One of those tools is regular expression matching.

I'm currently working on a project to convert a FileMaker 5.5 database to SalesForce. Inside the FileMaker database, the data is in a very strange and odd states because there was no data validation done. So a date field can contain strings like '1/1/01', '6/24/2007' or '9-1-2005' or '2/27' or 'Jan 1.' These different formats cause a world of grief in trying to move the data to a database that expects a date to be structured. Something like 1/1/2001. It doesn't know about the various format of our dates and it rejects these as invalid. So the dates had to be filtered into something more standard.

I started to write some code that searches the strings for '/' and '-' and the text names of months. And naturally, it quickly became a rats nest of nested if and conditions that made understanding the code very cumbersome. So I went looking for another way to solve the problem. I remembered that Python had this module called re that provides regular expression processing but I hadn't really used it and wasn't sure about how it would work. So I start searching the web to find some help on using regular expressions and the re module. What I found was just wonderful. A.M. Kuchling wrote up the fantastic Regular Expression HOWTO. In almost no time flat I was starting to put together a regular expression that would match most the permutations of dates (those based on '1/1/01') that I've seen in our database and a second that would match the ones that had the months written out. The quickest way that I found to test the regular expressions was using Python's unittest module. I would edit the regular expression, run the unittests, re-editing, re-run, etc until it worked and all the tests past.

The code that I developed looks like this:

import unittest
import re


class TestDateRegEx(unittest.TestCase):
def test_slashes(self):
slashes = re.compile(r'(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})$')

self.assertEqual(slashes.match('1/1/10').group(), '1/1/10')
self.assertEqual(slashes.match('1/1/10').group(1), '1')
self.assertEqual(slashes.match('1/1/10').group(2), '1')
self.assertEqual(slashes.match('1/1/10').group(3), '10')
self.assertEqual(slashes.match('01/1/10').group(), '01/1/10')
self.assertEqual(slashes.match('11/1/10').group(), '11/1/10')
self.assertEqual(slashes.match('11/13/10').group(), '11/13/10')
self.assertEqual(slashes.match('1/1/2010').group(), '1/1/2010')
self.assertEqual(slashes.match('12/1/2010').group(), '12/1/2010')
self.assertEqual(slashes.match('1/21/2010').group(), '1/21/2010')
self.assertEqual(slashes.match('1/21/2010').group(1), '1')
self.assertEqual(slashes.match('1/21/2010').group(2), '21')
self.assertEqual(slashes.match('1/21/2010').group(3), '2010')

self.assertEqual(slashes.match('111/1/10'), None)
self.assertEqual(slashes.match('11/111/10'), None)
self.assertEqual(slashes.match('11/11/10100'), None)

self.assertEqual(slashes.match('1-1-10').group(), '1-1-10')
self.assertEqual(slashes.match('01-1-10').group(), '01-1-10')
self.assertEqual(slashes.match('11-1-10').group(), '11-1-10')
self.assertEqual(slashes.match('11-13-10').group(), '11-13-10')
self.assertEqual(slashes.match('1-1-2010').group(), '1-1-2010')
self.assertEqual(slashes.match('12-1-2010').group(), '12-1-2010')
self.assertEqual(slashes.match('1-21-2010').group(), '1-21-2010')
self.assertEqual(slashes.match('1-21-2010').group(1), '1')
self.assertEqual(slashes.match('1-21-2010').group(2), '21')
self.assertEqual(slashes.match('1-21-2010').group(3), '2010')

self.assertEqual(slashes.match('111-1-10'), None)
self.assertEqual(slashes.match('11-111-10'), None)
self.assertEqual(slashes.match('11-11-10100'), None)

self.assertEqual(slashes.match('1'), None)
self.assertEqual(slashes.match('1/'), None)
self.assertEqual(slashes.match('1/1'), None)
self.assertEqual(slashes.match('1'), None)
self.assertEqual(slashes.match('1/'), None)
self.assertEqual(slashes.match('1/1'), None)
self.assertEqual(slashes.match('Jan 1'), None)
self.assertEqual(slashes.match('Jan 1, 2010'), None)


def test_names(self):
names = re.compile(r'(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\s(\d{1,2})(,\s(\d{2,4}))?$', re.IGNORECASE)

self.assertEqual(names.match('Jan 1').group(), 'Jan 1')
self.assertEqual(names.match('Jan 1').group(1), 'Jan')
self.assertEqual(names.match('Jan 1').group(2), '1')
self.assertEqual(names.match('feb 21').group(), 'feb 21')
self.assertEqual(names.match('feb 21').group(1), 'feb')
self.assertEqual(names.match('feb 21').group(2), '21')
self.assertEqual(names.match('Jan 1, 2010').group(), 'Jan 1, 2010')
self.assertEqual(names.match('Jan 1, 2010').group(1), 'Jan')
self.assertEqual(names.match('Jan 1, 2010').group(2), '1')
self.assertEqual(names.match('Jan 1, 2010').group(4), '2010')
self.assertEqual(names.match('MAR 14, 2010').group(), 'MAR 14, 2010')
self.assertEqual(names.match('MAR 14, 2010').group(1), 'MAR')
self.assertEqual(names.match('MAR 14, 2010').group(2), '14')
self.assertEqual(names.match('MAR 14, 2010').group(4), '2010')

self.assertEqual(names.match('MA 20, 2010'), None)
self.assertEqual(names.match('xyz 2,'), None)
self.assertEqual(names.match('jan 2,'), None)


if __name__ == '__main__':
unittest.main()


The use of regular expressions and the ability to match and group the matches makes the code needed to clean up the dates much simpler and a lot easier to maintain. It might take a moment next time I need to parse strings to think about using regular expressions and I'm pretty sure I'll have to refer to the howto a couple more times but I'm really happy with how powerful and how much simpler text processing can be by using them.