Friday, October 16, 2009

Web service monitoring w/ Nagios and JSON

I'm using Nagios to act as a watch dog for my network and the various services that live on it. Nagios does the job pretty well. It lets me know when there's a problem, when things are back to normal and generally keeps on eye on things for me.

The checks that Nagios performs are done through a series of check commands. These commands are your typical Unix style program with the exceptions that they produce a single line of text that describes the state of the item being checked and the exit value let's Nagios know what's going on.

So for instance, to check the health of the web service on the localhost:

peter@sybil:~$ /usr/lib/nagios/plugins/check_http -H localhost
HTTP OK HTTP/1.1 200 OK - 361 bytes in 0.001 seconds |time=0.001021s;;;0.000000 size=361B;;;0
peter@sybil:~$ echo $?
2
peter@sybil:~$

The single line of text that is displayed follows a specific format. It starts with the prefix of what's being tested, HTTP. Next is the status, OK. This can be OK, WARNING, CRITICAL or UNKNOWN. Everything after the status is eye candy that provide details that are specific to the test being done. Nagios doesn't really care about it but it does provide important details when looking at problems that may be occurring.

Writing these check program in Python is pretty straight forward.

I recently had a situation where our ISP moved our web servers from one physical machine to another. This resulted in the credit cards processing for our online store to fail. The payment provider uses the IP address of the server as part of the authentication process when submitting credit cards for processing. Since the server changed, the IP address changed. Things went around in circles for a while until we figured out the problem and gave the new IP address to the payment
provider.

I thought is would be a good additional Nagios check for the store web site to check on the IP address of the physical server. Unfortunately, the ISP doesn't provide access to the IP address. But they do provide access to the hostname.

To get the hostname, I added a simple CGI program that determines the hostname and then packages it up as a JSON data structure.

#!/usr/bin/env python

"""
Bundle the hostname up as a JSON data structure.

Copyright (c) 2009 Peter Kropf. All rights reserved.
"""

import cgi
import popen2
import sys
sys.path.insert(1, '/home/crucible/tools/lib/python2.4/site-packages')
sys.path.insert(1, '/home/crucible/tools/lib/python2.4/site-packages/simplejson-2.0.9-py2.4-linux-x86_64.egg')

import simplejson as json

field = cgi.FieldStorage()
print "Content-Type: application/json\n\n"

r, w, e = popen2.popen3('hostname')
host = r.readline()
r.close()
w.close()
e.close()

fields = {'hostname': host.split('\n')[0]}

print json.dumps(fields)

There's a couple of things to note. Since the ISP provides a very restrictive environment, I have to add the location of the simplejson module before it can be imported. It's a bit annoying but it does work.

On the Nagios service side, I created a new check program called check_json. It takes the name of a field, the expected value and the URI from which to pull the JSON data.

#! /usr/bin/env python

"""
Nagios plugin to check a value returned from a uri in json format.

Copyright (c) 2009 Peter Kropf. All rights reserved.

Example:

Compare the "hostname" field in the json structure returned from
http://store.example.com/hostname.py against a known value.

./check_json hostname buenosaires http://store.example.com/hostname.py
"""


import urllib2
import simplejson
import sys
from optparse import OptionParser

prefix = 'JSON'

class nagios:
ok = (0, 'OK')
warning = (1, 'WARNING')
critical = (2, 'CRITICAL')
unknown = (3, 'UNKNOWN')


def exit(status, message):
print prefix + ' ' + status[1] + ' - ' + message
sys.exit(status[0])


parser = OptionParser(usage='usage: %prog field_name expected_value uri')
options, args = parser.parse_args()


if len(sys.argv) < 3:
exit(nagios.unknown, 'missing command line arguments')

field = args[0]
value = args[1]
uri = args[2]

try:
j = simplejson.load(urllib2.urlopen(uri))
except urllib2.HTTPError, ex:
exit(nagios.unknown, 'invalid uri')

if field not in j:
exit(nagios.unknown, 'field: ' + field + ' not present')

if j[field] != value:
exit(nagios.critical, j[field] + ' != ' + value)

exit(nagios.ok, j[field] + ' == ' + value)


Some checking is done to ensure that the JSON data can be retrieved, that the needed field is in the data and then that the field's value matches what's expected.

These examples show the basic testing that's done and the return values:

peter@sybil:~$ /usr/lib/nagios/plugins/check_json hostname buenosaires http://store.thecrucible.org/hostname.py
JSON OK - buenosaires == buenosaires
peter@sybil:~$ echo $?
0
peter@sybil:~$ /usr/lib/nagios/plugins/check_json hostname buenosaires http://store.thecrucible.org/hostname.p
JSON UNKNOWN - invalid uri
peter@sybil:~$ echo $?
3
peter@sybil:~$ /usr/lib/nagios/plugins/check_json hostname buenosairs http://store.thecrucible.org/hostname.py
JSON CRITICAL - buenosaires != buenosairs
peter@sybil:~$ echo $?
2
peter@sybil:~$ /usr/lib/nagios/plugins/check_json ostname buenosaires http://store.thecrucible.org/hostname.py
JSON UNKNOWN - field: ostname not present
peter@sybil:~$ echo $?
3
peter@sybil:~$

Once the Nagios server is configured with the new command, the hostname on the server can be monitored and hopefully ease any problems that may occur then next time things change at the ISP.

More details on Nagios can be found at http://nagios.org and on developing check program at http://nagiosplug.sourceforge.net/developer-guidelines.html.