Tuesday, September 22, 2009

Molniya 0.2

Molniya 0.2 is now available. New in this release:
  • Molniya will track notifications that would have been sent while you were away or offline, and summarize them for you when you come back.
  • When you ask Molniya to re-check a service with check host/service or @3 check, it will issue a check command to Nagios, then poll until the newly checked service status is available. Once it sees that, it will report it to you.
  • Problem hosts are included in status reporting.
There are also a variety of internal improvements to the code: command handling is modularized, and a lot of the Nagios data representations are cleaned up.

It's been downloaded a whopping three times so far; two for 0.1 and one for 0.2. Running it out of Subversion should be a pretty reasonable choice as well; whatever is in Subversion is running on my own network, so it can't be too broken.

Next up: downtime scheduling, custom notifications, listing services and things, and possibly broadcast messages.

Suggestions and patches are welcomed; use Google Code's issue reporting.

Wednesday, September 16, 2009

Initial release of Molniya

I am happy to announce the first-ever release of my Nagios IM gateway software, Molniya. First and foremost, it lets you receive Nagios problem notifications as instant messages instead of inbox-clogging floods of email (or worse, SMS messages). Beyond that, you can also ask it for a status report on any problems Nagios currently knows about, force service checks, and acknowledge problems. From the command summary:

Nagios switchboard commands:
status: get a status report
check <host | host/svc>: force a check of the named host or service
You can respond to a notification with its @ number, like so:
@N ack [message]: acknowledge a host or service problem, with optional message
@N check: force a check of the host or service referred to

I've been running this code and its predecessors at work for quite a while now, and it works well. I'm actively adding features; problem acknowledgement just went in this afternoon, for instance.

The code has plenty of room for improvement; expect major revisions to the way commands are handled and messages are formatted, among other things. Patches, feature requests, and especially "it doesn't work on my machine" reports are all welcome.

You can download version 0.1.

Thursday, June 18, 2009

HOWTO monitor HP DL3x0 G5 hardware with Nagios

HP Proliant servers (G5 hardware, anyway, such as the DL360 G5 and DL380 G5) have good hardware monitoring capabilities, but they can be a bit complex to monitor with Nagios. The lights-out management hardware only provides a WS-Management interface; there is no remote IPMI capability, and SNMP management requires the installation of a rather heavyweight set of OS-level management agents. With a little bit of setup work, though, it is completely possible to get good Nagios-based monitoring. I've identified a number of SNMP OIDs that cover the important general and subsystem-specific health indications.

(It's also possible to do this with HP's Systems Insight Manager software, and it might well be simpler to set that up, but the last thing I need is another monitoring system to administer in parallel with Nagios.)

Step 1: install the HP management agents on your servers. They can be downloaded here; choose your OS and then look under "Software - System Management." For instance, the ESX 3.5 agents are here.

Step 2: download the SNMP MIBs from here (choose the HP-UX/Linux download link to get a tarball of MIBs). These should be installed in the usual SNMP MIB location on your Nagios server; on my Solaris boxes this is /etc/sma/snmp/mibs/.

Step 3: define an appropriate SNMP monitoring command in Nagios, so that you can specify OK and warning conditions. Note that the SNMP community string is $USER3 in my environment; change this as appropriate. The first argument is the OID, the second is the OK value, and the third is the range spanning the OK and warning values. (You'll need to look through the MIBs to find these OIDs and values.)


define command {
command_name check-snmp-okwarn
command_line $USER1$/check_snmp -H $HOSTADDRESS$ -P 2c -C $USER3$ -o $ARG1$ -w $ARG2$ -c $ARG3$
}


Step 4: define the Nagios services to monitor. You will probably need to change the 'use important-service' and 'hostgroup_name proliant' directives to suit your own service template and hostgroup names. My HP servers use both Ethernet ports, so I have services defined for both; if you need to monitor more or fewer, you'll need to define additional services, changing the instance number component of the OID (the last element) as appropriate.


define service {
use important-service
hostgroup_name proliant
service_description General system health
servicegroups system
check_command check-snmp-okwarn!CPQSTDEQ-MIB::cpqSeMibCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description RAID controller health
servicegroups system
check_command check-snmp-okwarn!CPQIDA-MIB::cpqDaCntlrCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Server health
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeMibCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Thermal condition
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description System fans
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalSystemFanStatus.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description CPU fans
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalCpuFanStatus.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Power supplies
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeFltTolPwrSupplyCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Ethernet 0
servicegroups system
check_command check-snmp-okwarn!CPQNIC-MIB::cpqNicIfPhysAdapterCondition.1!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Ethernet 1
servicegroups system
check_command check-snmp-okwarn!CPQNIC-MIB::cpqNicIfPhysAdapterCondition.2!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Advanced Memory Protection
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeResilientMemCondition.0!2!2:3
}


This configuration works well for me, but there's undoubtedly room for improvement by monitoring additional OIDs. Feel free to leave a comment with your suggestions.

EDITED 8/9/2009: added Advanced Memory Protection service to watch for DIMM ECC errors. Guess where I got the idea for that. :)