Thursday, June 18, 2009

HOWTO monitor HP DL3x0 G5 hardware with Nagios

HP Proliant servers (G5 hardware, anyway, such as the DL360 G5 and DL380 G5) have good hardware monitoring capabilities, but they can be a bit complex to monitor with Nagios. The lights-out management hardware only provides a WS-Management interface; there is no remote IPMI capability, and SNMP management requires the installation of a rather heavyweight set of OS-level management agents. With a little bit of setup work, though, it is completely possible to get good Nagios-based monitoring. I've identified a number of SNMP OIDs that cover the important general and subsystem-specific health indications.

(It's also possible to do this with HP's Systems Insight Manager software, and it might well be simpler to set that up, but the last thing I need is another monitoring system to administer in parallel with Nagios.)

Step 1: install the HP management agents on your servers. They can be downloaded here; choose your OS and then look under "Software - System Management." For instance, the ESX 3.5 agents are here.

Step 2: download the SNMP MIBs from here (choose the HP-UX/Linux download link to get a tarball of MIBs). These should be installed in the usual SNMP MIB location on your Nagios server; on my Solaris boxes this is /etc/sma/snmp/mibs/.

Step 3: define an appropriate SNMP monitoring command in Nagios, so that you can specify OK and warning conditions. Note that the SNMP community string is $USER3 in my environment; change this as appropriate. The first argument is the OID, the second is the OK value, and the third is the range spanning the OK and warning values. (You'll need to look through the MIBs to find these OIDs and values.)


define command {
command_name check-snmp-okwarn
command_line $USER1$/check_snmp -H $HOSTADDRESS$ -P 2c -C $USER3$ -o $ARG1$ -w $ARG2$ -c $ARG3$
}


Step 4: define the Nagios services to monitor. You will probably need to change the 'use important-service' and 'hostgroup_name proliant' directives to suit your own service template and hostgroup names. My HP servers use both Ethernet ports, so I have services defined for both; if you need to monitor more or fewer, you'll need to define additional services, changing the instance number component of the OID (the last element) as appropriate.


define service {
use important-service
hostgroup_name proliant
service_description General system health
servicegroups system
check_command check-snmp-okwarn!CPQSTDEQ-MIB::cpqSeMibCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description RAID controller health
servicegroups system
check_command check-snmp-okwarn!CPQIDA-MIB::cpqDaCntlrCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Server health
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeMibCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Thermal condition
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description System fans
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalSystemFanStatus.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description CPU fans
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalCpuFanStatus.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Power supplies
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeFltTolPwrSupplyCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Ethernet 0
servicegroups system
check_command check-snmp-okwarn!CPQNIC-MIB::cpqNicIfPhysAdapterCondition.1!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Ethernet 1
servicegroups system
check_command check-snmp-okwarn!CPQNIC-MIB::cpqNicIfPhysAdapterCondition.2!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Advanced Memory Protection
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeResilientMemCondition.0!2!2:3
}


This configuration works well for me, but there's undoubtedly room for improvement by monitoring additional OIDs. Feel free to leave a comment with your suggestions.

EDITED 8/9/2009: added Advanced Memory Protection service to watch for DIMM ECC errors. Guess where I got the idea for that. :)