(It's also possible to do this with HP's Systems Insight Manager software, and it might well be simpler to set that up, but the last thing I need is another monitoring system to administer in parallel with Nagios.)
Step 1: install the HP management agents on your servers. They can be downloaded here; choose your OS and then look under "Software - System Management." For instance, the ESX 3.5 agents are here.
Step 2: download the SNMP MIBs from here (choose the HP-UX/Linux download link to get a tarball of MIBs). These should be installed in the usual SNMP MIB location on your Nagios server; on my Solaris boxes this is
/etc/sma/snmp/mibs/
.Step 3: define an appropriate SNMP monitoring command in Nagios, so that you can specify OK and warning conditions. Note that the SNMP community string is
$USER3
in my environment; change this as appropriate. The first argument is the OID, the second is the OK value, and the third is the range spanning the OK and warning values. (You'll need to look through the MIBs to find these OIDs and values.)
define command {
command_name check-snmp-okwarn
command_line $USER1$/check_snmp -H $HOSTADDRESS$ -P 2c -C $USER3$ -o $ARG1$ -w $ARG2$ -c $ARG3$
}
Step 4: define the Nagios services to monitor. You will probably need to change the '
use important-service
' and 'hostgroup_name proliant
' directives to suit your own service template and hostgroup names. My HP servers use both Ethernet ports, so I have services defined for both; if you need to monitor more or fewer, you'll need to define additional services, changing the instance number component of the OID (the last element) as appropriate.
define service {
use important-service
hostgroup_name proliant
service_description General system health
servicegroups system
check_command check-snmp-okwarn!CPQSTDEQ-MIB::cpqSeMibCondition.0!2!2:3
}
define service {
use important-service
hostgroup_name proliant
service_description RAID controller health
servicegroups system
check_command check-snmp-okwarn!CPQIDA-MIB::cpqDaCntlrCondition.0!2!2:3
}
define service {
use important-service
hostgroup_name proliant
service_description Server health
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeMibCondition.0!2!2:3
}
define service {
use important-service
hostgroup_name proliant
service_description Thermal condition
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalCondition.0!2!2:3
}
define service {
use important-service
hostgroup_name proliant
service_description System fans
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalSystemFanStatus.0!2!2:3
}
define service {
use important-service
hostgroup_name proliant
service_description CPU fans
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalCpuFanStatus.0!2!2:3
}
define service {
use important-service
hostgroup_name proliant
service_description Power supplies
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeFltTolPwrSupplyCondition.0!2!2:3
}
define service {
use important-service
hostgroup_name proliant
service_description Ethernet 0
servicegroups system
check_command check-snmp-okwarn!CPQNIC-MIB::cpqNicIfPhysAdapterCondition.1!2!2:3
}
define service {
use important-service
hostgroup_name proliant
service_description Ethernet 1
servicegroups system
check_command check-snmp-okwarn!CPQNIC-MIB::cpqNicIfPhysAdapterCondition.2!2!2:3
}
define service {
use important-service
hostgroup_name proliant
service_description Advanced Memory Protection
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeResilientMemCondition.0!2!2:3
}
This configuration works well for me, but there's undoubtedly room for improvement by monitoring additional OIDs. Feel free to leave a comment with your suggestions.
EDITED 8/9/2009: added Advanced Memory Protection service to watch for DIMM ECC errors. Guess where I got the idea for that. :)
1 comment:
Very interesting. Will certainly be trying this out soon. Thanks for getting me started.
Post a Comment