Tuesday, September 22, 2009

Molniya 0.2

Molniya 0.2 is now available. New in this release:
  • Molniya will track notifications that would have been sent while you were away or offline, and summarize them for you when you come back.
  • When you ask Molniya to re-check a service with check host/service or @3 check, it will issue a check command to Nagios, then poll until the newly checked service status is available. Once it sees that, it will report it to you.
  • Problem hosts are included in status reporting.
There are also a variety of internal improvements to the code: command handling is modularized, and a lot of the Nagios data representations are cleaned up.

It's been downloaded a whopping three times so far; two for 0.1 and one for 0.2. Running it out of Subversion should be a pretty reasonable choice as well; whatever is in Subversion is running on my own network, so it can't be too broken.

Next up: downtime scheduling, custom notifications, listing services and things, and possibly broadcast messages.

Suggestions and patches are welcomed; use Google Code's issue reporting.

Wednesday, September 16, 2009

Initial release of Molniya

I am happy to announce the first-ever release of my Nagios IM gateway software, Molniya. First and foremost, it lets you receive Nagios problem notifications as instant messages instead of inbox-clogging floods of email (or worse, SMS messages). Beyond that, you can also ask it for a status report on any problems Nagios currently knows about, force service checks, and acknowledge problems. From the command summary:

Nagios switchboard commands:
status: get a status report
check <host | host/svc>: force a check of the named host or service
You can respond to a notification with its @ number, like so:
@N ack [message]: acknowledge a host or service problem, with optional message
@N check: force a check of the host or service referred to

I've been running this code and its predecessors at work for quite a while now, and it works well. I'm actively adding features; problem acknowledgement just went in this afternoon, for instance.

The code has plenty of room for improvement; expect major revisions to the way commands are handled and messages are formatted, among other things. Patches, feature requests, and especially "it doesn't work on my machine" reports are all welcome.

You can download version 0.1.

Thursday, June 18, 2009

HOWTO monitor HP DL3x0 G5 hardware with Nagios

HP Proliant servers (G5 hardware, anyway, such as the DL360 G5 and DL380 G5) have good hardware monitoring capabilities, but they can be a bit complex to monitor with Nagios. The lights-out management hardware only provides a WS-Management interface; there is no remote IPMI capability, and SNMP management requires the installation of a rather heavyweight set of OS-level management agents. With a little bit of setup work, though, it is completely possible to get good Nagios-based monitoring. I've identified a number of SNMP OIDs that cover the important general and subsystem-specific health indications.

(It's also possible to do this with HP's Systems Insight Manager software, and it might well be simpler to set that up, but the last thing I need is another monitoring system to administer in parallel with Nagios.)

Step 1: install the HP management agents on your servers. They can be downloaded here; choose your OS and then look under "Software - System Management." For instance, the ESX 3.5 agents are here.

Step 2: download the SNMP MIBs from here (choose the HP-UX/Linux download link to get a tarball of MIBs). These should be installed in the usual SNMP MIB location on your Nagios server; on my Solaris boxes this is /etc/sma/snmp/mibs/.

Step 3: define an appropriate SNMP monitoring command in Nagios, so that you can specify OK and warning conditions. Note that the SNMP community string is $USER3 in my environment; change this as appropriate. The first argument is the OID, the second is the OK value, and the third is the range spanning the OK and warning values. (You'll need to look through the MIBs to find these OIDs and values.)


define command {
command_name check-snmp-okwarn
command_line $USER1$/check_snmp -H $HOSTADDRESS$ -P 2c -C $USER3$ -o $ARG1$ -w $ARG2$ -c $ARG3$
}


Step 4: define the Nagios services to monitor. You will probably need to change the 'use important-service' and 'hostgroup_name proliant' directives to suit your own service template and hostgroup names. My HP servers use both Ethernet ports, so I have services defined for both; if you need to monitor more or fewer, you'll need to define additional services, changing the instance number component of the OID (the last element) as appropriate.


define service {
use important-service
hostgroup_name proliant
service_description General system health
servicegroups system
check_command check-snmp-okwarn!CPQSTDEQ-MIB::cpqSeMibCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description RAID controller health
servicegroups system
check_command check-snmp-okwarn!CPQIDA-MIB::cpqDaCntlrCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Server health
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeMibCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Thermal condition
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description System fans
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalSystemFanStatus.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description CPU fans
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeThermalCpuFanStatus.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Power supplies
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeFltTolPwrSupplyCondition.0!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Ethernet 0
servicegroups system
check_command check-snmp-okwarn!CPQNIC-MIB::cpqNicIfPhysAdapterCondition.1!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Ethernet 1
servicegroups system
check_command check-snmp-okwarn!CPQNIC-MIB::cpqNicIfPhysAdapterCondition.2!2!2:3
}

define service {
use important-service
hostgroup_name proliant
service_description Advanced Memory Protection
servicegroups system
check_command check-snmp-okwarn!CPQHLTH-MIB::cpqHeResilientMemCondition.0!2!2:3
}


This configuration works well for me, but there's undoubtedly room for improvement by monitoring additional OIDs. Feel free to leave a comment with your suggestions.

EDITED 8/9/2009: added Advanced Memory Protection service to watch for DIMM ECC errors. Guess where I got the idea for that. :)

Tuesday, November 11, 2008

Fishworks!

So Sun has finally rolled out their Fishworks systems. I've been waiting for these to show up for a long time. The idea is perfect and long overdue: combining Solaris storage features like ZFS and the new CIFS stack with commodity x86 hardware, and of course Solaris itself, and packaging that all into a real storage system. As a Solaris enthusiast, I'm happy to see all this stuff in a mature form. As a programmer with an interest in performance analysis and visualization, I'm insanely jealous of the analytics interface. :) But most of all, as a system administrator, I'm squarely in the target audience for this thing as a product. Not only do I run a heterogeneous NAS-centric environment, I run it on a NetApp filer cluster. And although the Fishworks team has avoided saying so in so many words (on their blogs, anyway), these boxes are designed to be NetApp killers.

As it happens, at $WORK, our current storage infrastructure is about due for a refresh, and we've been looking very hard at the NetApp FAS2050. This has occasioned a lot of performance analysis (or attempts at it) on the existing filers, and a lot of thinking about what I'd like to see in a storage system and how NetApp's lineup relates to that. The new Sun gear isn't quite a no-brainer, but it means I'll be doing a close evaluation of both systems and seeing how they stack up. There are a lot of questions to answer, on both sides. For instance: how much of a difference does it make having all the software features be part of the appliance? What kind of performance difference is there between a storage box with a 2.2 GHz Mobile Celeron and one with quad-core Opterons? (Sorry NetApp.) On the other side of the coin, how hard will it be to give up SnapDrive, SnapManager for SQL, and all the other nice NetApp integration bits? And how much pain will there be, going from a nice mature platform like Data ONTAP to Version 1.0 of the Fishworks stack?

I've already got a VMware image of the Fishworks system up and running, so I'll use that for an initial point of comparison. Obviously it won't tell me much about performance, but I'll be able to see what the software stack is like, and test just about everything but clustering. Stay tuned.

Friday, November 9, 2007

Solaris Notes I: serial console

My wiki at work has a very long SolarisNotes page. We have three X4100s running Solaris that I've already got a number of secondary apps running on, and I'm preparing to move some serious production apps onto them. I've done a lot with Solaris before, but never on x86, never with Solaris 10, and never with stuff like Live Upgrade, (organized) patching, etc. So I've learned a lot doing all this, and have made all kinds of notes.

The Solaris documentation is mostly pretty good, so it's a little surprising to me that I've had to assemble such a collection of notes. But some things aren't brought together in an organized and task-centric way, at all. If you want to find out how to perform a specific operation, that's no problem. But if you don't know what operation to perform in the first place, it might take a while.

A case in point is setting up a Solaris x86 install for serial console management. I'm used to Solaris on SPARC, where the console is always ttya unless somebody's been silly enough to plug in a keyboard. Want to manage the boot process? Halt the machine? ttya is it. Whatever else is wrong, you can always send a BREAK and drop into the PROM monitor. Surely buying a Sun box running Solaris would reproduce that experience, even if it did happen to have an AMD processor instead of SPARC, right?

Wrong. The first X4100 I unpacked, I ended up hard-resetting a couple of times before I realized that it actually was booting, it just wasn't printing anything on the serial console, and I needed to load up all the Java KVM redirection gunk just to log in and configure the network. (Never mind my surprise when I realized that the first Galaxy boxes we bought actually had regular PC BIOSes; I had really been hoping for a nice OpenBoot environment or something...)

My preference for a serial console isn't just my curmudgeon side showing through, either. It's actually a substantial pain to navigate through the ILOM interface and launch the Java KVM thing, and its bandwidth requirements are obscene. It's one thing when I've got a DS3 between my MacBook and the servers, and quite another when I'm on flaky coffee-shop WiFi. In the latter case, the ILOM redirection is Not Gonna Happen. But a serial console is only 9600 bps at source, and even the slowest of connections can handle that. Plus a single SSH hop is a lot quicker to establish.

In practice, you have to change an "eeprom" setting (which doesn't actually go into an EEPROM at all, rather some file in the mysterious "boot archive" AFAIK) and the SMF console-login service configuration and the GRUB configuration, like so:


# eeprom console=ttya
# svccfg -s console-login setprop ttymon/terminal_type = "vt102"


And then in /boot/grub/menu.lst:

Uncomment these lines:


serial --unit=0 --speed=9600
terminal serial


Comment this out:

#splashimage /boot/grub/splash.xpm.gz


And change the Solaris failsafe entry too:

kernel /boot/multiboot kernel/unix -s -B console=ttya


What a mess! And, of course, this wasn't documented in any organized way under "how to make your shiny new Sun box have a serial console like god intended;" I had to rely on comments in the GRUB menu file, scattered bits of documentation, and other people's blog posts (long-forgotten, I'm afraid).

I can appreciate that the idea of a single console device is probably baked in pretty deep in Unix, but it really would be good if we could have active consoles on the graphical console AND the serial port.

Moreover, a separate chapter in the System Administration Guide or something on setting up your system for remote access and management would be incredibly helpful. This was a piece of cake compared to the things I had to kludge up for monitoring the built-in RAID and hardware sensors, and the documentation is basically useless for solving these kinds of problems.

Browser windows as reading list considered harmful

Well, for the second time this month, Safari has crashed out from under me. For some people, this wouldn't be so bad. Unfortunately, I tend to keep 20 or 30 browser windows open at a time; these include my reading queue, all the articles I've found interesting but not actually gotten around to reading yet. Losing my assortment of Ruby documentation windows, last week's leftover Google search result pages, and the odd YouTube video is no big deal. Losing my reading queue is profoundly irritating.

I solved this once, in a previous job doing a lot of Apple Event automation, with Hamish Sanderson's excellent appscript library. It only took a little Python cron job to dump out the URLs of all my browser windows into a file, or if Safari wasn't running, launch it and load them in. In practice, though, it meant that I'd end up with two-month-old articles stuffed into the end of the Dock, and Safari eating half of my PowerBook's memory.

I've periodically tried using del.icio.us more aggressively, adding stuff I mean to read as well as stuff I've read and found valuable, but it just seems unnatural; I can't categorize things well if I haven't read them, and they go straight out of sight, out of mind.

So then what? Ideally, this would be integrated with my feed reader; maybe Google Reader's starred item scheme would work? But Google Reader is too slow, and I sure like NetNewsWire Lite. Maybe it's time for a little side project, to invent a reading-list manager?

About Me

Your humble correspondent is a programmer and system administrator in Seattle. If I can be bothered to post regularly, I'll try to write about interesting technology-related topics. Interesting CS papers, project ideas, Solaris trivia, Ruby notes, and other things could pop up. Who knows?