All entries for Wednesday 15 August 2007

August 15, 2007

Monitoring solaris FMD with nagios

Solaris 10 has a very cool subsystem called FMD, the Fault Management Daemon. This bit of code monitors your server, looking for failed hardware, and re-configuring the system to avoid it. So if a CPU fails, FMD will offline it; if a memory DIMM reports an unacceptable number of errors, FMD will mark the bad segments off-limits. This is very cool, but it’s all done terribly quietly. I wanted a way to have FMD tell me about it when it spotted a problem. This blog entry describes using fmadm to report on any known faults and email details, which is cool, but I wanted something that would hook into our nagios management server. Here’s how I did it:

1) add a crontab entry to run fmadm every 10 minutes and dump the output into a file:

0,10,20,30,40,50 * * * * /usr/sbin/fmadm faulty > /tmp/fmadm.out

This needs to run as root, or as a user who has the SYS_CONFIG privilege. SYS_CONFIG appears to be fairly wide-ranging, so I didn’t want to grant this to the nagios user (which is a bit of a shame really, it would have made things much simpler and also more timely if I could have run fmadm inside the nagios check.

Next step, the nagios check. I’m using a local script, which is invoked by nrpe (available from blastwave). I’ve written mine in ruby, but it would be easy to port to perl or even sh:

require 'ftools'
def exit_unknown(message)
  puts message
  exit 3
def exit_critical(message)
  puts message
  exit 2
def exit_warning(message)
  puts message
  exit 1
def exit_ok(message)
  puts message
  exit 0

if (!File.exist?(fname)) then
  exit_warning("Status file #{fname} not found")
now =
mtime = file.mtime
# How many minutes old can the check file be ?
# we're running the check every 10 mins, so allow no more than 11
warn_mins = 11
crit_mins = 15
warn_threshold = now - (warn_mins*60)
crit_threshold = now - (crit_mins * 60)
if (mtime < crit_threshold)
  exit_critical("Marker file #{fname} more than #{crit_mins} mins old")
if ( mtime < warn_threshold)
  exit_warning("Marker file #{fname} more than  mins #{warn_mins} old}")
if text.length < 2
   exit_warning "Status file does not appear valid" 
elsif text.length == 2
    exit_ok("Marker file #{fname} is up to date")
   exit_critical "Hardware faults found: check #{fname} for details" 

Now just configure a check in nrpe.conf:

command[check_fmadm]=/usr/local/nagios-plugins/check_fmadm.rb /tmp/fmadm.out

and you’re good to go! add in an nrpe check on your nagios server, and you’ll get notified whenever fmadm detects a hardware failure.

Most recent entries


Search this blog

on twitter...


    Not signed in
    Sign in

    Powered by BlogBuilder
    © MMXIX