All entries for Wednesday 15 August 2007
August 15, 2007
Monitoring solaris FMD with nagios
Solaris 10 has a very cool subsystem called FMD, the Fault Management Daemon. This bit of code monitors your server, looking for failed hardware, and re-configuring the system to avoid it. So if a CPU fails, FMD will offline it; if a memory DIMM reports an unacceptable number of errors, FMD will mark the bad segments off-limits. This is very cool, but it’s all done terribly quietly. I wanted a way to have FMD tell me about it when it spotted a problem. This blog entry describes using fmadm to report on any known faults and email details, which is cool, but I wanted something that would hook into our nagios management server. Here’s how I did it:
1) add a crontab entry to run fmadm every 10 minutes and dump the output into a file:
0,10,20,30,40,50 * * * * /usr/sbin/fmadm faulty > /tmp/fmadm.out
This needs to run as root, or as a user who has the SYS_CONFIG privilege. SYS_CONFIG appears to be fairly wide-ranging, so I didn’t want to grant this to the nagios user (which is a bit of a shame really, it would have made things much simpler and also more timely if I could have run fmadm inside the nagios check.
Next step, the nagios check. I’m using a local script, which is invoked by nrpe (available from blastwave). I’ve written mine in ruby, but it would be easy to port to perl or even sh:
#!/opt/csw/bin/ruby
require 'ftools'
def exit_unknown(message)
puts message
exit 3
end
def exit_critical(message)
puts message
exit 2
end
def exit_warning(message)
puts message
exit 1
end
def exit_ok(message)
puts message
exit 0
end
fname=ARGV[0]
if (!File.exist?(fname)) then
exit_warning("Status file #{fname} not found")
end
file=File.new(fname)
now = Time.new
mtime = file.mtime
# How many minutes old can the check file be ?
# we're running the check every 10 mins, so allow no more than 11
warn_mins = 11
crit_mins = 15
warn_threshold = now - (warn_mins*60)
crit_threshold = now - (crit_mins * 60)
if (mtime < crit_threshold)
exit_critical("Marker file #{fname} more than #{crit_mins} mins old")
end
if ( mtime < warn_threshold)
exit_warning("Marker file #{fname} more than mins #{warn_mins} old}")
end
text=file.readlines
if text.length < 2
exit_warning "Status file does not appear valid"
elsif text.length == 2
exit_ok("Marker file #{fname} is up to date")
else
exit_critical "Hardware faults found: check #{fname} for details"
end
Now just configure a check in nrpe.conf:
command[check_fmadm]=/usr/local/nagios-plugins/check_fmadm.rb /tmp/fmadm.out
and you’re good to go! add in an nrpe check on your nagios server, and you’ll get notified whenever fmadm detects a hardware failure.
Chris May
Please wait - comments are loading
Loading…