August 15, 2007

Monitoring solaris FMD with nagios

Solaris 10 has a very cool subsystem called FMD, the Fault Management Daemon. This bit of code monitors your server, looking for failed hardware, and re-configuring the system to avoid it. So if a CPU fails, FMD will offline it; if a memory DIMM reports an unacceptable number of errors, FMD will mark the bad segments off-limits. This is very cool, but it’s all done terribly quietly. I wanted a way to have FMD tell me about it when it spotted a problem. This blog entry describes using fmadm to report on any known faults and email details, which is cool, but I wanted something that would hook into our nagios management server. Here’s how I did it:

1) add a crontab entry to run fmadm every 10 minutes and dump the output into a file:

0,10,20,30,40,50 * * * * /usr/sbin/fmadm faulty > /tmp/fmadm.out

This needs to run as root, or as a user who has the SYS_CONFIG privilege. SYS_CONFIG appears to be fairly wide-ranging, so I didn’t want to grant this to the nagios user (which is a bit of a shame really, it would have made things much simpler and also more timely if I could have run fmadm inside the nagios check.

Next step, the nagios check. I’m using a local script, which is invoked by nrpe (available from blastwave). I’ve written mine in ruby, but it would be easy to port to perl or even sh:

#!/opt/csw/bin/ruby
require 'ftools'
def exit_unknown(message)
  puts message
  exit 3
end
def exit_critical(message)
  puts message
  exit 2
end
def exit_warning(message)
  puts message
  exit 1
end
def exit_ok(message)
  puts message
  exit 0
end

fname=ARGV[0]
if (!File.exist?(fname)) then
  exit_warning("Status file #{fname} not found")
end
file=File.new(fname)
now = Time.new
mtime = file.mtime
# How many minutes old can the check file be ?
# we're running the check every 10 mins, so allow no more than 11
warn_mins = 11
crit_mins = 15
warn_threshold = now - (warn_mins*60)
crit_threshold = now - (crit_mins * 60)
if (mtime < crit_threshold)
  exit_critical("Marker file #{fname} more than #{crit_mins} mins old")
end
if ( mtime < warn_threshold)
  exit_warning("Marker file #{fname} more than  mins #{warn_mins} old}")
end
text=file.readlines
if text.length < 2
   exit_warning "Status file does not appear valid" 
elsif text.length == 2
    exit_ok("Marker file #{fname} is up to date")
else
   exit_critical "Hardware faults found: check #{fname} for details" 
end

Now just configure a check in nrpe.conf:

command[check_fmadm]=/usr/local/nagios-plugins/check_fmadm.rb /tmp/fmadm.out

and you’re good to go! add in an nrpe check on your nagios server, and you’ll get notified whenever fmadm detects a hardware failure.


- 4 comments by 1 or more people Not publicly viewable

  1. Nickus

    I’m the author of the script you linked to and I’m actually using a very similar solution as yours in SNMP environments. But I run fmadm directly from the snmp server and I haven’t seen any performance impact from it? Or is the reason for the ruby script purely cosmetical to get nicer output?

    cheers,
    Nickus

    16 Aug 2007, 05:50

  2. Chris May

    Hi Nickus, thanks for the inspiration that your blog post provided!

    Nagios wants plugin output in the form of one line of information and an exit code, so I’d have to have this script (or something like it) either on the nagios server, or on the server being monitored. So I guess that it is kind of cosmetic, but for the benefit of Nagios rather than end users!

    I suppose I could run fmadm via snmp and then have this script use snmpget to get the results, but it seemed simpler to just dump the data into a file and read that. :-)

    16 Aug 2007, 09:22

  3. Keith Wesolowski

    If you have the ability to use SNMP, you actually should never need to run fmadm on any host. Instead you can use the FMA MIB and the agent module provided with Solaris. It supports traps on fault diagnosis as well as polling of faulted components, fault manager status, and diagnoses. See http://blogs.sun.com/wesolows/entry/a_louder_voice_for_the.

    23 Aug 2007, 17:26

  4. Chris May

    That looks pretty cool, but your blog entry suggests it’s in OpenSolaris only. Has it made it back into Solaris 10 yet?

    23 Aug 2007, 17:29


Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

Most recent entries

Loading…

Search this blog

on twitter...


    Tags

    Not signed in
    Sign in

    Powered by BlogBuilder
    © MMXIX