Informatics uses a Nagios monitoring system to keep track of the health and current status of many of its services and servers. One of the components of the Nagios environment is
lcfg-hwmon. This periodically performs some routine health checks on servers and services then sends the results to Nagios, which alerts administrators if necessary.
lcfg-hwmon checks several things:
- It warns if any disks are mounted read-only. The SL6 version excluded device names starting
/dev/loop. The SL7 version also ignores anything mounted on
/sys/fs/cgroup. This check can be disabled by giving the
hwmon.readonlydiskresource a false value.
- If it finds RAID controller software it uses this to get the current status of the machine’s RAID arrays, then it reports any problems found. It knows about MegaRAID SAS, HP P410, Dell H200 and SAS 5i/R RAID types. Note that the software does not attempt to find out what sort of RAID controller the machine actually has, so the administrator has to be sure to use the correct RAID header when configuring the machine.
- It warns if any of the machine’s power supply units has failed or is indicating a problem.
As well as the periodic checks from
cron a manual status check can be done with
--stdout option is omitted the result is sent to Nagios rather than displayed on the shell output.
Version 0.21.2-1 of
lcfg-hwmon functions properly on SL7 servers. In Informatics, any server using
lcfg-hwmon. Other LCFG servers can get it like this:
In related news, the RAID controller software for the RAID types listed above is now installed on SL7 servers by the same headers as on SL6. The HP P410 RAID software has changed its name from
hpssacli but seems otherwise identical. The Dell H200 software
sas2ircu has gained a few extra commands (SETOFFLINE, SETONLINE, ALTBOOTIR, ALTBOOTENC) but the existing commands seem unchanged. The other varieties of RAID software are much as they were on SL6.