Hardware monitoring and RAID on SL7

Informatics uses a Nagios monitoring system to keep track of the health and current status of many of its services and servers. One of the components of the Nagios environment is lcfg-hwmon. This periodically performs some routine health checks on servers and services then sends the results to Nagios, which alerts administrators if necessary. lcfg-hwmon checks several things:

  • It warns if any disks are mounted read-only. The SL6 version excluded device names starting /media/ and /dev/loop. The SL7 version also ignores anything mounted on /sys/fs/cgroup. This check can be disabled by giving the hwmon.readonlydisk resource a false value.
  • If it finds RAID controller software it uses this to get the current status of the machine’s RAID arrays, then it reports any problems found. It knows about MegaRAID SAS, HP P410, Dell H200 and SAS 5i/R RAID types. Note that the software does not attempt to find out what sort of RAID controller the machine actually has, so the administrator has to be sure to use the correct RAID header when configuring the machine.
  • It warns if any of the machine’s power supply units has failed or is indicating a problem.

As well as the periodic checks from cron a manual status check can be done with

/usr/sbin/check_hwmon --stdout

If the --stdout option is omitted the result is sent to Nagios rather than displayed on the shell output.

Version 0.21.2-1 of lcfg-hwmon functions properly on SL7 servers. In Informatics, any server using dice/options/server*.h gets lcfg-hwmon. Other LCFG servers can get it like this:

#include <lcfg/options/hwmon.h>

In related news, the RAID controller software for the RAID types listed above is now installed on SL7 servers by the same headers as on SL6. The HP P410 RAID software has changed its name from hpacucli to hpssacli but seems otherwise identical. The Dell H200 software sas2ircu has gained a few extra commands (SETOFFLINE, SETONLINE, ALTBOOTIR, ALTBOOTENC) but the existing commands seem unchanged. The other varieties of RAID software are much as they were on SL6.

Network device naming

During our original project to port LCFG to SL7 we were only really considering desktops which typically have a single network interface. To get things working quickly we decided to stick with the “legacy” network device naming scheme which gives us interfaces named like eth0, eth1, eth2, etc. This works just fine with a single interface since we will only ever need access to eth0 but as we’ve moved onto adding network interface bonding for servers we have some discovered some problems. Many of our servers have two controllers each of which has two devices, for maximum reliability we wish to bond over one device from each controller. Traditionally we have done this by naming eth0 as the first device on the first controller and eth1 as the first device on the second controller. We have found with the legacy support on SL7 that this is not possible as they always come out as eth0 and eth2 (eth1 being the second device on the first controller) it seems that the ability to rename interfaces based on MAC address is not working correctly. Due to the way we have configured bonding in LCFG, for simplicity we really would like the two interfaces to continue to be named eth0 and eth1. To resolve this problem we have decided that it is now time to convert to the “modern” naming scheme as described in the Redhat network guide. The interfaces can then be aliased as eth0 and eth1 after they have been configured with their “consistent” names. This appears to work as desired but requires some changes be made in the LCFG headers and we will be working through this transition over the next few weeks. It is likely that the complete change to the default approach will have to wait until the SL7.2 upgrade to ensure we don’t break anything. The first step will be to move the “legacy” support out of the lcfg-level header (lcfg/defaults/network.h) into the ed-level header, this will not have any impact for most users but makes it possible to easily enable and disable the naming schemes for testing purposes. New headers have been provided – lcfg/options/network-legacy-names.h and lcfg/options/network-modern-names.h – to make it easy to swap between the two naming schemes. Once we are confident that this modern approach is reliable we will update the various hardware support headers in the lcfg-level so that it works for the various server models we have in Informatics.

LCFG apacheconf component

As discussed at the LCFG Annual Review meeting held in December we are planning to start work on updating the apacheconf component for apache 2.4 fairly soon. We will also be generally refactoring the whole thing. There is a wiki page which holds a collection of ideas for new features that would be nice to have and bugs that should be fixed. I’m currently doing some exploratory work to decide how to approach this week so this is the last chance to make suggestions. Please either add them to the wiki page (tag them with your name please) or email them to me directly.