Moving services to SL7

Although we don’t always blog about it, the MPU has been busy lately. One project which has been taking a great deal of our time is the SL7 server upgrade project, the effort to move all of the various services we run from the now-outdated version 6 of Scientific Linux to version 7. The MPU, one of the School’s five computing units, has its fair share of services to port to SL7, and here’s a summary of the work we did for this project during the last third of 2016 (that is, September to December):

We upgraded these MPU services to SL7:

  • The virtual machine hosting service. This covers eight servers on three sites, hosting some 180 guest VMs, most of which were kept running seamlessly through the upgrades.
  • The PXE service and the package cache service. These share two servers in the Forum and the Tower.
  • The PackageForge package-building service, covering two build servers and a master server. The build servers’ performance was improved by moving them to VMs. Before the master server could be upgraded, the PackageForge software needed enhancement, including an upgrade to PostgreSQL 9.6, changing the package data from YAML format in the filesystem to JSON format in the database – opening the way for a future version to provide far better presentation of the build results in the user interface – and various code updates, making the web interface noticeably more responsive.
  • The export packages server. This was moved to a new VM.
  • The LCFG slave servers – the two main slaves, one test slave, one DIY DICE slave and two inf-level release-testing slaves, an increase of one (we now monitor the inf level on SL6 and SL7). The two main slave servers were substantially speeded up by increasing their memory to 8GB, so that all LCFG profile information could be held in memory at once.
  • The site mirrors packages server, where we keep our own copies of various software repositories covering Scientific Linux, EPEL, PostgreSQL and others.
  • The LCFG website and the LCFG wiki. We installed and configured a substantially updated version of the TWiki software.
  • BuzzSaw and LogCabin (which organise and serve the login logs) were moved to the new SL7 loghost. This work included the update of Django packages and the building of some dependencies.
  • The LCFG disaster relief server, which will take over our configuration infrastructure should some calamity befall the Forum. This server hosts a complex mix of services , so sorting out its Apache config for SL7 helped to prepare the way for the LCFG master upgrade to come.

In addition, substantial work was done towards the upgrade of these services:

  • The computing help service.
  • The LCFG bug tracking service.
  • The LCFG master:
    • Replacement of Apache mod_krb5 with mod_gssapi;
    • Porting of mod_user_rewrite to the new LCFG build tools;
    • Reworking of the rfe packaging to produce a separate rfe-server sub-package and to introduce systemd support;
    • A complete rewrite of the rfe component in Perl with Template Toolkit;
    • We moved the web view of the LCFG repositories from the outdated websvn to the more capable viewvc, with a new LCFG component to manage its configuration;
    • The updating of all components’ defaults packages to up-to-date SL7 versions.

Work on this project has continued into 2017, but more of that in a future post.

Hardware monitoring and RAID on SL7

Informatics uses a Nagios monitoring system to keep track of the health and current status of many of its services and servers. One of the components of the Nagios environment is lcfg-hwmon. This periodically performs some routine health checks on servers and services then sends the results to Nagios, which alerts administrators if necessary. lcfg-hwmon checks several things:

  • It warns if any disks are mounted read-only. The SL6 version excluded device names starting /media/ and /dev/loop. The SL7 version also ignores anything mounted on /sys/fs/cgroup. This check can be disabled by giving the hwmon.readonlydisk resource a false value.
  • If it finds RAID controller software it uses this to get the current status of the machine’s RAID arrays, then it reports any problems found. It knows about MegaRAID SAS, HP P410, Dell H200 and SAS 5i/R RAID types. Note that the software does not attempt to find out what sort of RAID controller the machine actually has, so the administrator has to be sure to use the correct RAID header when configuring the machine.
  • It warns if any of the machine’s power supply units has failed or is indicating a problem.

As well as the periodic checks from cron a manual status check can be done with

/usr/sbin/check_hwmon --stdout

If the --stdout option is omitted the result is sent to Nagios rather than displayed on the shell output.

Version 0.21.2-1 of lcfg-hwmon functions properly on SL7 servers. In Informatics, any server using dice/options/server*.h gets lcfg-hwmon. Other LCFG servers can get it like this:

#include <lcfg/options/hwmon.h>

In related news, the RAID controller software for the RAID types listed above is now installed on SL7 servers by the same headers as on SL6. The HP P410 RAID software has changed its name from hpacucli to hpssacli but seems otherwise identical. The Dell H200 software sas2ircu has gained a few extra commands (SETOFFLINE, SETONLINE, ALTBOOTIR, ALTBOOTENC) but the existing commands seem unchanged. The other varieties of RAID software are much as they were on SL6.

Progress so far on SL7 server base

Every year or two we migrate all of DICE to a newer operating system version, so that we can keep up with technology advances and security fixes. Most recently we’ve been moving it from Scientific Linux 6 to Scientific Linux 7.

When migrating DICE to a new platform, we make the move in several stages. First we need our configuration environment LCFG fully working on the new OS (see for instance Work involved in porting DICE to SL7); then we work on the desktop computing environment, and the research and teaching software it needs; after that comes the tools and environment for servers. We’re tackling the last of those stages now, the SL7 server platform project. We have several hundred servers, hosting both a variety of services and a range of behind-the-scenes support functions.

So far we’ve tested and passed these things:

  • Server networking features. Setting NM_CONTROLLED=no in the network interface config files allows us to use the old networking scripts to setup bonding, bridging and VLANs. We’ll take a look at doing this with Network Manager later on, since the old networking scripts will probably be removed at some point, but in the meantime we have access to the networking functionality which our servers need.
  • IPMI. We use it for our monitoring needs and for Serial Over LAN (remote consoles and remote power control).
  • Our standard SL7 disk partition layout.
  • The basic active checks for our Nagios monitoring setup.
  • We’ve installed the software needed by the Nagios passive check which monitors network bonding, and it’s now working correctly.
  • The hwmon passive check does a variety of hardware health tests. These ones have been tested and work on SL7: read-only disk mounts; MegaSAS RAID; dual power supply redundancy; LSI SAS 5i/R RAID.
  • RAID controller software and LCFG configuration headers for MegaSAS RAID and for LSI SAS 5i/R RAID.
  • The toohot overheating emergency shutdown tool.
  • Fibre Channel Multipath. The ability to use multiple paths through the FC fabric increases the dependability of our storage area network facilities.
  • LVM. This storage abstraction layer is used for storage space for the VMs on our virtualisation servers.
  • We have rethought the DNS configuration for SL7. Instead of using only localhost for DNS lookups, SL7 servers will be configured to query the full set of DNS servers.

We’re currently working on support for other RAID types, on LCFG apacheconf and on other aspects of Fibre Channel functionality.

A new MPU blog

This is the new blog of the Managed Platform Unit. The MPU is one of the organisational units of the computing staff in the School of Informatics; it’s responsible for the Linux platform which forms the basis of DICE. It also maintains the tools needed by that platform, principally LCFG. See here for links to all of the units.

We’ll use this blog to keep you up to date on work which is shared between the MPU members. Initially that’ll include our work to develop an SL7 server platform.

We’ll still be blogging individually too (Alastair’s ramblings, Stephen’s work ramblings, cc:) and of course we’ll make announcements on the Computing Systems blog.