LCFG autoreboot

November 10, 2017

One of the tools which saves us an enormous amount of effort is our LCFG autoreboot component. This watches for reboot requests from other LCFG components and then schedules the reboot for the required date/time.

One nice feature is that it can automatically choose a reboot time from within a specified range. This means that when many similarly configured machines schedule a reboot they don’t all go at the same time which could result in the overloading of services that are accessed at boot time. Recently it was reported that the component has problems parsing single-digit times which results in the reboot not being scheduled. Amazingly this bug has lain undetected for approximately 4 years during which time a significant chunk of machines have presumably been failing to reboot on time. As well as resolving that bug I also took the chance to fix a minor issue related to a misunderstanding of the shutdown command options which resulted in the default delay time being set for 3600 minutes instead of 3600 seconds, thankfully we change that delay locally so it never had any direct impact on our machines.

Whilst fixing those two bugs I discovered another issue related to sending reboot notifications via email, if that failed for any reason the reboot would not be scheduled, the component will now report the error but continue. This is a common problem we see in LCFG components where problems are handled with the Fail method (which logs and then exits) instead of just logging with Error. This is particularly a problem since an exit with non-zero code is not the same as dieing which can be caught with the use of the eval function. Since a call to Fail ends the current process immediately this can lead to a particularly annoying situation where a failure in a Configure method results in a failure in the Start method. This means that a component might never reach the started state, a situation from which it is difficult to recover. We are slowly working our way through eradicating this issue from core components but it’s going to take a while.

Recently we have had feedback from some of our users that the reboot notification message was not especially informative. The issue is related to us incorporating the message into the message of the day which sometimes leads to it being left lieing around out-of-date for some time. The message would typically say something like “A reboot has been scheduled for 2am on Thursday”, which is fine as long as the message goes away once the reboot has been completed. To resolve this I took advantage of a feature I added some years ago which passes the reboot time as a Perl DateTime object (named shutdown_dt) into the message template. With a little bit of thought I came up with the following which uses the Template Toolkit Date plugin:

[%- USE date -%]
[%- USE wrap -%]
[%- FILTER head = wrap(70, ‘*** ‘, ‘*** ‘) -%]
This machine ([% host.VALUE %]) requires a reboot as important updates are available.
[%- END %]

[% IF enforcing.VALUE -%]
[%- FILTER body = wrap(70, ‘ ‘, ‘ ‘) -%]
It will be unavailable for approximately 15 minutes beginning at
[% date.format( time = shutdown_dt.VALUE.epoch,
format = ‘%H:%M %A %e %B %Y’,
locale = ‘en_GB’) %].
Connected users will be warned [% shutdown_delay.VALUE %] minutes beforehand.
[%- END %]

[% END -%]

This also uses the wrap plugin to ensure that the lines are neatly arranged and the header section has a “*** ” prefix for each line to help grab the attention of the users.