LCFG Profile Security Project

April 18, 2018

Having completed the work to add support for GSSAPI auth to the client for fetching profiles I’ve now moved on to the LCFG installer. Currently the installer attempts to fetch the LCFG profile for the machine just prior to the (I)nstall, (D)ebug, (S)hell, (P)atchup, (R)eboot prompt. That fetching is done by calling the client component install method which in turn calls rdxprof in one shot mode. Having previously ported the client component to the Perl LCFG::Component framework I had hoped this would “just work” but it turned out that a number of bootstrapping issues were only being avoided previously due to many hardwired paths in the shell ngeneric code. The Perl framework takes a different approach and prefers to use the LCFG sysinfo resources wherever possible, this improves platform independence and maintainability but presents a bootstrapping problem at the first stage of the install when we have not yet downloaded any profile and thus have no sysinfo resources… I wasn’t keen on performing major surgery on the Perl component framework so I decided that the simplest solution to this problem was to get the installer to call rdxprof directly. With this change the installer worked again but still required support for Kerberos authentication.

Adding support for Kerberos authentication has been done in a fairly simple way. I’ve added support for two new install kernel command line options: lcfg.kauth and lcfg.realm. When the lcfg.kauth option is specified the user is prompted to enter their principal name and the kinit program is run to do the authentication. The user may specify the full principal name, if the realm is not specified then either the lcfg.realm option or the upper-cased domain name is used (e.g. @LCFG.ORG). If the authentication fails then the user is prompted to re-enter the principal name (which defaults to the previously entered string) and password. Once the Kerberos authentication has succeeded the credentials will be automatically used by rdxprof when required for fetching the LCFG profile.


LCFG Profile Security Project

March 28, 2018

This week I have been working on providing a way to configure the LCFG client profile fetcher via client component resources. In particular some sites will need to be able to specify SSL options (e.g. ca_path, verify_hostname) and also be able to set parameters for the authentication modules (e.g. gssapi might need the keytab file path to be specified). By default profile fetching will work for most sites without any additional configuration, furthermore as this is most easily expressed in terms of list and hash data structures I’ve decided to only support setting these parameters via a configuration file. Although it is currently configured entirely through the command line, rdxprof daemon supports loading configuration from a YAML file. I’ve altered the SetOptions method so that when it encounters a fetch entry in the configuration data hash it will pass this through to the LCFG::Client::ProfileFetcher instance via a configure method which knows how to handle the various options.

The current LCFG client component is written in bash which makes generating a config file in YAML more tricky than I would like. As we have a longstanding plan to rewrite all the core LCFG components into Perl this seemed like a good opportunity to get on with that job. I’ve previously been putting off this particular rewrite since the component is rather old and very complex. It manages the starting, stopping and signalling of the rdxprof daemon and as such it has a lot of code for handling PID files and checking for the liveness of processes. Given that we no longer support platforms such as SL6 and older this situation can be massively improved by switching to systemd for the management of rdxprof. I’ve introduced /usr/lib/systemd/system/rdxprof.service and /etc/sysconfig/rdxprof files which can be used by the component to control the daemon. To properly verify that the rdxprof daemon has successfully started the component creates a null callback and waits for it to be processed. I’ve moved the implementation of that into the LCFG::Client module itself so that the details are nicely hidden behind an API.

This is all implemented in perl-LCFG-Client version 4.3.4 and lcfg-client version 4.0.3. To make it easier to test I’ve added a dice/options/lcfg-client.h header. If the DICE_OPTIONS_LCFG_CLIENT_GSSAPI macro is defined then a new keytab will be created and the LCFG client will use it for authentication. The LCFG server is not yet quite ready for me to enable the use of gssapi but hopefully will be in the next couple of days.

Enabling gssapi for an LCFG client will be done something like this:


!kerberos.keys mADD(lcfg)
kerberos.keytab_lcfg /etc/lcfg/client.keytab
kerberos.keytabuid_lcfg root
kerberos.keytabgid_lcfg lcfg

!client.url mSET(https://lcfg1.inf.ed.ac.uk/profiles https://lcfg2.inf.ed.ac.uk/profiles)

!client.fetch_auth mSET(gssapi)
!client.fetch_params_gssapi mSET(keytab)
!client.fetch_param_gssapi_keytab mSET(<%kerberos.keytab_lcfg%>)


LCFG Profile Security Project

March 21, 2018

After improving support for Apache authentication in the LCFG server I have moved onto the client this week. The bulk of the work has been focused on the creation of a new LCFG::Client::Fetcher module which encapsulates all the details associated with fetching XML profiles from various sources. As well as improving the authentication support I am taking the chance to overhaul a chunk of code which has not seen much love in either of the v3 or v4 projects. One particular issue is that currently the handling of the list of profile sources is spread around the client libraries, this means that even a small change can involve locating and altering many separate small pieces of code. This general work also includes adding support for IPv6, enhancing SSL security as well as making the code much more maintainable.

One big change in approach I’ve made is that the lists of local file and remote web server sources are now handled in a unified way where previously they were dealt with completely separately. The new Fetcher module has a single list of source objects (either LCFG::Client::Fetch::Source::File or LCFG::Client::Fetch::Source::Remote) which come from the value of the client.url resource. One advantage here is that it is now trivial to add an entirely new type of source (e.g. rsync or ldap) anything with an LWP::Protocol module is a possibility. When configured to use both local files and remote sources the client has always preferred local files where possible, this behaviour is retained by using a priority system with file sources being guaranteed to have a higher default priority than any remote source.

The other part of recent development work is the addition of support for different authentication mechanisms. This is supported via modules in the LCFG::Client::Fetch::Auth namespace, currently we have modules for basic (username/password) and gssapi authentication. As with the new source modules this approach means it is easy to support alternative mechanisms, including site-specific needs which might not be appropriate for merging into the upstream code base. Before making a request the Fetcher will call the relevant authentication module to initialise the environment. I am also working on supporting multiple mechanisms so that if one fails the next will be tried until one succeeds.

Most of the code for the client is now in place and I am working on documentation for the various new modules. Once that is done I need to consider how the necessary authentication information can make it from LCFG resources into the rdxprof application via the LCFG client component. Although I would rather not make such a big change it might be that I finally need to bite the bullet and rewrite the client component from bash into Perl.


LCFG Profile Security Project

March 13, 2018

I have recently begun work on the Review Security of LCFG Profile Access project. So far I have mostly been considering the various aspects of the project with the aim being to produce a list of ideas which can be discussed at some future Development Meeting.

The first aspect of the project I have looked at in more depth is the LCFG server which has support for generating Apache .htaccess files. These can be used to limit access to each individual LCFG profile when fetched over http/https. We have traditionally supported both http and https protocols and relied on IP addresses to limit access but would like to move over to https-only along with using GSSAPI authentication, the LCFG client would then use a keytab to get the necessary credentials. To help with this change I have introduced a new schema (4) for the profile component and made some modifications to the LCFG server code which makes it easier to use the Apache mod_auth_gssapi module. In particular there is new auth_tmpl_$ resource which allows the selection of a different template (e.g. the apache_gssapi.tt template which is provided in the package) which more closely meets local requirements. There are also auth_vars_$ and auth_val_$_$ resources which can be used to specify any additional information that is required. For example:

!profile.version_profile mSET(4) /* not yet the default */
!profile.auth          mADD(ssl)
!profile.auth_tmpl_ssl mSET(apache_gssapi.tt)
!profile.acl_ssl 
   mSET(host/<%profile.node%>.<%profile.domain%>@<%kerberos.realm%>)
!profile.acl_ssl       mADD(@admin)
!profile.auth_vars_ssl mADD(groupfile)
!profile.auth_val_ssl_groupfile mSET(/etc/httpd/conf.d/lcfgadmins.group)

which results in the the LCFG server generating the following .htaccess file:

AuthType GSSAPI
AuthName "lcfg@foo.inf.ed.ac.uk"
GssapiBasicAuth Off
GssapiBasicAuthMech krb5
GssapiSSLonly On
GssapiCredStore "keytab:/etc/httpd.keytab"
AuthGroupFile "/etc/httpd/conf.d/lcfgadmins.group"
<RequireAny>
  Require user "host/foo.inf.ed.ac.uk@INF.ED.AC.UK"
  Require group "admin"
</RequireAny>

The profile.acl_ssl resource holds a list of users and groups (which have an ‘@’ prefix). In a real deployment it might make more sense to use an lcfg/ principal rather host/. The groupfile support is provided by the mod_authz_groupfile module which needs to be loaded.

I have tested this with curl and it works as required. The LCFG client doesn’t currently have support for doing a kinit (or launching something like k5start in the background) prior to fetching the profile so it isn’t yet possible to actively use this authentication method.


User management improvements

November 23, 2017

Management of local users and groups (i.e. those in /etc/passwd and /etc/group) is done using the LCFG auth component. One feature that has always been lacking is the ability to create a home directory where necessary and populate it from a skeleton directory (typically this is /etc/skel). The result of this feature being missing is that it is necessary to add a whole bunch of additional file component resources to create the home directory and that still doesn’t provide support for a skeleton directory.

Recently I needed something along those lines so I’ve taken the chance to add a couple of new resources – create_home_$ and skel_dir_$. When the create_home resource is set to true for a user the home directory will be created by the component and the permissions set appropriately. By default the directory will be populated from /etc/skel but it could be anything. This means it is now possible to setup a machine with a set of identically initialised local users.

For example:

auth.pw_name_cephadmin           cephadmin
auth.pw_uid_cephadmin            755
auth.pw_gid_cephadmin            755
auth.pw_gecos_cephadmin          Ceph Admin User
auth.pw_dir_cephadmin            /var/lib/cephadmin
auth.pw_shell_cephadmin          /bin/bash
auth.create_home_cephadmin       yes /* Ensure home directory exists */

auth.gr_name_cephadmin           cephadmin
auth.gr_gid_cephadmin            755

LCFG Core: resource types

November 21, 2017

The recent round of LCFG client testing using real LCFG profiles from both Informatics and the wider community has shown that the code is now in very good shape and we’re close to being able to deploy to a larger group of machines. One issue that this testing has uncovered is related to how the type of a resource is specified in a schema. A type in the LCFG world really just controls what regular expression is used to validate the resource value. Various type annotations can be used (e.g. %integer, %boolean or %string) to limit the permitted values, if there is no annotation it is assumed to be a tag list and this has clearly caught out a few component authors. For example:

@foo %integer
foo

@bar %boolean
bar

@baz
baz

@quux sub1_$ sub2_$
quux
sub1_$
sub2_$

Both of the last two examples (baz and quux) are tag lists, the first just does not have any associated sub-resources.

The compiler should not allow anything but valid tag names (which match /^[a-zA-Z0-9_]+$/) in a tag list resource but due to some inadequacies it currently permits pretty much anything. The new core code is a lot stricter and thus the v4 client will refuse to accept a profile if it contains invalid tag lists. Bugs have been filed against a few components (bug#1016 and bug#1017). It’s very satisfying to see the new code helping us improve the quality of our configurations.


yum cache and disk space

November 15, 2017

At a recent LCFG Deployers meeting we discussed a problem with yum not fully cleaning the cache directory even when the yum clean all command is used. This turns out to be related to how the cache directory path is defined in /etc/yum.conf as /var/cache/yum/$basearch/$releasever. As the release version changes with each minor platform release (e.g. 7.3, 7.4) the old directories can become abandoned. At first this might seem like a trivial problem but these cache directories can be huge, we have seen instances where gigabytes of disk space have been used and cannot be simply reclaimed. To help fix this problem I’ve added a new purgecache method to the LCFG yum component. This takes a sledgehammer approach of just deleting everything in the /var/cache/yum/ directory. This can be run manually whenever required or called regularly using something like cron. In Informatics it is now configured to run weekly on a Sunday like this:

!cron.objects             mADD(yum_purge)
cron.object_yum_purge     yum
cron.method_yum_purge     purgecache
cron.run_yum_purge        AUTOMINS AUTOHOUR * * sun

LCFG autoreboot

November 10, 2017

One of the tools which saves us an enormous amount of effort is our LCFG autoreboot component. This watches for reboot requests from other LCFG components and then schedules the reboot for the required date/time.

One nice feature is that it can automatically choose a reboot time from within a specified range. This means that when many similarly configured machines schedule a reboot they don’t all go at the same time which could result in the overloading of services that are accessed at boot time. Recently it was reported that the component has problems parsing single-digit times which results in the reboot not being scheduled. Amazingly this bug has lain undetected for approximately 4 years during which time a significant chunk of machines have presumably been failing to reboot on time. As well as resolving that bug I also took the chance to fix a minor issue related to a misunderstanding of the shutdown command options which resulted in the default delay time being set for 3600 minutes instead of 3600 seconds, thankfully we change that delay locally so it never had any direct impact on our machines.

Whilst fixing those two bugs I discovered another issue related to sending reboot notifications via email, if that failed for any reason the reboot would not be scheduled, the component will now report the error but continue. This is a common problem we see in LCFG components where problems are handled with the Fail method (which logs and then exits) instead of just logging with Error. This is particularly a problem since an exit with non-zero code is not the same as dieing which can be caught with the use of the eval function. Since a call to Fail ends the current process immediately this can lead to a particularly annoying situation where a failure in a Configure method results in a failure in the Start method. This means that a component might never reach the started state, a situation from which it is difficult to recover. We are slowly working our way through eradicating this issue from core components but it’s going to take a while.

Recently we have had feedback from some of our users that the reboot notification message was not especially informative. The issue is related to us incorporating the message into the message of the day which sometimes leads to it being left lieing around out-of-date for some time. The message would typically say something like “A reboot has been scheduled for 2am on Thursday”, which is fine as long as the message goes away once the reboot has been completed. To resolve this I took advantage of a feature I added some years ago which passes the reboot time as a Perl DateTime object (named shutdown_dt) into the message template. With a little bit of thought I came up with the following which uses the Template Toolkit Date plugin:


[%- USE date -%]
[%- USE wrap -%]
[%- FILTER head = wrap(70, ‘*** ‘, ‘*** ‘) -%]
This machine ([% host.VALUE %]) requires a reboot as important updates are available.
[%- END %]

[% IF enforcing.VALUE -%]
[%- FILTER body = wrap(70, ‘ ‘, ‘ ‘) -%]
It will be unavailable for approximately 15 minutes beginning at
[% date.format( time = shutdown_dt.VALUE.epoch,
format = ‘%H:%M %A %e %B %Y’,
locale = ‘en_GB’) %].
Connected users will be warned [% shutdown_delay.VALUE %] minutes beforehand.
[%- END %]

[% END -%]

This also uses the wrap plugin to ensure that the lines are neatly arranged and the header section has a “*** ” prefix for each line to help grab the attention of the users.


LCFG Core: Resource import and export

November 7, 2017

As part of porting the LCFG client to the new core libraries the qxprof and sxprof utilities have been updated. This has led to the development of a new high-level LCFG::Client::Resources Perl library which can be used to import, merge and export resources in all the various required forms. The intention is that eventually all code which uses the LCFG::Resources Perl library (in particular the LCFG::Component framework) will be updated to use this new library. The new library provides a very similar set of functionality and will appear familiar but I’ve taken the opportunity to improve some of the more awkward parts. Here’s a simple example taken from the perldoc:

# Load client resources from DB
my $res1 = LCFG::Client::Resources::LoadProfile("mynode","client");

# Import client resources from environment variables
my $res2 = LCFG::Client::Resources::Import("client");

# Merge two sets of resources
my $res3 = LCFG::Client::Resources::Merge( $res1, $res2 );

# Save the result as a status file
LCFG::Client::Resources::SaveState( "client", $res3 );

The library can import resources from: Berkeley DB, status files, override files, shell environment and explicit resource specification strings. It can export resources as status files, in a form that can be evaluated in the shell environment and also in various terse and verbose forms (e.g. the output styles for qxprof).

The LCFG::Resources library provides access to resources via a reference to a hash which is structured something like:

{
   'sysinfo' => {
                 'os_id_full' => {
                                  'DERIVE' => '/var/lcfg/conf/server/releases/develop/core/include/lcfg/defaults/sysinfo.h:42',
                                  'VALUE' => 'sl74',
                                  'TYPE' => undef,
                                  'CONTEXT' => undef
                                 },
                 'path_lcfgconf' => {
                                  'DERIVE' => '/var/lcfg/conf/server/releases/develop/core/include/lcfg/defaults/sysinfo.h:100',
                                  'VALUE' => '/var/lcfg/conf',
                                  'TYPE' => undef,
                                  'CONTEXT' => undef
                                 },
                }
}

The top level key is the component name, the second level is the resource name and the third level is the name of the resource attribute (e.g. VALUE or TYPE ).

The new LCFG::Client::Resources library takes a similar approach with the top level key being the component name but the value for that key is a reference to a LCFG::Profile::Component object. Resource objects can then be accessed by using the find_resource method which returns a reference to a LCFG::Resource object. For example:

my $res = LCFG::Client::Resources::LoadProfile("mynode","sysinfo");

my $sysinfo = $res->{sysinfo};

my $os_id_full = $sysinfo->find_resource('os_id_full');

say $os_id_full->value;

Users of the qxprof and sxprof utilities should not notice any differences but hopefully the changes will be appreciated by those developing new code.


Testing the new LCFG core : Part 1

May 17, 2017

The project to rework the core LCFG code is rattling along and has reached the point where some full scale testing is needed. The first step is to check whether the new XML parser can actually just parse all of our LCFG profiles. At this stage I’m not interested in whether it can do anything useful with the data once loaded, I just want to see how it handles a large number of different profiles.

Firstly a source of XML profiles is needed, I grabbed a complete local copy from our lcfg server:


rsync -av -e ssh lcfg:/var/lcfg/conf/server/web/profiles/ /disk/scratch/profiles/

I then ran the XML parser on every profile I could find:


find /disk/scratch/profiles/ -name ‘*.xml’ | xargs \
perl -MLCFG::Profile -wE \
‘for (@ARGV) { eval { LCFG::Profile->new_from_xml($_) }; print $@ if $@ }’

Initially I hit upon bug#971 which is a genuine bug in the schema for the gridengine component. As noted previously, this was found because the new libraries are much stricter about what is considered to be valid data. With that bug resolved I can now parse all 1525 LCFG XML profiles for Informatics.