LCFG Profile Security Project

March 13, 2018

I have recently begun work on the Review Security of LCFG Profile Access project. So far I have mostly been considering the various aspects of the project with the aim being to produce a list of ideas which can be discussed at some future Development Meeting.

The first aspect of the project I have looked at in more depth is the LCFG server which has support for generating Apache .htaccess files. These can be used to limit access to each individual LCFG profile when fetched over http/https. We have traditionally supported both http and https protocols and relied on IP addresses to limit access but would like to move over to https-only along with using GSSAPI authentication, the LCFG client would then use a keytab to get the necessary credentials. To help with this change I have introduced a new schema (4) for the profile component and made some modifications to the LCFG server code which makes it easier to use the Apache mod_auth_gssapi module. In particular there is new auth_tmpl_$ resource which allows the selection of a different template (e.g. the apache_gssapi.tt template which is provided in the package) which more closely meets local requirements. There are also auth_vars_$ and auth_val_$_$ resources which can be used to specify any additional information that is required. For example:

!profile.version_profile mSET(4) /* not yet the default */
!profile.auth          mADD(ssl)
!profile.auth_tmpl_ssl mSET(apache_gssapi.tt)
!profile.acl_ssl 
   mSET(host/<%profile.node%>.<%profile.domain%>@<%kerberos.realm%>)
!profile.acl_ssl       mADD(@admin)
!profile.auth_vars_ssl mADD(groupfile)
!profile.auth_val_ssl_groupfile mSET(/etc/httpd/conf.d/lcfgadmins.group)

which results in the the LCFG server generating the following .htaccess file:

AuthType GSSAPI
AuthName "lcfg@foo.inf.ed.ac.uk"
GssapiBasicAuth Off
GssapiBasicAuthMech krb5
GssapiSSLonly On
GssapiCredStore "keytab:/etc/httpd.keytab"
AuthGroupFile "/etc/httpd/conf.d/lcfgadmins.group"
<RequireAny>
  Require user "host/foo.inf.ed.ac.uk@INF.ED.AC.UK"
  Require group "admin"
</RequireAny>

The profile.acl_ssl resource holds a list of users and groups (which have an ‘@’ prefix). In a real deployment it might make more sense to use an lcfg/ principal rather host/. The groupfile support is provided by the mod_authz_groupfile module which needs to be loaded.

I have tested this with curl and it works as required. The LCFG client doesn’t currently have support for doing a kinit (or launching something like k5start in the background) prior to fetching the profile so it isn’t yet possible to actively use this authentication method.


Remote Desktop Project

February 28, 2018

This week I’ve been preparing the new staff XRDP service for user testing. It now has a quovadis SSL certificate and I’ve been attempting to resolve an issue with some clients presenting a warning dialogue about not trusting the certificate. According to this bug report it is necessary to include the whole trust chain in the certificate file. I’ve tried appending the contents of the .chain file without success, it’s not clear if I am missing a part of the chain, I’ll continue investigating but if we can’t easily resolve the issue we could just document what users should expect to see.

As Chris had access to a Windows machine he has managed to generate a .bmp image file for the login screen logo which actually displays correctly. I have no idea why the various Linux applications generated bad images but I’m not going to worry too much. This gives us a much more official-looking Informatics login screen which should reassure users. The image has been packaged up in an xrdp-logo-inf RPM.

I’ve also been investigating rate-limiting new connections using iptables. The standard dice iptables configuration is rather complicated so I need to speak to George about the best way to go about this.

To ensure the xrdp service only gets started once the machine is ready to handle connections I’ve modified the systemd config so that it waits for the LCFG stable target to be reached.

I’ve noticed that all the xrdp logs are being sent to the serial console. Even with just a single user that’s flooding our console logs so I’d like to get that stopped. It’s already going to local file and syslog so no more logging is really required. SEE don’t see the same problem so I wonder if it’s related to our Informatics syslog configuration.

The user documentation is now close to being complete, we even have some information on how to access the XRDP service from Android devices.


Remote Desktop Project

February 21, 2018

This week I’ve been working on the configuration for an XRDP server for Informatics staff. This will be publicised as a prototype service, the plan being to hold off replacing the NX service until Semester 2 is completed at the end of May, that avoids the potential for any disruption to teaching. The prototype service will be installed on some spare hardware which has 2 x 2.6GHz CPU, 36GB RAM and 146GB disk space, that’s not huge but should be sufficient for multiple users to be logged in simultaneously. As the staff service is likely to only ever be based on a single server I’ve decided to simplify the config by dropping the haproxy frontend, that will now only be used on the multi-host general service. To protect from DoS attacks iptables will be used to do rate-limiting. If I can work out how to get the xrdp software to log the IP address for failed logins I will also investigate using fail2ban to add firewall rules. Most of the user documentation on computing.help is now ready, I just need to add some instructions and screenshots for the Remmina client on Linux.


User management improvements

November 23, 2017

Management of local users and groups (i.e. those in /etc/passwd and /etc/group) is done using the LCFG auth component. One feature that has always been lacking is the ability to create a home directory where necessary and populate it from a skeleton directory (typically this is /etc/skel). The result of this feature being missing is that it is necessary to add a whole bunch of additional file component resources to create the home directory and that still doesn’t provide support for a skeleton directory.

Recently I needed something along those lines so I’ve taken the chance to add a couple of new resources – create_home_$ and skel_dir_$. When the create_home resource is set to true for a user the home directory will be created by the component and the permissions set appropriately. By default the directory will be populated from /etc/skel but it could be anything. This means it is now possible to setup a machine with a set of identically initialised local users.

For example:

auth.pw_name_cephadmin           cephadmin
auth.pw_uid_cephadmin            755
auth.pw_gid_cephadmin            755
auth.pw_gecos_cephadmin          Ceph Admin User
auth.pw_dir_cephadmin            /var/lib/cephadmin
auth.pw_shell_cephadmin          /bin/bash
auth.create_home_cephadmin       yes /* Ensure home directory exists */

auth.gr_name_cephadmin           cephadmin
auth.gr_gid_cephadmin            755

LCFG Core: resource types

November 21, 2017

The recent round of LCFG client testing using real LCFG profiles from both Informatics and the wider community has shown that the code is now in very good shape and we’re close to being able to deploy to a larger group of machines. One issue that this testing has uncovered is related to how the type of a resource is specified in a schema. A type in the LCFG world really just controls what regular expression is used to validate the resource value. Various type annotations can be used (e.g. %integer, %boolean or %string) to limit the permitted values, if there is no annotation it is assumed to be a tag list and this has clearly caught out a few component authors. For example:

@foo %integer
foo

@bar %boolean
bar

@baz
baz

@quux sub1_$ sub2_$
quux
sub1_$
sub2_$

Both of the last two examples (baz and quux) are tag lists, the first just does not have any associated sub-resources.

The compiler should not allow anything but valid tag names (which match /^[a-zA-Z0-9_]+$/) in a tag list resource but due to some inadequacies it currently permits pretty much anything. The new core code is a lot stricter and thus the v4 client will refuse to accept a profile if it contains invalid tag lists. Bugs have been filed against a few components (bug#1016 and bug#1017). It’s very satisfying to see the new code helping us improve the quality of our configurations.


yum cache and disk space

November 15, 2017

At a recent LCFG Deployers meeting we discussed a problem with yum not fully cleaning the cache directory even when the yum clean all command is used. This turns out to be related to how the cache directory path is defined in /etc/yum.conf as /var/cache/yum/$basearch/$releasever. As the release version changes with each minor platform release (e.g. 7.3, 7.4) the old directories can become abandoned. At first this might seem like a trivial problem but these cache directories can be huge, we have seen instances where gigabytes of disk space have been used and cannot be simply reclaimed. To help fix this problem I’ve added a new purgecache method to the LCFG yum component. This takes a sledgehammer approach of just deleting everything in the /var/cache/yum/ directory. This can be run manually whenever required or called regularly using something like cron. In Informatics it is now configured to run weekly on a Sunday like this:

!cron.objects             mADD(yum_purge)
cron.object_yum_purge     yum
cron.method_yum_purge     purgecache
cron.run_yum_purge        AUTOMINS AUTOHOUR * * sun

LCFG autoreboot

November 10, 2017

One of the tools which saves us an enormous amount of effort is our LCFG autoreboot component. This watches for reboot requests from other LCFG components and then schedules the reboot for the required date/time.

One nice feature is that it can automatically choose a reboot time from within a specified range. This means that when many similarly configured machines schedule a reboot they don’t all go at the same time which could result in the overloading of services that are accessed at boot time. Recently it was reported that the component has problems parsing single-digit times which results in the reboot not being scheduled. Amazingly this bug has lain undetected for approximately 4 years during which time a significant chunk of machines have presumably been failing to reboot on time. As well as resolving that bug I also took the chance to fix a minor issue related to a misunderstanding of the shutdown command options which resulted in the default delay time being set for 3600 minutes instead of 3600 seconds, thankfully we change that delay locally so it never had any direct impact on our machines.

Whilst fixing those two bugs I discovered another issue related to sending reboot notifications via email, if that failed for any reason the reboot would not be scheduled, the component will now report the error but continue. This is a common problem we see in LCFG components where problems are handled with the Fail method (which logs and then exits) instead of just logging with Error. This is particularly a problem since an exit with non-zero code is not the same as dieing which can be caught with the use of the eval function. Since a call to Fail ends the current process immediately this can lead to a particularly annoying situation where a failure in a Configure method results in a failure in the Start method. This means that a component might never reach the started state, a situation from which it is difficult to recover. We are slowly working our way through eradicating this issue from core components but it’s going to take a while.

Recently we have had feedback from some of our users that the reboot notification message was not especially informative. The issue is related to us incorporating the message into the message of the day which sometimes leads to it being left lieing around out-of-date for some time. The message would typically say something like “A reboot has been scheduled for 2am on Thursday”, which is fine as long as the message goes away once the reboot has been completed. To resolve this I took advantage of a feature I added some years ago which passes the reboot time as a Perl DateTime object (named shutdown_dt) into the message template. With a little bit of thought I came up with the following which uses the Template Toolkit Date plugin:


[%- USE date -%]
[%- USE wrap -%]
[%- FILTER head = wrap(70, ‘*** ‘, ‘*** ‘) -%]
This machine ([% host.VALUE %]) requires a reboot as important updates are available.
[%- END %]

[% IF enforcing.VALUE -%]
[%- FILTER body = wrap(70, ‘ ‘, ‘ ‘) -%]
It will be unavailable for approximately 15 minutes beginning at
[% date.format( time = shutdown_dt.VALUE.epoch,
format = ‘%H:%M %A %e %B %Y’,
locale = ‘en_GB’) %].
Connected users will be warned [% shutdown_delay.VALUE %] minutes beforehand.
[%- END %]

[% END -%]

This also uses the wrap plugin to ensure that the lines are neatly arranged and the header section has a “*** ” prefix for each line to help grab the attention of the users.


LCFG Core: Resource import and export

November 7, 2017

As part of porting the LCFG client to the new core libraries the qxprof and sxprof utilities have been updated. This has led to the development of a new high-level LCFG::Client::Resources Perl library which can be used to import, merge and export resources in all the various required forms. The intention is that eventually all code which uses the LCFG::Resources Perl library (in particular the LCFG::Component framework) will be updated to use this new library. The new library provides a very similar set of functionality and will appear familiar but I’ve taken the opportunity to improve some of the more awkward parts. Here’s a simple example taken from the perldoc:

# Load client resources from DB
my $res1 = LCFG::Client::Resources::LoadProfile("mynode","client");

# Import client resources from environment variables
my $res2 = LCFG::Client::Resources::Import("client");

# Merge two sets of resources
my $res3 = LCFG::Client::Resources::Merge( $res1, $res2 );

# Save the result as a status file
LCFG::Client::Resources::SaveState( "client", $res3 );

The library can import resources from: Berkeley DB, status files, override files, shell environment and explicit resource specification strings. It can export resources as status files, in a form that can be evaluated in the shell environment and also in various terse and verbose forms (e.g. the output styles for qxprof).

The LCFG::Resources library provides access to resources via a reference to a hash which is structured something like:

{
   'sysinfo' => {
                 'os_id_full' => {
                                  'DERIVE' => '/var/lcfg/conf/server/releases/develop/core/include/lcfg/defaults/sysinfo.h:42',
                                  'VALUE' => 'sl74',
                                  'TYPE' => undef,
                                  'CONTEXT' => undef
                                 },
                 'path_lcfgconf' => {
                                  'DERIVE' => '/var/lcfg/conf/server/releases/develop/core/include/lcfg/defaults/sysinfo.h:100',
                                  'VALUE' => '/var/lcfg/conf',
                                  'TYPE' => undef,
                                  'CONTEXT' => undef
                                 },
                }
}

The top level key is the component name, the second level is the resource name and the third level is the name of the resource attribute (e.g. VALUE or TYPE ).

The new LCFG::Client::Resources library takes a similar approach with the top level key being the component name but the value for that key is a reference to a LCFG::Profile::Component object. Resource objects can then be accessed by using the find_resource method which returns a reference to a LCFG::Resource object. For example:

my $res = LCFG::Client::Resources::LoadProfile("mynode","sysinfo");

my $sysinfo = $res->{sysinfo};

my $os_id_full = $sysinfo->find_resource('os_id_full');

say $os_id_full->value;

Users of the qxprof and sxprof utilities should not notice any differences but hopefully the changes will be appreciated by those developing new code.


Testing the new LCFG core : Part 2

May 18, 2017

Following on from the basic tests for the new XML parser the next step is to check if the new core libs can be used to correctly store the profile state into a Berkeley DB file. This process is particularly interesting because it involves evaluating any context information and selecting the correct resource values based on the contexts. Effectively the XML profile represents all possible configuration states whereas only a single state is stored in the DB.

The aim was to compare the contents of the old and new DBs for each Informatics LCFG profile. Firstly I used rdxprof to generate DB files using the current libs:

cd /disk/scratch/profiles/inf.ed.ac.uk/
for i in $(find -maxdepth 1 -type d -printf '%f\n' | grep -v '^\.');\
do \
 echo $i; \
 /usr/sbin/rdxprof  -v -u file:///disk/scratch/profiles/ $i; \
done

This creates a DB file for each profile in the /var/lcfg/conf/profile/dbm directory. For 1500-ish profiles this takes a long time…

The next step is to do the same with the new libs:

find /disk/scratch/profiles/ -name '*.xml' | xargs \
perl -MLCFG::Profile -wE \
'for (@ARGV) { eval { $p = LCFG::Profile->new_from_xml($_); \
$n = $p->nodename; \
$p->to_bdb( "/disk/scratch/results/dbm/$n.DB2.db" ) }; \
print $@ if $@ }'

This creates a DB file for each profile in the /disk/scratch/results/dbm directory. This is much faster than using rdxprof.

The final step was to compare each DB. This was done simply using the perl DB_File module to tie each DB to a hash and then comparing the keys and values. Pleasingly this has shown that the new code is generating identical DBs for all the Informatics profiles.

Now I need to hack this together into a test script which other sites can use to similarly verify the code on their sets of profiles.


Testing the new LCFG core : Part 1

May 17, 2017

The project to rework the core LCFG code is rattling along and has reached the point where some full scale testing is needed. The first step is to check whether the new XML parser can actually just parse all of our LCFG profiles. At this stage I’m not interested in whether it can do anything useful with the data once loaded, I just want to see how it handles a large number of different profiles.

Firstly a source of XML profiles is needed, I grabbed a complete local copy from our lcfg server:


rsync -av -e ssh lcfg:/var/lcfg/conf/server/web/profiles/ /disk/scratch/profiles/

I then ran the XML parser on every profile I could find:


find /disk/scratch/profiles/ -name ‘*.xml’ | xargs \
perl -MLCFG::Profile -wE \
‘for (@ARGV) { eval { LCFG::Profile->new_from_xml($_) }; print $@ if $@ }’

Initially I hit upon bug#971 which is a genuine bug in the schema for the gridengine component. As noted previously, this was found because the new libraries are much stricter about what is considered to be valid data. With that bug resolved I can now parse all 1525 LCFG XML profiles for Informatics.