LCFG XML Profile changes

August 20, 2015

As part of the LCFG v4 client project I am working on converting the XML profile parsing over to using the libxml2 library. Recent testing has revealed a number of shortcomings in the way the LCFG XML profiles are generated which break parsers which are stricter than the old W3C code upon which the current client is based. In particular the encoding of entities has always been done in a style which is more suitable for HTML than XML. There is really only a small set of characters that must be encoded for XML, those are: single-quote, double-quote, left-angle-bracket, right-angle-bracket and ampersand (in some contexts the set can be even smaller). The new XML parser was barfing on unknown named entities which would be supported by a typical web browser. It is possible to educate an XML parser about these entities but it’s not really necessary. A better solution is to emit XML which is utf-8 compliant which avoids the needs for additional encoding. Alongside this problem of encoding more than was necessary the server was not encoding significant whitespace, e.g. newlines, carriage returns and tabs. By default a standards compliant XML parser will ignore such whitespace. An LCFG resource might well contain such whitespace so it was necessary to add encoding support to the server. In the process of making these changes to the LCFG::Server::Profile::XML module I merged all the calls to the encoder into a call to a single new EncodeData subroutine so that it is now trivial to tweak the encoding as required. These changes will be going out in version 3.3.0 of the LCFG-Compiler package in the next stable release. As always, please let us know if these changes break anything.

MooX::HandlesVia and roles

August 20, 2015

I’ve been using Moo for Perl object-oriented programming for a while now. It’s really quite nice, it certainly does everything I need and it’s much lighter than Moose.

Whilst working on the LCFG v4 client project I recently came across a problem with the MooX::HandlesVia module when used in conjunction with roles. I thought it worth blogging about if only to save some other pour soul from a lot of head scratching (probably me in 6 months time).

If a class is composed of more than one role and each role uses the MooX::HandlesVia module, for example:

    package SJQ::Role::Foo;
    use Moo::Role;
    use MooX::HandlesVia;

    package SJQ::Role::Bar;
    use Moo::Role;
    use MooX::HandlesVia;

    package SJQ::Baz;
    use Moo;

    with 'SJQ::Role::Foo','SJQ::Role::Bar';

    use namespace::clean;

my $test = SJQ::Baz->new();

It fails and the following error message is generated:

Due to a method name conflict between roles 'SJQ::Role::Bar and
SJQ::Role::Foo', the method 'has' must be implemented by 'SJQ::Baz'
at /usr/share/perl5/vendor_perl/Role/ line 215.

It appears that MooX::HandlesVia provides its own replacement has method and this causes a problem when namespace::clean is also used.

The solution is to apply the roles separately, it’s perfectly allowable to call the with method several times. For example:

    package SJQ::Baz;
    use Moo;

    with 'SJQ::Role::Foo';
    with 'SJQ::Role::Bar';

    use namespace::clean;

PostgreSQL new features

June 10, 2015

It looks like PostgreSQL 9.4 has some really interesting new features. Today I came across a blog post by 2ndquadrant demonstrating the WITHIN GROUP and FILTER clauses. I don’t think I’ve entirely got my head round the purpose of WITHIN GROUP yet, I suspect I need a couple of good real-world examples. The FILTER clause looks very handy though, I’m sure I’ll be using that when I get the chance.

LCFG::Component environment plugins

January 5, 2015

Version 1.13.0 the Perl version of the ngeneric framework (LCFG::Component) provides an all-new environment initialisation system for component methods. This has support for plugins which mean it is fully extensible.

There is a new InitializeEnvironment method which is called for most standard methods which are accessible via om (including configure, start, restart, stop, run, and logrotate). The method can also be called from any additional methods you have added to your own components, the method needs access to the resources so it must be called after a call to LoadProfile or LoadStatus.

There are currently two plugins – a very simple one which can be used to set values for environment variables before the method is called and a more complex one that can do the equivalent of kinit and aklog to acquire Kerberos credentials and AFS tokens.

For full details see the LCFG wiki.

Moo and Type::Tiny

December 14, 2014

At the start of 2014 I was working on a project to further improve the LCFG client. When I hit problems with Moose and its memory usage I discovered the excellent Moo framework which provides all the nice bits but is much less heavyweight. As part of the Perl Advent Calendar for 2014 someone has written a great introductory article on using Moo along with Type::Tiny. I’ve learnt a few things, I particularly like the idea of a “type library” as a good way to organize all the local types.

LCFG::Build::Skeleton changes

December 8, 2014

At the LCFG Annual Review meeting held last week one topic which was discussed was the idea of all Perl based LCFG components being implemented as modules with the component script just being a very thin wrapper which loads the module and calls the dispatch method. This has been our recommended coding style for quite a while now and we use this approach for many of the core components.

During the discussion I realised that the lcfg-skeleton tool which is used to create the outline directory structure for new projects does not support this way of working. To make it really easy to create new Perl-based components which follow recommended best-practice I have consequently updated LCFG-Build-Skeleton. The new version 0.4.1 creates a module file (e.g. lib/LCFG/Component/, the necessary CMake file and also tweaks the specfile appropriately. This will be in the stable release on Thursday 18th December or you can grab it from CPAN now.

LCFG authorization

December 3, 2014

The authorization of LCFG component methods (which are called using the om command) is typically done using the LCFG::Authorize module. This is limited to checking usernames and membership of groups managed in LCFG.

In Informatics we have for a long-time used a different module – DICE::Authorize – which extends this to also checking membership of a netgroup. Recently we discovered some problems with our implementation of this functionality which make it very inflexible. We have been connecting directly to the LDAP server and doing the lookup based on hardcoded information in the module. As this really just boils down to checking membership of a netgroup this can clearly be done more simply by calling the innetgr function. This will work via the standard NS framework so will handle LDAP, NIS or whatever is required. The necessary details are then only stored in the standard location and not embedded into the code.

Rather than just rewrite the DICE::Authorize module I took the chance to move the functionality to the LCFG layer, so we now have LCFG::Authorize::NetGroups. This nicely sub-classes the standard module so that if the user is not a member of a netgroup the other checks are then carried out. This is much better code reuse, previously we had two distinct implementations of the basic checks.

Having a new implementation of the authorization module is also handy for dealing with the transition stage. We can keep the old one around so that if a problem is discovered with the new code we can quickly switch back to the old code.

I also took the chance to improve the documentation of the authorization framework so if you’re still running om as root now is a good time to improve things!

Sub-classing LCFG components

December 3, 2014

One topic that often comes up in discussions about how to make things easier for LCFG component authors is the idea of sub-classing.

Although I had never tried it I had always assumed this was possible. Recently whilst looking through the LCFG::Component code I noticed that the list of methods are looked up in the symbol table for the module:

    my $mtable = {};
    for my $method ( ( keys %LCFG::Component:: ),
        ( eval 'keys %' . ref($self) . q{::} ) )
        if ( $method =~ m/^method_(.*)/i ) {
            $mtable->{ lc $1 } = $method;
    $self->{_METHOD} = lc $_METHOD;
    my $_FUNCTION = $mtable->{ $self->{_METHOD} };

So, this will work if your method comes from LCFG::Component or LCFG::Component::Foo but it wouldn’t work if you have a sub-class of Foo. You would potentially miss out on methods which are only in Foo (or have to copy/paste them into your sub-class).

Not only does this make sub-classing tricky it also involves a horrid string eval. There had to be a better way. Thankfully I was already aware of the Class::Inspector module which can do the necessary. This module is widely used by projects such as DBIx::Class and Catalyst so is likely to be reliable. It has a handy methods method which does what we need:

    my $_FUNCTION;
    my $public_methods = Class::Inspector->methods( ref($self), 'public' );
    for my $method (@{$public_methods}) {
        if ( $method =~ m/^Method_(\Q$_METHOD\E)$/i ) {
            $_FUNCTION = $method;
            $_METHOD = $1;

Much nicer code and a tad more efficient. Now the LCFG component Perl modules are properly sub-classable.

Usenix LISA 2014

November 18, 2014

Last week I attended the Usenix LISA conference in Seattle. There was a very strong “DevOps” theme to this year’s conference with a particular focus on configuration management, monitoring (the cool term seems to be “metrics”) and managing large scale infrastructure. As always, this conference offers a strong hallway track, there is the opportunity to pick the brains of some of the best sysadmins in the business. I had a lot of interesting discussions with people who work in other universities as well as those who work at the very largest scale such as Google.

There were lots of good talks this year, annoyingly quite a few of those which seemed likely to be most interesting had been scheduled against each other. Thankfully most of them were recorded so they can be viewed later. There is no doubt that this conference delivers real value for money in terms of the knowledge and inspiration gained. I had conversations with several people where we commented that the cost of the entire conference, including travel and accommodation, equals just a few days of “professional training” in the UK. A few of the highlights for me were:

Radical Ideas from the Practice of Cloud Computing

This talk by Tom Limoncelli’s was based on some of the topics in his new book – The Practice of Cloud System Administration: Volume 2: Designing and Operating Large Distributed Systems. He proposed the idea that it is better to use lots of cheaper, less reliable, hardware rather than a few very expensive machines. He explained how this can be achieved by focussing on resilience of a service rather than reliability of individual hardware, this becomes cheaper as a portion of the total capital expenditure as you scale up.

He moved on to showing that when you have a risky business process you should not avoid it but rather should choose to do it more frequently, a “practice makes perfect” approach. With practice your procedures will become better understood and they will be more reliable and more efficient. Admins are unlikely to have good knowledge of a process which is only done rarely. Doing risky processes often also helps reveal single points of failure in your infrastructure.

An advantage of doing updates regularly is that the changes can be applied in small batches. The changes are thus easier to debug because they are recent and fresh in the minds of developers. Also, the environment changes less so it’s easier to spot the origin of a problem if one occurs. The frequent application of changes also keeps developers happy, they get faster feedback and have the warm, fuzzy feeling of success on a regular basis. This idea of keeping the feedback loop short and tight was something that kept cropping up throughout the conference and it’s clear to me that this is one of the main factors in the success of the DevOps strategy.

Clearly doing risky changes frequently does mean that bad things will happen. Tom recommended avoiding punishing people for outages, any problem should be seen as a failure of the procedures, one quote was “there is no root cause, only contributing factors“. The best way to handle outages is to be well prepared, this means anticipating likely problems, having practice drills and ensuring there is a thorough post-mortem. A post-mortem should consider what went right/wrong and propose actions which can be done in the short and long-term. This is something we have been doing in Informatics for several years, it’s always nice to be told you’re doing the right thing!

His closing remarks were “We run services not servers” and “We are hired to be awesome in the face of failure“. Clearly he is working at a different scale to what we do in Informatics but these sentiments are still both very applicable to how we manage our systems in Informatics.

I’m definitely interested in getting a copy of his book to learn more. Impressively, many people at the conference queued up to get Tom to sign their copy.

Building a One-Time Password Token Authentication Infrastructure

This was an excellent talk which covered a subject we have been investigating in Informatics. This talk was given by two admins from the LIGO project. They had identified user credentials theft as a critical risk to their project. The data generated by the project is eventually published publically so they are not worried about data theft, rather they are concerned about loss of access to scientific data which is not replayable. If their systems are down when an important astronomical event occurs they will lose valuable data. They were particularly focussing on avoiding problems which can occur because users reuse passwords on multiple services.

Their plan was to use a separate credential that is not replayable, this is important, they didn’t just want a second authentication factor. This credential would be used to gain access to the most critical parts of their infrastructure. As well as increasing security this has an important psychological benefit in that it makes users aware whenever they are accessing the most important systems. For services such as email they would not be required to use a second factor, the inconvenience would annoy users too much for the small benefit gained. They noted that it is still necessary to beware that either end of an active session could be hijacked after authentication has been successfully completed.

They examined various options, they required a token-based – “something you have” – approach, preferably it should be highly tamper resistant. They wanted a separate physical device to avoid the opportunity for remote compromise, as could occur with software based systems in mobile devices. They gave an example of a virus which infected MacOSX computers and then deliberately targetted iPhones when they were plugged into the machine. I hadn’t really considered this downside of using mobile devices before, it definitely makes me strongly in favour of a solely hardware token approach.

They did note some limitations of token-based systems. In particular they only have a limited lifetime which seems to be in the range of 2 to 3 years, depending on usage. This created some problems for the project, how do you securely deliver a token to a very remote user? Particularly if they have lost one and need a replacement quickly. Many tokens are time-based, this can introduce synchronisation problems for remote users who cannot return to base to get it fixed. Also, many time-based systems avoid replays by only allowing one login within a time window (e.g. 1 minute), this could be frustrating for users.

They went on to discuss how any 2-factor system is going to introduce additional overheads. There will be issues with failures occurring at any point in the system. It needs to integrate well with existing infrastructure and preferably avoid the need to replace software.

They did not wish to trust 3rd parties or rely on a proprietary blackbox solution that could be compromised and lose secrets. To achieve total ownership of the system they created their own custom authentication server. This supports a multi-site approach with secure replication of data. They selected the yubikey device which we have looked at in Informatics. This is used via PAM as a second factor to Kerberos authentication.

This talk gave a very good coverage of the whole 2-factor authentication problem. I look forwards to reviewing the recording and the slides. I will have to find out if we can get the code for their custom authentication system and try it out in Informatics.

One Year After the Meltdown: Now What?

This talk was given by Mikey Dickerson who was originally seconded from Google to the White House to help fix the website when it so spectacularly failed to deal with demand last year. Due to the very imminent deadline for the website to be ready for renewals he had to do the talk via video link from the White House. This worked much better than I feared it would and thankfully the network didn’t collapse. The main thing I took from this was how a DevOps approach can be applied to failing projects no matter how huge and weighed down with bureaucracy. There was a clear determination to save the project without resorting to a complete rewrite, the success came from restructuring teams and using better procedures. It was interesting to hear that they had been in contact with the GOV.UK people and considered the UK government to have better public facing IT services. They are now moving onto applying the same strategy to other US government IT services, in particular the Veterans Association. The team are clearly very determined and driven, they are working stupid numbers of hours each week. Many of them have given up well paid private sector jobs so that they can make a real difference to the country. It will be interesting to see if they manage to achieve real permanent change which can cope with a change of president.

Gauges, Counters and Ratios, Oh My!

The aim of this talk was to explain how to design useful metrics which can be used for service monitoring and problem diagnosis. It started off with quite a technical discussion of the definition of “metric”. The definition given was “a named value at some specific time“. Having discussed these 3 important points (name, value and time) the discussion moved onto using high-dimension databases which can handle high-resolution time series data. The recommended Open-Source software for this purpose is OpenTSDB which works on hadoop.

There was also discussion about why gathering metrics is useful. In particular 4 broad themes were identified: operational health monitoring, quality assurance, capacity planning and product management. Currently we do health monitoring fairly well but we’re not really doing the others. I think it would certainly be very useful to have better monitoring of resources when planning for future capacity requirements.

The recommended software suite to cover all requirements is nagios (or equivalent) plus Graphite plus Sensu plus logstash plus ganglia.

Although an interesting talk I think I would have benefited more from the talk the speaker gave at LISA 2013 titled “A Working Theory of Monitoring” which he referenced a couple of times. The slides and video of that previous talk are now available online.

The Top 5 Things I Learned While Building Anomaly Detection Algorithms for IT Ops

This talk was given by Toufic Boubez who is clearly a smart chap who really knows his stuff. He gave lots of useful advice on how to analyse the metrics you have collected to detect anomalies.

His main point was that your data is almost certainly NOT gaussian. This is a problem because most analytic tools assume that parametric techniques are applicable.

There is also the issue that “yesterday’s anomaly is today’s normal“. He talked about how stationarity (sic) is not a realistic assumption with large complex systems. The term for this is “Concept Drift“.

He went on to discuss non-parametric techniques (such as the Kolmogorov-Smirnov (KS) test) which can be used to compare probability distributions.

As well as using the right statistical techniques it is very important to have good domain knowledge. You need to know your data and the general patterns. This will allow you to customise alerts appropriately so you don’t get paged unnecessarily.

He also noted that some data channels are inherently very quiet. It’s hard to deal with this type of data using time-series techniques. Sparse data is very hard to analyse but will still contain very important information.

The speaker posts interesting stuff on his twitter account.

LCFG Client Guide

March 31, 2014

As part of my work on updating the LCFG client I’ve written a guide to the inner workings of the LCFG client. This is intended to be fairly high-level so it doesn’t go into the details of which subroutine calls which subroutine. The aim is that this should cover all the main functionality and provide the information necessary to get started with altering and extending the client code base.