Last week I attended the Usenix LISA conference in Seattle. There was a very strong “DevOps” theme to this year’s conference with a particular focus on configuration management, monitoring (the cool term seems to be “metrics”) and managing large scale infrastructure. As always, this conference offers a strong hallway track, there is the opportunity to pick the brains of some of the best sysadmins in the business. I had a lot of interesting discussions with people who work in other universities as well as those who work at the very largest scale such as Google.
There were lots of good talks this year, annoyingly quite a few of those which seemed likely to be most interesting had been scheduled against each other. Thankfully most of them were recorded so they can be viewed later. There is no doubt that this conference delivers real value for money in terms of the knowledge and inspiration gained. I had conversations with several people where we commented that the cost of the entire conference, including travel and accommodation, equals just a few days of “professional training” in the UK. A few of the highlights for me were:
This talk by Tom Limoncelli’s was based on some of the topics in his new book – The Practice of Cloud System Administration: Volume 2: Designing and Operating Large Distributed Systems. He proposed the idea that it is better to use lots of cheaper, less reliable, hardware rather than a few very expensive machines. He explained how this can be achieved by focussing on resilience of a service rather than reliability of individual hardware, this becomes cheaper as a portion of the total capital expenditure as you scale up.
He moved on to showing that when you have a risky business process you should not avoid it but rather should choose to do it more frequently, a “practice makes perfect” approach. With practice your procedures will become better understood and they will be more reliable and more efficient. Admins are unlikely to have good knowledge of a process which is only done rarely. Doing risky processes often also helps reveal single points of failure in your infrastructure.
An advantage of doing updates regularly is that the changes can be applied in small batches. The changes are thus easier to debug because they are recent and fresh in the minds of developers. Also, the environment changes less so it’s easier to spot the origin of a problem if one occurs. The frequent application of changes also keeps developers happy, they get faster feedback and have the warm, fuzzy feeling of success on a regular basis. This idea of keeping the feedback loop short and tight was something that kept cropping up throughout the conference and it’s clear to me that this is one of the main factors in the success of the DevOps strategy.
Clearly doing risky changes frequently does mean that bad things will happen. Tom recommended avoiding punishing people for outages, any problem should be seen as a failure of the procedures, one quote was “there is no root cause, only contributing factors“. The best way to handle outages is to be well prepared, this means anticipating likely problems, having practice drills and ensuring there is a thorough post-mortem. A post-mortem should consider what went right/wrong and propose actions which can be done in the short and long-term. This is something we have been doing in Informatics for several years, it’s always nice to be told you’re doing the right thing!
His closing remarks were “We run services not servers” and “We are hired to be awesome in the face of failure“. Clearly he is working at a different scale to what we do in Informatics but these sentiments are still both very applicable to how we manage our systems in Informatics.
I’m definitely interested in getting a copy of his book to learn more. Impressively, many people at the conference queued up to get Tom to sign their copy.
This was an excellent talk which covered a subject we have been investigating in Informatics. This talk was given by two admins from the LIGO project. They had identified user credentials theft as a critical risk to their project. The data generated by the project is eventually published publically so they are not worried about data theft, rather they are concerned about loss of access to scientific data which is not replayable. If their systems are down when an important astronomical event occurs they will lose valuable data. They were particularly focussing on avoiding problems which can occur because users reuse passwords on multiple services.
Their plan was to use a separate credential that is not replayable, this is important, they didn’t just want a second authentication factor. This credential would be used to gain access to the most critical parts of their infrastructure. As well as increasing security this has an important psychological benefit in that it makes users aware whenever they are accessing the most important systems. For services such as email they would not be required to use a second factor, the inconvenience would annoy users too much for the small benefit gained. They noted that it is still necessary to beware that either end of an active session could be hijacked after authentication has been successfully completed.
They examined various options, they required a token-based – “something you have” – approach, preferably it should be highly tamper resistant. They wanted a separate physical device to avoid the opportunity for remote compromise, as could occur with software based systems in mobile devices. They gave an example of a virus which infected MacOSX computers and then deliberately targetted iPhones when they were plugged into the machine. I hadn’t really considered this downside of using mobile devices before, it definitely makes me strongly in favour of a solely hardware token approach.
They did note some limitations of token-based systems. In particular they only have a limited lifetime which seems to be in the range of 2 to 3 years, depending on usage. This created some problems for the project, how do you securely deliver a token to a very remote user? Particularly if they have lost one and need a replacement quickly. Many tokens are time-based, this can introduce synchronisation problems for remote users who cannot return to base to get it fixed. Also, many time-based systems avoid replays by only allowing one login within a time window (e.g. 1 minute), this could be frustrating for users.
They went on to discuss how any 2-factor system is going to introduce additional overheads. There will be issues with failures occurring at any point in the system. It needs to integrate well with existing infrastructure and preferably avoid the need to replace software.
They did not wish to trust 3rd parties or rely on a proprietary blackbox solution that could be compromised and lose secrets. To achieve total ownership of the system they created their own custom authentication server. This supports a multi-site approach with secure replication of data. They selected the yubikey device which we have looked at in Informatics. This is used via PAM as a second factor to Kerberos authentication.
This talk gave a very good coverage of the whole 2-factor authentication problem. I look forwards to reviewing the recording and the slides. I will have to find out if we can get the code for their custom authentication system and try it out in Informatics.
One Year After the healthcare.gov Meltdown: Now What?
This talk was given by Mikey Dickerson who was originally seconded from Google to the White House to help fix the healthcare.gov website when it so spectacularly failed to deal with demand last year. Due to the very imminent deadline for the website to be ready for renewals he had to do the talk via video link from the White House. This worked much better than I feared it would and thankfully the network didn’t collapse. The main thing I took from this was how a DevOps approach can be applied to failing projects no matter how huge and weighed down with bureaucracy. There was a clear determination to save the project without resorting to a complete rewrite, the success came from restructuring teams and using better procedures. It was interesting to hear that they had been in contact with the GOV.UK people and considered the UK government to have better public facing IT services. They are now moving onto applying the same strategy to other US government IT services, in particular the Veterans Association. The team are clearly very determined and driven, they are working stupid numbers of hours each week. Many of them have given up well paid private sector jobs so that they can make a real difference to the country. It will be interesting to see if they manage to achieve real permanent change which can cope with a change of president.
The aim of this talk was to explain how to design useful metrics which can be used for service monitoring and problem diagnosis. It started off with quite a technical discussion of the definition of “metric”. The definition given was “a named value at some specific time“. Having discussed these 3 important points (name, value and time) the discussion moved onto using high-dimension databases which can handle high-resolution time series data. The recommended Open-Source software for this purpose is OpenTSDB which works on hadoop.
There was also discussion about why gathering metrics is useful. In particular 4 broad themes were identified: operational health monitoring, quality assurance, capacity planning and product management. Currently we do health monitoring fairly well but we’re not really doing the others. I think it would certainly be very useful to have better monitoring of resources when planning for future capacity requirements.
The recommended software suite to cover all requirements is nagios (or equivalent) plus Graphite plus Sensu plus logstash plus ganglia.
Although an interesting talk I think I would have benefited more from the talk the speaker gave at LISA 2013 titled “A Working Theory of Monitoring” which he referenced a couple of times. The slides and video of that previous talk are now available online.
This talk was given by Toufic Boubez who is clearly a smart chap who really knows his stuff. He gave lots of useful advice on how to analyse the metrics you have collected to detect anomalies.
His main point was that your data is almost certainly NOT gaussian. This is a problem because most analytic tools assume that parametric techniques are applicable.
There is also the issue that “yesterday’s anomaly is today’s normal“. He talked about how stationarity (sic) is not a realistic assumption with large complex systems. The term for this is “Concept Drift“.
He went on to discuss non-parametric techniques (such as the Kolmogorov-Smirnov (KS) test) which can be used to compare probability distributions.
As well as using the right statistical techniques it is very important to have good domain knowledge. You need to know your data and the general patterns. This will allow you to customise alerts appropriately so you don’t get paged unnecessarily.
He also noted that some data channels are inherently very quiet. It’s hard to deal with this type of data using time-series techniques. Sparse data is very hard to analyse but will still contain very important information.
The speaker posts interesting stuff on his twitter account.