The machine hosting the
staff.ssh service (
rydell) crashed on the evening of Tuesday 12th April. This was caused by a runaway process which consumed all available memory. The machine was rebooted at 6:30am on Wednesday 13th April and is now working normally.
For those interested in the details, the Linux kernel Out-Of-Memory (OOM) killer did kick in and did kill the runaway process. As is often the case though this didn’t regain sufficient memory quickly enough so it then went on the rampage and started killing processes all over the place which left the system running but non-functional.
On both the ssh login machines there is a limit on the number of processes a user is permitted to have running. This had not previously been changed from the default value of 1024. The default value was clearly too high in this case as the OOM killer had kicked in before the limit was reached. There is unlikely to be a good reason anyone should ever run that many processes on an ssh login server. To help prevent this problem recurring we are consequently going to drop the limit to 200. This limit is a lot more than anyone is currently running but significantly lower than the default. This does not, of course, prevent absolute memory consumption by a small number of large processes but it’s unclear whether it is possible to prevent that currently. When we upgrade to SL6 later in the year we will review the situation and see if newer features of the Linux kernel will allow us to do anything else to prevent total resource consumption and the subsequent crashes.