The machine hosting the
staff.ssh service (
rydell) crashed on the evening of Tuesday 12th April. This was caused by a runaway process which consumed all available memory. The machine was rebooted at 6:30am on Wednesday 13th April and is now working normally.
For those interested in the details, the Linux kernel Out-Of-Memory (OOM) killer did kick in and did kill the runaway process. As is often the case though this didn’t regain sufficient memory quickly enough so it then went on the rampage and started killing processes all over the place which left the system running but non-functional.
On both the ssh login machines there is a limit on the number of processes a user is permitted to have running. This had not previously been changed from the default value of 1024. The default value was clearly too high in this case as the OOM killer had kicked in before the limit was reached. There is unlikely to be a good reason anyone should ever run that many processes on an ssh login server. To help prevent this problem recurring we are consequently going to drop the limit to 200. This limit is a lot more than anyone is currently running but significantly lower than the default. This does not, of course, prevent absolute memory consumption by a small number of large processes but it’s unclear whether it is possible to prevent that currently. When we upgrade to SL6 later in the year we will review the situation and see if newer features of the Linux kernel will allow us to do anything else to prevent total resource consumption and the subsequent crashes.
Maybe encourage users to set a “ulimit -v” in their shell configuration to prevent them accidentally trying to allocate stupid amounts of memory? One form of encouragement might be setting a default in the system and default shell configurations.
The problem is that what is an appropriate amount of memory on one machine (e.g. a compute server) is completely wrong on another machine (e.g. an ssh login server with 1GB of RAM). I suspect most people would not want the hassle of altering it depending on their current server and task. I’m hopeful that the cgroups support in newer kernels will allow us to set up some sensible policies for different types of machines.
I assume if someone had a really good reason for a higher limit then this could be granted on a one-off basis.
Yes, we can make exceptions to the resource limits on a per-user basis where necessary.