Last weekend (Saturday 19.1.19), the UPS system which supplies power to the various server rooms located in the Informatics Forum developed a fault which meant that one half of the pair of units which comprise the UPS went off-line. As a result, the other half of the pair became overloaded: we now have so many machines installed in the Forum server rooms that we are really at the limit of what the UPS is designed to cope with.
In order to deal with the overloading, we have since shut down various machines, and we have also moved some machines to different electrical phases: the idea is to try to have the overall load balanced as well as we can between the three electrical phases in use. Owners of self-managed machines, as well as owners of machines which the computing staff manage on their behalf, have been helpful to us in this overall effort – for which, many thanks.
An engineer has attended and identified a fault in the rectifier of the failed UPS. Unfortunately, there are currently insufficient spare parts locally, so parts are coming from abroad and the current estimate is that the repair will be made on Monday or Tuesday next week (i.e. 28.1.19 or 29.1.19.) We will keep people involved of progress via postings to the various relevant mailing lists.
To the general issue of the overloading of the Forum server room UPS system: our current UPS system is old, and is now insufficiently powerful for the collection of machines which we have installed. To that end, we are in the process of organizing a complete replacement of the UPS system which will be specified to allow us plenty of spare power capacity for future growth. Getting this system installed and commissioned will take significant time and effort (as well as money of course), and the installation of the new system will necessarily involve some downtime of all of our server rooms. But the final system should provide us with a much more reliable and future-proof solution.
Our current expectation is that the new UPS system will be operational before the end of July of the current year, 2019. Until the new system does become operational, we will be keeping a close eye on the overall power consumption of our server rooms, and might need to do further rebalancing across phases in order to try to maintain a reliable service. In any event, we will keep you informed of progress.
Thanks for your patience and cooperation during the current power supply problems.