We suffered a power failure in the Informatics Forum on Monday 11th November 2013, starting at about 11am and ending at about 1pm. We still have no information from Estates and Buildings as to why this failure occured.
Many users were surprised to find that whilst power to their desktops and the network wasn’t interrupted, many of the School’s servers shutdown shortly after 11am.
Emergency power for servers based in the Forum is provided by a pair of UPS. These are primarily intended to allow us to weather short (eg a few mins) power interruptions and to cleanly shutdown servers for longer interruptions. When both UPSes are fully functional, we have around 45 minutes of runtime on battery (given our current power load). Unfortunately, one of the UPSes has been out of action for a number of months, reducing our runtime on battery to 20 minutes.
Emergency power for offices and the network is provided by a single building UPS. This has a runtime of around 3 hours on battery, given our current power load. It is worth noting that the energy overhead of the building UPS is quite high, and consideration is being given to withdrawing it from service. No other University building has this level of cover for offices.
When power was reinstated at 1pm, the majority of services resumed reasonably quickly. However, the hardware failure of a disk controller in one of the storage arrays had a knock on effect for a number of services – eg AFS. Power to some less critical services (eg Hadoop cluster) wasn’t immediately restored, just in case the power had dropped again.
You can read our post mortem here.