Last weekend (Saturday 19.1.19), the UPS system which supplies power to the various server rooms located in the Informatics Forum developed a fault which meant that one half of the pair of units which comprise the UPS went off-line. As a result, the other half of the pair became overloaded: we now have so many machines installed in the Forum server rooms that we are really at the limit of what the UPS is designed to cope with.
In order to deal with the overloading, we have since shut down various machines, and we have also moved some machines to different electrical phases: the idea is to try to have the overall load balanced as well as we can between the three electrical phases in use. Owners of self-managed machines, as well as owners of machines which the computing staff manage on their behalf, have been helpful to us in this overall effort – for which, many thanks.
An engineer has attended and identified a fault in the rectifier of the failed UPS. Unfortunately, there are currently insufficient spare parts locally, so parts are coming from abroad and the current estimate is that the repair will be made on Monday or Tuesday next week (i.e. 28.1.19 or 29.1.19.) We will keep people involved of progress via postings to the various relevant mailing lists.
To the general issue of the overloading of the Forum server room UPS system: our current UPS system is old, and is now insufficiently powerful for the collection of machines which we have installed. To that end, we are in the process of organizing a complete replacement of the UPS system which will be specified to allow us plenty of spare power capacity for future growth. Getting this system installed and commissioned will take significant time and effort (as well as money of course), and the installation of the new system will necessarily involve some downtime of all of our server rooms. But the final system should provide us with a much more reliable and future-proof solution.
Our current expectation is that the new UPS system will be operational before the end of July of the current year, 2019. Until the new system does become operational, we will be keeping a close eye on the overall power consumption of our server rooms, and might need to do further rebalancing across phases in order to try to maintain a reliable service. In any event, we will keep you informed of progress.
Thanks for your patience and cooperation during the current power supply problems.
We will shortly be upgrading the remaining Forum core network switch. With this one there will be some (expected) visible effects, as follows:
- This switch is normally the “root bridge” for the entire Informatics network. While we do have extensive redundant connectivity, the active paths are usually based around this switch being at the centre. To minimise the effect of removing this switch from the network we will deprioritise it in advance, at an off-peak time, which will have the effect of rebalancing the spanning-tree to use a different root. After the replacement switch is installed and configured, and we are happy with the way it is running, we will again rebalance the spanning-tree back to using this switch as its root. On each of these occasions we anticipate there may be a few seconds of network disruption while the spanning-tree is recalculated.
- Being at the centre of the Forum network, we normally use this switch as the primary router for most of our subnets. To allow time for DHCP-configured hosts to pick up the change we will move this function to a different core switch a couple of days ahead of the upgrade. Other than a visible change in path and a little extra load on our intra-core links there should be no effect from this change.
- While the upgrade is being carried out the link to the Forum’s external router will fall back from 10Gbps to 1Gbps. There is a chance that external connectivity will appear slower while the upgrade is taking place as a result of congestion on this slower link. We will try to minimise this by making this router’s link one of the last to be removed from the old switch and one of the first to be made to the replacement.
Apologies in advance for any disruption this work causes. This is the last of the old Forum core switches to be replaced. They were bought new over ten years ago when we moved into the Forum, firmware is no longer being released for them, and the replacements are considerably more powerful and so better able to handle the additional load from Bayes as well as the faster Forum edge connectivity.
Technical network documentation is here.
The computing help elves have been busy. In the past few weeks they’ve overhauled the pages on Printing and on Audio-visual facilities. In particular there are now pages on AV in Appleton Tower (formerly not covered at all) and AV in the Informatics Forum (formerly split over many pages).
If you’re looking for help or information on the computer systems in the School of Informatics, take a look at computing.help.inf.ed.ac.uk – you might learn something.
With over 300 pages of technical advice, keeping our computing help pages accurate is something of a Sisyphean task, so perhaps it’s inevitable that some get missed. If you find any inaccurate or outdated information there, please let us know. (Here’s where to find us.)
Apologies for the short network disruption yesterday afternoon. It was caused by a 10Gbps forwarding loop, which was created as the second-last fibre was being connected as part of our core switch upgrade programme. As soon as we realised there was a problem the fibre was disconnected again and the port configuration corrected.
Background: we (Informatics, and the constituent Departments before that) set up our network with redundant paths for resilience, using Rapid Spanning Tree Protocol to manage the links and prevent loops. EdLAN as a whole has different constraints, and they run a different STP variant across the core and no STP at the edge. Over the years there have been incompatibilties between the way these variants operate, and we have seen some instability as a result of STP-related events elsewhere. We have therefore for some time filtered BPDUs at all of our interfaces to EdLAN. This has generally operated well for many years.
So what went wrong yesterday? The cards in the new switch which was being installed yesterday are slightly different from the ones in the old switch, and the port involved in yesterday’s problems was previously set up as a hot-spare EdLAN link. (We keep some links pre-configured so that they can be quickly swapped into operation should there be a fault with our principal link.) As part of the upgrade process that port became one of our “normal” infrastructure links and the hot-spare EdLAN link was moved to a different port. The VLAN configurations were moved correctly, but the BPDU filtering was accidentally left applied to the wrong port. When that port was patched in, therefore, STP did not know to block one of the downstream links, and so a loop was set up. Unicast traffic would still have been operating normally, but we have enough multicast traffic that was looped around to completely saturate our infrastructure links.
The fix was to disconnect the problem link, so breaking the loop. The BPDU filter was then applied to the correct link, and everything connected up again.
As usual, our technical network documentation is here.
We will shortly begin the process of upgrading the switches which form the core of the Informatics network in the Forum and Appleton Tower. The network has enough resilience built into it that this can happen mostly transparently to users. Where this is not possible, an announcement will be posted in advance, though actual downtime is expected to be only a few seconds while fibre is re-patched.
The switches which form the existing network core in the Forum date back to when we occupied the building, and those in the Appleton Tower core are nearly as old. With the connection of the Informatics floors of the Bayes Centre to the Forum core, it was decided that the time had come to replace that core with more modern, powerful models, which also offer better possibilities for interaction with the new EdLAN which the University is currently procuring. At the same time, HPE’s range and pricing structure made it advantageous to replace the AT core as part of the same process.
We’ll work through the six switches over the next few weeks, taking one down at a time, transferring cards from the old to the new switch where possible, racking up the new switch, and setting it up with (almost) the same configuration as the old one. The process will, of course, be spread out to allow adequate testing of each replacement before moving on to the next one.
Technical documentation on the Informatics network can be found here.
We’ve made a new version of Virtual DICE for the new session. Virtual DICE is the lightweight DICE-like virtual machine which you can install and run on your own computer (here’s how to install it).
We release a new version of Virtual DICE twice a year. This version has the hostname tiepolo. It has most of the same software as DICE machines.
If you have an earlier version of Virtual DICE you should upgrade to the new version. To do that, make backup copies of whatever files you want to keep (for example, copy them to your AFS home directory – and here’s how to access AFS from Virtual DICE) then shut down and delete your Virtual DICE version, then install the new tiepolo version instead.
To find out more read the Virtual DICE help pages.
Next Tuesday (4th September) the remote desktop service which uses the NX technology – nx.inf.ed.ac.uk – will be decommissioned. It will be replaced by a new service which is based on RDP – xrdp.inf.ed.ac.uk – see the computing help page for details.
The machine hosting this service – hammersmith – also needs to be reinstalled and we need to apply some important firmware updates so we expect the service to be unavailable for the whole morning.
If you have any queries about this please contact the Computing Team via the usual support form.
The 5th minor update for ScientificLinux 7 (which is based on RHEL7) is now ready for deployment to the Informatics DICE office and student lab machines. A minor update like this provides us with the opportunity to update important software and fix any bugs which are not security issues (we apply security updates as soon as they are available) in a controlled manner.
Notable updates include a switch from VirtualBox 5.1 to 5.2, libreoffice is upgraded from 188.8.131.52 to 184.108.40.206, QT5 is upgraded from 5.6.2 to 5.9.2 and R is upgraded from 3.4.2 to 3.5.0. Along with the general updates this platform upgrade also provides a fully supported python 3 environment based on version 3.4.8 which includes the full “scientific python” stack of scipy, numpy, matplotlib, ipython, pandas, etc.
We plan to make this change on the evening of Monday 20th August. This has been scheduled so that it is after the exam resits have finished but still well before the start of Semester 1 so that teaching staff have sufficient time to test their course work. If you would like to have any of your DICE machines upgraded sooner that can be arranged, please get in touch.
SL7.5 was released on May 10th 2018 and since then it has been thoroughly tested in our DICE environment so we are confident that this update will not cause any issues for users.
Full details of the package updates are available on the LCFG wiki. For further, in depth information, there are also release notes from ScientificLinux and Redhat.
If you have any questions or problems with the upgrade please contact our User Support team using the support form.
Our remote wake-up service will be down this weekend, 14-15 July, because of planned work to the electricity supply in the James Clerk Maxwell Building. Sorry for the inconvenience. It should be back to normal on Monday morning.
For the last 5 years we have provided a remote DICE desktop service which is based on the NX technology. Although this system has some nice benefits, particularly being light on bandwidth requirements, the technology is beginning to show its age and there is a serious shortage of good client software for many platforms.
With this in mind we have made the decision to switch to providing an RDP based service. This has very good client support for all platforms (Windows, MacOSX and Linux). Currently we are at the stage of being able to offer staff users early access to a test version of the service at staff.xrdp.inf.ed.ac.uk. We’re hoping that people will try it out and report back any problems they experience. Full information on how to use the service is available on our Computing Help site.
Once we’re confident there are no major issues we intend to replace the two NX services with equivalents based on RDP. The aim being to get the new service entirely rolled out before the start of Semester 1 in September.