As previously advised, the new Forum server room UPS (Uninterruptible Power Supply) will be connected to the server rooms on Saturday the 20th of July. This will necessitate the powering down of all machines in the Forum server rooms, and therefore the loss of the Informatics services provided by those machines.
It is probably safest to assume that all services will be affected all day Saturday, but some AFS file space and some other services should remain available. Below is a list of services that will definitely be unavailable, and some which should remain working (by virtue of being housed in either Appleton Tower or KB). A service not listed below does not mean it definitely will or will not be available!
The self-managed server rooms will lose power at 7:30am, moving onto the main server rooms about 11am (though machines will be shutdown from 10am). Services will be brought back online once the power work is complete. There may be false starts and extra reboots as machines install updates. We’d hope all will be restored by 6pm.
List of services that will definitely be unavailable:
- pp.inf.ed.ac.uk – password changes will not be possible
- half of xrdp.inf.ed.ac.uk
- staff.compute.inf.ed.ac.uk – shutdown on Friday
- student.login.inf.ed.ac.uk – shutdown on Friday
- gresley, huldra, kraken, peppercorn, riddles
Use “fs whereis <path>” to see if a particular “path” is on a particular server. eg:
> fs whereis ~neilb
File /afs/inf.ed.ac.uk/user/n/neilb is on host kraken.inf.ed.ac.uk
So my home directory would be unavailable during the power work.
Other websites that serve data from affected AFS serves will also be down.
List of notable services that should remain up:
- mail.inf.ed.ac.uk (lists and forwarding to staffmail/Office 365)
- ssh.inf.ed.ac.uk aka student.ssh.inf.ed.ac.uk (though if your home directory is affected, this will be of little use)
- www.inf.ed.ac.uk (though in readonly mode, publishing will stop working around 9am)
The 6th minor update for ScientificLinux 7 (which is based on RHEL7) is now ready for deployment to the Informatics DICE office and student lab machines. A minor update like this provides us with the opportunity to update important software and fix any bugs which are not security issues (we apply security updates as soon as they are available) in a controlled manner.
Along with the general updates for this platform upgrade we are pleased to announce that our python 3 environment has been updated to version 3.6.8 which includes the full “scientific python” stack of scipy, numpy, matplotlib, ipython, pandas, etc. We have attempted to provide the most commonly required modules for Python 3.6, if there are any additional modules you require for teaching next year please let us know as soon as possible.
We plan to make this change on the evening of next Monday (17th June). To complete the upgrade a reboot is required, that will happen overnight for the student lab machines. For all other DICE office desktops a delayed reboot has been scheduled, the
delay will be 5 days. Although the reboots are delayed, it would be greatly appreciated if people could manually reboot their machines at their earliest convenience; the delayed reboot would then be cancelled.
SL7.6 was released on 3rd December 2018 and since then it has been thoroughly tested in our DICE environment so we are confident that this update will not cause any issues for users.
Full details of the package updates are available on the LCFG wiki. For further, in depth information, there are also release notes from ScientificLinux and Redhat.
If you have any questions or problems with the upgrade please contact our User Support team using the support form.
Earlier this year, we mentioned that we intended to replace the Uninterruptible Power Supply (UPS) system which supplies power to all of the server rooms located in the Informatics Forum – see our blog post Forum server room UPS from January 24, 2019.
Since then, we’ve been doing lots of preparation, and we are now about to commence the actual replacement programme. The new UPS we’ve chosen will be able to sustain a power load of 200kW – more-or-less twice the load which the existing UPS can supply – and it will also be far more resilient than the existing system. We currently expect the new UPS will be fully in operation by Wednesday July 24, 2019 but, between then and now, there is a great deal of electrical infrastructure work to be completed, and some of that work will cause unavoidable disruption.
Some key events (and dates/times) for your attention are as follows:
Isolation of the building-wide Forum UPS: Tuesday 28th May, 2019; 7:00am
Explanation: As well as the UPS which supplies our server rooms, the Informatics Forum also has a completely separate UPS system which currently supplies all offices, and all IT closets. As part of the current programme, that UPS system will be decommissioned and permanently removed. Arranging that will require a brief (we expect no more than five minutes) power cut to all Forum offices and all IT closets at 7:00am on Tuesday 28th May, 2019.
Load-shedding from the Forum server rooms: Thursday 20th June, 2019 – Monday 22nd July, 2019
Explanation: During the replacement of the server room UPS, we will need to operate our server rooms for the above four week period using only one-half of the existing UPS system. In order to make that feasible, we will need to reduce the combined power load currently being used by all of the servers located in the Informatics Forum by about 20%.
Shutdown of Forum server rooms: Saturday 20th July, 2019; all day
Explanation: In order to bring the new server room UPS fully into service, we will need a total shutdown of all Forum server rooms on the Saturday 20th July, 2019.
We’ll be in touch with owners of self-managed servers regarding items 2 and 3 closer to the date. Meanwhile, if you have any questions about this work, please submit a support ticket in the usual way.
With the introduction of the University’s centrally provided blogging service (blogs.ed.ac.uk), no new blogs will be created on the Informatics blog.inf.ed.ac.uk and wp.inf.ed.ac.uk services.
The central service is based on a current version of WordPress, and has a selection of modern themes, and plugins, including an EdGEL one to match the University’s branding.
Using the central service, and phasing out the use of our blogging services, will save us duplicating effort to make sure the services are GDPR and accessibility compliant, as well as the regular WordPress updates.
Neil – Services Unit
We are about to upgrade all rack power bars in the Informatics Forum self-managed server room IF-B.Z14. We will be replacing the existing metered power bars (APC models AP7853 and AP8853) with switched versions (APC model AP8953).
The new power bars will allow us fine-grained control of individual power outlets; they’ll also help us prevent power surges (which have in the past tripped circuit breakers in our distribution boards) should the incoming power supply fail and then recover for any reason.
If all servers in all racks in IF-B.Z14 had dual power supplies, we could do this work with no interruption to any users of those servers. However as things stand, only Rack 3 is fully populated by such servers: all other racks contain machines which have single supplies only. So arranging the power bar replacements on those racks will inevitably mean that machines will temporarily lose power.
We will start the replacement next week (i.e. week commencing 18th Feb 2019) and will replace both power bars on Rack 3. As mentioned above: this should not cause any interruptions for users of servers in that rack.
Provided that these initial replacements go to plan, we will then be in touch with users involved (via the
selfmanaged-sr mailing list – see http://computing.help.inf.ed.ac.uk/smserver-rooms) to schedule downtime for the seven other racks in the room. We would like to get this work completely finished in as short a time as possible.
Thanks in advance for your patience and cooperation.
Last weekend (Saturday 19.1.19), the UPS system which supplies power to the various server rooms located in the Informatics Forum developed a fault which meant that one half of the pair of units which comprise the UPS went off-line. As a result, the other half of the pair became overloaded: we now have so many machines installed in the Forum server rooms that we are really at the limit of what the UPS is designed to cope with.
In order to deal with the overloading, we have since shut down various machines, and we have also moved some machines to different electrical phases: the idea is to try to have the overall load balanced as well as we can between the three electrical phases in use. Owners of self-managed machines, as well as owners of machines which the computing staff manage on their behalf, have been helpful to us in this overall effort – for which, many thanks.
An engineer has attended and identified a fault in the rectifier of the failed UPS. Unfortunately, there are currently insufficient spare parts locally, so parts are coming from abroad and the current estimate is that the repair will be made on Monday or Tuesday next week (i.e. 28.1.19 or 29.1.19.) We will keep people involved of progress via postings to the various relevant mailing lists.
To the general issue of the overloading of the Forum server room UPS system: our current UPS system is old, and is now insufficiently powerful for the collection of machines which we have installed. To that end, we are in the process of organizing a complete replacement of the UPS system which will be specified to allow us plenty of spare power capacity for future growth. Getting this system installed and commissioned will take significant time and effort (as well as money of course), and the installation of the new system will necessarily involve some downtime of all of our server rooms. But the final system should provide us with a much more reliable and future-proof solution.
Our current expectation is that the new UPS system will be operational before the end of July of the current year, 2019. Until the new system does become operational, we will be keeping a close eye on the overall power consumption of our server rooms, and might need to do further rebalancing across phases in order to try to maintain a reliable service. In any event, we will keep you informed of progress.
Thanks for your patience and cooperation during the current power supply problems.
We will shortly be upgrading the remaining Forum core network switch. With this one there will be some (expected) visible effects, as follows:
- This switch is normally the “root bridge” for the entire Informatics network. While we do have extensive redundant connectivity, the active paths are usually based around this switch being at the centre. To minimise the effect of removing this switch from the network we will deprioritise it in advance, at an off-peak time, which will have the effect of rebalancing the spanning-tree to use a different root. After the replacement switch is installed and configured, and we are happy with the way it is running, we will again rebalance the spanning-tree back to using this switch as its root. On each of these occasions we anticipate there may be a few seconds of network disruption while the spanning-tree is recalculated.
- Being at the centre of the Forum network, we normally use this switch as the primary router for most of our subnets. To allow time for DHCP-configured hosts to pick up the change we will move this function to a different core switch a couple of days ahead of the upgrade. Other than a visible change in path and a little extra load on our intra-core links there should be no effect from this change.
- While the upgrade is being carried out the link to the Forum’s external router will fall back from 10Gbps to 1Gbps. There is a chance that external connectivity will appear slower while the upgrade is taking place as a result of congestion on this slower link. We will try to minimise this by making this router’s link one of the last to be removed from the old switch and one of the first to be made to the replacement.
Apologies in advance for any disruption this work causes. This is the last of the old Forum core switches to be replaced. They were bought new over ten years ago when we moved into the Forum, firmware is no longer being released for them, and the replacements are considerably more powerful and so better able to handle the additional load from Bayes as well as the faster Forum edge connectivity.
Technical network documentation is here.
The computing help elves have been busy. In the past few weeks they’ve overhauled the pages on Printing and on Audio-visual facilities. In particular there are now pages on AV in Appleton Tower (formerly not covered at all) and AV in the Informatics Forum (formerly split over many pages).
If you’re looking for help or information on the computer systems in the School of Informatics, take a look at computing.help.inf.ed.ac.uk – you might learn something.
With over 300 pages of technical advice, keeping our computing help pages accurate is something of a Sisyphean task, so perhaps it’s inevitable that some get missed. If you find any inaccurate or outdated information there, please let us know. (Here’s where to find us.)
Apologies for the short network disruption yesterday afternoon. It was caused by a 10Gbps forwarding loop, which was created as the second-last fibre was being connected as part of our core switch upgrade programme. As soon as we realised there was a problem the fibre was disconnected again and the port configuration corrected.
Background: we (Informatics, and the constituent Departments before that) set up our network with redundant paths for resilience, using Rapid Spanning Tree Protocol to manage the links and prevent loops. EdLAN as a whole has different constraints, and they run a different STP variant across the core and no STP at the edge. Over the years there have been incompatibilties between the way these variants operate, and we have seen some instability as a result of STP-related events elsewhere. We have therefore for some time filtered BPDUs at all of our interfaces to EdLAN. This has generally operated well for many years.
So what went wrong yesterday? The cards in the new switch which was being installed yesterday are slightly different from the ones in the old switch, and the port involved in yesterday’s problems was previously set up as a hot-spare EdLAN link. (We keep some links pre-configured so that they can be quickly swapped into operation should there be a fault with our principal link.) As part of the upgrade process that port became one of our “normal” infrastructure links and the hot-spare EdLAN link was moved to a different port. The VLAN configurations were moved correctly, but the BPDU filtering was accidentally left applied to the wrong port. When that port was patched in, therefore, STP did not know to block one of the downstream links, and so a loop was set up. Unicast traffic would still have been operating normally, but we have enough multicast traffic that was looped around to completely saturate our infrastructure links.
The fix was to disconnect the problem link, so breaking the loop. The BPDU filter was then applied to the correct link, and everything connected up again.
As usual, our technical network documentation is here.
We will shortly begin the process of upgrading the switches which form the core of the Informatics network in the Forum and Appleton Tower. The network has enough resilience built into it that this can happen mostly transparently to users. Where this is not possible, an announcement will be posted in advance, though actual downtime is expected to be only a few seconds while fibre is re-patched.
The switches which form the existing network core in the Forum date back to when we occupied the building, and those in the Appleton Tower core are nearly as old. With the connection of the Informatics floors of the Bayes Centre to the Forum core, it was decided that the time had come to replace that core with more modern, powerful models, which also offer better possibilities for interaction with the new EdLAN which the University is currently procuring. At the same time, HPE’s range and pricing structure made it advantageous to replace the AT core as part of the same process.
We’ll work through the six switches over the next few weeks, taking one down at a time, transferring cards from the old to the new switch where possible, racking up the new switch, and setting it up with (almost) the same configuration as the old one. The process will, of course, be spread out to allow adequate testing of each replacement before moving on to the next one.
Technical documentation on the Informatics network can be found here.