Network upgrades

As part of our rolling programme of network upgrades and replacement of old kit, the following have either happened or are planned to happen soon:

  1. The remaining “gigabit” switches in the Forum will be replaced with current models, completing the process of upgrading these switches which was begun last year.  Ports with labels beginning 4/B, 5/B and x/C (for all x in 0..3) may experience a short outage as the old switches are removed and the new ones installed.  We have not yet scheduled this, and as last time we expect the availability of the relevant computing and technical staff to be the tightest constraint.  Email warnings will, of course, be issued nearer the time.
  2. At the beginning of January our link to EdLAN via Appleton Tower was upgraded to 10Gbps (previously 2 x 1Gbps).  This link carries the bulk of our external routed traffic, as well as VoIP phones and wireless.  We are now in the process of installing a new primary external router for the Forum, also with a 10Gbps link.  This should alleviate the traffic bottleneck which has affected us several times recently.  (We have a second link to EdLAN via Old College, for resilience and load-sharing.)
  3. Both the Forum and Appleton Tower “network services” servers will also shortly be upgraded.  These run our OpenVPN endpoints, as well as providing DNS service for self-managed machines.

For those interested, network documentation with diagrams can be found here.

Posted in Uncategorized | Leave a comment

Home directory quota problem on Tuesday

On Tuesday 4th of March there was a period of a few minutes, from 11:15am to about 11:25am, when most peoples’ home directory quota was incorrectly shrunk to 2MB. Anything trying to write to home directories during that time will have failed.

This happened because the script, to calculate peoples’ quota, had not been updated to take in to account a change in our account management system. This change had been flagged by our colleagues weeks in advance, but the dependency of this script on the change had not been spotted. When the change happened on the Tuesday morning, the quotas script was no longer able to determine the roles a user had, and so could not allocate the correct quota, eg 10GB for staff, 2GB for UG1 and UG2. So it defaulted to the minimal quota of 2MB.

The quick fix was to change that minimal quota to 20GB for everyone, and then later update the quotas script to use the new location for the user role information. Then the previous, correct role based quota was applied, eg 10GB for staff, 2GB for UG1 etc.

As this user data (like roles) is now retrieved from a central source, it will be easier in future to see what queries are being made for that data, and so what would be affected by any changes to that central source.

Sorry for this break in operation.

Services Unit

Posted in Uncategorized | Leave a comment

Gas explosion in AT basement (not for real!)

Every now and then, we test our preparedness for disasters by holding a mock disaster exercise.

On the 13th January, the computing staff were told that over the preceding weekend, as a result of a gas explosion in the basement of Appleton Tower, all of our IT equipment in the AT basement had been destroyed.

Each computing unit was asked to produce a report on what services would have been affected and the state of the backups for those services. They were also asked to test reinstall one service just from those backups. The reports are available here.

In summary, the only data lost was scratch data on a small number of servers. A config file for the plone service was lost, but could have been easily restored from a number of external web sites.


Posted in Uncategorized | Leave a comment

Upgrade of DICE desktops to Scientific Linux 7

Redhat, who provide the Linux platform on which DICE is based, has recently released a beta version of their latest release – RHEL 7.

We have started work on porting DICE to this platform, with a view to upgrading DICE desktops to a RHEL 7 based platform this summer. This will result in many core applications being upgraded.

For further information, see the project home page.

Posted in Uncategorized | Leave a comment

Scanning for vulnerable systems

This article describes a couple of security enhancements which the Computing Team will be developing over the next few months.

As I mentioned  last time, we have recently started scanning all our externally-visible machines for security vulnerabilities using the JANET ESISS penetration-testing service.  In order to use the service as effectively as possible we need an up-to-date list of the URLs of web sites to be tested.  For managed servers, our configuration database contains the necessary information.  For self-managd machines we propose extracting URLs from the traffic going to the servers on those machines, which we expect should keep the list automatically current.

We are also evaluating the use of the snort intrusion detection system, in the hope that it might be able to alert us to the presence of compromised machines or services on our network.  This does sound a promising system, but we are still at the initial stages with it and it is not yet clear whether it would have too much of an effect on our edge routers to be able to run it as we would like.

Both of these will require the automated inspection of traffic passing through our edge routers, with the Head of School’s permission under the terms of the Lawful Business Practice regulations.  This will, of course, be kept to the absolute minimum necessary for the purpose.

Posted in Uncategorized | Leave a comment

Self-managed machines, particularly with firewall holes

Users of self-managed machines are reminded that School policy requires that they should make all reasonable efforts to secure those machines.  This applies particularly to those which have firewall holes.

Machines must be running a current OS version, and patching must be kept up-to-date.  If you have any services running, please make sure that you have turned off unnecessary options, and have changed all default passwords.  For example, in one recent hack to a self-managed machine a default tomcat manager account was used to install botnet modules which were then used to attack other systems.

You should not assume that just because your system is not actively advertised (e.g. in the DNS or through links on the web) that it won’t be found.  On the contrary, scanning is widespread.  Our own logs show that any IP address, even one which has never been used for externally-visible machines, is likely to be probed several dozen times per day.

The University has subscribed to the JANET ESISS penetration-testing service. We now use it to scan all managed and self-managed machines with external firewall holes, and will be following up its warnings with machines’ managers.  However, it won’t catch everything, so you should still take care with your configurations.

Please contact Support in the usual way if you would like to discuss your self-managed machine.

Posted in Uncategorized | Leave a comment

Christmas closure – saving energy

Please help to reduce the University’s energy bill by switching off any equipment, including computers and monitors, that you are unlikely to use over the Xmas break.

If you are responsible for research group servers, please consider powering these down over the vacation. Contact support if you need computing staff to do this for you.

You can power off a DICE box either by briefly pressing the power button on the front of the machine or choosing the Shutdown option from the menu at the bottom of the DICE login screen.

If you think that you many want to remotely access your desktop over the holidays, just let your machine sleep as normal. You can wake the machine again by going to Self-managed machines can also be awoken using this mechanism – see this computing help page for details.

Posted in Uncategorized | Leave a comment

Replacement of failed SAN controller

We have received a replacement for the SAN controller that failed following the power cut of the 11th of November. We had to abort our plan to replace it yesterday (Monday 18th) after we discovered there’s not enough slack in the cables to manoeuvre the failed controller out and the replacement controller in. Unfortunately this now means we will have to turn off the SAN box to do the replacement.

This SAN box (ifevo4 as we call it) currently serves about 50TB of data, all of which will be unavailable while it is powered down to replace the controller. Given the disruption this may cause, we plan to do the work starting at 10:30am on Sunday 24th November.

To minimise the disruption, for the servers that have both local disk storage and SAN mounted storage, we will unmount the SAN storage and leave the local disk data available. This means that, apart from a couple of short breaks (a couple of minutes each), most home directories will remain available.

For the rest of the SAN mounted data (mostly group space) it will be unavailable for the duration of the controller swap, which should take between 30mins and an hour.

To check if your home directory is on local disk, run the “homedir” command. If it says your home directory is on either a /vicepa, /vicepb or /vicepc partition, then you will be fine (apart from the brief interruptions). eg in my case:

neilb> homedir
neilb (Neil Brown) : nessie/vicepc : /afs/ : free
162.2G (used 64%)

So I’m on server “nessie” and partition “/vicepc”, so should be fine. We realise that some users are still on SAN mounted space, and between now and Sunday, we’ll be moving who we can to local disk.

All other networked file space will be unavailable during the replacement, eg everything under /group or /afs/

If there are any major problems with the planned date and time, please get in touch as soon as possible, but the longer we run on a single controller the bigger the risk of further unplanned failures.

Services Unit

Posted in Uncategorized | Leave a comment

Forum power failure on 11th November

We suffered a power failure in the Informatics Forum on Monday 11th November 2013, starting at about 11am and ending at about 1pm. We still have no information from Estates and Buildings as to why this failure occured.

Many users were surprised to find that whilst power to their desktops and the network wasn’t interrupted, many of the School’s servers shutdown shortly after 11am.

Emergency power for servers based in the Forum is provided by a pair of UPS. These are primarily intended to allow us to weather short (eg a few mins) power interruptions and to cleanly shutdown servers for longer interruptions. When both UPSes are fully functional, we have around 45 minutes of runtime on battery (given our current power load). Unfortunately, one of the UPSes has been out of action for a number of months, reducing our runtime on battery to 20 minutes.

Emergency power for offices and the network is provided by a single building UPS. This has a runtime of around 3 hours on battery, given our current power load. It is worth noting that the energy overhead of the building UPS is quite high, and consideration is being given to withdrawing it from service. No other University building has this level of cover for offices.

When power was reinstated at 1pm, the majority of services resumed reasonably quickly. However, the hardware failure of a disk controller in one of the storage arrays had a knock on effect for a number of services – eg AFS. Power to some less critical services (eg Hadoop cluster) wasn’t immediately restored, just in case the power had dropped again.

You can read our post mortem here.

Posted in Uncategorized | 1 Comment

New look EASE login page

As people using the University’s EASE web login page will have hopefully noticed, the look of will be changing at 8am on the 5th of November.

The image below shows you what to expect, there may be small differences, but generally it will look like the following:

New look EASE login page

New look EASE login page from November 5th 2013

So rest assured this change is legitimate and intended.


Posted in Uncategorized | Leave a comment