Windows XP – end of life

Windows XP was first launched in 2001 and Microsoft have now withdrawn support for XP as of 8th April. There is a useful page giving full details:

http://windows.microsoft.com/en-GB/windows/end-support-help

It does recommend upgrading to Windows 8.1 but you may find that your machine is not capable of running Windows 8 in which case you may prefer to try Windows 7 or even invest in some new kit! Provided you are eligible to register with Dreamspark, you can download either Windows 7 or 8 from there. Details on Dreamspark can be found here:

https://computing.help.inf.ed.ac.uk/self-managed-windows

As Microsoft will no longer be providing updates to protect your machine, this means that it will become more vulnerable to new security risks and viruses which are not being fixed for XP. If you have a Windows XP machine within Informatics which currently has firewall holes opened, we will need to close these holes to reduce the risks to other users. Our self-managed and personal machines policy also states that:

“Users must ensure that the software and systems on their machines are kept fully patched and appropriately configured against security vulnerabilities. If they use a system which is prone to viruses, they must ensure that they have adequate and current protection installed.”

The full page can be found at:

https://computing.help.inf.ed.ac.uk/self-managed-policy

Posted in Uncategorized | Leave a comment

Virtual DICE

DICE is also available as a virtual machine. We call it Virtual DICE. It can run in a variety of environments – Windows, Linux, Mac, and so on. The computing help pages give details on how to download, install and use it.

Virtual DICE was previously in testing. It’s now a supported service, so please get in touch with computing support if you have problems with it which can’t be solved with the help of the documentation.

Here’s a screenshot:

Screenshot of Virtual DICE running on a DICE machine

Virtual DICE running on a DICE machine

Posted in Uncategorized | Tagged , | Leave a comment

SAN problems of 27th March 2014

Following the unplanned power cut on Tuesday, one of our SAN machines (ifevo4) started reporting a problem with the flash cache memory in one of its controllers. The machine has two controllers, A and B, for redundancy. Both with two fibre channel (FC) connections, one to each of our fabrics (network). This means that should one controller fail, the other will take over its duties and service will remain uninterrupted.

After reporting the fault of controller A, our supplier shipped a replacement controller to swap out the faulty one with a working one. To minimise the length of time the ifevo4 was running in a degraded state, we decided to replace the controller on Thursday after 5pm. Due to the redundancy, this should not have caused any problems to the running service.

The redundancy only fully works if the client machines (our file servers) are configured to use the multiple paths to the FC connections on both controllers. I assumed they were (but didn’t actually check), as that’s how it should be, and we’d had a separate fabric failure a week previously, and all the servers continued to work via the one remaining path/fabric without any issue. Unfortunately not all the volumes on the ifevo4 were as fully redundant as they should have been. In some cases the volumes were only accessible via controller A and not via A and B, this was a configuration error that had probably gone unnoticed since November 2013.

So when I removed controller A, the volumes mounted by the servers that were only accessible via controller A became inaccessible. Thus causing problems for anyone trying to access data on those volumes. As it is generally group file space that is mounted from the SAN, home volumes are on disks local to the servers, not many people noticed at this point.

Unfortunately to reattach the failed volumes (once controller A had been replaced) typically means checking the consistency (salvaging) of all the data on the server, during which time the file server will not serve any files, even those unaffected by the loss of controller A. As our file servers have several terabytes of data check, this means no access to all files for a couple of hours.

To give those a chance to finish anything they may be working on, I mailed out to explain that I’d reattach, and salvage, the affected volumes at 8pm. As it turned out, after rebooting the servers at 8pm, I was able to salvage the volumes individually, without affecting the availability of the working volumes. So apart from a 5 minute break at about 8pm, file access remained working. Over the next couple of hours the volumes affected by the controller A replacement gradually came back on-line. Most files were back by by 10:30pm.

The reason that some of the volumes were incorrectly configured to only use one controller is unknown. The most likely explanation is that they were all on the JBOD part of ifevo4. The JBOD is an expansion unit containing just extra disks. It was previously attached to an older version of the SAN hardware (ifevo2), which also had dual controllers and multiple FC connections to our fabric. Back in November 2013 we shutdown ifevo2, disconnected the JBOD, and attached it to the new ifevo4. At that point everything seemed to be working fine. The file servers just continued to access the volumes from their new location, and multiple paths were available to the data, so we had redundancy. I suspect had we looked more closely, this is where the problem was introduced, and though we had multiple paths via our two different fabrics, they were only to a single controller.

Since the problem on Thursday, all the paths have been checked and updated, where necessary, to make sure there are multiple paths to both controllers on our ifevo3 and ifevo4. And in future should we need to change a controller again, we will double check that those paths are still in place before replacing a controller.

Neil

Posted in Uncategorized | Leave a comment

Network upgrades

As part of our rolling programme of network upgrades and replacement of old kit, the following have either happened or are planned to happen soon:

  1. The remaining “gigabit” switches in the Forum will be replaced with current models, completing the process of upgrading these switches which was begun last year.  Ports with labels beginning 4/B, 5/B and x/C (for all x in 0..3) may experience a short outage as the old switches are removed and the new ones installed.  We have not yet scheduled this, and as last time we expect the availability of the relevant computing and technical staff to be the tightest constraint.  Email warnings will, of course, be issued nearer the time.
  2. At the beginning of January our link to EdLAN via Appleton Tower was upgraded to 10Gbps (previously 2 x 1Gbps).  This link carries the bulk of our external routed traffic, as well as VoIP phones and wireless.  We are now in the process of installing a new primary external router for the Forum, also with a 10Gbps link.  This should alleviate the traffic bottleneck which has affected us several times recently.  (We have a second link to EdLAN via Old College, for resilience and load-sharing.)
  3. Both the Forum and Appleton Tower “network services” servers will also shortly be upgraded.  These run our OpenVPN endpoints, as well as providing DNS service for self-managed machines.

For those interested, network documentation with diagrams can be found here.

Posted in Uncategorized | Leave a comment

Home directory quota problem on Tuesday

On Tuesday 4th of March there was a period of a few minutes, from 11:15am to about 11:25am, when most peoples’ home directory quota was incorrectly shrunk to 2MB. Anything trying to write to home directories during that time will have failed.

This happened because the script, to calculate peoples’ quota, had not been updated to take in to account a change in our account management system. This change had been flagged by our colleagues weeks in advance, but the dependency of this script on the change had not been spotted. When the change happened on the Tuesday morning, the quotas script was no longer able to determine the roles a user had, and so could not allocate the correct quota, eg 10GB for staff, 2GB for UG1 and UG2. So it defaulted to the minimal quota of 2MB.

The quick fix was to change that minimal quota to 20GB for everyone, and then later update the quotas script to use the new location for the user role information. Then the previous, correct role based quota was applied, eg 10GB for staff, 2GB for UG1 etc.

As this user data (like roles) is now retrieved from a central source, it will be easier in future to see what queries are being made for that data, and so what would be affected by any changes to that central source.

Sorry for this break in operation.

Neil
Services Unit

Posted in Uncategorized | Leave a comment

Gas explosion in AT basement (not for real!)

Every now and then, we test our preparedness for disasters by holding a mock disaster exercise.

On the 13th January, the computing staff were told that over the preceding weekend, as a result of a gas explosion in the basement of Appleton Tower, all of our IT equipment in the AT basement had been destroyed.

Each computing unit was asked to produce a report on what services would have been affected and the state of the backups for those services. They were also asked to test reinstall one service just from those backups. The reports are available here.

In summary, the only data lost was scratch data on a small number of servers. A config file for the plone service was lost, but could have been easily restored from a number of external web sites.

 

Posted in Uncategorized | Leave a comment

Upgrade of DICE desktops to Scientific Linux 7

Redhat, who provide the Linux platform on which DICE is based, has recently released a beta version of their latest release – RHEL 7.

We have started work on porting DICE to this platform, with a view to upgrading DICE desktops to a RHEL 7 based platform this summer. This will result in many core applications being upgraded.

For further information, see the project home page.

Posted in Uncategorized | Leave a comment

Scanning for vulnerable systems

This article describes a couple of security enhancements which the Computing Team will be developing over the next few months.

As I mentioned  last time, we have recently started scanning all our externally-visible machines for security vulnerabilities using the JANET ESISS penetration-testing service.  In order to use the service as effectively as possible we need an up-to-date list of the URLs of web sites to be tested.  For managed servers, our configuration database contains the necessary information.  For self-managd machines we propose extracting URLs from the traffic going to the servers on those machines, which we expect should keep the list automatically current.

We are also evaluating the use of the snort intrusion detection system, in the hope that it might be able to alert us to the presence of compromised machines or services on our network.  This does sound a promising system, but we are still at the initial stages with it and it is not yet clear whether it would have too much of an effect on our edge routers to be able to run it as we would like.

Both of these will require the automated inspection of traffic passing through our edge routers, with the Head of School’s permission under the terms of the Lawful Business Practice regulations.  This will, of course, be kept to the absolute minimum necessary for the purpose.

Posted in Uncategorized | Leave a comment

Self-managed machines, particularly with firewall holes

Users of self-managed machines are reminded that School policy requires that they should make all reasonable efforts to secure those machines.  This applies particularly to those which have firewall holes.

Machines must be running a current OS version, and patching must be kept up-to-date.  If you have any services running, please make sure that you have turned off unnecessary options, and have changed all default passwords.  For example, in one recent hack to a self-managed machine a default tomcat manager account was used to install botnet modules which were then used to attack other systems.

You should not assume that just because your system is not actively advertised (e.g. in the DNS or through links on the web) that it won’t be found.  On the contrary, scanning is widespread.  Our own logs show that any IP address, even one which has never been used for externally-visible machines, is likely to be probed several dozen times per day.

The University has subscribed to the JANET ESISS penetration-testing service. We now use it to scan all managed and self-managed machines with external firewall holes, and will be following up its warnings with machines’ managers.  However, it won’t catch everything, so you should still take care with your configurations.

Please contact Support in the usual way if you would like to discuss your self-managed machine.

Posted in Uncategorized | Leave a comment

Christmas closure – saving energy

Please help to reduce the University’s energy bill by switching off any equipment, including computers and monitors, that you are unlikely to use over the Xmas break.

If you are responsible for research group servers, please consider powering these down over the vacation. Contact support if you need computing staff to do this for you.

You can power off a DICE box either by briefly pressing the power button on the front of the machine or choosing the Shutdown option from the menu at the bottom of the DICE login screen.

If you think that you many want to remotely access your desktop over the holidays, just let your machine sleep as normal. You can wake the machine again by going to wake.inf.ed.ac.uk. Self-managed machines can also be awoken using this mechanism – see this computing help page for details.

Posted in Uncategorized | Leave a comment