## A second NX server: staff.nx.inf.ed.ac.uk

We recently added a second NX server, staff.nx.inf.ed.ac.uk. The first NX server, nx.inf.ed.ac.uk, is unchanged.

NX is the remote graphical login facility, offering a DICE desktop from a Windows, Mac or Linux machine anywhere on the internet.

staff.nx.inf.ed.ac.uk is for staff, research postgraduates and visitors. nx.inf.ed.ac.uk is still accessible to all Informatics users.

If you want to use either NX server your computer will first need a little configuration.

As ever, if you have any problems please take a look at the documentation, and if that doesn’t help, just ask User Support for help.

## Forum network stability

Following the changes noted in my recent article, the Forum network appears to have been stable again.  If you’re still experiencing problems, please report them through the support form in the usual way and we’ll look into them.

As a separate issue, a number of central University services were affected by a firewall problem on Monday 16th.  Please see IS’s service alert for details.  This appears to have been caused by a contractor mis-patching something in 50 George Square, though it’s not clear why the problem propagated as it did.  If we do hear any more details we’ll pass them on.

## DICE teaching platform upgrade – postponed

I’m sorry to report that we shall not be upgrading the DICE teaching desktop platform this summer. This means that the DICE teaching platform will be SL6 based for another academic year.

We had hoped to upgrade to a free derivative of Redhat Enterprise 7 (RHEL 7) this summer, but unfortunately RHEL 7 has only just been released. A free version is unlikely to ship until mid to late August, leaving us with insufficient time to deploy in the teaching labs for the start of the next academic year.

We expect to be able to upgrade individual staff and research postgrad desktops to whichever RHEL 7 derivative we chose to use from October.

## Recent Forum network instability

You may have noticed some periods of instability with our Forum network recently.  This has proved hard to diagnose as one of the first things to be hit would be the management interfaces to our network switches, so that obtaining useful data (indeed, any data at all) during an event was problematic.  However, we believe we have now accumulated enough evidence to be able to come to a tentative conclusion.

The first thing to say is that there hasn’t been one single cause.  Rather, instability has been due to several things happening to occur at the same time.  Ultimately, though, the effect has been to overwhelm the CPUs in our older-model switches, which have then started to miss various important pieces of housekeeping.  In particular:

• As mentioned, the management interfaces stopped responding.  This resulted in us losing logging and traffic data, making it much harder to look back afterwards to see what had happened.
• The older switches missed some DHCP exchanges, which they normally track to help manage IP address use on the network for security reasons.  As a result, self-managed machines which were trying to acquire or renew their leases would be blocked, though existing leases appeared to be honoured correctly.
• Ultimately spanning-tree packets would be lost.  Loops in the network formed, and some links were swamped as a result.  This appeared to affect wireless users more than wired users.  These high traffic levels then meant that it would take longer for the network to converge again afterwards.

As for the various causes, we have identified the following.  They are all somewhat variable in effect, and generally aren’t an issue individually.  It’s when several happen to combine that problems occur.

• The older-model switches are underpowered.  In the medium term they are due to be upgraded to newer, more powerful models.  Meanwhile we have removed as much “unnecessary” processing as we can.
• In particular, we now don’t process multicast traffic specially, though as a result we do now have to propagate it more widely across the network and to end-systems which don’t particularly want it and will have to process it to throw it away.
• We identified unexpectedly high levels of IPv6 multicast traffic coming in from outside.  This was on a subnet which we carried for E&B, and as it was no longer required by them we have removed it completely.  We queried this with IS, and it turns out that they were also investigating poor performance on the same subnet, so it appears that whatever this machine was doing was affecting more than just Informatics.  We now believe that this outside traffic was what finally tipped our older switches over the edge, and that this also explains the peculiar timing of some of the events.
• However, along the way we made quite a few configuration changes in order to remove potential sources of instability, and unfortunately it looks as though we were running into bugs, or at least features, in the way that the older switches’ management interfaces were implemented.  Thus, some of the instability was self-induced, for which we apologise!  It also took a bit longer than we would have hoped to identify this, due to the effects being confounded with the other ongoing instability.  At least we now know better what to expect and how to try to minimise the effects.
• There is still one issue which we believe only affects older Mac with wired network connections, and which is resolved by a reboot.  It’s still not at all clear where the fault lies.

We still have some network configuration changes to make, some of which may result in a few more short glitches.  We’ll try to keep those to a minimum, and may be able to reduce the impact further with some code changes to our management tools.

As usual, our technical network documentation is available to browse for those of you who would like a more detailed picture of the Informatics network.

## Mobile Friendly Informatics Survey Results

A big thank you to all of you who took the time to complete the recent survey on the areas of the Informatics Computing Service where mobile (mobile in this context meaning a smartphone or tablet rather than a conventional laptop) support could be improved. 99 people responded and the results make interesting reading.

We had identified 6 areas in which mobile support could be improved:

• printing
• connection to AV equipment
• improving web content on mobile devices
• allowing mobile users to use authentication mechanisms such as kerberos to access secure School services
• allowing mobile users to connect to the School’s network infrastructure via a VPN
• providing the facility for users to access their DICE desktop machines on their mobile devices via a virtual desktop

There was also the option for respondents to suggest any areas we hadn’t though of. The results were:

• improving web content – 45 votes
• accessing AV equipment – 41 votes
• virtual desktop – 26 votes

Of the ‘other’ responses, two users felt that no improvement in mobile support was necessary, one user wanted to use their mobile device to access the Forum, and another provided a link to the Android Microsoft Remote Desktop app. This might prove useful for another user who wanted remote access to self-managed machines from Android.

Another respondent simply replied ‘Secure Shell’. In fact, Secure Shell clients are already available for iOS and Android at least and can be used to connect to the School’s SSH servers from outside. For some general information about this, see this computing.help page. Access to the room booking service was another request, presumably via a dedicated app rather than the normal web interface.

Finally there were two requests concerning services the School doesn’t actually manage. One user wanted Wifi access through WPA authentication without the need for proxy auth and another wanted streamline access to electronic versions of research articles from the University’s ejournals service. We will pass these on to the provider of these services.

We also asked respondents which mobile operating system they used on their devices. Android was the clear winner here with 67 users followed by iOS with 39 and Windows Mobile with 7.

So what now? Well, we’d like to discuss that with you and to that end, we would like to extend an invitation to the next CO technical discussion meeting on the 21st of May at 10am in IF-1.15 where we will chew over these results and decide what to do next. These discussion meetings are intended to be a forum where any technical topics likely to be of interest to Informatics users can be discussed so if you have a topic you would like to discuss at the meeting, related to mobile computing or not, please let me know. See you on the 21st!

Craig Strachan, cms@inf.ed.ac.uk

## Kaspersky Anti-Virus Software

The University’s license for the Kaspersky Anti-Virus software expires on 30th April 2014. If you are running a copy of Kaspersky on a self-managed machine using the University License, you must remove it manually and replace it. The replacement anti-virus software is System Center Endpoint Protection (SCEP) which is currently licensed for use on University-owned computers ONLY.

The University are recommending that you use Microsoft Security Essentials on non-University owned machines.

Full details on removal of Kaspersky and installation of either of the 2 options above can be found at:

http://www.ed.ac.uk/schools-departments/information-services/computing/desktop-personal/security/anti-virus/win-anti-virus

## Informatics Computing Plan 2014

The School produces an annual computing plan which :-

• states our current strategic objectives
• reports on goals set in the previous year’s plan
• lists goals for the current calendar year
• provides feedback and input to Information Services

The plan is available, along with plans for previous years, on the Computing Strategy Group wiki.

## Windows XP – end of life

Windows XP was first launched in 2001 and Microsoft have now withdrawn support for XP as of 8th April. There is a useful page giving full details:

http://windows.microsoft.com/en-GB/windows/end-support-help

It does recommend upgrading to Windows 8.1 but you may find that your machine is not capable of running Windows 8 in which case you may prefer to try Windows 7 or even invest in some new kit! Provided you are eligible to register with Dreamspark, you can download either Windows 7 or 8 from there. Details on Dreamspark can be found here:

https://computing.help.inf.ed.ac.uk/self-managed-windows

As Microsoft will no longer be providing updates to protect your machine, this means that it will become more vulnerable to new security risks and viruses which are not being fixed for XP. If you have a Windows XP machine within Informatics which currently has firewall holes opened, we will need to close these holes to reduce the risks to other users. Our self-managed and personal machines policy also states that:

“Users must ensure that the software and systems on their machines are kept fully patched and appropriately configured against security vulnerabilities. If they use a system which is prone to viruses, they must ensure that they have adequate and current protection installed.”

The full page can be found at:

https://computing.help.inf.ed.ac.uk/self-managed-policy

## Virtual DICE

DICE is also available as a virtual machine. We call it Virtual DICE. It can run in a variety of environments – Windows, Linux, Mac, and so on. The computing help pages give details on how to download, install and use it.

Virtual DICE was previously in testing. It’s now a supported service, so please get in touch with computing support if you have problems with it which can’t be solved with the help of the documentation.

Here’s a screenshot:

Virtual DICE running on a DICE machine

## SAN problems of 27th March 2014

Following the unplanned power cut on Tuesday, one of our SAN machines (ifevo4) started reporting a problem with the flash cache memory in one of its controllers. The machine has two controllers, A and B, for redundancy. Both with two fibre channel (FC) connections, one to each of our fabrics (network). This means that should one controller fail, the other will take over its duties and service will remain uninterrupted.

After reporting the fault of controller A, our supplier shipped a replacement controller to swap out the faulty one with a working one. To minimise the length of time the ifevo4 was running in a degraded state, we decided to replace the controller on Thursday after 5pm. Due to the redundancy, this should not have caused any problems to the running service.

The redundancy only fully works if the client machines (our file servers) are configured to use the multiple paths to the FC connections on both controllers. I assumed they were (but didn’t actually check), as that’s how it should be, and we’d had a separate fabric failure a week previously, and all the servers continued to work via the one remaining path/fabric without any issue. Unfortunately not all the volumes on the ifevo4 were as fully redundant as they should have been. In some cases the volumes were only accessible via controller A and not via A and B, this was a configuration error that had probably gone unnoticed since November 2013.

So when I removed controller A, the volumes mounted by the servers that were only accessible via controller A became inaccessible. Thus causing problems for anyone trying to access data on those volumes. As it is generally group file space that is mounted from the SAN, home volumes are on disks local to the servers, not many people noticed at this point.

Unfortunately to reattach the failed volumes (once controller A had been replaced) typically means checking the consistency (salvaging) of all the data on the server, during which time the file server will not serve any files, even those unaffected by the loss of controller A. As our file servers have several terabytes of data check, this means no access to all files for a couple of hours.

To give those a chance to finish anything they may be working on, I mailed out to explain that I’d reattach, and salvage, the affected volumes at 8pm. As it turned out, after rebooting the servers at 8pm, I was able to salvage the volumes individually, without affecting the availability of the working volumes. So apart from a 5 minute break at about 8pm, file access remained working. Over the next couple of hours the volumes affected by the controller A replacement gradually came back on-line. Most files were back by by 10:30pm.

The reason that some of the volumes were incorrectly configured to only use one controller is unknown. The most likely explanation is that they were all on the JBOD part of ifevo4. The JBOD is an expansion unit containing just extra disks. It was previously attached to an older version of the SAN hardware (ifevo2), which also had dual controllers and multiple FC connections to our fabric. Back in November 2013 we shutdown ifevo2, disconnected the JBOD, and attached it to the new ifevo4. At that point everything seemed to be working fine. The file servers just continued to access the volumes from their new location, and multiple paths were available to the data, so we had redundancy. I suspect had we looked more closely, this is where the problem was introduced, and though we had multiple paths via our two different fabrics, they were only to a single controller.

Since the problem on Thursday, all the paths have been checked and updated, where necessary, to make sure there are multiple paths to both controllers on our ifevo3 and ifevo4. And in future should we need to change a controller again, we will double check that those paths are still in place before replacing a controller.

Neil