## SATABeast2 disk failures

Yesterday evening, our SAN storage device satabeast2, suffered 2 disk failures. We can cope with 1 disk failure (RAID5), but following that first disk failure there is a period of time, while the redundancy is restored, that should a second disk fail then data will be lost. This is what happened yesterday.

There is a recovery mode feature of the satabeast that will attempt to recover data from marginal disk failures, but this requires rebooting the array, which will affect access to all data (whether currently affected by the disk failures or not) on that array.

We plan to start this process at 11am today, and it may take several hours to complete, but will will try to make it as short as possible.

Below is a list of the affected areas.

/group/project/statmt5     Currently affected by disk failures
/group/project/statmt6                "
/group/project/statmt7                "
/group/project/statmt8                "
/group/project/statmt9                "
/group/project/statmt10               "
/group/project/statmt11               "
/group/project/statmt12               "
/group/project/statmt13               "
/group/project/ami9      Currently working, but will be affect by shutdown
/group/project/ami10                  "
/group/project/cstr1                  "
/group/project/cstr2                  "
/group/project/cstr3                  "
/afs/inf.ed.ac.uk/group/project/imr/data    "


We will update this post as work progresses.

12:30pm The rebuild is currently at 8% complete. It’s a 9TB array BTW.
2:15pm 24% complete
4:25pm 45% complete. The non-disk affected areas should be back, eg /group/project/ami9
5:30pm 55% complete. Will continue to monitor it over the weekend.
10:30pm 99% complete.

Saturday 8:15am: Unfortunately that last 1% was too much for it, and again one of the previously failed disks (that it was trying to recover from) failed again. So it’s looking unlikely we’ll be able to recover the data from statmt5 to statmt13.

Sunday: I have contacted the suppliers of the satabeast (by email) in case they have any other tricks up their sleeves.

Monday – 6/10/2014: Having analysed the logs and details that I sent them, the suppliers are planning on replacing the controller in the satabeast. This is scheduled for 11am onwards tomorrow – Tuesday. During this time the ami, cstr and imr-data areas will be unavailable.

Tuesday – Unfortunately the wrong part was shipped, so we’ll try the controller replacement again, but on Wednesday, same sort of time – 11am.

Wednesday – The controller has been replaced, and we are trying another recovery.

Thursday – The first attempt failed, we are trying a different one and currently it is 66% complete.

Friday 10th October

Rather surprisingly that last recovery attempt seems to have worked so far, and at least a couple of the group areas seem to be intact. I am running fsck on them and making them available (read only) as they pass the checks. The current statmt areas back are:

/group/project/statmt5
/group/project/statmt6
/group/project/statmt7
/group/project/statmt8
/group/project/statmt9
/group/project/statmt10
/group/project/statmt11
/group/project/statmt12
/group/project/statmt13


Though the file-systems pass consistency checks, it is possible that the files themselves contain corruption, we can’t know for sure one way or the other. Only the authors of those files would know if their contents are as expected.

5pm That’s all the affected group areas back now. I’ve left them as read only for now so that if any possible corruption isn’t made worse by things trying to open the files. If there’s some agreement that things do look normal, we can make them writeable again.

Neil

PS If you are actually affected by the loss of data, or use any of the group areas listed above, please let me know (neilb@inf) or leave a comment below.

You can now follow us on twitter at @infalerts! We will post details of planned and unplanned outages, information on new services, changes to existing services and anything else that we think may be helpful to you as users of the IT services provided by Informatics.

Information Services (IS) also post information about services they provide – see @isalerts.

## Scientific Linux 6.5 Update

The 5th minor update to ScientificLinux 6 (which is based on RHEL6) is now ready for deployment to the Informatics SL6 DICE office and student lab machines. A minor update like this provides us with the opportunity to update important software and fix any bugs which are not security issues (we apply security updates as soon as they are available) in a controlled manner.

To complete this upgrade a reboot is required. The student lab machines will be rebooted during the night of Thursday 28th August. A delayed reboot will be scheduled for all DICE office desktops. The delay will be 5 days, although the reboots are delayed it would be greatly appreciated if people could manually reboot their machines at their earliest convenience; the delayed reboot would then be cancelled. Upgrades for individual servers will be scheduled over the next few weeks and users affected will be contacted as necessary.

SL6.5 was released on 6th December 2013 and since then it has been thoroughly tested in our DICE environment so we are confident that this update will not cause any issues for users.

Details of the package updates are available on the LCFG wiki. For further, in depth information, there are also release notes from ScientificLinux and RHEL.

## New staff SSH server

On Tuesday 26th August we plan to replace the SSH server named hogwood which hosts staff.ssh.inf.ed.ac.uk with a machine named brendel.

All that will happen is that at about 09:00 on Tuesday we will change the DNS aliases to point to the new machine. This change can take some time to propagate so we will not switch off access to hogwood immediately. It will be left running as normal until 12:00 Friday 29th August. This should also allow sufficient time for users logged in to finish their existing sessions and move to the new server.

The IP address for the service will change from 129.215.33.85 to 129.215.33.112, your SSH client may warn you about this change and request verification. For reference the new RSA host key fingerprint is:

e3:0e:ed:f9:3a:9c:a5:1e:2e:3a:26:3a:2c:1e:0b:d1


us via the Support Form.

## Local addition to mailman (lists.inf.ed.ac.uk)

We’ve recently made a small local change to mailman that allows users with the appropriate entitlement, automatic access to a list admin web page. Similarily, entitlements can now also be used to limit the visibility of the mailing list archives.

Entitlements are attributed to groups of users (or individuals), typically according to their status in the School Database. For example, all staff are given the “login/staffssh/remote” entitlement, and that allows them to remote login to staff.ssh.inf.ed.ac.uk.

So if you have a group of people (for example, ITO admin staff) that need to administer the various teaching-related mailing lists, and as they already have the entitlement “role/ito-admin”, that entitlement can be added to the new “admin_entitlement” field on the “General Options” page:

Example of admin_entitlements option from the General Options page

In this example, the computing staff – role/sysman – would also have access.

Similarily the new field “read_archive_entitlement” has been added to the “Archiving Options” page. This gives users with the specified role the permission to view otherwise private web archives. Normally it is just the list members that can see private archives. This could be useful for taught course student lists, where normally only the students on that particular course can see the web archives, but perhaps we’d like students from the whole year to see the archives.

Example of read_archive_entitlement option on the Archiving Options page.

In this example any user with the “roles/student” entitlement would be able to view the web archives.

These entitlement changes require the use of our Cosign service, which means people must view the pages on lists.inf.ed.ac.uk via the HTTPS version of the URL, eg
https://lists.inf.ed.ac.uk/mailman/private/inf-general/ to view the inf-general web archive.

If this sounds like it may be of use to you, but are not sure about “entitlements”, submit a support ticket and we’ll give you a hand.

Neil

## A second NX server: staff.nx.inf.ed.ac.uk

We recently added a second NX server, staff.nx.inf.ed.ac.uk. The first NX server, nx.inf.ed.ac.uk, is unchanged.

NX is the remote graphical login facility, offering a DICE desktop from a Windows, Mac or Linux machine anywhere on the internet.

staff.nx.inf.ed.ac.uk is for staff, research postgraduates and visitors. nx.inf.ed.ac.uk is still accessible to all Informatics users.

If you want to use either NX server your computer will first need a little configuration.

As ever, if you have any problems please take a look at the documentation, and if that doesn’t help, just ask User Support for help.

## Forum network stability

Following the changes noted in my recent article, the Forum network appears to have been stable again.  If you’re still experiencing problems, please report them through the support form in the usual way and we’ll look into them.

As a separate issue, a number of central University services were affected by a firewall problem on Monday 16th.  Please see IS’s service alert for details.  This appears to have been caused by a contractor mis-patching something in 50 George Square, though it’s not clear why the problem propagated as it did.  If we do hear any more details we’ll pass them on.

## DICE teaching platform upgrade – postponed

I’m sorry to report that we shall not be upgrading the DICE teaching desktop platform this summer. This means that the DICE teaching platform will be SL6 based for another academic year.

We had hoped to upgrade to a free derivative of Redhat Enterprise 7 (RHEL 7) this summer, but unfortunately RHEL 7 has only just been released. A free version is unlikely to ship until mid to late August, leaving us with insufficient time to deploy in the teaching labs for the start of the next academic year.

We expect to be able to upgrade individual staff and research postgrad desktops to whichever RHEL 7 derivative we chose to use from October.

## Recent Forum network instability

You may have noticed some periods of instability with our Forum network recently.  This has proved hard to diagnose as one of the first things to be hit would be the management interfaces to our network switches, so that obtaining useful data (indeed, any data at all) during an event was problematic.  However, we believe we have now accumulated enough evidence to be able to come to a tentative conclusion.

The first thing to say is that there hasn’t been one single cause.  Rather, instability has been due to several things happening to occur at the same time.  Ultimately, though, the effect has been to overwhelm the CPUs in our older-model switches, which have then started to miss various important pieces of housekeeping.  In particular:

• As mentioned, the management interfaces stopped responding.  This resulted in us losing logging and traffic data, making it much harder to look back afterwards to see what had happened.
• The older switches missed some DHCP exchanges, which they normally track to help manage IP address use on the network for security reasons.  As a result, self-managed machines which were trying to acquire or renew their leases would be blocked, though existing leases appeared to be honoured correctly.
• Ultimately spanning-tree packets would be lost.  Loops in the network formed, and some links were swamped as a result.  This appeared to affect wireless users more than wired users.  These high traffic levels then meant that it would take longer for the network to converge again afterwards.

As for the various causes, we have identified the following.  They are all somewhat variable in effect, and generally aren’t an issue individually.  It’s when several happen to combine that problems occur.

• The older-model switches are underpowered.  In the medium term they are due to be upgraded to newer, more powerful models.  Meanwhile we have removed as much “unnecessary” processing as we can.
• In particular, we now don’t process multicast traffic specially, though as a result we do now have to propagate it more widely across the network and to end-systems which don’t particularly want it and will have to process it to throw it away.
• We identified unexpectedly high levels of IPv6 multicast traffic coming in from outside.  This was on a subnet which we carried for E&B, and as it was no longer required by them we have removed it completely.  We queried this with IS, and it turns out that they were also investigating poor performance on the same subnet, so it appears that whatever this machine was doing was affecting more than just Informatics.  We now believe that this outside traffic was what finally tipped our older switches over the edge, and that this also explains the peculiar timing of some of the events.
• However, along the way we made quite a few configuration changes in order to remove potential sources of instability, and unfortunately it looks as though we were running into bugs, or at least features, in the way that the older switches’ management interfaces were implemented.  Thus, some of the instability was self-induced, for which we apologise!  It also took a bit longer than we would have hoped to identify this, due to the effects being confounded with the other ongoing instability.  At least we now know better what to expect and how to try to minimise the effects.
• There is still one issue which we believe only affects older Mac with wired network connections, and which is resolved by a reboot.  It’s still not at all clear where the fault lies.

We still have some network configuration changes to make, some of which may result in a few more short glitches.  We’ll try to keep those to a minimum, and may be able to reduce the impact further with some code changes to our management tools.

As usual, our technical network documentation is available to browse for those of you who would like a more detailed picture of the Informatics network.

## Mobile Friendly Informatics Survey Results

A big thank you to all of you who took the time to complete the recent survey on the areas of the Informatics Computing Service where mobile (mobile in this context meaning a smartphone or tablet rather than a conventional laptop) support could be improved. 99 people responded and the results make interesting reading.

We had identified 6 areas in which mobile support could be improved:

• printing
• connection to AV equipment
• improving web content on mobile devices
• allowing mobile users to use authentication mechanisms such as kerberos to access secure School services
• allowing mobile users to connect to the School’s network infrastructure via a VPN
• providing the facility for users to access their DICE desktop machines on their mobile devices via a virtual desktop

There was also the option for respondents to suggest any areas we hadn’t though of. The results were:

• improving web content – 45 votes
• accessing AV equipment – 41 votes
• virtual desktop – 26 votes

Of the ‘other’ responses, two users felt that no improvement in mobile support was necessary, one user wanted to use their mobile device to access the Forum, and another provided a link to the Android Microsoft Remote Desktop app. This might prove useful for another user who wanted remote access to self-managed machines from Android.

Another respondent simply replied ‘Secure Shell’. In fact, Secure Shell clients are already available for iOS and Android at least and can be used to connect to the School’s SSH servers from outside. For some general information about this, see this computing.help page. Access to the room booking service was another request, presumably via a dedicated app rather than the normal web interface.

Finally there were two requests concerning services the School doesn’t actually manage. One user wanted Wifi access through WPA authentication without the need for proxy auth and another wanted streamline access to electronic versions of research articles from the University’s ejournals service. We will pass these on to the provider of these services.

We also asked respondents which mobile operating system they used on their devices. Android was the clear winner here with 67 users followed by iOS with 39 and Windows Mobile with 7.

So what now? Well, we’d like to discuss that with you and to that end, we would like to extend an invitation to the next CO technical discussion meeting on the 21st of May at 10am in IF-1.15 where we will chew over these results and decide what to do next. These discussion meetings are intended to be a forum where any technical topics likely to be of interest to Informatics users can be discussed so if you have a topic you would like to discuss at the meeting, related to mobile computing or not, please let me know. See you on the 21st!

Craig Strachan, cms@inf.ed.ac.uk