## Cloud survey results

Earlier this year we conducted a survey of VM/cloud usage. The project has now concluded and a formal report on the project can be found on the Rat Unit wiki This includes a link to the survey report itself. Copies of all the documentation relating to the project can be found in afs in the /afs/inf.ed.ac.uk/group/rat-unit/projects/vm_cloud_survey directory.

## OpenVPN changes

It’s now over ten years since we first set up our OpenVPN service, and things have moved on quite a bit since then.  So far we have managed to maintain compatibility for existing users of the service, but we would now like to make a couple of enhancements which unfortunately do require incompatible changes:

1. We would like to offer a service to users of Android and iOS mobile devices.  However, the way we set up the service access keys (back at the beginning, when that was the only way to do it) is not compatible with the way these devices now require things to be done.
2. Due to the ever-growing popularity of the service we need to expand the IP address space used so as to avoid unexpected glitches for users.

As this is necessarily an incompatible change, we’ll arrange things as follows:

• We’ll set up new endpoints to provide the new service for testing. (This has actually already been done.)
• We’ll create suitable new configuration files for beta-testers.
• Once we’re happy that things are running as expected, we’ll tidy up, document and advertise the new configurations.
• Some time later (probably around Easter next year) we’ll close down the old-style service.

We’ll also take the chance to remove some now-deprecated options from the configurations, and we’ll add some platform-specific enhancements where these appear to be generally useful.

We do have to turn off the old service in due course, rather than just leaving it running, as this will allow us to recycle the IP address range it uses.  Globally-routed IPv4 addresses are now a scarce resource, and we simply can’t justify keeping these for what will be an ever-decreasing number of users.

Look out for announcements regarding the introduction of the new service, specific mobile devices, and in particular the schedule for the retiral of the old service.

Meantime, if anyone would like to beta-test the new service, please get in touch through the support form in the usual way.

## Seven.

You may remember that DICE Linux is not getting the major upgrade this summer (DICE teaching platform upgrade – postponed) that we optimistically forecast in February (Upgrade of DICE desktops to Scientific Linux 7). This post explains what’s been going on.

DICE Linux is based on Scientific Linux, which is based on Red Hat Enterprise Linux. To make a Linux distro into DICE, we port our configuration technology LCFG to the new system so that we can configure it appropriately. We use LCFG to, for instance, add software; make the network, printers and mail behave appropriately; control who can do what and where; and defend our systems and data against (constant) attack.

So why isn’t the latest greatest DICE ready yet? There have been two main problems.

The first was the later than expected release of RHEL 7. We had hoped for it to appear by February at the latest. Judging by previous releases this would have given the Scientific Linux team enough time to produce the corresponding SL release by April, which would have given us just about enough time to get LCFG and DICE ported and tested in time for the next session. Unfortunately RHEL 7 wasn’t released until June, and SL 7 was released this month (October).

The second major problem has been the sheer amount of new technology in RHEL 7. In a word, systemd. This ambitious replacement for init has introduced major changes to Linux. It abandons the old approach of starting services one at a time in a predetermined order, in favour of a dependency-based system. In principle this is a great idea, and it’s the approach taken by launchd, which does the same job rather successfully on Apple Macs. Some great advantages come with this approach – better control of processes for instance, and faster booting – but the scale of the changes has meant a great deal of work for us.

To cope with the changes some of our core software has had to be redesigned or replaced (rather than just recompiled and tested, as we would hope on a new system), and the required effort has been substantial. Read the SL7 LCFG port diary to get some idea of what we’ve been up against (and to).

Another way to grasp the enormity of the change from init to systemd is to look at the opposition it’s stirred up. In Linux as in life, when wide-ranging revolutionary change is imposed, there will be rebellion. In the case of systemd the perceived preference of the design team for unrelenting major change over (say) consolidation and bug fixing, and the project’s absorption of more and more formerly independent Linux services, hasn’t helped. Searching the web for systemd controversy throws up some interesting responses – Systemd: Harbinger of the Linux apocalypse, Boycott systemd, Debian fork and uselessd among them.

We think we can now cope with systemd, at least to the extent of being able to configure it to produce a basically working DICE system. Along the way we have also been tackling a number of other challenges, such as the not much less controversial GNOME 3, and the replacement of the grub bootloader with a very different grub 2, but nevertheless we hope to have a DICE SL7 desktop option available for “early adopter” staff and researchers within a month or two. Thanks to careful redesign of our software infrastructure we also hope to be able in time to offer DICE variants based on related flavours of Linux such as CentOS 7, RHEL 7 itself and Oracle Linux, if the demand is there. Are we still too optimistic? Time will tell.

## New staff NX server

On Tuesday 4th November we plan to replace the staff NX server named central which hosts staff.nx.inf.ed.ac.uk with a machine named northern.

All that will happen is that at about 09:00 on Tuesday we will change the DNS aliases to point to the new machine. This change can take some time to propagate so we will not switch off access to central immediately. It will be left running as normal until 12:00 Friday 7th November. This should allow sufficient time for users logged in to finish their existing sessions and move to the new server.

The IP address for the service will change from 129.215.33.56 to 129.215.33.85, your NX client may warn you about this change and request verification. For reference the new RSA host key fingerprint is c3:46:f4:e5:13:d2:cb:6c:df:a1:d9:24:79:68:15:d6

More information regarding the NX service can be found on our help pages.

## New NX server

On Tuesday 28th October we plan to replace the NX server named bakerloo which hosts nx.inf.ed.ac.uk with a machine named piccadilly.

All that will happen is that at about 09:00 on Tuesday we will change the DNS alias to point to the new machine. This change can take some time to propagate so we will not switch off access to bakerloo immediately. It will be left running as normal until 12:00 Friday 31st October. This should allow sufficient time for users logged in to finish their existing sessions and move to the new server.

The IP address for the service will change from 129.215.33.55 to 129.215.202.105, your NX client may warn you about this change and request verification. For reference the new RSA host key fingerprint is: cf:91:ba:2e:0d:76:0e:7b:8d:5a:79:77:69:68:63:4d

## SATABeast2 disk failures

Yesterday evening, our SAN storage device satabeast2, suffered 2 disk failures. We can cope with 1 disk failure (RAID5), but following that first disk failure there is a period of time, while the redundancy is restored, that should a second disk fail then data will be lost. This is what happened yesterday.

There is a recovery mode feature of the satabeast that will attempt to recover data from marginal disk failures, but this requires rebooting the array, which will affect access to all data (whether currently affected by the disk failures or not) on that array.

We plan to start this process at 11am today, and it may take several hours to complete, but will will try to make it as short as possible.

Below is a list of the affected areas.

/group/project/statmt5     Currently affected by disk failures
/group/project/statmt6                "
/group/project/statmt7                "
/group/project/statmt8                "
/group/project/statmt9                "
/group/project/statmt10               "
/group/project/statmt11               "
/group/project/statmt12               "
/group/project/statmt13               "
/group/project/ami9      Currently working, but will be affect by shutdown
/group/project/ami10                  "
/group/project/cstr1                  "
/group/project/cstr2                  "
/group/project/cstr3                  "
/afs/inf.ed.ac.uk/group/project/imr/data    "


We will update this post as work progresses.

12:30pm The rebuild is currently at 8% complete. It’s a 9TB array BTW.
2:15pm 24% complete
4:25pm 45% complete. The non-disk affected areas should be back, eg /group/project/ami9
5:30pm 55% complete. Will continue to monitor it over the weekend.
10:30pm 99% complete.

Saturday 8:15am: Unfortunately that last 1% was too much for it, and again one of the previously failed disks (that it was trying to recover from) failed again. So it’s looking unlikely we’ll be able to recover the data from statmt5 to statmt13.

Sunday: I have contacted the suppliers of the satabeast (by email) in case they have any other tricks up their sleeves.

Monday – 6/10/2014: Having analysed the logs and details that I sent them, the suppliers are planning on replacing the controller in the satabeast. This is scheduled for 11am onwards tomorrow – Tuesday. During this time the ami, cstr and imr-data areas will be unavailable.

Tuesday – Unfortunately the wrong part was shipped, so we’ll try the controller replacement again, but on Wednesday, same sort of time – 11am.

Wednesday – The controller has been replaced, and we are trying another recovery.

Thursday – The first attempt failed, we are trying a different one and currently it is 66% complete.

Friday 10th October

Rather surprisingly that last recovery attempt seems to have worked so far, and at least a couple of the group areas seem to be intact. I am running fsck on them and making them available (read only) as they pass the checks. The current statmt areas back are:

/group/project/statmt5
/group/project/statmt6
/group/project/statmt7
/group/project/statmt8
/group/project/statmt9
/group/project/statmt10
/group/project/statmt11
/group/project/statmt12
/group/project/statmt13


Though the file-systems pass consistency checks, it is possible that the files themselves contain corruption, we can’t know for sure one way or the other. Only the authors of those files would know if their contents are as expected.

5pm That’s all the affected group areas back now. I’ve left them as read only for now so that if any possible corruption isn’t made worse by things trying to open the files. If there’s some agreement that things do look normal, we can make them writeable again.

Neil

PS If you are actually affected by the loss of data, or use any of the group areas listed above, please let me know (neilb@inf) or leave a comment below.

You can now follow us on twitter at @infalerts! We will post details of planned and unplanned outages, information on new services, changes to existing services and anything else that we think may be helpful to you as users of the IT services provided by Informatics.

Information Services (IS) also post information about services they provide – see @isalerts.

## Scientific Linux 6.5 Update

The 5th minor update to ScientificLinux 6 (which is based on RHEL6) is now ready for deployment to the Informatics SL6 DICE office and student lab machines. A minor update like this provides us with the opportunity to update important software and fix any bugs which are not security issues (we apply security updates as soon as they are available) in a controlled manner.

To complete this upgrade a reboot is required. The student lab machines will be rebooted during the night of Thursday 28th August. A delayed reboot will be scheduled for all DICE office desktops. The delay will be 5 days, although the reboots are delayed it would be greatly appreciated if people could manually reboot their machines at their earliest convenience; the delayed reboot would then be cancelled. Upgrades for individual servers will be scheduled over the next few weeks and users affected will be contacted as necessary.

SL6.5 was released on 6th December 2013 and since then it has been thoroughly tested in our DICE environment so we are confident that this update will not cause any issues for users.

Details of the package updates are available on the LCFG wiki. For further, in depth information, there are also release notes from ScientificLinux and RHEL.

## New staff SSH server

On Tuesday 26th August we plan to replace the SSH server named hogwood which hosts staff.ssh.inf.ed.ac.uk with a machine named brendel.

All that will happen is that at about 09:00 on Tuesday we will change the DNS aliases to point to the new machine. This change can take some time to propagate so we will not switch off access to hogwood immediately. It will be left running as normal until 12:00 Friday 29th August. This should also allow sufficient time for users logged in to finish their existing sessions and move to the new server.

The IP address for the service will change from 129.215.33.85 to 129.215.33.112, your SSH client may warn you about this change and request verification. For reference the new RSA host key fingerprint is:

e3:0e:ed:f9:3a:9c:a5:1e:2e:3a:26:3a:2c:1e:0b:d1


us via the Support Form.

## Local addition to mailman (lists.inf.ed.ac.uk)

We’ve recently made a small local change to mailman that allows users with the appropriate entitlement, automatic access to a list admin web page. Similarily, entitlements can now also be used to limit the visibility of the mailing list archives.

Entitlements are attributed to groups of users (or individuals), typically according to their status in the School Database. For example, all staff are given the “login/staffssh/remote” entitlement, and that allows them to remote login to staff.ssh.inf.ed.ac.uk.

So if you have a group of people (for example, ITO admin staff) that need to administer the various teaching-related mailing lists, and as they already have the entitlement “role/ito-admin”, that entitlement can be added to the new “admin_entitlement” field on the “General Options” page:

Example of admin_entitlements option from the General Options page

In this example, the computing staff – role/sysman – would also have access.

Similarily the new field “read_archive_entitlement” has been added to the “Archiving Options” page. This gives users with the specified role the permission to view otherwise private web archives. Normally it is just the list members that can see private archives. This could be useful for taught course student lists, where normally only the students on that particular course can see the web archives, but perhaps we’d like students from the whole year to see the archives.

Example of read_archive_entitlement option on the Archiving Options page.

In this example any user with the “roles/student” entitlement would be able to view the web archives.

These entitlement changes require the use of our Cosign service, which means people must view the pages on lists.inf.ed.ac.uk via the HTTPS version of the URL, eg
https://lists.inf.ed.ac.uk/mailman/private/inf-general/ to view the inf-general web archive.

If this sounds like it may be of use to you, but are not sure about “entitlements”, submit a support ticket and we’ll give you a hand.

Neil