DICE Software Collections

As many of you are aware, the standard versions of various developer tools provided in Scientific Linux 6 (e.g. gcc) have now become quite old. To gain access to newer versions you can now have various software collections added to your system. These are extra packages provided by Redhat, details are available on the computing help site.

As part of the work necessary to provide access to more of these software collections we have upgraded the devtoolset collection from version 2 to 3. If you are currently using this collection to get a newer version of gcc you must change your scripts after your system has applied updates overnight (Wednesday 7th to Thursday 8th January). The specific devtoolset name has changed so it will now be activated like:

        scl enable devtoolset-3 bash

(note the change from devtoolset-2 to devtoolset-3). Apologies for forcing this incompatible change at short notice, we hope that the changes we’ve made will allow us to avoid this pain in the future. The benefit of this change is that gcc will be upgraded to version 4.9.1.

As usual, if you have any queries about this please contact the Computing Team via the support form.

Posted in Uncategorized | 2 Comments

New staff NX server (revisited)

We have now resolved the hardware problems with the new hardware for the staff NX server so we can reschedule the planned upgrade.

On Tuesday 6th January we plan to replace the staff NX server named central which hosts staff.nx.inf.ed.ac.uk with a machine named northern.

All that will happen is that at about 09:00 on Tuesday we will change the DNS aliases to point to the new machine. This change can take some time to propagate so we will not switch off access to central immediately. It will be left running as normal until 12:00 Friday 9th January. This should allow sufficient time for users logged in to finish their existing sessions and move to the new server.

The IP address for the service will change from 129.215.33.56 to 129.215.33.85, your NX client may warn you about this change and request verification. For reference the new RSA host key fingerprint is: c3:46:f4:e5:13:d2:cb:6c:df:a1:d9:24:79:68:15:d6

More information regarding the NX service can be found on our help pages. If you encounter any problems accessing the NX service please contact us via the Support Form.

Posted in Uncategorized | Leave a comment

cron problems

We have recently discovered a problem with the cron daemon (cronie) which is supplied with SL6 and SL7 which means that users with home directories stored in AFS have been prevented from using the service on DICE machines. We have now patched the code and are satisfied that normal service has now been resumed. This bug was introduced as part of the SL6.5 upgrade which occurred during September and October. This wasn’t spotted quickly because normally when a cron job fails the user will get an error report via email but due to the nature of the bug all user cron jobs were silently failing.

Apologies for any inconvenience caused, we have now put in place better monitoring of this service so we should catch any future problems a lot more quickly.

For those interested, this problem was introduced when a security hole was fixed. The change in question is recorded in the Redhat bugzilla as #697485. The change was to drop privileges (i.e. go from root to the user who owns the crontab) before reading the crontab which is clearly a sensible thing to do. A piece of code used elsewhere in cronie was reused to drop privileges, rather annoyingly it has an unnecessary secondary function which is that it insists on being able to change into the user’s home directory. With AFS the home directory is usually inaccessible (even to the user which owns the directory) as there are no Kerberos tickets or AFS tokens available at this stage in the session. There are later checks on the ability to access the home directory which can be worked around by setting the HOME environment variable to a directory in the local filesystem but that doesn’t work in this case since it fails before the crontab has been parsed.

Posted in Uncategorized | Leave a comment

Cloud survey results

Earlier this year we conducted a survey of VM/cloud usage. The project has now concluded and a formal report on the project can be found on the Rat Unit wiki This includes a link to the survey report itself. Copies of all the documentation relating to the project can be found in afs in the /afs/inf.ed.ac.uk/group/rat-unit/projects/vm_cloud_survey directory.

Posted in News, Project Reports | Leave a comment

OpenVPN changes

It’s now over ten years since we first set up our OpenVPN service, and things have moved on quite a bit since then.  So far we have managed to maintain compatibility for existing users of the service, but we would now like to make a couple of enhancements which unfortunately do require incompatible changes:

  1. We would like to offer a service to users of Android and iOS mobile devices.  However, the way we set up the service access keys (back at the beginning, when that was the only way to do it) is not compatible with the way these devices now require things to be done.
  2. Due to the ever-growing popularity of the service we need to expand the IP address space used so as to avoid unexpected glitches for users.

As this is necessarily an incompatible change, we’ll arrange things as follows:

  • We’ll set up new endpoints to provide the new service for testing. (This has actually already been done.)
  • We’ll create suitable new configuration files for beta-testers.
  • Once we’re happy that things are running as expected, we’ll tidy up, document and advertise the new configurations.
  • Some time later (probably around Easter next year) we’ll close down the old-style service.

We’ll also take the chance to remove some now-deprecated options from the configurations, and we’ll add some platform-specific enhancements where these appear to be generally useful.

We do have to turn off the old service in due course, rather than just leaving it running, as this will allow us to recycle the IP address range it uses.  Globally-routed IPv4 addresses are now a scarce resource, and we simply can’t justify keeping these for what will be an ever-decreasing number of users.

Look out for announcements regarding the introduction of the new service, specific mobile devices, and in particular the schedule for the retiral of the old service.

Meantime, if anyone would like to beta-test the new service, please get in touch through the support form in the usual way.

Posted in Uncategorized | Leave a comment

Seven.

You may remember that DICE Linux is not getting the major upgrade this summer (DICE teaching platform upgrade – postponed) that we optimistically forecast in February (Upgrade of DICE desktops to Scientific Linux 7). This post explains what’s been going on.

DICE Linux is based on Scientific Linux, which is based on Red Hat Enterprise Linux. To make a Linux distro into DICE, we port our configuration technology LCFG to the new system so that we can configure it appropriately. We use LCFG to, for instance, add software; make the network, printers and mail behave appropriately; control who can do what and where; and defend our systems and data against (constant) attack.

So why isn’t the latest greatest DICE ready yet? There have been two main problems.

The first was the later than expected release of RHEL 7. We had hoped for it to appear by February at the latest. Judging by previous releases this would have given the Scientific Linux team enough time to produce the corresponding SL release by April, which would have given us just about enough time to get LCFG and DICE ported and tested in time for the next session. Unfortunately RHEL 7 wasn’t released until June, and SL 7 was released this month (October).

The second major problem has been the sheer amount of new technology in RHEL 7. In a word, systemd. This ambitious replacement for init has introduced major changes to Linux. It abandons the old approach of starting services one at a time in a predetermined order, in favour of a dependency-based system. In principle this is a great idea, and it’s the approach taken by launchd, which does the same job rather successfully on Apple Macs. Some great advantages come with this approach – better control of processes for instance, and faster booting – but the scale of the changes has meant a great deal of work for us.

To cope with the changes some of our core software has had to be redesigned or replaced (rather than just recompiled and tested, as we would hope on a new system), and the required effort has been substantial. Read the SL7 LCFG port diary to get some idea of what we’ve been up against (and to).

Another way to grasp the enormity of the change from init to systemd is to look at the opposition it’s stirred up. In Linux as in life, when wide-ranging revolutionary change is imposed, there will be rebellion. In the case of systemd the perceived preference of the design team for unrelenting major change over (say) consolidation and bug fixing, and the project’s absorption of more and more formerly independent Linux services, hasn’t helped. Searching the web for systemd controversy throws up some interesting responses – Systemd: Harbinger of the Linux apocalypse, Boycott systemd, Debian fork and uselessd among them.

We think we can now cope with systemd, at least to the extent of being able to configure it to produce a basically working DICE system. Along the way we have also been tackling a number of other challenges, such as the not much less controversial GNOME 3, and the replacement of the grub bootloader with a very different grub 2, but nevertheless we hope to have a DICE SL7 desktop option available for “early adopter” staff and researchers within a month or two. Thanks to careful redesign of our software infrastructure we also hope to be able in time to offer DICE variants based on related flavours of Linux such as CentOS 7, RHEL 7 itself and Oracle Linux, if the demand is there. Are we still too optimistic? Time will tell.

Posted in Uncategorized | Leave a comment

New staff NX server

On Tuesday 4th November we plan to replace the staff NX server named central which hosts staff.nx.inf.ed.ac.uk with a machine named northern.

All that will happen is that at about 09:00 on Tuesday we will change the DNS aliases to point to the new machine. This change can take some time to propagate so we will not switch off access to central immediately. It will be left running as normal until 12:00 Friday 7th November. This should allow sufficient time for users logged in to finish their existing sessions and move to the new server.

The IP address for the service will change from 129.215.33.56 to 129.215.33.85, your NX client may warn you about this change and request verification. For reference the new RSA host key fingerprint is c3:46:f4:e5:13:d2:cb:6c:df:a1:d9:24:79:68:15:d6

More information regarding the NX service can be found on our help pages.

If you encounter any problems accessing the NX service please contact us via the Support Form.

Posted in Uncategorized | Leave a comment

New NX server

On Tuesday 28th October we plan to replace the NX server named bakerloo which hosts nx.inf.ed.ac.uk with a machine named piccadilly.

All that will happen is that at about 09:00 on Tuesday we will change the DNS alias to point to the new machine. This change can take some time to propagate so we will not switch off access to bakerloo immediately. It will be left running as normal until 12:00 Friday 31st October. This should allow sufficient time for users logged in to finish their existing sessions and move to the new server.

The IP address for the service will change from 129.215.33.55 to 129.215.202.105, your NX client may warn you about this change and request verification. For reference the new RSA host key fingerprint is: cf:91:ba:2e:0d:76:0e:7b:8d:5a:79:77:69:68:63:4d

More information regarding the NX service can be found on our help pages. If you encounter any problems accessing the NX service please contact us via the Support Form.

Posted in Uncategorized | Leave a comment

SATABeast2 disk failures

Yesterday evening, our SAN storage device satabeast2, suffered 2 disk failures. We can cope with 1 disk failure (RAID5), but following that first disk failure there is a period of time, while the redundancy is restored, that should a second disk fail then data will be lost. This is what happened yesterday.

There is a recovery mode feature of the satabeast that will attempt to recover data from marginal disk failures, but this requires rebooting the array, which will affect access to all data (whether currently affected by the disk failures or not) on that array.

We plan to start this process at 11am today, and it may take several hours to complete, but will will try to make it as short as possible.

Below is a list of the affected areas.

/group/project/statmt5     Currently affected by disk failures
/group/project/statmt6                "
/group/project/statmt7                "
/group/project/statmt8                "
/group/project/statmt9                "
/group/project/statmt10               "
/group/project/statmt11               "
/group/project/statmt12               "
/group/project/statmt13               "
/group/project/ami9      Currently working, but will be affect by shutdown
/group/project/ami10                  "
/group/project/cstr1                  "
/group/project/cstr2                  "
/group/project/cstr3                  "
/afs/inf.ed.ac.uk/group/project/imr/data    "

We will update this post as work progresses.

12:30pm The rebuild is currently at 8% complete. It’s a 9TB array BTW.
2:15pm 24% complete
4:25pm 45% complete. The non-disk affected areas should be back, eg /group/project/ami9
5:30pm 55% complete. Will continue to monitor it over the weekend.
10:30pm 99% complete.

Saturday 8:15am: Unfortunately that last 1% was too much for it, and again one of the previously failed disks (that it was trying to recover from) failed again. So it’s looking unlikely we’ll be able to recover the data from statmt5 to statmt13.

Sunday: I have contacted the suppliers of the satabeast (by email) in case they have any other tricks up their sleeves.

Monday – 6/10/2014: Having analysed the logs and details that I sent them, the suppliers are planning on replacing the controller in the satabeast. This is scheduled for 11am onwards tomorrow – Tuesday. During this time the ami, cstr and imr-data areas will be unavailable.

Tuesday – Unfortunately the wrong part was shipped, so we’ll try the controller replacement again, but on Wednesday, same sort of time – 11am.

Wednesday – The controller has been replaced, and we are trying another recovery.

Thursday – The first attempt failed, we are trying a different one and currently it is 66% complete.

Friday 10th October

Rather surprisingly that last recovery attempt seems to have worked so far, and at least a couple of the group areas seem to be intact. I am running fsck on them and making them available (read only) as they pass the checks. The current statmt areas back are:

/group/project/statmt5
/group/project/statmt6
/group/project/statmt7
/group/project/statmt8
/group/project/statmt9
/group/project/statmt10
/group/project/statmt11
/group/project/statmt12
/group/project/statmt13

Though the file-systems pass consistency checks, it is possible that the files themselves contain corruption, we can’t know for sure one way or the other. Only the authors of those files would know if their contents are as expected.

5pm That’s all the affected group areas back now. I’ve left them as read only for now so that if any possible corruption isn’t made worse by things trying to open the files. If there’s some agreement that things do look normal, we can make them writeable again.

Neil

PS If you are actually affected by the loss of data, or use any of the group areas listed above, please let me know (neilb@inf) or leave a comment below.

Posted in system event | Leave a comment

Follow us on twitter

You can now follow us on twitter at @infalerts! We will post details of planned and unplanned outages, information on new services, changes to existing services and anything else that we think may be helpful to you as users of the IT services provided by Informatics.

Information Services (IS) also post information about services they provide – see @isalerts.

Posted in Uncategorized | Leave a comment