What's Chris been doing?

Successes and failures at inf.ed.ac.uk

Posts Tagged ‘LCFG

A bug fix for the sleep component

leave a comment »

A new version of the LCFG sleep component, 0.30.0, is out and installed on the sleep beta test machines. It fixes bug 653.

The problem was with the code which checked keyboard/mouse idleness. I was so delighted to be able to do this at last that it went to my head and I forgot that keyboard/mouse idleness is only relevant when somebody is logged in. After logout these idleness figures can be ignored: other tests will pick up things like remote shells.

The result was that although the component correctly checked keyboard/mouse idleness and politely waited until the machine had been idle for a while before authorising sleep, it kept doing that after the user had logged out and gone away. Before this my machine would fall asleep a minute or two after I logged out; with this it would wait several hours before sleeping.

So, all fixed in lcfg-sleep 0.30.0. Everything seems OK so far from the intrepid beta test team so I’m hoping that this version will hit other DICE desktops within a month or so.

Written by Chris Cooke

April 26, 2013 at 10:37 am

Posted in Uncategorized

Tagged with , ,

Spring Sleep Roundup

leave a comment »

I’ve been asked for the latest news on the LCFG sleep component (which sends our desktops to sleep when they’re idle, but ensures that they wake up to run important cron jobs). Your wish is my command. Here are the major developments since its last mention here:

  • Since 0.21.0 it’s enabled all USB devices for wake-up, rather than trying to guess which ones might have a keyboard attached to them. Simpler, more reliable. (This is so that a sleeping machine can be woken by a press of a key.)
  • Since 0.22.0 the component has tried harder to create a wake alarm to wake the machine in good time for cron jobs, even when it knows that it’s never going to send the machine to sleep. (Because machines can go to sleep by other means too.)
  • In 0.23.0 the “nosleep” command was introduced – “man nosleep” for details. Thanks to Sharon Goldwater for this idea.
  • From 0.24.0 the wake alarm is cleared when the component stops. So machines which have been shut down for the Christmas holidays will stay shut down :-)
  • Versions 0.25.0 to 0.29.0 added the holy grail: a test for login session idleness. I found the right bit of DBUS at last! Provided the user is using GNOME, the component can find out if and for how long her login session has been idle. If the user uses some other environment the component will refrain from sleeping the machine while the login session remains; a solution for that is in development and will appear when we get time. New resource xidletime specifies the time delay between the session becoming idle and sleep being permitted.

More recently I’ve moved a bunch of generally useful settings from the Informatics-only sleep header into lcfg/options/sleep.h.

The “login session idleness” functionality is currently being beta-tested (which got a mention on the Informatics Energy blog) so it will be installed only if you put the following in a machine’s profile:

#define LCFG_SLEEP_BETA_TEST
#include <lcfg/options/sleep.h>

I’ve been asked whether I have data on whether sleeping shortens the life of desktop machines. I’m afraid I don’t, but if you do, please get in touch. I do have a few thoughts on the matter though.

  • Modern desktop hardware and operating systems are all designed to support sleep, and they do it well. Our managed Windows desktop machines enforce sleep and this seems to work well. Ten or fifteen years ago frequent sleep might have been bad for the hardware, but nowadays I really doubt that it would cause problems.
  • We’ve now been using sleep for several years and we haven’t seen an epidemic of premature death in our desktop machines.
  • We’ve come across problems caused by hard disks being kept permanently running 24/7. “Never switch off” isn’t always the right idea.

Written by Chris Cooke

April 11, 2013 at 10:46 am

Posted in Uncategorized

Tagged with , ,

TiBS is now under LCFG control

with 3 comments

Yesterday we deployed the lcfg-tibs component on our main TiBS backup server. Things seem to have gone smoothly; the software is now installed via RPM packages; the config files are now mostly generated from LCFG resources; and configuration changes are held back until TiBS is idle.

This is phase 1 of the LCFG TiBS component. Phase 2 will automatically generate the list of non-AFS backups from “someone please back me up” resources in the LCFG profiles of DICE machines. Some Nagios monitoring is also desirable! However more development may have to wait a while: other more urgent projects are elbowing their way ahead of this one in the development queue.

The LCFG TiBS component and its accompanying RPMs are not available for distribution outside the School of Informatics because TiBS is proprietary commercial software; but if you’ve also bought this software and you want to use the LCFG component to automate its configuration, let us know, maybe we can share the work.

Some docs:

One thing that did not go smoothly was my attempt to get the component to stop TiBS when the component stopped. TiBS is stopped with the command stoptibs which can be found on our backup server in /usr/tibs/bin. It’s a shell script. It’s short but I won’t post it here as it’s not freely redistributable. All of my attempts to call it, with backticks or system and/or eval or whatever other wacky way I came across on google, result in the component immediately terminating as soon as stoptibs has run, so the component doesn’t ever officially stop. So far I’m baffled as to what’s wrong here. Is this an elementary perl boob on my part? A bug somewhere in LCFG?

Written by Chris Cooke

November 18, 2009 at 2:36 pm

Posted in Uncategorized

Tagged with ,

TiBS LCFGification plans

leave a comment »

The LCFGification (© Craig) of the TiBS backup software is under way. To help things along I’ve added some comments to Craig’s plan.

Written by Chris Cooke

June 8, 2009 at 3:48 pm

Posted in Uncategorized

Tagged with ,

Upgrading the LCFG master server

leave a comment »

Yesterday I upgraded the master LCFG server tobermory to SL5. Here’s the plan I worked to, with some comments in italics saying what went wrong.

Tobermory SL5 Upgrade Plan

* No need to alter any fstab resources or to #define anything to prepare for the SL5 install

* Alter lcfg/tobermory to os/sl5.h, and add the following to lcfg/tobermory too:
/* Restrict access to subversion */
!subversion.authzallow_lcfg_coreinc mSET(mpu)
!subversion.authzallow_lcfg_corepack mSET(mpu)
!subversion.authzallow_lcfg_liveinc mSET(mpu)
!subversion.authzallow_lcfg_livepack mSET(mpu)
/* Stop rsync and rfe from starting automatically */
!boot.services mREMOVE(lcfg_rsync)
!boot.services mREMOVE(lcfg_rfe)

And wait for tobermory’s new profile to reach it

When the machine gets its new profile with the new OS version’s resources, the resource changes are enough to screw up logins; you’re then locked out of the machine unless you already have a window open or you reverse the changes. Luckily I had a window open on tobermory.

———— CO access to svn stops ———-

* “om rfe stop” on tobermory

———— access to rfe lcfg/foo stops ——-

* “om rsync stop” on tobermory

———— no more LCFG changes ————-

* “om server stop” on all slaves (mousa, trondra, bressay, illustrious, dresden) – but keep apacheconf running

———— profile building stops —————

———— slaves serve unchanging profiles ——

* “om subversion dumpdb — -r lcfg -g -k 30″ on tobermory

* scp ALL of /var/lcfg/svndump/lcfg/ to another machine

I should have stopped the server process on all the slave servers but kept rsync on tobermory running to help me in copying off all the data. As it was with rsync down I pushed the data from tobermory rather than pulling it from the other machine; so I pushed it to my own account on the other machine rather than root; so I lost the owner and group information for each file. It would have been better to pull the info from root to root and keep the owner/group info.

———— subversion repository data saved ——

* rsync /var/rfedata to another machine

———— lcfg source files now saved ———–

* rsync /var/lcfg/releases to another machine

———— stable and testing releases saved ——-

* copy /var/svn/lcfg/hooks/* to another machine (these are not dumped by svnadmin!)

———— autocheckout mechanism saved ——–

* copy /var/lcfg/lcfgrelease.sh to another machine

———— release script saved —————–

This is the point at which I should have shut down the rsync component on tobermory.

* install SL5 on tobermory

* Restore all of /var/lcfg/svndump from elsewhere

———— svn dumps now present ————–

* Restore /var/rfedata from elsewhere

This is where I started noticing the file group and ownership problem. All the LCFG files were owned by me – whoops!

———— lcfg source files exist again ———–

* Start the subversion component if not already started

————- repository exists once more ———-

* Reload the repository: “svnadmin load” the dump file

On the test machine it had taken about a minute to load all 11000 revisions, but on the real server it took 15-20 minutes.

————- repository now contains our data ——

* Restore the post-commit and pre-commit hooks from elsewhere (*after* doing the load!)

————- autocheckout now enabled ———–

* Check out something from the repository; change it; commit

Access denied! It turned out that my subversion.authzallow resource changes – intended to restrict read/write access to the repository to just the MPU – hadn’t worked as intended. Stephen explains:
Normally the access to these areas is done in terms of higher-level
groups (e.g. cos, csos). The problem was caused because the access
permissions weren’t set for the mpu group specifically, this meant
they defaulted to “read-only”. Note:

authzallowperms_lcfg_coreinc_mpu=r
authzallowperms_lcfg_corepack_mpu=r
authzallowperms_lcfg_liveinc_mpu=r
authzallowperms_lcfg_livepack_mpu=r
authzallowperms_lcfg_root_all=r

Each of those resources would need to be set to ‘rw’.

Stephen solved the problem by editing /var/svn/lcfg/conf/authz directly.

After this was solved there was another problem; my svn commit would succeed, but nothing would appear in the autocheckout directory. After some digging and experimenting I found that the permissions and ownership on the autocheckout directory needed to exactly match that of the repository or nothing would be checked out. When I compared them I found that the group and the permissions didn’t match. Once I had adjusted them and tried another test commit, the autocheckout succeeded.

————- /var/lib/autocheckout now populated —
————- develop and default releases now there –

* Restore /var/lcfg/releases from elsewhere

And correct the file ownership

————- stable and testing releases now there —-

* restore /var/lcfg/lcfgrelease.sh

And correct the file ownership

————- release script restored —————-

* restore /var/cache/lcfgreleases from rsync.lcfg.org::lcfgreleases

And correct the file ownership

————- stable releases cache restored ———

* Start the rfe and rsync components on tobermory

————- CO access to lcfg/foo restored ———
————- MPU access to svn restored ————

* Start the rsync component on tobermory

————- changes now available to slaves again —-

* Remove the boot.services alterations from lcfg/tobermory

* Remove subversion.authzallow restrictions from lcfg/tobermory

* om bressay.server start

* om illustrious.server start

* om dresden.server start

* om mousa.apacheconf stop

* om mousa.server start

————– mammoth rebuilds now hopefully start —

* Mail a progress report to cos

I told the COs that they could now once again use rfe and also use the subversion repository. They could use rfe but subversion wasn’t actually going to be available to them until tobermory had got its new profile, which it hadn’t yet. Oops.

————– mousa rebuild finishes —————-

A complete rebuild has recently been taking an enormous amount of time, two to three hours, and crippling the slave server involved. That’s why I shut down apacheconf – to minimise the load on the machine. However with no existing spanning maps this complete rebuild took precisely 35 minutes! Normally it takes 30 minutes to build 1000 profiles when the stable release is installed. 35 minutes is on a par with that – we currently have a total of 1143 profiles on the slave servers. The total number of hosts to process was 2813. The difference between the two figures is from hosts which have no XML profile associated with them.

* rfe dns/inf, point lcfg alias at mousa

* om mousa.apacheconf start

————– CO access to svn restored ————-

————– LCFG service is now functional ———-

* om trondra.apacheconf stop

* om trondra.server start

————– mammoth trondra rebuild starts ———

————– mammoth trondra rebuild finishes ——-

* om trondra.apacheconf start

————– LCFG service is back to normal ———-

* Announce to cos

Written by Chris Cooke

March 18, 2008 at 9:54 am

When reinstalling an LCFG server

leave a comment »

Note to self and colleagues for future.

When reinstalling one of the LCFG servers, things would go more smoothly if the following were done:

  • At least a day before, shift the lcfg.inf.ed.ac.uk alias from the machine you’re going to reinstall to the other LCFG server. Machines which are starting their reinstall contact this server for their profiles, and you don’t want to break installs for several hours – besides which the machine’s own install will stall with “failed to download LCFG profile”.
  • Also beforehand, remove lcfg_server and lcfg_apacheconf from boot.services. Keep them out until the machine has finished its reinstall and is up and running OK as a normal machine.
  • Then reinstate lcfg_server, start it up, and wait until it has finished remaking all of the profiles it needs to. This can take an hour or two.
  • Then reinstate lcfg_apacheconf, start it up, and check in the apacheconf.lcfghost log that it’s serving profiles correctly.
  • If you’re using server.keepmaps, then go to the other LCFG server and zap its server cache. This should synchronise the content of the spanning maps with the newly reinstalled server. If you don’t do this you’ll get instability in the spanning maps and machines will go in and out of them.

Written by Chris Cooke

February 19, 2008 at 12:35 pm

Posted in Uncategorized

Tagged with ,

DICE disk partitioning

with 5 comments

I’ve finally documented how disk partitioning on DICE works at the LCFG header file level, and how to change the default disk partitions. I think access to the wiki is restricted – if you’re not in Informatics I think you’ll need to use iFriend to get in. (It’s also handy for commenting on this blog.)

Written by Chris Cooke

February 19, 2008 at 12:04 pm