# What's Chris been doing?

Successes and failures at inf.ed.ac.uk

## Upgrading the LCFG master server

Yesterday I upgraded the master LCFG server tobermory to SL5. Here’s the plan I worked to, with some comments in italics saying what went wrong.

* No need to alter any fstab resources or to #define anything to prepare for the SL5 install

* Alter lcfg/tobermory to os/sl5.h, and add the following to lcfg/tobermory too:
!subversion.authzallow_lcfg_coreinc mSET(mpu)
!subversion.authzallow_lcfg_corepack mSET(mpu)
!subversion.authzallow_lcfg_liveinc mSET(mpu)
!subversion.authzallow_lcfg_livepack mSET(mpu)
/* Stop rsync and rfe from starting automatically */
!boot.services mREMOVE(lcfg_rsync)
!boot.services mREMOVE(lcfg_rfe)

And wait for tobermory’s new profile to reach it

When the machine gets its new profile with the new OS version’s resources, the resource changes are enough to screw up logins; you’re then locked out of the machine unless you already have a window open or you reverse the changes. Luckily I had a window open on tobermory.

* “om rfe stop” on tobermory

* “om rsync stop” on tobermory

———— no more LCFG changes ————-

* “om server stop” on all slaves (mousa, trondra, bressay, illustrious, dresden) – but keep apacheconf running

———— profile building stops —————

———— slaves serve unchanging profiles ——

* “om subversion dumpdb — -r lcfg -g -k 30″ on tobermory

* scp ALL of /var/lcfg/svndump/lcfg/ to another machine

I should have stopped the server process on all the slave servers but kept rsync on tobermory running to help me in copying off all the data. As it was with rsync down I pushed the data from tobermory rather than pulling it from the other machine; so I pushed it to my own account on the other machine rather than root; so I lost the owner and group information for each file. It would have been better to pull the info from root to root and keep the owner/group info.

———— subversion repository data saved ——

* rsync /var/rfedata to another machine

———— lcfg source files now saved ———–

* rsync /var/lcfg/releases to another machine

———— stable and testing releases saved ——-

* copy /var/svn/lcfg/hooks/* to another machine (these are not dumped by svnadmin!)

———— autocheckout mechanism saved ——–

* copy /var/lcfg/lcfgrelease.sh to another machine

———— release script saved —————–

This is the point at which I should have shut down the rsync component on tobermory.

* install SL5 on tobermory

* Restore all of /var/lcfg/svndump from elsewhere

———— svn dumps now present ————–

* Restore /var/rfedata from elsewhere

This is where I started noticing the file group and ownership problem. All the LCFG files were owned by me – whoops!

———— lcfg source files exist again ———–

* Start the subversion component if not already started

————- repository exists once more ———-

On the test machine it had taken about a minute to load all 11000 revisions, but on the real server it took 15-20 minutes.

————- repository now contains our data ——

* Restore the post-commit and pre-commit hooks from elsewhere (*after* doing the load!)

————- autocheckout now enabled ———–

* Check out something from the repository; change it; commit

Access denied! It turned out that my subversion.authzallow resource changes – intended to restrict read/write access to the repository to just the MPU – hadn’t worked as intended. Stephen explains:
Normally the access to these areas is done in terms of higher-level
groups (e.g. cos, csos). The problem was caused because the access
permissions weren’t set for the mpu group specifically, this meant

authzallowperms_lcfg_coreinc_mpu=r
authzallowperms_lcfg_corepack_mpu=r
authzallowperms_lcfg_liveinc_mpu=r
authzallowperms_lcfg_livepack_mpu=r
authzallowperms_lcfg_root_all=r

Each of those resources would need to be set to ‘rw’.

Stephen solved the problem by editing /var/svn/lcfg/conf/authz directly.

After this was solved there was another problem; my svn commit would succeed, but nothing would appear in the autocheckout directory. After some digging and experimenting I found that the permissions and ownership on the autocheckout directory needed to exactly match that of the repository or nothing would be checked out. When I compared them I found that the group and the permissions didn’t match. Once I had adjusted them and tried another test commit, the autocheckout succeeded.

————- /var/lib/autocheckout now populated —
————- develop and default releases now there –

* Restore /var/lcfg/releases from elsewhere

And correct the file ownership

————- stable and testing releases now there —-

* restore /var/lcfg/lcfgrelease.sh

And correct the file ownership

————- release script restored —————-

* restore /var/cache/lcfgreleases from rsync.lcfg.org::lcfgreleases

And correct the file ownership

————- stable releases cache restored ———

* Start the rfe and rsync components on tobermory

* Start the rsync component on tobermory

————- changes now available to slaves again —-

* Remove the boot.services alterations from lcfg/tobermory

* Remove subversion.authzallow restrictions from lcfg/tobermory

* om bressay.server start

* om illustrious.server start

* om dresden.server start

* om mousa.apacheconf stop

* om mousa.server start

————– mammoth rebuilds now hopefully start —

* Mail a progress report to cos

I told the COs that they could now once again use rfe and also use the subversion repository. They could use rfe but subversion wasn’t actually going to be available to them until tobermory had got its new profile, which it hadn’t yet. Oops.

————– mousa rebuild finishes —————-

A complete rebuild has recently been taking an enormous amount of time, two to three hours, and crippling the slave server involved. That’s why I shut down apacheconf – to minimise the load on the machine. However with no existing spanning maps this complete rebuild took precisely 35 minutes! Normally it takes 30 minutes to build 1000 profiles when the stable release is installed. 35 minutes is on a par with that – we currently have a total of 1143 profiles on the slave servers. The total number of hosts to process was 2813. The difference between the two figures is from hosts which have no XML profile associated with them.

* rfe dns/inf, point lcfg alias at mousa

* om mousa.apacheconf start

————– LCFG service is now functional ———-

* om trondra.apacheconf stop

* om trondra.server start

————– mammoth trondra rebuild starts ———

————– mammoth trondra rebuild finishes ——-

* om trondra.apacheconf start

————– LCFG service is back to normal ———-

* Announce to cos

Written by Chris Cooke

March 18, 2008 at 9:54 am