Posts Tagged ‘LCFG server’
Yesterday I upgraded the master LCFG server tobermory to SL5. Here’s the plan I worked to, with some comments in italics saying what went wrong.
Tobermory SL5 Upgrade Plan
* No need to alter any fstab resources or to #define anything to prepare for the SL5 install
* Alter lcfg/tobermory to os/sl5.h, and add the following to lcfg/tobermory too:
/* Restrict access to subversion */
/* Stop rsync and rfe from starting automatically */
And wait for tobermory’s new profile to reach it
When the machine gets its new profile with the new OS version’s resources, the resource changes are enough to screw up logins; you’re then locked out of the machine unless you already have a window open or you reverse the changes. Luckily I had a window open on tobermory.
———— CO access to svn stops ———-
* “om rfe stop” on tobermory
———— access to rfe lcfg/foo stops ——-
* “om rsync stop” on tobermory
———— no more LCFG changes ————-
* “om server stop” on all slaves (mousa, trondra, bressay, illustrious, dresden) – but keep apacheconf running
———— profile building stops —————
———— slaves serve unchanging profiles ——
* “om subversion dumpdb — -r lcfg -g -k 30″ on tobermory
* scp ALL of /var/lcfg/svndump/lcfg/ to another machine
I should have stopped the server process on all the slave servers but kept rsync on tobermory running to help me in copying off all the data. As it was with rsync down I pushed the data from tobermory rather than pulling it from the other machine; so I pushed it to my own account on the other machine rather than root; so I lost the owner and group information for each file. It would have been better to pull the info from root to root and keep the owner/group info.
———— subversion repository data saved ——
* rsync /var/rfedata to another machine
———— lcfg source files now saved ———–
* rsync /var/lcfg/releases to another machine
———— stable and testing releases saved ——-
* copy /var/svn/lcfg/hooks/* to another machine (these are not dumped by svnadmin!)
———— autocheckout mechanism saved ——–
* copy /var/lcfg/lcfgrelease.sh to another machine
———— release script saved —————–
This is the point at which I should have shut down the rsync component on tobermory.
* install SL5 on tobermory
* Restore all of /var/lcfg/svndump from elsewhere
———— svn dumps now present ————–
* Restore /var/rfedata from elsewhere
This is where I started noticing the file group and ownership problem. All the LCFG files were owned by me – whoops!
———— lcfg source files exist again ———–
* Start the subversion component if not already started
————- repository exists once more ———-
* Reload the repository: “svnadmin load” the dump file
On the test machine it had taken about a minute to load all 11000 revisions, but on the real server it took 15-20 minutes.
————- repository now contains our data ——
* Restore the post-commit and pre-commit hooks from elsewhere (*after* doing the load!)
————- autocheckout now enabled ———–
* Check out something from the repository; change it; commit
Access denied! It turned out that my
subversion.authzallow resource changes – intended to restrict read/write access to the repository to just the MPU – hadn’t worked as intended. Stephen explains:
Normally the access to these areas is done in terms of higher-level
groups (e.g. cos, csos). The problem was caused because the access
permissions weren’t set for the mpu group specifically, this meant
they defaulted to “read-only”. Note:
Each of those resources would need to be set to ‘rw’.
Stephen solved the problem by editing /var/svn/lcfg/conf/authz directly.
After this was solved there was another problem; my svn commit would succeed, but nothing would appear in the autocheckout directory. After some digging and experimenting I found that the permissions and ownership on the autocheckout directory needed to exactly match that of the repository or nothing would be checked out. When I compared them I found that the group and the permissions didn’t match. Once I had adjusted them and tried another test commit, the autocheckout succeeded.
————- /var/lib/autocheckout now populated —
————- develop and default releases now there –
* Restore /var/lcfg/releases from elsewhere
And correct the file ownership
————- stable and testing releases now there —-
* restore /var/lcfg/lcfgrelease.sh
And correct the file ownership
————- release script restored —————-
* restore /var/cache/lcfgreleases from rsync.lcfg.org::lcfgreleases
And correct the file ownership
————- stable releases cache restored ———
* Start the rfe and rsync components on tobermory
————- CO access to lcfg/foo restored ———
————- MPU access to svn restored ————
* Start the rsync component on tobermory
————- changes now available to slaves again —-
* Remove the boot.services alterations from lcfg/tobermory
* Remove subversion.authzallow restrictions from lcfg/tobermory
* om bressay.server start
* om illustrious.server start
* om dresden.server start
* om mousa.apacheconf stop
* om mousa.server start
————– mammoth rebuilds now hopefully start —
* Mail a progress report to cos
I told the COs that they could now once again use rfe and also use the subversion repository. They could use rfe but subversion wasn’t actually going to be available to them until tobermory had got its new profile, which it hadn’t yet. Oops.
————– mousa rebuild finishes —————-
A complete rebuild has recently been taking an enormous amount of time, two to three hours, and crippling the slave server involved. That’s why I shut down apacheconf – to minimise the load on the machine. However with no existing spanning maps this complete rebuild took precisely 35 minutes! Normally it takes 30 minutes to build 1000 profiles when the stable release is installed. 35 minutes is on a par with that – we currently have a total of 1143 profiles on the slave servers. The total number of hosts to process was 2813. The difference between the two figures is from hosts which have no XML profile associated with them.
* rfe dns/inf, point lcfg alias at mousa
* om mousa.apacheconf start
————– CO access to svn restored ————-
————– LCFG service is now functional ———-
* om trondra.apacheconf stop
* om trondra.server start
————– mammoth trondra rebuild starts ———
————– mammoth trondra rebuild finishes ——-
* om trondra.apacheconf start
————– LCFG service is back to normal ———-
* Announce to cos
Note to self and colleagues for future.
When reinstalling one of the LCFG servers, things would go more smoothly if the following were done:
- At least a day before, shift the lcfg.inf.ed.ac.uk alias from the machine you’re going to reinstall to the other LCFG server. Machines which are starting their reinstall contact this server for their profiles, and you don’t want to break installs for several hours – besides which the machine’s own install will stall with “failed to download LCFG profile”.
- Also beforehand, remove lcfg_server and lcfg_apacheconf from boot.services. Keep them out until the machine has finished its reinstall and is up and running OK as a normal machine.
- Then reinstate lcfg_server, start it up, and wait until it has finished remaking all of the profiles it needs to. This can take an hour or two.
- Then reinstate lcfg_apacheconf, start it up, and check in the apacheconf.lcfghost log that it’s serving profiles correctly.
- If you’re using server.keepmaps, then go to the other LCFG server and zap its server cache. This should synchronise the content of the spanning maps with the newly reinstalled server. If you don’t do this you’ll get instability in the spanning maps and machines will go in and out of them.