# What's Chris been doing?

Successes and failures at inf.ed.ac.uk

## A bug fix for the sleep component

A new version of the LCFG sleep component, 0.30.0, is out and installed on the sleep beta test machines. It fixes bug 653.

The problem was with the code which checked keyboard/mouse idleness. I was so delighted to be able to do this at last that it went to my head and I forgot that keyboard/mouse idleness is only relevant when somebody is logged in. After logout these idleness figures can be ignored: other tests will pick up things like remote shells.

The result was that although the component correctly checked keyboard/mouse idleness and politely waited until the machine had been idle for a while before authorising sleep, it kept doing that after the user had logged out and gone away. Before this my machine would fall asleep a minute or two after I logged out; with this it would wait several hours before sleeping.

So, all fixed in lcfg-sleep 0.30.0. Everything seems OK so far from the intrepid beta test team so I’m hoping that this version will hit other DICE desktops within a month or so.

Written by Chris Cooke

April 26, 2013 at 10:37 am

Posted in Uncategorized

Tagged with , ,

## Spring Sleep Roundup

I’ve been asked for the latest news on the LCFG sleep component (which sends our desktops to sleep when they’re idle, but ensures that they wake up to run important cron jobs). Your wish is my command. Here are the major developments since its last mention here:

• Since 0.21.0 it’s enabled all USB devices for wake-up, rather than trying to guess which ones might have a keyboard attached to them. Simpler, more reliable. (This is so that a sleeping machine can be woken by a press of a key.)
• Since 0.22.0 the component has tried harder to create a wake alarm to wake the machine in good time for cron jobs, even when it knows that it’s never going to send the machine to sleep. (Because machines can go to sleep by other means too.)
• In 0.23.0 the “nosleep” command was introduced – “man nosleep” for details. Thanks to Sharon Goldwater for this idea.
• From 0.24.0 the wake alarm is cleared when the component stops. So machines which have been shut down for the Christmas holidays will stay shut down
• Versions 0.25.0 to 0.29.0 added the holy grail: a test for login session idleness. I found the right bit of DBUS at last! Provided the user is using GNOME, the component can find out if and for how long her login session has been idle. If the user uses some other environment the component will refrain from sleeping the machine while the login session remains; a solution for that is in development and will appear when we get time. New resource xidletime specifies the time delay between the session becoming idle and sleep being permitted.

More recently I’ve moved a bunch of generally useful settings from the Informatics-only sleep header into lcfg/options/sleep.h.

The “login session idleness” functionality is currently being beta-tested (which got a mention on the Informatics Energy blog) so it will be installed only if you put the following in a machine’s profile:

#define LCFG_SLEEP_BETA_TEST
#include <lcfg/options/sleep.h>


I’ve been asked whether I have data on whether sleeping shortens the life of desktop machines. I’m afraid I don’t, but if you do, please get in touch. I do have a few thoughts on the matter though.

• Modern desktop hardware and operating systems are all designed to support sleep, and they do it well. Our managed Windows desktop machines enforce sleep and this seems to work well. Ten or fifteen years ago frequent sleep might have been bad for the hardware, but nowadays I really doubt that it would cause problems.
• We’ve now been using sleep for several years and we haven’t seen an epidemic of premature death in our desktop machines.
• We’ve come across problems caused by hard disks being kept permanently running 24/7. “Never switch off” isn’t always the right idea.

Written by Chris Cooke

April 11, 2013 at 10:46 am

Posted in Uncategorized

Tagged with , ,

## You have the lamp.

Things have moved on since my last post. The Dell page of Documentation for System Administration Software is proving very helpful.

In particular the Dell OpenManage Software link, once you click through, offers a Dell Systems Software Support Matrix which answers my questions about what software is available, for what platforms, and what it does. Dell Update Packages is also being very helpful. There’s lots of other useful stuff too, enough to keep me busy learning for a while.

Written by Chris Cooke

April 20, 2012 at 2:34 pm

Posted in Uncategorized

## You are in a maze of twisty little passages, all alike.

As part of project 171 I’m trying to work out how we might more easily upgrade the firmware and bios on our servers. This post is an attempt to make sense of what I’ve done so far and to work out what to do next.

Ideally a firmware upgrade should just be a matter of using handy Dell or HP software to identify and install any required upgrades. However until now that hasn’t really worked out very well for us, and we’ve ended up with some fairly convoluted workarounds, for example this.

Since most of our servers are currently Dells, I’ll deal with them first.

Dell’s big bundle of software for monitoring and maintaining servers is called OMSA, short for OpenManage Server Administrator. We were somewhat frustrated with it for years because although it looked quite useful, it was installed in a way which we judged to be incompatible with our management standards: instead of being installed cleanly and removably using a conventional Linux RPM the OMSA files seemed to be dropped all over the filesystem in unpredictable and sometimes disruptive places, so without a lot of work we couldn’t tell precisely what the consequences of OMSA installation were and how to properly control it.

The package updates also seemed a little more unpredictable than what we were used to dealing with: instead of RPMs they came in the form of executable scripts with built-in payloads.

The web downloader, the main alternative way of obtaining Dell updates, also had its own charming idiosyncrasies, such as assuming that the user was using a Windows computer.

The good news is that Dell has seen the RPM light, and in a big way.

OMSA, from version 6.5 onwards, is now packaged in a collection of several dozen RPMs, with proper dependencies, making it possible to install just the required parts of it, to easily find out what has been installed, and to have some confidence that OMSA installation and removal can be done without disrupting other software.

Dell Firmware and BIOS updates for Linux are now packaged in RPMs.

There’s also Dell software which identifies what firmware versions your server has, what versions it should have, and offers to find the necessary RPMs for you and perform the upgrades.

All of these wonderful things are available in a Dell RPM repository which is compatible with package tools such as yum and which can even be mirrored using rsync. The repository is documented here and situated here.

On my test server I’ve enabled yum access to the Dell repository (though not yet using LCFG, though that should be easily doable), installed OMSA, started it running and tried a few sample commands. So far it seems fairly straightforward to get it running on a server. We may find it useful. A summary of what it offers would be helpful; I shall have another look at it and at the documentation, which can be got from here.

There are a few worries and niggles.

One is that although the documentation site offers information on OMSA version 7.0, the RPM repository only goes up to OMSA version 6.5.3. Is it just lagging behind or has it been abandoned in some way? Will OMSA 7.0 appear in the RPM repository at some point or should I believe the OMSA 7.0 readme which claims that Dell OpenManage System Management software, including Server Administrator, is available only on the “Dell Systems Management Tools and Documentation” DVD? Or is the current version of OMSA still officially 6.5, as claimed on the Dell techcenter wiki? In which case what is the status of OMSA 7.0?

Another area with questions over it is firmware updates, my current main interest. These were once handled by OMSA, but at some point the functionality was split out into separate software.

The Dell tool of choice for handling firmware updates seems to be OpenManage Essentials, but this appears to be available only for Windows operating systems. At the moment firmware-tools seems to be the tool of choice for Linux firmware updates. I have no idea if it is still supported for Linux, if OME is one day going to replace it for Linux, or whether there is official or dependable Dell support for it.

Anyway, I installed firmware-tools following Dell’s instructions and like OMSA I found it easy to install (thanks Dell). It inventoried my system (told me what firmware and BIOS versions my server was running) and assessed it for updates (told me what firmware and BIOS versions it should be running). I then tried running it in “yes, do the updates for me” mode, but this was where it went wrong. Here’s the output from the update_firmware program.

% update_firmware --yes

Running system inventory...

Searching storage directory for available BIOS updates...
Checking BIOS - a00
Available: dell_dup_componentid_00159 - a05
Found Update: dell_dup_componentid_00159 - a05
Checking SAS5ira Controller 0 Firmware - 00.06.50.00.06.06.00.02
Available: pci_firmware(ven_0x1000_dev_0x0054_subven_0x1028_subdev_0x1f09) - 00.10.51.00.06.12.05.00
Found Update: pci_firmware(ven_0x1000_dev_0x0054_subven_0x1028_subdev_0x1f09) - 00.10.51.00.06.12.05.00
Checking ATLAS10K5_073SAS Firmware - bp00
Did not find a newer package to install that meets all installation checks.
Checking System BIOS for PowerEdge 860 - A00
Did not find a newer package to install that meets all installation checks.

Found firmware which needs to be updated.

\       Installing dell_dup_componentid_00159 - a05Installation failed for package: dell_dup_componentid_00159 - a05
aborting update...

The error message from the low-level command was:

Enough contiguous physical memory is not available to perform the BIOS update running under this Operating System. Reboot and try again.


The server in question has 2G of memory. The same error message was produced whether update_firmware was run in multi-user mode or in single-user mode, and whether or not it was run just after rebooting.

I haven’t tried locating and installing the RPMs which it mentioned.

I have come across an interesting blog post though which suggests that I may be on a hiding to nothing here, and suggesting some other things to look at.

Now for HP. It’s been supporting well-behaved RPMs for rather longer than Dell has, I think. Its repository is called the HP Software Delivery Repository or SDR for short. It’s written up here.

The SDR doesn’t seem to support Scientific Linux. It does support RHEL and (here and there?) CentOS.

HP’s management software is called ProLiant Support Pack or PSP for short. Or has that been replaced now with Service Pack for ProLiant, or SPP for short? It would appear that SPP supports at least RHEL and SUSE so it’s worth taking a closer look at.

Disappointingly, there doesn’t seem to be a mailing list which deals with HP PSP or SPP issues.

The HP repository can be mirrored by rsync.

After finding out this lot I’m left with lots of uncertainty and lots of possible avenues to explore, and I haven’t even mentioned other possibilities such as getting our servers to automatically look for and report memory errors in the IPMI log.

Do you have any experience in this area, or any observations or knowledge or clarifications to pass on? Any hints about which area might be the most promising to look at? Please get in touch if you do.

Finally, to reward you for reading this far here’s a magnificent picture of a purple heron, similar to the ones I saw on holiday in India last winter. Apparently they’re very rare indeed here but in India we saw them every day, lurking in the fields near where we were staying. They have extraordinarily long and flexible necks and tend to stand for long periods totally immobile, resembling weirdly twisted lengths of wood, until they suddenly strike at prey. The picture and its attribution and more information are available from here.

Written by Chris Cooke

April 18, 2012 at 3:46 pm

Posted in Uncategorized

## Round-up of sleep news

It’s been a while since I blogged about the sleep component. There’s been a lot of activity on that front lately, so here’s a roundup of the news.

• You can now wake a sleeping machine by pressing a key on the keyboard.
• In theory you can also wake a sleeping machine by clicking a mouse button. However RHEL6.0 / SL6.0 seems to have a kernel bug which makes this not work any more. As far as I can gather, the kernel bug was fixed ages ago but RHEL deliberately removes the fix when building its kernel. Hopefully this will all be better in 6.1.
• The component now detects running cron jobs. If it finds one, and that job isn’t in the cronignores list, it keeps the machine awake.
• There are new disable and enable methods. These disable sleep, and undo the disable method, respectively. The idea is that they can be run by a machine’s user, for example by typing om sleep disable.
• The component should work on 64 bit architectures too.
• A bug which broke the execution of extra commands at suspend and resume has been fixed.
• A new blacklist resource allows the selective disabling of sleep for particular hardware models.
• The sleep component’s LCFG Wiki page has been thoroughly tidied and brought up to date. Take a look at that page for an introduction to the component.
• Sleep mostly seems to work reliably on SL6. One or two models are currently presenting problems (the Dell Optiplex 755 in particular) but solutions have been identified and I expect those models to sleep successfully too soon. Certainly my test 755 sleeps like a baby. (It wakes in the middle of the night to perform important functions…)
• Edit: I forgot to mention that the lcfg-level resources have now been beefed up so that the component can now be run out of the box with lcfg level headers: no extra configuration should be necessary. (More configuration is possible of course – see the LCFG Wiki page for some config ideas.)

With all these developments out of the way it’s looking likely that we’ll soon be rolling out the sleep component onto all the staff and postgrad DICE Linux desktops in Informatics. In addition the introduction of the blacklist resource clears the way for the possible adoption of lcfg-sleep by other schools and units too. I’m looking forward to that challenge; it’ll be great to see more power-saving from the Linux desktops across the University.

Written by Chris Cooke

June 10, 2011 at 11:19 am

Posted in Uncategorized

Tagged with ,

## Linux Sleep: a new hoop to jump through

This is a follow-up to an earlier post, Linux sleep: how to wake with a key press or mouse click.

Shortly after discovering how to wake a sleeping machine this way – something of a Holy Grail of mine for several years – a new kernel version came along and broke the mechanism. At least, you now seem to have to jump through an extra hoop to enable it, in addition to the one described in the earlier post.

It’s now also necessary to find the relevant USB devices’ files under the /sys/devices/pci0000:00 tree and echo "enabled" to them. Version 0.12.0 of the LCFG sleep component has been updated to do this. It should be installed on DICE machines by 14 April 2011. LCFG bug 408 is the bug report associated with the change.

Written by Chris Cooke

April 7, 2011 at 1:07 pm

Posted in Uncategorized

Tagged with ,

## Printing man pages

If you hadn’t realised how easy it is at the moment to print a ‘man’ page, try this:

man -t man | lpr

The -t switch makes man output the page in Postscript. The man page suggests that you can have the page output in any of a number of other handy formats instead if you dig a little deeper.

Why print a man page? Reading from paper is easier on my eyes than from a screen and the prettier typefaces are also clearer. The paper version is more portable, too.

Written by Chris Cooke

March 31, 2011 at 4:15 pm

Posted in Uncategorized

Tagged with ,

## Linux sleep: how to wake with a key press or mouse click

with one comment

Several years ago we started sending the Linux machines in our student labs to sleep when idle, to save power. We configured them to check carefully before deciding whether or not they were idle enough to sleep, and also to wake themselves up in time to run important cron jobs. Machines could also be woken manually when needed.

This was fine, except for one problem: the only way to wake the machine manually was to press its power button. That’s not how most people try to wake a sleeping machine: it’s far more natural to press a key on its keyboard, or click one of its mouse buttons.

We’ve had a user education campaign which seems to have successfully taught most users of the labs how to wake a machine up, but there’s still a persistent minority of people who don’t understand, or maybe get impatient, and who sometimes end up doing something rash such as forcing a sleeping machine to reboot; so we get a steady flow of broken machines.

To solve this problem I’ve been trying for a long time to find out how to enable wake from sleep with a key press or mouse click. I’ve even been trying to find out if it was actually possible with Linux.

I have finally succeeded! It is possible, I’ve done it, and the solution will shortly be rolled out to our student lab machines. Here’s how:

The key file to manipulate is called /proc/acpi/wakeup. This file is a list of devices which can be used to wake the machine from sleep – and whether or not they’re currently allowed to. A status of “disabled” against a device means that it won’t wake the machine, while “enabled” means that it will. Here are the default contents of /proc/acpi/wakeup on my desktop HP dc7900 running Fedora 13:

Device  S-state   Status   Sysfs node
PCI0      S4    *disabled  no-bus:pci0000:00
COM1      S4    *disabled  pnp:00:07
PEG1      S4    *disabled  pci:0000:00:01.0
PEG2      S4    *disabled
IGBE      S4    *disabled  pci:0000:00:19.0
PCX1      S4    *disabled  pci:0000:00:1c.0
PCX2      S4    *disabled
PCX5      S4    *disabled  pci:0000:00:1c.4
PCX6      S4    *disabled
HUB       S4    *disabled  pci:0000:00:1e.0
USB1      S3    *disabled  pci:0000:00:1d.0
USB2      S3    *disabled  pci:0000:00:1d.1
USB3      S3    *disabled  pci:0000:00:1d.2
USB4      S3    *disabled  pci:0000:00:1a.0
USB5      S3    *disabled  pci:0000:00:1a.1
USB6      S3    *disabled  pci:0000:00:1a.2
EUS1      S3    *disabled  pci:0000:00:1d.7
EUS2      S3    *disabled  pci:0000:00:1a.7
PBTN      S4    *enabled


The only device that’s allowed to wake the machine is PBTN – the power button.

To enable a device, just echo its device code to the file, like this:

# echo USB3 > /proc/acpi/wakeup


A quick look at /proc/acpi/wakeup confirms that USB3 is now enabled for wakeup:

USB1      S3    *disabled  pci:0000:00:1d.0
USB2      S3    *disabled  pci:0000:00:1d.1
USB3      S3    *enabled   pci:0000:00:1d.2
USB4      S3    *disabled  pci:0000:00:1a.0
USB5      S3    *disabled  pci:0000:00:1a.1


I wanted to make it possible for the keyboard and the mouse to wake the machine, so I used this method to “enable” all of the USB devices.

Note that echoing the device code to the file toggles the device’s status: a disabled device is enabled, and an enabled one is disabled.

Note also that if writing a Perl script to do this, you’ll have to open /proc/acpi/wakeup for writing, echo a device code, then close the file, separately for each device you want to enable.

Here’s a bit of Perl which will enable wakeup on all USB devices, if you run it from an account which has permission to write to /proc/acpi/wakeup:

#!/usr/bin/perl

my $wakeup = "/proc/acpi/wakeup"; my @disabled; my$device;

# Let's take a look at the wakeup file
open(INPUT, "< $wakeup") or die "Couldn't open$wakeup for reading: $!\n"; # Remember the names of each disabled USB device while () { if (/^(USB\d+).*disabled/) { push(@disabled,$1);
print "Added $1 to disabled list\n"; } } # We've finished reading from the file close(INPUT); # Enable each device on our list foreach$device (@disabled) {
print "$device is disabled! Enabling it now.\n"; open(OUTPUT, ">$wakeup")
or die "Couldn't open $wakeup for writing:$!\n";
print OUTPUT $device;_ or die "Couldn't echo$device to $wakeup:$!\n";
close(OUTPUT);
}


Written by Chris Cooke

March 4, 2011 at 3:08 pm

Posted in Uncategorized

Tagged with ,

## Online books

This is a short post in praise of Edinburgh University Library.

I’ve been working on Hadoop recently, setting up clusters for our students. Faced with some configuration complications which I won’t go into right now, I decided to consult some books on Hadoop to see what advice they could give me. My local branch of Blackwells had two books that looked suitable: Pro Hadoop by Jason Venner and Hadoop: The Definitive Guide by Tom White. Both looked good and were about £30; I went back to the office to think about which one to get.

Back at the office I remembered that the university library had made some books available free online to university users. I tried this once years ago, and while I did in the end manage to view some pages of an O’Reilly book online, the process seemed awfully clunky and arcane. Had it improved since I last tried it? I went to its site at http://www.lib.ed.ac.uk to have a look.

The short answer is yes, online access to books has definitely improved. It’s so easy that I just clicked through straight away without any particular effort at all. This is roughly what I did:

• Pointed my web browser at www.lib.ed.ac.uk
• Clicked “Library Catalogue”
• The results included clickable links with names like “Pro Hadoop [electronic resource] / Jason Venner”. I clicked on one.
• The EASE authentication screen came up; I authenticated.
• The front page of the book, and a list of its contents, appeared in my browser.
• I could then either read the book online or print it off a section at a time. “Printing” nowadays lets you save to PDF too.

It’s really quite a useable system. I’m impressed.

Edit: Apparently the library also deals in real, physical books made of paper. I’ve just been there, for the first time in donkey’s years, and managed to borrow a book with the greatest of ease, with only a little help needed from a very helpful library staff member. (The newly refurbished building is a big improvement too.)

Written by Chris Cooke

September 8, 2010 at 4:05 pm

Posted in Uncategorized

Tagged with ,

## Systemd

As you probably know by now we’ve been porting LCFG to Fedora 12. Like other recent distro releases F12 uses Upstart instead of the venerable SysV init we’re used to. Upstart’s event-driven so it sounded great – until Alastair had a closer look at it and found that it was catastrophically difficult to configure, given that an Upstart job’s configuration information is jumbled up with its code.

Lennart Poettering of Red Hat has recently written a fascinating blog post in which he announces a new startup mechanism called systemd. In his post he analyses Upstart and SysV init and explains what he doesn’t like about them and why – for example, why they try to do far too much so end up being far too slow – and explains his rethink.

He goes into a lot of detail about systemd. There are a lot of interesting ideas there: the page is well worth a read.

The Upstart author has been big enough to respond fairly positively on his own blog. He also points out the apposite meaning of the slang term System D.

Via H Online.

Written by Chris Cooke

May 4, 2010 at 7:31 pm

Posted in Uncategorized

Tagged with

## VirtualBox success (sort of, eventually)

VirtualBox and I fought all day. Eventually I scored a victory on points. I made it work by trying out a new virtual machine on a new real machine, abandoning everything I’d tried last week. It seems happier running on a Dell 755 than on last week’s HP 7900. However I’ve still somehow ended up with a VM which refuses to boot from its hard disk unless I press F12 during the boot process and instruct the thing to boot from its hard disk. Compared to the problems I’ve had with VirtualBox over the past couple of days this is a small price to pay for actually getting the dratted thing up and running. Anyway, I got my f12 virtual machine installed and booted, and even rebooted, and managed to add a few test results to the test results table.
With this success I’ve finally been able to send an invitation out to colleagues to try F12, kick its tyres and take it for a spin.

Written by Chris Cooke

April 26, 2010 at 3:55 pm

Posted in Uncategorized

Tagged with

## VirtualBox failure, HP 7100 success

• I have my f12 VirtualBox client installing to the point where the fstab component attempts to make filesystems. The installation fails at that point every time I try it, telling me that the disk doesn’t appear to exist. When I try the same mke2fs command that the fstab component just got a failure from, it succeeds for me perfectly. Alastair saw the same thing several weeks ago but with him it was only an intermittent problem, whereas with me it’s happening every time. Alastair suggested setting fstab.dopartition_sda to false then trying again. This does get the install past the mke2fs stage and on to updaterpms.
• Next problem with the install: updaterpms installs a few dozen RPMs then fails with multiple disk errors. This happens repeatably: I’ve now tried three installs in a row and all fail after installing a few dozen RPMs. Alastair suggested changing the virtual machine disk from IDE to SATA. No joy. The install fails at the same point each time: during the installation of the glibc-common RPM.
• I tried turning off the vm’s use of hardware virtualisation features. This makes it slower, and the installation procedure actually manages to install the glibc-common RPM this way, but it still fails a few packages later on: “Buffer I/O error on device sda1″, etc.
• There’s a sense of relief in the MPU today: Stephen has tried the Dell Optiplex 780, the stop-gap new PC choice that’s been forced on us, with our standard SL5 installation and has found no apparent problems. Phew.
• I’ve been given an old HP 7100 to test f12 on. Apparently they’re proving to be more durable than our (newer) Dell GX620s which are dropping like flies, so the plan is for our 7100s to have a productive old age. Anyway, it tests out OK; F12 works fine; the beast even suspends and resumes successfully. It now has an entry in the test results table.

Written by Chris Cooke

April 23, 2010 at 3:56 pm

Posted in Uncategorized

Tagged with

## Multiple session types

• My GDM login screen is now rather more civilised than before. It now offers a choice of session type, language and keyboard layout to choose from – though the choice only appears after you’ve typed in your username, oddly. I’ve also installed KDE, Xfce and LXDE so we have a number of session types to choose between. I haven’t been able to add these to the base list yet though as some of the packages depend on versions which, though already available in our mirror, are too recent for our f12 updates list. I’ll add them when it becomes possible.
• About the network failure when switching from dhcp to our normal style of networking: Alastair’s machine has a similar setup but doesn’t suffer from this. The difference seems to be that his machine doesn’t use the openldap component. This bears further investigation.
• I’m experimenting with VirtualBox. At the moment my LCFG-controlled f12 installation isn’t cooperating but it’s early days.

Written by Chris Cooke

April 22, 2010 at 4:25 pm

Posted in Uncategorized

Tagged with

• I spent the morning up to my elbows in network scripts trying to establish why my machine’s network is totally cut off when I switch from DHCP to our normal networking setup then reboot the machine. The resulting /etc/sysconfig/network-scripts/ifcfg-eth0 looks pretty similar to the one you get on sl5 DICE machines and also on Stephen’s f12_64 inf machine:
DEVICE=eth0
ONBOOT=yes
NETWORK= (em>(the first three parts of the IP address followed by .0 - looks good to me)


I must be missing something but so far I don’t see what.

• Spent the rest of the day looking into gdm and kdm, playing with GConf, but again not getting anywhere.

Written by Chris Cooke

April 21, 2010 at 4:04 pm

Posted in Uncategorized

Tagged with

## 755s now boot

• Alastair has fixed the keyboard-input-on-boot problem with a new version of lcfg-upstarthooks.
• I’ve got my Dell Optiplex 755 rebooting instead of hanging on reboot. You have to add reboot=bios to the kernel arguments. I finally got the workaround from this Ubuntu bug. It’d be nice to track down a more exactly relevant bug report.
• On closer inspection, the Dells whose internal speakers were so silent in my hardware tests actually don’t have any internal speakers. So that’s that cleared up then!
• lcfg-updaterpms-0.100.50 now officially supports F12 and is the default version. (I thought this had been done ages ago but apparently not. Some kind of oversight maybe.)

Written by Chris Cooke

April 20, 2010 at 4:09 pm

Posted in Uncategorized

Tagged with

## hardware test results

Whoops: I’ve neglected this for the last few days. This post therefore has to be something of a catch-up. Sorry for the length.

• On the 15th I built lcfg-openafs and lcfg-openldap for f12_64. (Stephen has been enthusing about Mock for a long time now, and I can see why – it’s extremely useful to be able to build packages without having to install all the Requires and BuildRequires packages on the build machine.) lcfg-openafs-0.0.32 is now officially ported to F12.
• I also confirmed that switching my F12 machine from Kerberos configuration by the file component, to configuration by the kerberos component, kills my keyboard stone dead on reboot. I get a prompt for my admin principle and the keyboard totally fails to work. I’ve gone back to the file component method and reinstalled…
• Most of the rest of the 15th and 16th was taken up with hardware tests: results here. The problems I found were:
755 doesn’t reboot
Whenever the 755 tries to reboot it announces “Rebooting system.” then hangs.
745 doesn’t mount CDs
If you insert a CD into a 745 it whirrs but nothing appears on the desktop.
HP 7900 CD support is dodgy
Sometimes an inserted CD doesn’t mount on the 7900′s desktop, sometimes it does.
Dell sound is dodgy
There’s no audio output from speakers on some Dells.
Audio or sleep troubles on 755
The X login screen disappeared from a 755 after it had undergone an intensive programme of frequent suspends and resumes. On checking the logs it seemed that rtkit-daemon was logging to syslog a lot at resume time. On the very first resume it logged

rtkit-daemon[4569]: Sucessfully made thread 4567 of process 4567 (/usr/bin/pulseaudio) owned by '42' high priority at nice level -11.
rtkit-daemon[4569]: Sucessfully made thread 4573 of process 4567 (/usr/bin/pulseaudio) owned by '42' RT at priority 5.
rtkit-daemon[4569]: Sucessfully made thread 4574 of process 4567 (/usr/bin/pulseaudio) owned by '42' RT at priority 5.


Then on the second resume it logged:

rtkit-daemon[4569]: The poor little canary died! Taking action.
rtkit-daemon[4569]: Rampaging.
rtkit-daemon[4569]: Successfully demoted thread 4573 of process 4567 (/usr/bin/pulseaudio).
rtkit-daemon[4569]: Successfully demoted thread 4574 of process 4567 (/usr/bin/pulseaudio).


and on subsequent resumes:

rtkit-daemon[4569]: The poor little canary died! Taking action.
rtkit-daemon[4569]: Rampaging.


Some hours later this eventually became

 rtkit-daemon[4569]: Rampaging.
gdm-simple-slave[4497]: WARNING: Unable to kill D-Bus daemon
console-kit-daemon[1811]: WARNING: Couldn't read /proc/4556/environ: Failed to open file '/proc/4556/environ': No such file or directory
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.844015 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.834487 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.829573 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.829621 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.830660 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.835599 seconds
gdm-binary[4464]: WARNING: GdmLocalDisplayFactory: maximum number of X display failures reached: check X server log for errors
init: prefdm main process (4464) terminated with status 1
init: prefdm main process ended, respawning
gdm-binary[13695]: WARNING: GdmDisplay: display lasted 0.833754 seconds


and so on. Judging by these two posts on Ubuntu forums it may be the case that PulseAudio should be stopped on suspend and started on resume. I’ve checked our pm-tools sleep.d scripts and that’s not happening on our F12 machines.

“rtkit” by the way is “real time kit”, it’s required by PulseAudio but not yet by anything else.
I spent some time today debugging a pm-utils sleep.d hook script which would suspend PulseAudio on system suspend and resume it on resume, but without success. I think I’ve spent enough time on this; for now we’ll just have to have lcfg-sleep disabled on 755s. I’m modifying lcfg/defaults/sleep.h accordingly.
A note for the future: my failed sleep hook script needed to run nsu or sudo so it could run pactl as the user running pulseaudio. Root doesn’t have permission to nsu, and I subsequently noticed some console messages saying “root: sorry, you must have a tty to run sudo”. So that explains those failures, anyway.

• Alastair has got us round the keyboard/kerberos problem. Apparently scripts called from Upstart can’t get interactive input! See this Ubuntu support thread. Setting kerberos.hostkeyless to true gets us round the problem for now, at the cost of not having any automatically generated host keys. I’ve changed inf/options/kerberos-client.h accordingly. But what a pain. We really don’t like Upstart. Edit: it may be plymouth rather than Upstart. Hopefully we’ll be able to chuck or disable plymouth as a workaround.

Written by Chris Cooke

April 19, 2010 at 4:34 pm

## openldap and mock

• For my own machine I’ve backed out the lcfg-kerberos changes we tried yesterday and reverted to configuring Kerberos with the file component:
#define INF_OPTIONS_KERBEROS_CLIENT
#include <inf/os/f12.h>
#include <inf/options/kerberos-client-by-file.h>


Stephen reckons I was just unlucky yesterday. By tomorrow I may have enough courage to give lcfg-kerberos another try.

• F12 now uses the openldap component if you #define USE_OPENLDAP_COMPONENT at the top of the profile. The component has so far only been built and submitted on 32 bit though as a build dep (lsb) is missing on the 64 bit machine.
• I’ve built and submitted the last dependencies for lcfg-sleep on f12_64. I used mock to do this. Initial mock failures were solved by adding a load of BuildRequires to the specfile of the package I had to build.
• Today I’ve been able to strike lcfg-pam, lcfg-grub and lcfg-sleep from the not-yet-done list on the project plan.
• An attempt to build lcfg-openafs on f12_64 failed due to dependencies not being installed. I’ll tackle this tomorrow with mock. Mock has the distinct advantage that you can have it install dependencies for you, and it does it just in its chroot: so you don’t have to muck about installing packages on a build machine, possibly annoying another person who may be using the same machine for some other purpose at the same time.
• Looking back on this I now realise that tomorrow I can also use mock to tackle lcfg-openldap on 64 bit.

Written by Chris Cooke

April 14, 2010 at 4:10 pm

Posted in Uncategorized

Tagged with

## Bork bork bork

• Yesterday I brought the F12Upgrade page up to date and we had an F12 progress meeting.
• I also did some testing of my 745: sound works, although only audio files in free formats play successfully: attempting to play a non-free format such as mp3 triggers a PackageKit codec install attempt, which fails, saying that all available codecs have been blacklisted. I suspect some kind of lack of PackageKit authorisation. I need to look at PackageKit to find out how to cleanly disable it while preserving the functionality we do want such as manual use of yum.
• Today I reinstalled the machine, this time with Alastair’s new PXE setup. It worked smoothly. One oddness, which Stephen also sees and which I’ll add to the F12Upgrade page, is that GDM comes up very early. GDM would come up, I’d have enough time to enter my username and password, then be told “authorisation failed” or similar, then the machine would reboot. I had this twice during the install process. If I’d been paying attention to the screen instead of to another screen I’d have seen that the install wasn’t yet finished of course, but I’m not used to having login prompts come up too early. Also after the install was finished I still couldn’t login: GDM came up, I attempted to login, and got another authorisation failure. The auth log said this for every failed login attempt:
pam_krb5_allbery(gdm-password:auth): authentication failure; logname=cc uid=0 euid=0 tty=:0 ruser= rhost=


By the time Stephen tried a few minutes later it was working and we could both login perfectly well; so it seems that GDM just comes up early. It gets started by Upstart – see /etc/event.d/prefdm – so we’ll have to look there to investigate how to delay it until booting has finished.

• I’ve removed all the PackageKit RPMs from the base package lists. yum is still as functional as ever. Attempting to play an mp3 now gives the message “The playback of this movie requires a MPEG-1 Layer 3 (MP3) decoder plugin which is not installed.” rather than attempting to install the missing codec then failing. We’ll need to install more codecs though – for a start our speech researchers will need the audio support. (We could substitute our own cod-PackageKit functionality which instead of attempting to install missing packages does something more appropriate to our circumstances.) We’ll install the non-free codecs at the DICE level though, so that will presumably be done as part of the later DICE-on-F12 project.
• I’ve added a sample profile and (minimal!) install instructions to the F12Upgrade page.
• Then I spent much of the afternoon trying to get lcfg-kerberos working. This exposed problems with lcfg-network settings, which turned out not to be problems really, but in the meantime misguided attempts to fix them borked the system so thoroughly that not only was the network stone dead but also the keyboard, though only at multi-user level. I’m onto my third or fourth reinstall of the day – I’ve lost count.
• Written by Chris Cooke

April 13, 2010 at 4:15 pm

Posted in Uncategorized

Tagged with

## sleep and gdm on F12

lcfg-sleep is OK and working on my F12 machine with dice/options/sleep.h included. My machine is a GX745 and is sleeping far more reliably than it did with SL5. Based on this I haven’t added any model-based exclusions for F12.

lcfg-gdm on the other hand isn’t looking so good. The gdm on F12 has been totally rewritten compared to the version on SL5 and the configuration isn’t fully compatible. It looks like we’re going to need a new lcfg-gdm, or possibly we could configure gdm with lcfg-gconf.

Written by Chris Cooke

April 9, 2010 at 9:17 pm

Posted in Uncategorized

Tagged with

## A reinstall brings RTC wake alarm confusion

Late yesterday I bogged up my F12 machine completely. Today I took the opportunity to reinstall it using Alastair’s shiny new F12 install process. This worked, albeit with a few hiccups, so I now have a new F12 installation.

With the new installation, the wake alarm no longer works as it did. This is how it worked until yesterday:

# echo 0 >/sys/class/rtc/rtc0/wakealarm
# date "+%s" -d "+ 5 minutes" >/sys/class/rtc/rtc0/wakealarm


That is, you zero the alarm then you set it with the number of seconds between the epoch and the date/time you want the machine to wake up. Now though, with the same kernel RPM as before, the above doesn’t work. Instead it works like this:

# echo 0 >/sys/class/rtc/rtc0/wakealarm
# echo +300 >/sys/class/rtc/rtc0/wakealarm


That is, you put in a + followed by the number of seconds between now and your alarm time. But the kernel version is the same, I think, from yesterday to today. How can the alarm behaviour have changed…?! And will it change again tomorrow? How do I write software when the kernel behaviour changes arbitrarily from one day to the next with no change of kernel RPM version?

I got the solution from here – which seems to be about Asus boards so I’m still confused as I’m using a Dell:
http://www.mail-archive.com/acpi-bugzilla@lists.sourceforge.net/msg24296.html

Written by Chris Cooke

April 7, 2010 at 4:06 pm

## Fixed problems with lcfg-sleep on F12

The day was mostly taken up with making the sleep component behave itself properly on Fedora 12. The OS’s power management facilities certainly seem to have matured: my test Dell Optiplex 745 suspends and resumes far more quickly than it did with SL5, and it seems to be doing it far more reliably as well so far. I’ve left it on an intensive suspend/resume cycle though (awake for 3 minutes, then suspend if appropriate, then wake 2 minutes later and start again) so we’ll see if that brings out any misbehaviour over the next few days.

An apparent bug whereby the pm-utils hook scripts weren’t being called was solved when I noticed that for F12 I’d switched the suspend command for my machine from /usr/sbin/pm-suspend to some other fancy suspend command. Doh. I switched it back and the pm-utils hooks were called again as they should be.

I also rewrote the part of the component’s code which sets the wake alarm to make it cope properly with either the old kernel alarm system used on SL5 (/proc/acpi/alarm) or the newer one found on F12 (/sys/class/rtc/rtc0/wakealarm).

I also discovered and fixed an edge case problem whereby the component’s shell idle time test would happily approve sleep in case where all interactive shells had an idle time of zero seconds. Repeat twenty times: I must not confuse zero with undefined in my perl scripts. Still to do: check that other sleep tests are behaving themselves (I think they are though) and check that the new code still does the right thing on SL5.

Written by Chris Cooke

April 5, 2010 at 4:24 pm

## The end is in sight

• I’ve cleaned loads of testing crud out of my F12 machine’s LCFG file. There’s now far less potential for confusion between local overrides and the default settings.
• Idea from Alastair and Stephen: I’ve added fstab resources for the current partitioning arrangement my machine has. The lack of those messed up Stephen’s machine on reboot.
• The end is in sight: most of what matters now seems to be being managed by LCFG one way or another on my, Stephen’s and Alastair’s machines; Stephen’s has had a successful LCFG-controlled reboot, with LCFG-configured grub; and Alastair has managed a successful LCFG-controlled reinstall!
• The F12 Upgrade page has been licked further into shape. It now sports a Known Issues section with a couple of problems added already.

Written by Chris Cooke

April 3, 2010 at 9:15 am

Posted in Uncategorized

Tagged with

## mostly box-ticking

• The test sleep machine didn’t wake up on schedule. This could be because yesterday I ignored the advice in the MythTV ACPI Wakeup page to stop the suspend-time scripts from syncing the hardware clock from the system clock; since any changes to the hardware clock after the wakeup alarm is set will make the alarm ineffective.
• Wrong guess, or insufficient; the machine wakes up when I set the alarm from the shell (good), but not when it’s set from the hacked sleep component. Need to clean up the component’s alarm setting code.
• lcfg-autoreboot now builds and installs on F12. Submitted to devel bucket. Dependency added to lcfg/autoreboot.h.
• lcfg-bugzilla now builds and installs on F12. Submitted to devel bucket.
• lcfg-cups now builds and installs on F12. Submitted to devel bucket.
• lcfg-gbios now builds and installs on F12. Submitted to devel bucket. I incremented the micro release number to make the subversion tags to make the build tools work.
• lcfg-localhome now builds and installs on F12. Submitted to devel bucket. I incremented the micro release number to make the subversion tags to make the build tools work.
• lcfg-postgresql now builds and installs on F12. Submitted to devel bucket.
• The other entries in lcfg_f12_extras.rpms are either already done, are being tackled or aren’t in LCFG subversion.
• lcfg-syslog-2.0.1 is ready to use on F12 (though not SL5 which stays with 1.1.15) and has had its platform list updated. I’ve built and submitted it for f12, adjusted lcfg_f12_lcfg.rpms to install it and adjusted inf/os/f12.h to enable it.
• On Stephen’s advice I made an fstab template in /var/lcfg/conf/fstab/fstab.sda for the existing anaconda-derived partitioning of my test machine’s disk. This should stop lcfg-fstab from blootering the current partition setup.
• Built and submitted lcfg-hardware-0.100.22 for f12; updated the version number in lcfg_f12_lcfg_installroot; enabled lcfg-hardware in inf/os/f12.h.
• Built and submitted lcfg-kernel-0.102.3 for f12; updated the version number in lcfg_f12_lcfg_installroot; enabled lcfg-kernel in inf/os/f12.h. SL5 is staying on lcfg-kernel-0.101.18 for now.
• Built and submitted everything else in subversion that would build first time and seemed even possibly relevant to F12
• lcfg-perlex
• lcfg-pgluser
• lcfg-postfix
• lcfg-procmailrc
• lcfg-pxeserver
• lcfg-remctld
• lcfg-status-checks
• lcfg-subversion
• lcfg-sudo
• lcfg-tibsconf
• lcfg-websvn
• lcfg-xen

Written by Chris Cooke

March 31, 2010 at 4:14 pm

Posted in Uncategorized

Tagged with

• Components don’t seem to be reconfiguring when they get a change of resources. I had to “om boot configure” to get it to pick up changes to boot.services that were made days ago, despite several reboots since the change. I also just had to “om cron configure” to get the cron component to process a new cron job I’d added to its resources.
• inf/options/ntp.h was configuring /etc/ntpd.conf but that was all. It now starts ntpd too by adding rc_ntpd to boot.services.
• Have been having difficulties with the cron component. Then spotted that inf/os/f12.h was still removing it from profile.components and boot.services. Oops. Now removed those removals; lcfg-cron and crond now start correctly on reboot.
• I’ve worked out some things that lcfg-sleep needs in order to work on F12:
• F12 versions of perl packages, including two which aren’t in the Fedora distribution and had to be got from CPAN (perl-DateTime-Event-Cron and perl-DateTime-Format-Epoch)
• The newer kernel means a change to using the newer kernel alarm file /sys/class/rtc/rtc0/wakealarm instead of /proc/acpi/alarm. This is what needs perl-DateTime-Format-Epoch.
• Removal of suspend-time and resume-time actions for LCFG components not currently in use on F12 (lcfg-ntp and lcfg-amd).
I’ve slept the test F12 machine and woken it several times and so far it’s behaving wonderfully: nice quick wake-ups and no video problems at all. We’ll see tomorrow how the sleep component performed overnight.
• I spotted that xemacs didn’t have a full complement of packages by the fact that it was missing a perl mode. I’ve added them to lcfg_f12_base.rpms where the other xemacs packages were put a week or two ago.

Written by Chris Cooke

March 25, 2010 at 5:06 pm

Posted in Uncategorized

Tagged with

• My machine now allows logins from inf.ed.ac.uk DICE users and gives them their AFS home directories by default. The final piece of this jigsaw was to add the following line to /etc/ldap.conf:
nss_map_attribute homeDirectory afsHomeDirectory


This is best done via the file component – this line is nicked from Stephen’s lcfg/bowmore file:

!file.tmpl_ldapconf    mEXTRA(\nnss_map_attribute homeDirectory afsHomeDirectory\n)

• I’ve extensively hacked the F12 Upgrade page but there’s still a lot more to do.

Written by Chris Cooke

March 24, 2010 at 5:05 pm

Posted in Uncategorized

Tagged with

## 23/3

• I’ve created an lcfg_f12_extras.rpms list.
• lcfg-openssh seems to be lacking just a submit to the lcfg bucket, so I’ve done that.
• lcfg-auth was still being disabled in inf/os/f12.h; I’ve changed this, so it’s now started and in use by default. The auth resources inherit from the fstab resources so I’ve added fstab to profile.components though not yet to boot.services.
• I’ve created and started work on the local F12 Upgrade information page.

Written by Chris Cooke

March 23, 2010 at 5:05 pm

Posted in Uncategorized

Tagged with

• lcfg-prelink-1.0.5 makes its debut and supports F12 too. Support for -c lines has been added as the F12 /etc/prelink.conf has this:
# -c ' is used to source additional config file snippets.


Doing this has shown up the lack of a lcfg_f12_extras.rpms list. I’ll look at that tomorrow. It doesn’t do much except drive the lcfg.org website automatic lists if I remember rightly.

• I’m sure I did other stuff too, since I spent most of the day elbows deep in Fedora and LCFG, but I didn’t note it down at the time. Mostly investigating and correcting minor errors in the port so far, I think.

Written by Chris Cooke

March 22, 2010 at 5:50 pm

Posted in Uncategorized

Tagged with

## Friday

• We’ve got an mpu chatroom where we’re coordinating the f12 port. Feel free to join if you have access to our jabber service and you want to see what we’re doing.
• I’ve committed my changes to the lcfg_f12_kernel list and the inf/options/openafs-client.h header to upgrade us to the latest kernel and openafs versions.
• So now I have openafs-devel, so maybe now perl-AFS will build… Nope. It’s still not happy:
gcc: /usr/lib/libafsrpc.a: No such file or directory
gcc: /usr/lib/libafsauthent.a: No such file or directory


A google search on perl-AFS dependencies returns this blog in first place. I hate it when that happens.
We do have:

/usr/lib/libafsrpc.so.1


which is in openafs-authlibs. Aha, we probably need to install openafs-authlibs-devel. Yes, that did it… perl-AFS is now built and submitted.

• lcfg-updaterpms-0.100.49 has been built and submitted and is now the default.
• pkgsubmit-0.0.6 has been built and submitted and is now the default.
• nsu allows and denies me correctly when I start lcfg-nsu and adjust nsu resources accordingly, so I’ll say it’s working on F12. I encountered a build error when trying to build the subsequent new version on sl5:
error: Installed (but unpackaged) file(s) found:
/usr/lib/debug/usr/bin/nsu.debug
/usr/src/debug/lcfg-nsu-2.5.10/nsu.c
/usr/src/debug/lcfg-nsu-2.5.10/permit.c
/usr/src/debug/lcfg-nsu-2.5.10/propagate_auth.c


This was due to the presence of a “BuildArch: noarch” line in a sub-package. You can’t have a noarch subpackage when the main package isn’t noarch. Thanks Stephen for sorting it out. That out of the way, lcfg-nsu-2.5.11 makes its debut as the default on all supported platforms.

• Stephen said yesterday that lcfg-tcpwrappers tested out OK for him and it seems to work fine for me too, so lcfg-tcpwrappers-0.99.11 is now the default on all platforms.
• Ditto ditto lcfg-nsswitch-0.100.10.
• My machine’s clock is now properly adjusted thanks to Stephen’s blog post on ntp on f12.

Written by Chris Cooke

March 19, 2010 at 5:00 pm

Posted in Uncategorized

Tagged with

## progress meeting

• I’ve solved last night’s updaterpms conflict. My machine is now running with the PAE kernel version 2.6.32.9-70.fc12 and with openafs 1.4.12.
The conflict was caused by two versions of kernel-firmware being listed in profile.packages, the old one and the new one. I had thought that the new one would overwrite the old one since I’d put a + in front of the new one. Then when that didn’t work I’d thought I could take out the old one by inserting a -kernel-firmware-*-* line between old and new. That didn’t work either, and today I realised why: the old kernel-firmware line specifies a context [install!=true] – and it seems you can only override a package specification that uses a context by overriding it in the same context. One I’d added [install!=true] to the -kernel-firmware-*-* line, all was well: I rebooted and the new kernel and openafs installed happily.

• Following Stephen’s fixes to sxprof, I’ve issued lcfg-etcservices-0.100.13 to mark F12 support.
• Now that openafs is configured with inf/openafs-client.h I’ve removed openafs packages from lcfg_f12_override.rpms.
• I’ve downloaded and submitted openafs 1.4.12 for f12 and f12_64 into the world bucket.
• This afternoon we had an LCFG Fedora 12 Port Meeting.

Written by Chris Cooke

March 18, 2010 at 5:00 pm

Posted in Uncategorized

Tagged with

## bumper Wednesday

• I started the day by doing a google search on rpmReadPackageFile Unknown system – and result number one was from this blog. I hate it when that happens. Altering the search slightly, the only other site on the entire internet (google claims) which has the text “Unknown system: (null)” anywhere in it is this one, an archive of a mail message from eleven years ago, which references an RPMfind FAQ which unsurprisingly no longer exists. When I look at one which does exist it doesn’t mention that phrase anywhere.
The actual warning message comes from the RPM 4.7.1 source file lib/rpmrc.c, specifically from the function getMachineInfo based on what it gets back after calling lookupInCanonTable. I’ve updated bug 224 with the info.

• lcfg-cron seems to be doing the right thing on f12 and f12_64 so I’ve issued lcfg-cron-2.0.13 to mark official f12 support.
• lcfg-openafs is installed but it needs perl-AFS which won’t build because:
Path to the AFS installation (libraries, binaries,
header files) to be used for this installation? 	  [/usr] /usr
/usr/bin/ld: cannot find -lubik
collect2: ld returned 1 exit status
ERROR from evaluation of /home/cc/RPMbuild/BUILD/AFS-2.6.1/src/Makefile.PL:
Could not compile test code to retrieve the version of AFS system libraries...
error: Bad exit status from /var/tmp/rpm-tmp.5yyfci (%build)


Stephen is guessing that this might be caused by the lack of (e.g.) openafs-devel. There is an openafs-devel for openafs 1.4.11 but yum won’t install it because:

Transaction Check Error:
file /usr/share/man/man1/compile_et.1.gz from install of openafs-devel-1.4.11-fc12.1.1.i386 conflicts with file from package libcom_err-devel-1.41.9-7.fc12.i686


This may well be fixed in openafs 1.4.12 which is now out. However the OpenAFS Fedora 12 download page doesn’t mention a kmod-openafs package for my kernel version. I could possibly build my own but even if I did I’d need to reboot the machine (scary) afterwards, so I might as well upgrade the kernel and openafs when I reboot. Whatever I do it looks like a machine reboot is called for, which means it’s time to deploy Alastair’s lcfg-boot for Fedora 12.

• Speaking of which: Alastair has now deployed his F12-compatible, Upstart-compatible lcfg-boot on my machine! It was a pretty painless process, too – this is about all that’s needed:
!profile.packages	mEXTRA(+initscripts-9.02.1-1.lcfg.1/i386 \
lcfg-upstarthooks-0.0.4-1/noarch )



Two reboots later it’s looking as if the machine is booting under LCFG control. This feels good. I hadn’t dared to boot the thing for the past couple of weeks as I didn’t know what might break.

• Spurred on by this success, I thought I’d try upgrading openafs to 1.4.12 and the kernel to something compatible with it. I decided on the most recent kernel that openafs supports for this platform, and here’s what I’m asking updaterpms to do:
/* upgrade to newer kernel and openafs */
!profile.packages mEXTRA(+kernel-doc-2.6.32.9-70.fc12/noarch:b)
!profile.packages mEXTRA(+kernel-2.6.32.9-70.fc12:br)
!profile.packages mEXTRA(+kernel-PAE-devel-2.6.32.9-70.fc12:b)
!profile.packages mEXTRA(+kernel-PAE-2.6.32.9-70.fc12:br)
!profile.packages mEXTRA(+kernel-devel-2.6.32.9-70.fc12:b)
!profile.packages mEXTRA(+kmod-openafs-PAE-1.4.12-1.1.2.6.32.9_70.fc12:b)
!profile.packages mEXTRA(+kmod-openafs-1.4.12-1.1.2.6.32.9_70.fc12:b)
!profile.packages mEXTRA(+openafs-client-1.4.12-fc12.1.1/i386:b)
!profile.packages mEXTRA(+openafs-docs-1.4.12-fc12.1.1/i386:b)
!profile.packages mEXTRA(+openafs-kernel-source-1.4.12-fc12.1.1/i386:b)
!profile.packages mEXTRA(+openafs-1.4.12-fc12.1.1/i386:b)
!profile.packages mEXTRA(+openafs-authlibs-1.4.12-fc12.1.1/i386:b)
!profile.packages mEXTRA(+openafs-krb5-1.4.12-fc12.1.1/i386:b)
!profile.packages mEXTRA(+openafs-server-1.4.12-fc12.1.1/i386:b)
!profile.packages mEXTRA(+openafs-devel-1.4.12-fc12.1.1/i386:b)
!profile.packages mEXTRA(+kernel-firmware-2.6.32.9-70.fc12/noarch:b)
!profile.packages mEXTRA(xorg-x11-drv-ati-firmware-6.13.0-0.21.20100219gite68d3a389.fc12)


This looks OK to me, but updaterpms is finding two conflicts:

kernel-firmware >= 2.6.32.9-70.fc12 is needed by kernel-PAE-2.6.32.9-70.fc12.i686
kernel-firmware >= 2.6.32.9-70.fc12 is needed by kernel-2.6.32.9-70.fc12.i686


This is odd, because:

rpm -qp --provides kernel-firmware-2.6.32.9-70.fc12.noarch.rpm
warning: kernel-firmware-2.6.32.9-70.fc12.noarch.rpm: Header V3 RSA/SHA256 signature: NOKEY, key ID 57bbccba
kernel-firmware = 2.6.32.9-70.fc12


Furthermore if I install kernel-firmware-2.6.32.9-70.fc12.noarch.rpm using the rpm command, then run updaterpms, it wants to delete the package then reinstall it then complain that what it provides isn’t there. I’m probably missing something obvious here; I’ll attack this again tomorrow.

• I’ve started looking at lcfg-mailcap. It’s just an incarnation of the file component so it should be fine, but I don’t think the default resources are sensible any more. Our mailcap resources specify particular applications to be used for different types of content, but the mailcap which comes with Fedora just gives everything straight to xdg-open which finds out what the desktop’s default application is for that mime type and hands the data to it to display. This latter course seems less error-prone, more standard and less in need of maintenance by us. Here’s an /etc/mailcap from a DICE SL5 machine:
#
# LCFG generated /etc/mailcap - do not edit
#

audio/mod;/usr/bin/mikmod %s

audio/wav;/usr/bin/mplayer %s

image/*;/usr/bin/display %s

application/msword;/usr/bin/openoffice.org3 %s

application/postscript;/usr/bin/gv %s

text/html;/usr/bin/htmlview %s ; copiousoutput

video/quicktime;/usr/bin/mplayer %s

application/x-java-jnlp-file;/etc/alternatives/javaws

application/vnd.oasis.opendocument.text;/usr/bin/openoffice.org3 %s

application/vnd.oasis.opendocument.presentation;/usr/bin/openoffice.org3 %s

video/*;/usr/bin/mplayer %s

Application/VND.MS-EXCEL;/usr/bin/openoffice.org3 %s

Application/VND.MS-POWERPOINT;/usr/bin/openoffice.org3 %s


and here’s one from a Fedora 12 machine:

###
### Begin Red Hat Mailcap
###

audio/*; /usr/bin/xdg-open %s

image/*; /usr/bin/xdg-open %s

application/msword; /usr/bin/xdg-open %s
application/pdf; /usr/bin/xdg-open %s
application/postscript ; /usr/bin/xdg-open %s

text/html; /usr/bin/xdg-open %s ; copiousoutput


Presumably we could for F12 get lcfg-mailcap to make an /etc/mailcap which hands everything in sight to xdg-open.

• Iain has now finished finding and where necessary building and submitting Fedora 12 versions of all of the packages mentioned in lcfg_f12_lcfg.rpms. Where he hasn’t been able to do that he’s submitted a bug.

Written by Chris Cooke

March 17, 2010 at 5:30 pm

Posted in Uncategorized

Tagged with

## Friday & Monday

• I’ve moved untested packages from the lcfg bucket to the devel bucket. Alastair has added the devel bucket to the default updaterpms path for f12. The idea is to use the devel bucket as the temporary dumping ground for our untested packages. Tested versions will later be submitted to lcfg as normal. I’m judging a package to be tested if “Fedora12″ is in its platforms list in lcfg.yml in lcfg subversion. These packages have been moved from lcfg to devel: lcfg-auth-*, lcfg-cron-*, lcfg-fstab-*, lcfg-gdm-*, lcfg-grub-*, lcfg-hardware-*, lcfg-init-*, lcfg-kernel-*, lcfg-lcfginit-*, lcfg-mailcap-*, lcfg-network-*, lcfg-nsswitch-*, lcfg-nsu-*, lcfg-pam-*, lcfg-tcpwrappers-*, netgroup-*, perl-String-*, perl-W3C-*, pkgsubmit-*.
• After eliminating some more package conflicts I have bravely run updaterpms for real, with no visible bad consequences so far.
• Stephen has done an inf/options/openafs-client.h. This exposed the lack of a non-PAE kernel in lcfg_f12_lcfg.rpms (Alastair and I both had an updaterpms conflict) so I’ve added one.
• Checking out lcfg-auth. Our normal auth resources depend on netgroup entries which depend on openldap which I haven’t tackled yet, so I’m overriding them with entries which should recreate basic authorization files more or less as they currently are on the test machine. After an adventure or two (in particular, if you start getting errors like Can't call method "name" on an undefined value at /usr/bin/om line 300 then the account you’re logged in as has just been deleted from the passwd file, OK?) the auth component seems to be doing the right thing, and the files it’s created look as they should. I declare it OK for Fedora 12 so I’ve changed the supported platforms list and built and submitted lcfg-auth-0.102.12 for all supported platforms.
• Spent some time debugging genhdfile to find where the warning messages (bug 224) were coming from. They’re being produced by this line in genhdfile.c:
  rc = rpmReadPackageFile(ts, fd, fullrpmfile, &h);


Written by Chris Cooke

March 16, 2010 at 5:09 pm

Posted in Uncategorized

Tagged with

## Miscellany from a long day

updaterpms still wants to delete several dozen packages. I’m wondering what useful stuff I might have missed out of the package lists.

• I’ve added cpanspec and dependencies to the devel list. I know Fedora comes with every Perl module imaginable but I can’t rule its usefulness out entirely so we might as well have it now as later.
• I’ve added xemacs and emacs and dependencies to the base list.
• We’re now down to 39 deletions, 15 of which are multiple kernel packages, 8 lcfg packages I haven’t yet added to the lists, 10 perl and 6 stragglers.
• Oops, lcfg/options/buildtools.h already incorporated lcfg_sl5_devel.rpms into the rpmpath. I’ve changed it so it now includes lcfg/options/devel.h, which if you’ve been paying attention you’ll know incorporates lcfg_sl5_devel.rpms (or lcfg_f12_devel.rpms) into the rpmpath.
• More package list adjustments: move redhat-lsb and deps to base; add perl-Module-Build and deps to base; some extra perl test modules added to devel.
• I think we’ve sorted out updaterpms bug 235. Gory details can be found there.
• Just made my first attempt to build lcfg-server on f12. Failed pretty quickly. bug 236. Later – bug closed: we don’t need lcfg-server on f12 right now; an upcoming version is in preparation and that should work fine on f12. (Good; that simplifies things.)
• Alastair has beaten me! He now has two f12 machines managed by updaterpms and boot
• lcfg-nsu builds as i386 rather than i686 which messes up our lcfg package list which is shared between 32 and 64 bit. Reported as bug 237. Fixed in lcfg-nsu-2.5.9 by removing the BuildArch line from the specfile; BuildArch is only really needed for a “noarch” RPM.
• hidden/f12vars.h now defines ARCH_I386 as well as ARCH_I686 to save hassle.
• Iain has done a massive build and pkgsubmit of lots of components!
• The f12 kernel package now loads after the updates and overrides packages.

Written by Chris Cooke

March 11, 2010 at 6:15 pm

Posted in Uncategorized

Tagged with

## updaterpms

with one comment

Still working on getting updaterpms to run cleanly.

• Submitted kmod-openafs-PAE (I’d missed this out before); built and submitted lcfg-pkgtools 1.0.9, lcfg-utils 1.3.3, submitted perl-W3C-SAX-XmlParser and perl-W3C-Util-Basekit.
• lcfg-afs is being added to the packages list by inf/options/afs-client.h. lcfg-afs has been replaced by lcfg-openafs. In the dice layer afs-client.h has recently been replaced by openafs-client.h. To eliminate an updaterpms error, and because my AFS is currently working, for the time being I’ll disable inf/options/afs-client.h on my machine. Later: just spoken to Stephen. He’s going to create inf/options/openafs-client.h to more or less mirror the way we set it up in DICE, using lcfg-openafs.
• corrected a typo in inf/options/packages.h. The inf_f12_env.rpms list is now loaded. It’s empty though.
• Came back from lunch to find new F12 updates and updaterpms reporting several conflicts. Some were due to my having previously added packages to lcfg_f12_overrides (which is for non-Fedora packages we want to install) when I should have added them to lcfg_f12_postship (which is for Fedora packages we want to install which have only been shipped as updates not as part of the Everything bucket). Others were new packages or new dependencies of updated versions of existing packages. Where a version of these new dependencies was shipped in Everything, I’ve added the Everything version to lcfg_f12_base. Where it was only ever shipped in updates, I’ve added it to lcfg_f12_postship.
• Created and populated lcfg_f12_devel.rpms with pkgsubmit, cmake, redhat-lsb and their dependencies. Added six waves of perl dependencies. Created empty inf_f12_devel.rpms, inf_sl5_devel.rpms, dice_sl5_devel.rpms and dice_f12_devel.rpms. Created lcfg/options/devel.h, inf/options/devel.h, dice/options/devel.h.
• updaterpms is behaving oddly – for instance it wants to delete something called kernel-PAE/i686-2.6.31.12-174.2.3.fc12/i686 when it should be recognising it as kernel-PAE-2.6.31.12-174.2.3.fc12/i686 which it’s been told not to delete. It separately reports on the latter package name, reporting that it’ll deal with it at boot time (since the package is specified with :br flags). Reported as bug 235.

Written by Chris Cooke

March 10, 2010 at 4:53 pm

Posted in Uncategorized

Tagged with

## package lists cleanup

• Stephen has regenerated all the F12 rpmlist and package header files, so updaterpms is now finding most of the RPMs it’s looking for.
• Judicious population of the new lcfg_f12_overrides.rpms list has eliminated all package conflicts.
• updaterpms isn’t yet entirely happy however… it still can’t find several dozen RPMs and it wants to delete too much. To fix this I need to do at least these things:
• Create, populate and use lcfg_f12_devel.rpms (cmake, lsb, etc.)
• Autobuild and submit a load of as yet unchecked LCFG components. Iain will be doing this.
• Populate lcfg_f12_kernel.rpms. (Done. The beginnings of one, anyway.)
• Submit and find a package list for the openafs packages. (Done. I used the overrides list, since openafs doesn’t seem to be part of Fedora in any way at all.)
• Fix inf/options/packages.h so that for F12 it defines rpmlist #ifdef ARCH_I686 rather than ARCH_I386. (Done.)
• Fix inf/options/packages.h so that for F12 it defines rpmlist in terms of layers. (Done.)
• Some of my package conflicts and overrides may be caused by our updates mirror being out of date? This is a reminder to me to look through and find packages we don’t seem to have recent enough updates for, and tell Stephen about them.

Written by Chris Cooke

March 9, 2010 at 5:54 pm

Posted in Uncategorized

Tagged with

## rpm conflicts galore

• updaterpms wants to delete an epic amount of packages from my machine and is flagging 383 conflicts.
• Adding -a i686 to updaterpms.flags reduces this to 195 conflicts but the enormous amount of would-be deletions remains.
• There are no “noarch” indications in the lcfg_f12_base package list. Whoops! Adding those in reduces the number of conflicts to 18. However it still wants to delete a lot of RPMs and there are also still a huge number of “can’t find original RPM for” and “couldn’t find RPM header file for” messages.
• By judicious use of yum and rpm and even a few adjustments in package lists I’ve reduced the number of conflicts reported on my machine by updaterpms to 4. A lot of the problems seem to stem from my having run yum upgrade at some point. Some of the package versions on my machine are higher than those on our update package lists as well as on the base list. These higher package versions sometimes pull in extra dependencies that weren’t needed before. One updated package imposes a new dependency on ruby-libs which needs readline-5, but the machine already has readline-6 installed. I can’t help thinking that Fedora updates seem to be more of a chaotic mess than RHEL/SL updates and will need a lot of policing.
• News on bug 231: Alastair has found two problems with our packages infrastructure on F12. Firstly the attempts to write all those thousands of RPM header files to our F12 sites mirror directories has bumped us up against some maximum size limit for an AFS directory. He’s solved that by quickly making it possible to have the RPM header files in a subdir instead. Secondly the rpmlist files in each of these dirs contain full paths rather than just the names of the files, and updaterpms can’t cope with this. Fixing the rpmlist files by hand makes my updaterpms far happier. However the script which regenerates the rpmlist files will need to be fixed to make this permanent.

Written by Chris Cooke

March 8, 2010 at 4:58 pm

Posted in Uncategorized

Tagged with

## updaterpms and the default arch

• perl-LCFG-PkgTools-1.0.30 is now the default on F12 and SL5 and has been built and submitted for f12, sl5 and sl5_64.
• It looks like perl-LCFG-PkgUtils is used by perl-LCFG-PkgTools, and that’s working fine, so I’ll declare perl-LCFG-PkgUtils to be OK too. perl-LCFG-PkgUtils-1.0.1 has now been built, submitted and made the default version on SL5 and F12.
• I’ve just enabled lcfg-updaterpms and tried running updaterpms for the first time. Before starting the component I set updaterpms.methods to run to prevent it from running updaterpms at “start” time – because it’ll inevitably want to do something silly the very first time it’s run so I want to run it in test mode. The component then started up successfully without attempting to run updaterpms. I then ran it myself in test mode with
om updaterpms run -- -t


and got a whole load of errors which culminated with

[ERROR] updaterpms: There were 383 conflicts


This might take some time to sort out.
First error:

couldn't find RPM header file for gnome-desktop-devel-2.28.2-3.fc12/i386


There is a header file for that package:

/afs/inf.ed.ac.uk/pkgs/sites/f12/updates/i386/.gnome-desktop-devel-2.28.2-3.fc12.i686.rpm


but the RPM and its header file are both “i686″ not “i386″. The package is specified in the package list file like so:

?gnome-desktop-devel-2.28.2-3.fc12


… because it’s using the default arch rather than another one such as “noarch”.
Have I set the default arch wrongly somewhere…?
Edit: now reported as bug 231.

Written by Chris Cooke

March 4, 2010 at 4:50 pm

Posted in Uncategorized

Tagged with

## Mac subversion problem

I used to keep subversion checkouts on my Mac. This was very handy. Now i’ve upgraded the Mac to 10.5 (not 10.6 yet – I’m a “late adopter” by nature!) these don’t work: any attempt to do a subversion checkout fails with a “207 Multi-Status” error.

Graham tells me that this is a problem with a 10.5 library. In rough order of increasing consumer friendliness I can either replace the library with one I compile myself, or install MacPorts and download its Subversion, or simply upgrade the Mac to 10.6, in which this bug has been fixed.

Written by Chris Cooke

March 2, 2010 at 5:10 pm

Posted in Uncategorized

Tagged with , ,

## More on perl-LCFG-PkgTools & perl-LCFG-Utils

• On the question of whether perl-Template-Toolkit should or shouldn’t be a dependency of lcfg-utils: No, it shouldn’t. However it would be far more on target to have it as a dependency of perl-LCFG-Utils. Strictly speaking it’s not actually required, but its presence does enhance perl-LCFG-Utils (it lets it handle new-style perl templates as well as the old-style LCFG templates) and we should have it installed for that reason. Also, perl modules which are dependencies of perl software generally aren’t mentioned specifically in the specfile as dependencies anyway, Simon tells me; instead the build system automatically notices the “use” or “require” and adds a dependency automatically. This is better than specifying such dependencies in the specfile as the dependencies can then automatically keep up to date if a module moves to a new package, for instance. So: while I’ve left it to the build mechanism to add a formal RPM dependency on perl-Template-Toolkit if it wants to (and it doesn’t seem to want to in this case), I have kept perl-Template-Toolkit and dependencies in the core-prereq section of @lcfg_f12_lcfg.rpms.
• Now I’ve got some sense into the package lists I can test qxpack, and it seems to be working. In fact it alerted me to a typo I’d made last week which I’ve now fixed. All the qxpack results I’m getting back seem sensible, so I’ll declare perl-LCFG-PkgTools to be ported and release a new version. (Later.) Hmm: I’ve made the new version but when I try and build it I get:
% rpmbuild -ba LCFG-PkgTools-Perl-1.0.30.spec
error: Failed build dependencies:
perl(Test::Differences) is needed by perl-LCFG-PkgTools-1.0.30-1.i386
perl(Test::Exception) is needed by perl-LCFG-PkgTools-1.0.30-1.i386


These aren’t installed anywhere by default, but if they’re only build-time dependencies then maybe they don’t need to be. (Later: I’ve checked with Stephen and he agrees: just install these build dependencies for the build then get rid of them again.)
For now I’ll install them on the hosts I use to rebuild this package and we’ll see what happens. (Later.) One dependency led to another so my SL5 machine now has:

/* BuildReq by perl-LCFG-PkgTools */
!profile.packages  mEXTRA(perl-Test-Differences-0.47-2.el5/noarch)
/* Req by perl-Test-Differences */
!profile.packages  mEXTRA(perl-Text-Diff-0.35-3.el5/noarch)
/* Req by perl-Text-Diff */
!profile.packages  mEXTRA(perl-Algorithm-Diff-1.1902-2.el5/noarch)
/* BuildReq by perl-LCFG-PkgTools */
!profile.packages  mEXTRA(perl-Test-Exception-0.29-1.inf/noarch)
/* Req by perl-Test-Exception */
!profile.packages  mEXTRA(perl-Sub-Uplevel-0.18-2.1.inf/noarch)


Some of these had several versions: I chose the highest numbered version in each case. Having done that, perl-LCFG-PkgTools-1.0.30 builds and installs on SL5, after which qxpack appears to behave just as the previous version did. It continues to work after I remove the five build-time requirements mentioned above.

Written by Chris Cooke

March 2, 2010 at 5:04 pm

Posted in Uncategorized

Tagged with

## om, core-prereq cleanup

with one comment

• Tried installing lcfg-om and running om.
Got the message

Can't do setuid (cannot exec sperl)


This was fixed by installing the perl-suidperl package. I’ve now added that to the specfile of lcfg-om as a Requirement. The next attempt to run “om” gave the error:

Component group does not exist


Google knows of precisely three websites in the world containing this exact phrase. (Now there will be four.) The only Linux-related site was my own SL5 LCFG port diary (!) which describes my encounter with the same problem a couple of years ago. The fix is to add the “lcfg” group to /etc/group.
Having done this, the error message changes to the more conventional:

No permission to run this method


which is fixed by adding my username to the resource authorize.users_superusers.
After having done that, various random “om” commands work.
I’d say that constitutes an adequate test of both lcfg-om and lcfg-authorize so I’ll mark the port by issuing updated versions of both: lcfg-om-0.4.8 and lcfg-authorize-1.0.13.

• I’ve added perl-suidperl as a dependency of lcfg-om in the LCFG package lists. I also corrected a couple of dependencies of lcfg-authorize in the F12 LCFG package lists and made a mental note to go back and correct the package list dependencies of other packages already done for F12 – I realise I’ve neglected them.
• Change of plan: it turns out that perl-suidperl is in the base list for SL5 so isn’t in the SL5 lcfg list; and since perl-suidperl is going to be needed by all sorts of things, we should add it to the base list for F12 too. So I’ve taken it back out of the LCFG lists and added it to the F12 base list.
• In the package lists perl-Template-Toolkit is listed as a dependency of lcfg-utils. In the lcfg-utils specfile it’s not.
My guess is that this is deliberate because Template Toolkit support is an optional feature of lcfg-utils, but that nevertheless we want to have perl-Template-Toolkit installed. I’d like to get this confirmed though. I’ve gone ahead and updated perl-Template-Toolkit in the LCFG package lists, and also updated its dependencies (which are a different list of packages this time than on SL5).

• Ditto perl-Tk and perl-LCFG-Utils: perl-Tk seems to be an optional dependency of perl-LCFG-Utils.
• In case I need to know later, these packages on SL5 require one of the targets provided by perl-Template-Toolkit:
perl-LCFG-Build-Skeleton-0.0.12-1.noarch
eqe-1.3.0-1.dice.1.noarch
lcfg-autoreboot-1.0.11-1.noarch
perl-LCFG-Build-Tools-0.0.61-1.noarch


No package on SL5 requires a target provided by perl-Tk.
This is how I found that out:

for i in rpm -q --provides perl-Template-Toolkit|awk -F\= '{print $1}'; \ do echo perl-Template-Toolkit provides$i; rpm -q --whatrequires $i; \ done | grep -v perl-Template-Toolkit|grep -v 'no package requires'  • I’ll probably go back and move perl-Template-Toolkit (and dependencies) to a package list where they’re definitely needed, rather than in LCFG core. Not today though, it’s been complicated enough, and besides it is Friday afternoon: prime time for breaking things for the weekend with an ill-considered bit of hacking of core package lists. • Hopefully a lot of the dependency problems will be sorted out quickly and easily when I get updaterpms working and can run through its list of moans and groans. Next week I hope. Written by Chris Cooke February 26, 2010 at 5:04 pm Posted in Uncategorized Tagged with ## First stages now complete! leave a comment » • Belatedly built and submitted recently updated LCFG components for sl5_64 too. • lcfg-pkgtools-1.0.9 supports F12 and is the default on F12 and SL5. • lcfg-sysinfo-1.0.2 supports F12 and is the default on F12 and SL5. • pkgsubmit seems to submit packages correctly but when run genhdfile issues warnings: Creating dependency file for lcfg-logserver-defaults-s1-1.2.20-1.noarch.rpm warning: Unknown system: (null) warning: Please contact rpm-maint@lists.rpm.org warning: Unknown system: (null) warning: Please contact rpm-maint@lists.rpm.org  This has been reported as bug 224 (in the lcfg-updaterpms category, which is where genhdfile lives). • logserver seems to serve logs properly when tested so logserver-1.2.20 which supports F12 and SL5 has been made and is the new default version. • qxprof and sxprof seem to work OK on F12 so a new default of perl-LCFG-Utils-1.3.4 commemorates the fact. • Default repository locations are now set for F12 machines in lcfg/options/installroot.h & inf/defaults.h. These are used to set default F12 package repository locations in inf/options/packages.h & lcfg/defaults/updaterpms.h. The _REPOSITORY macro (which defines the root of the local RPM repository) is joined by a new _SITES_REPOSITORY macro which defines the root of the site mirror RPM repository. Both are used for updaterpms.rpmpath on F12, so F12 RPMs will be taken straight from our Fedora mirror. • This means that stages 2 (RPM Repositories), 3 (Package Lists) and 4 (Essential Headers) of the project are now complete. Written by Chris Cooke February 25, 2010 at 5:19 pm Posted in Uncategorized Tagged with ## client, ngeneric, file leave a comment » • Bug 222 is resolved. Alastair figured it out: Fedora 12 comes with a Firewall which seems to be enabled by default. After I disabled it then made a change to the machine’s LCFG file, the new profile was picked up by the client component pretty much immediately. Take a bow lcfg-client-2.2.38 with official support for Fedora 12: now the default version on SL5 and F12. • lcfg-ngeneric-1.2.35 now supports F12 and is the default version on F12 and SL5. • lcfg-file-1.1.19 now supports F12 and is the default version on F12 and SL5. • A bunch of milestones were created on the devproj page. Written by Chris Cooke February 23, 2010 at 5:10 pm Posted in Uncategorized Tagged with ## Slow progress leave a comment » This week got off to a bad start. I couldn’t login to the Fedora machine. I eventually traced this to a DHCP problem: the machine’s minimal new LCFG profile didn’t have the dhclient component included. This meant that the machine’s MAC address didn’t get added to the right spanning map, so it didn’t make it on to the DHCP servers, so DHCP for my machine was broken. Oops. I fixed it by adding dhclient to profile.components and uncommenting dhclient.mac. It took a while to reinstall openafs which got broken by an ill judged “OK” to a PackageManager popup which offered to install a load of security fixes. Among these was a new kernel; when I rebooted this new kernel was selected so the openafs kernel module didn’t match as the kmod-openafs RPM is specific to the kernel version you’re using. While trying to fix the problem I ended up in a situation where the openafs yum magic which automatically selects the right version of kmod-openafs for you was guessing at a kernel version which wasn’t on my machine. I reported that problem but got round it by specifying the kmod-openafs version to yum myself. In the end the magic commands were: yum install kmod-openafs-PAE-1.4.11-1.1.2.6.31.12_174.2.3.fc12 openafs-client openafs-krb5 cp /usr/vice/etc/ThisCell.rpmsave /usr/vice/etc/ThisCell /etc/init.d/openafs-client start  Also, • The LCFG client sees and downloads new kernels but isn’t seeing the UDP notification, so it only downloads a new kernel at most once every ten minutes or so (client.poll being set to 10m+30s). Bug 222. • There’s now a tracking bug for the LCFG port to Fedora 12. All bugs related to this project block it. Bug 223. • lcfg-utils seems to work – at least lcfgmsg seems to behave properly when tested – so I’ve added “Fedora12″ to its list of supported platforms and issued a new version (1.3.3) which has been built, submitted and made the default version on SL5, and made the default version on F12. • I’ve created LCFG-level package lists lcfg_f12_lcfg.rpms, lcfg_f12_lcfg_installroot.rpms, lcfg_f12_installroot.rpms, lcfg_f12_installbase.rpms, lcfg_f12_testupdates.rpms, lcfg_f12_kernel.rpms. Written by Chris Cooke February 22, 2010 at 5:42 pm Posted in Uncategorized Tagged with ## We have a profile! leave a comment » • Created inf/os/f12.h. For now this overrides profile.components and boot.services with a fairly minimal setup: !profile.components mSET(sysinfo client file om_defaults) !boot.services mSET(lcfg_client lcfg_file)  • Then give the test machine a fairly small LCFG file: #define INF_FLAVOUR_DICE #include <inf/os/f12.h> #include <inf/hw/dell_optiplex_gx745.h> #include <live/wire_forum.h> !profile.release mSET(develop)  This makes an XML profile for the machine. • Then install the profile: # /usr/lib/lcfg/components/client install http://lcfg2.inf.ed.ac.uk/profiles/inf.ed.ac.uk/tarragona/XML/profile.xml [OK] client: install #  • Then we can use qxprof to see the resources! # qxprof sysinfo arch=i686 display=model location Serial~No=sno allocated manager owner OS=os_id Release~Version=release_version domain=inf.ed.ac.uk manager=root@inf.ed.ac.uk model=Dell Optiplex GX745 etc.  • Written by Chris Cooke February 19, 2010 at 3:35 pm Posted in Uncategorized Tagged with ## more package builds leave a comment » • lcfg-etcservices doesn’t build (bug 218) because the lcfg.cmake.tt template doesn’t know how to recognise Fedora (bug 219). • lcfg-init-1.0.12 builds and installs. • lcfg-lcfginit-0.100.11 builds but doesn’t install because error: Failed dependencies: lcfg-boot >= 1.2.0 is needed by lcfg-lcfginit-0.100.11-1.noarch  • lcfg-nsu-2.5.8 builds and installs. • lcfg-pam-1.0.11 builds and installs. • lcfg-syslog-1.1.15 doesn’t build because error: Failed build dependencies: sysklogd is needed by lcfg-syslog-1.1.15-1.src  On Fedora 12 sysklogd has been replaced by rsyslog, which claims that “It is quite compatible to stock sysklogd and can be used as a drop-in replacement”. I’ll go back and tackle this later. • lcfg-tcpwrappers-0.99.10 builds and installs. • lcfg-defetc-f12-0.1.2 created, built and installed. • lcfg-dns-6.1.69 doesn’t build due to its use of obsolete build tools: bug 220. • lcfg-kerberos-2.1.35 doesn’t build due to its use of obsolete build tools: bug 221. Written by Chris Cooke February 18, 2010 at 3:37 pm Posted in Uncategorized Tagged with ## auth, boot, cron leave a comment » Various items on the plan can’t be done yet so I’m pressing on with test builds and installs of various LCFG packages: • lcfg-auth-0.102.11 builds and installs. • lcfg-boot-1.2.20 doesn’t build: it appears to use old-style buildtools. The version in subversion is 1.2.22 but that doesn’t build either because of a missing tag. Bug 217. • lcfg-cron-2.0.12 wouldn’t build because: error: Failed build dependencies: netgroup is needed by lcfg-cron-2.0.12-1.src  netgroup-1.1 builds and installs. lcfg-cron-2.0.12 then builds, but won’t install because: error: Failed dependencies: perl(IPC::Run) is needed by lcfg-cron-2.0.12-1.noarch perl(String::CRC::Cksum) is needed by lcfg-cron-2.0.12-1.noarch  “yum install perl-IPC-Run” succeeds and installs perl-IPC-Run-0.84-1.fc12.noarch. perl-String-CRC-Cksum is not in CPAN and can’t be installed from the Fedora yum repositories – it appears to be a local RPM. A rebuild from perl-String-CRC-Cksum-0.03-1.inf.src.rpm succeeds and the resulting RPM installs. lcfg-cron-2.0.12 then installs. Written by Chris Cooke February 17, 2010 at 6:07 pm Posted in Uncategorized Tagged with ## Package building leave a comment » I’d suspected that something was wrong, and after checking with Stephen it seems I overcomplicated the package building process. The thing to do is to just rebuild the LCFG source RPMs; any build errors should be reported in Bugzilla. So, some package builds: • I’ve done that with lcfg-utils and reported the bug (LCFG bug 213). Stephen has fixed the bug in lcfg-utils-1.3.2 and this builds and installs on F12. • lcfg-pkgtools-1.0.8 builds and installs on F12. • lcfg-pkgtools-perl-1.0.25 needs these additional packages to be installed: • perl-Class-Accessor-0.34-1.fc12.noarch • perl-IO-Zlib-1.07-87.fc12.i686 • perl-TimeDate-1.16-11.fc12.noarch • perl-Test-Differences-0.4801-3.fc12.noarch which pulls in these packages: • perl-Algorithm-Diff-1.1902-8.fc12.noarch • perl-Text-Diff-1.37-2.fc12.noarch • perl-Test-Exception-0.27-4.fc12.noarch which pulls in these packages: • perl-Test-Simple-0.92-87.fc12.i686 • perl-Sub-Uplevel-0.2002-3.fc12.noarch It then builds. However it doesn’t install because: error: Failed dependencies: lcfg-sysinfo is needed by lcfg-pkgtools-perl-1.0.25-1.i686 perl(LCFG::PkgUtils) is needed by lcfg-pkgtools-perl-1.0.25-1.i686 perl(LCFG::SysInfo) is needed by lcfg-pkgtools-perl-1.0.25-1.i686 perl(Readonly) is needed by lcfg-pkgtools-perl-1.0.25-1.i686 updaterpms is needed by lcfg-pkgtools-perl-1.0.25-1.i686  • updaterpms-3.2.1 builds and installs. • lcfg-updaterpms-0.100.48 builds but doesn’t yet install: error: Failed dependencies: lcfg-ngeneric is needed by lcfg-updaterpms-0.100.48-1.noarch lcfg-om is needed by lcfg-updaterpms-0.100.48-1.noarch  • lcfg-ngeneric-1.2.34 doesn’t build because: error: Failed build dependencies: perl(LCFG::Utils) is needed by lcfg-ngeneric-1.2.34-1.src perl(LCFG::Template) is needed by lcfg-ngeneric-1.2.34-1.src perl(LCFG::Resources) is needed by lcfg-ngeneric-1.2.34-1.src perl(LCFG::SysInfo) is needed by lcfg-ngeneric-1.2.34-1.src  • perl-LCFG-Utils-1.3.3 doesn’t build because: error: Failed build dependencies: perl(Module::Build) is needed by perl-LCFG-Utils-1.3.3-1.src perl(ExtUtils::CBuilder) is needed by perl-LCFG-Utils-1.3.3-1.src  Installed perl-Module-Build-0.3200-87.fc12.i686 This also pulls in: perl-ExtUtils-CBuilder-0.24-87.fc12.i686 perl-Archive-Tar-1.46-87.fc12.i686 perl-Package-Constants-0.01-87.fc12.i686 perl-LCFG-Utils-1.3.3 then builds, but doesn’t install because: error: Failed dependencies: perl(Readonly) is needed by perl-LCFG-Utils-1.3.3-1.i686  Installed perl-Readonly-1.03-10.fc12.noarch which also pulls in: perl-Readonly-XS-1.05-2.fc12.i686 perl-LCFG-Utils-1.3.3 then installs. • lcfg-sysinfo-1.0.0 builds and installs. • lcfg-ngeneric-1.2.34 now builds and installs. • lcfg-om-0.4.7 builds but doesn’t install because: error: Failed dependencies: perl(UNIVERSAL::require) is needed by lcfg-om-0.4.7-1.noarch  Installed perl-UNIVERSAL-require-0.13-1.fc12.noarch lcfg-om-0.4.7 now installs. • lcfg-updaterpms-0.100.48 now installs. • perl-LCFG-PkgUtils-1.0.0 builds and installs. • lcfg-pkgtools-perl-1.0.25 now installs. • perl-LCFG-PkgTools-1.0.29 builds but doesn’t install because:  file /usr/bin/qxpack from install of perl-LCFG-PkgTools-1.0.29-1.i686 conflicts with file from package lcfg-pkgtools-perl-1.0.25-1.i686 file /usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi/LCFG/PkgList.pm from install of perl-LCFG-PkgTools-1.0.29-1.i686 conflicts with file from package lcfg-pkgtools-perl-1.0.25-1.i686 file /usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi/LCFG/PkgSpec.pm from install of perl-LCFG-PkgTools-1.0.29-1.i686 conflicts with file from package lcfg-pkgtools-perl-1.0.25-1.i686 file /usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi/LCFG/PkgTools.pm from install of perl-LCFG-PkgTools-1.0.29-1.i686 conflicts with file from package lcfg-pkgtools-perl-1.0.25-1.i686 file /usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi/auto/LCFG/PkgTools/PkgTools.so from install of perl-LCFG-PkgTools-1.0.29-1.i686 conflicts with file from package lcfg-pkgtools-perl-1.0.25-1.i686 file /usr/share/man/man1/qxpack.1.gz from install of perl-LCFG-PkgTools-1.0.29-1.i686 conflicts with file from package lcfg-pkgtools-perl-1.0.25-1.i686 file /usr/share/man/man3/LCFG::PkgList.3pm.gz from install of perl-LCFG-PkgTools-1.0.29-1.i686 conflicts with file from package lcfg-pkgtools-perl-1.0.25-1.i686 file /usr/share/man/man3/LCFG::PkgSpec.3pm.gz from install of perl-LCFG-PkgTools-1.0.29-1.i686 conflicts with file from package lcfg-pkgtools-perl-1.0.25-1.i686 file /usr/share/man/man3/LCFG::PkgTools.3pm.gz from install of perl-LCFG-PkgTools-1.0.29-1.i686 conflicts with file from package lcfg-pkgtools-perl-1.0.25-1.i686  Whoops! Looks like I shouldn’t have installed lcfg-pkgtools-perl! A quick ‘rpm -e’ later and perl-LCFG-PkgTools-1.0.29 installs successfully. • lcfg-client-2.2.37 doesn’t build because: error: Failed build dependencies: perl-W3C-SAX-XmlParser is needed by lcfg-client-2.2.37-1.src perl-W3C-Util-Basekit is needed by lcfg-client-2.2.37-1.src  These were both installed on SL5 three years ago at LCFG port time (!) and the packages were made using cpan2rpm which no longer exists. I’ll try cpanspec instead. Installing cpanspec-1.78-3.fc12.noarch pulls in these dependencies: perl-Algorithm-C3.noarch 0:0.08-2.fc12 perl-Archive-Zip.noarch 0:1.30-1.fc12 perl-CPAN-DistnameInfo.noarch 0:0.08-2.fc12 perl-Class-C3.noarch 0:0.21-2.fc12 perl-Class-C3-XS.i686 0:0.13-1.fc12 perl-Class-MOP.i686 0:0.94-1.fc12 perl-Compress-Raw-Bzip2.i686 0:2.020-1.fc12 perl-Data-OptList.noarch 0:0.104-3.fc12 perl-Devel-GlobalDestruction.i686 0:0.02-7.fc12 perl-IO-Compress-Bzip2.noarch 0:2.005-6.fc12 perl-List-MoreUtils.i686 0:0.22-9.fc12 perl-MRO-Compat.noarch 0:0.11-2.fc12 perl-Moose.noarch 0:0.92-1.fc12 perl-Params-Util.i686 0:1.00-2.fc12 perl-Parse-CPAN-Packages.noarch 0:2.31-2.fc12 perl-Sub-Exporter.noarch 0:0.982-3.fc12 perl-Sub-Identify.i686 0:0.04-6.fc12 perl-Sub-Install.noarch 0:0.925-3.fc12 perl-Sub-Name.i686 0:0.04-4.fc12 perl-Task-Weaken.noarch 0:1.02-5.fc12 perl-TeX-Hyphen.noarch 0:0.140-9.fc12 perl-Text-Autoformat.noarch 0:1.14.0-5.fc12 perl-Text-Reform.noarch 0:1.12.2-6.fc12 perl-Try-Tiny.noarch 0:0.02-1.fc12 perl-YAML.noarch 0:0.70-2.fc12 [cc@tarragona SPECS]$ cpanspec perl-W3C-SAX-XmlParser
Failed to parse 'perl::W3C::SAX::XmlParser' or find a module by that name, skipping...
[cc@tarragona SPECS]$cpanspec perl-W3C-Util-Basekit Failed to parse 'perl::W3C::Util::Basekit' or find a module by that name, skipping...  Neither of these modules seems to exist any more. Reported in LCFG bug 215. Apparently it’s OK for us to maintain our own copies of these modules and just rebuild from SRPMs from one port to the next. So: perl-W3C-Util-Basekit-0.91 builds and installs. perl-W3C-SAX-XmlParser-0.99 builds and installs. lcfg-client-2.2.37 then builds and installs. • lcfg-file-1.1.18 builds and installs. • lcfg-logserver-1.2.19 builds and installs. • lcfg-authorize-1.0.11 builds and installs. • pkgsubmit-0.0.4 fails to build because: /usr/lib/ccache/gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i686 -mtune=atom -fasynchronous-unwind-tables -fPIC CMakeFiles/pkgsubmit.dir/pkgsubmit.c.o -o pkgsubmit -rdynamic -lrpm -lrpmio -lz CMakeFiles/pkgsubmit.dir/pkgsubmit.c.o: In function PrintErr': /home/cc/RPMbuild/BUILD/pkgsubmit-0.0.4/pkgsubmit.c:72: undefined reference to rpmErrorString' collect2: ld returned 1 exit status make[2]: *** [pkgsubmit] Error 1 make[2]: Leaving directory /home/cc/RPMbuild/BUILD/pkgsubmit-0.0.4' make[1]: *** [CMakeFiles/pkgsubmit.dir/all] Error 2 make[1]: Leaving directory /home/cc/RPMbuild/BUILD/pkgsubmit-0.0.4' make: *** [all] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.OY7FUA (%build)  Reported in LCFG bug 216. Written by Chris Cooke February 16, 2010 at 6:07 pm Posted in Uncategorized Tagged with ## lcfg-pkgtools, hidden vars headers leave a comment » Installing lcfg-pkgtools: The SRPM installed OK. ‘cmake .’ failed with: CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: PCRE_LIBRARY linked by target "lcfg_pkgtools" in directory /root/rpmbuild/SOURCES/lcfg-pkgtools-1.0.8/src -- Configuring incomplete, errors occurred!  A search for ‘pcre’ in the package’s files found this: ChangeLog: * specfile (BuildRequires): Added pcre-devel  My SL5 machine has pcre-devel but the F12 machine doesn’t: [ayre]cc: rpm -qa|grep -i pcre pcre-6.6-2.el5_1.7.i386 pcre-ocaml-5.11.1-0.dice.3.i386 pcre-devel-6.6-2.el5_1.7.i386 [cc@tarragona srpms]$ rpm -qa|grep -i pcre
pcre-7.8-3.fc12.i686


A ‘yum install pcre-devel’ succeeds without pulling in any other dependencies.
The ‘cmake .’ now succeeds, as does the rest of the build and install.

Also made hidden/f12vars.h and hidden/f12_64vars.h. The concept of minor OS version number doesn’t seem to be meaningful with Fedora so I’ve left these variables undefined:

OS_RELEASE_MAJOR
OS_RELEASE_MINOR
OS_RELEASE_FULL
OS_ID_FULL


Written by Chris Cooke

February 15, 2010 at 4:27 pm

Posted in Uncategorized

Tagged with

## First LCFG package installed: lcfg-utils

with one comment

Installing lcfg-utils:

• Copied across lcfg-utils-1.3.1-1.src.rpm
• rpm -i lcfg-utils-1.3.1-1.src.rpm

cmake >= 2.6.0
/etc/rpm/macros.cmake
lsb


“yum install cmake” was straightforward:

Installing:
cmake           i686           2.6.4-5.fc12            updates           5.2 M


“yum install redhat-lsb” pulled in some dependencies of its own:

Installing:
redhat-lsb                i686      3.2-7.fc12                fedora      26 k
Installing for dependencies:
foomatic                  i686      4.0.3-8.fc12              updates    241 k
foomatic-db               noarch    4.0-8.20091126.fc12       updates    1.0 M
foomatic-db-filesystem    noarch    4.0-8.20091126.fc12       updates    4.4 k
foomatic-db-ppds          noarch    4.0-8.20091126.fc12       updates     19 M
libmodplug                i686      1:0.8.7-2.fc12            fedora     150 k
libmpcdec                 i686      1.2.6-6.fc12              fedora      24 k
pax                       i686      3.4-10.fc12               fedora      67 k
phonon                    i686      4.3.80-5.fc12             updates    152 k
phonon-backend-xine       i686      4.3.80-5.fc12             updates    153 k
qt                        i686      1:4.5.3-9.fc12            updates    3.1 M
qt-sqlite                 i686      1:4.5.3-9.fc12            updates     46 k
qt-x11                    i686      1:4.5.3-9.fc12            updates     13 M
xine-lib                  i686      1.1.16.3-5.fc12           updates    2.2 M

I now have lcfg-utils (and presumably also lcfg-utils-devel?) installed – though RPM doesn’t know this. Hopefully this won’t cause problems later.

• Written by Chris Cooke

February 15, 2010 at 3:48 pm

Posted in Uncategorized

Tagged with

## OpenAFS: how it should [and shouldn't] be installed

(Or, a scary glimpse into the mind of Chris.)
I’m trying in this blog to keep a complete record of the port of LCFG to Fedora 12, blunders and all, because cataloguing mistakes, problems, bugs and their solutions can be really useful later on. However sometimes it does get a bit painful, and this is one of those times.
Remember the problems I had last week installing OpenAFS? Well, it turns out that I was being, let’s not mince words, really stupid. Whatever most of my brain was doing it wasn’t concentrating on the screen in front of me. As an illustration here’s what happened with my first problem.
I point my browser at www.openafs.org and click on 1.4.x Maintenance Release then Fedora. This shows me a page of handy links something like this:

• Fedora
• Version 10
• Version 11
• Version 12
• Version 8
• Version 9

But in my haste to get along I read something like this:

• Fedora
• Version 10
• blah
• blah
• some old Fedora
• some older Fedora

… and thought “that’s funny, I can’t see any download links for Fedora 12 or Fedora 11″. And this happened every time I checked that page: I so expected the links to be in order of Fedora release date, most recent first, that it overrode my perceptions completely.

Secondly: once I had (somehow) found the correct download area anyway, and downloaded the packages, an examination of my command history shows me managing to try just about every possible combination of RPM installs except the correct ones. I installed the dkms version of the kernel module RPM, but Fedora 12 doesn’t come with dkms by default so I got the dkms dependency error. Then I tried installing the correct kind of kernel module package (the “kmod-openafs” one) managed to get the correct kernel version number (the kernel module packages are provided for a number of kernels, very handy) but picked the package for non-PAE kernels. Guess what, my Fedora 12 machine has a PAE kernel. So that explains the dependency problems I saw there. All I needed to do at that point was install the PAE version of the kmod-openafs package instead; but no, instead I took a bizarre detour into the past and started compiling an old-style “openafs-kernel” package instead, in the hopes that that would work. It did, eventually, but here’s what I should have done:

• Scroll down, download and install the openafs-repository-1.4.11-1.noarch.rpm. This provides a repository definition to let yum use the OpenAFS.org binary distribution provided from http://dl.openafs.org.
• The command yum install openafs-client kmod-openafs then automagically finds the correct RPMs for your kernel version and installs them for you. Very neat.
• Then start the openafs-client service, do an aklog and Bob’s your uncle. Simple. Five minutes.

Coming up this week: more speed, less haste. Or at least less haste.

Written by Chris Cooke

February 15, 2010 at 11:39 am

Posted in Uncategorized

Tagged with

## openafs finally working

My goodness. Throwing caution to the wind,
rpm -i --nodeps
works a treat. I install the openafs-kernel RPM; start up the openafs-client service; aklog; and hey presto, I have read-write access to our AFS directories.

Written by Chris Cooke

February 11, 2010 at 3:44 pm

Posted in Uncategorized

Tagged with

## openafs kernel module

Good news: I’ve made some progress. Bad news: not much progress, and it still feels like I’m wading through treacle.

An rpm -qi openafs gives you the rebuilding instructions which tell you among other things how to rebuild the kernel module. I rebuilt the kernel module like so:
rpmbuild -ba --define "build_modules 1" --target=i686 openafs.spec 
This produced a package file named kmod-openafs-1.4.11-1.1.2.6.31.12_174.2.3.fc12.i686.rpm.
However this doesn’t install as the machine doesn’t use dkms, you get dependency problems. And the openafs instructions say to install an RPM called openafs-kernel not one called kmod-openafs anyway. Turns out you can tell it to build an old-style kernel module instead like so:
rpmbuild -ba --define "build_modules 1" --define "fedorakmod 0" --target=i686 openafs.spec
This produces a package file named openafs-kernel-1.4.11-2.6.31.12_174.2.3.fc12.i686.PAE_1.1.i686.rpm
Hooray! This is what I’ve needed all along.
So I install it:

# rpm -i /root/rpmbuild/RPMS/i686/openafs-kernel-1.4.11-2.6.31.12_174.2.3.fc12.i686.PAE_1.1.i686.rpm
error: Failed dependencies:
kernel-i686 = 2.6.31.12-174.2.3.fc12.i686.PAE is needed by openafs-kernel-1.4.11-2.6.31.12_174.2.3.fc12.i686.PAE_1.1.i686


The installed kernel package provides these dependency targets:

# rpm -q --provides kernel-PAE-2.6.31.12-174.2.3.fc12.i686
kernel = 2.6.31.12-174.2.3.fc12
kernel-drm = 4.3.0
kernel-drm-nouveau = 15
kernel-i686 = 2.6.31.12-174.2.3.fc12.PAE
kernel-modeset = 1
kernel-uname-r = 2.6.31.12-174.2.3.fc12.i686.PAE
kernel-xen = 2.6.31.12-174.2.3.fc12
linux-gate.so.1
linux-gate.so.1(LINUX_2.5)
kernel-PAE = 2.6.31.12-174.2.3.fc12
kernel-PAE(x86-32) = 2.6.31.12-174.2.3.fc12


So the openafs-kernel RPM wants kernel-i686 = 2.6.31.12-174.2.3.fc12.i686.PAE but the kernel package doesn’t provide this. It does provide both kernel-i686 = 2.6.31.12-174.2.3.fc12.PAE and kernel-PAE = 2.6.31.12-174.2.3.fc12 but neither of these are good enough.
I suppose I could hack the openafs spec file for now to get me going, and fix the problem properly later. Or maybe I’ve missed something in the openafs RPM rebuild instructions.

Written by Chris Cooke

February 11, 2010 at 12:52 pm

Posted in Uncategorized

Tagged with

## openafs kernel module

I’ve found the instructions…

Written by Chris Cooke

February 10, 2010 at 4:45 pm

Posted in Uncategorized

Tagged with

## openafs kernel module error

Starting the openafs client shows an error:

# /etc/rc.d/init.d/openafs-client start
Updating CellServDB:
Starting openafs-client: WARNING: Deprecated config file /etc/modprobe.conf, all config files belong into /etc/modprobe.d/.
failed to load openafs kernel module.                  [FAILED]


There was no openafs-kernel package, only an openafs-kernel-source package, which I had installed. I suppose this means that my vague hope that it would automagically compile somehow has been dashed, then…

Written by Chris Cooke

February 10, 2010 at 4:26 pm

Posted in Uncategorized

Tagged with

## aklog

The AFS wiki seems helpful – with its help I now at least have aklog

Written by Chris Cooke

February 10, 2010 at 4:20 pm

Posted in Uncategorized

Tagged with

## kerberos and LDAP

with one comment

Kerberos is working – I copied Graham’s suggested /etc/krb5.conf.

LDAP lookups are also working:

• I added the openldap-clients package (to get ldapsearch)
• ignored /etc/ldap.conf
• threw away the supplied /etc/openldap/ldap.conf and replaced it with
URI ldap://dir.inf.ed.ac.uk BASE dc=inf,dc=ed,dc=ac,dc=uk

• Typed ldapsearch -x to use simple authentication instead of SASL. Otherwise ldapsearch fails.

I’m now wondering:

• Do I need to make the SASL GSSAPI stuff work? I gather that in this context this is some way of connecting LDAP lookups to Kerberos but I’m hazy on the details.
• What do I need LDAP for anyway?
• How to tackle OpenAFS. The OpenAFS docs only go up to Fedora 10. By hacking URLs I converted a Fedora 10 OpenAFS download location into a Fedora 12 one and got some packages, but my attempts to install and configure them haven’t met with success. I haven’t even managed to install an aklog command – though there does seem to be a klog (which fails with Unable to authenticate to AFS because Authentication Server was unavailable). The openafs-info list seems to have some hints about what to do but I don’t currently know enough to understand how they fit in to the jigsaw. Perhaps there’s a wiki with useful info on it? I’ll try that next.

Written by Chris Cooke

February 10, 2010 at 3:18 pm

Posted in Uncategorized

Tagged with

## Falling at the first hurdle

The Fedora 12 LCFG port starts not with a bang but with a whimper.

I’ve installed Fedora 12. That’s easy enough. But getting the machine onto the network, and getting Kerberos and LDAP working, is like wading through treacle. It’s fighting me every step of the way. This should be easy shouldn’t it? These modern Linux installers are meant to be idiot-proof.

Once I eventually got the machine onto the correct subnet, with the correct IP address, with a network wire which definitely worked and did provide a working connection, I hit a DHCP problem – the machine wasn’t being served by the DHCP server. I’d forgotten to specify the machine’s MAC address in its stub LCFG profile (which contributes to the spanning map which drives the DHCP servers). Doh.

This accomplished I reinstall again (I’ve already been through lots of reinstalls up to this point – the Fedora installer is getting to look very familiar) but continually hit a problem with the system config screens which come up immediately post installation. I enter my account details, a root password, network details, then configure Network Login. This lets me configure Kerberos and LDAP amongst other things. But with Network Login, however I try to configure LDAP I end up with a machine which just sits there dumbly when I try to login to it.

After another reinstall I try configuring just Kerberos, leaving LDAP off, and this time I can at least login. However a klist reveals nothing, I can’t ping any other machines, I’m not even on the network it seems. After some more experimentation it seems that I have to configure the network some more after login: System -> Administration -> Network, double click “eth0″, then enable “Activate device when computer starts”. I find it bizarre that this is not enabled by default, and more bizarre that the post-install screens don’t offer you a way to turn the network on. You’d think it would be kind of necessary to LDAP and Kerberos and the like to have a working network? Anyway, another reboot, and hey presto, I can login and after doing so I see something meaningful from a klist. I can even ping other computers.

For my next trick I shall try to re-enable LDAP. (Watch me reduce a Dell 745 to an unresponsive heap of junk once more.)

Written by Chris Cooke

February 9, 2010 at 11:47 am

Posted in Uncategorized

Tagged with

Today the media is full of Apple’s new iPad. It looks very pretty, but I mut say I don’t quite see it setting the world on fire; it’s already been dubbed the why-Pad.

It’s an iPhone, except you can’t make phone calls. And it doesn’t have a camera. And it’s too big for your pocket. And you can’t hold it comfortably in one hand (or I couldn’t, I’d drop it – I’d need a stout handle to hold it with).

It’s an iPod Touch, except it’s too big for your pocket.

It’s a laptop, except it won’t stand up, not even on a table, you’ll have to hold it constantly. And it doesn’t multi-task. And it doesn’t do web pages with Flash or Java or the like. And you have to get your software from the Apple Store. And of course there’s no handy keyboard or mouse, just the iPod’s pop-up screen keyboard.

It’s an e-book reader without the main benefit of an e-book reader, which is the easy-on-the-eyes e-paper display. Oh and it doesn’t fit in your pocket.

It’s a portable DVD player that you’ll have to hold in your hands all the way through the film.

Any way you look at it it seems limited. I suppose it’ll find some sort of niche market, especially with the iBookstore and Apple’s software design and marketing power behind it, but I can’t see me buying one.

Written by Chris Cooke

January 28, 2010 at 10:17 am

Posted in Uncategorized

Tagged with , ,

## Projects and India

It’s change time for my development projects. The TiBS LCFG project is now complete, at least complete enough to be going on with; further developments will be tackled at some point in the Further Improvements to TiBS Component project. The next two weeks of my development time will be spent on the Server Hardware Interaction project then after that I’ll concentrate on the port of LCFG to Fedora 12. (These links are to our devproj project development site, for which you’ll need to be an authenticated School of Informatics user; if you don’t have an Informatics account you can make your own using iFriend.)

Server Hardware Interaction is a rag-bag of things which our servers could do with. First on the list is some sort of monitoring of the ambient temperature, so the machines can shut themselves down cleanly when it gets too hot (we still haven’t got the bugs fully out of our shiny new air conditioning plant). Currently the machines carry on running until each server reaches the pre-set critical point for its motherboard at which point power is cut, which saves the hardware from harm but doesn’t do the data much good. A clean shutdown would be preferable. This’ll give us a safe fall-back procedure; we can then put cleverer stuff in front of that if we like to for example shut down less important servers in a sacrificial manner when the temperature starts to rise too much. Next on the list is RAID monitoring – we need our Nagios monitoring system to alert us when a machine’s RAID disk has gone. Lower down the list is the issue of automatically (or otherwise) keeping the firmware and BIOS versions of our controllers and hardware in general up to date to help avoid problems.

Our servers are mostly Dells and OMSA is designed to deal with a lot of this stuff automatically, but our understanding is that it takes rather more control of the machine than we’re comfortable with; we’ll probably take a look at it and see if we can use bits of it, though.

Finally, it’s been a long time since my last entry here. That time was mostly spent in a long winter holiday in southern India. I’ve started putting photos up on the web from that holiday. You can see them at my photo page. At the time of writing that has pictures of Mysore, of rural Karnataka and Tamil Nadu from the train, and of rangoli or devotional decorations which are drawn on the ground every morning in front of houses.

Written by Chris Cooke

January 27, 2010 at 1:10 pm

## Slides from the UKUUG Perl courses

Dave Cross has shared the slides from the three UKUUG Perl courses he ran in November, including the Intermediate Perl one which I attended. They’re in OpenOffice and PDF formats on his site and in Flash on slideshare.net.

Written by Chris Cooke

December 7, 2009 at 10:14 am

Posted in Uncategorized

Tagged with ,

## tibsconf

The lcfg-tibs component now has a twin sibling called lcfg-tibsconf. The latter will replace the former.

The reason is bizarre: when the LCFG tibs component stops, it tries to stop the TiBS software. TiBS software is stopped by calling a shell script called “stoptibs”. One of the things which “stoptibs” does is to “kill” every process on the system called tibs. Including my tibs LCFG component. Top marks for style! Thanks to Stephen Quinney for working out why my component was mysteriously disappearing instead of stopping. So, anyway, I’ve got round the problem for now by renaming lcfg-tibs to lcfg-tibsconf. The stoptibs script doesn’t currently try to kill anything called tibsconf – I’ve checked…

Written by Chris Cooke

December 3, 2009 at 5:50 pm

Posted in Uncategorized

Tagged with

## UKUUG’s Intermediate Perl day

with one comment

I’m currently travelling home from a one day “Intermediate Perl” training course. It was organised by the UKUUG and O’Reilly and written and presented by Dave Cross of Magnum Solutions.

He did a very good job. He’s very personable, and a good teacher. He got through a lot of material at what felt like a clear and lucid but also gentle and unhurried amble through the slides. He judged the technical level almost perfectly for me, introducing us to plenty of new concepts and areas of Perl to explore without bamboozling us. Altogether I’d give him and the course an enthusiastic thumbs up. He’s planning on running more Perl training days in March next year, and if you want to learn more (or anything) about Perl, go, you won’t regret it.

“Intermediate Perl” was the middle of three separate Perl days running back to back. The others were “Beginners” and “Advanced”. The “Intermediate” day covered these areas:

• Types of Variable – not scalar vs list, or string vs array vs hash, but lexical variables vs package variables. It also covered packages and use of “local”.
• Strict and Warnings – what use strict and use warnings do; why to use them; what can happen if you don’t.
• References – creating them, how to use them, why to use them; using them to pass parameters; using them to make complex data structures; some useful examples of complex data structures. The syntax of complex data structures in perl can be rather difficult, not to mention messy; here it was presented so clearly that it really didn’t seem too bad.
• Sorting – you thought you could just type sort and forget about the detail? You can in some circumstances, but sometimes you need to go more deeply into it, for instance to change the default sort order, or to invent your own sorting order for your own data structures (use sorting subroutines); how then to make your sorting more efficient, and further, how to chain a sequence of efficient sorting operations together into a Schwartzian Transform. The class more or less universally drew a sharp intake of breath when it saw the Schwartzian Transform, thinking it unnecessarily unclear, but the tutor defended it valiantly: don’t be afraid to use Perl in a way that goes beyond simple, clear baby steps; don’t be afraid to assume that whoever has to understand your code will have a decent knowledge of Perl. Personally I disagree with this one, having frequently been in the position of trying to understand/maintain/change code I didn’t understand in a language I didn’t know. I’m a firm believer in making everything as simple and clear and copiously annotated as possible, and I’ll happily prioritise that over elegance and compactness of code. Each to his own.
• Reusable Code – writing modules, why you might want to and how to go about it. The boilerplate code you have to put in your module file, what it all means and why you need it; exporting things from your module, different ways to export, when not to export at all.
• Object Oriented Perl – objects, classes, methods, what, when why, how; Moose makes it all a lot easier.
• Testing – you can write simple tests for your Perl program to pass; this can hugely help your code development; when, why, how to write tests; how very easy it looks; how very useful it looks. This section of the course was a revelation.
• Dates and Times – All about the wonderful, the magnificent, the magical DateTime family of modules, which do every sort of time-related manipulation you could ever want to do, and which all fit together utterly seamlessly. I knew about DateTime already, having used it to do calculations on the dates and times of upcoming cron jobs for the LCFG sleep component.
• Templates – when, how and why to use them; how you can us the same data and different templates to produce radically different output – think for instance of generating web pages and their RSS feeds from the same source data, and maybe email messages too. Template Toolkit.
• Databases – various and increasingly niftier ways of interacting with databases using Perl. This was the only section that was a bit wasted on me, since (shameful admission) I know very little about such things as SQL and haven’t used a database in anger since Edquse or maybe Astrid more years ago than I’d care to admit. Still, if I need to learn, I now know where to turn.

All in all the Intermediate Perl course was a big success.

Written by Chris Cooke

November 25, 2009 at 9:21 pm

Posted in Uncategorized

Tagged with ,

## LCFG Users Day talk

Yesterday was the 2009 LCFG Users Day afternoon session. The talks were all pretty interesting I thought; it was very encouraging to see how useful people are finding LCFG, and how its use has grown compared to last year. The developments at ACE seemed particularly impressive: LCFG has become a very useful and powerful Mac configuration management tool.

My own wee talk on the sleep component went OK (to my relief). Since I have it all written down anyway, here it is, more or less verbatim, for those that missed it.

I’m Chris Cooke and I’ve written a component called “sleep”.

I did it because we want to save the environment – that’s one of the
University’s corporate goals, more or less – and we also wanted to
save money off our electricity bill.

So, the sleep component.

The idea is that it runs on our desktop Linux computers.

When it runs it decides whether or not it would be appropriate for the
computer to sleep. If it would be appropriate, it sends the computer
to sleep.

However, just as importantly, before sending the computer to sleep,
the sleep component also decides a good time for the computer to wake
up again, and it sets a wake alarm which will wake the computer up
at that time.

So, when is it appropriate for a computer to sleep?

Well, cron jobs are one thing to look out for.

The component makes sure that a computer will be awake in time to run
every important cron job.

(And by the way it takes “important” to mean “every cron job except
the ones you’ve told it to ignore”.)

It also looks at the load average – if that’s higher than a level you
set in a resource, the computer will stay awake.

It also looks at the idle time of shells, at X sessions, and it also
runs any arbitrary command you tell it to, and takes the return value
of that command to be an approval or veto of sleep for the machine.

So, for instance, I realised quite late on that although I’d dealt
nicely with cron jobs, I’d totally forgotten “at” jobs, so I was able
to add on a call to an external command which has the effect of
vetoing sleep if there’s anything in the “at” queue.

I also gave the component other things to look out for:

You can set a minimum awake duration, that is, a minimum time between
sleeps.

You can also set a minimum and maximum duration for sleep.

So basically it runs this whole battery of tests, and if any of them
vetoes sleep, the machine stays awake.

Before sleeping, the component looks ahead to when the next important
cron job will run – or when the maximum sleep duration will be up, if
that comes sooner – and it sets a wake alarm which will wake the
computer up for that time.

And finally there are also resources which run things for you when the
machine is falling asleep and when it’s waking up again.

Like, we use one or two daemons which react badly to sleep, so we shut
them down before sleep, then start them up again when the machine
wakes.

So, this is all great, it’s shiny and wonderful, but there is some bad
news: I found out the hard way that the power management on our Dell
SelectPCs does not seem reliable with Scientific Linux; when the
machine tries to sleep you get crashes and freezes now and then.
In the end we gave up trying to use it with our Dells.

However it is perfectly reliable on the current SelectPC, the HP 7900,
and in fact we are using the sleep component on the 7900s in our
student labs across the road there.

So, the component is called lcfg-sleep, you’ll find it in
svn.lcfg.org, and if you look on wiki.lcfg.org you’ll find a page

That’s it. Thank you.

Written by Chris Cooke

November 21, 2009 at 10:29 am

## TiBS is now under LCFG control

Yesterday we deployed the lcfg-tibs component on our main TiBS backup server. Things seem to have gone smoothly; the software is now installed via RPM packages; the config files are now mostly generated from LCFG resources; and configuration changes are held back until TiBS is idle.

This is phase 1 of the LCFG TiBS component. Phase 2 will automatically generate the list of non-AFS backups from “someone please back me up” resources in the LCFG profiles of DICE machines. Some Nagios monitoring is also desirable! However more development may have to wait a while: other more urgent projects are elbowing their way ahead of this one in the development queue.

The LCFG TiBS component and its accompanying RPMs are not available for distribution outside the School of Informatics because TiBS is proprietary commercial software; but if you’ve also bought this software and you want to use the LCFG component to automate its configuration, let us know, maybe we can share the work.

Some docs:

One thing that did not go smoothly was my attempt to get the component to stop TiBS when the component stopped. TiBS is stopped with the command stoptibs which can be found on our backup server in /usr/tibs/bin. It’s a shell script. It’s short but I won’t post it here as it’s not freely redistributable. All of my attempts to call it, with backticks or system and/or eval or whatever other wacky way I came across on google, result in the component immediately terminating as soon as stoptibs has run, so the component doesn’t ever officially stop. So far I’m baffled as to what’s wrong here. Is this an elementary perl boob on my part? A bug somewhere in LCFG?

Written by Chris Cooke

November 18, 2009 at 2:36 pm

Posted in Uncategorized

Tagged with ,

## lcfg-tibs 1.1.0

lcfg-tibs 1.1.0 is now out. Not much change: it now makes the symlink /etc/tibs.conf when it configures on a TiBS server, with the symlink pointing to the tibs.conf configuration file.

Written by Chris Cooke

October 29, 2009 at 4:35 pm

Posted in Uncategorized

Tagged with

I always end up tearing my hair out (see picture) at Bugzilla upgrade time, so since I have less hair to spare these days, here are a few notes for Future Me to read and profit by. So, in no particular order:

• Informatics Bugzilla and LCFG Bugzilla use the same lcfg-bugzilla component.
• Local prefs can be added to the LCFG template files in lcfg-bugzilla.
• Be careful what version of lcfg-mysql you use: it should be at least 1.1.14 as earlier versions don’t make new databases well enough.
• You may need to run checksetup.pl with the –make-admin option to give yourself Bugzilla admin privilege, which you’ll need for the post-install pointy-clicky configuration phase.

The general procedure for moving from an old bugzilla server to a new one goes something like this:

1. take down the old bugzilla.
2. take down the new bugzilla too if you have one up for testing.
3. change the dns.
4. “om dns update” on nautilus (x509 master), berlin & osprey (cosign masters).
5. change the lcfg files.
6. touch nautilus, berlin & osprey lcfg files to push/get new spanning maps.
7. do a mysql dump of the old bugzilla database. (Look at the cron job to get the exact command.)
8. make the new database. (Use at least lcfg-mysql-1.1.14 as earlier versions fail to make new databases properly. Stop lcfg-mysql, delete the contents of /var/lib/mysql then start lcfg-mysql.)
9. load the dump into the new database. (Something like “mysql < dumpfile" or "mysql -p'cat /var/lib/mysql/rootpwd' < dumpfile")
10. edit the \$db_pass variable in the new server’s /etc/bugzilla/localconfig so that it has the value it had in the same file on the old server.
11. stop then start the mysql component.
12. run checksetup .pl to convert the database
13. om bugzilla start

Written by Chris Cooke

October 23, 2009 at 11:45 am

Posted in Uncategorized

Tagged with

## A simpler TiBS configure

The configure idea in my last post could be made simpler.

There’s no need for a two stage configure process at all. All we need is for the configure method to be called regularly. (I’m assuming that lcfg locking will automatically prevent simultaneous calls but I’d better check. The documentation says that locking is automatic for “certain methods”.)

When the configure method is called, it first cooks up some new config files from its current resources and diffs those with the existing config files – in other words it finds out whether it needs to do anything. If not, it exits.
If config files do need to be changed, it can then look to see if TiBS is currently busy (tibstat) – if it is, it exits. If not, it stops TiBS, changes the files then starts TiBS again as discussed previously.

The two checks could equally well happen the other way round; whichever is quicker.

We could then call the configure method from cron, say every few minutes between 9am and 9pm.

Much simpler than the previous idea; no need for custom methods or for passing information from one run of the component to another.

Written by Chris Cooke

September 25, 2009 at 3:18 pm

Posted in Uncategorized

Tagged with

## Why we can’t deploy lcfg-tibs in its current state

with one comment

lcfg-tibs in its current state will manage most TiBS config
files. It’ll manage all the files that are normally managed by
editing. There are other bits of config it doesn’t handle – those bits
managed by running TiBS commands like hostadd and hostdel.

The contents of the config files are managed with LCFG resources. When
a tibs LCFG resource changes, the tibs component will get a configure
call and it will change the relevant config file.

This is the problem with the current lcfg-tibs: it changes the config
file as soon as it sees the resource value change. This is bad because
TiBS config files should be changed only when there isn’t a backup
running – in fact for the config files you change by editing, preferably
when TiBS is completely stopped, too.

There’s another problem with the current lcfg-tibs: it doesn’t start
TiBS when it starts and doesn’t stop TiBS when it stops. The TiBS
software has to be started and stopped manually. This is different
from normal LCFG procedure and not really advisable.

# What we can do to fix it

At the moment when lcfg-tibs gets a configure call it makes new versions of the
config files without installing them into the right place. It then
diffs them with the in-service config files, and if there’s a
difference then it replaces the in-service file with the new version.

To ensure that config files are only changed when TiBS isn’t running a
backup we can make this a bit more sophisticated.

We’ll need a two stage process. This is the first stage:

• lcfg-tibs gets a configure call
• it makes new versions of the config files.
• diff these new versions with the in-service config files.
• if there’s a config file change, raise a flag to say so.
• make and save somewhere a list of which config files have to change.

That’s the end of the first stage of the process.

The second stage of the process runs independently of the first. It
can be kicked off either by a human or from a regular cron job
(running say every few minutes from 9am to 9pm), but either way the
component runs with a custom method, call it changeconfig.

When changeconfig runs:

• it looks to see if there’s a flag raised to say that there’s a
config file change (or more than one, the number of changes doesn’t
matter).
• if there isn’t it exits.
• There’s a config change pending. Now the component checks to see
whether TiBS is currently running a backup or not. It can do that with
tibstat.
• If there’s a backup running, the component exits from the
changeconfig run. (Another changeconfig will be along in a
few minutes and that one might have better luck.)
• If TiBS is quiescent, we stoptibs. The TiBS manual says that this
is preferable but not necessary when changing config files, but in
this case we’re doing it to get a lock on TiBS, to prevent a backup from starting while we’re
changing the config files.
• make the config file changes.
• Start TiBS again with runtibs. If a backup has attempted to start
in the meantime while TiBS has been stopped, it will automagically
start after a runtibs.

# However this raises another question

Giving the component the power to stoptibs and runtibs begs the
question, why don’t we get the component to runtibs when the
component starts, and stoptibs when the component stops?

There doesn’t seem to be a reason not to do this and it does seem to
be the intuitive behaviour.

# When and how to hostadd and hostdel

The maintenance of non-AFS backup clients is done with hostadd and
hostdel commands. A later version of lcfg-tibs will do this for you.
For now the human has to do it. hostadd and hostdel have to be run
when TiBS is quiescent but running. They cannot run once TiBS has been
stopped with stoptibs.

In an environment where the component stops and starts TiBS by itself,
when will it be safe to run hostadd or hostdel?
1. Check with tibstat that there isn’t a backup running.
2. Do it during the day when a backup isn’t likely to start.
3. To ensure safety you could possibly om client stop to prevent new
tibs resource values from getting to the tibs component; then do your
resource changes; then om client start.
A better form of locking would be desirable though; we don’t want
to run the risk of someone forgetting to restart the client component
after having done a backup client change. This needs more thought.

Written by Chris Cooke

September 23, 2009 at 9:23 am

Posted in Uncategorized

Tagged with

## lcfg-sleep gets deployed! *shock*

lcfg-sleep has finally been deployed! I’ve put it on all our Student lab machines. It’ll only actually do anything on the new HP 7900s of course. There are several dozen of them in our student labs, so the sleep component will soon begin to earn its keep at last. There’s a big autoreboot due to happen tonight to install a new kernel, and the sleep component should start working after that reboot.

I realised this morning that the component still had an alarming flaw related to autoreboot: although it wakes the machine automatically when it’s time to run the autoreboot component (as it’s kicked off by a cron job), and although it prevents the machine from sleeping once the autoreboot component has started a shutdown command to reboot the machine (I added a sleep test for this), I had forgotten the middle step: autoreboot kicks off the shutdown command from an at job, and I had neglected to tell sleep to refrain from suspending the machine if there was anything in the “at” queue. I’ve never really used “at”, so I don’t tend to think of it much, but the “at” queue is so fundamental that I should really go back and make the sleep component examine the queue properly when assessing a machine’s sleep-worthiness. For the moment I’ve put in a stop-gap extra sleep test which simply vetoes sleep when there’s anything in the “at” queue.

Written by Chris Cooke

September 11, 2009 at 4:52 pm

Posted in Uncategorized

Tagged with ,

## lcfg-tibs version 1

lcfg-tibs has gone from “development only” to version 1. Version 1.0.1 in fact. This is to mark its upgrade to supporting TiBS 2402, the version we use on our backup server. It now seems to generate a load of configuration files identical to the hand-maintained ones currently on the backup server. Except for comments, anyway. So, this version is ready for deployment on the backup server. Pleasantly enough I finished this the day before I’d said I would on the project plan. Hooray!
Next: start on version 2, which will have a list of non-AFS backup clients and will compare this with TiBS’ own list and will order TiBS to add or remove backup clients appropriately to match this list.

Written by Chris Cooke

September 11, 2009 at 4:39 pm

Posted in Uncategorized

Tagged with

## Sleep: HP in, Dell out

While I was away on a week’s holiday I left a small lab full of Dell GX745 machines testing lcfg-sleep. When I came back I found that half of them had frozen at suspend-time. This has been happening on and off for months and really I’m no nearer to curing the problem. By contrast, the few new HP dc7900s that have been running lcfg-sleep have been as good as gold, suspending and resuming several times a day quite happily, waking up perfectly well both automatically and on a press of the power button – no problems there.

Since effort is short and I’ve been bashing my head against the Dell problems for far too long now, and I’m utterly sick of it, I’ve now disabled sleep on the final two remaining Dell models it was operating on, the GX745 and the 755. The only supported model is therefore the HP dc7900. Happily this is the PC du jour and we’re deploying a lot of them in the student labs right now, so if things go well in my wee HP sleep test lab we can fairly rapidly spread lcfg-sleep to the rest of the student lab HPs. A few months after that, if things seem as reliable as I hope they will, we could perhaps think of spreading lcfg-sleep to HPs in offices too. After all the original point of the project was to cut the Forum’s electricity bill, and the Forum has no student labs.

This isn’t necessarily the end for the hopes of automatic sleep on the Dells. I may well go back and test sleep on a Dell from time to time, or even fix the problem if inspiration strikes, but I won’t spend much more effort on it at the moment: there are more important things to be getting on with.

Written by Chris Cooke

August 26, 2009 at 9:11 am

Posted in Uncategorized

Tagged with , , ,

## lcfg-sleep 0.7.2

lcfg-sleep got its first new version in 3 months this morning. Just one wee change: it now touches its waketime file – the one it uses to know when it woke up, so it can figure out if it’s been awake for long enough yet – after running other resume-time commands, rather than before. I’ve made the change because while I was on holiday a few machines picked up timestamps on their waketime files which were a couple of weeks in the future; one machine’s waketime file even managed to leap twenty years into the future. I’m hoping that touching the file after restarting ntp, rather than before, will mean that the file picks up a rather more sensible timestamp. The effect of the future timestamp wasn’t catastrophic – it just prevented the machine from sleeping, because the component simply reckoned that the machine had been awake for a negative amount of time, which it decided was less than the minimum acceptable time period for being awake. If the problem occurs again I’ll add a test for negative numbers in there somewhere.

Written by Chris Cooke

August 25, 2009 at 9:59 am

Posted in Uncategorized

Tagged with

## TiBS development meeting

Now I’m back from my holiday I’d better try to decipher my scribbled notes from the TiBS meeting we had just before I left:

Linux and Solaris group files: everything, including defaults, is managed using “hostadd” (and “hostdel”).

The “vicep” group is managed by hand, so I’ll make the component generate/manage this file.

The list of TiBS configuration files is therefore now finished! **confetti and party squeaks**

Next:

• We’ll get the 2402 tarball from Teradactyl.
• Then I’ll adapt the TiBS RPMs and component for it. This will probably be fiddly but not difficult.
• Also I’ll split the TiBS headers somehow to make both stable and development versions – possibly just with a #define.
• Once all that’s done we’ll be able to deploy the component on the server. It will take over the management of hand-edited config files, and the software will be owned by RPMs. Other aspects of TiBS will be managed as at present.

Note 1: one thing I didn’t think to mention is licences – I’d better do a TiBS licence RPM. This should be easy enough and could be installed independently of anything else.
Note 2: when installing the TiBS RPMs we’d better be ultra-careful about preserving the various TiBS state files!

After making the headers support both development and stable use I’ll be able to continue development on the component.

• First up is getting it to maintain the group files – a.k.a. the lists of non-AFS backups – getting the component to take over the “hostadd”/”hostdel” duties. I was wondering about the possibility or advisability of cutting corners with the comparison of TiBS LCFG resources and the current state of TiBS, specifically for specifying non-AFS backups. We reckoned that there was no reasonable alternative to doing a thorough and clean job of looking at TiBS’ group files and figuring out what’s actually in TiBS’ current list of non-AFS backups and comparing them with what the LCFG resources say should be in the list.
• Next, we’ll introduce a spanning map and a client backup component of some kind. The backup component will inherit details of local partitions from the fstab component and feed information into a spanning map which the TiBS server will subscribe to and use to build its group files.
• For safety’s sake the backup component should default to backing up all [local] partitions [with a filesystem!], on the grounds that “opt out” is much safer backup practice than “opt in”.
• The backup component will need to specify a type of backup for each partition – perhaps by choosing a keyword from a definable list of them.
• The partitions and backup type keywords will be published to a spanning map.
• The TiBS server will take information from the spanning map and translate the backup type keywords into TiBS settings.

Written by Chris Cooke

August 25, 2009 at 8:52 am

Posted in Uncategorized

Tagged with

## no sleep for 755s

Just a quick note. Following the discovery of changes needed to support sleep on 745s, this morning I tested some 755s with various sleep quirks. With the i810 driver, which is what the machines are currently using, no single sleep quirk does the job. The best I can do is get the machine to resume with a totally dead screen. At some point perhaps I should test them to see if the intel driver suffers from the same nastiness as on 745s. In the meantime no sleep for 755s.

Written by Chris Cooke

August 14, 2009 at 11:57 am

Posted in Uncategorized

Tagged with ,

Less than 24 hours after I enabled it on six Dell 745s, three of them have given up the ghost. Not permanently I suppose (I haven’t physically checked them yet) but they’ve certainly failed to wake up: all three went to sleep just after 1pm yesterday and have failed to communicate with the profile server since then. The other three machines seem fine – they woke up yesterday evening and downloaded new profiles, then last night, then most recently this morning between 8am and 9am.

Also, my own test 745, which went through an accelerated test of hundreds of sleep cycles thanks to my rather cruelly giving it a maximum sleep period of three minutes, yesterday hung rather than wake up. After I’d manually rebooted it and it had slept perfectly happily a number of times more, it then repeated the trick last night shortly after 8pm.

So, to do:

• boot the benighted machines single-user and save copies of whatever they managed to log before freezing
• remove lcfg-sleep from the lab machines (edit: I’m leaving it to run for longer. Nobody uses those machines anyway.)
• rethink the supported models (currently Dell 745, 755 and new HP 7900).

On the last point I’m now wondering whether Dells are just too crappy and unreliable to be trusted with power management at all? The HP has behaved flawlessly – but then, it hasn’t gone through as many sleep cycles as my own test Dell 745, so who knows how it’ll work out. IS seem to manage, but not with Linux, they have to boot the machines into Windows to sleep them – and Windows has traditionally had its hardware support designed specifically around the shortcomings and unreliability of whatever’s provided by the hardware vendors.

I think I’m going to have to arrange a mass HP 7900 sleep test somehow.
There’s still the possibility of getting 755s to behave reliably, and I’ll work on that, but it seems unlikely to succeed.

Written by Chris Cooke

August 13, 2009 at 9:11 am

Posted in Uncategorized

Tagged with , ,

## Sleep is live!

The sleep component has gone live! I’m very excited To start with it’s being tested in one very small student lab. Six machines are now fast asleep, and all being well they’ll wake up in a few hours and be fresh and ready for use. I’d better check tomorrow to see what sort of state they’re in. I’ve written a page of documentation for the machines’ users. I still need to add some stuff to the Support FAQ.

Written by Chris Cooke

August 11, 2009 at 5:30 pm

Posted in Uncategorized

Tagged with

## GX745 gdm hangs: it’s the intel driver

We’ve found out, to some extent anyway, why our Dell GX745s have been freezing sometimes when gdm starts the login screen, but mysteriously only the ones using the DICE develop release; those using the stable release have been unaffected. It has to do with the X video driver: Stephen noticed that lcfg/hw/dell_optiplex_gx745.h still contained:

#ifndef LCFG_RELEASE_DEVELOP
#include <lcfg/options/video_i810.h>
#endif


The machines on the develop release were using the intel driver, and others were using i810. It seems that this intel driver is bad news on 745s, at least with DVI cables anyway. The develop 745s I know about are all on DVI cables because of a previous problem with VGA cables – and they were on the intel driver because of a previous problem with the i810 driver.

This would leave us in a bit of a fix, but a retest of sleep on the GX745 has revealed that

• with i810 we now need a completely different sleep quirk from before, and
• so far it seems to work perfectly.

I’ve therefore changed the sleep defaults for the GX745 to make it sleep only if the i810 driver is in use and to use the new quirk.

I’d better also retest at least a 755 and a 7900, and possibly also explore how the intel driver on a 745 reacts to VGA cables.

Written by Chris Cooke

August 10, 2009 at 7:17 pm

Posted in Uncategorized

Tagged with , , ,

## lcfg-tibs’ crash diet

The code of the tibs component is now 29% of its former size, and it’s a whole lot healthier looking too. Instead of having a separate function to make each of a dozen (and counting) configuration files, we now just reuse the one function a dozen times. Which is what functions are for, after all. I’d been meaning to do this for a while. Feeling pleased with self. The function also now only replaces an existing config file where it needs to – it does a diff of the new and (any) existing config files to see if anything’s changed – and it produces more readable debug output too.

Written by Chris Cooke

August 7, 2009 at 2:16 pm

Posted in Uncategorized

Tagged with

## autoreboot and sleep

Just a quick note: as envisaged in my last post I’ve rejigged the autoreboot and sleep support. Extra sleep settings for machines with both autoreboot and sleep are now in a new autoreboot-and-sleep.h header. At the moment that just contains an extra sleep test which will veto sleep whenever root is running an instance of whatever command is mentioned in the autoreboot.shutdown_command resource.

Written by Chris Cooke

August 6, 2009 at 9:19 am

Posted in Uncategorized

Tagged with

## some sleep integration developments

with one comment

It’s been a while since I did anything with the power management for DICE desktops, aka sleep, project so here goes. I need to get it installed in a student lab to see what happens. Before I can do that the sleep component needs to be integrated safely with several things:

• exam lockdown: now done, exam lockdown disables sleep.
• condor: now done, we have a new condor_and_sleep.h header which tweaks sleep resources to reduce the maximum sleep time and also stop condor at suspend time and start it again at resume time. The latter is necessary because Condor detects imminent sleep and uses some non-standard mechanism to react to it, and doesn’t finish reacting to sleep until the machine has slept and woken up again, which is a bit useless. The header is (or will be) included from both condor and sleep headers in such a way that its contents won’t be multiply included but it won’t matter which order the condor and sleep headers appear in. Which I’m quite proud of.
• autoreboot: sleep now won’t sleep when shutdown is running. I probably need to rejig and complicate this along the lines of condor above though so that I can test for whatever the autoreboot component is currently using as its shutdown command, rather than just assuming that it’s still shutdown.

Written by Chris Cooke

August 4, 2009 at 5:26 pm

Posted in Uncategorized

Tagged with

## NB when moving a web service

This morning I moved a web service from machine A to machine B, leaving some other web services still running on machine A. It’s worth noting the following for next time:

When changing the address which the web service’s DNS alias points to, do make sure that this change goes to every machine concerned. This includes:

• machine A
• machine B
• the nagios server(s)
• the cosign server(s)
• the certificate-issuing server(s)
• the LCFG servers, master and slaves

Having done that, look out for any services using spanning maps – they may well not generate a new spanning map as they ought to unless you provoke it by tweaking the LCFG file of the relevant server.

Even after getting past all that we’re still left with the discovery of some unhelpful behaviour in our apacheconf nagios tests – where multiple https services are on one machine the tests just test for a response from whatever the machine reckons is its “main” https service, rather than from the particular service which the test ought to be interested in. In today’s case this “main” service is the one that’s moved to another machine, leaving the nagios monitoring of machine A looking rather unhappy, though the services themselves all seem to be OK.

Written by Chris Cooke

July 31, 2009 at 3:54 pm

## HP dc7900 now OK for sleep

TiBS LCFG development has temporarily stopped while I take care of some operational matters:

• moving the DIY DICE service to a dedicated machine to simplify admin – and configuring and installing the machine
• decommissioning the fc5 build host (not that this took a great deal of time)
• tidying up a pile of machines that were lurking under my desk
• looking into ways of making the office habitable
• satisfying various bureaucracies
• Investigating a bizarre case of intermittent narcolepsy in my main desktop. I don’t think it’s LCFG-related, I’ve misconfigured Gnome power management somehow.

I’ve also set up a new HP dc7900 and tested it for compatibility with lcfg-sleep. Good news: it’s now supported, or it will be when today’s changes hit the stable release. For the record it appeared to suspend and resume happily when no quirks were used, but a subsequent gnome login would pop up a “something went wrong with your resume” warning, and on logout the whole machine would freeze solid – lovely. Thankfully this behaviour goes away and the machine seems as good as gold when slept with the VBE Post sleep quirk, so that’s what LCFG will now do.

Written by Chris Cooke

July 30, 2009 at 5:32 pm

Posted in Uncategorized

Tagged with ,

## Four more TiBS config files now supported

lcfg-tibs now makes a bumper bundle of config files: the latest that can be made on TiBS servers are labels.txt, ThisFull.txt, ThisDaily.txt and the AFS group’s omit file.

This completes the harvest of the low-hanging config file fruit; the remaining config files will have to be tackled with something more complex than LCFG::Template::Substitute. Here’s the updated list of TiBS config files showing what’s been done and what’s still to be tackled.

Judging by a quick shufti round the LCFG server, this component has smashed the previous record for the number of resources used by a single component: from 44 (ffox) to 120 or so (tibs). This could with some difficulty be looked on proudly as a worthy achievement.

Written by Chris Cooke

July 27, 2009 at 4:30 pm

Posted in Uncategorized

Tagged with

## more TiBS config files now supported

lcfg-tibs can now also maintain groups.txt and afsgroups.txt TiBS config files. This feels like slow progress, but at least I am working my way through the list.

Written by Chris Cooke

July 24, 2009 at 3:59 pm

Posted in Uncategorized

Tagged with

## TiBS config files

I’m still trying to finalise the list of which TiBS configuration files are currently hand-maintained, which are automatically generated, and which we want to have maintained (somehow or other) via LCFG. Here’s the latest effort.

Written by Chris Cooke

July 21, 2009 at 4:41 pm

Posted in Uncategorized

Tagged with

## TiBS server/client and tibs.ini

After an afternoon of furious hacking lcfg-tibs now draws a distinction between (TiBS) servers and clients, and for clients it now generates a tibs.ini file, for which the settings can be changed using LCFG resources. By default it creates exactly the same tibs.ini file as the install.sh script does.

Written by Chris Cooke

July 17, 2009 at 5:17 pm

Posted in Uncategorized

Tagged with

## tibs RPM rejig

I’ve replaced the tibs RPM with the tibs-serverRPM. The former put files in /opt from where it was intended that the tibs LCFG component would install them into /usr/tibs. However it was pointed out to me that it would be helpful for the “production” TiBS files installed by the RPM to be “owned” by the RPM, so that for instance one could use rpm -qf to find out which files belonged to which RPM. Good point, so that’s how it’s now done. The tibs component needed to be rejigged slightly to match: the configuration files which it rebuilds have been excluded from the %files list of the tibs-server RPM so that doesn’t install dummy copies any more so it no longer makes sense for the component to only regenerate files where they already exist. It now just goes ahead and generates each of the required configuration files – the ones I’ve done so far anyway.

Oh, and now that we’re going to need separate RPMs for server and clients, it no longer makes sense for a profile to include tibs.h directly, so some CPP directives now throw an error unless tibs-server.h or tibs-client.h have been included.

Written by Chris Cooke

July 15, 2009 at 5:21 pm

Posted in Uncategorized

Tagged with

## TiBS type checking

Today I made the type-checking stricter for those lcfg-tibs resources which configure settings in tibs.conf – so that for instance tibs.tibsatliinitialize will only accept a value of 0 or 1.

But that didn’t take long; most of the day was spent crawling back and forth through the TiBS Manual trying to get to grips with TiBS groups, media pools, classes, and how they all relate to each other, how they’re all maintained and in what ways they all change when AFS comes into the equation; and how our own TiBS groups, media pools, classes (etc.) are configured.

Written by Chris Cooke

July 13, 2009 at 6:18 pm

Posted in Uncategorized

Tagged with

## TiBS progress

I’ve been working on an LCFG component to control our use of TiBS. This is a quick note of how I’ve been getting on.

The idea is that the component will automate the configuration and running of TiBS as far as possible. One aim is the standard LCFG one of making it possible to, within reason, throw the main backup server off the roof of the building, then substitute a new machine and have LCFG configure it up to the state of its predecessor. Another is to automate the day to day running of TiBS to enable us humans to get on with something more productive than nannying the backup system all day.

As I see it there are three main things that the TiBS LCFG component needs to do:

• control the initial full installation of the software.
• control the software’s day to day configuration files post-installation
• run TiBS commands for the system administrator – some of the commands have ludicrous numbers of options and we can surely make life simpler here, either with further automation or by at least providing a component method which supplies most of the options itself.

The software is installed using a vendor supplied shell script. Although this uses a template configuration file and substitutes a couple of important values into it to form the real configuration file, which sounds promising from an LCFG point of view, it doesn’t follow through on this method very much, preferring instead to just dump the resulting file in place then instruct the system admin to edit it appropriately for the site. I want the whole configuration file to be configurable via LCFG, so what to do here? The first approach was a long delicate detour around the TiBS install script: I made a template for the template, which my component then substituted LCFG resources into to make the template file for the install process, which then produced the real file. For post-installation resource changes the component painstakingly mirrored the install script’s own actions: substituting in the same values and adding the same extra lines which were constructed in the same way.

I implemented this, and it all worked. Except – it then occurred to me that my reasons for doing this rather than taking the other approach I’d considered (using the standard LCFG resource substitution method to transform my own template into my own version of the configuration file, which would completely replace the file generated by the TiBS install process) weren’t really valid after all. I had decided that closely mirroring the install script would produce results which were more likely to be exactly like those produced by the install script – so we wouldn’t have problems caused by odd configuration file entries. I also thought it was best to let the software’s own install script do as much of the work as possible. But once I’d finished my code which accomplished this, and looked at the several screensful, I realised two things: firstly the LCFG template approach is perfectly valid if you just take the time to get the template right in the first place. Secondly almost all the code I’d written and got working could be thrown away and replaced by one call to LCFG::Template::Substitute and a few debug messages. Much simpler! And importantly, far more maintainable for whoever might take this software over from me in the future.

So that’s the path I eventually took: I threw away my carefully crafted eccentric code and let the LCFG::Template perl module take the strain. This seems to have been a success: I haven’t yet found anything wrong with the resulting config file. It also arguably has another advantage: when making the template config file I took the opportunity to strip out all the helpful comments, so that future sys admins will be less tempted to edit LCFG-controlled config files with a text editor rather than by changing LCFG resources! Don’t worry, the helpful comments are all now in the man/pod file.

Having conquered the main TiBS configuration file I’ve been picking off the other much smaller ones one by one in the same way, and I should soon have them all under LCFG control, although not yet the versions on our actual TiBS server. After that, automatically running the install script should be easy enough – just supply a few options and a bit of input to it then run Configure afterwards to remake the configuration files as above.

We also want to have the component deal correctly with (non-AFS) Linux TiBS clients, as well as with the server. We don’t have any such clients yet – unless you count the main server, which is a client of itself – so I’ll be setting one up then figuring out what to get the component to do to reproduce the setup. Since by this time the component will be able to both automatically run the install process and make configuration files I’m not expecting that to be too hard. Then after that I expect to work on the method(s) which will provide a simplified interface to any TiBS commands which can’t be automated.

And then, I imagine, we’ll stand back and reassess.

Lastly, one thing I’m not clear about at the moment is when to introduce the component to the actual TiBS server and actually start using it. That may become clearer later?

Written by Chris Cooke

July 12, 2009 at 11:03 am

Posted in Uncategorized

Tagged with

## too hot

Tried going into the Forum to rescue the sleep test machine but it was just far too hot and sweaty in there. Further sleep work will have to wait until cooler weather. Not that what is laughingly called my “office” is ever much less than 27C at the best of times, but at least in cooler weather I won’t already be hot, sweaty and pissed off before going in.

I’ve just looked up 27C: it’s over 80F in old money.

Written by Chris Cooke

July 3, 2009 at 3:04 pm

Posted in Uncategorized

## instability :-(

Sigh. Spoke too soon. Yesterday’s reboot seems to have unsettled something: the test machine slept twice and woke twice last night, as normal, but the third time was a failure: I got the “going to sleep” message at 04:21 but it failed to wake up on schedule at 07:44.

The machine now needs some personal attention to examine its state and to get it up single user to copy all the logs before booting into normal running. Oh well; it can wait until I’m next in the building.

I’m wondering if there was something special happening at that time of the morning to make the machine unhappy. I’ll take a look when the machine’s up again.

Meanwhile back to TiBS, whose config I’ve been automating with LCFG and which will no doubt be mentioned more fully here soon.

Written by Chris Cooke

July 2, 2009 at 9:05 am

Posted in Uncategorized

Tagged with ,

## More stability

My test 755 had been up for 17 days before I had to reboot it today for a software update. In those 17 days it had been sleeping three times each night and several times during the day at weekends too. And no hangs! And no failures to wake up! All of which is very good. I’m now willing to let the sleep component loose on a test student lab. This’ll have to wait until they’re upgraded to SL5.3 but that’s due to happen this month sometime I think.

Written by Chris Cooke

June 30, 2009 at 3:06 pm

Posted in Uncategorized

Tagged with ,

## Stability

I’ve come back today from a week’s holiday to find that – to my amazement – the sleep test machine has successfully suspended and resumed a full 48 times without any problems at all. This is incredible considering that this is the machine that can hardly go a night or two without getting into some sort of sticky hang-type situation. Ten nights, three sleeps a night and more at the weekends. Am I just lucky or was this due to something being different? OK, this is what was different:

• I wasn’t here. I normally use the machine remotely for shell access during the day. I can’t see a remote user login session being a cause of suspend problems, particularly when there’s no session on the go during the night which is when the machine does its sleep thing anyway (I have sleep disabled during working hours)?
• The client component wasn’t running. This follows the problem we spotted a week or two ago whereby something else would grab the client component’s port before the client component got to it. (Understandably, as Simon pointed out.) I had started the machine up without the client component as I looked into this problem and had forgotten to start the component before going on holiday. So the machine has been running in ultra-stable mode with no profile changes and no RPM changes. This seems to suit the power management stuff down to the ground. I wonder why…?

I think I’ll leave it another week or so with client running this time, and we’ll see if the stability continues.

The run of good luck meant that I don’t currently have a chance of trying out Simon’s diagnostic suggestions in his comment on my previous entry, but no doubt I shall get to try them out soon enough.

In other news, I’ve solved a couple of problems that were plaguing me before the holiday. Both solutions were really stupid and probably show how much I was needing the holiday:

1. I’d been trying to get a new multipath SAN partition up on one of the web servers. When I tried to make a filesystem on the new partition I’d get “this partition is busy” errors. The solution: I was using the partition’s sd entry in /dev when I should have been using its entry with a big long name in /dev/mpath. It was the system’s own multipath code which was keeping the partition “busy”. The new partition is now happily mounted and filling up with data.
2. The sleep test machines had been configured to mail me the sleep log files every week when logrotate ran. They did this, but they mailed me stuff which was a month out of date. Who wants that…? Turns out that this is the rather odd default behaviour of logrotate: it mails you the logs it’s about to delete – the ones that fall off the end of the weekly conveyor belt – rather than the most recent logs. Adding the logrotate keyword mailfirst to the logrotate recipe has hopefully cured this.

Written by Chris Cooke

June 22, 2009 at 5:06 pm

Posted in Uncategorized

Tagged with , , ,

## SMP alternatives: switching to UP code

with one comment

Sleep failed again on my test 755 last night. It’s been happily sleeping and waking for several days, four times per night, but last night it failed to wake from its second nap. This time it looks as if the machine hung at the point of powering off, or on again.

The power management suspend log (/var/log/pm/suspend.log) has all the suspend messages one could expect to see in it and none of the resume messages:

Fri Jun 12 00:03:03 BST 2009: running suspend hooks.
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/00clear =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/05led =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/15_915resolution =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/20video =====
kernel.acpi_video_flags = 0
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/49bluetooth =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/50modules =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/55battery =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/60sysfont =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/65alsa =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/94cpufreq =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/95led =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/98lcfgsleep =====
===== Fri Jun 12 00:03:08 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/99video =====
Fri Jun 12 00:03:08 BST 2009: done running suspend hooks.


No problem there. But there’s a possible clue in *syslog*. This is what it looks like when the machine sleeps – this is from the previous, successful sleep last night, from 18:00 to 23:52:

Jun 11 18:00:08 orkney kernel: Disabling non-boot CPUs ...
Jun 11 18:00:09 orkney kernel: CPU 1 is now offline
Jun 11 18:00:09 orkney kernel: SMP alternatives: switching to UP code
Jun 11 23:52:32 orkney kernel: CPU1 is down


I’ve checked through the logs (grep -A 2 -B 2 'now offline' /var/lcfg/log/syslog*) and every time the machine successfully suspended and resumed, it managed to write CPU 1 is now offline and then SMP alternatives: switching to UP code to syslog before suspending. But at last night’s unsuccessful sleep it didn’t – these are the last syslog entries in syslog before the hang:

Jun 12 00:03:08 orkney kernel: Disabling non-boot CPUs ...
Jun 12 00:03:09 orkney kernel: CPU 1 is now offline


SMP alternatives: switching to UP code is missing. Kernel bug? Time to do some internet searching perhaps.

Written by Chris Cooke

June 12, 2009 at 12:02 pm

Posted in Uncategorized

Tagged with ,

## less

I’ve finally sorted out some irritating little niggles in my environment. I’ll note them here for reference. I usually log in from my 10.4 Mac to a Linux machine.

The first problem was that lines would run off the top of the screen, Linux didn’t seem to know what size my terminal window was. I solved that by changing my default terminal type to ansi – the default on the Mac seemed to be xterm-color, whatever that is.

The second problem was that less wouldn’t show man pages properly; bold text would mess up the formatting. That went away when I added less’s r option to my LESS variable. The “r” option allows escape sequences to pass straight through less instead of being altered by less to remove their effect – so things like bold text would finally show up bold rather than normal but surrounded by a weird jumble of characters put in by less to replace vanished escape sequences.

So I could finally view man pages, which was great; but then the first problem started up again: less would sometimes run some lines off the top of the screen, apparently trying to show me too many lines. A hunt through the less man page provided the answer: I shouldn’t have used less’s r option. I should instead have used the R option, which allows ANSI Escape sequences through to set colour, bold text and so on, but blocks all other escape sequences. That seems to have been the final piece of the puzzle – less is now a well-behaved citizen. So I can now stop footering about with this sort of rubbish and get on with some work…

Written by Chris Cooke

June 11, 2009 at 10:38 am

Posted in Uncategorized

Tagged with

## some investigation of the client hang

(I’ve also posted this entry as a comment to LCFG bug 146.)

I’ve just tried

[orkney]root: lsof -i | grep 732


and this is what I see:

rpc.statd 3094 rpcuser    7u  IPv4   7031       TCP *:732 (LISTEN)


This was after a couple of reboots. Before the reboot nothing was using port
732 (client hadn’t started – because before *that* I’d mREMOVEd lcfg_client
from boot.services and rebooted until I got the machine coming up without
client starting on boot.)

Here’s what ‘man rpc.statd’ has to say about ports:

-o, –outgoing-port port
specify a port for rpc.statd to send outgoing status requests
from. By default, rpc.statd will ask portmap(8) to assign it a
port number. As of this writing, there is not a standard port
number that portmap always or usually assigns. Specifying a
port may be useful when implementing a firewall.

-p, –port port
specify a port for rpc.statd to listen on. By default,
rpc.statd will ask portmap(8) to assign it a port number. As of
this writing, there is not a standard port number that portmap
always or usually assigns. Specifying a port may be useful when
implementing a firewall.

No LCFG components mention rpcstatd. Two /etc/init.d scripts do, rstatd and
nfslock.
rstatd isn’t in boot.services but nfslock is. nfslock also starts up earlier
than client does/did.
Looking at nfslock, it starts statd and *optionally* specifies ports according
to variables defined in /etc/sysconfig/nfs:

[orkney]root: grep -i stat /etc/sysconfig/nfs
# Optional arguments passed to rpc.statd. See rpc.statd(8)
#STATDARG=""
# Port rpc.statd should listen on.
#STATD_PORT=662
# Outgoing port statd should used. The default is port
#STATD_OUTGOING_PORT=2020
#STATD_HA_CALLOUT="/usr/local/bin/foo"


Since those are commented out just now it doesn’t ask for particular ports.

Looking at /etc/services, both of those ports suggested in the commented-out
entries in /etc/sysconfig/nfs are already being used for other things

Written by Chris Cooke

June 9, 2009 at 4:40 pm

Posted in Uncategorized

Tagged with

## TiBS LCFGification plans

The LCFGification (© Craig) of the TiBS backup software is under way. To help things along I’ve added some comments to Craig’s plan.

Written by Chris Cooke

June 8, 2009 at 3:48 pm

Posted in Uncategorized

Tagged with ,

## another client hang, and some sleep debugging hints

The test 755 slept happily last night, three times, and was fine when it woke up this morning.

However on reboot it again got stuck starting the client component. I’ve entered this as bug 146 in the LCFG bug tracker.

I’ve written some guidelines on how to go about debugging sleep-related problems.

Written by Chris Cooke

June 4, 2009 at 1:50 pm

Posted in Uncategorized

Tagged with , ,

## the client component hangs

The test 755′s most recent difficulty turned out to be not a crash or a total machine hang, but something that happened the last time I rebooted it. I can’t now remember why it was necessary to reboot the machine, but anyway, I had started the reboot then departed for pleasanter climes. When I got back to it today the machine was half-booted and showing “LCFG mailng [OK]” at the bottom of a list of startup messages. Which means that it had completed mailng and was starting the next thing on the list: the client component. I later retrieved the client log and at the hang it was saying:

01/06/09 16:42:53: >> context
01/06/09 16:43:16: >> start
01/06/09 16:43:16: configuration changed
01/06/09 16:43:16: starting daemon [4283/732] version 2.2.37 04/17/09 13:53:56
01/06/09 16:43:16: warnings: conflict,context,dirs,error,localprofile,notify,parse,rpms,server
01/06/09 16:43:16: ** can’t bind UDP socket

When I restarted the machine today the client component started OK and the machine booted successfully. This time the client log said:

03/06/09 12:31:57: >> context
03/06/09 12:32:19: >> start
03/06/09 12:32:19: configuration changed
03/06/09 12:32:19: starting daemon [4291/732] version 2.2.37 04/17/09 13:53:56
03/06/09 12:32:19: warnings: conflict,context,dirs,error,localprofile,notify,parse,rpms,server
03/06/09 12:32:19: context check requested
03/06/09 12:32:19: new context: booting=true
03/06/09 12:32:23: profile accepted: 2921cae88f21209cf706dec15bafd4b5
03/06/09 12:33:27: >> context
03/06/09 12:33:27: context check requested
03/06/09 12:33:27: new context: default
03/06/09 12:33:30: profile accepted: 2921cae88f21209cf706dec15bafd4b5

This morning’s development meeting went well, a big improvement on previous ones I think: it helps far more to talk briefly about the real progress on each project than to go on about deadlines.

Written by Chris Cooke

June 3, 2009 at 4:26 pm

Posted in Uncategorized

Tagged with ,

## mpu minutes, preparing for dev meeting

I haven’t had much time for the sleep project today. Much of the morning was spent writing up the minutes of this week’s MPU meeting. I see the test 755 didn’t recover from its first sleep attempt again last night; I’ll rescue it and investigate tomorrow.

For this month’s development meeting we’ve been asked to provide a brief summary
of what has been achieved in the month since the last meeting and what
is intended to be achieved by the next meeting
. Here’s what I’ve come up with for the sleep project:

Over the past month the project has concentrated on analysing various
problems related to suspend and resume on the test machines. Some of
these have been solved with changes to the component or to the
resources. As a result of these problems the scope of the project has
progressively narrowed from “look for sleep opportunities on all DICE
desktop machines, all the time” to “look for sleep opportunities on
Dell 745s and 755s, running SL5.3, when no X session is running”.

In the next month I plan to arrange to test the component in at least
one student lab, with a view to deploying it in the student labs for
the next session. I also plan to check that lcfg-sleep cooperates
properly with lcfg-condor on machines that use both.

In the longer term, as OS and hardware support for sleep gradually
improves, as surely it must, we may be able to widen the scope of our
power management once again and spread automatic sleep to more
machines for more of the time. That may not happen as part of this
project however, but more as small incremental developments over the
coming years. For example: once the pm-utils package (which
implements power management) supports the concept of timed sleep with
automatic wakeup it will be possible to integrate it safely with
lcfg-sleep.

Written by Chris Cooke

June 2, 2009 at 11:43 pm

Posted in Uncategorized

Tagged with ,

## another hang, and a bios upgrade attempt

Following George’s comment that it looked to him like amd that was hanging rather than ntp, I’ve switched round the order of the resume hooks so that ntp should start up before amd. That way hopefully amd will start up with a sensible date/time. I’ve not seen the amd component failing to start up before and I haven’t seen the clock going haywire before either, so I’m hoping that the one was the cause of the other. We’ll see if it happens again.

Meanwhile, the very next time the test 755 went to sleep it also didn’t wake up. This time it didn’t seem to be hung in the same way: the last time it happened the client component was managing to report back to the LCFG server, and I could ping the machine, but this time the machine’s LCFG status page reported

Last acknowledgement: 29/05/09 17:50:24

and ping returned “destination unreachable”.

Saving the log files and rebooting, this is what /var/log/pm/suspend.log says this time:

Fri May 29 18:00:03 BST 2009: running suspend hooks.
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/00clear =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/05led =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/15_915resolution =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/20video =====
kernel.acpi_video_flags = 0
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/49bluetooth =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/50modules =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/55battery =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/60sysfont =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/65alsa =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/94cpufreq =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/95led =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/98lcfgsleep =====
===== Fri May 29 18:00:09 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/99video =====
Fri May 29 18:00:09 BST 2009: done running suspend hooks.
`

So all the suspend hooks ran successfully, but then – it hung? When I got to it this morning the machine wasn’t sleeping, as the power button wasn’t flashing, so it’s not simply a case of it not waking up on cue. The power button was solidly on, which is the case when the machine is either hung or running normally. And it wasn’t running normally…

Meanwhile I’ve been trying to upgrade the bios on the test 755. I’ve been trying the method described in the Dell Linux wiki but when I reboot it says that it can’t find the bios.hdr file I’ve loaded in memory.

I think I need to do a “warm boot” rather than the usual (apparently) “cold” one.
The instructions tell you to add “reboot=bios” to the end of your kernel command line, then reboot, then try loading your bios.hdr again then rebooting (when the actual upgrade will then happen). Tried that, get the “can’t find your bios.hdr in memory” message.

Perhaps I should have edited the kernel command line rather than appended to it? Maybe I’ll try that tomorrow.

Written by Chris Cooke

June 1, 2009 at 4:58 pm

Posted in Uncategorized

Tagged with , , , ,