# What's Chris been doing?

Successes and failures at inf.ed.ac.uk

## A bug fix for the sleep component

A new version of the LCFG sleep component, 0.30.0, is out and installed on the sleep beta test machines. It fixes bug 653.

The problem was with the code which checked keyboard/mouse idleness. I was so delighted to be able to do this at last that it went to my head and I forgot that keyboard/mouse idleness is only relevant when somebody is logged in. After logout these idleness figures can be ignored: other tests will pick up things like remote shells.

The result was that although the component correctly checked keyboard/mouse idleness and politely waited until the machine had been idle for a while before authorising sleep, it kept doing that after the user had logged out and gone away. Before this my machine would fall asleep a minute or two after I logged out; with this it would wait several hours before sleeping.

So, all fixed in lcfg-sleep 0.30.0. Everything seems OK so far from the intrepid beta test team so I’m hoping that this version will hit other DICE desktops within a month or so.

## Spring Sleep Roundup

I’ve been asked for the latest news on the LCFG sleep component (which sends our desktops to sleep when they’re idle, but ensures that they wake up to run important cron jobs). Your wish is my command. Here are the major developments since its last mention here:

• Since 0.21.0 it’s enabled all USB devices for wake-up, rather than trying to guess which ones might have a keyboard attached to them. Simpler, more reliable. (This is so that a sleeping machine can be woken by a press of a key.)
• Since 0.22.0 the component has tried harder to create a wake alarm to wake the machine in good time for cron jobs, even when it knows that it’s never going to send the machine to sleep. (Because machines can go to sleep by other means too.)
• In 0.23.0 the “nosleep” command was introduced – “man nosleep” for details. Thanks to Sharon Goldwater for this idea.
• From 0.24.0 the wake alarm is cleared when the component stops. So machines which have been shut down for the Christmas holidays will stay shut down
• Versions 0.25.0 to 0.29.0 added the holy grail: a test for login session idleness. I found the right bit of DBUS at last! Provided the user is using GNOME, the component can find out if and for how long her login session has been idle. If the user uses some other environment the component will refrain from sleeping the machine while the login session remains; a solution for that is in development and will appear when we get time. New resource xidletime specifies the time delay between the session becoming idle and sleep being permitted.

More recently I’ve moved a bunch of generally useful settings from the Informatics-only sleep header into lcfg/options/sleep.h.

The “login session idleness” functionality is currently being beta-tested (which got a mention on the Informatics Energy blog) so it will be installed only if you put the following in a machine’s profile:

#define LCFG_SLEEP_BETA_TEST
#include <lcfg/options/sleep.h>


I’ve been asked whether I have data on whether sleeping shortens the life of desktop machines. I’m afraid I don’t, but if you do, please get in touch. I do have a few thoughts on the matter though.

• Modern desktop hardware and operating systems are all designed to support sleep, and they do it well. Our managed Windows desktop machines enforce sleep and this seems to work well. Ten or fifteen years ago frequent sleep might have been bad for the hardware, but nowadays I really doubt that it would cause problems.
• We’ve now been using sleep for several years and we haven’t seen an epidemic of premature death in our desktop machines.
• We’ve come across problems caused by hard disks being kept permanently running 24/7. “Never switch off” isn’t always the right idea.

## Round-up of sleep news

It’s been a while since I blogged about the sleep component. There’s been a lot of activity on that front lately, so here’s a roundup of the news.

• You can now wake a sleeping machine by pressing a key on the keyboard.
• In theory you can also wake a sleeping machine by clicking a mouse button. However RHEL6.0 / SL6.0 seems to have a kernel bug which makes this not work any more. As far as I can gather, the kernel bug was fixed ages ago but RHEL deliberately removes the fix when building its kernel. Hopefully this will all be better in 6.1.
• The component now detects running cron jobs. If it finds one, and that job isn’t in the cronignores list, it keeps the machine awake.
• There are new disable and enable methods. These disable sleep, and undo the disable method, respectively. The idea is that they can be run by a machine’s user, for example by typing om sleep disable.
• The component should work on 64 bit architectures too.
• A bug which broke the execution of extra commands at suspend and resume has been fixed.
• A new blacklist resource allows the selective disabling of sleep for particular hardware models.
• The sleep component’s LCFG Wiki page has been thoroughly tidied and brought up to date. Take a look at that page for an introduction to the component.
• Sleep mostly seems to work reliably on SL6. One or two models are currently presenting problems (the Dell Optiplex 755 in particular) but solutions have been identified and I expect those models to sleep successfully too soon. Certainly my test 755 sleeps like a baby. (It wakes in the middle of the night to perform important functions…)
• Edit: I forgot to mention that the lcfg-level resources have now been beefed up so that the component can now be run out of the box with lcfg level headers: no extra configuration should be necessary. (More configuration is possible of course – see the LCFG Wiki page for some config ideas.)

With all these developments out of the way it’s looking likely that we’ll soon be rolling out the sleep component onto all the staff and postgrad DICE Linux desktops in Informatics. In addition the introduction of the blacklist resource clears the way for the possible adoption of lcfg-sleep by other schools and units too. I’m looking forward to that challenge; it’ll be great to see more power-saving from the Linux desktops across the University.

## Linux Sleep: a new hoop to jump through

This is a follow-up to an earlier post, Linux sleep: how to wake with a key press or mouse click.

Shortly after discovering how to wake a sleeping machine this way – something of a Holy Grail of mine for several years – a new kernel version came along and broke the mechanism. At least, you now seem to have to jump through an extra hoop to enable it, in addition to the one described in the earlier post.

It’s now also necessary to find the relevant USB devices’ files under the /sys/devices/pci0000:00 tree and echo "enabled" to them. Version 0.12.0 of the LCFG sleep component has been updated to do this. It should be installed on DICE machines by 14 April 2011. LCFG bug 408 is the bug report associated with the change.

## Linux sleep: how to wake with a key press or mouse click

Several years ago we started sending the Linux machines in our student labs to sleep when idle, to save power. We configured them to check carefully before deciding whether or not they were idle enough to sleep, and also to wake themselves up in time to run important cron jobs. Machines could also be woken manually when needed.

This was fine, except for one problem: the only way to wake the machine manually was to press its power button. That’s not how most people try to wake a sleeping machine: it’s far more natural to press a key on its keyboard, or click one of its mouse buttons.

We’ve had a user education campaign which seems to have successfully taught most users of the labs how to wake a machine up, but there’s still a persistent minority of people who don’t understand, or maybe get impatient, and who sometimes end up doing something rash such as forcing a sleeping machine to reboot; so we get a steady flow of broken machines.

To solve this problem I’ve been trying for a long time to find out how to enable wake from sleep with a key press or mouse click. I’ve even been trying to find out if it was actually possible with Linux.

I have finally succeeded! It is possible, I’ve done it, and the solution will shortly be rolled out to our student lab machines. Here’s how:

The key file to manipulate is called /proc/acpi/wakeup. This file is a list of devices which can be used to wake the machine from sleep – and whether or not they’re currently allowed to. A status of “disabled” against a device means that it won’t wake the machine, while “enabled” means that it will. Here are the default contents of /proc/acpi/wakeup on my desktop HP dc7900 running Fedora 13:

Device  S-state   Status   Sysfs node
PCI0      S4    *disabled  no-bus:pci0000:00
COM1      S4    *disabled  pnp:00:07
PEG1      S4    *disabled  pci:0000:00:01.0
PEG2      S4    *disabled
IGBE      S4    *disabled  pci:0000:00:19.0
PCX1      S4    *disabled  pci:0000:00:1c.0
PCX2      S4    *disabled
PCX5      S4    *disabled  pci:0000:00:1c.4
PCX6      S4    *disabled
HUB       S4    *disabled  pci:0000:00:1e.0
USB1      S3    *disabled  pci:0000:00:1d.0
USB2      S3    *disabled  pci:0000:00:1d.1
USB3      S3    *disabled  pci:0000:00:1d.2
USB4      S3    *disabled  pci:0000:00:1a.0
USB5      S3    *disabled  pci:0000:00:1a.1
USB6      S3    *disabled  pci:0000:00:1a.2
EUS1      S3    *disabled  pci:0000:00:1d.7
EUS2      S3    *disabled  pci:0000:00:1a.7
PBTN      S4    *enabled


The only device that’s allowed to wake the machine is PBTN – the power button.

To enable a device, just echo its device code to the file, like this:

# echo USB3 > /proc/acpi/wakeup


A quick look at /proc/acpi/wakeup confirms that USB3 is now enabled for wakeup:

USB1      S3    *disabled  pci:0000:00:1d.0
USB2      S3    *disabled  pci:0000:00:1d.1
USB3      S3    *enabled   pci:0000:00:1d.2
USB4      S3    *disabled  pci:0000:00:1a.0
USB5      S3    *disabled  pci:0000:00:1a.1


I wanted to make it possible for the keyboard and the mouse to wake the machine, so I used this method to “enable” all of the USB devices.

Note that echoing the device code to the file toggles the device’s status: a disabled device is enabled, and an enabled one is disabled.

Note also that if writing a Perl script to do this, you’ll have to open /proc/acpi/wakeup for writing, echo a device code, then close the file, separately for each device you want to enable.

Here’s a bit of Perl which will enable wakeup on all USB devices, if you run it from an account which has permission to write to /proc/acpi/wakeup:

#!/usr/bin/perl

my $wakeup = "/proc/acpi/wakeup"; my @disabled; my$device;

# Let's take a look at the wakeup file
open(INPUT, "< $wakeup") or die "Couldn't open$wakeup for reading: $!\n"; # Remember the names of each disabled USB device while () { if (/^(USB\d+).*disabled/) { push(@disabled,$1);
print "Added $1 to disabled list\n"; } } # We've finished reading from the file close(INPUT); # Enable each device on our list foreach$device (@disabled) {
print "$device is disabled! Enabling it now.\n"; open(OUTPUT, ">$wakeup")
or die "Couldn't open $wakeup for writing:$!\n";
print OUTPUT $device;_ or die "Couldn't echo$device to $wakeup:$!\n";
close(OUTPUT);
}


## hardware test results

Whoops: I’ve neglected this for the last few days. This post therefore has to be something of a catch-up. Sorry for the length.

• On the 15th I built lcfg-openafs and lcfg-openldap for f12_64. (Stephen has been enthusing about Mock for a long time now, and I can see why – it’s extremely useful to be able to build packages without having to install all the Requires and BuildRequires packages on the build machine.) lcfg-openafs-0.0.32 is now officially ported to F12.
• I also confirmed that switching my F12 machine from Kerberos configuration by the file component, to configuration by the kerberos component, kills my keyboard stone dead on reboot. I get a prompt for my admin principle and the keyboard totally fails to work. I’ve gone back to the file component method and reinstalled…
• Most of the rest of the 15th and 16th was taken up with hardware tests: results here. The problems I found were:
755 doesn’t reboot
Whenever the 755 tries to reboot it announces “Rebooting system.” then hangs.
745 doesn’t mount CDs
If you insert a CD into a 745 it whirrs but nothing appears on the desktop.
HP 7900 CD support is dodgy
Sometimes an inserted CD doesn’t mount on the 7900′s desktop, sometimes it does.
Dell sound is dodgy
There’s no audio output from speakers on some Dells.
Audio or sleep troubles on 755
The X login screen disappeared from a 755 after it had undergone an intensive programme of frequent suspends and resumes. On checking the logs it seemed that rtkit-daemon was logging to syslog a lot at resume time. On the very first resume it logged

rtkit-daemon[4569]: Sucessfully made thread 4567 of process 4567 (/usr/bin/pulseaudio) owned by '42' high priority at nice level -11.
rtkit-daemon[4569]: Sucessfully made thread 4573 of process 4567 (/usr/bin/pulseaudio) owned by '42' RT at priority 5.
rtkit-daemon[4569]: Sucessfully made thread 4574 of process 4567 (/usr/bin/pulseaudio) owned by '42' RT at priority 5.


Then on the second resume it logged:

rtkit-daemon[4569]: The poor little canary died! Taking action.
rtkit-daemon[4569]: Rampaging.
rtkit-daemon[4569]: Successfully demoted thread 4573 of process 4567 (/usr/bin/pulseaudio).
rtkit-daemon[4569]: Successfully demoted thread 4574 of process 4567 (/usr/bin/pulseaudio).
rtkit-daemon[4569]: Demoted 2 threads.


and on subsequent resumes:

rtkit-daemon[4569]: The poor little canary died! Taking action.
rtkit-daemon[4569]: Rampaging.
rtkit-daemon[4569]: Demoted 0 threads.


Some hours later this eventually became

 rtkit-daemon[4569]: Rampaging.
rtkit-daemon[4569]: Demoted 0 threads.
gdm-simple-slave[4497]: WARNING: Child process -4519 was already dead.
gdm-simple-slave[4497]: WARNING: Unable to kill D-Bus daemon
console-kit-daemon[1811]: WARNING: Couldn't read /proc/4556/environ: Failed to open file '/proc/4556/environ': No such file or directory
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.844015 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.834487 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.829573 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.829621 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.830660 seconds
gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.835599 seconds
gdm-binary[4464]: WARNING: GdmLocalDisplayFactory: maximum number of X display failures reached: check X server log for errors
init: prefdm main process (4464) terminated with status 1
init: prefdm main process ended, respawning
gdm-binary[13695]: WARNING: GdmDisplay: display lasted 0.833754 seconds


and so on. Judging by these two posts on Ubuntu forums it may be the case that PulseAudio should be stopped on suspend and started on resume. I’ve checked our pm-tools sleep.d scripts and that’s not happening on our F12 machines.

“rtkit” by the way is “real time kit”, it’s required by PulseAudio but not yet by anything else.
I spent some time today debugging a pm-utils sleep.d hook script which would suspend PulseAudio on system suspend and resume it on resume, but without success. I think I’ve spent enough time on this; for now we’ll just have to have lcfg-sleep disabled on 755s. I’m modifying lcfg/defaults/sleep.h accordingly.
A note for the future: my failed sleep hook script needed to run nsu or sudo so it could run pactl as the user running pulseaudio. Root doesn’t have permission to nsu, and I subsequently noticed some console messages saying “root: sorry, you must have a tty to run sudo”. So that explains those failures, anyway.

• Alastair has got us round the keyboard/kerberos problem. Apparently scripts called from Upstart can’t get interactive input! See this Ubuntu support thread. Setting kerberos.hostkeyless to true gets us round the problem for now, at the cost of not having any automatically generated host keys. I’ve changed inf/options/kerberos-client.h accordingly. But what a pain. We really don’t like Upstart. Edit: it may be plymouth rather than Upstart. Hopefully we’ll be able to chuck or disable plymouth as a workaround.

## A reinstall brings RTC wake alarm confusion

leave a comment »

Late yesterday I bogged up my F12 machine completely. Today I took the opportunity to reinstall it using Alastair’s shiny new F12 install process. This worked, albeit with a few hiccups, so I now have a new F12 installation.

With the new installation, the wake alarm no longer works as it did. This is how it worked until yesterday:

# echo 0 >/sys/class/rtc/rtc0/wakealarm
# date "+%s" -d "+ 5 minutes" >/sys/class/rtc/rtc0/wakealarm


That is, you zero the alarm then you set it with the number of seconds between the epoch and the date/time you want the machine to wake up. Now though, with the same kernel RPM as before, the above doesn’t work. Instead it works like this:

# echo 0 >/sys/class/rtc/rtc0/wakealarm
# echo +300 >/sys/class/rtc/rtc0/wakealarm


That is, you put in a + followed by the number of seconds between now and your alarm time. But the kernel version is the same, I think, from yesterday to today. How can the alarm behaviour have changed…?! And will it change again tomorrow? How do I write software when the kernel behaviour changes arbitrarily from one day to the next with no change of kernel RPM version?

I got the solution from here – which seems to be about Asus boards so I’m still confused as I’m using a Dell:
http://www.mail-archive.com/acpi-bugzilla@lists.sourceforge.net/msg24296.html

## Fixed problems with lcfg-sleep on F12

leave a comment »

The day was mostly taken up with making the sleep component behave itself properly on Fedora 12. The OS’s power management facilities certainly seem to have matured: my test Dell Optiplex 745 suspends and resumes far more quickly than it did with SL5, and it seems to be doing it far more reliably as well so far. I’ve left it on an intensive suspend/resume cycle though (awake for 3 minutes, then suspend if appropriate, then wake 2 minutes later and start again) so we’ll see if that brings out any misbehaviour over the next few days.

An apparent bug whereby the pm-utils hook scripts weren’t being called was solved when I noticed that for F12 I’d switched the suspend command for my machine from /usr/sbin/pm-suspend to some other fancy suspend command. Doh. I switched it back and the pm-utils hooks were called again as they should be.

I also rewrote the part of the component’s code which sets the wake alarm to make it cope properly with either the old kernel alarm system used on SL5 (/proc/acpi/alarm) or the newer one found on F12 (/sys/class/rtc/rtc0/wakealarm).

I also discovered and fixed an edge case problem whereby the component’s shell idle time test would happily approve sleep in case where all interactive shells had an idle time of zero seconds. Repeat twenty times: I must not confuse zero with undefined in my perl scripts. Still to do: check that other sleep tests are behaving themselves (I think they are though) and check that the new code still does the right thing on SL5.

## LCFG Users Day talk

Yesterday was the 2009 LCFG Users Day afternoon session. The talks were all pretty interesting I thought; it was very encouraging to see how useful people are finding LCFG, and how its use has grown compared to last year. The developments at ACE seemed particularly impressive: LCFG has become a very useful and powerful Mac configuration management tool.

My own wee talk on the sleep component went OK (to my relief). Since I have it all written down anyway, here it is, more or less verbatim, for those that missed it.

I’m Chris Cooke and I’ve written a component called “sleep”.

I did it because we want to save the environment – that’s one of the
University’s corporate goals, more or less – and we also wanted to
save money off our electricity bill.

So, the sleep component.

The idea is that it runs on our desktop Linux computers.

When it runs it decides whether or not it would be appropriate for the
computer to sleep. If it would be appropriate, it sends the computer
to sleep.

However, just as importantly, before sending the computer to sleep,
the sleep component also decides a good time for the computer to wake
up again, and it sets a wake alarm which will wake the computer up
at that time.

So, when is it appropriate for a computer to sleep?

Well, cron jobs are one thing to look out for.

The component makes sure that a computer will be awake in time to run
every important cron job.

(And by the way it takes “important” to mean “every cron job except
the ones you’ve told it to ignore”.)

It also looks at the load average – if that’s higher than a level you
set in a resource, the computer will stay awake.

It also looks at the idle time of shells, at X sessions, and it also
runs any arbitrary command you tell it to, and takes the return value
of that command to be an approval or veto of sleep for the machine.

So, for instance, I realised quite late on that although I’d dealt
nicely with cron jobs, I’d totally forgotten “at” jobs, so I was able
to add on a call to an external command which has the effect of
vetoing sleep if there’s anything in the “at” queue.

I also gave the component other things to look out for:

You can set a minimum awake duration, that is, a minimum time between
sleeps.

You can also set a minimum and maximum duration for sleep.

So basically it runs this whole battery of tests, and if any of them
vetoes sleep, the machine stays awake.

Before sleeping, the component looks ahead to when the next important
cron job will run – or when the maximum sleep duration will be up, if
that comes sooner – and it sets a wake alarm which will wake the
computer up for that time.

And finally there are also resources which run things for you when the
machine is falling asleep and when it’s waking up again.

Like, we use one or two daemons which react badly to sleep, so we shut
them down before sleep, then start them up again when the machine
wakes.

So, this is all great, it’s shiny and wonderful, but there is some bad
news: I found out the hard way that the power management on our Dell
SelectPCs does not seem reliable with Scientific Linux; when the
machine tries to sleep you get crashes and freezes now and then.
In the end we gave up trying to use it with our Dells.

However it is perfectly reliable on the current SelectPC, the HP 7900,
and in fact we are using the sleep component on the 7900s in our
student labs across the road there.

So, the component is called lcfg-sleep, you’ll find it in
svn.lcfg.org, and if you look on wiki.lcfg.org you’ll find a page
there about the sleep component.

That’s it. Thank you.

## instability :-(

leave a comment »

Sigh. Spoke too soon. Yesterday’s reboot seems to have unsettled something: the test machine slept twice and woke twice last night, as normal, but the third time was a failure: I got the “going to sleep” message at 04:21 but it failed to wake up on schedule at 07:44.

The machine now needs some personal attention to examine its state and to get it up single user to copy all the logs before booting into normal running. Oh well; it can wait until I’m next in the building.

I’m wondering if there was something special happening at that time of the morning to make the machine unhappy. I’ll take a look when the machine’s up again.

Meanwhile back to TiBS, whose config I’ve been automating with LCFG and which will no doubt be mentioned more fully here soon.

## More stability

leave a comment »

My test 755 had been up for 17 days before I had to reboot it today for a software update. In those 17 days it had been sleeping three times each night and several times during the day at weekends too. And no hangs! And no failures to wake up! All of which is very good. I’m now willing to let the sleep component loose on a test student lab. This’ll have to wait until they’re upgraded to SL5.3 but that’s due to happen this month sometime I think.

## Stability

leave a comment »

I’ve come back today from a week’s holiday to find that – to my amazement – the sleep test machine has successfully suspended and resumed a full 48 times without any problems at all. This is incredible considering that this is the machine that can hardly go a night or two without getting into some sort of sticky hang-type situation. Ten nights, three sleeps a night and more at the weekends. Am I just lucky or was this due to something being different? OK, this is what was different:

• I wasn’t here. I normally use the machine remotely for shell access during the day. I can’t see a remote user login session being a cause of suspend problems, particularly when there’s no session on the go during the night which is when the machine does its sleep thing anyway (I have sleep disabled during working hours)?
• The client component wasn’t running. This follows the problem we spotted a week or two ago whereby something else would grab the client component’s port before the client component got to it. (Understandably, as Simon pointed out.) I had started the machine up without the client component as I looked into this problem and had forgotten to start the component before going on holiday. So the machine has been running in ultra-stable mode with no profile changes and no RPM changes. This seems to suit the power management stuff down to the ground. I wonder why…?

I think I’ll leave it another week or so with client running this time, and we’ll see if the stability continues.

The run of good luck meant that I don’t currently have a chance of trying out Simon’s diagnostic suggestions in his comment on my previous entry, but no doubt I shall get to try them out soon enough.

In other news, I’ve solved a couple of problems that were plaguing me before the holiday. Both solutions were really stupid and probably show how much I was needing the holiday:

1. I’d been trying to get a new multipath SAN partition up on one of the web servers. When I tried to make a filesystem on the new partition I’d get “this partition is busy” errors. The solution: I was using the partition’s sd entry in /dev when I should have been using its entry with a big long name in /dev/mpath. It was the system’s own multipath code which was keeping the partition “busy”. The new partition is now happily mounted and filling up with data.
2. The sleep test machines had been configured to mail me the sleep log files every week when logrotate ran. They did this, but they mailed me stuff which was a month out of date. Who wants that…? Turns out that this is the rather odd default behaviour of logrotate: it mails you the logs it’s about to delete – the ones that fall off the end of the weekly conveyor belt – rather than the most recent logs. Adding the logrotate keyword mailfirst to the logrotate recipe has hopefully cured this.

## SMP alternatives: switching to UP code

with one comment

Sleep failed again on my test 755 last night. It’s been happily sleeping and waking for several days, four times per night, but last night it failed to wake from its second nap. This time it looks as if the machine hung at the point of powering off, or on again.

The power management suspend log (/var/log/pm/suspend.log) has all the suspend messages one could expect to see in it and none of the resume messages:

Fri Jun 12 00:03:03 BST 2009: running suspend hooks.
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/00clear =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/05led =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/15_915resolution =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/20video =====
kernel.acpi_video_flags = 0
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/49bluetooth =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/50modules =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/55battery =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/60sysfont =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/65alsa =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/94cpufreq =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/95led =====
===== Fri Jun 12 00:03:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/98lcfgsleep =====
===== Fri Jun 12 00:03:08 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/99video =====
Fri Jun 12 00:03:08 BST 2009: done running suspend hooks.


No problem there. But there’s a possible clue in *syslog*. This is what it looks like when the machine sleeps – this is from the previous, successful sleep last night, from 18:00 to 23:52:

Jun 11 18:00:08 orkney kernel: Disabling non-boot CPUs ...
Jun 11 18:00:09 orkney kernel: CPU 1 is now offline
Jun 11 18:00:09 orkney kernel: SMP alternatives: switching to UP code
Jun 11 23:52:32 orkney kernel: CPU1 is down


I’ve checked through the logs (grep -A 2 -B 2 'now offline' /var/lcfg/log/syslog*) and every time the machine successfully suspended and resumed, it managed to write CPU 1 is now offline and then SMP alternatives: switching to UP code to syslog before suspending. But at last night’s unsuccessful sleep it didn’t – these are the last syslog entries in syslog before the hang:

Jun 12 00:03:08 orkney kernel: Disabling non-boot CPUs ...
Jun 12 00:03:09 orkney kernel: CPU 1 is now offline


SMP alternatives: switching to UP code is missing. Kernel bug? Time to do some internet searching perhaps.

## another client hang, and some sleep debugging hints

leave a comment »

The test 755 slept happily last night, three times, and was fine when it woke up this morning.

However on reboot it again got stuck starting the client component. I’ve entered this as bug 146 in the LCFG bug tracker.

I’ve written some guidelines on how to go about debugging sleep-related problems.

## mpu minutes, preparing for dev meeting

leave a comment »

I haven’t had much time for the sleep project today. Much of the morning was spent writing up the minutes of this week’s MPU meeting. I see the test 755 didn’t recover from its first sleep attempt again last night; I’ll rescue it and investigate tomorrow.

For this month’s development meeting we’ve been asked to provide a brief summary
of what has been achieved in the month since the last meeting and what
is intended to be achieved by the next meeting
. Here’s what I’ve come up with for the sleep project:

Over the past month the project has concentrated on analysing various
problems related to suspend and resume on the test machines. Some of
these have been solved with changes to the component or to the
resources. As a result of these problems the scope of the project has
progressively narrowed from “look for sleep opportunities on all DICE
desktop machines, all the time” to “look for sleep opportunities on
Dell 745s and 755s, running SL5.3, when no X session is running”.

In the next month I plan to arrange to test the component in at least
one student lab, with a view to deploying it in the student labs for
the next session. I also plan to check that lcfg-sleep cooperates
properly with lcfg-condor on machines that use both.

In the longer term, as OS and hardware support for sleep gradually
improves, as surely it must, we may be able to widen the scope of our
power management once again and spread automatic sleep to more
machines for more of the time. That may not happen as part of this
project however, but more as small incremental developments over the
coming years. For example: once the pm-utils package (which
implements power management) supports the concept of timed sleep with
automatic wakeup it will be possible to integrate it safely with
lcfg-sleep.

Written by Chris Cooke

Posted in Uncategorized

Tagged with ,

## another hang, and a bios upgrade attempt

leave a comment »

Following George’s comment that it looked to him like amd that was hanging rather than ntp, I’ve switched round the order of the resume hooks so that ntp should start up before amd. That way hopefully amd will start up with a sensible date/time. I’ve not seen the amd component failing to start up before and I haven’t seen the clock going haywire before either, so I’m hoping that the one was the cause of the other. We’ll see if it happens again.

Meanwhile, the very next time the test 755 went to sleep it also didn’t wake up. This time it didn’t seem to be hung in the same way: the last time it happened the client component was managing to report back to the LCFG server, and I could ping the machine, but this time the machine’s LCFG status page reported

Last acknowledgement: 29/05/09 17:50:24

and ping returned “destination unreachable”.

Saving the log files and rebooting, this is what /var/log/pm/suspend.log says this time:

Fri May 29 18:00:03 BST 2009: running suspend hooks.
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/00clear =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/05led =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/15_915resolution =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/20video =====
kernel.acpi_video_flags = 0
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/49bluetooth =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/50modules =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/55battery =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/60sysfont =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/65alsa =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/94cpufreq =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/95led =====
===== Fri May 29 18:00:03 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/98lcfgsleep =====
===== Fri May 29 18:00:09 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/99video =====
Fri May 29 18:00:09 BST 2009: done running suspend hooks.


So all the suspend hooks ran successfully, but then – it hung? When I got to it this morning the machine wasn’t sleeping, as the power button wasn’t flashing, so it’s not simply a case of it not waking up on cue. The power button was solidly on, which is the case when the machine is either hung or running normally. And it wasn’t running normally…

Meanwhile I’ve been trying to upgrade the bios on the test 755. I’ve been trying the method described in the Dell Linux wiki but when I reboot it says that it can’t find the bios.hdr file I’ve loaded in memory.

I think I need to do a “warm boot” rather than the usual (apparently) “cold” one.
The instructions tell you to add “reboot=bios” to the end of your kernel command line, then reboot, then try loading your bios.hdr again then rebooting (when the actual upgrade will then happen). Tried that, get the “can’t find your bios.hdr in memory” message.

Perhaps I should have edited the kernel command line rather than appended to it? Maybe I’ll try that tomorrow.

## intel at last

leave a comment »

Quick recap: automatic sleep is working happily with SL5.3 on 745s and 755s if they use the intel video driver. I want to get the 745 working also if it uses the i810 video driver. (The 755 with i810 doesn’t resume reliably.)

So, this morning’s experiments so far:

• Revive a 745 from near death and get it up and running as a healthy-looking DICE machine.
• With intel video driver, sleep it with what I’ve found to be the correct sleep command: /usr/sbin/pm-suspend --quirk-vbemode-restore
• Yes, it resumes cleanly
• Login, and I don’t see the “Resume Problem” error. Good, this is as expected. Logout again.
• Switch to i810
• sleep with same command
• It doesn’t resume. Reboot the machine.
• Try again without quirks: /usr/sbin/pm-suspend
• This resumes cleanly.
• Login, and as expected I see the “Resume Problem” error message. Good.

If this behaviour – different quirks for different video drivers for the same model – is representative it leaves me with the irritating problem of using different sleep commands for the same model depending on which video driver it’s using. It’s irritating because it doesn’t fit the current idea of setting the exact suspend command on a per-model basis in the sleep.h defaults header; also, that header is currently included before the header which sets the video driver so the information about which video driver is in use isn’t available to the sleep defaults header. So I’ll have to do what I did with the business of checking the video driver and somehow get the component itself to decide on the fly what command to use in which circumstance – things will have to be reorganised *again* and further complicated. Gah. I’m getting a bit fed up of redesigning the software to get round bugs in other peoples’ code.

But anyway, I can at least now test the suppression of the error message. This can be done with gconf. One way to alter gconf settings is with the command line tool gconftool-2. The gconftool-2 man page mentions the --type or -t option – to specify the type of the data you’re setting a preference key to – but then doesn’t mention it in its list of options. It has some similar looking options though – --list-type, --car-type and --cd-type – but none of them work with -s or --set, the option you use to set a value. And if you use -s without setting a type it tells you “Must specify a type when setting a value”. Luckily --type does turn out to exist, it’s just not listed on the man page. So this is the first non-error-producing command to try to stop the error message you get after you login after what the system thinks is an imperfect suspend and resume:

gconftool-2 -s /apps/gnome-power-manager/notify_hal_error -t bool false

You can check that you’ve changed the value by examining it before and afterwards using -g or --get:

gconftool-2 -g /apps/gnome-power-manager/notify_hal_error

In this case it’ll print out “true” or “false”.

So, after doing this on the test machine, I repeat the suspend (with pm-suspend) and resume. This time it doesn’t resume cleanly.
Blast.
Is this because I’ve changed that gnome setting? Surely not. I’m assuming that sleep and resume on 745/i810/5.3 is just unreliable, it sometimes works and sometimes doesn’t. Maybe I’ll go back later and undo things and try again but for now I’ll have to limit sleep support to the machines using the intel driver.

Later. I switched the same machine back to the intel driver and then left it. When I went back to it an hour or two later it had gone to sleep, but had hung. So it hangs when using previously reliable resume commands with both the i810 and intel drivers. I’d say there’s something wrong with that machine. Right enough when I revived it earlier today it had had filesystem damage; I repaired that with fsck but perhaps that wasn’t enough. I’ve now initiated a complete reinstall, with fresh filesystems, to see if that changes it back to the expected behaviour.

In the meantime I tried a different tack: to find out why we can’t move to the intel driver and try to shift that barrier. My memory was that we had stuck with the i810 driver because that was the only one which worked with our old and creaky version of Webots which is needed for teaching. Stephen confirms this memory. I talked to Graham, Mr. Webots, and it turns out that he now has authorisation to move us up to webots version 6 which doesn’t exhibit any of the bad behaviour of the elderly version we have. Hooray! He’s optimistic that webots v6 will work with the intel driver on 5.3 745s and 755s, but he’ll test it to check. In the meantime, he points out, we can change the 5.3 machines to the intel driver anyway as no 5.3 machines are yet used for teaching, and Stephen adds that webots isn’t going to be needed anyway until at least September. Excellent! So I’ve altered the dell_optiplex_gx745.h and dell_optiplex_755.h headers to exclude develop machines from the inclusion of lcfg/options/video_i810.h. This seems to have the desired effect on a test 5.3 745: /etc/X11/xorg.conf is rebuilt with no mention of “i810″ and one mention of “intel” drivers. Thus my lcfg-sleep test pool has gone up from 3-4 machines to 30-40 at least. Excellent. Perhaps it’s about time I figured out how to monitor their sleep patterns then. For now I’ve changed the sleep.ng_logrotate resource in dice/options/sleep.h to have them mailed to me until I figure out something more satisfactory. 30-40 machines mailing me two log files once a week, shouldn’t be too bad.

A quick inspection of the sleep log on a random test machine revealed that the component still hadn’t started, so I’ve gone round all of the test machines and started it. On most it hadn’t started but started successfully at my command. Some were down or unavailable, half a dozen or so were already running it.

## What I’ve been doing in May 09

leave a comment »

We had a technical Strategy Meeting this morning, an unusual event. I’d been dreading it, fearing a lack of stargazing technical foresight on my part, but as it happened the meeting was excellent, and I came away from it with two things: a massive boost to my morale (which had been flagging a bit since learning that I’m about to be chucked out of my bearable temporary office) and a determination to revive the habit of writing down at least every day pretty much everything I’ve been up to at work that day: the successes, the failures, the error messages, the things I found out and the things that I do and don’t understand. Documenting the successes is great for me (and for others) as an aide memoire. Documenting the failures is even more important though in two ways: firstly it can often help to clear my thoughts, but secondly and maybe more importantly other people can read about my problems and chip in with their own ideas, they can rally round and help. Far better than just suffering in silence, I think.

In other words this amounts to the usual “I’m going to revive this blog” post.

Next week the sleep project is going to have to face the music as regards deadlines and the missing thereof at the monthly development meeting. This time Tim has asked us to produce a short summary of what happened in the last month with our projects and what might happen in the next month. To try to get it straight I’ve written down here what happened in the last month using the MPU weekly minutes as my guide. It’s probably rather longer than Tim asked for but if necessary I’ll summarise later. For now, I need to get this all down on paper dots.

Worked out how to successfully suspend and resume a Dell Optiplex 755 (the exact command is different for each model).

The test 745 is hanging sporadically. This is related to the latest kernel: it doesn’t happen on a stable 5.2 machine, but does happen on both a develop 5.3 machine and on a stable 5.2 machine with the latest test kernel (the one on develop 5.3 machines). This turned out to be a bug with VGA support, it happens only if you use a VGA cable, machines with DVI-connected displays are fine. It appears that our 745s and 755s are pretty much universally connected with DVI, but all other models have at least some machines with VGA cables, so I’ve disabled lcfg-sleep for all models earlier than a 745.

Timed wake-up occasionally fails to happen with 5.3. Very occasionally. Haven’t reproduced this again so will ignore it for now, not a big problem.

Alastair is trying out lcfg-sleep on his T3400. It seems to sleep and resume perfectly well but gnome pops up an error message when you login after a resume. We’re going to try using gconf (and if that’s successful, lcfg-gconf) to suppress this error message.

Tim tried out lcfg-sleep and found a gap in its idleness checking. It was checking for low load average and for all interactive shells being idle, but a long session on a text editor can fulfill both of these, and Tim’s machine went to sleep while he was typing. So I had a rethink and (to cut a long story short) now reckon that we should let X session managers decide for themselves when the machine is idle. They turn out to be much better at it. In Gnome, gnome-power-manager does this perfectly well, and can be configured to suspend the machine a number of minutes after the machine becomes idle. We can use gconf and James Jarvis’s lcfg-gconf to set this as default behaviour for all gnome users. (User preferences can override default behaviour. If necessary gconf can also set mandatory settings which cannot be overridden by user preferences.) This takes care of idleness detection and suspend; resume will still be handled by the sleep component, which will continue to run every minute or two and will ensure that the machine always has a suitable wake time set. The machine will therefore not miss any important cron jobs no matter what puts it to sleep. (Coping with user-initiated suspend was always part of the sleep component design anyway.)

This led me to alter the component so that by default it looks for the presence of a user X session on the machine, and vetoes sleep if it finds one. This works. You can use the sleep.overridesessionmanager resource to override this behaviour and sleep even when somebody is logged in, but given Tim’s experience I don’t recommend it, unless you do it in conjunction with idleness timeouts long enough that they’re likely to be triggered only at night and weekends.

While working on the above I explored a sizeable blind alley or two. One was DBus: gnome-power-manager sends out signals using DBus when for instance it decides that the session is idle, and other apps can subscribe to those signals and react, or send messages back to gnome-power-manager asking it to change its behaviour. So for instance a movie player application could use DBus to inhibit the screen saver while a film was playing. But DBus has more than one bus per machine: it has one system bus but it also has a separate session bus for each user session. An LCFG component could probably get access to the system bus, being a thing running as root, but not to any of the session buses. And guess what, gnome-power-manager and other gnome stuff all uses the session bus: the idea is that only the user’s own gnome (and other DBus-using) apps will be legitimately interested in what the user is or isn’t doing. So the effect is that the lcfg-sleep component can’t possibly ever see any signals from gnome-power-manager – the “Hey, I’m idle now” signal for instance – so can’t use that info or send any influence back to gnome-power-manager either. Sigh. I briefly wondered about having each user launch some sort of home-cooked program which could subscribe to the DBus session bus and have that act as a gateway to/from lcfg-sleep but quickly gave that up as a ghastly idea and probably a horrible security risk.

Another blind alley was monitoring USB keyboards or mice for signs of activity – you’d need to have something running constantly rather than having something you could just query now and then to find out what the situation was, which is the model that the sleep component has been using. And having something monitoring keystrokes doesn’t sound great from a security point of view.

I had another partial failure: you can configure gnome-power-manager to suspend the machine after a certain period of inactivity. This is easily done either using the gnome app (select it on the menu – More Preferences I think? – to put it onto the top menu bar in gnome) which lets you control how long until idle, how long from then until suspend, that sort of thing. This works very nicely. Since lcfg-sleep runs every few minutes, and writes a suitable wake-up time to the machine every time it runs whether it decides to sleep the machine or not, you can have gnome-power-manager send the machine to sleep and then have the machine wake up automatically in time to run the next cron job. But here’s where it gets complicated: when the machine wakes up again, it hasn’t been woken by a button press or in fact by anything which gnome-power-manager is looking out for, so gnome-power-manager doesn’t go through the process of realising “Oh, I’m not idle any more” – so it never becomes idle, never suspends the machine again, and so you only ever get one period of sleep per period of idleness. The machine won’t be slept again until some sort of manual intervention happens – the user comes back and plays with the mouse or whatever, triggering the machine to realise that it’s not idle any more. Until that happens, the machine just sits awake. I don’t know if there might be some way of triggering gnome-power-manager to realise that the machine has woken up and isn’t idle any more?

Incidentally, when letting gnome-power-manager initiate the suspend, rather than the sleep component, you need to be careful to add the desired video quirks for each model to the sleep quirk database, since they can’t be invoked on the command line as lcfg-sleep does it. The sleep quirk database is a bunch of .fdi files under the /usr/share/hal/fdi directory. The existing files there don’t mention many of our models and are quite out of date; we’d have to add model info for our models, perhaps by upgrading to a more recent HAL version or perhaps just by tweaking the files in the existing version.

Another thing I looked at was the possibility of getting gnome-power-manager to send the machine to sleep with an instruction to wake up after a certain period of time. This is referred to in one of the gnome-power-manager docs or web pages I came across. I looked at the source for the version we’re using, 2.16, and found a stub where a future developer might add code to do the timed wake-up – but no actual existing facility to do it. Blast. Checking the source of the latest version, 2.26, the stub and all the code around it appears to have been swept away in a significant rewrite, and I couldn’t see any sign of automatic/timed wake-up anywhere in the new version.

Another limitation I had to place on sleep was to mandate the use of the intel video driver, rather than the i810 driver we’re currently using on 745s and 755s. Suspend and resume with i810 was just too unreliable. Thus the only machines getting an active sleep component are those 745s and 755s which are both on the develop release and happen to have been specially configured to override the default video driver by including dice/options/video_intel.h. Doing this turned out to be awkward with dice/options/video_intel.h being included in profiles *after* dice/options/sleep.h but I got round it by giving the component the cabability to find out for itself what X video driver is in use (new resource sleep.actualvideo, which inherits from xfree.device_main) and compare that to a list of acceptable drivers (sleep.approvedvideo) when evaluating whether or not it’s safe to send the machine to sleep.

There’s a possibility that we can safely re-enable lcfg-sleep for 745s with i810 since the problem there was simply Gnome’s popup error message which we now know how to suppress using gconf.

I’ve tried i810 with the 755 with 5.3 and that’s not a possibility: the 755 doesn’t resume properly whatever I try when using i810 – it only works with a combination of the intel driver and the right video quirk option.

What I intend to do next is try to get the 5.3 develop 745 with i810 and DVI cables resuming happily with the same sleep quirk as with the same machine with the intel driver; and if that works, re-enable lcfg-sleep for i810-using 745s. Then I’ll monitor each machine actively running lcfg-sleep and leave things for a while to see how they go. With any luck I should be able to get on with other work while that’s happening.

There are two other areas of upcoming work: bugzilla and TiBS. The bugzilla work will be to revive the move of bugzilla.inf.ed.ac.uk to bugzilla version 3. This is the version the LCFG bugzilla uses so we’ve already done the work of upgrading the local infrastructure to cope. It’ll just be a matter of getting the new server up and running, copying over the data and reproducing the configuration. TiBS is the new all-singing all-dancing backup software we’re running and the task will be to automate appropriate bits of its configuration and use using LCFG. Craig’s driving that project, so it’ll be good to do it in conjunction with him (projects with other folk involved are both easier and more fun than doing it all by yourself) but so far not much has happened except that I’ve looked a bit at things and found (and been told) that the software is rather messy, our configuration is rather messy, the documentation we’ve been given is incomplete and out of date, the software has foibles and problems that we seem to only discover by running full tilt into them with our live backup service, and that our primary expert in all of this left us several months ago. Aside from that the TiBS project is going really well and it’s going to be hunky-dory and fab.

## Go with the flow

leave a comment »

I’ve been thinking about and exploring possible solutions to the problem I mentioned in my last post.

The more I read about DBus, Gnome and HAL the more I realise that I should be working with them rather than subverting them with lcfg-sleep. They detect the idleness of a user’s X session far more easily than I can, they have sophisticated tie-ins with other software via DBus (a common example is that a DVD player application can ask the screensaver to refrain from screen-saving while a film is playing), and gnome-power-manager has facilities I could use for setting default (overridden by user preferences) and mandatory (overrides user preferences) behaviour in matters such as automatically suspending a machine after the session has become idle. They also support use of the same display quirks which I’ve had to use with lcfg-sleep to get displays up and running again at resume time – and today I’ve tried enabling quirks in HAL then suspending with Gnome’s suspend method and they work, they bring the display back to a working state after resume, which isn’t what happens when the right quirk is not enabled.

Anyway, enough about quirks and stuff; the main idea is this. When it runs lcfg-sleep should detect whether or not there’s a user X session, and if there is, lcfg-sleep will refrain from suspending the machine. Instead it’ll back off to fallback behaviour of just calculating a suitable wake time for the machine and writing that to the kernel wake time file. Gnome-power-manager can be trusted to do the idleness detection and suspension. We can set default preferences for it to suspend the machine a suitable number of minutes after it judges the session to have become idle. Resume will happen thanks to the wake time which lcfg-sleep will have written to the kernel alarm file before suspend. lcfg-sleep doesn’t get triggered by gnome-power-manager or anything, it just runs regularly, every minute or two, and writes a suitable time to the kernel alarm file every time it runs; so there should always be a correct wake time (for example, in the future rather than the past) in the kernel alarm file. I’ve tested using gnome-power-manager’s suspension in conjunction with lcfg-sleep’s waketime and it works. (Well, it works as long as you set the proper display quirks in a HAL quirk database file anyway.)

When there’s no user X session detected, lcfg-sleep can revert to its fuller behaviour of triggering suspend as well as setting wakeup time.

We could even perhaps base relevant gnome-power-manager default values on similar lcfg-sleep resource values. The gnome default values are set using gconf, and James has a handy lcfg-gconf component in the repository…

May 12, 2009 at 4:04 pm

## checking for idleness via DBus

with one comment

As you may know I’ve been working on the lcfg-sleep component.
It’s in a testing phase at the moment. I had thought that development work on it had finished, and that the main thing still to sort out was various OS bugs. However Tim has been trying it out and has encountered a problem I hadn’t – in retrospect I suppose this unsurprising as I’d just been using it on remote machines, whereas Tim tried it out on his main work desktop!

What he found was that the idle detection was inadequate. He’d be sitting there happily editing a document with a text editor and the machine would go to sleep on him. Currently the component checks to see that all interactive login shells have an idle time of more than a specified duration (our default is 10 minutes) and that all three load average figures are below a certain level (I’ve set it to 0.1). A machine whose user is just doing light text editing can easily fulfil both of these criteria. A text editor running in the terminal window updates the interactive shell’s idle time, but one running in a separate window of its own doesn’t.

Something needs to be added. After casting about for some method of checking for keyboard or mouse activity I’ve been thinking of checking the Gnome Screensaver status – it can tell you whether the session is currently idle or not, for instance. This turns out to be doable via DBus, the interprocess communication bus used by Gnome and (in the next major version) by KDE. For instance I can use DBus to simply ask the screensaver whether it currently considers the session to be idle or not. Great. Gnome is the default at our site so hopefully we’d still get a big win on our electricity bill even if we refused to sleep on machines with a current X session ongoing but which weren’t running Gnome.

There’s a problem though. DBus has more than one bus: the machine has one system bus and each user login session has its own session bus. Gnome Screensaver uses the session bus. You can therefore talk to the screensaver from a process in the same session, but it seems that you can’t from a process which isn’t part of that session – like an LCFG component. So how to get the information (whether or not the screensaver considers the session idle) to the LCFG component?

Here is the (unstable!) Gnome Screensaver DBus API, and here‘s the homepage.

Presumably you could (somehow) get every user session to run a wee process which queried gnome screensaver, or looked for “idle” announcements for it, then (say) wrote the current status to a file somewhere; but it all sounds terribly yucky and fraught with problems.

## Sleep does cron

with one comment

It’s been a while since I said what was happening with the Sleep project, if I ever did, so here’s a catch-up.

It’s an LCFG component which will run on DICE desktop machines. The idea is to save money by saving electricity. The component will decide whether or not it’s safe to put the machine into a sleep state, and if it is, send the machine to seep.

The component will attempt to make sure that the machine doesn’t miss any cron jobs, so before sleeping the machine it’ll calculate when the machine should wake for the next cron job, and set the machine’s wake time appropriately.

At the moment the sleep component successfully handles the cron aspect of the job, sets the wake time, and puts the machine to sleep. It doesn’t yet judge whether or when putting the machine to sleep might be a good idea.

It has these resources: cronfiles (a list of files where cron jobs might be lurking – for instance /etc/crontab); crondirs (a list of directories in which to look for more cron files – for instance /var/spool/cron); suspendcommand; waketimefile (where the system looks for a wake time – this varies according to kernel version).

It’s written in Perl, which I’ve studiously avoided for years but which turns out to be far more manageable than before now that I know about O’Reilly’s Perl Cookbook – so much more clear and helpful for the impatient would-be Perl programmer than the desperately irritating Programming Perl, with its acres of intrusive jokes and its hundreds of irrelevant and outlandish clever edge cases which I really don’t want to know about when I just want a reminder and clear explanation of how to do something, dammit.

Hmm. I’ve just realised that although it handles cron jobs successfully, the component doesn’t know about at jobs at all.

Drat!

## DateTime and summer time

with one comment

I’ve been looking for a way of comparing the current time with the times of upcoming cron jobs. It looks as if you can do this using modules from the Perl DateTime project. This fabulous collection of modules lets you represent times and dates in pretty much any way you want (including “the little hand is pointing to the twelve and the big hand is pointing to the nine” using DateTime::Format::Baby) and manipulate them in all sorts of ways. You can declare durations, you can do arithmetical operations on dates and durations to get other dates or durations, you can declare time spans and find out whether other dates are in them. One thing which really impressed me is that you can declare sets of dates or timespans then use set operations (union, complement, intersection) between the sets of dates or timespans.

Amazing.

Anyway, it has a handy module called DateTime::Event::Cron which understands times/dates expressed in crontab format, which sounds perfect.

However, after reading the docs I’ve started worrying about details.

Cron isn’t aware of summer time changes – that is, it doesn’t know about them in advance. Instead it reacts when it spots that the time has changed under it. man cron says:

Local time changes of less than three hours, such as those caused by
the start or end of Daylight Saving Time, are handled specially. This
only applies to jobs that run at a specific time and jobs that are run
with a granularity greater than one hour. Jobs that run more fre-
quently are scheduled normally.

If time has moved forward, those jobs that would have run in the inter-
val that has been skipped will be run immediately. Conversely, if time
has moved backward, care is taken to avoid running jobs twice.

Time changes of more than 3 hours are considered to be corrections to
the clock or timezone, and the new time is used immediately.

DateTime can cope with summertime clock changes when it’s told to use a particular timezone. However, if you try to specify a date that doesn’t exist in that timezone, such as during the lost hour in spring, DateTime will fall over with a fatal error.

Since cron is quite happy to have jobs scheduled during the missing hour, we won’t be able to simply take all the cron times and shove them into DateTime. We’ll have to filter them first. (Unless I’m just not familiar enough with Perl: is a fatal error a big deal? Could my code easily trap it and carry on?)

The DateTime man page recommends getting round summer time problems by using UTC for all the calculations and using the local time zone just for input and presentation. But this wouldn’t help in our case, as the cron times are effectively expressed in the local time zone to start with and may be invalid, in which case the attempt to convert them to UTC will make DateTime fall over with a fatal error.

I’d rather not have a fatal error every year.

One option is simply to look out for the days on which summer time changes. Wikipedia says that Since 1996 European Summer Time has been observed from the last Sunday in March to the last Sunday in October. One could put special checks for those dates in the code. DateTime makes it easy to represent dates like “the last Sunday in March”.

All this assumes specific use of the local time zone. I’m not yet clear on how using the default “floating” time zone might change things. However I have a suspicion that it’ll just behave as cron does, and be taken by surprise by summer time changes – which would be ideal, except that it’d mean that its predictions about how long it is to go until the next cron job is due to run could be totally wrong a couple of times a year.

## Sleep & cron brainstorm notes

leave a comment »

Today I talked over the problem of sleeping and cron jobs with Alastair and Stephen. Some helpful points were raised, and we came up with a basic behaviour which the system should have.

Points:

• Don’t forget to take users’ cron jobs into account. Users might want things run at particular times.
• More power cycles (especially disk spin-ups) will shorten the machine’s life, so don’t do it too often. (So, for instance, if we were waking up periodically to check for Condor jobs to run, we could perhaps wake up say every half dozen hours instead of hourly.)
• If running things at wake-up time: if the user has woken us up, back off and wait a bit, don’t start an avalanche of maintenance jobs going immediately, let the user be able to use the machine’s resources.
• Surely current distributions used on laptops must take some account of missed cron jobs? Check the PM hook scripts on e.g. Ubuntu to see what happens there. (I’ve just had another look on Ubuntu and I see that I missed the ACPI hook scripts: there are a lot of them! Ubuntu starts anacron when it wakes and the machine’s on AC power; but there’s nothing which checks your actual cron.)
• An incidental point: does the SL5 kernel support Condor’s checking for recent USB keyboard/mouse activity?
• The cron component could spot (e.g. at wake-up) what jobs it had missed and decide to rerun some.
• Alternatively we could get the machine to wake up in time to run cron jobs.
• Remember to allow for “at” too. For instance, the autoreboot component uses it!
• We could check cron.hourly/daily/etc. at wake-up time.
• Could parse crontabs to find out what times things will be run.

At this point we decided that the simplest and cleanest thing to do seemed to be:

• Something needs to parse the cron tabs to find out when things are due to run – so we can calculate for instance how soon from now something is due to run. It’s difficult to see how we could do without this. But there must be a Perl module somewhere which does this for you – it can’t be that hard.
• “Is anything due to run in the next X minutes” could then be one of the questions we ask when deciding whether it’s currently a good time to put the machine to sleep.
• We could also then simply wake up in time to run every cron job. Don’t bother to distinguish between “important” and “unimportant” jobs. Just wake up for every one, and if there are so many that the machine never gets to sleep, let’s tidy the cron jobs – rather than trying to tip-toe through the forest of jobs in some complicated selective way.

Previously I had vaguely envisaged some sort of semi-intelligent free-wheeling behaviour whereby a machine might for instance wake up to check Condor, find that it was night time, and decide to run its nightly maintenance jobs then – rather than having them have run at the same particular time every night.
However we don’t need to have this sort of behaviour, so let’s get the sleep system up and running without it for the moment.

## cron and periodic etc. etc.

leave a comment »

Just a brief follow-up to yesterday’s post. After I wrote it I was reminded of some advice of Simon’s: don’t add LCFG bells and whistles unnecessarily. Use existing standard mechanisms where you can.

In this case that would mean requiring our components and software which wanted to run at wake-up or sleepy-time (there must be more technical terms for these, but I like these) to drop a wee script of their own into the /etc/pm/hooks directory. Forget, for instance, all this stuff about a special component running at wake-up time; it’s an attempt to simplify things but it’ll just complicate them instead.

## cron and periodic maintenance tasks on a possibly sleeping machine

leave a comment »

When the sleep component is up and running, a DICE machine might spend most of its time sleeping, only waking up every few hours to do essential things for a few minutes before going back to sleep again. Then again, it might only go to sleep at night. Or it might be used so heavily (for example by Condor) that it rarely gets a chance to sleep from one week to the next.

I’ve been trying to figure out how best to manage cron jobs on such a machine.

We have to manage these things:

1. The scripts in the distro-provided directories /etc/cron.hourly, /etc/cron.daily, /etc/cron.weekly, /etc/cron.monthly need to be run. At the moment these directories are run by cron at specific times, during which our machine might be asleep.
2. Also, those scripts need to be run the right number of times. We don’t want to run them too often.
3. LCFG-configured periodic jobs (the current cron.* resources) need to be run too. For example the boot component is run once per night, and the openldap component runs once per hour.

You can see what a typical DICE desktop runs from cron on a typical day here:

http://wiki.inf.ed.ac.uk/view/DICE/PowerManagementReport#Cron_and_Anacron

It’d be nice to have clean, simple code managing all of this, so we’d make use odf standard ways of doing things, and we’d have more small, simple components. It’d also be nice to have as little duplication as possible in the LCFG resources – it seems confusing to need both cron resources and when-I'm-awake resources for doing periodic tasks). And it’d be nice to have one software system just take care of all of this for us and give us a fairly simple interface so we didn’t need to worry about the interaction between (say) maximum sleep periods before wake, and minimum times between successive runs of a periodic task. It’d be nice to minimise the change required from the current system too – we have hundreds of cron resources in the header files. And it’d be nice to have clear, simple, defined roles for the different components of the system, with clear and obvious interactions between them.

But all this seems horribly contradictory. If you simplify things in one way you seem to complicate them in others. For example if you try to keep the existing cron resources, you either force the cron component to do things grossly different from what it was designed for – it’d need to monitor whether or not cron had done things, or how long the machine had been sleeping, rather than just making a few config files – and since you’d also need to add some extra resources to cover asynchronous operation, for instance whether or not it was acceptable to run a missed-due-to-sleep job an hour after it was missed, the resources get complicated and need changing anyway. The cron component turns into a hideous monster.

Perhaps it could be replaced by one unified “tasks runner” which Just Takes Care Of It for you – you tell it to run things so many times per time period and the component then figures out what to do, and runs as a cron-type daemon, keeping timestamps and goodness knows what. This sounds a bit like the hideous monster cron component above, except possibly cleaner and simpler as it would be designed from scratch. But this seems to chuck “do it the standard way” and “make use of standard software and components” right out of the window.

Maybe we could keep the cron resources, but have a second “asynchronous cron” component inherit their values? It could peek to see when or whether things have been run or not; it could deduce when to run things based on the times given in the cron resources; and so on. Sounds complicated to write. It’d have to understand cron times and be able to judge when or whether to run them. Presumably it’d have to wait until cron had failed to run something then it’d step in afterwards and run it instead. But how long afterwards would be acceptable? The more you think about this the more disgusting it sounds.

Should the component which manages sleep, also be in charge of kicking off periodic jobs? If it knows when it’s going to send the machine to sleep and when it’s going to tell the machine to wake up again, it’ll be able to use that information to figure out when to run periodic tasks. But it sounds as if this would be rather confusing to write.

Should we have a component completely separate from cron and from the sleep component? Call it awake for instance. awake would run when the machine woke up. I envisage a machine with Condor waking up at least every few hours, and a machine without Condor waking up at least once a day, so awake would be able to run things reliably say once a day. But that’d be dependent on whatever component managing sleep being told to make sure that the machine woke up often enough. Is it OK to leave that sort of coordination task to the sys admin to manage by hand? Sensible defaults could be set anyway.

But wouldn’t that need to run things at some other time as well as whenever the machine woke? Say we have a machine that’s so popular with Condor that it never gets a chance to sleep. If it never sleeps, it’ll never wake, and if it never wakes, it’ll never run our stuff, like the nightly run of the boot component for instance.

We need some sort of coordination between running things at wake-up time and running things regularly from cron; something which makes sure that things are run often enough but not too often.

I’ve spent the last day or so dreaming up ever more baroque ways in which this shouldn’t be done, and failing to come up with a simple-sounding solution. I’m sure there were several other ideas more ghastly than the ones above.

The last couple do seem to have the most potential, though.

Help? Any ideas or observations?

Oh, and PS: I forgot to say earlier. anacron would be really useful, except its greatest frequency is daily. It has no hourly or twice a day for instance.

