Posts Tagged ‘power management’
A bug fix for the sleep component
A new version of the LCFG sleep component, 0.30.0, is out and installed on the sleep beta test machines. It fixes bug 653.
The problem was with the code which checked keyboard/mouse idleness. I was so delighted to be able to do this at last that it went to my head and I forgot that keyboard/mouse idleness is only relevant when somebody is logged in. After logout these idleness figures can be ignored: other tests will pick up things like remote shells.
The result was that although the component correctly checked keyboard/mouse idleness and politely waited until the machine had been idle for a while before authorising sleep, it kept doing that after the user had logged out and gone away. Before this my machine would fall asleep a minute or two after I logged out; with this it would wait several hours before sleeping.
So, all fixed in lcfg-sleep 0.30.0. Everything seems OK so far from the intrepid beta test team so I’m hoping that this version will hit other DICE desktops within a month or so.
Spring Sleep Roundup
I’ve been asked for the latest news on the LCFG sleep component (which sends our desktops to sleep when they’re idle, but ensures that they wake up to run important cron jobs). Your wish is my command. Here are the major developments since its last mention here:
- Since 0.21.0 it’s enabled all USB devices for wake-up, rather than trying to guess which ones might have a keyboard attached to them. Simpler, more reliable. (This is so that a sleeping machine can be woken by a press of a key.)
- Since 0.22.0 the component has tried harder to create a wake alarm to wake the machine in good time for cron jobs, even when it knows that it’s never going to send the machine to sleep. (Because machines can go to sleep by other means too.)
- In 0.23.0 the “nosleep” command was introduced – “man nosleep” for details. Thanks to Sharon Goldwater for this idea.
- From 0.24.0 the wake alarm is cleared when the component stops. So machines which have been shut down for the Christmas holidays will stay shut down
- Versions 0.25.0 to 0.29.0 added the holy grail: a test for login session idleness. I found the right bit of DBUS at last! Provided the user is using GNOME, the component can find out if and for how long her login session has been idle. If the user uses some other environment the component will refrain from sleeping the machine while the login session remains; a solution for that is in development and will appear when we get time. New resource
xidletimespecifies the time delay between the session becoming idle and sleep being permitted.
More recently I’ve moved a bunch of generally useful settings from the Informatics-only sleep header into lcfg/options/sleep.h.
The “login session idleness” functionality is currently being beta-tested (which got a mention on the Informatics Energy blog) so it will be installed only if you put the following in a machine’s profile:
#define LCFG_SLEEP_BETA_TEST #include <lcfg/options/sleep.h>
I’ve been asked whether I have data on whether sleeping shortens the life of desktop machines. I’m afraid I don’t, but if you do, please get in touch. I do have a few thoughts on the matter though.
- Modern desktop hardware and operating systems are all designed to support sleep, and they do it well. Our managed Windows desktop machines enforce sleep and this seems to work well. Ten or fifteen years ago frequent sleep might have been bad for the hardware, but nowadays I really doubt that it would cause problems.
- We’ve now been using sleep for several years and we haven’t seen an epidemic of premature death in our desktop machines.
- We’ve come across problems caused by hard disks being kept permanently running 24/7. “Never switch off” isn’t always the right idea.
Round-up of sleep news
It’s been a while since I blogged about the sleep component. There’s been a lot of activity on that front lately, so here’s a roundup of the news.
- You can now wake a sleeping machine by pressing a key on the keyboard.
- In theory you can also wake a sleeping machine by clicking a mouse button. However RHEL6.0 / SL6.0 seems to have a kernel bug which makes this not work any more. As far as I can gather, the kernel bug was fixed ages ago but RHEL deliberately removes the fix when building its kernel. Hopefully this will all be better in 6.1.
- The component now detects running cron jobs. If it finds one, and that job isn’t in the
cronignoreslist, it keeps the machine awake. - There are new disable and enable methods. These disable sleep, and undo the disable method, respectively. The idea is that they can be run by a machine’s user, for example by typing
om sleep disable. - The component should work on 64 bit architectures too.
- A bug which broke the execution of extra commands at suspend and resume has been fixed.
- A new blacklist resource allows the selective disabling of sleep for particular hardware models.
- The sleep component’s LCFG Wiki page has been thoroughly tidied and brought up to date. Take a look at that page for an introduction to the component.
- Sleep mostly seems to work reliably on SL6. One or two models are currently presenting problems (the Dell Optiplex 755 in particular) but solutions have been identified and I expect those models to sleep successfully too soon. Certainly my test 755 sleeps like a baby. (It wakes in the middle of the night to perform important functions…)
- Edit: I forgot to mention that the lcfg-level resources have now been beefed up so that the component can now be run out of the box with lcfg level headers: no extra configuration should be necessary. (More configuration is possible of course – see the LCFG Wiki page for some config ideas.)
With all these developments out of the way it’s looking likely that we’ll soon be rolling out the sleep component onto all the staff and postgrad DICE Linux desktops in Informatics. In addition the introduction of the blacklist resource clears the way for the possible adoption of lcfg-sleep by other schools and units too. I’m looking forward to that challenge; it’ll be great to see more power-saving from the Linux desktops across the University.
Linux Sleep: a new hoop to jump through
This is a follow-up to an earlier post, Linux sleep: how to wake with a key press or mouse click.
Shortly after discovering how to wake a sleeping machine this way – something of a Holy Grail of mine for several years – a new kernel version came along and broke the mechanism. At least, you now seem to have to jump through an extra hoop to enable it, in addition to the one described in the earlier post.
It’s now also necessary to find the relevant USB devices’ files under the /sys/devices/pci0000:00 tree and echo "enabled" to them. Version 0.12.0 of the LCFG sleep component has been updated to do this. It should be installed on DICE machines by 14 April 2011. LCFG bug 408 is the bug report associated with the change.
Linux sleep: how to wake with a key press or mouse click

Several years ago we started sending the Linux machines in our student labs to sleep when idle, to save power. We configured them to check carefully before deciding whether or not they were idle enough to sleep, and also to wake themselves up in time to run important cron jobs. Machines could also be woken manually when needed.
This was fine, except for one problem: the only way to wake the machine manually was to press its power button. That’s not how most people try to wake a sleeping machine: it’s far more natural to press a key on its keyboard, or click one of its mouse buttons.
We’ve had a user education campaign which seems to have successfully taught most users of the labs how to wake a machine up, but there’s still a persistent minority of people who don’t understand, or maybe get impatient, and who sometimes end up doing something rash such as forcing a sleeping machine to reboot; so we get a steady flow of broken machines.
To solve this problem I’ve been trying for a long time to find out how to enable wake from sleep with a key press or mouse click. I’ve even been trying to find out if it was actually possible with Linux.
I have finally succeeded! It is possible, I’ve done it, and the solution will shortly be rolled out to our student lab machines. Here’s how:
The key file to manipulate is called /proc/acpi/wakeup. This file is a list of devices which can be used to wake the machine from sleep – and whether or not they’re currently allowed to. A status of “disabled” against a device means that it won’t wake the machine, while “enabled” means that it will. Here are the default contents of /proc/acpi/wakeup on my desktop HP dc7900 running Fedora 13:
Device S-state Status Sysfs node PCI0 S4 *disabled no-bus:pci0000:00 COM1 S4 *disabled pnp:00:07 PEG1 S4 *disabled pci:0000:00:01.0 PEG2 S4 *disabled IGBE S4 *disabled pci:0000:00:19.0 PCX1 S4 *disabled pci:0000:00:1c.0 PCX2 S4 *disabled PCX5 S4 *disabled pci:0000:00:1c.4 PCX6 S4 *disabled HUB S4 *disabled pci:0000:00:1e.0 USB1 S3 *disabled pci:0000:00:1d.0 USB2 S3 *disabled pci:0000:00:1d.1 USB3 S3 *disabled pci:0000:00:1d.2 USB4 S3 *disabled pci:0000:00:1a.0 USB5 S3 *disabled pci:0000:00:1a.1 USB6 S3 *disabled pci:0000:00:1a.2 EUS1 S3 *disabled pci:0000:00:1d.7 EUS2 S3 *disabled pci:0000:00:1a.7 PBTN S4 *enabled
The only device that’s allowed to wake the machine is PBTN – the power button.
To enable a device, just echo its device code to the file, like this:
# echo USB3 > /proc/acpi/wakeup
A quick look at /proc/acpi/wakeup confirms that USB3 is now enabled for wakeup:
USB1 S3 *disabled pci:0000:00:1d.0 USB2 S3 *disabled pci:0000:00:1d.1 USB3 S3 *enabled pci:0000:00:1d.2 USB4 S3 *disabled pci:0000:00:1a.0 USB5 S3 *disabled pci:0000:00:1a.1
I wanted to make it possible for the keyboard and the mouse to wake the machine, so I used this method to “enable” all of the USB devices.
Note that echoing the device code to the file toggles the device’s status: a disabled device is enabled, and an enabled one is disabled.
Note also that if writing a Perl script to do this, you’ll have to open /proc/acpi/wakeup for writing, echo a device code, then close the file, separately for each device you want to enable.
Here’s a bit of Perl which will enable wakeup on all USB devices, if you run it from an account which has permission to write to /proc/acpi/wakeup:
#!/usr/bin/perl
my $wakeup = "/proc/acpi/wakeup";
my @disabled;
my $device;
# Let's take a look at the wakeup file
open(INPUT, "< $wakeup")
or die "Couldn't open $wakeup for reading: $!\n";
# Remember the names of each disabled USB device
while () {
if (/^(USB\d+).*disabled/) {
push(@disabled, $1);
print "Added $1 to disabled list\n";
}
}
# We've finished reading from the file
close(INPUT);
# Enable each device on our list
foreach $device (@disabled) {
print "$device is disabled! Enabling it now.\n";
open(OUTPUT, "> $wakeup")
or die "Couldn't open $wakeup for writing: $!\n";
print OUTPUT $device;_
or die "Couldn't echo $device to $wakeup: $!\n";
close(OUTPUT);
}
hardware test results
Whoops: I’ve neglected this for the last few days. This post therefore has to be something of a catch-up. Sorry for the length.
- On the 15th I built lcfg-openafs and lcfg-openldap for f12_64. (Stephen has been enthusing about Mock for a long time now, and I can see why – it’s extremely useful to be able to build packages without having to install all the Requires and BuildRequires packages on the build machine.) lcfg-openafs-0.0.32 is now officially ported to F12.
- I also confirmed that switching my F12 machine from Kerberos configuration by the file component, to configuration by the kerberos component, kills my keyboard stone dead on reboot. I get a prompt for my admin principle and the keyboard totally fails to work. I’ve gone back to the file component method and reinstalled…
- Most of the rest of the 15th and 16th was taken up with hardware tests: results here. The problems I found were:
- 755 doesn’t reboot
- Whenever the 755 tries to reboot it announces “Rebooting system.” then hangs.
- 745 doesn’t mount CDs
- If you insert a CD into a 745 it whirrs but nothing appears on the desktop.
- HP 7900 CD support is dodgy
- Sometimes an inserted CD doesn’t mount on the 7900′s desktop, sometimes it does.
- Dell sound is dodgy
- There’s no audio output from speakers on some Dells.
- Audio or sleep troubles on 755
- The X login screen disappeared from a 755 after it had undergone an intensive programme of frequent suspends and resumes. On checking the logs it seemed that rtkit-daemon was logging to syslog a lot at resume time. On the very first resume it logged
rtkit-daemon[4569]: Sucessfully made thread 4567 of process 4567 (/usr/bin/pulseaudio) owned by '42' high priority at nice level -11. rtkit-daemon[4569]: Sucessfully made thread 4573 of process 4567 (/usr/bin/pulseaudio) owned by '42' RT at priority 5. rtkit-daemon[4569]: Sucessfully made thread 4574 of process 4567 (/usr/bin/pulseaudio) owned by '42' RT at priority 5.
Then on the second resume it logged:
rtkit-daemon[4569]: The poor little canary died! Taking action. rtkit-daemon[4569]: Rampaging. rtkit-daemon[4569]: Successfully demoted thread 4573 of process 4567 (/usr/bin/pulseaudio). rtkit-daemon[4569]: Successfully demoted thread 4574 of process 4567 (/usr/bin/pulseaudio). rtkit-daemon[4569]: Demoted 2 threads.
and on subsequent resumes:
rtkit-daemon[4569]: The poor little canary died! Taking action. rtkit-daemon[4569]: Rampaging. rtkit-daemon[4569]: Demoted 0 threads.
Some hours later this eventually became
rtkit-daemon[4569]: Rampaging. rtkit-daemon[4569]: Demoted 0 threads. gdm-simple-slave[4497]: WARNING: Child process -4519 was already dead. gdm-simple-slave[4497]: WARNING: Unable to kill D-Bus daemon console-kit-daemon[1811]: WARNING: Couldn't read /proc/4556/environ: Failed to open file '/proc/4556/environ': No such file or directory gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.844015 seconds gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.834487 seconds gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.829573 seconds gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.829621 seconds gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.830660 seconds gdm-binary[4464]: WARNING: GdmDisplay: display lasted 0.835599 seconds gdm-binary[4464]: WARNING: GdmLocalDisplayFactory: maximum number of X display failures reached: check X server log for errors init: prefdm main process (4464) terminated with status 1 init: prefdm main process ended, respawning gdm-binary[13695]: WARNING: GdmDisplay: display lasted 0.833754 seconds
and so on. Judging by these two posts on Ubuntu forums it may be the case that PulseAudio should be stopped on suspend and started on resume. I’ve checked our pm-tools sleep.d scripts and that’s not happening on our F12 machines.
“rtkit” by the way is “real time kit”, it’s required by PulseAudio but not yet by anything else.
I spent some time today debugging a pm-utils sleep.d hook script which would suspend PulseAudio on system suspend and resume it on resume, but without success. I think I’ve spent enough time on this; for now we’ll just have to have lcfg-sleep disabled on 755s. I’m modifyinglcfg/defaults/sleep.haccordingly.
A note for the future: my failed sleep hook script needed to run nsu or sudo so it could run pactl as the user running pulseaudio. Root doesn’t have permission to nsu, and I subsequently noticed some console messages saying “root: sorry, you must have a tty to run sudo”. So that explains those failures, anyway.
- Alastair has got us round the keyboard/kerberos problem. Apparently scripts called from Upstart can’t get interactive input! See this Ubuntu support thread. Setting
kerberos.hostkeylesstotruegets us round the problem for now, at the cost of not having any automatically generated host keys. I’ve changedinf/options/kerberos-client.haccordingly. But what a pain. We really don’t like Upstart. Edit: it may be plymouth rather than Upstart. Hopefully we’ll be able to chuck or disable plymouth as a workaround.
A reinstall brings RTC wake alarm confusion
Late yesterday I bogged up my F12 machine completely. Today I took the opportunity to reinstall it using Alastair’s shiny new F12 install process. This worked, albeit with a few hiccups, so I now have a new F12 installation.
With the new installation, the wake alarm no longer works as it did. This is how it worked until yesterday:
# echo 0 >/sys/class/rtc/rtc0/wakealarm # date "+%s" -d "+ 5 minutes" >/sys/class/rtc/rtc0/wakealarm
That is, you zero the alarm then you set it with the number of seconds between the epoch and the date/time you want the machine to wake up. Now though, with the same kernel RPM as before, the above doesn’t work. Instead it works like this:
# echo 0 >/sys/class/rtc/rtc0/wakealarm # echo +300 >/sys/class/rtc/rtc0/wakealarm
That is, you put in a + followed by the number of seconds between now and your alarm time. But the kernel version is the same, I think, from yesterday to today. How can the alarm behaviour have changed…?! And will it change again tomorrow? How do I write software when the kernel behaviour changes arbitrarily from one day to the next with no change of kernel RPM version?
I got the solution from here – which seems to be about Asus boards so I’m still confused as I’m using a Dell:
http://www.mail-archive.com/acpi-bugzilla@lists.sourceforge.net/msg24296.html
Fixed problems with lcfg-sleep on F12
The day was mostly taken up with making the sleep component behave itself properly on Fedora 12. The OS’s power management facilities certainly seem to have matured: my test Dell Optiplex 745 suspends and resumes far more quickly than it did with SL5, and it seems to be doing it far more reliably as well so far. I’ve left it on an intensive suspend/resume cycle though (awake for 3 minutes, then suspend if appropriate, then wake 2 minutes later and start again) so we’ll see if that brings out any misbehaviour over the next few days.
An apparent bug whereby the pm-utils hook scripts weren’t being called was solved when I noticed that for F12 I’d switched the suspend command for my machine from /usr/sbin/pm-suspend to some other fancy suspend command. Doh. I switched it back and the pm-utils hooks were called again as they should be.
I also rewrote the part of the component’s code which sets the wake alarm to make it cope properly with either the old kernel alarm system used on SL5 (/proc/acpi/alarm) or the newer one found on F12 (/sys/class/rtc/rtc0/wakealarm).
I also discovered and fixed an edge case problem whereby the component’s shell idle time test would happily approve sleep in case where all interactive shells had an idle time of zero seconds. Repeat twenty times: I must not confuse zero with undefined in my perl scripts. Still to do: check that other sleep tests are behaving themselves (I think they are though) and check that the new code still does the right thing on SL5.
LCFG Users Day talk
Yesterday was the 2009 LCFG Users Day afternoon session. The talks were all pretty interesting I thought; it was very encouraging to see how useful people are finding LCFG, and how its use has grown compared to last year. The developments at ACE seemed particularly impressive: LCFG has become a very useful and powerful Mac configuration management tool.
My own wee talk on the sleep component went OK (to my relief). Since I have it all written down anyway, here it is, more or less verbatim, for those that missed it.
I’m Chris Cooke and I’ve written a component called “sleep”.
I did it because we want to save the environment – that’s one of the
University’s corporate goals, more or less – and we also wanted to
save money off our electricity bill.
So, the sleep component.
The idea is that it runs on our desktop Linux computers.
When it runs it decides whether or not it would be appropriate for the
computer to sleep. If it would be appropriate, it sends the computer
to sleep.
However, just as importantly, before sending the computer to sleep,
the sleep component also decides a good time for the computer to wake
up again, and it sets a wake alarm which will wake the computer up
at that time.
So, when is it appropriate for a computer to sleep?
Well, cron jobs are one thing to look out for.
The component makes sure that a computer will be awake in time to run
every important cron job.
(And by the way it takes “important” to mean “every cron job except
the ones you’ve told it to ignore”.)
It also looks at the load average – if that’s higher than a level you
set in a resource, the computer will stay awake.
It also looks at the idle time of shells, at X sessions, and it also
runs any arbitrary command you tell it to, and takes the return value
of that command to be an approval or veto of sleep for the machine.
So, for instance, I realised quite late on that although I’d dealt
nicely with cron jobs, I’d totally forgotten “at” jobs, so I was able
to add on a call to an external command which has the effect of
vetoing sleep if there’s anything in the “at” queue.
I also gave the component other things to look out for:
You can set a minimum awake duration, that is, a minimum time between
sleeps.
You can also set a minimum and maximum duration for sleep.
So basically it runs this whole battery of tests, and if any of them
vetoes sleep, the machine stays awake.
Before sleeping, the component looks ahead to when the next important
cron job will run – or when the maximum sleep duration will be up, if
that comes sooner – and it sets a wake alarm which will wake the
computer up for that time.
And finally there are also resources which run things for you when the
machine is falling asleep and when it’s waking up again.
Like, we use one or two daemons which react badly to sleep, so we shut
them down before sleep, then start them up again when the machine
wakes.
So, this is all great, it’s shiny and wonderful, but there is some bad
news: I found out the hard way that the power management on our Dell
SelectPCs does not seem reliable with Scientific Linux; when the
machine tries to sleep you get crashes and freezes now and then.
In the end we gave up trying to use it with our Dells.
However it is perfectly reliable on the current SelectPC, the HP 7900,
and in fact we are using the sleep component on the 7900s in our
student labs across the road there.
So, the component is called lcfg-sleep, you’ll find it in
svn.lcfg.org, and if you look on wiki.lcfg.org you’ll find a page
there about the sleep component.
That’s it. Thank you.
lcfg-sleep gets deployed! *shock*
lcfg-sleep has finally been deployed! I’ve put it on all our Student lab machines. It’ll only actually do anything on the new HP 7900s of course. There are several dozen of them in our student labs, so the sleep component will soon begin to earn its keep at last. There’s a big autoreboot due to happen tonight to install a new kernel, and the sleep component should start working after that reboot.
I realised this morning that the component still had an alarming flaw related to autoreboot: although it wakes the machine automatically when it’s time to run the autoreboot component (as it’s kicked off by a cron job), and although it prevents the machine from sleeping once the autoreboot component has started a shutdown command to reboot the machine (I added a sleep test for this), I had forgotten the middle step: autoreboot kicks off the shutdown command from an at job, and I had neglected to tell sleep to refrain from suspending the machine if there was anything in the “at” queue. I’ve never really used “at”, so I don’t tend to think of it much, but the “at” queue is so fundamental that I should really go back and make the sleep component examine the queue properly when assessing a machine’s sleep-worthiness. For the moment I’ve put in a stop-gap extra sleep test which simply vetoes sleep when there’s anything in the “at” queue.
Sleep: HP in, Dell out
While I was away on a week’s holiday I left a small lab full of Dell GX745 machines testing lcfg-sleep. When I came back I found that half of them had frozen at suspend-time. This has been happening on and off for months and really I’m no nearer to curing the problem. By contrast, the few new HP dc7900s that have been running lcfg-sleep have been as good as gold, suspending and resuming several times a day quite happily, waking up perfectly well both automatically and on a press of the power button – no problems there.
Since effort is short and I’ve been bashing my head against the Dell problems for far too long now, and I’m utterly sick of it, I’ve now disabled sleep on the final two remaining Dell models it was operating on, the GX745 and the 755. The only supported model is therefore the HP dc7900. Happily this is the PC du jour and we’re deploying a lot of them in the student labs right now, so if things go well in my wee HP sleep test lab we can fairly rapidly spread lcfg-sleep to the rest of the student lab HPs. A few months after that, if things seem as reliable as I hope they will, we could perhaps think of spreading lcfg-sleep to HPs in offices too. After all the original point of the project was to cut the Forum’s electricity bill, and the Forum has no student labs.
This isn’t necessarily the end for the hopes of automatic sleep on the Dells. I may well go back and test sleep on a Dell from time to time, or even fix the problem if inspiration strikes, but I won’t spend much more effort on it at the moment: there are more important things to be getting on with.
lcfg-sleep 0.7.2
lcfg-sleep got its first new version in 3 months this morning. Just one wee change: it now touches its waketime file – the one it uses to know when it woke up, so it can figure out if it’s been awake for long enough yet – after running other resume-time commands, rather than before. I’ve made the change because while I was on holiday a few machines picked up timestamps on their waketime files which were a couple of weeks in the future; one machine’s waketime file even managed to leap twenty years into the future. I’m hoping that touching the file after restarting ntp, rather than before, will mean that the file picks up a rather more sensible timestamp. The effect of the future timestamp wasn’t catastrophic – it just prevented the machine from sleeping, because the component simply reckoned that the machine had been awake for a negative amount of time, which it decided was less than the minimum acceptable time period for being awake. If the problem occurs again I’ll add a test for negative numbers in there somewhere.
no sleep for 755s
Just a quick note. Following the discovery of changes needed to support sleep on 745s, this morning I tested some 755s with various sleep quirks. With the i810 driver, which is what the machines are currently using, no single sleep quirk does the job. The best I can do is get the machine to resume with a totally dead screen. At some point perhaps I should test them to see if the intel driver suffers from the same nastiness as on 745s. In the meantime no sleep for 755s.
Bad news for lcfg-sleep
Less than 24 hours after I enabled it on six Dell 745s, three of them have given up the ghost. Not permanently I suppose (I haven’t physically checked them yet) but they’ve certainly failed to wake up: all three went to sleep just after 1pm yesterday and have failed to communicate with the profile server since then. The other three machines seem fine – they woke up yesterday evening and downloaded new profiles, then last night, then most recently this morning between 8am and 9am.
Also, my own test 745, which went through an accelerated test of hundreds of sleep cycles thanks to my rather cruelly giving it a maximum sleep period of three minutes, yesterday hung rather than wake up. After I’d manually rebooted it and it had slept perfectly happily a number of times more, it then repeated the trick last night shortly after 8pm.
So, to do:
- boot the benighted machines single-user and save copies of whatever they managed to log before freezing
- remove lcfg-sleep from the lab machines (edit: I’m leaving it to run for longer. Nobody uses those machines anyway.)
- rethink the supported models (currently Dell 745, 755 and new HP 7900).
On the last point I’m now wondering whether Dells are just too crappy and unreliable to be trusted with power management at all? The HP has behaved flawlessly – but then, it hasn’t gone through as many sleep cycles as my own test Dell 745, so who knows how it’ll work out. IS seem to manage, but not with Linux, they have to boot the machines into Windows to sleep them – and Windows has traditionally had its hardware support designed specifically around the shortcomings and unreliability of whatever’s provided by the hardware vendors.
I think I’m going to have to arrange a mass HP 7900 sleep test somehow.
There’s still the possibility of getting 755s to behave reliably, and I’ll work on that, but it seems unlikely to succeed.
Sleep is live!
The sleep component has gone live! I’m very excited
To start with it’s being tested in one very small student lab. Six machines are now fast asleep, and all being well they’ll wake up in a few hours and be fresh and ready for use. I’d better check tomorrow to see what sort of state they’re in. I’ve written a page of documentation for the machines’ users. I still need to add some stuff to the Support FAQ.
GX745 gdm hangs: it’s the intel driver
We’ve found out, to some extent anyway, why our Dell GX745s have been freezing sometimes when gdm starts the login screen, but mysteriously only the ones using the DICE develop release; those using the stable release have been unaffected. It has to do with the X video driver: Stephen noticed that lcfg/hw/dell_optiplex_gx745.h still contained:
#ifndef LCFG_RELEASE_DEVELOP #include <lcfg/options/video_i810.h> #endif
The machines on the develop release were using the intel driver, and others were using i810. It seems that this intel driver is bad news on 745s, at least with DVI cables anyway. The develop 745s I know about are all on DVI cables because of a previous problem with VGA cables – and they were on the intel driver because of a previous problem with the i810 driver.
This would leave us in a bit of a fix, but a retest of sleep on the GX745 has revealed that
- with i810 we now need a completely different sleep quirk from before, and
- so far it seems to work perfectly.
I’ve therefore changed the sleep defaults for the GX745 to make it sleep only if the i810 driver is in use and to use the new quirk.
I’d better also retest at least a 755 and a 7900, and possibly also explore how the intel driver on a 745 reacts to VGA cables.
autoreboot and sleep
Just a quick note: as envisaged in my last post I’ve rejigged the autoreboot and sleep support. Extra sleep settings for machines with both autoreboot and sleep are now in a new autoreboot-and-sleep.h header. At the moment that just contains an extra sleep test which will veto sleep whenever root is running an instance of whatever command is mentioned in the autoreboot.shutdown_command resource.
some sleep integration developments
It’s been a while since I did anything with the power management for DICE desktops, aka sleep, project so here goes. I need to get it installed in a student lab to see what happens. Before I can do that the sleep component needs to be integrated safely with several things:
- exam lockdown: now done, exam lockdown disables sleep.
- condor: now done, we have a new
condor_and_sleep.hheader which tweaks sleep resources to reduce the maximum sleep time and also stop condor at suspend time and start it again at resume time. The latter is necessary because Condor detects imminent sleep and uses some non-standard mechanism to react to it, and doesn’t finish reacting to sleep until the machine has slept and woken up again, which is a bit useless. The header is (or will be) included from both condor and sleep headers in such a way that its contents won’t be multiply included but it won’t matter which order the condor and sleep headers appear in. Which I’m quite proud of. - autoreboot: sleep now won’t sleep when
shutdownis running. I probably need to rejig and complicate this along the lines of condor above though so that I can test for whatever the autoreboot component is currently using as its shutdown command, rather than just assuming that it’s stillshutdown.
HP dc7900 now OK for sleep
TiBS LCFG development has temporarily stopped while I take care of some operational matters:
- moving the DIY DICE service to a dedicated machine to simplify admin – and configuring and installing the machine
- decommissioning the fc5 build host (not that this took a great deal of time)
- tidying up a pile of machines that were lurking under my desk
- looking into ways of making the office habitable
- satisfying various bureaucracies
- Investigating a bizarre case of intermittent narcolepsy in my main desktop. I don’t think it’s LCFG-related, I’ve misconfigured Gnome power management somehow.
I’ve also set up a new HP dc7900 and tested it for compatibility with lcfg-sleep. Good news: it’s now supported, or it will be when today’s changes hit the stable release. For the record it appeared to suspend and resume happily when no quirks were used, but a subsequent gnome login would pop up a “something went wrong with your resume” warning, and on logout the whole machine would freeze solid – lovely. Thankfully this behaviour goes away and the machine seems as good as gold when slept with the VBE Post sleep quirk, so that’s what LCFG will now do.
mpu minutes, preparing for dev meeting
I haven’t had much time for the sleep project today. Much of the morning was spent writing up the minutes of this week’s MPU meeting. I see the test 755 didn’t recover from its first sleep attempt again last night; I’ll rescue it and investigate tomorrow.
For this month’s development meeting we’ve been asked to provide a brief summary
of what has been achieved in the month since the last meeting and what
is intended to be achieved by the next meeting. Here’s what I’ve come up with for the sleep project:
Over the past month the project has concentrated on analysing various
problems related to suspend and resume on the test machines. Some of
these have been solved with changes to the component or to the
resources. As a result of these problems the scope of the project has
progressively narrowed from “look for sleep opportunities on all DICE
desktop machines, all the time” to “look for sleep opportunities on
Dell 745s and 755s, running SL5.3, when no X session is running”.In the next month I plan to arrange to test the component in at least
one student lab, with a view to deploying it in the student labs for
the next session. I also plan to check that lcfg-sleep cooperates
properly with lcfg-condor on machines that use both.In the longer term, as OS and hardware support for sleep gradually
improves, as surely it must, we may be able to widen the scope of our
power management once again and spread automatic sleep to more
machines for more of the time. That may not happen as part of this
project however, but more as small incremental developments over the
coming years. For example: once the pm-utils package (which
implements power management) supports the concept of timed sleep with
automatic wakeup it will be possible to integrate it safely with
lcfg-sleep.
Dr. Who eat your heart out
More suspend weirdness – but this looks like another new case. (This project begins to seem like an epidemic of power management bugs.)
This time it’s the test 755 that’s affected. It was hung when I came in this morning: it went to sleep at 00:21 but failed to wake up again at 03:54. It wasn’t totally hung, it seemed to be stuck in a sort of partially up mode where some things were working OK but you couldn’t login to it. I’ve had a stunningly dozy day, my brain seems to have spent most of the day in low power mode, possibly because of the hot weather – so I’ve only just thought to consult the machine’s /var/log/pm/suspend.log (I got the machine going again this morning.)
And it looks interesting. Firstly here’s what you get when you do ls -l on it:
-rw-r--r-- 1 root lcfg 1565 Jun 27 1922 suspend.log
1922? And here are the contents:
Fri May 29 00:21:02 BST 2009: running suspend hooks. ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/00clear ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/05led ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/15_915resolution ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/20video ===== kernel.acpi_video_flags = 0 ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/49bluetooth ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/50modules ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/55battery ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/60sysfont ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/65alsa ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/94cpufreq ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/95led ===== ===== Fri May 29 00:21:02 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/98lcfgsleep ===== ===== Fri May 29 00:21:08 BST 2009: running hook: /usr/lib/pm-utils/sleep.d/99video ===== Fri May 29 00:21:08 BST 2009: done running suspend hooks. Tue Jun 27 01:26:46 BST 1922: running resume hooks. ===== Tue Jun 27 01:26:46 BST 1922: running hook: /usr/lib/pm-utils/sleep.d/99video ===== ===== Tue Jun 27 01:26:46 BST 1922: running hook: /usr/lib/pm-utils/sleep.d/98lcfgsleep =====
Here’s that 98lcfgsleep script in full:
#!/bin/bash
. /usr/lib/pm-utils/functions
case "$1" in
hibernate|suspend)
/usr/bin/om ntp stop
/bin/echo "`/bin/date` : Suspend" >> /var/lcfg/log/sleep.activity
/bin/grep `/bin/date +%d/%m/%y` /var/lcfg/log/sleep | /bin/grep `/bin/date +%H:%M` | /bin/mail -s "Falling asleep" [Chris's mail address]
sleep 5
;;
thaw|resume)
/bin/touch /var/tmp/lcfg-sleep-waketime
/usr/bin/om amd restart
/usr/bin/om ntp start
/bin/echo "`/bin/date` : Resume" >> /var/lcfg/log/sleep.activity
/bin/mail -s "Waking up" [Chris's mail address] < /dev/null
;;
*)
;;
esac
exit $?
I got the falling asleep mail, and the ntp and sleep.activity log files were both updated at the right time, so all of the suspend-time stuff appears to have happened properly.
The resume-time stuff doesn’t though, and somewhere here seems to be where the machine got stuck for seven hours until I power-cycled it. The “touch” command seems to have happened – but look, here’s a familiar date:
-rw-r--r-- 1 root root 0 Jun 27 1922 /var/tmp/lcfg-sleep-waketime
Here are the last three entries in the amd log:
28/05/09 23:52:32: >> restart 27/06/22 01:26:46: >> restart 29/05/09 10:33:56: >> start
I suppose that means just that it did in fact restart but at some point between 23:52 yesterday and 10:33 today.
There’s no sign of any entry for this resume-time start in the ntp log. The most recent entry in the sleep.activity log is from the suspend script, and the file is dated 00:21 today, when the machine suspended, so that didn’t appear to run. I also received no mail at resume time.
So it looks like the machine went down in 2009 and came up in 1922, which made ntp hang indefinitely on startup.
What I’ve been doing in May 09
We had a technical Strategy Meeting this morning, an unusual event. I’d been dreading it, fearing a lack of stargazing technical foresight on my part, but as it happened the meeting was excellent, and I came away from it with two things: a massive boost to my morale (which had been flagging a bit since learning that I’m about to be chucked out of my bearable temporary office) and a determination to revive the habit of writing down at least every day pretty much everything I’ve been up to at work that day: the successes, the failures, the error messages, the things I found out and the things that I do and don’t understand. Documenting the successes is great for me (and for others) as an aide memoire. Documenting the failures is even more important though in two ways: firstly it can often help to clear my thoughts, but secondly and maybe more importantly other people can read about my problems and chip in with their own ideas, they can rally round and help. Far better than just suffering in silence, I think.
In other words this amounts to the usual “I’m going to revive this blog” post.
Next week the sleep project is going to have to face the music as regards deadlines and the missing thereof at the monthly development meeting. This time Tim has asked us to produce a short summary of what happened in the last month with our projects and what might happen in the next month. To try to get it straight I’ve written down here what happened in the last month using the MPU weekly minutes as my guide. It’s probably rather longer than Tim asked for but if necessary I’ll summarise later. For now, I need to get this all down on paper dots.
Worked out how to successfully suspend and resume a Dell Optiplex 755 (the exact command is different for each model).
The test 745 is hanging sporadically. This is related to the latest kernel: it doesn’t happen on a stable 5.2 machine, but does happen on both a develop 5.3 machine and on a stable 5.2 machine with the latest test kernel (the one on develop 5.3 machines). This turned out to be a bug with VGA support, it happens only if you use a VGA cable, machines with DVI-connected displays are fine. It appears that our 745s and 755s are pretty much universally connected with DVI, but all other models have at least some machines with VGA cables, so I’ve disabled lcfg-sleep for all models earlier than a 745.
Timed wake-up occasionally fails to happen with 5.3. Very occasionally. Haven’t reproduced this again so will ignore it for now, not a big problem.
Alastair is trying out lcfg-sleep on his T3400. It seems to sleep and resume perfectly well but gnome pops up an error message when you login after a resume. We’re going to try using gconf (and if that’s successful, lcfg-gconf) to suppress this error message.
Tim tried out lcfg-sleep and found a gap in its idleness checking. It was checking for low load average and for all interactive shells being idle, but a long session on a text editor can fulfill both of these, and Tim’s machine went to sleep while he was typing. So I had a rethink and (to cut a long story short) now reckon that we should let X session managers decide for themselves when the machine is idle. They turn out to be much better at it. In Gnome, gnome-power-manager does this perfectly well, and can be configured to suspend the machine a number of minutes after the machine becomes idle. We can use gconf and James Jarvis’s lcfg-gconf to set this as default behaviour for all gnome users. (User preferences can override default behaviour. If necessary gconf can also set mandatory settings which cannot be overridden by user preferences.) This takes care of idleness detection and suspend; resume will still be handled by the sleep component, which will continue to run every minute or two and will ensure that the machine always has a suitable wake time set. The machine will therefore not miss any important cron jobs no matter what puts it to sleep. (Coping with user-initiated suspend was always part of the sleep component design anyway.)
This led me to alter the component so that by default it looks for the presence of a user X session on the machine, and vetoes sleep if it finds one. This works. You can use the sleep.overridesessionmanager resource to override this behaviour and sleep even when somebody is logged in, but given Tim’s experience I don’t recommend it, unless you do it in conjunction with idleness timeouts long enough that they’re likely to be triggered only at night and weekends.
While working on the above I explored a sizeable blind alley or two. One was DBus: gnome-power-manager sends out signals using DBus when for instance it decides that the session is idle, and other apps can subscribe to those signals and react, or send messages back to gnome-power-manager asking it to change its behaviour. So for instance a movie player application could use DBus to inhibit the screen saver while a film was playing. But DBus has more than one bus per machine: it has one system bus but it also has a separate session bus for each user session. An LCFG component could probably get access to the system bus, being a thing running as root, but not to any of the session buses. And guess what, gnome-power-manager and other gnome stuff all uses the session bus: the idea is that only the user’s own gnome (and other DBus-using) apps will be legitimately interested in what the user is or isn’t doing. So the effect is that the lcfg-sleep component can’t possibly ever see any signals from gnome-power-manager – the “Hey, I’m idle now” signal for instance – so can’t use that info or send any influence back to gnome-power-manager either. Sigh. I briefly wondered about having each user launch some sort of home-cooked program which could subscribe to the DBus session bus and have that act as a gateway to/from lcfg-sleep but quickly gave that up as a ghastly idea and probably a horrible security risk.
Another blind alley was monitoring USB keyboards or mice for signs of activity – you’d need to have something running constantly rather than having something you could just query now and then to find out what the situation was, which is the model that the sleep component has been using. And having something monitoring keystrokes doesn’t sound great from a security point of view.
I had another partial failure: you can configure gnome-power-manager to suspend the machine after a certain period of inactivity. This is easily done either using the gnome app (select it on the menu – More Preferences I think? – to put it onto the top menu bar in gnome) which lets you control how long until idle, how long from then until suspend, that sort of thing. This works very nicely. Since lcfg-sleep runs every few minutes, and writes a suitable wake-up time to the machine every time it runs whether it decides to sleep the machine or not, you can have gnome-power-manager send the machine to sleep and then have the machine wake up automatically in time to run the next cron job. But here’s where it gets complicated: when the machine wakes up again, it hasn’t been woken by a button press or in fact by anything which gnome-power-manager is looking out for, so gnome-power-manager doesn’t go through the process of realising “Oh, I’m not idle any more” – so it never becomes idle, never suspends the machine again, and so you only ever get one period of sleep per period of idleness. The machine won’t be slept again until some sort of manual intervention happens – the user comes back and plays with the mouse or whatever, triggering the machine to realise that it’s not idle any more. Until that happens, the machine just sits awake. I don’t know if there might be some way of triggering gnome-power-manager to realise that the machine has woken up and isn’t idle any more?
Incidentally, when letting gnome-power-manager initiate the suspend, rather than the sleep component, you need to be careful to add the desired video quirks for each model to the sleep quirk database, since they can’t be invoked on the command line as lcfg-sleep does it. The sleep quirk database is a bunch of .fdi files under the /usr/share/hal/fdi directory. The existing files there don’t mention many of our models and are quite out of date; we’d have to add model info for our models, perhaps by upgrading to a more recent HAL version or perhaps just by tweaking the files in the existing version.
Another thing I looked at was the possibility of getting gnome-power-manager to send the machine to sleep with an instruction to wake up after a certain period of time. This is referred to in one of the gnome-power-manager docs or web pages I came across. I looked at the source for the version we’re using, 2.16, and found a stub where a future developer might add code to do the timed wake-up – but no actual existing facility to do it. Blast. Checking the source of the latest version, 2.26, the stub and all the code around it appears to have been swept away in a significant rewrite, and I couldn’t see any sign of automatic/timed wake-up anywhere in the new version.
Another limitation I had to place on sleep was to mandate the use of the intel video driver, rather than the i810 driver we’re currently using on 745s and 755s. Suspend and resume with i810 was just too unreliable. Thus the only machines getting an active sleep component are those 745s and 755s which are both on the develop release and happen to have been specially configured to override the default video driver by including dice/options/video_intel.h. Doing this turned out to be awkward with dice/options/video_intel.h being included in profiles *after* dice/options/sleep.h but I got round it by giving the component the cabability to find out for itself what X video driver is in use (new resource sleep.actualvideo, which inherits from xfree.device_main) and compare that to a list of acceptable drivers (sleep.approvedvideo) when evaluating whether or not it’s safe to send the machine to sleep.
There’s a possibility that we can safely re-enable lcfg-sleep for 745s with i810 since the problem there was simply Gnome’s popup error message which we now know how to suppress using gconf.
I’ve tried i810 with the 755 with 5.3 and that’s not a possibility: the 755 doesn’t resume properly whatever I try when using i810 – it only works with a combination of the intel driver and the right video quirk option.
What I intend to do next is try to get the 5.3 develop 745 with i810 and DVI cables resuming happily with the same sleep quirk as with the same machine with the intel driver; and if that works, re-enable lcfg-sleep for i810-using 745s. Then I’ll monitor each machine actively running lcfg-sleep and leave things for a while to see how they go. With any luck I should be able to get on with other work while that’s happening.
There are two other areas of upcoming work: bugzilla and TiBS. The bugzilla work will be to revive the move of bugzilla.inf.ed.ac.uk to bugzilla version 3. This is the version the LCFG bugzilla uses so we’ve already done the work of upgrading the local infrastructure to cope. It’ll just be a matter of getting the new server up and running, copying over the data and reproducing the configuration. TiBS is the new all-singing all-dancing backup software we’re running and the task will be to automate appropriate bits of its configuration and use using LCFG. Craig’s driving that project, so it’ll be good to do it in conjunction with him (projects with other folk involved are both easier and more fun than doing it all by yourself) but so far not much has happened except that I’ve looked a bit at things and found (and been told) that the software is rather messy, our configuration is rather messy, the documentation we’ve been given is incomplete and out of date, the software has foibles and problems that we seem to only discover by running full tilt into them with our live backup service, and that our primary expert in all of this left us several months ago. Aside from that the TiBS project is going really well and it’s going to be hunky-dory and fab.
Go with the flow
I’ve been thinking about and exploring possible solutions to the problem I mentioned in my last post.
The more I read about DBus, Gnome and HAL the more I realise that I should be working with them rather than subverting them with lcfg-sleep. They detect the idleness of a user’s X session far more easily than I can, they have sophisticated tie-ins with other software via DBus (a common example is that a DVD player application can ask the screensaver to refrain from screen-saving while a film is playing), and gnome-power-manager has facilities I could use for setting default (overridden by user preferences) and mandatory (overrides user preferences) behaviour in matters such as automatically suspending a machine after the session has become idle. They also support use of the same display quirks which I’ve had to use with lcfg-sleep to get displays up and running again at resume time – and today I’ve tried enabling quirks in HAL then suspending with Gnome’s suspend method and they work, they bring the display back to a working state after resume, which isn’t what happens when the right quirk is not enabled.
Anyway, enough about quirks and stuff; the main idea is this. When it runs lcfg-sleep should detect whether or not there’s a user X session, and if there is, lcfg-sleep will refrain from suspending the machine. Instead it’ll back off to fallback behaviour of just calculating a suitable wake time for the machine and writing that to the kernel wake time file. Gnome-power-manager can be trusted to do the idleness detection and suspension. We can set default preferences for it to suspend the machine a suitable number of minutes after it judges the session to have become idle. Resume will happen thanks to the wake time which lcfg-sleep will have written to the kernel alarm file before suspend. lcfg-sleep doesn’t get triggered by gnome-power-manager or anything, it just runs regularly, every minute or two, and writes a suitable time to the kernel alarm file every time it runs; so there should always be a correct wake time (for example, in the future rather than the past) in the kernel alarm file. I’ve tested using gnome-power-manager’s suspension in conjunction with lcfg-sleep’s waketime and it works. (Well, it works as long as you set the proper display quirks in a HAL quirk database file anyway.)
When there’s no user X session detected, lcfg-sleep can revert to its fuller behaviour of triggering suspend as well as setting wakeup time.
We could even perhaps base relevant gnome-power-manager default values on similar lcfg-sleep resource values. The gnome default values are set using gconf, and James has a handy lcfg-gconf component in the repository…
At last
Here’s the power management report.
wake on LAN revisited
I mailed discuss@lesswatts.org (list admin page) with a description of what I was trying to achieve and what I’ve found out so far. I’ve had some very helpful replies, mostly focusing on my not being able to get wake-on-LAN working for sleeping machines although I can get it working for powered off machines. Alas I still can’t get it working, but I feel better about it now, having tried a lot of clever ideas.
Look for the saving power on Linux desktops thread in the February 2008 list archives.
Thinking aloud
I’ve been having a few thoughts about the practicalities of sleeping our machines when idle.
The question of idle machines. How to tell when a machine is idle and it’s time to sleep?
– who’s logged in? “w”?
– remote logins?
– jobs?
– load average?
How to know when it’s time to wake up?
– LCFG resource, presumably.
– default value: when the next nightly maintenance period is due? Machines can be woken up manually by pressing the power button so the default would be to sleep right through the day until the next maintenance was due at night.
(Note: this would favour use of suspend rather than hibernate because of the faster resume.)
(Note 2: I wonder how accessible the power buttons are on machines installed in lab flip desks.)
What if an LCFG-managed thing is doing something which shouldn’t be interrupted?
– is there anything like this? LDAP updates for instance? updaterpms?
– In general we shouldn’t sleep a machine when it’s in an inconsistent state – a software package half-installed for instance. It’s one thing to have a machine in such a state for a few seconds but it’s quite another to suspend it in that state for several hours, during which the power might for all we know be pulled.
– could LCFG components set some sort of “stay awake” flag whenever they need to protect something?
– a power management component could check the stay awake flag, if it’s set the machine stays awake
– components could set the flag before embarking on a non-interruptible task and clear it afterwards
– also clear it on reboot and on resume/thaw
– a context would work well for this perhaps?
– the context could be set to various values – e.g. “normal” (it’s OK to sleep), “stayawake” (don’t sleep), “nap” (sleep in short bursts), …
What about Condor?
– the “nap” idea above comes from thinking about Condor. At first it might seem that Condor and sleeping machines aren’t great bedfellows. However this assumes that machines are sleeping when they could instead be productively running Condor jobs – or vice versa, that Condor uses up lots of electricity that could be better saved.
(My view: deciding how much people are allowed to use the computers is NOT our job. We should only be trying to save power on machines which are NOT in use. Condor is as legitimate as anything else.)
If instead we assume that a machine can be available to Condor more or less whenever Condor needs it, and sleeps when Condor doesn’t need it, the situation changes and Condor and sleeping-when-idle become natural partners.
Imagine that a Condor pool gets bursts of use – nothing much for a while then suddenly someone submits several thousand jobs. A machine could be set to go to sleep when idle (we still have to figure out what exactly we mean by “idle”, but leave that aside for now) but be told to sleep only for (say) half an hour. After that time the machine would wake up and stay awake for long enough for Condor to register the machine’s presence and give it a job if it has jobs to give it. We’ll have to measure how long that is, but say it’s five minutes.
After that time, if the machine is busy running a job it can stay awake. If it’s not it can go back to sleep for another half hour.
This way we’d still get most of the savings from sleeping machines overnight whenever Condor wasn’t being heavily used, and also the high throughput from the Condor pool whenever it was needed.
The beauty of this is that it needs no expensive modifications to Condor, just an awareness of our sleep/awake context on the part of the Condor LCFG component. Equally it also needs our “is the machine idle” logic to be aware of Condor jobs (and maybe Condor’s current status on the machine?) but that would be necessary anyway.
Cooperation between Condor and sleeping-when-idle would appear to be a lot more straightforward than I had feared.
Of course it might be a bit odd to be working in a student lab late at night surrounded by machines periodically waking up and going to sleep…
overnight sleep success
The test machine was suspended yesterday with a wake alarm to make it wake up at 9:30am this morning. Success!
The screensaver prompted me for a password.
Then there was a pop-up kerberos renewal password prompt underneath that.
A klist revealed that I now just had a ticket-granting ticket:
% klist Ticket cache: FILE:/tmp/krb5cc_28267_2bhpXS Default principal: cc@INF.ED.AC.UK Valid starting Expires Service principal 02/01/08 11:10:32 02/02/08 03:31:59 krbtgt/INF.ED.AC.UK@INF.ED.AC.UK Kerberos 4 ticket cache: /tmp/tkt28267 klist: You have no tickets cached
I did a renc to get the rest of the tickets renewed properly, the AFS ticket for instance, and got a (bright red) winscard_clnt.c message:
% renc Password: winscard_clnt.c:320:SCardEstablishContextTH() Cannot open public shared file: /var/run/pcscd.pub
A subsequent klist listed all the expected tickets though (as far as I know):
% klist
Ticket cache: FILE:/tmp/krb5cc_28267_2bhpXS
Default principal: cc@INF.ED.AC.UK
Valid starting Expires Service principal
02/01/08 11:10:47 02/02/08 05:10:47 krbtgt/INF.ED.AC.UK@INF.ED.AC.UK
renew until 02/01/08 11:10:47
02/01/08 11:10:47 02/02/08 05:10:47 kca_service/kingsmen.inf.ed.ac.uk@INF.ED.AC.UK
renew until 02/01/08 11:10:47
01/31/08 11:10:47 02/02/08 05:10:47 kx509/certificate@INF.ED.AC.UK
02/01/08 11:10:47 02/02/08 05:10:47 afs/inf.ed.ac.uk@INF.ED.AC.UK
renew until 02/01/08 11:10:47
Kerberos 4 ticket cache: /tmp/tkt28267
klist: You have no tickets cached
amd was fine.
SL5 wake alarm success
The swsusp wake alarms work on SL5 as they did on FC6 (documented somewhere in the power diary). Good!
# echo "+00-00-00 00:05:00" > /proc/acpi/alarm
… then suspend the machine, and it’ll wake up 5 minutes later. (With a working amd! And with kerberos tickets refreshed, if you happened to be logged in.)
Stop press: it also works when you hibernate the machine.
amd on SL5
Now that my test machine is running SL5 rather than FC6 I’ve retried hibernating it overnight, without the hook script, to see if amd crashes as it does on FC6. It doesn’t! It seems to survive the sleep period perfectly happily; after the machine thaws amd is still running afterwards and works perfectly well (it happily takes me to /pkgs/master/srpms/ when I ask it to). Which is nice! Perhaps there’ll be no need for the hook script on SL5 then.
hook script change
I’ve amended the hook script so that it tests for the amd component’s run file before restarting the component – because the run file should only exist while the component is running and we only want to restart the component if it was already running.
#!/bin/bash
. /etc/pm/functions
# On resume/thaw, restart the amd component if it was running, because
# amd crashes on resume if the machine has been suspended for more than
# a few minutes.
case "$1" in
hibernate|suspend)
;;
thaw|resume)
[ -f /var/lcfg/tmp/amd.run ] && /usr/bin/om amd restart
;;
*)
;;
esac
exit $?
TuxOnIce feedback!
Nigel Cunningham (a.k.a. Mr. TuxOnIce) has been in touch with a really helpful message, which he says it’s fine for me to post here:
Hi. I just discovered your wiki entries regarding power management. Interesting stuff! I noticed your comment that I hadn't documented the wake alarm support in TuxOnIce very well. I'll seek to fix that. In the meantime, here's a brief description: /sys/power/tuxonice has five files that are relevant to this: lid_file, wake_alarm_dir, wake_delay, powerdown_method and post_wake_state. Wake_alarm_dir says which rtc alarm to use. If you want to use /sys/class/rtc/rtc0/wakealarm, put rtc0 in this file. Wake delay is the delay (in seconds) after going to sleep until we should wake again. Powerdown_method says what we should do after writing the image and setting the alarm (if we set one). The numbers are based on ACPI state numbers, so 0 = non acpi poweroff, 3 = suspend to ram, 4 = enter ACPI platform suspend-to-disk state and 5 = acpi power off. Post_wake_state is like powerdown_method, but described what to do after we wake. Lid_file says which file to use to check the lid state (/proc/acpi/button/$Lid_file/state. If you want to use /proc/acpi/button/lid/LID/state, you'd put "lid/LID" in here. If the lid is open when we resume, the post-wake-state is ignored. Using this combination of files (and assuming the wake events work on your computer), you can: * Write a hibernation image, then suspend to ram. Wake (say) 20 minutes later and power off completely, unless the lid is opened and the computer woken in the meantime (in which case we just resume). * Write a hibernation image and powerdown completely. Wake at 6am (set an absolute time by setting the wake_alarm prior to starting the cycle and not using the wake_alarm_dir/wake_delay entries), resume (ie reload the hibernation image) and suspend to ram until you're ready to use the computer. Hope that helps! Nigel
Thanks a lot, much appreciated! (And I’m slightly alarmed and quite pleased to find that people are actually reading this page!) In subsequent email Nigel also said
All I'd really ask is that if you have some feature you'd like to see, let me know please. I know from your comments that you guys want to minimise the diff against vanilla kernels. That said, I'm far more focussed on providing hibernation features than the guys who work on the mainline kernel, so I'll be more responsive to issues and feature requests. (Of course if you end up not using TuxOnIce, I won't mind either!)
the amd hook script
(Also posted to the Power Diary.)
I’ve at last got round to trying out a pm hook script to get us round the problem of the amd automounter crashing when the machine resumes from a sleep state. As noted before the hook scripts go in /etc/pm/hooks. At our site the amd automounter is stopped, started and configured using the corresponding LCFG component, and to restart the daemon we do om amd restart. So here’s the hook script /etc/pm/hooks/25amd:
#!/bin/bash
. /etc/pm/functions
case "$1" in
hibernate|suspend)
;;
thaw|resume)
/usr/bin/om amd restart
;;
*)
;;
esac
exit $?
25 was just an arbitrary choice, putting it somewhere in the middle of the running order for the scripts.
Anyway, I’ve tried suspending the machine and leaving it for several hours, then resuming, which is normally enough to make amd give up completely at resume time. This time when the machine resumed amd simply restarted. Magic. Yes, it was a new process with a different pid, and amd was doing its job properly.
Actually I had better go back and check the amd crashing situation.
Confession time: since we’re thinking of possibly moving our desktops from FC6 to SL5 in the summer, I’ve tried out SL5 on my power management test machine to see what differences I could see. (I couldn’t see any – the software seems to look and behave pretty much identically – the only difference I could see was that the SL5 version of gnome power manager puts a prettier icon on your menu bar…). Anyway, so I’d better go back and double check that long suspends and hibernates do indeed provoke an amd crash on resume on SL5 just as they do on FC6.