What's Chris been doing?

Successes and failures at inf.ed.ac.uk

Posts Tagged ‘progress summary

What I’ve been doing in May 09

leave a comment »

We had a technical Strategy Meeting this morning, an unusual event. I’d been dreading it, fearing a lack of stargazing technical foresight on my part, but as it happened the meeting was excellent, and I came away from it with two things: a massive boost to my morale (which had been flagging a bit since learning that I’m about to be chucked out of my bearable temporary office) and a determination to revive the habit of writing down at least every day pretty much everything I’ve been up to at work that day: the successes, the failures, the error messages, the things I found out and the things that I do and don’t understand. Documenting the successes is great for me (and for others) as an aide memoire. Documenting the failures is even more important though in two ways: firstly it can often help to clear my thoughts, but secondly and maybe more importantly other people can read about my problems and chip in with their own ideas, they can rally round and help. Far better than just suffering in silence, I think.

In other words this amounts to the usual “I’m going to revive this blog” post.

Next week the sleep project is going to have to face the music as regards deadlines and the missing thereof at the monthly development meeting. This time Tim has asked us to produce a short summary of what happened in the last month with our projects and what might happen in the next month. To try to get it straight I’ve written down here what happened in the last month using the MPU weekly minutes as my guide. It’s probably rather longer than Tim asked for but if necessary I’ll summarise later. For now, I need to get this all down on paper dots.

Worked out how to successfully suspend and resume a Dell Optiplex 755 (the exact command is different for each model).

The test 745 is hanging sporadically. This is related to the latest kernel: it doesn’t happen on a stable 5.2 machine, but does happen on both a develop 5.3 machine and on a stable 5.2 machine with the latest test kernel (the one on develop 5.3 machines). This turned out to be a bug with VGA support, it happens only if you use a VGA cable, machines with DVI-connected displays are fine. It appears that our 745s and 755s are pretty much universally connected with DVI, but all other models have at least some machines with VGA cables, so I’ve disabled lcfg-sleep for all models earlier than a 745.

Timed wake-up occasionally fails to happen with 5.3. Very occasionally. Haven’t reproduced this again so will ignore it for now, not a big problem.

Alastair is trying out lcfg-sleep on his T3400. It seems to sleep and resume perfectly well but gnome pops up an error message when you login after a resume. We’re going to try using gconf (and if that’s successful, lcfg-gconf) to suppress this error message.

Tim tried out lcfg-sleep and found a gap in its idleness checking. It was checking for low load average and for all interactive shells being idle, but a long session on a text editor can fulfill both of these, and Tim’s machine went to sleep while he was typing. So I had a rethink and (to cut a long story short) now reckon that we should let X session managers decide for themselves when the machine is idle. They turn out to be much better at it. In Gnome, gnome-power-manager does this perfectly well, and can be configured to suspend the machine a number of minutes after the machine becomes idle. We can use gconf and James Jarvis’s lcfg-gconf to set this as default behaviour for all gnome users. (User preferences can override default behaviour. If necessary gconf can also set mandatory settings which cannot be overridden by user preferences.) This takes care of idleness detection and suspend; resume will still be handled by the sleep component, which will continue to run every minute or two and will ensure that the machine always has a suitable wake time set. The machine will therefore not miss any important cron jobs no matter what puts it to sleep. (Coping with user-initiated suspend was always part of the sleep component design anyway.)

This led me to alter the component so that by default it looks for the presence of a user X session on the machine, and vetoes sleep if it finds one. This works. You can use the sleep.overridesessionmanager resource to override this behaviour and sleep even when somebody is logged in, but given Tim’s experience I don’t recommend it, unless you do it in conjunction with idleness timeouts long enough that they’re likely to be triggered only at night and weekends.

While working on the above I explored a sizeable blind alley or two. One was DBus: gnome-power-manager sends out signals using DBus when for instance it decides that the session is idle, and other apps can subscribe to those signals and react, or send messages back to gnome-power-manager asking it to change its behaviour. So for instance a movie player application could use DBus to inhibit the screen saver while a film was playing. But DBus has more than one bus per machine: it has one system bus but it also has a separate session bus for each user session. An LCFG component could probably get access to the system bus, being a thing running as root, but not to any of the session buses. And guess what, gnome-power-manager and other gnome stuff all uses the session bus: the idea is that only the user’s own gnome (and other DBus-using) apps will be legitimately interested in what the user is or isn’t doing. So the effect is that the lcfg-sleep component can’t possibly ever see any signals from gnome-power-manager – the “Hey, I’m idle now” signal for instance – so can’t use that info or send any influence back to gnome-power-manager either. Sigh. I briefly wondered about having each user launch some sort of home-cooked program which could subscribe to the DBus session bus and have that act as a gateway to/from lcfg-sleep but quickly gave that up as a ghastly idea and probably a horrible security risk.

Another blind alley was monitoring USB keyboards or mice for signs of activity – you’d need to have something running constantly rather than having something you could just query now and then to find out what the situation was, which is the model that the sleep component has been using. And having something monitoring keystrokes doesn’t sound great from a security point of view.

I had another partial failure: you can configure gnome-power-manager to suspend the machine after a certain period of inactivity. This is easily done either using the gnome app (select it on the menu – More Preferences I think? – to put it onto the top menu bar in gnome) which lets you control how long until idle, how long from then until suspend, that sort of thing. This works very nicely. Since lcfg-sleep runs every few minutes, and writes a suitable wake-up time to the machine every time it runs whether it decides to sleep the machine or not, you can have gnome-power-manager send the machine to sleep and then have the machine wake up automatically in time to run the next cron job. But here’s where it gets complicated: when the machine wakes up again, it hasn’t been woken by a button press or in fact by anything which gnome-power-manager is looking out for, so gnome-power-manager doesn’t go through the process of realising “Oh, I’m not idle any more” – so it never becomes idle, never suspends the machine again, and so you only ever get one period of sleep per period of idleness. The machine won’t be slept again until some sort of manual intervention happens – the user comes back and plays with the mouse or whatever, triggering the machine to realise that it’s not idle any more. Until that happens, the machine just sits awake. I don’t know if there might be some way of triggering gnome-power-manager to realise that the machine has woken up and isn’t idle any more?

Incidentally, when letting gnome-power-manager initiate the suspend, rather than the sleep component, you need to be careful to add the desired video quirks for each model to the sleep quirk database, since they can’t be invoked on the command line as lcfg-sleep does it. The sleep quirk database is a bunch of .fdi files under the /usr/share/hal/fdi directory. The existing files there don’t mention many of our models and are quite out of date; we’d have to add model info for our models, perhaps by upgrading to a more recent HAL version or perhaps just by tweaking the files in the existing version.

Another thing I looked at was the possibility of getting gnome-power-manager to send the machine to sleep with an instruction to wake up after a certain period of time. This is referred to in one of the gnome-power-manager docs or web pages I came across. I looked at the source for the version we’re using, 2.16, and found a stub where a future developer might add code to do the timed wake-up – but no actual existing facility to do it. Blast. Checking the source of the latest version, 2.26, the stub and all the code around it appears to have been swept away in a significant rewrite, and I couldn’t see any sign of automatic/timed wake-up anywhere in the new version.

Another limitation I had to place on sleep was to mandate the use of the intel video driver, rather than the i810 driver we’re currently using on 745s and 755s. Suspend and resume with i810 was just too unreliable. Thus the only machines getting an active sleep component are those 745s and 755s which are both on the develop release and happen to have been specially configured to override the default video driver by including dice/options/video_intel.h. Doing this turned out to be awkward with dice/options/video_intel.h being included in profiles *after* dice/options/sleep.h but I got round it by giving the component the cabability to find out for itself what X video driver is in use (new resource sleep.actualvideo, which inherits from xfree.device_main) and compare that to a list of acceptable drivers (sleep.approvedvideo) when evaluating whether or not it’s safe to send the machine to sleep.

There’s a possibility that we can safely re-enable lcfg-sleep for 745s with i810 since the problem there was simply Gnome’s popup error message which we now know how to suppress using gconf.

I’ve tried i810 with the 755 with 5.3 and that’s not a possibility: the 755 doesn’t resume properly whatever I try when using i810 – it only works with a combination of the intel driver and the right video quirk option.

What I intend to do next is try to get the 5.3 develop 745 with i810 and DVI cables resuming happily with the same sleep quirk as with the same machine with the intel driver; and if that works, re-enable lcfg-sleep for i810-using 745s. Then I’ll monitor each machine actively running lcfg-sleep and leave things for a while to see how they go. With any luck I should be able to get on with other work while that’s happening.

There are two other areas of upcoming work: bugzilla and TiBS. The bugzilla work will be to revive the move of bugzilla.inf.ed.ac.uk to bugzilla version 3. This is the version the LCFG bugzilla uses so we’ve already done the work of upgrading the local infrastructure to cope. It’ll just be a matter of getting the new server up and running, copying over the data and reproducing the configuration. TiBS is the new all-singing all-dancing backup software we’re running and the task will be to automate appropriate bits of its configuration and use using LCFG. Craig’s driving that project, so it’ll be good to do it in conjunction with him (projects with other folk involved are both easier and more fun than doing it all by yourself) but so far not much has happened except that I’ve looked a bit at things and found (and been told) that the software is rather messy, our configuration is rather messy, the documentation we’ve been given is incomplete and out of date, the software has foibles and problems that we seem to only discover by running full tilt into them with our live backup service, and that our primary expert in all of this left us several months ago. Aside from that the TiBS project is going really well and it’s going to be hunky-dory and fab.

Written by Chris Cooke

May 27, 2009 at 4:07 pm