# What's Chris been doing?

Successes and failures at inf.ed.ac.uk

## cron and periodic maintenance tasks on a possibly sleeping machine

When the sleep component is up and running, a DICE machine might spend most of its time sleeping, only waking up every few hours to do essential things for a few minutes before going back to sleep again. Then again, it might only go to sleep at night. Or it might be used so heavily (for example by Condor) that it rarely gets a chance to sleep from one week to the next.

I’ve been trying to figure out how best to manage cron jobs on such a machine.

We have to manage these things:

1. The scripts in the distro-provided directories /etc/cron.hourly, /etc/cron.daily, /etc/cron.weekly, /etc/cron.monthly need to be run. At the moment these directories are run by cron at specific times, during which our machine might be asleep.
2. Also, those scripts need to be run the right number of times. We don’t want to run them too often.
3. LCFG-configured periodic jobs (the current cron.* resources) need to be run too. For example the boot component is run once per night, and the openldap component runs once per hour.

You can see what a typical DICE desktop runs from cron on a typical day here:

http://wiki.inf.ed.ac.uk/view/DICE/PowerManagementReport#Cron_and_Anacron

It’d be nice to have clean, simple code managing all of this, so we’d make use odf standard ways of doing things, and we’d have more small, simple components. It’d also be nice to have as little duplication as possible in the LCFG resources – it seems confusing to need both cron resources and when-I'm-awake resources for doing periodic tasks). And it’d be nice to have one software system just take care of all of this for us and give us a fairly simple interface so we didn’t need to worry about the interaction between (say) maximum sleep periods before wake, and minimum times between successive runs of a periodic task. It’d be nice to minimise the change required from the current system too – we have hundreds of cron resources in the header files. And it’d be nice to have clear, simple, defined roles for the different components of the system, with clear and obvious interactions between them.

But all this seems horribly contradictory. If you simplify things in one way you seem to complicate them in others. For example if you try to keep the existing cron resources, you either force the cron component to do things grossly different from what it was designed for – it’d need to monitor whether or not cron had done things, or how long the machine had been sleeping, rather than just making a few config files – and since you’d also need to add some extra resources to cover asynchronous operation, for instance whether or not it was acceptable to run a missed-due-to-sleep job an hour after it was missed, the resources get complicated and need changing anyway. The cron component turns into a hideous monster.

Perhaps it could be replaced by one unified “tasks runner” which Just Takes Care Of It for you – you tell it to run things so many times per time period and the component then figures out what to do, and runs as a cron-type daemon, keeping timestamps and goodness knows what. This sounds a bit like the hideous monster cron component above, except possibly cleaner and simpler as it would be designed from scratch. But this seems to chuck “do it the standard way” and “make use of standard software and components” right out of the window.

Maybe we could keep the cron resources, but have a second “asynchronous cron” component inherit their values? It could peek to see when or whether things have been run or not; it could deduce when to run things based on the times given in the cron resources; and so on. Sounds complicated to write. It’d have to understand cron times and be able to judge when or whether to run them. Presumably it’d have to wait until cron had failed to run something then it’d step in afterwards and run it instead. But how long afterwards would be acceptable? The more you think about this the more disgusting it sounds.

Should the component which manages sleep, also be in charge of kicking off periodic jobs? If it knows when it’s going to send the machine to sleep and when it’s going to tell the machine to wake up again, it’ll be able to use that information to figure out when to run periodic tasks. But it sounds as if this would be rather confusing to write.

Should we have a component completely separate from cron and from the sleep component? Call it awake for instance. awake would run when the machine woke up. I envisage a machine with Condor waking up at least every few hours, and a machine without Condor waking up at least once a day, so awake would be able to run things reliably say once a day. But that’d be dependent on whatever component managing sleep being told to make sure that the machine woke up often enough. Is it OK to leave that sort of coordination task to the sys admin to manage by hand? Sensible defaults could be set anyway.

But wouldn’t that need to run things at some other time as well as whenever the machine woke? Say we have a machine that’s so popular with Condor that it never gets a chance to sleep. If it never sleeps, it’ll never wake, and if it never wakes, it’ll never run our stuff, like the nightly run of the boot component for instance.

We need some sort of coordination between running things at wake-up time and running things regularly from cron; something which makes sure that things are run often enough but not too often.

I’ve spent the last day or so dreaming up ever more baroque ways in which this shouldn’t be done, and failing to come up with a simple-sounding solution. I’m sure there were several other ideas more ghastly than the ones above.

The last couple do seem to have the most potential, though.

Help? Any ideas or observations?

Oh, and PS: I forgot to say earlier. anacron would be really useful, except its greatest frequency is daily. It has no hourly or twice a day for instance.

Written by Chris Cooke

July 10, 2008 at 4:21 pm

Posted in Uncategorized

Tagged with , ,