toohot – shut down a machine if the cooling fails

One of the tasks of a system admin is to take care of computers in server rooms. You might be in your office, or you might be at home in bed, but the computers in the server rooms are in use 24 hours a day, 7 days a week.

This computing activity can generate a lot of heat, especially on powerful servers, so server rooms have powerful cooling systems.

That’s great – until the cooling system fails. When that happens, you’re going to want your servers to somehow know that the room is quickly becoming very hot, and maybe to shut themselves down before they suffer temperature-related hardware damage.

For some years now, our DICE servers have been running a script called toohot which shuts the server down if it detects a temperature which is too high.

The script queries one of the server’s IPMI temperature sensors. Generally there’s one marked “Inlet Temp”, “Ambient” or the like – look for the sensor which reports the temperature of the air that’s being taken into the machine.

It uses the ipmi-sensors utility, part of the freeipmi package, to query the server’s built-in temperature sensors.

On a server running Linux you can list the machine’s temperature sensors and their current readings:

# ipmi-sensors -t temperature

Here’s what a Dell PowerEdge R740 has in the way of temperature sensors:

# ipmi-sensors -t temperature
ID  | Name         | Type        | Reading    | Units | Event
1   | Temp         | Temperature | 39.00      | C     | 'OK'
2   | Temp         | Temperature | 51.00      | C     | 'OK'
3   | Inlet Temp   | Temperature | 20.00      | C     | 'OK'
187 | GPU1 Temp    | Temperature | N/A        | C     | N/A
188 | GPU2 Temp    | Temperature | N/A        | C     | N/A
192 | GPU3 Temp    | Temperature | N/A        | C     | N/A
193 | GPU4 Temp    | Temperature | N/A        | C     | N/A
194 | GPU5 Temp    | Temperature | N/A        | C     | N/A
195 | GPU6 Temp    | Temperature | N/A        | C     | N/A
196 | GPU7 Temp    | Temperature | N/A        | C     | N/A
197 | GPU8 Temp    | Temperature | N/A        | C     | N/A
210 | Exhaust Temp | Temperature | 28.00      | C     | 'OK'

In this case toohot would use sensor number 3, “Inlet Temp”.

To get the information it needs from the sensor, the script uses “-v” (verbose) and “-s” (specify a particular sensor number). In this example,

# ipmi-sensors -v  -s 3

This gives, on the example machine:

Record ID: 3
ID String: Inlet Temp
Sensor Type: Temperature (1h)
Sensor Number: 5
IPMB Slave Address: 10h
Sensor Owner ID: 20h
Sensor Owner LUN: 0h
Channel Number: 0h
Entity ID: system board (7)
Entity Instance: 1
Entity Instance Type: Physical Entity
Event/Reading Type Code: 1h
Lower Critical Threshold: -7.000000 C
Upper Critical Threshold: 47.000000 C
Lower Non-Critical Threshold: 3.000000 C
Upper Non-Critical Threshold: 43.000000 C
Lower Non-Recoverable Threshold: N/A
Upper Non-Recoverable Threshold: N/A
Sensor Min. Reading: -128.000000 C
Sensor Max. Reading: 127.000000 C
Normal Min.: 11.000000 C
Normal Max.: 69.000000 C
Nominal Reading: 23.000000 C
Sensor Reading: 20.000000 C
Sensor Event: 'OK'

From this the script extracts the values of the sensor readings, then compares them.

It then uses a bit of simple logic:
If the “Sensor Reading” value is >= the “Upper Non-Critical Threshold” value then shut down the machine.
In this case 20 is less than 43 so of course we don’t need to shut down the machine.

Sometimes “Upper Non-Critical Threshold” won’t have a meaningful value but “Upper Critical Threshold” or “Upper Non-Recoverable Threshold” will. In these cases the script would subtract 5C from the “Upper Critical Threshold” value or 10C from the “Upper Non-Recoverable Threshold” value, then make the comparison as above.

Occasionally, also, the sensor won’t give a sensible value – perhaps some other software is using IPMI at the time. If this happens the script waits a few seconds then tries again. It does this a few times before giving up.

The script is then run every few minutes using a “cron” job, to give some very basic protection against damage caused by a cooling failure.

The script used on the DICE machines is an LCFG component – LCFG is our automated machine configuration system – so it’s not easily separable from the rest of LCFG for use in servers managed by other means. But for what it’s worth, you can see it here:

If you look at the script you’ll see that it also checks the temperatures of any GPUs it finds.

Leave a Reply