Cooling an overheated server room...with fans: A true sysadmin story

 Originally published on September 22, 2021 by Shaun Behrens
Last updated on September 22, 2021 • 7 minute read

How do you cool a server room when the AC unit isn’t working? Well, what about good old ventilation fans? While this might not sound like a workable solution, it’s exactly the idea our colleague Michael had when he faced an overheated server room in a previous job. 

“It’s just one of those days…”

We all know those days that start off badly – and then proceed to get worse. That’s exactly how one particular day started for Michael when he arrived at work to discover that the security siren of his office building was going off. And then he found out that his server room was the cause of it.

Watch: Michael tells his story

 

Turns out, the server room had overheated overnight, and one of his early bird colleagues had opened the window in the server room in an attempt to cool it down. The problem was that the windows were equipped with contact sensors that would set off the security alarm when opened. So that explained why the siren was going off.

But why did the server room overheat in the first place? The answer is a combination of old hardware and bad luck. As is often the case, there was a main AC unit and a backup one. Both AC units were old and had no network interface, and so the IT team couldn't directly monitor them. Instead, they monitored the temperature of the server room itself using an IoT temperature sensor and PRTG.

serverraum-monitoring-1On the previous day, PRTG had indicated that the temperature in the server room had been rising and dropping. After some investigation, they realized the main AC unit was malfunctioning. They shut it down, switched over to the backup AC unit, and called the AC technician to come the next day. What they did not know was the backup AC unit was almost out of cooling fluid (something they could have known if they were monitoring it).

Overnight, the backup AC unit ran out of coolant and stopped working, causing the temperature in the server room to rise. And rise.

Several of the servers were configured to shut down when a specific temperature was reached to prevent them from overheating, and that's exactly what they did. So the next morning, the IT team discovered a server room that was around 140 degrees Fahrenheit (60 degrees Celsius) with many of their systems down. 

I’d like to thank the fans

Priority number one for Michael was turning the siren off, which required entering a keycode on a panel in the tropical climate of the server room. The next problem was that the AC technician was not going to show up for a few hours still, and Michael and the team needed to get the systems up and running as soon as possible. After all, you know users: they don't care if your server room is overheated (or even if your whole IT team is trapped in a volcano, really): they only want their systems up and running ASAP.

And getting the systems up meant cooling the server room down.

fans-basementAs already mentioned, there was a window in the server room, but it did not seem to be cooling the room fast enough. They needed to speed up the process – at which point Michael came up with the idea of using standard, every day, garden variety ventilation fans. The thinking was to open the building’s main entrance and then place five or six fans in a row to direct the cool winter air to the server room. Sounds a little crazy, doesn't it?

As even Michael says: “It was such a lame idea…but it worked!”

The team was able to lower the temperature of the server room enough to be able to turn on some servers and get the most important systems and services up and running again.

Lessons learned

Crises like this one are a fast track to new knowledge and insights, and it was no different for Michael. Here's what he learned from his experience:

1. Monitor your AC units!

These days, most AC units offer a network interface that let you monitor using software like PRTG. And even if you do have an older unit, you can still monitor it using MQTT or Modbus TCP by connecting it to a gateway. Knowing the status of your AC units can alert you about potential problems before they occur help you keep track of important metrics.

2. Conduct regular AC system tests

AC units should not be a case of “set and forget” – test them regularly to ensure that they work as expected, including the failover process in the case of redundant ACs.

3. These are the days you’ll remember your whole life

One day, when your career as a sysadmin is over and you’re sitting staring out at the ocean, sipping a cocktail and enjoying your retirement, you might not remember installing the newest version of Windows for your users, or cleaning dirty keyboards. What you will remember are the moments of crisis and times you had to find an unusual solution.

Do you have your own sysadmin horror stories? Tell your story in the comments below!