How Secure is Your Server Cooling System?

The ability to restore data in minutes is moot if your server room's air conditioning system goes down for days. Here are some cool ways to protect servers from heatstroke.

November 29, 2004

5 Min Read
Network Computing logo

The ability to restore data in minutes is moot if your server room's air conditioning system goes down for days. Is your cooling system up to its task? Do you have an emergency plan if it isn't? Have you tested that plan lately? Here are some cool ways to protect servers from heatstroke.

Most comedy is tragedy that happens to someone else. Indeed, we take comfort from sharing our woes with others for their amusement and instruction. This anonymous tale of serial tragedy is a good example:

An IT manager carried a phone everywhere, secure in the knowledge that "intelligent" monitors would alert him to any server problems via email. But following an unusually hot weekend, he was greeted by server room temperatures of 113 F./45 C. No bits were moving, naturally. Things went from bad to absurd during the next several days.

The "redundant" air conditioning unit shut down when its twin's power supply overheated and died. The UPS unit on the email alert system lacked such overload protection so it fried. Water vapor condensed all over the server room floor. The functional AC unit was restarted and a repairman was called for the deceased unit, but the AC engineer could not arrive in less than three days. The one working unit got temperatures down to tolerable levels, but only until nightfall.

When the office building's AC system went off for the night, the lone server room cooler overloaded and shut down again. A junior AC engineer arrived the next day, only to learn that access to the rooftop unit required an hour of safety training, a "method statement" from the tenant ("Exactly what are you proposing to do on our roof?"), and 24 hours' notice. Portable coolers were rented with thick vent hoses poking through the open server room door. Those, along with the one working unit, kept the servers running intermittently.Four days of crashes later, a senior repairman arrived and declared the AC unit's motor kaput. A lone replacement for the obsolete motor was found, but it was dropped during shipment and damaged beyond repair. A whole new rooftop unit was needed, with attendant method statement and 24 hours' notice; the latter was not a problem because the new unit would not arrive for several weeks. Meanwhile, the working AC unit collapsed under the strain of non-stop strain for a week and a half. The portable coolers and their security risk hoses had been returned so the server room became very quiet once more.

Eventually, the new AC unit was installed and things simmered down. But this server room will have to move before its equipment is upgraded or expanded, for the building owner and power company are unable to supply additional electricity to support cooling for more and hotter servers.

This story gets funnier the longer it doesn't happen to you. So, stop laughing and learn from it!

Fatal assumptions and complacency caused much of our hapless IT manager's dismay. He thought each cooling unit could handle the full load but never updated his calculations as the server heat load grew. He thought the units were wired in parallel but in fact they were in series, so when one shut down both did. He thought all of his UPS units had thermal overload protection, but the one on the email alert system didn't. He thought parts would be available on demand but they weren't. He thought he would be the first customer on his repairman's list, but he wasn't. He thought rooftop access was just a stair climb away, but it sure wasn't! A smart server manager checks and rechecks his assumptions, and takes preventive measures before stuff hits the fan, says Steve Satchell of Reno, Nevada, based Web host American Internet.

"American Internet went through an issue where the primary AC unit would freeze up on a regular basis," Satchell says "We didn't get a handle on it until we installed a Liebert AC unit (www.liebert.com), where we could get to the evaporator coils if there was a chance of freeze-up. Actually, there isn't a chance of freeze-up now because our Liebert has heaters to prevent that sort of thing!"American Internet took foresighted steps to avert delays and power outages before they happened.

"It's written into our lease that we and our contractors have full and uninhibited access to the roof over our space. The owner is a licensed electrician so he made sure we have enough power for the foreseeable future. We also have an 80-kw generator standing by to power our three AC units, although we haven't needed it yet." Nonetheless, Satchell's servers were laid low once, as he recounts:

"Our AC failure stemmed not from a blowout but from a part failure: A fan belt broke on the primary AC unit. The secondary unit wasn't able to get rid of enough BTUs, so the room rose to 95 degrees. Our morning person opened the server room door and was blasted by the heat. He ran around opening every door possible, and scrounging every fan he could find to get air exchange set up. We lost a couple of servers and two UPS units to the heat.

"When we discovered the broken belt, we went to Home Depot to find the closest match, and then ordered three spares from our AC supplier. We still have one of them -- one of the replacements was eaten up before we got the motor properly aligned, which was the source of the problem."

A part in the hand is worth two in the catalog, which may well be out of date. You do keep spare CPUs, motherboards, and other IT parts handy, don't you? What about fan belts and compressor motors? And where is your email alert server located?"We have two thermal probes, one at 3 feet and one at the ceiling," says Satchell. "Now, when the room gets above 76 degrees we get a page. We also have an active off-site monitor in my mail server at home, so that a failure in our server room or in our connectivity doesn't go unnoticed either. Finally, the active monitoring system that I wrote also monitors the temperature and flashes 'condition red' when the temperature is too high or there is no temperature report.

"Since that one episode, we've had a couple of failures of the AC with no ill effects."

May you be so "lucky." Luck is when preparation meets opportunity.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights