Halloween comes but once a year, yet network and IT managers face horror stories and have nightmares every day. It has been a while since we visited this topic. And unfortunately, matters have not changed. Namely, network and IT managers continue to encounter bizarre incidents on a regular basis.
Some are the result of poor planning outside the domain of the IT department. A great example of this is the case of the poorly placed data center kill switch, also known as the haunted red button. As we reported, one network manager noted:
“We had a large data center with a few AS/400s running various applications across Canada. The server room was pretty state-of-the-art at that time, with central cooling, proper cable management, alarms, fire suppression, cameras, and even an emergency power shutdown button.
"Unfortunately, this button was at 'butt' height and didn't have a protective cover on it. One administrator bent over, hit the button with his behind, and killed power to the entire room, including the AS/400s, shutting down all the enterprise applications across the country. The individual had a very embarrassing week trying to recover all the data and get them back up and running."
Other horror stories are the result of bad luck. That was the case years ago in an incident InformationWeek dubbed “Outage by SUV.” As reported then:
“The driver of a large four-wheel drive vehicle, a diabetes sufferer, passed out behind the wheel. Instead of swerving to the edge of the street, the vehicle accelerated straight ahead, failed to turn at a T-intersection, and jumped the curb to climb a grass berm on the far side. The berm served as a ramp that allowed the SUV to launch itself into the air over a row of parked cars.” As it came down, it slammed into a building housing a power transformer for a managed hosting providers’ facility, knocking it out as a power source.
Untested software and systems
Another power-related problem that probably occurs more frequently than any of us want to admit is when a UPS fails to come on when power is cut. This is usually the result of haste. The UPS is deployed and not properly tested.
Such problems are not limited to hardware. Carl Sverre, Senior Director, Engineering, Launch Pad at SingleStore, shares a incident that happened years ago that shows how important testing is before putting something into production.
“A ‘horror story’ we experienced happened very early in SingleStore's journey. Someone at the company who was working on our test infrastructure wrote a script to replicate our analytics database from a legacy MySQL instance into SingleStore. We were excited to start eating our own dog food, and in our rush didn't audit the script enough before running it on production. Unfortunately, the script started with a drop database command which, rather than executing on the destination, ran on the source. Moments later, we realized that over a year’s worth of important test data had been eliminated. Needless to say, it was a bit of an intense moment. Fortunately, we had a backup, and we were able to recover most of the important data by the end of the day.”
He noted that the incident was a great lesson learned. “Now, years later, it's exciting to see our product develop new features that can recover from this kind of horror story from happening within seconds. Maybe, without this event in our past, we wouldn't have prioritized our innovative separation of storage and compute design as much as we have."
Same issue with cloud outages
Most of the top cloud outages of the last two years have been the result of updates or other software changes gone wrong. As we reported earlier this year:
- CloudFlare suffered a roughly one-hour outage impacting many companies and sites due to a change to the network configuration.
- Google Cloud had a two-hour outage due to a change to the Traffic Director code that processes configuration updates. The code change assumed that the configuration data format migration was fully completed. In fact, the data migration had not been completed.
- Amazon Web Services experienced a five-hour outage on the East Coast due to a glitch in some automated software that led to “unexpected behavior” that then “overwhelmed” AWS networking devices.
More horror stories from the field
Carrie Goetz, D.MCO and Principal/CTO at StrategITcom, LLC, and a frequent speaker at Network Computing events, offered up some spooky incidents she has encountered over the years.
“There was the case of cleaning people plugging vacuum cleaners into UPS outlets when cleaning the data center and shutting them down. Happened every night sometime between 2 and 4 AM. The only way we caught it was to sit up there at that time.”
“Or how about doing an audit of the gear in a data center? The customer thought they had about 2,600 servers, and we found over 3,000 physical machines. Some had not passed a bit of traffic in years.” Talk about a nightmare. She noted, “decommissioning was not in their vocabulary until after the audit.”
Another example should send chills down any IT manager’s spine. “We took over a contract for a prison health care provider. They had previously hired another company. When all of the deliveries were late, the customer started investigating and found out that the company was staging their servers in a shed with a dirt floor and no AC running. They kept going up and down, and two failed for dirt and moisture.”
It should be noted that Goetz recently came out with a book “Jumpstart Your Career in Data Centers, Featuring Careers for Women, Trades, and Vets in Tech and Data Centers." The book takes a holistic approach to explaining mission-critical data centers, from site selection to the cloud and all things in between.
A final word
Managing systems, data centers, and infrastructures keeps getting more challenging as complexity grows. As a result, network, data center, and IT managers face a constant stream of problems that lead to outages, disruptions, and unhappy users.
Rather than getting spooked, the best way to move forward is to try to minimize incidents that can cause nightmares. While there is no one thing that can guarantee perfect uptime and performance, the industry is increasingly using more sophisticated monitoring, management, and observability tools and services to spot anomalies, identify issues before they cause problems, and speed the resolution of problems after they happen.