If you’ve worked in IT for more than 10 minutes, you know that stuff goes wrong. In fact, it should be obvious that we have jobs in IT specifically because things go wrong.
And that’s what IT monitoring and automation is all about -- building systems that automatically mind the shop, raise a flag when things start to go south and give the information needed to know what happened and when it happened so you can avoid it in the future.
After over a decade implementing monitoring systems at companies large and small, I’ve become all too familiar with what might be called monitoring grief. This is what often occurs when you are tasked to monitor something, anything for someone else -- which is almost inevitable -- and they ask you to do things you know are going to cause problems. It involves a series of behaviors I’ve grouped into five stages. Get it --the five stages of (IT monitoring) grief?
While companies often go through these stages when rolling out monitoring for the first time, they can also occur when a group or department starts to seriously implement an existing solution, when new capabilities are added to a current monitoring suite or simply when it’s Tuesday.
Spoiler alert: If you’re at all familiar with the standard Kubler-Ross five stages of grief model, acceptance is not on this list.
Stage one: monitor everything
This is the initial monitoring non-decision, a response to the simple and innocent question, “What do I need to monitor?” The favorite choice of managers and teams who won’t actually get the ticket is to simply open the fire hose wide and request you to monitor “everything.” This choice is also frequently made by admins with a hair-on-fire problem in progress. This decision assumes that all the information is good information, and can be “tuned up” later.
Stage two: The Prozac moment
This stage follows closely on the heels of the first, when the recipient of 734 monitoring alert emails in five minutes comes to you and exclaims, “All these things can’t possibly be going wrong!” While this may be correct in principle, it ignores the fact that a computer only defines “going wrong” as specifically as the humans who requested the monitors in the first place. So, you ratchet things down to reasonable levels, but “too much” is still showing red and the reaction remains the same.
Worse, because monitoring was considered “bad” before (due to the request for something stupid) monitoring must be wrong again. Except this time it isn’t wrong. It’s catching all the stuff that’s been going up and down for weeks, months, or years, but which nobody noticed. Either the failures self-corrected quickly enough, users never complained, or someone somewhere was jumping in and fixing it before anybody knew about it.
It’s at this moment you wish you could give the system owner Prozac so they will chill out and realize that knowing about outages is the first step to avoiding them in the future.
Stage three: Painting the roses green
The next stage occurs when too many things are still showing as “down” and no amount of tweaking is making them show “up” because, ahem, they are down.
In a fit of stubborn pride, the system owner often admits something like, “They’re not down-down, they’re just, you know, a little down-ish right now.” And so they demand that you do whatever it takes to show the systems as up/good/green.
And I mean they’ll ask you to do anything like changing alert thresholds to impossible levels (“Only alert if it’s been down for 30 hours. No, make that a full week.”) and disabling alerts entirely. In one case, I was told-- under the threat of losing my job -- to create a completely false page with GIFs that have been re-colored from red to green to show senior management.
What makes this stage even more embarrassing for all concerned is that the work involved is often greater than the work to actually fix the issue.
Stage four: An inconvenient truth
And so goes the web of deceit and lies, sometimes for weeks or months, until the point when there’s a critical error that can’t be glossed (or Photoshopped) over. At that point, you and the system owner find yourselves on a service restoration team phone call with about a dozen other engineers and a few high-ranking IT staffers where everything is analyzed, checked and restarted in real-time.
This is about the time someone asks to see the performance data for the system -- the one that’s been down for a month and a half, but showed as “up” on the reports. For a system owner who has developed a habit of buying their green paint by the tanker-full, there is nowhere left to run or hide.
Stage five: Gaming the system
Assuming the system owner has managed through stage four with his or her job intact, stage five involves keeping you at arm’s length. The less sophisticated folks will find ways to have monitoring without all the messy inconvenience of actually having monitoring while people who’ve been around a while will request detailed information on exactly what permissions you need in order to monitor. That information is passed along to an inevitably brand-new security audit team that denies the access request out-of-hand because the permissions are too risky to give out.
At this point, you have a choice: Pull out all your documentation and insist you be given the permissions that have already been agreed upon, or go find another group that actually wants monitoring.
And what of the system owners who started off by demanding, “monitor everything?” Don’t worry, they'll be back after the next system outage -- to give you more grief.