It’s now commonly understood that monitoring is only a subset of observability. Monitoring shows that something’s wrong with your IT infrastructure and applications, while observability helps you to understand why, typically through analysis of logs, metrics, and traces. In today's environment, various data streams are required to determine the "root cause" of a performance issue--the holy grail of observability--including availability data, performance metrics, custom metrics, events, logs/traces, and incidents. An observability framework is built from these data sources, which allows the operations team to navigate through this data with confidence.
Observability can also determine what prescriptive actions to take, with or without human intervention, to respond to or even prevent critical, business-disrupting scenarios. Getting to that advanced level of observability requires a monitoring evolution from reactive to proactive (or predictive) and, finally, prescriptive monitoring. Let’s discuss what that evolution includes.
No easy feat
First, it pays to look at the current state of federated IT operations to see the challenge. Infrastructure and applications are scattered across staging, pre-production, and production environments--on-premises and in the cloud--and IT operations teams are constantly engaged to make sure those environments are always available to meet business requirements. The operations team has to deal with multiple tools, teams, and processes. There is often confusion as to how many data streams are required to implement an observability platform and also how to enable business and IT operations teams within an organization to follow a framework that improves operations optimization over a period of time.
In order for monitoring to mature past a metrics dashboard into this observability posture, it typically evolves in three stages: Reactive, Proactive (Predictive), and Prescriptive. Let’s dig into what these are.
Phase 1: Reactive Monitoring: These are monitoring platforms, tools, or frameworks that set performance baselines or norms, then detect when those thresholds are breached and alert accordingly. They help in determining optimized configurations required to prevent performance thresholds from being reached. Over time, the pre-defined baseline might shift as more hybrid infrastructure is called on or deployed to support a growing number of business services and expanded enterprise reach. This can result in poor performance becoming normalized and not triggering alerts, leading a system to crash altogether. Organizations then look to proactive and predictive monitoring to alert them in advance of performance anomalies that may indicate an impending incident.
Phase 2: Proactive/Predictive Monitoring: Though the words sound different, predictive monitoring can be considered a subset of proactive monitoring. Proactive monitoring enables organizations to view signals from the environment, which may or may not be the cause of disruption of business services. This allows organizations to prepare remediation solutions or standard operating procedures [SOP] to overcome priority zero incidents. One of the common approaches to implementing proactive monitoring is to provide a "manager of managers" with a unified UI where operations teams have access to all the alerts from multiple monitoring domains to gain an understanding of their system's "normal" behavior and "performance bottlenecks" behavior. When a certain pattern of behavior matches with existing machine-learned patterns that indicate a potential problem, the monitoring system triggers an alert.
Predictive monitoring uses dynamic thresholding for newer technologies in the market without having first-hand experience of how they should perform. These tools then understand metric behavior over a period of time and alert when standard deviations are noticed, which could lead to outages or performance degradations that end users would notice. Actions can be taken in response to these alerts that prevent business-impacting events.
Phase 3: Prescriptive Monitoring: This is the final stage of the observability framework, where the monitoring system can learn from events and remedial/automation packs in the environment and understand the following:
- Which alerts are most frequently occurring, and what remediation actions are executed from automation packs in response to them?
- Whether certain triggered resources belong to the same data center or are the same issues seen across multiple data centers, which might result in understanding faulty configuration baselines.
- If an alert is seasonal and can be ignored at a later stage without executing unnecessary automation.
- What remediation actions to execute on new resources that are introduced as part of vertical or horizontal scaling.
IT ops teams need proper algorithms to associate and formulate these scenarios. This can be a combination of feeds from ITOM and ITSM systems to the IT operations analytical engine to build prescriptive models.
Seeing the future is the new monitoring
Monitoring is not observability but a key part of it, starting with reactive monitoring, which tells you when a pre-defined performance threshold has been breached. As you bring more infrastructure and application services online, monitoring needs to shift to proactive and predictive models which analyze larger sets of monitoring data and detect anomalies that could indicate a potential problem before service levels and user experience are impacted.
Then, an observability framework requires analyzing a series of data points to identify the most probable cause of a performance issue or an outage scenario within the very first few minutes of anomaly detection and then start working towards remediating that performance issue before moving to war room/situation analysis calls. The end result is a better user experience, an always-available system, and improved business operations.
Finally, you close the observability loop with prescriptive monitoring, which filters for frequency and seasonality and recommends remedial actions to take.
Prasad Dronamraju is a solution architect and technical product marketing manager at OpsRamp.