As the old saying goes, a problem well-stated is a problem half-solved. A thorough understanding of any issue is the obvious first step in solving it, but when software is involved, it’s not always that simple. “Known unknowns,” when a problem is well-defined but more information is needed for the solution, are usually remedied by further analysis. “Unknown unknowns,” where the problem and solution are a mystery, are the most difficult to solve and most likely to keep systems down for an extended period of time.
For example, if a device you’re monitoring suddenly begins misbehaving in a way not clearly visible, you can often just replace or upgrade the device. The problem is known and remedying the issue is a matter of increasing capacity. If multiple, seemingly unrelated devices go down at the same time, both the problem and the solution are unknown and the issue requires a deeper dive beyond existing analytics.
In networking, solving any unknown unknown requires the right data from multiple sources across the environment. By nature, these issues arise unexpectedly, often with serious consequences for the business. There is no one-size-fits-all solution, but having established protocols in advance is crucial to mitigating unknown issues as they arise. Here are three key steps to prepare for unknown unknowns and keep operations running smoothly.
Step 1: Preparing the data
Data without metadata is meaningless. Metadata is the data that contextualizes network telemetry, logs, events, flows, routes, alerts, configuration changes, etc. Without holistic management of metadata, network and IT telemetry are just raw, disconnected points.
While network and IT operation engineers focus primarily on their infrastructure telemetry and its semantics, there are a set of foundational building blocks that must also be present:
- A single source of truth for the infrastructure data model
- A well-defined data model that describes the infrastructure entities and their relationships
- Keeping the source of truth regularly updated, as it represents the backbone connecting all data
Step 2: Connecting the dots (data sources)
A data-centric approach with a focus on quality is non-negotiable in complex environments. The good news is that for most IT teams, there’s no shortage of data or monitoring tools. Every vendor naturally provides alert monitoring for their own product, but they’re often focused on one specific type of data. Unfortunately, this leaves IT teams with a multitude of dashboards that fail to provide a deeper analysis. In the example above, manually identifying the issue causing misbehavior across the system would be nearly impossible with the data from those devices alone. For a clear view of operations, it’s crucial to use a semantic model with metadata that connects data points from the widest variety of available sources.
Step 3: Finding the root cause
With the data connected, extracting and rapidly analyzing information across the system is where machine learning truly gives operations teams the advantage to address unknown issues when the stakes are high. By reducing the time spent during the investigation phase with machine learning, operations can move faster towards addressing the issue, but only with the proper architecture in place.
The longer teams spend scrambling to connect the dots, the worse the impact. While unknown unknowns will forever be impossible to predict, especially in an industry as complex as networking, maintaining a well-defined data model and a holistic view of the network are essential to keep operations running smoothly.
Kannan Kothandaraman is co-founder and CEO of Selector AI.