It may sound like a cliché, but it’s really more of a truism: enterprise networks are more complex than ever before. They need to support more users, devices, device types, applications, and data traffic than at any point in the past. There's no going back: for medium-to-large networks, the trend toward greater complexity will inevitably extend into the future.
How network complexity creates service assurance challenges
Increasing network complexity creates significant challenges for IT when it comes to meeting both service level agreements and user expectations. Without the proper tools in place, IT organizations find themselves unable to detect network problems until they begin adversely affecting user experience. Helpdesk tickets related to user connectivity pile up. IT spends hours, days, or even weeks trying to troubleshoot network problems.
Inefficient manual processes make it impossible for humans to keep up when it comes to service assurance in modern enterprise networks. Senior IT management ends up over-allocating people resources to routine network service assurance to the detriment of being able to focus on higher-value strategic projects.
AI and ML for incident analytics: Steps toward a self-healing network
Fortunately, as in so many areas of human endeavor, new technologies are coming online to automate many routine service assurance tasks. That means humans can focus on activities that they are uniquely able to perform. Recent advances in incident analytics that leverage machine learning (ML) and artificial intelligence (AI) help IT be much more efficient in making sure end users get the connectivity experience that they expect. The networking industry is working toward a better, future state of the world where networks become self-healing.
There are four main elements of incident analytics as things stand today, and they unfold sequentially from the inception of the service incident. These elements are identifying the incident, classifying it as to severity, tracing the root causes, and recommending steps for remediation. Let’s consider each of these in turn.
Identifying network service incidents and classifying them by severity
Addressing service-affecting, or potentially service-affecting, issues starts with identifying that a problem exists. A user can't connect to the network, gets disconnected, or takes too long to connect. Or they are not getting the kind of client throughput needed to support data-intensive applications.
Helpdesk tickets are a poor proxy for identifying network service issues. Not every user reports every service issue, but that does not stop them from complaining to others in the organization. Even if they do report a service disruption, the network has already failed them.
Advanced incident analytics for service assurance automatically identifies network anomalies that can give rise to service incidents, often before users even notice a problem. Machine learning lets the system do this without IT involvement. The most advanced tools also classify incidents as to their level of severity, based upon factors like the duration of the incident and the number of users. That way, IT knows which issues to prioritize for remediation. This is where artificial intelligence comes in, as the system also automates this severity classification.
Tracing root causes and recommending steps for remediation
Advanced incident analytics can also automatically trace the root causes underlying the service incident. For example, if clients are experiencing poor received signal strength (RSS), one root cause is the need for more access points in the network or poor placement of access points. Another possible cause is sticky client behavior, where clients make unduly conservative roaming decisions. They don’t roam until the signal strength is very low.
Advanced network analytics can automatically recommend steps that IT can take to remediate these and other issues that create this type of service issue. These recommended steps are the fourth element of incident analytics. In the case of the above signal strength issue, the recommendations could include placing more APs, and/or placing them more appropriately to address the RSS issue. For the sticky client behavior, activating features that help clients make more appropriate roaming decisions would be called for.
Because one root cause can cause multiple service incidents, addressing these can prevent additional users from being affected at all. Moreover, some incidents can be identified before they affect service for users. That’s another way that analytics can help to avoid many service issues that result in helpdesk tickets.
How all of this makes life easier for IT
By now, it's probably obvious to the reader how this makes IT more efficient and reduces the pressure of routine service assurance. Automation is the key. By automating tasks that were formerly done by humans, IT teams are freed up to focus on more strategic projects that they are uniquely able to address. IT management no longer must over-allocate people resources for routine service assurance issues and can better help the organization achieve its core mission. The future of enterprise networks will have even the remediation performed automatically, as networks become self-healing.
Related Network Computing articles: