IT outages can gobble up large amounts of time and effort devoted to managing incidents, especially when the majority of those incidents must be handled manually. According to the ITIC’s 2021 Hourly Cost of Downtime survey, a typical outage can cost an enterprise more than $300,000 per hour, with some outages costing between $ 1 million and $ 5 million per hour.
AIOps adoption is on the rise to meet this challenge by automating IT event management with a closed-loop approach to analyze, determine probable cause, recommend fixes and heal incidents autonomously. But what about cases where the incidents cannot fully be resolved autonomously – where issues must be identified and handed off to relevant stakeholders across disciplines who need to collaborate to fix the problem? All the while, the downtime clock is ticking.
Let's examine how, in such situations, closed-loop AIOps can be augmented with something called conversation-driven collaborative triaging — a digitally supported process for connecting the right enterprise experts and bringing them quickly to the table to resolve issues. We’ll see how this powerful value-add to AIOps can further enhance the enterprise’s ability to reduce outages and boost uptime significantly.
The Challenge of Connecting the Right Stakeholders for Event Management
IT outages can significantly impact the productivity and profitability of an enterprise – including heavy reputational damage and lost consumer trust that results from long outages. Industry research shows more than a third of companies need up to 12 hours to fix an infrastructure outage, and 17% need two to seven days to do so.
To put a dent in these downtime figures, companies are increasingly turning to AIOps – a burgeoning $13 billion market on track to hit $40 billion by 2026. AIOps allows you to leverage automation in separating the signal from the noise in what may be thousands of events per second in a typical enterprise IT system.
With AIOps, machine learning algorithms govern closed-loop systems to automatically isolate the most important issues, the ones having the largest business impacts, and then prioritize them for autonomous resolution. In cases where this auto-resolution is possible, the AIOps solution can leverage pre-built knowledge that’s informed by business context and real-time health checks to self-heal the incidents and facilitate corrective actions.
But where manual intervention is required, AIOps needs an assist – ideally from companion systems that can efficiently orchestrate how, where, and by whom the human element is brought into the picture for event resolution. In such cases, a conversation-driven collaborative triaging system can be implemented to get the right experts on the case as fast as possible.
The Power of Machine Learning for Collaborative Triaging
Collaborative triaging is the process of bringing the appropriate stakeholders quickly together to convene around incidents and bottlenecks. Connecting closed-loop AIOps capabilities together with collaborative triaging is the key to integrating auto-resolution and manual resolution modalities into one seamless suite of capabilities for an effective and unified approach to downtime.
In a well-designed integration, AIOps systems for IT incident response can leverage machine learning to identify the correct experts for incident assignment in cases where problems cannot be fully auto-triaged. In this scenario, incidents are computed and assigned to an expert resolver; and incident context and history are made readily accessible to facilitate the manual triaging and resolution.
Triaging experts and resolvers remain connected to ensure transparency in the incident resolution process – with real-time triaging details provided, with context, that shows automated fixes applied and remaining areas where manual support is required. Throughout the process, automated updates are sent to front-line stakeholders, keeping these business users posted on activities and the status of the fix.
The benefits of augmenting AIOps with these collaborative triaging capabilities can be significant – especially in light of past research showing some 60 percent of companies currently need 15 minutes or more just to identify the right team members to work on an issue. The resulting delay can cause a loss of $25,000 just to get a single ticket claimed and acknowledged. With AIOps, organizations now have powerful options to avoid such revenue loss.
With the soaring costs of downtime, system managers dealing with events and issues at scale need the seamless ability to deploy both automation and human expertise – strategically and where they can do the most good. True agility to proactively address all issues comes from the combined power of closed-loop AIOps systems and collaborative triaging processes – accelerating the right conversations to happen between the right experts to maximize uptime and value for the enterprise.
VS Joshi is Global Head of Product and Solution Marketing at Digitate.