Riverbed’s AIOps Vision: Toward a Self-Healing Future
After decades of theory, hype, and a fair few memorable movies, artificial intelligence, deep learning and big data are finally ready for real-world applications. CIOs are keen to harness these burgeoning technologies to better manage and protect their IT infrastructures, but how much of the AIOps vision can feasibly be delivered today.
Dr. Vincent Berk, Chief Security Architect at Riverbed, is uniquely positioned to cut through the hype and give a sense of what AIOps can really offer. He believes that Riverbed has better data than any other company on the market, putting SteelCentral in the best position to deliver revolutionary AIOps-driven IT quality and performance enhancements.
An AIOps scenario
To understand what's possible, consider this scenario: Some sort of event seems to be occurring, filling the wire to 90% capacity. VOIP calls are dropping.
"Naturally, the operator can work with all of Riverbed's extensive suite of monitoring and security tools to arrive at that conclusion," says Berk. "But my premise is that by applying the appropriate analytical techniques, the Riverbed software can make that inference for the operator. So when they log in, Riverbed reports that a DDOS attack is underway and then the human can directly proceed to dealing with it."
But VOIP disruptions can have other root causes, such as badly timed backups or a broken router interface, all of which produce varying types of evidence. If the monitoring system's smart enough to analyze all these factors and give the operator an informed suggestion as to the root cause, the path to resolution becomes that much quicker.
Do this well enough, and the system can even become predictive. It could realize that when certain conditions are met, such as the infrastructure being at 90% capacity, packet drops and retransmissions are soon to come. It might also notice this seems to happen on Tuesday afternoons, and that several applications tend to slow down as a result.
"It should warn the operator as these conditions start developing: Take action, or user experience will suffer," says Berk. "By discovering the right kinds of relationships, we're even able to become predictive in our capabilities."
First you need data
How much of that AIOps potential can today's performance monitoring products deliver? Quite a lot, says Berk. The key, he explains, is that machine learning requires high-quality data to deliver great results.
"Peter Norvig, Google's research director, wrote a classic article for IEEE, called 'The Unreasonable Effectiveness of Data,'" says Berk. "And in that article he just came right out and said Google doesn't have a magic algorithm. They just have more data."
A comprehensive dataset helps machine learning understand the specifics of a given IT infrastructure. Compared to other vendors, Riverbed collects unparalleled data quality at unparalleled scale thanks to the advancements it's made in big-data technology for APM and NPM. Riverbed can capture every user transaction, tracing it from the user’s device all the way through the back-end code and infrastructure, preserving all relevant metadata, and complete call stack detail, even tracking high-definition system metrics at one second intervals.
"We can capture and store every detail for billions of transactions a day for our enterprise customers who experience extremely high transaction volumes, such as credit card companies," says Berk. "With the clustered architecture introduced last year, we saw an order of magnitude [10x] increase in our ability to scale."
This rich dataset sets the stage for what's next.
Figure 1. Riverbed SteelCentral collects data across many domains
Correlate and draw conclusions
Once the data's collected, an AIOps system can apply analytics to start to derive useful, potentially actionable conclusions.
"We believe machine learning and neural networks can help with these problems because, essentially, they're a classifier system," says Berk. "We're trying to classify, based on the available evidence, which root cause best explains the evidence that we're collecting. That can be very powerful."
It's difficult to train classifiers appropriately unless they have all the data. And often, Berk points out, having absolutely all the data in a complex, ever-changing environment is simply not a feasible reality. But hold that thought a moment.
Let's say you do collect a lot of pretty good data. Even so, correlation does not imply causation. The system may employ a variety of techniques to find meaningful correlations and derive actionable conclusions. Every infrastructure and every problem is different, and training neural networks to recognize general failure scenarios is very difficult, so there's no one magic set of tricks that just always works. This is where purely algorithmic approaches can be bolstered with expert knowledge —embedded into the algorithms themselves — to improve outcomes for specific sets of problems.
"Deep-learning neural networks is one technology we use, in the right places. But there exist various other types of classifiers and correlation algorithms that are also very powerful," says Berk. "Whatever specific technique it uses, when pieces of data are related, an AIOps system should be equipped to make that correlation, as opposed to waiting for the operator come to that conclusion. That's where the real power comes in."
Figure 2. Riverbed AppInternals features a cluster operator that uses pattern recognition to automatically find sets of related transactions
The final link: Human intelligence
So the AIOps implementation examines a rich and deep dataset, recognizes correlations and draws conclusions, and presents them to the operator. Why not have the system go a step further, and immediately remediate any problems it discovers?
Ultimately the goal of AIOps is a self-healing system that addresses issues before anyone is ever aware they existed. This is feasible for some scenarios, but there are also other situations in which additional intelligence is needed to decide on appropriate and effective interventions.
As Berk noted, it's realistically impossible to have a truly complete dataset, a full understanding of a complex, ever-changing infrastructure that is inevitably affected by factors outside the bounds of what can be monitored. That missing piece often comes from the operator. After all, a human might be the only entity who understands that the cleaning service works in the server room on Tuesdays (that comes from a true story, by the way).
"Some missing piece of information in the operator's head could instantly make that causal link even if the AIOps product cannot," says Berk. "And that's one reason why you won’t always want to let the system decide to execute remedial actions on its own."
While AIOps can have the built-in intelligence to suggest specific remedial actions that are often appropriate — which is great for taking a lot of cognitive burden off of operators — having a human in the loop to make that final call is just a wise thing to do.
Figure 3. Riverbed Aternity can initiate automated remedial actions, but the system will always confirm with an operator before executing them
"The product is learning your IT environment and making correlations so it can be predictive," says Berk. "It can arm the operator with insights that go beyond the surface issues, suggest causes beyond the usual suspects, surface things they may not have thought to look for. [Our first-generation] AIOps offering can point you in the right direction, and we'll be building on that intelligence in the future."
Coming very soon
All of this intelligence is either rolled into Riverbed's offerings today or will be available by the end of the year. We're excited for what the future holds.
For more on how AIOps and machine learning can provide better outcomes for IT and business watch the recent webinar with Gartner Research Director Charley Rich and Riverbed APM Marketing Director Amena Siddiqi, "Prepare for the Future of APM in 2019." Some recent Riverbed blog posts will also be of interest, specifically "AI & Data Analytics: Finding the Needle in the Haystack" and "4 Ways Automated Remediation Drives Up Service Desk KPIs."