What Network Management Could Learn From Google Maps
I’ve been enjoying working with a fairly large company with some excellent staff, reviewing network and UC management tools, processes, and gaps. All that network management tool immersion has triggered some high-level questions in my brain, and I’d like to share some of those with you.
The immersion means I’ll likely have more to say on network management tools in future posts.
Those of you who have been following my blog posts know that I find Network Management tools intensely frustrating, basically because in 25+ years they seem to have evolved so little. I comprehend the economics behind network management products; it is a small market compared to Servers and now Security, so budgets cap R&D. Even so, I’m left feeling that surely we could do better, a lot better!
I’d like to go beyond the basic topics covered in Succeeding with Network Management Tools, which is about getting the most out of your tools. Today I’m asking: Are the tools even doing the right things?
Most organizations I’ve worked with lately, large or small, pretty much match a pattern: little spending on network management tools, existing tools under-maintained, and/or no or relatively few people dedicated to network management.
The organization I’m working with now has several sharp people tuning alerts and getting them mastered, surmounting discovery challenges, putting revamped processes in place, considering adding event correlation rules, etc. All of that is refreshing to see. A lot of effort has gone into the tool tuning, and it is all paying off for the customer.
That doesn’t seem to be something that many other organizations can achieve, due either to willpower, budget, staffing levels, and/or time. Perhaps ROI or value provided, as well – that’s a theme for a follow-on blog post: what are network management tools really good for? Where do they provide value, and where is the value less apparent, or not there?
I have become very leery of the word “correlation” with regard to network management tools. It is certainly possible and useful, but what comes built-in tends to be rather minimal. A toolkit is not a solution. Watch out for vendors who say their tool “can” do something, when what they mean is, “It wasn’t built to do that, but with a lot of work you might be able to get it to do that.” You have to ask sharp questions sometimes to find out what the product includes out of the box, compared to what you can build with it.
Let’s now turn to the big philosophical question I want to pose:
Should we be concerned about:
- Manual processes for provisioning devices into network management (for control compared to auto-discovery)?
- Manual enabling of polling, data retention, thresholds, and alert criticality?
- Manual correlation rules-building, especially topology-aware correlation (which is something the Operations side of the house would love)? Doing so means building many dependency rules.
- Manual configuration of back-end alerting processing and display (e.g. NetCool and/or Splunk) versus paging/emailing?
- Manual building of NetCool and Splunk dashboards?
- Maintaining the value of all that labor investment as the topology changes, since modifying correlation rules is likely to also be rather time-consuming?
Bottom line: Is all this labor sustainable?
Thinking about the manual items above, the whole approach is a rather ironic situation. We turn on more and more device traps, process syslog, and send some of it as alerts, do polling and threshold to get more alerts, all so we can be aware of problems. Except that we’re drowning in alerts, which hides the problems. So then we end up going down the alert tuning path (do I need this? Is it actionable?) and/or the correlation path – making more and more work for ourselves. At least, that’s how it looks when I step back from the details to gain some perspective.
For the most part, I think network management customers have “voted with their feet”, i.e. abandoned labor-intensive tools in a loud “NO WAY” response. Surely there has to be a better way to do things?
My answer in a previous blog post is that the vendor must do it for you – automate everything. Yet a lot of the above items could be hard for them to program. Also hard to shop for: in comparing tools, no vendor is going to give you the details of what SNMP variables they poll, their thresholds and alerts criticalities, what their correlation secret sauce does and doesn’t do, etc. So how do you detect shallow versus deep correlation when looking at tools – particularly when trying to review tools without an inordinate expenditure of time and effort? In short, even if you’re trying to buy a tool that automates things well, how would you recognize it?
There are at least two answers appearing on the horizon.
One is machine learning. For example, the co-founder of NetCool is behind a company called Moogsoft, which claims to do rule-less correlation and alerting via machine intelligence. Automatic – good! I’d like to see an objective trial to see how well it does. I can believe it might find unique combinations of things to alarm about; that it might spot failure trends early; and/or might cut a lot of the noise down. Their estimates of reduction in alert counts seem to vary from 50% to 90%. Does it also alarm about important events as well as the current tools do? Of course, if you’re having trouble maintaining and tuning to keep your current tools working well, that may not be a high bar to surmount.
The other answer is for applications to display events on a network diagram (or part of one) with changes and alert counts (or auto-selected alert types and counts) superimposed on the diagram.
Doing this might look like a Google road map that uses colors and icons to indicate real-time traffic conditions, accident and police activity, etc. For the network, I’d actually omit the traffic, and use colors to show alert counts and detected performance problems. If you have packet capture probes, you could even detect things like retransmissions, high latency, slow server responses, and more.
Which of these is a better way to convey lots of information and let the viewer correlate it easily: a Google traffic map, or a radio listing of traffic problems? With radio traffic reports, you have to think about where each one is and do more mental processing. I do have to note, radio is probably safer when driving, unless you look away from the road trying to mentally visualize the road map.
Multiple sources of information integrated via a map: What’s not to like?
Some vendors seem to be going down this path:
- NetBrain has the idea of paths and show commands down cold. What’s less clear is network and server polling, thresholding, alerting.
- Riverbed seems to have the right big picture here and moving fast, leveraging the OPNET acquisition. They have annotated paths already, and display of metrics on links and devices on paths. Their Portal product shows SNMP-based network alerts as well as recent changes to the network. If there’s a weakness, it might be on the server side in the SNMP-based NetCollector. Once their NetProfiler NetFlow-based server business policy/application monitoring capabilities get tied in, that should help. It’s all still evolving. Possible gap: manually mapping each application, even if easier than in the old OPNET tool, and possibly easier than the APM/ADDM class of tools – or at least a lot more network-focused than the ones I’m aware of.
- HP NNMi seems to understand the display part. For that matter, they’ve sort of had that for years, but more on the up/down front than the “performance brownout” side of things. Their web map GUI is intolerably slow for me. A bigger drawback for me is that NNMi is apparently driven by heavy weight alerting and correlation, combined with intense dependency mapping. Possible gap: how much manual labor is involved, what with all the available add-ons and plug-ins, correlation engines, etc. in the various products you need to back up NNMi with server/application visibility.
Getting applications into the picture
There is another side to this. We really need to tie applications to the network, if for no other reason than eliminating the network as a possible cause. Having maps and displaying network issues and changes, and server problems – to me, that’s the Holy Grail right now. Hopefully not quite as unreachable a goal! Seeing a given application’s server endpoints and flows on top of a colorized/iconized network map – bonus!
There are a lot of server/application centric tools, which I’ve been exploring but have less experience with. And there are network-centric tools. They all need to be tied together, and made easier to use.
The other thing I’ve been noticing is how non-GUI products have gotten, or got before the current crop of dashboards and graphical elements started appearing (it is springtime, after all). Linear listing of alerts seems to be a common and rather boring, uninformative GUI. Easy to code, apparently far easier than responsive smart maps in a web GUI.
Dashboards with drill-down capabilities also seem to be coming across various products. Sometimes they’re just event-oriented. I’ll grant that heat map matrices represent another display paradigm that potentially is relatively easy to code and fairly useful to the user. So while I personally like network maps, dashboards may well be part of the answer as well.
I get the impression Cisco is really rethinking their network management products, given how they’re pushing automation and Enterprise Network Virtualization. With the focus on rapid configuration at large scale via the APIC-EM controller, can rapid and increased information collection be far behind?
This article originally appeared on the Netscraftsmen blog.
Recommended For You
In honor of St. Patrick’s Day, there’s no better time to reflect on those instants when life threw us a curveball, but we were able to hit a home run.
The success of modern enterprises, especially those utilizing real-time communications solutions, is highly reliant on IT infrastructure availability.
To understand the critical role of HTTP/2 in streamlining operations, we must look back at the technologies and implementation gaps that got us where we are today.
A video overview and best practices on how to reduce broadcasts and find other things to tune.
This is a great example of the perfect storm of variables coming together to cause performance issues. Watch the video to see how the problem was found.
Providers should be making infrastructure work for everyone in 2019, improving efficiency and opening up networks for all apps on their infrastructure.