NPM dashboards fail to provide real-time insight into trouble spots.
Modern dashboards for network performance monitoring do a great job of reporting status and statistics. In general, these dashboards provide views of aggregated information that make them useful when looking for historical data or trends. But when there is a problem on your network, these dashboards often fall short in providing real-time, actionable information about individual network trouble areas.
What no company in the industry has done well is create dashboards that help network engineers troubleshoot. Not only are better dashboards absolutely necessary, I believe they’re completely within reach. As an industry, we need more discussion about what data is truly meaningful in this context and how we can obtain it.
One of the limiting factors is that meaningful, real-time analytics from individual data points cap out at around 2 Gbps. At that point, we reach the limits of our computational processing power. Real troubleshooting requires analysis of network data at the speed of the network itself.
There are six characteristics common to all of our dashboards today. They excel at baselining, but miss the mark when helping network engineers quickly identify problems.
- Too much consolidated data. Look at any dashboard today and you’ll probably see an attractive, color-coded bar chart, graph or number that represents a highly-consolidated summary of network data. When a single optic represents a large set of data, the individual problem may get lost, so these graphs are not actionable.
- Too many dashboards. Imagine yourself sitting in a NOC, surrounded by 15 or even 20 dashboards. Which data is useful, which is not?
- Not displaying what we need. Dashboards today display data based on what can be shown, not necessarily the deeper, more individualized data that engineers want to see. We need better indicators of where the problems are -- right now.
- Problems not immediately obvious. UI design has improved dramatically, and it’s great to see attractive dashboards. But NPM dashboards still don’t show whether there are problems or not. And if a problem is suspected, there is little indication of how to begin or continue an investigation.
- Not real time or integrated. For many products, network troubleshooting relies on queries against a database that has been created after-the-fact from stored packet data. This is not an integrated approach and it is definitely not proactive.
- Where to start? Even if you can spot a problem from a dashboard, what do you do next? Dashboards don’t always give a clear indication of what to investigate; they offer little help to drill down into the data to resolve issues.
It would be more useful to see a list of the network flows together with a list of specific problems. For example, where is the network encountering too many TCP retransmissions, and who’s being affected by them? If you can immediately see where the problem areas are, you’ll be able to resolve them much more efficiently.
Most dashboards get their results from NetFlow data or similar data collected by network devices. The weakness with this approach is that it doesn’t provide access to the data as it’s happening, so we can’t pivot on the data to see what’s actually going on in the network itself as it happens.
Here are a few things our dashboards should be capable of:
- Enter the IP address of any one of 100,000 or more nodes, and see all traffic in real time, for as long as needed.
- Specify a port/protocol combination and see 375 (or just two) flows within seconds.
- Monitor TCP quality, application and network latency, and MOS score for every flow in real time. Not an aggregate, but every flow. Report the worst offenders in any desired geographical area or network location.
- Provide a granular view that allows network engineers to drill down to a single desktop easily and fluidly.
The best way to sum it up is that we need a better way to see the worst things happening on our network, right now: The worst TCP quality, network or application latency, VoIP MOS scores, and where FTP traffic polices are being broken. NPM dashboards have their place, but the industry seems ripe for change. We need a "crashboard" that provides actionable data on the worst parts of our network, not just another dashboard.