
Reporting and Alarms
For performance-management purposes, you want to be able to produce meaningful reports that describe how a metric is trending relative to a baseline. For the most critical service elements you may examine such reports every day or once a week. But you don't have the time to check a report on every measured service element with such frequency -- so you need some mechanism by which you are alerted when a particular metric has changed in a significant manner. This is achieved by means of thresholds and alarms. A threshold is a baseline set to a level of the metric at which you want to become aware of trends in that metric.
I use the term baseline in the generic sense of "yardstick" or "standard for comparison." While the term is often used in this sense, I have heard several capacity-planning people argue over the semantics of this use. Many prefer to reserve the term "baseline" for the current state at the time of the original measurement and use the term "threshold" only for the level at which awareness becomes necessary. Unfortunately the term "baseline" is already used to mean something much more similar to "threshold" in the related field of service-level management-- the target level of performance.
When a threshold is exceeded you want to be notified by means of an alarm, e-mail, page or other "pushed" indicator. As we discussed previously, there is a capability in SNMP to send traps from devices in a network to a network management system. This approach is used to report faults such as a line down or an interface not responding, but it can also be used to send alerts when certain thresholds are exceeded. For example, this mechanism is frequently used by the SNMP agents in Ethernet hubs to report when preset error thresholds have been exceeded. However, such mechanisms tend to be focused on real-time changes in the operating environment rather than trends which develop over time. The threshold/alarm functions used in performance management are, in fact, usually provided by the reporting tools.
Let's take a quick look at some of the actual tools that can be used to collect and summarize the data and generate these alarms.
There are two classes of reporting tool that we're interested in. The first class of tools is used to collect and report on data from SNMP agents. The second class is the polling systems, which ususally combine data collection and reporting capabilities.
Examples of the first class of tools are Kaspia Network Audit Technology from Kaspia Systems and NetworkHealth from Concord Communications. These tools interrogate MIB data from a wide variety of SNMP agents and can provide a large variety of summary reports showing trends in the collected data over time. They also support various threshold mechanisms so that a network manager can be notified when a particular service element requires attention.
The second class of tools is quite diverse. If you have identified a number of different services that require this type of approach, you may be best served by a tool that provides a wide range of polling alternatives. For example, IP.Check from Baranoff Software addresses a wide range of IP based applications, as well as providing simple TCP/IP network-level polling using ICMP (Internet Control Management Protocol). On the other hand, if you are more focused on a particular type of service element then you may want to investigate tools that offer more depth in a particular area. For example, AlertPage from Geneva Software is focused on response and availability of network and servers (at the network level) while MailCheck, also from Baranoff, is focused purely on messaging systems.
It is beyond our scope to provide an in-depth analysis of all the tools available in each category. Similarly it is not practical to offer exact advice over what threshold values should be used for each type of network technology. In many cases the tools that offer threshold capabilities will have default values already set, and those are a good place to start. Otherwise, since utilization seems to be the metric which causes the most confusion, here are some guidelines based on my own experience for four common classes of service element. I have assumed in each case that you will set thresholds against metrics for both average utilization and peak utilization.
- Leased lines. Average utilization: 45% of line speed, Peak Utilization: 70%. Measurement period of one day.
- Frame relay. Since Frame Relay allows burst rates above the committed information rate (CIR) you can afford a smaller margin of error. Average utilization: 55% of CIR, Peak Utilization: 80%. Measurement period of one day.
- Ethernet LANs. The way Ethernet works is that a device on a shared LAN which needs to send data simply waits for the wire to go quiet then places its data on the LAN segment. If another device on the same logical segment attempts to do this at the same time, both devices detect a 'collision' and back-off for a short random (yes random!) period of time before waiting for another quiet period to try again. In practice this actually works very well while the utilization remains low. However, as the utilization increases, the performance gets exponentially worse. At around 40% utilization all that's happening is collisions and no actual data is getting sent. Therefore you never want to get anywhere near 40%. Set thresholds at 15% for average utilization and 25% for peak utilization. Those numbers can be increased for pure switched networks (since collisions are no longer a consideration) to 25% and 40%. Measurement period of 15 minutes.
- Other LAN technologies. No such problem for token ring and ATM. Average utilization: 50%, peak utilization 70%. Measurement period of 15 minutes.
Control
Tools that report on performance are just that -- tools. They are of no value unless it is clearly understood how the output from those tools will be used. For each key service and measured service element, you must define who is responsible for generating and analyzing reports, who receives alarms, who tunes thresholds and (most importantly) the mechanisms by which exceeded thresholds lead to network capacity or topology changes.
|