A popular new measurement methodology, APM (application performance measurement) aims to reverse this trend by supplying metrics that measure the productivity, efficiency and quality of distributed computing infrastructures. The central tenets of APM are that what matters most is what end users experience (for example, how long does it take to download a Web page), and because these users interact directly with the application, looking at the application will give the proper perspective. APM is one of the most important management technologies to emerge in the past 10 years because of how it helps IT personnel provide a higher quality of service to their customers. For instance, by using APM technology, an engineer can determine which glitches affect customers (and which don't).
Dozens of network- and systems-management software vendors are selling APM software, and most of the rest have a road map for how APM fits into their strategies. Common terminology has emerged, and several distinct collection architectures have become successful. In addition, a standard is being developed in the IETF for the collection of APM information. This standard, the APM MIB, is an important milestone as the APM industry matures.
Defining APM
Despite all the technically sophisticated ways in which network and systems resources can be measured, end users perceive only two things about an application's performance: availability and responsiveness. These terms can be defined as such:
» Availability: The percentage of the time that the application is ready to give a user service.
» Responsiveness: The speed or performance of the service.
A transaction is an action initiated by a user that starts and completes a distributed-processing function. A transaction begins when a user initiates a request for service (such as pushing a "submit" button) and ends when the work is completed (information is provided or a confirmation is delivered). A transaction is the fundamental item measured by APM software.
An example of an availability metric is "99.8 percent of all Web transactions were successful." An example of a response-time metric is "This Web page took 10 seconds to download" or "95 percent of all Web hits were downloaded in less than five seconds."
These metrics are profoundly different from previous metrics because end users understand them and can directly see and feel them. Further, the metrics roll up the combined effects of the complicated systems and subsystems involved among clients, networks and servers. No longer will end users hear explanations like, "Excessive page faults on the database server severely impacted performance on the payroll run" or "Route flapping between the ISP and the DMZ router caused the call-center application to shut down every 180 seconds." In contrast, since customers are really more concerned about availability and responsiveness, communicating with them directly in APM terms tells them what really matters, in terms they understand, and makes clear when performance is good and when it needs improvement.
Different Types of APM
Although APM vendors agree broadly on these measurements, they have created several different methodologies to accomplish the task. The two main types of APM software are active and observational.
Active agents simulate a desktop and continuously issue "synthetic transactions" to servers, recording the elapsed time and success of these transactions. These agents can be on desktops, probes or embedded into network equipment. An example of synthetic transaction software is NetIQ Corp.'s Pegasus, which has agents that can run on most desktops or servers. The customer configures what transactions to send and how frequently; Pegasus tests the performance, regularly reporting to a management station.
Observational agents run on desktops or probes and observe actual application activity, recording transaction performance. They may do this by decoding the network traffic, by monitoring desktop software activity or by taking a hybrid approach. Lucent Technologies' VitalSuite, for example, includes a lightweight desktop agent for Windows. This agent observes the desktop's network activity as well as user interaction with the application, and forwards the measurements to VitalSuite for reporting and analysis.
This diversity of techniques provides customers with closer fits to their needs. In fact, as customers gain experience with these methodologies, they are realizing the need for more than one technique. For example, a business might need synthetic transactions to monitor its Web sites and desktop-based observational tools to monitor all business applications. The trend is to integrate these approaches into one suite so customers can have one common reporting infrastructure and user interface no matter the methodology.
A Standard Is Born
As the APM market matured, standardization became necessary. Standardization provides interoperability between tools from different vendors and also makes APM tools easier to integrate into management frameworks. This lets customers build systems by combining best-of-breed collection software with the best reporting software. Standards also ensure accuracy and consistency among products and reduce training costs, because multiple products produce the same metrics. An effort began to create a MIB to be used for transporting APM information via SNMP.
The APM MIB first identifies the standard metrics that every APM-compliant system will collect. Nearly every description of APM mentions application response time. However, application response time is useful only for some applications, and other responsiveness metrics must be considered. For example, consider a streaming video that is displaying a movie: Watching a two-hour movie in five minutes is not what is intended -- another metric is needed. Similarly, it is reasonable to expect the transfer of a 10-MB file to take more time than the transfer of a 10-KB file, even on systems of "identical performance." Different kinds of application transactions require different measurements.
Application protocols implement one of three types of transactions: transaction oriented, throughput oriented or streaming oriented. While the availability metric is the same for all three, the responsiveness metric varies.
» Transaction oriented: These request-response transactions have a fairly constant amount of data to transfer. The responsiveness metric for transaction-oriented applications is application response time -- that is, the elapsed time between the user's request for service and the completion of the request . This is measured in milliseconds.
» Throughput oriented: These transactions have larger and varying amounts of data to transfer. The responsiveness metric is the data rate.
» Streaming oriented: These transactions deliver data at a constant metered rate, even if extra bandwidth is available. However, when the infrastructure cannot deliver data at this speed, interruption of service or degradation of service can result. The responsiveness metric is the signal quality expressed as the ratio of time the service is degraded or interrupted to total service time. This metric is measured in parts per million -- for example, a five-minute phone call with three seconds of dropouts (1 percent, or 10,000 ppm) or a two-hour video with a two-second glitch of frozen or black screen (.027 percent, or 277 ppm).
An application might act differently at different times. An e-mail login is transaction-oriented while an e-mail download is throughput-oriented because the workload depends on the amount of e-mail in the mailbox. The APM MIB lets vendors, customers or both define which responsiveness metric to track for each application.