The key business application is slow. It must be the network! The application worked fine in development and early deployment. The network team says that everything is running normally. No alerts on the network management system (NMS) and no performance bottlenecks. The IT team says that everything on the servers is also running normally, with low CPU utilization, reasonable free memory, and a reasonable number of disk I/Os.
So why is the application slow? No one can say for sure. It is a vendor-provided application, so there is no information within the organization about its internal functions. Sure, it’s a modern web-based app, so there is the web front-end, the application server itself, and a back-end database system. The problem could be hidden anywhere in or between these systems. And don’t forget that there are load balancers and firewalls in the mix. Let’s see, how many components are involved? There are a bunch of network interfaces on multiple switches, there are two load balancers, two firewalls, (you do have redundant load balancers and firewalls, don’t you?) and at least six servers, each with disk, CPU, and memory. Oh, and the disk is really a redundant storage system.
Each piece of the puzzle seems to be functioning correctly, but the collective system is having problems from time to time. To make matters worse, it is an intermittent problem that happens only a few times a day, but when it does, the app is very slow. How does your IT staff find the problem?
Where to look
The problem could be in the network. There could be a link with errors or interface drops that occur only when the network is congested. Duplex mismatches could be causing errors when the traffic load increases. Is DNS configured with old DNS servers, causing address lookups to time out before eventually connecting to a valid DNS server?
Middleware boxes like load balancers and firewalls could be the problem. Are firewall connection limits being exceeded at busy times of the day? Do the load balancers have the right set of servers configured?
The servers or applications themselves could be the source of problems. There could be a database query that worked fine with a small test and development database, but is not efficient as the production database grows.
Do the servers have sufficient CPU, memory, and I/O capacity for the application? The application could have been created in a way that uses a lot of disk I/O, which works fine in a test and development environment, but fails to scale to the requirements of a production load. The healthcare.gov application is a good example of an application that worked in development, but didn’t scale up when presented with the production load.
Finding the problem manually
Yes, you can use the manual method to try to identify the source of an application performance problem. The problem with the manual approach is that it takes a lot of time. Collecting the necessary data is easier if you develop some scripts to pull statistics from the NMS or directly from the devices (using the CLI and screen scraping). It is very helpful to collect the data every week so that changes are easily identified. For one-shot troubleshooting, this is an acceptable approach, but does not scale as the number and complexity of applications grows.
Using automated application performance monitoring
A better long-term solution is to deploy an automated Application Performance Monitoring (APM) system. These systems can easily identify whether an application system is not performing properly and what part of the system is having problems. They allow you to focus on the applications instead of doing script development or manual investigation.
Some APM systems require capturing packets from multiple places in the application infrastructure. They then perform complex analysis to correlate the packets and identify retransmitted packets, congestion, and slow servers. Capturing traffic between servers allows these systems to identify when a specific server or function is the cause of the slow application. We’ve seen cases where this type of system has identified a problem within a cloud provider’s service (application as a service) simply by identifying that the reply from the cloud-based service to an enterprise client was taking too long.
Other APM systems rely on hardware and software monitoring probes to collect data and to perform point-to-point tests across the network. Probes typically have the ability to generate synthetic transactions against applications — something we call an application-level ping. Alerts can be generated whenever application-level ping response times exceed specified thresholds. Software probes can be installed on end user computers, allowing easy monitoring of the end user experience for clients who are using both fixed and mobile devices.
The downside to APM products is their cost. While they are expensive, they perform a lot of analysis and provide real answers on a regular basis. They are much more scalable than manual methods and can quickly provide detailed insights into an application and whether it is functioning correctly. Ready access to application performance information means that updates to applications can be quickly checked to make sure that performance has not suffered. If performance has degraded, it can help identify the cause.
The decision often comes down to a tradeoff between staff time and money. A good automated application performance monitoring system, run by well-trained staff (don’t forget training), will outperform the manual method every time. The IT staff can then focus on the business, not on collecting application statistics.
This article originally appeared on the NetCraftsmen blog.