Playbook: Staying One Step Ahead of Performance
Eliminating system or network failures is critical in today's fast-paced environment. Contributing Editor Eric A. Hall walks you through how to analyze and manage your applications performance requirements to help
September 27, 2003
Analyze This
Predicting demand for your application load, network traffic, and disk or any other system resources is the toughest part of performance planning. Not only do you need to know the intimate details of the underlying technologies, you need to be familiar with the inner workings of your organization's business and to understand how those aspects affect demand.
In your performance analysis, first examine how application logic is distributed across your endpoints and determine the minimum bandwidth and latency requirements for each user session, as well as the expected peak-processing load, for instance. Because these factors vary from application to application, you'll need to scrutinize them on a case-by-case basis. With Web applications, for example, the processing load typically falls on the server, and processing time is more important than network latency. VoIP (voice over IP), meanwhile, relies heavily on the network, since the technology usually is implemented as a peer-to-peer system.
You can learn a lot from tracking your system's usage patterns. Short-term usage patterns, for instance, affect the demands on your system's resources: When a user fires up his or her application, there's usually a flurry of initial traffic as the client authenticates to the system and navigates to its destination. Traffic dies down after login, as does the demand on system resources. You can take advantage of this ebb and flow by off-loading certain tasks. For example, you can run authentication on a dedicated server rather than cramming everything onto one server.
Another trend you can glean from your usage data is peak traffic. With business applications, production workloads usually peak midmorning and midafternoon, while staff-related traffic, such as data entry, typically remains steady around the clock. Of course, if your users are spread across the nation or the world, working hours will vary by time zone. You should design your servers and network to accommodate spikes in usage before and during busy seasons--for example, at holiday time if you're a retailer, or in April if you're an accounting firm.Beware of changes in usage patterns after an upgrade. If your new, enhanced e-mail server supports remote folders better than your old one did, for instance, look out for more demand on the system as your users begin filing away their e-mail messages on the server rather than locally on their own machines.
Bottom line: Don't just guess at your usage patterns and trends; study them closely and regularly, and make adjustments as necessary. The more accurate your usage information, the better your performance planning.
Planning for growth, however, is tricky because it varies by application. While demand for a task-specific application, such as an online expense tool, grows incrementally with the number of employees, the resource demands on e-mail can grow exponentially with the influx of spam, for instance. And when you add spam filters to clean up the unwanted mail traffic, your server-processing overhead increases, too.
Another factor that can increase traffic volume is the so-called flash-crowd effect: A sharp increase in the number of users trying to access a Web site or intranet server at the same time because of a change in your company's ranking in a search engine or a news flash in a corporate newsletter. How do you plan for potential growth? The best practice is to design for 300 percent to 500 percent extra capacity on external-facing hosts, such as your Web server, and about 50 percent extra overhead for your internal server. That includes overbuilding your network capacity as well.
Once you've identified variables like these that can affect your system, it helps to use the Monte Carlo simulation model against your projections. It will give you a series of outcome scenarios: Rather than planning on a likelihood of a fixed number of simultaneous users, for instance, you can determine the possible ranges of users, which will make your growth projections more comprehensive. Then use the results of this simulation to estimate your traffic patterns. Although the Monte Carlo simulation is typically used for testing purposes, you can use its range of growth numbers to build a solid model for both the planning and testing phases.
When you've completed the analysis phase, it's time to build or rebuild your application or system. The requirements you identified in your original performance analysis will dictate your design, so it may entail building a storage system, for instance, that focuses on high throughput or fast seek times. The application's latency and bandwidth requirements, too, may determine whether the servers or application are distributed or centralized. Off-loading authentication and logging functions onto separate systems, for example, lets you better scale the architecture. That's easier and cheaper than trying to fix capacity problems in one monolithic system.Similarly, distributing a system geographically can be less expensive than trying to build a massive system at a central location. Groupware applications, for instance, usually are cheaper to operate if they're distributed geographically because traffic is then contained within a region. Scheduling in groupware typically occurs within a department or workgroup, so it's not necessary to have all the traffic go to a central server. Distributing these systems also lets you offer more bandwidth-sensitive features like remote e-mail folders because there aren't any bandwidth constraints. This architecture is not for all applications, though. Web messaging environments, in contrast, work best with centralized servers.
With an overall distributed architecture, it's best to sign up with multiple WAN service providers. If your system will be accessed by the general public, for instance, you should buy connectivity from multiple providers to ensure you're creating the shortest and cleanest path to the largest number of end users. This tactic also limits your exposure to ISP outages because you won't have all your users in one basket--as long as you build in redundancy, that is.
Keep your service providers and their partners informed about the changing demands of your system. Remember that they have a supplier chain of their own: If you need an additional circuit, for instance, your ISP may have to go through the phone company, which in turn needs to upgrade some infrastructure equipment, and so on. Maintaining close ties to your service provider will prevent you from having to scramble for additional resources when there's a spike in your system's usage.
Change management is another key element in the buildout phase. Make sure all the related components of your system are running the same software versions and configuration settings and that you can upgrade them in sync. Testing might reveal some software version discrepancies, but it's easier to take care of these details from the beginning using change-management and replication tools.
And keep in mind that latency is cumulative, and too much segmentation can increase latency on the overall system. Say your system is split into 10 different components with each requiring 500 milliseconds to set up, process and tear down connections. That's five seconds of overall latency. You can reduce that latency time significantly with a centralized or less distributed architecture, but at the expense of scalability and, in some cases, efficiency.Regardless of your initial design criteria, you'll probably end up rebuilding the system at least once. Testing--which we tackle in the next section--will almost certainly reveal flaws in your specifications, and deployment will uncover weaknesses in your testing methodology. So be prepared to adjust your design and build your secondary systems for the unexpected, with items like graphics-free Web pages for those spikes in traffic and resources. If a Web page with SSL (Secure Sockets Layer) has heavy graphic files that each require a new connection, performance can suffer miserably. Instead of forcing users to turn off image loading in their browsers to get around this kludge, build alternate pages without GIF images. That way, you can support more users during peak usage times.Ironically, testing is the most error-prone part of the performance-planning process. Each component is analyzed for utilization, and the entire system is stress-tested. Trouble is, you have to test against your assumptions and biases, which are likely to be at least partially wrong. To catch these kinds of errors, make sure each of the discrete and holistic tests represents the actual usage patterns you expect. You should also test separately for the possibility of higher loads because of long-term growth, marketing promotions or seasonal demands. This will ensure that you are prepared for these projected changes, and that preparation may even provide you with alternative buildout scenarios. Short-term, off-site support systems may be adequate for spikes in growth in some cases, for instance.
For the routine usage tests, follow the behavioral patterns you pinpointed in your performance planning analysis. If an application exhibits a flurry of login activity followed by a leisurely pace of queries, mimic that in your tests. That real traffic pattern is more likely to expose the problems you'll encounter than staged frequent bursts of short-lived sessions.
Conduct your tests from both ends of the connection simultaneously so you can get a full picture of problems in your design. Testing must be performed from a user's location, using his or her equipment and network connections. If you want to roll out a system that uses handheld devices on a cellular network, test performance using the same handhelds and network rather than relying on a PC-based simulator attached to the server's local Ethernet LAN segment.
You should also monitor the performance of the server and its local network segment during these same tests, though--this will reveal the source of any performance problems. The handheld devices may be doing too much query preprocessing, or perhaps the cellular network is dropping too many packets. Or maybe the server's back-end database is causing trouble. The point is you can better identify these problems with holistic testing practices that mirror real-world usage as much as possible.
Run your tests for relatively long periods before taking any initial measurements--at least a few hours for a simple application or several weeks for a complex database. And don't introduce anomalies or increased volume until the simple stuff in the initial tests is working. Test static Web page fetches before CGI scripts, for instance, and test open connections before searches in an e-mail server. Once your tests are running smoothly, add these extra elements and simultaneously ramp up the volume. Then you'll be running a fully loaded test bed that represents all the diverse scenarios you predicted in your initial analysis. Adding layers to your tests makes isolating problems simpler: If your static Web pages operated smoothly but a new layer of tests of the CGI database searches shows sudden delays, you can see where the problem lies.Be on the lookout for unusual resource utilization during the testing phase. Say you add a set of test clients and the test shows an unexpected flatlining of processor use. That may mean that a limitation in the network's bandwidth or frame rate, or in one of the back-end components, is preventing the server from processing the additional requests efficiently.
The rule of thumb is that no subsystem should operate at more than 75 percent of its capacity for a sustained time period. (Add more resources if any piece of your system is operating at that level of contention or higher.) Just the 75 percent rate may be too high if there is any significant contention for a particular resource, like the network. TCP, for example, has built-in congestion-avoidance algorithms that kick in whenever a single packet is dropped. That can generate excessive retransmissions at extremely low levels of utilization. The solution is to monitor your network and make the necessary tweaks until the retransmissions are eliminated, and then add at least another 25 percent capacity to allow for spikes. Proper testing will reveal the appropriate thresholds for your system.
Meantime, don't be surprised by short-term spikes in utilization. Applications typically make full use of the available CPU time or network resources. Your main concern instead should be any sustained utilization. Temporary spikes are a problem only if they become common or expose weaknesses in your overall system design, like when your network temporarily jumps to 100 percent usage and starves your other applications.
Finally, make sure you conduct simple validation tests of things like software versions. Two servers from the same manufacturer may be running different software or firmware on an embedded component, which means they can each exhibit very different performance or utilization rates. It's best to have configuration and change-management tools in place that detect these differences so you can avoid running resource-hungry validation tests.Most networked applications today, of course, use TCP for their underlying transport service. Although TCP is reliable and capable of very high levels of throughput when properly tuned, it's also highly sensitive to packet loss and timing delays. Unfortunately, most complex network topologies for large-scale applications suffer from both packet loss and timing delays, so applications don't get optimal TCP performance.
There are several ways to resolve this. You can optimize the TCP stacks on each of the network nodes, smooth out the network or move your servers (or their proxies) closer to the users. Most organizations choose the latter two strategies, which are the easiest ways to remedy TCP performance problems. Most loss and delay problems occur at boundary points between high- and low-capacity networks, such as a WAN connection between two offices. If multiple users are running bursty applications across a WAN, some packets will be delayed or dropped when there's an overload. You can increase the queue size of the junction router so the router caches rather than drops packets during spikes, eliminating the need for packets to be retransmitted.A longer-term solution to smoothing out the network, however, is the use of a traffic-shaper that imposes limits on the individual traffic flows or a higher-capacity WAN connection. You can also distribute the servers closer to your users so you bypass the WAN link, though this is obviously not an option for every application.
Another trade-off with TCP is that its sessions remain in a dormant state for about eight and a half minutes after they close, depending on the software vendor, to prevent laggard packets from being mistaken as a separate session. That's a long time for a request that might take less than one second to process. And each of these "dead" connections also requires memory and processing capacity, so it can burden systems with extremely high volumes of short-lived sessions.
Sometimes you can alleviate this by tweaking the protocol through the process of enabling HTTP 1.1 pipelining, for example, which lets the client reuse a single session for all requests or fetches. Rather than opening 20 sessions for all the embedded objects, it opens just one, so it consumes fewer resources on the server. HTTP 1.1 also lets the client (rather than the server, like with HTTP 1.0) close the session. Or you can instead institute a proxy agent that reuses sessions on behalf of the clients. This is a common approach with Web interfaces to an IMAP server, for instance: Its Web interface requires each message-access operation, such as deleting a message, to establish a new IMAP session. A popular solution to this problem is to implement an IMAP proxy that multiplexes the sessions on behalf of the Web interface. That limits the load on the real e-mail server.
Although most networked applications use TCP, a handful of high-profile applications, including VoIP, use UDP (User Datagram Protocol). UDP doesn't provide the reliability and flow-control that TCP does, so it's a popular protocol for streaming media signals, such as video and voice, where some small amount of packet loss is better than forcing the datastream to stop and resynchronize every time there's a minor hiccup in the network.
But UDP can cause other problems. Because UDP doesn't adjust the traffic rate according to the loss and delay characteristics of the underlying network, UDP traffic won't slow down in the face of congestion. Meanwhile, TCP sessions sharing the same pipe as UDP will slow down whenever packet loss is detected. That means TCP sessions will shrink to accommodate UDP. That's no big deal if all you care about is streaming media, but if you're trying to run mission-critical applications over the same network as your multimedia flows, knocking your TCP sessions off the network is not acceptable.There's no easy solution to the UDP-TCP performance dilemma. You can force the UDP application endpoints into using a lower frame rate or less bandwidth, which gives your TCP applications a fighting chance. The long-term solution is to make sure you have enough bandwidth to accommodate all your traffic.Determining the appropriate server architecture, meanwhile, is a major piece of the performance-planning puzzle. Servers can be designed as multiprocessor systems, multiserver clusters or distributed hosts in a mesh. Each of these configurations has its own distinct benefits in different environments. Although you can't do much about the architecture a specific vendor endorses, understanding the benefits of its approach will help you decide whether to go with that vendor. Microsoft, for instance, doesn't provide mechanisms for multiple DHCP Windows servers across the network to communicate with one another, so you have to bundle everything into a multi-CPU host or create a Windows server cluster. Bottom line: Your application may drive your server processing architecture.
Multiprocessor systems generally are best for multithreaded applications, because passing threads to a local processor provides the fastest turnaround. However, if a system generates an extremely large number of threads, the overhead can kill the performance benefits. There, a cluster of distinct server hosts with locally contained processes is more scalable.
At the other end of the spectrum, distributing the workload into manageable server domains, such as departmental Web and e-mail, is efficient. The trade-off is higher management costs and more systems. Adding more hosts can also increase the demands on back-end systems, such as a database management system, if there's a lot of contention for it.
Off-loading some of the server processing to an add-on card can help. Network adapters with dedicated TCP processors or SSL add-on cards, for example, can reduce significantly the processing demands on a host or cluster by freeing up tasks the main processor would otherwise have to manage.
Another performance issue with servers is disk capacity. Generally, disk choice is driven by a need for high throughput of very large data sets or fast seek times for random data. This decision may not always be obvious. For example, servers that do nothing but serve a few very large files over relatively slow networks may be better equipped with very large RAM disks and a single disk, and may have no requirement for RAID setups beyond a simple mirror. Applications that benefit the most with a focus on throughput are those that host databases and multimedia, where uninterrupted reads and writes are commonplace and crucial to success.System uptime is obviously a prerequisite to speed and performance. Several technologies can improve overall fault-tolerance, the simplest being spot solutions such as using a RAID array instead of a single disk drive, which is standard operating practice for most organizations today. Load distribution and load balancing are two other popular options for enhancing server reliability. Load-distribution technologies such as the round-robin DNS (Domain Name Service) protocol redirect traffic to the next host in the target list, without knowing the host's load or even checking if the host is up and running. This approach helps lighten the burden on individual systems, but it doesn't do anything to guarantee system availability.
Load balancing, meanwhile, relies on active performance monitors to measure the load of each server and then redirect queries to the host with the least load. This is important if you have a mix of slow and fast computers and want to distribute the load according to capability, or if you want to avoid directing traffic to a failed system.
Operating systems, meanwhile, are beginning to incorporate native clustering solutions. These technologies essentially mimic the load-distribution model, with enhancements to the native operating system, rather than relying on external systems. Windows 2000 and later versions come with clustering, as does the latest Linux kernel, for instance. Although native clustering provides redundancy and fault-tolerance, one potential trade-off is that it's typically designed for servers running physically close together. That can sometimes make deployment across WANs impractical or impossible. If you don't need carbon copies of each system, these clusters are probably overkill and load balancing is a better option.
Be Prepared
Ensuring that your applications, servers and network perform optimally depends primarily on how well you stay on top of your resources. That entails performing a comprehensive audit of your existing systems that takes into consideration future use. After your performance analysis and subsequent buildout come the comprehensive testing and management of the system. Performance planning, including getting to know the underlying technology and business your applications support, can help you avoid major system failures and outages. It pays to be prepared.Eric A. Hall is president of Network Technology Research Group, a Nashville-based network consultancy, and author of Internet Core Protocols: The Definitive Guide, from O'Reilly & Associates. Write to him at [email protected].
Post a comment or question on this story.
Getting to the source of your system-performance problems sometimes takes a little investigative work. Start monitoring the usual suspects on the client and server sides during testing. And if you experience any dips in performance when you go operational, check these hot spots:
• User-side applications. Your performance woes may be caused by an underpowered client conducting complex algorithms before the user even queries the application. Or the client may be generating complex response data after the query: A client receiving XML data in response to a query, for example, parses it and uses the data for generating secondary requests. Bottom line, you can't just monitor the application query.
Another problem area may be the client application. If the client application performs multiple transactions, such as DNS lookups and follow-up queries, the rest of the application can suffer from blocking delays. Run tests using typical end-user equipment to expose these problems before you roll out your app.
• User-access segment. If the client isn't on the same segment as the application servers, the user network connection will likely cause trouble. In particular, traffic from a high-speed LAN to a slow WAN link typically gets congested by excessive retransmissions as the fat LAN pipe tries to squeeze data through the thin WAN pipe.An emerging problem is retransmission with inline VPNs. When the host-generated packets are too large for the encrypted channel, the host has to retransmit the original data using smaller packets.
Increasing bandwidth and frame-rate demands exacerbate both of these access problems. The only fix is to change the characteristics of the network--by throwing more bandwidth at the problem, for instance--or the application, by using a lighter-weight encoding algorithm with lower frame-rate utilization or one that uses less bandwidth. Either way, the trade-off is a decrease in the quality of your voice and video traffic.
Hot Spots |
• Network-access equipment on the server segment. Although user-side devices are likely to drop some traffic at the WAN boundary, network-access equipment on the server side can drop a lot more if the network isn't tightly managed.
For example, a VPN or SSL concentrator on the server side of the network usually exhibits performance problems long before the end user's equipment starts to hiccup, while a router handling transmission flows for a few thousand remote users has major queue-management demands and can get clogged with traffic (unless you increase your available WAN bandwidth).
• Server. That's where most IT pros look first when performance degrades. Many server functions--excessive task switching, database performance, disk contention and disk swapping--can cause problems.How To Implement Performance Management 1. Analyze your requirements - This seems obvious enough, but expect the unexpected. Aside from technical requirements, such as processor and disk utilization, it's easy to make unrealistic assumptions about the size of your user community. Make sure you communicate with your marketing staff to get projected usage numbers for an online promotion, for instance, or with your CFO's office staff when they launch a new billing system to outside contractors.
2. Build--or better yet, overbuild--the system -- Don't just build your system--from the application endpoints to the hosts and network--to meet minimum requirements. The architecture should take both planned and unexpected demand into consideration. And leave some headroom for the natural inefficiencies of underlying technologies such as TCP (dropping packets after a spike and then retransmitting them). This may entail splitting the application into pieces for the database, authentication and logging functions, for instance.
3. Test it -- Testing reveals the flaws in your initial performance analysis, highlights any blatant problems with the technologies you have chosen and provides an opportunity to address the deficiencies. But beware: If you don't find many problems, either your testing or your initial analysis is probably flawed.
4. Deploy it Obscure flaws in your initial analysis and the underlying technology typically arise in the deployment stage, as do problems with your testing methodology. Be ready to deal with these issues immediately. Too many organizations push out a system prematurely and then send the development team on holiday or on to another project.Even after your system has gone live, you'll still need to monitor it. Think of the deployment/operational phase as live testing with real users banging on the system and giving you vital performance data. You'll find when you go operational that some assumptions you made during the initial analysis and testing were inaccurate. Be prepared to deal with that fallout.
There are two main rules of deployment. First, your original system-development team should be an integral part of the initial support team. That way, the hands-on experts are available to quickly address problems that crop up. It's almost always cheaper and faster to have the original development group fix problems than it is to hire hot-shot repair specialists who have to learn the entire system. Also, with the original experts performing initial monitoring and analysis, you can often detect problems before they occur.Second, schedule your deployments for slow times, but avoid doing installations immediately before or during holidays. That may seem obvious, but unfortunately the practice is alive and well in some organizations. Your IT group would surely not appreciate being dragged away from Thanksgiving dinner to fix a problem that could have been caught the week before or after.
And perform the same type of monitoring in deployment that you performed during testing, scrutinizing resource utilization and contention levels. In some cases, it might expose a critical weakness in the system that went undetected during testing. You may need to roll back to the previous software release of your server or network device to fix any bugs or performance flaws you find during deployment. Be prepared to yank the rollout and retrench if things start to go south.• Computer Measurement Group www.cmg.org. A nonprofit professional association that conducts research on topics such as queuing theory, which helps IT develop strategies for managing high-volume systems. CMG holds annual conferences and publishes member-provided papers and newsletters.
• The Cooperative Association for Internet Data Analysis www.caida.org. A nonprofit organization that provides reports and tools on performance management in the Internet and in private IP networks.
• Microsoft "Duwamish Online" papers www.msdn.microsoft.com. Although these papers focus on Microsoft-related implementation issues, several of them apply to almost any performance-planning project.
• Product-specific implementation guides. Most software vendors provide planning guides and white papers for their high-end database, groupware, Web and other server products. IBM, Microsoft and Novell, for instance, all have numerous deployment guides for their high-end products.IBM www.ibm.com; Microsoft www.microsoft.com; Novell
Centralized |
Centralized
Pros: Server-to-server traffic can take advantage of high-bandwidth, low-latency local connections. Application logic and other components can be segmented, and system maintenance and management can be simplified. This approach can use clusters and load balancing as well.
Cons: Application performance is susceptible to hiccups, and traffic management can be costly and difficult.
Best For: Organizationwide hosted applications and integrated applications.
Distributed |
Distributed
Pros: Local traffic has high-bandwidth, low-latency access. Server-to-server traffic can be limited to only mandatory data, which lowers WAN costs.
Cons: All application data has to be either partitioned or replicated across sites, which isn't always feasible. System outages in this architecture can be killers.
Best For: Latency- or bandwidth-sensitive applications in which data has local relevance, such as groupware, configuration management and departmental applications.
Hybrid |
HybridPros: Enterprisewide or infrequently used data is centrally managed, and application- or site-specific traffic stays local.
Cons: Highly susceptible to network outages, so it requires redundant WAN links or data replication schemes.
Best For: Applications that use replication as an integral design feature, such as daily batch transfers; applications that rely heavily on computational power; and graphics, distributed directories and batch-oriented applications.
You May Also Like