Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Playbook: Staying One Step Ahead of Performance: Page 5 of 15

Be on the lookout for unusual resource utilization during the testing phase. Say you add a set of test clients and the test shows an unexpected flatlining of processor use. That may mean that a limitation in the network's bandwidth or frame rate, or in one of the back-end components, is preventing the server from processing the additional requests efficiently.

The rule of thumb is that no subsystem should operate at more than 75 percent of its capacity for a sustained time period. (Add more resources if any piece of your system is operating at that level of contention or higher.) Just the 75 percent rate may be too high if there is any significant contention for a particular resource, like the network. TCP, for example, has built-in congestion-avoidance algorithms that kick in whenever a single packet is dropped. That can generate excessive retransmissions at extremely low levels of utilization. The solution is to monitor your network and make the necessary tweaks until the retransmissions are eliminated, and then add at least another 25 percent capacity to allow for spikes. Proper testing will reveal the appropriate thresholds for your system.

Meantime, don't be surprised by short-term spikes in utilization. Applications typically make full use of the available CPU time or network resources. Your main concern instead should be any sustained utilization. Temporary spikes are a problem only if they become common or expose weaknesses in your overall system design, like when your network temporarily jumps to 100 percent usage and starves your other applications.

Finally, make sure you conduct simple validation tests of things like software versions. Two servers from the same manufacturer may be running different software or firmware on an embedded component, which means they can each exhibit very different performance or utilization rates. It's best to have configuration and change-management tools in place that detect these differences so you can avoid running resource-hungry validation tests.

Most networked applications today, of course, use TCP for their underlying transport service. Although TCP is reliable and capable of very high levels of throughput when properly tuned, it's also highly sensitive to packet loss and timing delays. Unfortunately, most complex network topologies for large-scale applications suffer from both packet loss and timing delays, so applications don't get optimal TCP performance.

There are several ways to resolve this. You can optimize the TCP stacks on each of the network nodes, smooth out the network or move your servers (or their proxies) closer to the users. Most organizations choose the latter two strategies, which are the easiest ways to remedy TCP performance problems. Most loss and delay problems occur at boundary points between high- and low-capacity networks, such as a WAN connection between two offices. If multiple users are running bursty applications across a WAN, some packets will be delayed or dropped when there's an overload. You can increase the queue size of the junction router so the router caches rather than drops packets during spikes, eliminating the need for packets to be retransmitted.