|
Q : Our network infrastructure supports a major banking system and consists of more than 150 routers supporting more than 300 segments and thousands of nodes. The majority of the segments are Ethernet with a few Token-Ring segments in the data center and an FD
DI backbone. Both IP and IPX are routed across the entire infrastructure. The FDDI backbone interconnects several buildings in downtown Houston, forming a core campus network. Our remote WAN sites are connected with full or fractional T1 leased lines.
In a nutshell, our biggest problem is response time. At two or three particular sites, response time is noticeably slower, despite a full T1 circuit. Analysis of our router queues as well as traffic to and from the remote site show plenty of bandwidth available on the T1.
We especially notice a degradation in response time when running an automated data collection application (running over IPX) that retrieves data from the remote sites. In particular, collecting data from our Midland, Texas, site takes much longer than the others, considering the amount of data involved. The application is a client/server-type application that doesn't benefit fr
om the IPX packet burst protocol suitable for file transfer. Help!
Bill:
Before we even unpac
ked our analyzers, our first step was to check out the network documentation to get the big picture.
Scott:
We checked out the documentation that an engineer had scrawled on the back of a napkin; no, wait a second, it was on a whiteboard with "DO NOT ERASE" written across the top; no, it was real documentation on "E" sized paper!
Bill:
Not only that, but it seemed up to date, so we felt we were off to a good start.
Scott:
Analyzing a trace from the downtown site of the IPX-based application in question gave us our first clue. There was a delay of approximately 60 milliseconds from a packet sent to the remote site requesting data and the first packet returned.
Bill:
Further response time analysis using a different protocol (ping) with several packets showed a consistent delay, even when the T1 was lightly loaded.
Scott:
According to the network documentation, we were on the same LAN segment as the router connected to the T1 going to the remo
te router, and there was only one segment at the remote site.
Bill:
Our local segment was error free with little traffic, so more than likely, something was going on with the routers, the T1, the remote segment or the remote server.
Scott:
Another interesting piece of information on the network documentation was a backup link that paralleled the T1.
Bill:
This could be similar to a problem we described in one of our columns where the backup circuit was taking packets when the primary circuit got busy. This caused packets to arrive out of order, and connections subsequently dropped.
Scott:
In our current situation, however, we noted that there were no transport retransmissions, dropped sessions or attempted reconnects of any sort. It appeared that we had a different problem on our hands.
Bill:
We tried a little experiment.
Scott:
Breaking out the liquid nitroge
n, we decided to see if we could freeze the Cat 5 wire, thus lowering
its resistance and decreasing the bit delay.
Bill:
Unfortunately, we dropped the wire after pulling it from the liquid nitrogen, shattering it into a thousand little pieces.
Scott:
Seriously, we decided to drop the T1 line and analyze the response time across the backup circuit.
Bill:
The backup circuit was a fractional T1, with a data rate of 112 Kbps. We called the carrier to see what it could tell us about both circuits. Soon after, several loop-back tests were conducted on the primary T1 at various points in the circuit, with no errors detected.
Scott:
Two days later (talk about latency), the carrier got back to us and noted that the T1 circuit was routed "differently" than the backup circuit.
Bill:
How different was it?
Scott:
It turned out the backup circuit only had one or two circuit hops from Houston to Midland whileż
Bill:
...the primary circuit went from Houston to Birmingham, Ala., to Atlanta, to Kansas
City, to Denver, to Albuquerque, N.M., then back to Midland.
Scott:
The carrier told us that it selected this route to provide redundancy for another T1 circuit to Midland.
Bill:
Unfortunately, the primary circuit ended up on a rather outrageous path, adding a large amount of transmission delay in both directions.
Scott:
The original 60-ms round-tree delay was within the carrier's spec and usable for voice, but it was unacceptable for our client's application.
Bill:
After this experience, the carrier found a far more direct primary circuit for our customer. The round-trip delay was reduced to 20 ms.
Scott:
Proving, once again, faster is not always better.
Bill:
And if not, you need to analyze to see why.
Bill and Scott can be reached at otw@pmg.com. Portions of trace files from selected columns are available via Pine Mountain Group's Home Page (http://www.pmg.com).
|