home
NEWS       BLOGS       FORUMS       NEWSLETTERS       RESEARCH       EVENTS       DIGITAL LIBRARY       CAREERS  
Network Computing Network Computing Powered by InformationWeek Business Technology Network

IMMERSE YOURSELF:

SOA

  |

Data Center

  |

802.11n

  |

Data Privacy

  |
APO  |

Virtualization

  |

NAC

  |

Security

  |

Network Mgmt

  |

Enterprise Apps

  |

Storage & Servers


ON THE WIRE

Big Bad Watchdog Bites Network

by Bill Alderson and J. Scott Haugdahl

Nearly every protocol in networking has some form of "keep alive" or polling mechanism during periods of inactivity between a host and workstation or a server and client. These protocols are designed to drop connections and free resources whenever a workstation drops off the net. Unfortunately, if the protocol doesn't work properly, workstations can drop off the network for no apparent reason.

Scott: Such was the case during our recent investigation of a customer's campus local area network where users' NetWare connections were constantly being dropped by their file servers.

Bill: The problem appeared to be random and unrelated to any significant event.

Scott: Trying to troubleshoot a random problem with no consistent repeatability is a network manager's worst nightmare.

Bill: Some users appeared to lose their connection more than others. Some would be fine for a few weeks before having the problem again. Others wouldn't report the problem, since having to reboot your machine a few times a day while running Windows became an accepted practice.

Scott: What? Having to reboot Windows? We never have to do that, right?

Bill: This problem had been going on for several months, as previous attempts to solve the problem had been unsuccessful.

Scott: One such previous solution was to segment the LAN further using a switch, thinking that it was a load-related problem.

Bill: I guess it's time for a little forensic analysis, right Dr. Quincy?

Scott: Right, Sam.

Bill: After observing the network for several hours and not seeing the usual "bad things" that we see in networks, we decided to focus on dropped connections due to the NetWare watchdog process.

Scott: The watchdog process begins whenever a server hasn't seen a packet from a workstation in a given amount of time. A server will send a keep alive packet to a workstation it hasn't heard from in five minutes (by default). The workstation then replies with a keep alive response packet back to the server.

Bill: If the server sees this packet, then it won't bother the workstation again for another five minutes. But if it doesn't see the packetę

Scott: ęthen it will try again in one minute. This process continues until a reply is seen or a fixed number of keep alives are sent with no replies (nine more by default). At that point, the server will drop the connection.

Scott: It wasn't known for sure if the server always dropped a connection due to the watchdog process. For instance, changing the watchdog parameters didn't seem to have much effect on the problem. The user also showed us a trace where the workstation did reply, and the reply reached the server segment properly, so how could it have been a watchdog problem? Could it have been a NetWare bug?

Bill: We proceeded to capture packets, watched the server console for timeouts, and contacted users whenever the server reported a watchdog timeout.

Scott: By correlating theory, actual packet transactions and user feedback, we found the answer.

Bill: One thing that became clear was servers with hundreds of connections and users not accessing these servers for an extended period of time (such as when accessing local data, doing terminal emulation or even going to lunch) would get the boot.

Scott: Normally this shouldn't happen, because of the watchdog process.

Bill: Another observation, seen early in our analysis, was occasional receiver congestion being reported by servers on the 16-Mbps server ring.

Scott: Normally, a few receiver congestion errors now and then is not a big deal, and we usually just blow them off. After all, the server ring was extremely busy, and there just weren't enough receiver congestion reports to be overly concerned.

Bill: But we knew that a receiver congestion error meant a server was unable to receive a packet because of a full receive buffer.

Scott: By plugging our analyzer into the concentrator downstream from one of the servers, we could see packets addressed to that server. We could then observe when the address recognized bit was set, but not the frame copied bit, indicating receiver congestion.

Bill: As it turned out, there was a high correlation to the frame copied bit not being set on watchdog packets returned from workstations, and the server reporting receiver congestion two seconds (by default) later.

Scott: The problem was that these watchdog return packets were small and returned from a hundred or so workstations simultaneously, at a rate faster than the server's adapter and driver could process.

Bill: It's kind of like loading a bow with 100 arrows and shooting them all at the target at the same time.

Scott: If I were a server, I'd quiver, too. The next correlation as evidenced by our packet trace was that whenever a returned watchdog packet couldn't be buffered, the server would begin the one minute timeout cycle.

Bill: Now and then, a few unlucky workstations never got their replies buffered, even with multiple tries, and the server thought it never heard from them and dropped their connection!

Scott: There are many potentia l solutions to this problem. To name a few: increase the receive buffer space in the server's adapter (you couldn't do that in this case); increase the comm buffer space in NetWare (it was already at several hundred); get a more powerful adapter in the server; or keep the server happy some other way, so it doesn't have to go through the watchdog process for all those workstations.

Bill: The short term solution was to install a small TSR in each workstation that would periodically (within five minutes) send an unsolicited keep alive packet back to the server. The longer-term solution is to try a different Token-Ring adapter in the server that can process back-to-back small packets more efficiently.

Scott: Of course, when the customer upgrades to NetWare 4.1, they'll have to check for this problem all over again.

Bill and Scott are principals at Pine Mountain Group and can be reached at otw@pmg.com. Portions of the actual trace file from selected columns are available via Pine Mountain Group's Home Page on the Web (http://www.pmg.com).

November 1, 1995







Looking for a new job?

Function:

Keyword(s):

State:
SPONSOR
RECENT JOB POSTINGS
CAREER NEWS
The tumbling of IT jobs stopped in the second quarter, as the IT sector added about 44,000 jobs.

It's just a glimmer, but Oracle is starting to see a bit of light at the end of the recession tunnel.










2009 IT Salary Survey: Meager Raises, Solid Prospects
Though raises are notably smaller than a year ago, and job security’s shrinking, IT careers are looking safer than many others in this economic downturn. Get all the findings in InformationWeek's 2009 IT Salary Survey. Available FREE for a limited time.
 
ROLLING RIGHT ALONG
Follow key Network Computing Reviews from conception to completion. This Week: Holistic APM.



Network Computing Reports Emerging Enterprise Podcast Series: Secrets to Success








TechSearch


Microsite of the Week


Powerful Information at Your Fingertips



Techweb
Informationweek Business Technology Network
InformationweekInformationweek 500Informationweek 500 ConferenceInformationweek AnalyticsInformationweek Events
Informationweek MagazineGlobal CIOIWK Government ITbMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingPlug Into The CloudDr. DobbsContentinople
space
TechWeb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0Mobile Business ExpoNoJitter
Black HatGTECEnergy CampCloud ConnectGov 2.0 ExpoGov 2.0 Summit
space
Light Reading Communications Network
Light ReadingLight Reading AsiaUnstrungCable Digital NewsInternet EvolutionPyramid Research
Heavy ReadingLight Reading LiveLight Reading InsiderEthrnet ExpoTelco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems and TechnologyInsurance and TechnologyWall Street and TechnologyAccelerating WallstreetBST SummitBuyside Trading SummitIT Summit
space
Microsoft Technology Network
MSDNTechNetTotal IT ProTotal Dev ProNET Total Dev Pro CommunitySQL Total Dev Pro Community
space


App Infrastructure   |   Messaging & Collaboration   |   Network & Systems Mgmt   |   Network Infrastructure   |   Security  |   Storage & Servers   |   Wireless   |   Enterprise Apps
About Us  |  Contact Us  |  Site Map  |  Technology Marketing Solutions  |  Advertising Contacts  |   Briefing Centers
Copyright © 2009  United Business Media LLC  |  Privacy Statement  |  Terms of Service