DATA CENTERS

  • 02/06/2017
    7:00 AM
  • Rating: 
    0 votes
    +
    Vote up!
    -
    Vote down!

Troubleshooting: Devising A Testing Methodology

Tony Fortunato describes the steps he took to track down the source of an application performance problem.

Troubleshooting network performance issues is one of the more challenging jobs since there are many variables to consider. Then there are all the ways people measure performance: ping times, jitter, packet loss, throughput, and basic time to complete a task. In this blog, I'll describe the steps I took to help a client troubleshoot a user's complaint about "everything" being slow.

"Everything" turned out to be a human-resources application, but figuring out the "slow" part was a bit trickier. I suggested we use my old-school stopwatch on my watch and have the user walk through her current workflow.

The steps were pretty simple:

  • Launch HR application
  • Login
  • Query for employee name
  • Wait for screen to populate with data and hourglass is completed
  • Add notes, make any changes and save

It took 90 seconds to complete this task with most of the time spent waiting. I asked the user how long it typically takes and she said about half that. I then wondered if the response is consistent, which it was. Since consistent problems are easier to document, I had her do the same task five times with these results:1:30, 1:25, 1:33, 1:32 and 1:31.

I asked if anyone had an application baseline to use as a reference. No baseline was performed, but one of the network analysts, Vince, had some emails documenting how long it took to perform similar tasks, which verified that the typical response time was in the 40-50 second range. Ideally, I would have preferred some trace files but hey, this was a good start.

I then asked Vince if anything had changed since last year. He said there has been server upgrades, switches/routers replaced, servers consolidated, and a new data center online.

analysis-geralt.jpg

analysis
Caption Text: 

(Image: geralt/Pixabay)

I suggested we start from scratch and start an application baseline or profile as part of our troubleshooting. Vince admitted that he had never performed a baseline and thought it would take too much time. I explained that a baseline can be a series of small snapshots and numerical values than can literally take minutes. He then asked what value the baseline has since we have nothing to compare it to. This a common myth; as we document an issue, something is bound to look strange, which leads us to investigate further.

The first step in the baselining process was removing any unnecessary protocols. This streamlines data analysis and in some cases can resolve the issue. We unbound IPV6, LLDP and link-layer protocols.

I proposed creating a list of the server names or IP addresses that the application uses. There are several ways to collect this data. . The first I demonstrated was simple: Make sure no other applications are running, and from the Windows command prompt, type netstat –a –n > netstat_before_HR.txt . Then launch the application and from the command prompt type netstat –a –n > netstat_after_HR.txt. If you aren't comfortable with the command prompt, another method would be to use a GUI utility like NirSoft’s CurrPorts.

Next, we captured some TCP statistics with the following command, netstat –s –p tcp. This will provide information regarding retransmissions; the output looks something like this:

 

The command we used was netstat –s –p tcp > netstat_tcp_before_HR.txt and netstat –s –p tcp > netstat_tcp_after_HR.txt

With our list of servers, I asked Vince if ICMP is blocked to any of these servers and he wasn’t sure. So I asked him to ping the servers 50 times with a packet size of 1,111 bytes and no fragmentation. Why 50 times?  To get more than the four pings Microsoft uses by default. Why 1,111 bytes? To get a packet larger than Microsoft’s approximate 74 bytes. Finally, I didn’t want any devices along the path fragmenting our ping packet; I would rather have the pings fail if 1,111 bytes is too large. Our ping command looked like this:  ping Wilma –l 1111 –n 50 –f > ping_wilma.txt I added the > ping_wilma.txt to capture the output and write it to a file.

I know many of you are thinking that ICMP can be treated differently and is unreliable, among other issues.  I'm aware of all those points, but we have to start somewhere. In some cases, ICMP cannot be used as a test tool, but in this case I got lucky.

These names are simplified for obvious reasons, but here are the results of our pings:

Server

Min

Max

Avg

Loss %

DNS – Wilma

12

20

17

11

WEB –Harry

11

13

12

1

WEB – Dirty

11

13

12

1

Microsoft – Fred

13

15

14

0

SQL 1 – Barney

19

65

36

22

SQL 2 - Betty

18

66

35

22

Both of us were surprised that the DNS and SQL servers showed high packet loss. Vince said the server admins should have noticed that. However, even if they did the same pings, they would get different results depending on where they're located, plus ICMP might get handled differently on those server networks or server itself. In other words, this is not a definitive test, but something worth noting as we move on.

We then looked at a packet trace and noticed the same timings for SQL, which resulted in retransmissions, out of order, and other warnings from the packet analyzer. Unfortunately, DNS is UDP- based, so we had to manually look at those packets to confirm there was packet loss.

Since he’s not a server guy, Vince wanted to know what to do next. I advised him to provide the results, along with his testing methodology into the ticket and pass it back to the help desk. The help desk reassigned the ticket to the respective server groups.

In review, I cannot stress enough the importance of taking measurements, documenting your methodology/tools and the location of your test points as the framework of your baselines. Application baselining can be more complicated and involved if the troubleshooting scenario requires more testing which I will describe in future articles. But I would suggest you gather what you need first. As the requirements grow, so will your baseline.


Comments

Lovely troubleshooting

I enjoyed reading this. I have always said that if you got an excellent troubleshooting methodology and asked the RIGHT questions, you can troubleshoot anything. Well done.

Re: Lovely troubleshooting

you are welcome

I totally agree with you!!

and thank you for taking the time to share some feedback

Re: Lovely troubleshooting

thanks for your feedback

Worth a bookmark now

Hi Tony sir, this post is very useful for me right now and in future too, just gonna bookmark this page, thx again for the awesome troubleshooting guide.