• 12/07/2015
    7:00 AM
  • Rating: 
    0 votes
    Vote up!
    Vote down!

Troubleshooting Video Applications

Peter Welcher explains how he diagnosed performance problems with a TCP-based video application used at a hospital.

Maybe it’s another sign of experience, aging, cynicism, and/or ego. Lately, I seem to be doing a lot of troubleshooting of application performance problems. “What on earth was the developer thinking?” is becoming a recurrent refrain, at least in my head. This is (I hope) not ego, just a real question about what appears to be a sub-optimal approach.

Here is a case in point that I hope you’ll find entertaining and informative. I worked with a hospital using a medical application that provides steerable video for intensive-care patients. Medical and network diagnoses should both start with the symptoms. Hence:

Symptoms: There were problems such as the video client locking up, loss of ability to steer the camera, and erratic camera steering.

After inquiring about the application, I was told that the application used a TCP-based remote control channel and TCP-based video. This set off some alarm bells in my head.

Technical review

There’s a reason most real-time media, VoIP, and video use UDP. With video and voice, there’s a trade-off between timeliness of the data and reliability. Lost UDP video frames result in those pixelated artifacts on screen. It’s real-time and ugly.

Reliability and TCP requires retransmission of lost packets. Doing so takes time. That means the application needs a buffer to pre-fetch the video frames, allowing seconds to minutes of lead time for retransmissions to get caught up. You can’t just pause the video, waiting for a missing frame; that would play out in rather choppy fashion, so TCP-based video is  used mostly for streaming media, such as YouTube and video commercials on web pages. UDP-based video is usually used for real-time viewing. The value of doing this is that the play-out of the buffer is much smoother, and may only have artifacts or pauses for longer-term hiccups in the traffic stream.

There are apparently lots of proprietary ways of streaming video over TCP. If you Google “streaming video protocols,” you’ll find that people have come up with clever ways to leverage the congestion avoidance and flow control of TCP and provide adaptive video quality based on estimated bandwidth available. I just did a quick refresh on the topic (not an area of deep expertise for me) and apparently some of the protocols treat video delivery as a stream of small file transfers. Interesting stuff, but my need to know more is low right now. Just be aware, TCP video may not be one big stream.

Diagnosis continued

Given that the video was TCP-based, I suspected packet loss. There was a WAN involved for some sites, and for the remaining remote sites there was shared LAN media in a fiber-driven loop path through daisy-chained sites (geography and cost-driven design).

There was some QoS in place, somewhat inconsistent in how it was configured, and missing in places.

This was a concern, because any congestion due to micro-bursts of traffic would cause dropped TCP packets and retransmission. Queuing delays might also trigger retransmissions. Retransmissions might make the congestion worse, stepping on newer video to deliver packets that might even be no longer relevant video data (note the  emphasis on “might.”) .

Prescription: Remediate two sites daily until all sites and paths have the proper QoS.

Side note: The hospital in question had a lovely “network weather map,” which often showed rather low utilization. I like having such real-time data readily visible. The challenge with such data is that it is likely displaying averages over some period of time, which is rarely visible. Averaged data does not provide information about “micro-bursts.” I think of IP video as operating in micro-burst fashion: It sends frequent I-frames (think full-screen image in all detail) followed by frames indicating changes to that background. When you’re receiving, a garbled picture will likely stay that way until the next I-frame arrives in a second or three.

Digging deeper

There was another item of concern, one that I’ve been encountering a lot lately. Some of the links were sub-rate links, where the physical media speed was higher than the contracted data rate, and where the carrier was likely policing excess traffic to enforce the contractual data rate -- e.g., Fast Ethernet physical link with 20 or 40 Mbps contracted data rate. This is becoming a very common WAN approach, because the carrier can provision it once, and the customer can use a web portal to adjust the contracted rate upward or downward. More revenue for little effort; providers have to love that!

Sub-rate links are a red flag for me. My analogy is the famous "I Love Lucy" chocolate factory episode, where Lucy cannot keep up with the conveyor belt. Think of the conveyor belt as a 1 Gbps link. Say the contracted rate is 200 Mbps. That means Lucy can keep up with every fifth chocolate position on the conveyor belt. Send more chocolate than that, and the extras end up on the floor (policing).

The solution is for the sending router to shape the traffic, buffering it. In terms of Lucy, take micro-bursts and pace the chocolate transmission, only occupying every fifth time slot or conveyor belt position. Then none of it ends up on the floor (or in Lucy’s hair or mouth).

The challenge lately is cost and technology. When a site gets a 1 Gbps Ethernet physical WAN/MAN link in, the site staff often wants to connect it to a LAN switch, rather than buying a router as well. Unfortunately, if you wish to do traffic shaping to the contracted rate, you need a router.

Prescription #2: Add traffic shaping to sub-rate links, both to and from the main hospital.

In fairness to the unnamed video application vendors, I should note that the hospital in question deployed the application over the WAN, which may not have been an intended use case by the vendor.


Life Saving Packets

Network resources that save lives make a good case for overprovisioning. However, cost constraints will still apply.

To make the best possible case for the allocation of appropriate network resources, utilization data is the key. For instance, 40GB of downstream data in 30 days averages out at around 55MB of downstream data per hour but, 1GB data can be received in 1 hour at 720p (depending on the quality of the video). If these deviations are known then, it becomes easier to provision and overprovision slightly.