Assuring VoIP Quality: Not There Yet

Voice quality measurements set by the telecom industry are making their way to VoIP networks, but managing quality remains an elusive goal.

March 1, 2005

12 Min Read
Network Computing logo

It's a simple concept that we've all come to accept: VoIP will eventually be the way enterprise users (and the rest of us for that matter) make phone calls. The problem is that most users have a penchant for applications and services that actually work, and VoIP isn't quite there yet. The art of delivering high-quality voice--so well-perfected in the TDM world--remains immature in the IP world.

Vendors know it and for two years have been working on providing the standards and tools necessary to deliver TDM-grade voice quality. Over the next six months, those efforts will start to materialize through a variety of product introductions and enhancements, both in endpoint devices and gateways, as well as in test and measurement devices. Much of this new technology will borrow from decades-old work done by the telecom industry.

The bad news is that the resulting data from telco-style voice quality tests is so different from that for our existing IP networks that the industry is just now figuring out how to reconcile the two performance metrics into actionable network management information. This means that managing voice quality, particularly on a heterogeneous IP network, isn't possible today. Enterprise network architects must instead rely on good network design and hope that VoIP trouble spots don't appear. The good news is that help may be on the way. Voice quality management standards are beginning to gain acceptance among IP network and VoIP equipment vendors, bringing IP voice a step closer in the right direction.


Before jumping into the emerging standards for VoIP quality management, it's important to understand how the telecom industry measures quality. Telecom architects long ago realized that they needed a universal way to assess the voice quality of phone calls. For that purpose, they came up with the Mean Opinion Score (MOS). In its original form, MOS called for gathering a diverse group of people and asking them to listen to and rate a number of speech samples played over a phone circuit. Each listener rated the quality on a scale from one to five, with one being unacceptable and five being excellent. A MOS rating of 4.0 is generally considered "toll quality." The ITU Telecommunication Standardization Sector (ITU-T) defines rules for conducting these tests (ITU-T Recommendation P.800) so that when SBC runs a MOS test, it gets comparable results to what British Telecom might find.Because MOS so explicitly gets at the object of telecom service, MOS ratings are used and abused extensively. They're abused in the sense that over time engineers have observed that various objectively measured network parameters (both on TDM and IP networks) have very particular effects on MOS ratings. So rather than gather up 50 or so people and ask them to listen to voice samples every time there's a need to generate a MOS rating, these engineers calculate the likely MOS rating from observed conditions.

One way they do this is through a computational analysis of how known waveforms degrade when played across a voice link. The ITU-T set out to define these variations in the original MOS by way of Recommendation P.800.1. Among the definitions are the MOS Listening Quality Subjective (MOS-LQS) and MOS Listening Quality Objective (MOS-LQO) tests. P.800.1 doesn't actually explain how to run an objective test. Rather, it specifies that if you're running an objective test, you have to label it as such.

Objective measures of voice quality can either be passive or intrusive. If you aren't of a mind to run a subjective test, but you're still willing to grab a voice channel for testing, then consider using some computing horsepower and performing sound-file analysis. For intrusive tests, a known sound file gets placed onto a VoIP or TDM network at point A, then is extracted at point B. The two files are then compared, with the difference between them representing the deviation from the ideal. The ITU-T defines this as P.862, better known as the Perceptual Evaluation of Speech Quality (PESQ) test.

The latest changes to PESQ pay particular attention to the peculiarities of IP networks. For example, since IP networks may lose packets, a resultant speech file may end up slightly compressed compared to the source file. Its sound volume might also be higher or lower than the original file. PESQ calls for the resultant file to be normalized to account for these anomalies.

PESQ is fairly new and still evolving, with recent enhancements made to account for variations in noise cancellation and level matching. It's somewhat computationally intensive and primarily fit for probes or test instruments, though it's also implemented on carrier-side gateways. While PESQ is almost universally used over subjective MOS tests, MOS is still king, so the ITU-T has defined conversions from PESQ to MOS.PASSIVE AGGRESSIVE

While intrusive tests are certainly a straightforward way to assess voice quality, it's neither practical nor desirable to conduct such tests on an ongoing and widespread basis. Enter ITU-T Recommendation P.563. This recommendation is the result of collaborative work between Opticom, SwissQual, and Psytechnics, three European voice quality software and hardware vendors that, previous to P.563, had their own proprietary versions of passive voice quality analysis software. Released in May 2004, P.563 does extensive waveform analysis and is therefore computationally intensive. It's typically deployed on carrier-side gateways and probes.

Opticom has live demos where you can test the quality of any phone connection at When I tested my cell phone, Opticom gave it a MOS rating of 2.06--pretty darn bad. My office phone got a rating of 3.74--almost toll quality. As I learned, these algorithms are good, but not perfect. In my one-off test, I'd guess that a caller on the other side would have estimated the quality to be higher than what Opticom's demo site determined. This isn't unusual for P.563 tests.


While MOS and the newer objective measurement techniques are a great starting place, they alone don't define a user's satisfaction with a particular call--at least not in the incarnations we've described so far. First, these tests are listening-only tests, so such factors as network delay aren't measured. We've all been on calls where the voice quality is fine, but the delay on the network makes it awkward to carry on a conversation. We essentially can't tell if the other party is pausing to think or waiting for us to speak, or if it's just the network introducing delay.Then there's echo. This phenomenon can be broken down into near-side echo, which is good; and round-trip echo, which is bad. When we talk on our home telephones, we hear ourselves in the receiver. That's because consumer telephone service uses only two wires, so those two wires must carry all sound in both directions of the conversation. We've come to expect this phenomenon and when we don't hear it, we tend to assume that the call has been dropped. Round-trip echo occurs when we hear ourselves repeated from the other end of the call. If the delay in the echo is much over 150ms, the echo becomes annoying--sometimes to the point where we can't carry on a conversation. Long distance circuits usually contain echo cancellation technology to avoid round-trip echo. The ITU defines standards for this and for measuring any echo that might actually be heard. Generally, detecting echo with non-intrusive tests is quite difficult and remains one of the reasons why intrusive testing is still widely used.


What's needed in both the TDM and packet world is a method for considering all the parameters that go into a good voice connection so that the voice quality produced by any given network design can be predicted. Those parameters have been gathered together to form what's commonly known as the E-Model, standardized in the ITU-T's G.107 Recommendation. At first glance, the E-Model might seem irrelevant to data network architects. The variables used in its calculations don't generally correlate to anything that's typically measured on a data network. However, each of the variables in the equation can be derived from measured or known quantities. Still, since this is an established ITU specification, its variables are described in terms of voice network conditions such as echo, delay, and noise. Vendors are developing ways to characterize packet networks in terms of the E-Model just as they did with TDM networks.

The E-Model is particularly valuable for both TDM and VoIP networks during their design phase. Telecom equipment vendors have gone to great lengths to characterize the effects of various network elements on the E-Model (and MOS as well). In so doing, they've created a methodology for assessing what kind of voice quality will be delivered based on the design of the network. While this practice is highly mature in the TDM world, it's just now picking up steam in the VoIP market. For instance, each of the encoding methods has been evaluated, and their effect on MOS and the E-Model is well-documented.

FROM SIGNAL TO TESTCompared to TDM networks, packet networks present a unique problem in that problem conditions tend to be highly transient. When you make a call on a TDM network, the quality you experience in the first few seconds of the call is likely to be the quality throughout the call. As we've all experienced, that's not true for VoIP. So in order to manage VoIP quality, both the quality of the call and the network conditions at the time need to be constantly measured.

Telchemy and Psytechnics are two pioneers in developing computationally lightweight methods for calculating call quality based on data available from VoIP endpoints (such as phones, soft clients, and messaging centers) and gateways (such as IP PBXs and network gateways). Nortel Networks, for instance, has been using Telchemy's software agents in its phones since 2001. These agents use information gathered from the phone's Digital Signal Processing (DSP) chip, as well as from its protocol stack and jitter buffer (regarding packet loss and discards). MOS ratings and the E-Model can be evaluated every few seconds of a call in order to track the user's experience throughout the call. Telchemy is working with DSP and VoIP chipset manufacturers to build such data collection right into the VoIP chipset.

This data not only tells us the call quality experienced, but also provides information on the network conditions at the endpoint that led to the perceived call quality. For those who have followed data networking management standards for a while, this should sound a lot like the goals of SNMP's RMON MIB. The idea is to allow the endpoint device (or its associated switch port in the RMON world) to monitor its own performance and then report that information back to some centralized device for event correlation. If you tend to view RMON as an abysmal failure, take heart--the similarity (hopefully) starts and ends at the goal of data collection.

Those who championed collection of VoIP data at the endpoint realized that SNMP was too heavy for many devices. So instead of defining an SNMP MIB for the capture of call quality data, they decided to use the Real-Time Protocol (RTP), the same protocol used for sending bearer channel data on a VoIP network. Dubbed RTP Control Protocol Extended Reports (RTCP XR) and standardized as RFC 3611, the protocol calls for capturing and reporting 20 data network and voice quality statistics. Telchemy's CEO is one of the RFC's primary authors.

On the voice quality side, the MOS-LQ and MOS-CQ (Conversational Quality) scores are reported, as is the resultant (or "R") value of the E-Model described above. Also included are noise level, signal level, and echo return. On the data network side, packet loss, delay, and various jitter buffer statistics are reported. Some of these values are broken down further. For example, delay is measured both for RTP round trip and for the endpoint itself. Endpoint delay includes such parameters as the depth of the jitter buffer and the codec being used.WHO'S YOUR MANAGER?

One might think that with the interest in convergence and the concern for voice quality, management framework vendors such as IBM Tivoli, HP, and Computer Associates would be all over event correlation-based statistics like those gathered by RTCP XR. Unfortunately, that's not what's happening so far. In fact, on the enterprise side of the VoIP world, management framework vendors are almost ignoring the network-specific requirements of VoIP, concentrating instead on higher-level functions such as tracking the performance and health of call centers and gateways. There are, however, a number of network-focused management vendors, typically with roots in carrier network management, that are creating tools for managing enterprise networks.

Part of the problem is that RTCP XR has yet to win broad acceptance. Apart from those vendors that have licensed Telchemy-embedded software and more recently Psytechnics software, RTCP XR support is very limited. Cisco Systems, for example, doesn't currently implement RTCP XR in its endpoints.

Since RTCP XR isn't yet pervasive, network management vendors can't rely on it for event notification. To address this, Telchemy is working to create a combination of endpoint implementations and probes that will correlate events based on data gathered with RTCP XR throughout a VoIP implementation. Psytechnics is also creating reference designs for gathering and correlating network and voice quality events. This makes the job for network management vendors extremely easy: All their systems need to do is report the root cause event as diagnosed by either a Telchemy- or Psytechnics-style embedded probe.

The next six to 12 months should mark a turning point for instrumenting VoIP networks. Both Telchemy and Psytechnics have made headway with equipment manufacturers, so finding RTCP XR-compliant equipment should become easier. As endpoints increasingly adopt monitoring standards, network management vendors can get more serious about developing fault isolation and root cause analysis algorithms. At this point, however, building a VoIP network with pervasive voice quality monitoring is highly dependent on the vendors involved. Currently, RTCP XR is the only standardized game in town. But whether it's RTCP XR or something else, pervasive standardized voice quality measurements are needed as VoIP moves toward widespread adoption.Editor-in-Chief Art Wittmann can be reached at [email protected].

Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like

More Insights