

Measuring Voice Quality: Listening by the Numbers
May 31, 1999
Other Articles by David Willis
|
Putting Service Levels in Perspective, Columnists, March 8, 1999
Mariposa, 3Com Raise the Bar For Next-Generation ATM Access, Reviews, March 22, 1999
Wireless Phones: Untethered and Unreliable, Columnists, April 5, 1999
When Big Deals Become Bad Deals, Columnists, May 3, 1999
Visual UpTime 5.0 and DS3 ATM ASE: Premier Service-Level Management, Sneak Previews, May 17, 1999
|
Other Columnists this issue
|
Business to Business By Brian Walsh
On the Edge By Art Wittmann
|
|
Company Directory
|
|
Browse our directory to get data, starting with a particular company.
|
|
Reader Service
|
|
Allows you to request additional product information from our advertisers.
|
|
Print The Full Article
|
Click Here
|
|
E-mail this URL
|
Click Here
|
|
Buy the Book
|
|
|
By David Willis
For as long as telephones have been ringing, vendors have made competing claims about the superior sound quality of their voice networks. Scientists, engineers and academics have long labored to produce a quantifiable measure of sound quality and system usability. After decades of head scratching and head butting, we've finally arrived at international standards that attempt to focus on sound. You'll soon start seeing benchmark results in test reports and vendor marketing. But as a buyer, be aware that the benchmark standardization effort is not complete and that existing measurements are limited.
Don't chalk up this slow progress to a lack of effort; quantifying voice service is a thorny, multidimensional problem. By comparison, testing performance on a data network is a breeze. We can all relate to response time, file-transfer rates and measurements of packet-processing throughput. We generally recognize that when one system operates twice as fast as another, it is the superior solution. But it's not so easy with voice.
Everyone's Got an Opinion Traditionally, voice assessments are made by gathering opinions from a group of participants operating in a sterile test environment. This is the approach we've taken most often in Network Computing's testing of voice products and services. It's an intense and time-consuming process that requires a large sample of people because different people can perceive the same system as having radically different quality. For example, women often seem to hear things differently from men (Venus-Mars interpretations aside). Perceptions also vary based on age and language. Indeed, the same person's scoring can change from test to test and vary based on the expected quality: When a cell phone sounds good, it's mostly because of low expectations.
In the mid-1990s, the ITU-T (International Telecommunication Union, Telecommunication Standardization Sector) completed the P.800 specification "Methods for Subjective Determination of Voice Quality," the most recognized methodology for evaluating voice systems. It describes test-lab conditions, the content of audio samples and scoring, and how data should be analyzed. Most often, P.800 methods are used to create a MOS (Mean Opinion Score), measuring quality on a five-point scale.
Because P.800 tests can provide ambiguous results, substantial dissension has arisen among scientists and engineers regarding its use. Even its authors warn against comparing MOS scores taken under different conditions. Second, the language itself used in P.800 is open to subjective interpretation. English-speaking testers usually ask participants to assign the values of "excellent," "good," "fair," "poor" and "bad" to a call. The problem is that the difference between "bad" and "poor" is typically much smaller than the difference between "poor" and "fair," so it's not an evenly distributed scale. Change the vocabulary and the results change. Further, results don't translate between languages. Studies have found that when an Italian says a call is OK, he or she generally means "good," but when an American says OK, he or she generally means something closer to "fair."
But the biggest problem with using the P.800 approach is that it's simply impractical: Vendors tend to measure their systems in ideal, controlled settings, and the real world is neither ideal nor controlled. We've all seen marketers at trade shows breathlessly hawking their voice-over-data miracles as if they'd just introduced nuclear fusion. But when these systems grapple with a real network, all the rules change and quality can be quite different. I've run into a few network managers who can't get users to accept their voice-over-frame relay system, even when a controlled pilot worked quite well.
The service varies in every data network. One moment it can reliably pass a real-time stream, the next moment it can't--especially when no attempt is made to separate traffic into classes and manage accordingly. Recognizing this, equipment vendors claim their buffer-management and packet-prioritization schemes eliminate the variations. For example, 3Com has been claiming that its NBX virtual PBX system can outperform the competition under degraded network conditions, even when using the same codecs. Yet there is no way anyone is going to use P.800 techniques to prove it.
Training a Tin Ear Clearly, the solution is to automate the testing process. But even the most advanced telco test sets aren't designed to assess quality the way a human being does. The typical tester may tell you whether a sent tone has been received properly, what the delay is or whether bit errors have occurred. But if a received tone is 10 Hz different from the sent tone, will a user be annoyed? It's difficult to tell.
The ITU-T created the P.861 standard in an attempt to estimate a MOS using quantifiable low-level measurements that can be automated. The ITU-T advocates a technique known as PSQM (Perceptual Speech Quality Measurement), developed in Holland by KPN Research. Under certain conditions, PSQM scores correlate fairly closely to MOS.
Yet PSQM isn't a complete solution for use in voice-over-packet networks. By the author's own admission, P.861's approach doesn't account for several key factors that may critically affect perception--such as cell and packet loss, the clipping effect of bad voice-activity-detection mechanisms or the impact of bit errors. These problems are commonly found in voice-over-frame relay, ATM and IP networks. So when a vendor claims its PSQM scores are higher than the competition's, it's not really telling you much.
I don't want to imply that a PSQM score is useless. It certainly can give you some indication as to whether quality is terrible or acceptable. But these scores shouldn't be used to rate one system against another, and it was not the ITU-T's intention for them to facilitate this type of comparison. The primary value of these numbers is for tuning a single system to get optimal quality.
Top of the Sound Charts The reigning king of voice-over-packet test systems is Hammer Technologies' Hammer VON/VoIP Test System. This suite runs atop the Hammer IT platform and flaunts the most comprehensive range of voice-system quality tests of any device available today. It can generate calls in volume, interact with voice-response systems, verify that DTMF (Dual-Tone MultiFrequency) tones pass properly and provide PSQM scores. If you must assess voice quality and understand the limitations of PSQM, this tops the list.
Germany's OPTICOM is shipping PA&SQM, an MS-DOS-based suite that supports PSQM and PSQM+, in which the vendor claims to address some of the noted limitations of PSQM. And Sage Instruments will soon release its PSQM Voice Quality Assessment option for the 930A and 950 testers, which can give a PSQM score using an artificial voice in a matter of seconds. Sage is reportedly porting the test to a handheld device for easy pass/fail measurements in the field, which will be essential for voice-over-packet installers.
Recognizing the failure of the ITU-T specs, Ameritec has taken a radically different approach with its Voice Over Packet application test suite on its existing (and highly useful) call generators. The software measures dropouts, round-trip delay and signaling errors, in addition to its normal call-loading capabilities. This strategy is not as comprehensive as the PSQM testers, but it tends to produce quantifiable, reproducible and comparable output.
It's What's Inside That Counts As flawed as they are, automated measurements such as PSQM fill a real need--but it's not in external test equipment. Instead, we need embedded, real-time performance measurement inside next-generation voice-over-packet products. Imagine an RMON standard for voice services, with internal probes generating PSQM-like scores between critical points in the network and issuing alerts when the quality falls below a service-level threshold. The alerts might trigger an automatic failover to the circuit-switched network.
It will likely be some time before embedded voice-quality measurement emerges. PSQM calculations are highly processor-intensive, requiring powerful DSPs (Digital Signal Processors) and custom ASICs to be practical and affordable. But real-time measurement is in the best interest of the voice-over-packet vendors, who don't want you to blame their products for the subpar quality of your network. And if your job performance is based on how happy users are with the voice network, you'll want it, too.
Send your comments on this column to David Willis at dwillis@nwc.com.
|