Microsoft Talks Up Speech Server 2004

MSS is based on SALT (speech application language tags) language, rather than the VoiceXML standard for integrating text and speech web application interfaces. Microsoft claims that SALT will simplify the

April 6, 2004

5 Min Read
Network Computing logo

MSS is based on SALT (speech application language tags) language, rather than the VoiceXML standard for integrating text and speech web application interfaces. Microsoft claims that SALT will simplify the development of interactive applications that share common application code, while supporting the different voice/visual interface requirements of multi-modal user devices.

Like other disruptive technologies, this announcement will cause enterprise organizations to think twice about migrating from their present speech-oriented application tools to accommodate converged application interfaces. Not only do we see this announcement as breaking price barriers for speech-enabled applications in the relatively "greenfields" SMB market, but its practical exploitation of multimodal SALT will help push the market towards multi-modal user interfaces for many on-line applications.

Your Father's IVR

Interactive Voice Response (IVR) has long been the bastion of telephone-based applications, primarily to support call center activities with "front-ends" to identify a caller for selectively routing the call and generating "screen-pop" information to a live agent and, perhaps more importantly, to provide self-service applications via a Telephone User Interface (TUI).

The notorious TUI IVR speech menus ("press one for...") left much to be desired in terms of flexibility and time efficiency, and the proprietary platforms, speech cards, and complex design tools made IVR an expensive proposition for enterprise application implementation and ongoing maintenance. That's why it found some degree of success primarily in conjunction with larger enterprise call centers.When it came to non-customer callers, the limited facilities of voicemail systems were exploited to interact with the caller as a simple answering machine for caller voice messaging, an auto-attendant to re-direct calls from the main business number to specific user extensions, and some really mickey-mouse ways to use a group of special mailboxes to emulate an application call flow with voice menus (mailbox greetings) and branching logic to other mailboxes. What voicemail could not do is directly access application databases; that required the power of IVR programming.

Your father's IVR also had problems with creating the speech prompts and responses, because it originally required laborious pre-recording with voice artists. God forbid a small script change was needed and the original person who did the recording was no longer available! Although having mixed voices is not a terrible thing, it could be "unnatural" and disconcerting to a caller.


Everyone acknowledges the fact that speech recognition and text-to-speech have now become mature and cost-effective enough for practical use in controlling applications and informational content. This applies to both person-to-person "communication applications" (voice calls, messaging) and service applications, where speech is used for application input and/or output to end user contact devices, such as:

  • Desktop voice-only telephones;

  • Handheld wireless phones;

  • Multi-modal, handheld devices.

    But let's get realistic about the value of speech as an interface medium! Speech interfaces make sense for mobile use to replace a large screen and keyboard and for bite-size pieces of information like messages and information alerts.

    It's not useful for scanning documents or digging around databases. The value of speech control interfaces at the desktop, such as a PC-based softphone, may be somewhat limited because there are faster ways of interaction than with speech output (screen displays) or where informational privacy must be maximized through non-audible input or output. Finally, speech will be almost useless in a really noisy environment.

    It is only recently, however, that the benefits of improved speech recognition have been brought to market by leading technology providers, such as ready-to-use Avaya's Speech Access product, which can speech-enable its other communication applications software. Microsoft is aiming to exploit those kinds of benefits further with an integrated platform speech-enabling any kind of online web application by third-party developers.What About VoiceXML?

    Microsoft's speech application development offering is based on its Speech Application Language Tags (SALT) programming language, which obviously competes against the current leading open standard for developing speech-enabled applications, VoiceXML.

    However, VXML has its limitations for enabling flexible convergence between visual interfaces for web online applications and speech interfaces for telephone and multi-modal applications. In particular, VXML lacks functionality in call-control tasks, which has caused yet another set of supplementary programming language tags to be developed by the IETF, Call Control XML (CCXML).

    Although VXML-based applications are projected to process over 10 billion calls for large enterprises in 2004 in North America, according to Yankee Group analyst, Art Schoeller, that doesn't necessary mean it will remain the only game in town.

    In addition to the added convergence flexibility promised by SALT, the large, under-served SMB market will be the initial target of opportunities for the MS Speech Server and its application developers. As multi-modal handheld devices continue their penetration of the consumer market, and VoIP replaces the TDM infrastructure of the PSTN, the practicality of converged (self-service) interfaces for both business and communication applications will also increase.At least that's the future that Microsoft and its application development tool partners are gambling on.

    In talking to some of the leaders in the telecommunications industry, there is a realization that "times are a changing," but they want to migrate cautiously into the unproven world of converged user interfaces.

    More Questions - What Do You Think?

    How difficult will it be to switch from existing IVR applications to a VXML or SALT-based platform? How important will it be to support mobile users with multi-modal handheld devices with converged self-service applications and services? Will "combined" (e.g., speech input, visual output) speech interfaces be practical for handheld devices? Do we really need the same speech-enabled interfaces at the desktop?

    How will traditional IVR applications benefit most from multimedia interfaces? Will multi-modal applications only reside on the web? Will the communication application providers continue to supply standardized converged user interfaces for their phone systems, or will enterprise customers want to customize their own versions of everything? Who will be doing the interface design and programming for enterprise applications? Let me know your thoughts at [email protected] .0

Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights