Spoken Word Search Analyzes Audio Content

Innumerable audio files available on the Web make searching based on keywords or metadata tricky. But spoken-word search technology is making it easier for enterprises to find the right audio

October 19, 2006

7 Min Read
Network Computing logo

 

 



Spoken-word search is expected to reduce the burden placed on audio and video content creators to make their work easier to find on the Internet. Spoken word search should raise the usefulness of podcasting and thereby help perpetuate the growth of the podcast medium.

TVEyes, with its Podscope service, is a major player for searching within audio content. Podzinger and Blinkx offer similar services. AOL is beta testing an audio search engine based on TVEyes' technology and Lycos has signed on with Blinkx to build multimedia search capabilities. Vendors such as Google, MSN and Yahoo do not currently offer spoken-word search capabilities.



The ever-increasing number of audio files available on the Web makes searching based on keywords or metadata unreliable and difficult. Spoken-word search technology will greatly assist podcasts that would otherwise be buried on the Web. AOL's licensing of spoken-word search legitimizes the field; expect Google, MSN and Yahoo to adopt this technology in some fashion..

The popularity of podcasts and online audio is growing by leaps and bounds. In fact, 7.7 million people will be listening to podcasts weekly by 2010, versus an estimated 1.5 million listeners in 2006, according to Bridge Ratings, a provider of radio-audience trend information. With an increase in the quantity of available audio content, however, comes the question of how to find relevant content within audio files.

To address this, a few companies have developed a technology that applies spoken-word search techniques to audio content, as well as to the audio portion of video files. TVEyes' Podscope search engine, for instance, ferrets out audio content on the Web, then uses speech-recognition algorithms on that content, generating an index that can be searched by consumers and business users alike.

Although the use of spoken-word search is bound to make inroads into the consumer space first, it holds potential for use in the enterprise as well, just as instant messaging, blogging and wikis have. Although three big search vendors, Google, MSN and Yahoo, have yet to take outwardly visible action surrounding spoken-word technology, AOL has partnered with TVEyes, and launched a beta of TVEyes' search engine on its site this summer. We expect momentum around spoken-word search to continue to build, whether based on TVEyes' technology or a competing one.One big potential enterprise benefit of spoken-word search is that the content creators--those users producing the podcasts--could bypass the metadata creation and manual transcription of audio files, which have been the conventions followed by companies requiring text-searchable audio files. So the technology represents a significant advancement for a niche enterprise need. It should infiltrate the menu of standard Web searches over a relatively short period. TVEyes thinks this will happen within 18 to 24 months, and we concur.

Search Techniques

Searching multimedia content today is primarily done with the equivalent of 1970s technology. The search requires keywords or metadata, or it relies on extrapolating information from a Web page. If a topic or phrase is mentioned in a podcast but doesn't appear on an associated Web page or within metadata, a standard search on that topic will not produce the podcast as a result.

With the emerging spoken-word search, the audio portion of a multimedia file is "listened to" by the search engine. An index--not necessarily a word-for-word transcript--is built by converting the spoken words to text using one or more voice-recognition algorithms. TVEyes uses at least eight engines; the algorithms can look at vocal inflection and signatures to guess an unclear phrase. The resulting index is text-searchable data. Unlike conventional voice-recognition software for PCs and telephone systems, these spoken-word search engines don't attempt to learn speech patterns and require an extensive library of words, phrases and accents. Background noise and music are ignored, though overlap between these sounds and spoken words will reduce accuracy.Searching of video files is possible with a search engine such as Podscope, but it can search only the audio portion of video files. True video search--that is, using face- and body-recognition techniques--is still an experimental technology and is not ready for widespread consumer use. And there are audio search engines that can find music based on a few notes or that can locate a snippet of audio within a collection of sound files, but these do not perform voice recognition.


Podscope Media Indexing
Click to enlarge in another window

Enterprise Value

For consumers and enterprise users, spoken-word search promises not just to find an audio file containing relevant content, but also to pinpoint the location of the relevant content within the audio file using a time index. This means that, after finding the correct audio file to listen to, a user could skip ahead to the appropriate portion of the audio file.

Besides the added functionality that accurate search of multimedia content brings to both consumers and enterprise users, spoken-word search could ease the burden placed on enterprise users creating multimedia content. Without spoken-word search, creating the necessary search data is no small challenge for content creators, who rely on metadata or text transcripts to make the content searchable.For businesses, there is another considerable benefit of spoken-word search for external facing content: As the use of spoken-word search increases, potential customers will be more likely to find podcasts and videocasts--which should decrease a business' dependence on sheer popularity or marketing efforts to draw users to a site.

An enterprise considering spoken-word search would not need to modify existing audio files. However, the technology isn't perfect. Audio content is more easily indexed if it has a clear, attentive speaker with little background noise and no background music. Higher-quality recordings and encodings produce better search results. TVEyes claims to have an average accuracy rate of 80 percent. Higher accuracy can be achieved through improvements in speech-recognition engines, but progress in that field takes years.

Development Status

TVEyes' Podscope is leading the pack in spoken-word search, but it does have rivals. Podzinger, powered by technology from BBN Technologies, offers spoken-word searching; that tool will display a small text transcript around the search terms. Unfortunately, this can also make obvious how inaccurate voice recognition is, since the transcribed text doesn't always match what's been said. Podscope does not perform any transcription services, either in full or snippets.

Audio/video search vendor Blinkx has inked a contract with Lycos to power searches on its broadband service. We're not aware of any other voice-recognition vendor or PBX manufacturer, such as those with speech-recognition IVRs, making headway into Internet search--and the big four search vendors have also lagged behind. Google, MSN and Yahoo do not offer spoken-word searching. Yahoo's audio search engine looks only for audio files based around keywords and typical Web searching, rather than performing spoken-word search. TVEyes partner AOL released a podcast search beta in July; the product is similar to Podscope but with a few user interface changes.AOL's and Lycos' interest in spoken-word search has brought a sense of legitimacy to the technology. We expect the other big search vendors will soon partner with a spoken-word search vendor or develop their own audio search engine soon. Finally, TVEyes doesn't envision a corporate audio search appliance in the near future, as Podscope uses several evolving voice-recognition servers. The company said an appliance is on the long-term horizon. For now, the technology exists for external or public content.

Michael J. DeMaria is an associate technology editor based at Network Computing's Syracuse University's Real-World Labs®. Write to him at [email protected].

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights