Filtering Knowledge On The Net Just Got Simplerby Christine Hudgins-Bonafield
There's a French phrase that aptly describes today's Internet-embarras de richesse-an embarrassment of riches. The net is also saddled with embarrassing poverty.
In fact, there's so much wheat and so much chaff that it's difficult to find anything without the right cue. filtering and refining information is about to become a lot easier. Among the reasons:
· Friendlier, more inclusive second-generation search engines, like the HotBot service of HotWired Ventures and Inktomi Corp.;
· The unleashing of crawlers from companies like Open Text and Inmagic (using Lycos' spider) to index the desktop, the workgroup, the enterprise and the Internet;
· The individualization of searches, including Open Text's client index software to help users retrace searches and add context ("enterprise" might be associated more often with "networks" than "Klingons");
· The use of histories to build datamining tools; IBM plans to provide browser access to its infoMining tools by early next year and is considering returning Internet searches broken algorithmically into logical clusters;
· The onset of intranet tools, like Netscape's Catalog server (based on Harvest), and SRA's NetOwl (developed for the intelligence community, that identifies proper names); both index information at the server, rather than relying on crawlers;
· The browser integration of search engines, workflow, collaboration, e-mail and document management suites for intranet/Internet searches;
· The evolution of parallel processing for Internet-wide search engines;
· The spread of search engine access to mirror sites to improve performance, provide backup and produce multilanguage access (watch Digital's Alta Vista);
· The advent of image search tools like those from SRA, Virage and IBM's emerging Query by Image Content (QBIC); Virage, for instance, creates a 1 KB index for an image based on primitives like color and texture; and
· New products based on the World Wide Web Consortium's flexible PICs specifications to filter objectionable content; PICs is also likely to be used to rate the content of servers for vertical market groups or even by those seeking pornography.
The Hot Little Engine That Could One of the summer's significant advances is apt to be the HotBot engine from HotWired and Inktomi, founded by 29-year-old University of California at Berkeley assistant professor Eric Brewer and star pupil, Paul Gauthier (http// www.hotbot.com). This duo also uncovered the security weaknesses in the Netscape browser.
HotWired and Inktomi-the chaos-destroying spider of Indian legend-were expected last week to announce the Internet's most expansive and feature-rich search engine. According to The Yankee Group's Greg Wester, the engine's functionality is "more than one or two steps ahead" of the pack. And Meta Group's David Folger believes it could "take a big (traffic) slice" away from other engines.
The service will also be the first industrial-strength application of Network of Workstations (NOW), or hive computing, says Brewer. It's one reason Inktomi expects to be able to do full text indexing of 50 million pages initially. Brewer believes that search engines are limited today by the number of processors that can fit in a single SMP box. The NOW approach means there is "no pre-defined limit," and it lowers the overall cost.
Inktomi COO Rob Guyon says the entire fault-t olerant system runs on eight UltraSPARC stations, which Inktomi estimates will scale to handle 70 million pages. (Digital's Alta Vista reported 22 million full text pages indexed in May and InfoSeek's UltraSeek expected 25 million pages this month.)
But analysts believe HotBot's approach may be even more important than its size. Researchers will be able to search between a range of Internet posting dates, and HotBot also promises very fresh information. While Alta Vista sends out its crawler once a month, Inktomi is promising to refresh its 50 million documents weekly and some of the most popular sites daily. Open Text is also working on updating news-based and other sites on a short timeline.
One feature, though, that continues to distinguish Open Text is its ability to show the line of text in which a search term appears without requiring the user to leave its site. HotBot requires leaving the site through a window, while maintaining an ongoing search. HotBot users however, will be abl e to search servers in a specific geographical region or even on a single Web site.
HotBot promises support for an expansive set of Boolean operators and proximity searches-but its user interface is in plain English.
Inktomi also hopes to gain a foothold in sites that keep search engines at bay today because of their relentless drilling. David Pritchard, HotBot marketing director for HotWired, says the crawler merely nibbles at a Web site, leaves and later comes back for more.
Of course, how well HotBot does will depend on its marketing, says Wester. One thing HotBot is considering is to sell ads based on keyword clusters-so that a user entering the key words, "car," "buy," "Chrysler" might actually see an ad from Chrysler.
What will HotBot do next? Extending the search engine to intranets is a long-term possibility, but it looks like individualizing searches will be the next priority. The HotBot planners don't intend in the near term to package the engine with collaboration software. Inktomi's Guyton says the Internet needs to see "the fundamental things done right" before moving into new space.
Spiders Aren't Just Hangin' While there's a new player in the search engine market, this technology race is far from over. Open Text, for example, will be doubling the number of pages it visits by replacing its RISC-based Digital Alpha base with Intel-based Pentium Pro P6 servers running NT. David Weinberger, Open Text's vice president of strategic marketing and communications, says the new servers will provide 10 times to 20 times the throughput of a very powerful RISC machine. But most of the improvement will come from unannounced server software enhancements.
IBM is also among those with aspirations of indexing the entire Internet. It is conducting research on parallel processing with supercomputers-and because its infoMarket service has the largest base of copyrighted material available (based on Crypto lopes, which provides encrypted access to copywritten material using IBM's own clearance house), it may be able to deliver an extremely broad index (see "Selling Knowledge on the Internet," July 1995 and the H-
Report, December 15). In May, CMP's Techweb was among those exploring online pricing by signing up as an infoMarket content provider.
Collaborating Around the Engine Search engine providers are also coupling applications with their engines, producing the kind of streamlined multiuse products that Inktomi frowns upon but that challenge IBM's pricier and more customized offerings like Lotus Notes.
Digital is even rebranding its Internet software to match the Alta Vista name. In addition, Digital will be beta testing versions of its sear ch product for personal, workgroup and enterprise searches this month (prices are expected to be around under $100, under $5,000 and more than $10,000 respectively). It will also be introducing two product suites in July that integrate its engine with search, mail and collaboration capabilities. Digital's products are designed for 32-bit Windows desktops and NT and Unix servers. A toolkit to extend the packages' HTML search support to other formats will be provided to Digital partners this summer. The products require client software, but the interface can be any Internet browser.
Open Text shipped its Livelink intranet multiapplication package in May (sans e-mail), which extends beyond Digital's product to add workflow (with expected support for Sybase, Oracle and Informix by this month) and support for document management.
Christine Hudgins-Bonafield can be reached at firstname.lastname@example.org.
Updated May 31, 1996