![]() |
||
| F E A T U R E | ||
Search Engines: The Hunt Is On October 16, 2000 By Avi Rappoport Web sites, e-commerce sites, portals and intranets never shrink--this is one of the truisms of our time. And this fact makes finding anything on them more difficult by the hour. Search engines, along with good information architecture and navigation systems, provide site visitors with access to the information they seek. Any self-respecting portal must include a search box, and large Web sites, Web stores and intranets should help people find information locally, rather than having them wander off to a search engine or--horrors!--to a competitor.
Trends in Searching As the Web evolves, search engines are evolving with it. To accommodate international and multilingual Web sites, search engines must recognize extended characters, such as those in thé and daß, and page languages in general. They should let users search for terms with or without diacritical characters, and index double-byte characters and Unicode. For database-backed sites, direct database integration provides real-time updating of indexes. Search engines often index irrelevant sections of Web pages, such as navigation. Instead, they should provide the means for publishers to mark text to be ignored, using the pseudo-tag <!-- noindex -->, for example. And by recognizing metadata, including authorship and categories, search engines can improve both searching and results display: With metadata publication information, accurate date-range searching is possible. Hit highlighting (showing the word matches in context) is a great way to help searchers understand the results, and cached pages with the word matches marked are better. We expect search engines to do more to cluster and interpret results in the future. Software or Service? Search engines are heavy-duty server programs, whether they're local or served by an ASP (application service provider). Search systems have two elements. The indexer gathers the words from the documents--whether HTML pages, local files or database records--and puts those words into an index file for fast retrieval. The second element is the search engine itself, which accepts queries, locates the relevant pages in the index and formats the results in an HTML page. As you can imagine, this calls for fast processors, significant hard-disk space for the index and a great deal of bandwidth for responding to many simultaneous search requests. The exact configuration depends on the number of pages, but most search engines require Intel Pentium or Sun Solaris processors, Microsoft Windows NT/2000 or Unix, and at least a T1 line. There are some special indexing features to consider, such as file-format compatibility and index robot control. Search engines also should offer many options for customizing, from search forms through results pages, and relevance rankings. For those with constraints on server space, bandwidth or technical staff, specialized ASPs provide remote search services. These services have indexers that use robots to crawl links and locate pages. Then the indexer stores the text and meta-information on the ASP's servers. When a user enters a search in a form on a site, the action is sent to the remote-service search engine, which locates matches within the index, sorts the results by relevance and sends back an HTML results page with links to the original pages. This removes the load from local servers. The Web browser administration interfaces let producers and designers configure the Web server from any location. In our tests, at SearchTools.com's facilities in Berkeley, Calif., the ASPs, as expected, were slightly less responsive in indexing than were internal servers, but the services were easier to configure and customize. Because these services are new, there are few examples of enormous portals, but they performed well in our tests with 150,000 pages. Remote search services are not appropriate for intranets, as they can't traverse firewalls. They also can't be used for portals indexing millions of pages or for sites that would like to index databases directly. However, such services are good for outsourced and hosted Web sites, and for companies with limited resources or bandwidth. How We Tested We tested three leading search engines for servers: AltaVista Search Engine, the server version of the AltaVista Co. engine; Excalibur Technologies Corp. Excalibur RetrievalWare WebExpress; and Inktomi Search Software. We also evaluated two leading remote-search-service ASPs: Atomz.com's Atomz Enterprise Search and Searchbutton.com's Searchbutton Corporate. We also invited Verity and Fast Search to submit products and asked to evaluate Google's and Inktomi's other services. Verity and Inktomi declined to participate, Fast Search agreed but did not send software, and Google's search service had insufficient engineering resources. All the search engines we tested can handle hundreds of thousands of documents and run on multiple servers. Each has a powerful indexing robot that follows links through Web sites and reads the pages for indexing. All honor the standard robots.txt and robot metatags that developers use to indicate parts of the site that should not be indexed. All the local search-engine servers also can index local file systems and have code library/SDK versions for programmatic access to their functionality. And each is a capable and scalable search engine. To evaluate these search engines, we installed the servers or enabled the services using the default configurations. Then we pointed the indexing robots at several sites on the Web, including www.networkcomputing.com, as well as e-commerce, education, government and corporate information sites, to get a variety of data. We indexed about 150,000 pages on each server, using the "exclude" functions to control which sections were indexed. Within the administration interface, we experimented with the functions provided for indexing and retrieval changes. We tried out the scheduling, field definition, relevance ranking and customization options, comparing the functionality of each product. Of course, we were looking at the interface as a Web application, testing its design and usability. By indexing the special test section of SearchTools.com at www.searchtools.com/test/, we could watch the kinds of links the robot (indexing spider) could crawl and see what kind of data was recognized. This testing covered everything a robot may do, including adherence to the robots.txt standard control, JavaScript and image map links, frames, relative links, redirects, directory listings, file-name suffixes (such as .pl or .asp), and other file formats, such as PDF. For indexing, we tested whether the search engines recognized and stored text in alt tags, comments and metatags, and how well they dealt with extended and diacritical characters. Some of the indexers also detected duplicate pages better than others did. The SearchTools.com test section contains special relevance-ranking pages, with unique words in the title, meta description, keywords and heading tags. These let us evaluate the algorithms used by the search engines when they sort large data sets, attempting to put the most relevant items at the top. We tried out the customization features for search forms and results pages. We created pages with our own navigation and page layout, to ensure the results conform to the look and feel of the site. And we tried to rearrange the elements in each result item: the title, URL, page description or extracted text, file size and modification date. Finally, we evaluated the search logs and reports, looking for information that would be useful to site managers tracking the needs and interests of their visitors by their search terms. Overall, we looked for quality and coverage of indexing, search-results sorting, customization and search-administration options. Although we list the vendor-posted prices, these prices can be negotiated. The companies were reluctant to give public pricing information specific to the configurations we requested. Inktomi Search Software took our Editors' Choice award for power, capacity, customization and simplicity. While it is the most expensive product we tested, it has excellent indexing and search features, saving time and frustration for tech staff.
| ||
|
PAGE: 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I 10 I NEXT PAGE |
||











