Upcoming Events

Executive conference

Cloud Connect March 16-18

Comprehensive thought leadership for executives, IT professionals and developers. Topics include: the ROI, cost and economics of on-demand computing; Migration strategies to move from on-premise to cloud-based IT; Vertical cloud specialization, tailoring features and architectures to specific applications, industries, and customer ecosystems

More Events »

Subscribe to Newsletter

  • Keep up with all of the latest news and analysis on the fast-moving IT industry with Network Computing newsletters.
Sign Up

  F E A T U R E

Search Engines: The Hunt Is On

October 16, 2000
By Avi Rappoport

Web sites, e-commerce sites, portals and intranets never shrink--this is one of the truisms of our time. And this fact makes finding anything on them more difficult by the hour. Search engines, along with good information architecture and navigation systems, provide site visitors with access to the information they seek. Any self-respecting portal must include a search box, and large Web sites, Web stores and intranets should help people find information locally, rather than having them wander off to a search engine or--horrors!--to a competitor.

What Do Readers Think?

Check out our e-poll
on Search Engines!

If you have a database-generated site, such as an e-commerce catalog, you might wonder why you would need another search engine. After all, can't you locate everything by using the database itself? Not exactly. Database search functions were designed to find out how many widgets are in the Muncie warehouse or which salesperson did the best last quarter--they were not designed to show the top five pages on firewalls or the toastiest ski socks. In fact, there are several excellent reasons to use a text search engine on database-generated data:
  • Database information is stored in separate fields, but searchers dislike choosing fields before searching. For example, a database music search may limit searches to album-title, song-title or lyrics fields. It's much more convenient to search all those fields at once. Many text search engines can index field information, so users can limit their searches to specific fields if they choose.

  • Database searches require complex Boolean or SQL search commands, whereas text search engines can find items when given a simple set of search terms, with no operators at all. Although the vast majority of Web searches use no special operators, and it's likely that site visitors would rather not learn special search commands, text search engines have advanced search forms and search operators with more options.

  • Response time in databases when searching for multiple words can be extremely long.

  • Many database search engines require exact capitalization and diacritical character matches. They won't find Pokémon if the search is for pokemon. Text search engines generally perform intelligent conversions to match words even if some elements are different.

  • Database results are not sorted by relevance; they appear by date, size or price, or, worse yet, by internal ID number. It's extremely useful to see an item containing all the search terms, preferably as a phrase, listed before an item that has only one of the search terms. (Of course, while many text search engines can sort results by date as an option, none of them can sort by price, size or geographic location, for example. Databases are much better at these kinds of ordering.)
Adding a text search engine does increase the complexity of a Web site's infrastructure. It requires indexing the data--through a database gateway or via a Web crawl--keeping the index up to date and running another server or service to search the index. However, using a text search engine on a database-generated site provides better access to the same data and makes customers happy.

Trends in Searching

As the Web evolves, search engines are evolving with it. To accommodate international and multilingual Web sites, search engines must recognize extended characters, such as those in thé and daß, and page languages in general. They should let users search for terms with or without diacritical characters, and index double-byte characters and Unicode.

For database-backed sites, direct database integration provides real-time updating of indexes. Search engines often index irrelevant sections of Web pages, such as navigation. Instead, they should provide the means for publishers to mark text to be ignored, using the pseudo-tag <!-- noindex -->, for example. And by recognizing metadata, including authorship and categories, search engines can improve both searching and results display: With metadata publication information, accurate date-range searching is possible.

Hit highlighting (showing the word matches in context) is a great way to help searchers understand the results, and cached pages with the word matches marked are better. We expect search engines to do more to cluster and interpret results in the future.

Software or Service?

Search engines are heavy-duty server programs, whether they're local or served by an ASP (application service provider). Search systems have two elements. The indexer gathers the words from the documents--whether HTML pages, local files or database records--and puts those words into an index file for fast retrieval. The second element is the search engine itself, which accepts queries, locates the relevant pages in the index and formats the results in an HTML page. As you can imagine, this calls for fast processors, significant hard-disk space for the index and a great deal of bandwidth for responding to many simultaneous search requests. The exact configuration depends on the number of pages, but most search engines require Intel Pentium or Sun Solaris processors, Microsoft Windows NT/2000 or Unix, and at least a T1 line.

There are some special indexing features to consider, such as file-format compatibility and index robot control. Search engines also should offer many options for customizing, from search forms through results pages, and relevance rankings.

For those with constraints on server space, bandwidth or technical staff, specialized ASPs provide remote search services. These services have indexers that use robots to crawl links and locate pages. Then the indexer stores the text and meta-information on the ASP's servers. When a user enters a search in a form on a site, the action is sent to the remote-service search engine, which locates matches within the index, sorts the results by relevance and sends back an HTML results page with links to the original pages. This removes the load from local servers. The Web browser administration interfaces let producers and designers configure the Web server from any location.

In our tests, at SearchTools.com's facilities in Berkeley, Calif., the ASPs, as expected, were slightly less responsive in indexing than were internal servers, but the services were easier to configure and customize. Because these services are new, there are few examples of enormous portals, but they performed well in our tests with 150,000 pages. Remote search services are not appropriate for intranets, as they can't traverse firewalls. They also can't be used for portals indexing millions of pages or for sites that would like to index databases directly. However, such services are good for outsourced and hosted Web sites, and for companies with limited resources or bandwidth.

How We Tested

We tested three leading search engines for servers: AltaVista Search Engine, the server version of the AltaVista Co. engine; Excalibur Technologies Corp. Excalibur RetrievalWare WebExpress; and Inktomi Search Software. We also evaluated two leading remote-search-service ASPs: Atomz.com's Atomz Enterprise Search and Searchbutton.com's Searchbutton Corporate. We also invited Verity and Fast Search to submit products and asked to evaluate Google's and Inktomi's other services. Verity and Inktomi declined to participate, Fast Search agreed but did not send software, and Google's search service had insufficient engineering resources.

All the search engines we tested can handle hundreds of thousands of documents and run on multiple servers. Each has a powerful indexing robot that follows links through Web sites and reads the pages for indexing. All honor the standard robots.txt and robot metatags that developers use to indicate parts of the site that should not be indexed. All the local search-engine servers also can index local file systems and have code library/SDK versions for programmatic access to their functionality. And each is a capable and scalable search engine.

To evaluate these search engines, we installed the servers or enabled the services using the default configurations. Then we pointed the indexing robots at several sites on the Web, including www.networkcomputing.com, as well as e-commerce, education, government and corporate information sites, to get a variety of data. We indexed about 150,000 pages on each server, using the "exclude" functions to control which sections were indexed.

Within the administration interface, we experimented with the functions provided for indexing and retrieval changes. We tried out the scheduling, field definition, relevance ranking and customization options, comparing the functionality of each product. Of course, we were looking at the interface as a Web application, testing its design and usability.

By indexing the special test section of SearchTools.com at www.searchtools.com/test/, we could watch the kinds of links the robot (indexing spider) could crawl and see what kind of data was recognized. This testing covered everything a robot may do, including adherence to the robots.txt standard control, JavaScript and image map links, frames, relative links, redirects, directory listings, file-name suffixes (such as .pl or .asp), and other file formats, such as PDF. For indexing, we tested whether the search engines recognized and stored text in alt tags, comments and metatags, and how well they dealt with extended and diacritical characters. Some of the indexers also detected duplicate pages better than others did.

The SearchTools.com test section contains special relevance-ranking pages, with unique words in the title, meta description, keywords and heading tags. These let us evaluate the algorithms used by the search engines when they sort large data sets, attempting to put the most relevant items at the top.

We tried out the customization features for search forms and results pages. We created pages with our own navigation and page layout, to ensure the results conform to the look and feel of the site. And we tried to rearrange the elements in each result item: the title, URL, page description or extracted text, file size and modification date.

Finally, we evaluated the search logs and reports, looking for information that would be useful to site managers tracking the needs and interests of their visitors by their search terms.

Overall, we looked for quality and coverage of indexing, search-results sorting, customization and search-administration options. Although we list the vendor-posted prices, these prices can be negotiated. The companies were reluctant to give public pricing information specific to the configurations we requested.

Inktomi Search Software took our Editors' Choice award for power, capacity, customization and simplicity. While it is the most expensive product we tested, it has excellent indexing and search features, saving time and frustration for tech staff.




PAGE: 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I 10 I NEXT PAGE
 

Best of the Web

Data deduplication: Declawing the clones

Data deduplication is emerging as a critically important new arrow in the storage administrator's quiver to answer hard questions about the increasing problem in storage growth costs.

Quick Read

Compression, Encryption, Deduplication, and Replication: Strange Bedfellows

One of the great ironies of storage technology is the inverse relationship between efficiency and security: Adding performance or reducing storage requirements almost always results in reducing the confidentiality, integrity, or availability of a system.

Quick Read

WAN Optimization Whitelists and Blacklists

Optimization is a fantastic way of saving money and creating really happy customers at the same time, but it doesn't work flawlessly for all applications.

Quick Read

WAN Optimization as a Managed Service: It's Not About the Cost

This insight examines how organizations outsourcing their WAN optimization initiatives to a third-party go about achieving their goals for application performance, reducing operational costs, and streamlining enterprise infrastructure.

Quick Read

  Sponsored Links

Premium Content

Data Centers Gone Wild
February 22, 2010

NWC


Salary

Video