Network Infrastructure

Panning for Gold

Retrieving lost data is like searching for gold nuggets. An enterprise search engine can spare you time and expense, and may even generate precious revenue. CSIRO's Panoptic sparkles with superior

September 15, 2003

20 Min Read

These search engines also let you re-energize other systems. Processors driving file systems won't spend needless cycles looking for files or content in files. Databases won't have to crunch as many queries, and legacy systems will gain a new lease on life because they're not spending an inordinate amount of time in the search cycle. Better yet, you won't have to train your employees in SQL.

Search-engine software has two components: an indexer and the actual search engine. Indexers retrieve content, extract words and index them for fast retrieval. Engines interpret queries and locate words, concepts or phrases relevant to the question in the index, then format the output in HTML or XML and send it to the user or device that initiated the question.

We went looking for enterprise-class search engines--those that work behind a firewall or secure VPN. The vendors had to supply search-engine software or an appliance that supported it. We did not want it bundled with portal software or content-management software. Our contestants also had to be able to search both structured data in databases and unstructured data on Web servers and file stores. And we required support for a variety of document formats, including word processing and presentation and graphics editors.

We required indexers to retrieve content from secure Web pages (HTTPS) and standard HTTP servers and file systems, and to remove duplicate pages. We also required them to extract words from HTML, XML, Microsoft Office and PDF documents, and index the content. Finally, they had to support ODBC or JDBC (Java Database Connectivity) connectors or gateways.

As for the search engines, we asked that they include a spellchecker and support for phrase searching and stemming (grammatical variations) in addition to keyword searching. We also required a prebuilt search form or user interface to test the indexers and search engines.We sent invitations to 11 vendors. Four stepped up to the table: CSIRO (Commonwealth Scientific and Industrial Research Organisation), Kanisa, Mondosoft and dtSearch Corp. Each sent software products to our Syracuse University Real-World Labs®.

The companies that dropped out, declined or just didn't qualify ran the gamut from small to large. Copernic Technologies didn't qualify because its product doesn't support ODBC or JDBC. Autonomy Corp. and EasyAsk declined to participate but gave no reason. Convera, Dieselpoint and Fast Search & Transfer each said it is working on a new version of its software and declined. Both Verity and Google declined to participate on the basis of company policy, though Verity was changing its policy as this article went to press.

Navigational Searching

click to enlarge

As for our four contestants, we tested their ability to satisfy navigational searches by using Network Computing's production Web site (www.nwc.com), which contains almost 35,000 pages (see "How We Tested,"). We also tested indexing and searching capabilities using informational searches taken directly from the log files on www.nwc.com. Three of the four products we tested performed above average. Only dtSearch came in under par.

We judged the search engines on their ability to retrieve content using an indexer, also called a spider or crawler. We put a heavy emphasis on the search process, including how much control the administrator could assert, and assessed the amount of control that could be applied as well as the overall performance in navigational searches. We also looked at each vendor's management console and how it accomplished installation, configuration and customization tasks on the search-engine portion. And we considered log files and reporting capabilities. Prices were compared across the board.

Panoptic Enterprise Search Engine won our Editor's Choice award. Its secure and easy-to-use administrative interface, navigational deftness and indexing prowess put it on top.

Panoptic achieved the best performance in our navigational searches, proved to be the best indexer out of the box and offers the best price-to-feature ratio. And best of all, installation and configuration were a breeze compared with that of Kanisa Site Search, which requires a number of postinstallation steps to configure IIS and enable the file system for use.Although Panoptic is not as full-featured as Kanisa and MondoSearch, it has the most intuitive administrative interface to manage the search process. After the installation from CD-ROM terminates, the system is almost ready to use with a Red Hat version of Linux.

Panoptic requires an Intel PIII 1-GHz processor with 512 MB of RAM and at least two 40-GB disk drives. Panoptic is flexible when it comes to the OS: It is the only participant that supports Linux, Windows and Sun Solaris, and the only product that supports SSL out of the box. The admin interface and the sample user interface support Internet Explorer, Mozilla and Netscape Web browsers. By default, the admin interface is available from a secure port (HTTPS, 443), while the sample user interface is available under the default port (HTTP, 80).

Like that of MondoSearch and dtSearch, Panoptic's sample user interface can be configured from the administrative interface. But without any configuration, the advanced-search form contained entries that leverage author and title metatags. You can also refine your query to search within your results if you receive too many hits. Panoptic supports all the major standards for metadata, including the Dublin Core. Our other participants support metadata but do not detail their support. And you also can limit your search by document type and date.

To begin the search process, you create a collection--a finite set of Web pages to index and search. If you have logical divisions in your Web content, you can distinguish them by collection to facilitate search and retrieval. For example, you can create separate "collections" distinguished by content type: news, sales support. This can narrow a user's search and increase the number of relevant documents returned.

We created a Web collection by giving it an external display name "Network Computing Magazine" and a unique internal name "nwcmag." Then we identified the collection as our Network Computing production site. As with Kanisa and MondoSearch, you can confine the content collection to specific pages such as those on the www.networkcomputing.com site or its alias www.nwc.com. That way, the Web crawler will not detour and follow off-site links. You can also limit the discovery depth from the starting URL. All four search engines in this review support deep link limitation.Panoptic supports its own Java-based crawler, called FunnelBack. When you set up a Web collection, you define how a crawler will gather data for the search engine to index. In the advanced settings, you can directly edit a collection configuration file that contains the options for FunnelBack. For example, you can limit the length of time the crawler runs. You can also configure a maximum number of pages to store, limit the number of clicks (links) away from the home page and define many other settings. We excluded a file type to disregard Netgravity links. All the crawlers have a similar feature that excludes certain directories or files from a crawler's scrutiny. This is in addition to following the directives in a robots.txt file.

Search Engine Features

click to enlarge

FunnelBack took just less than nine hours to crawl our production Web site and index 34,720 documents--more than any other participant. Once it completed the crawl, Panoptic made the results to the collection immediately available to the default search form.

Because Panoptic does not provide a preview or prepublishing database--Kanisa or MondoSearch do--to test before going live, it has two options that protect you from putting a partially collected database into production. A changeover-percentage option specifies a minimum size to make a newly gathered collection available vis--vis the collection it is replacing. In addition, Panoptic has a "vital_servers" option, which prevents an update from overwriting your production database if a server is down during the collection process.

Panoptic's easy-to-use administrative interface set it apart from Kanisa and MondoSearch. In addition to setting parameters, you can use a form to update collections using the crontab file; this is a multistep process for Kanisa and MondoSearch. Panoptic also has extensive log files, but does not provide the reporting that Kanisa does.

Panoptic Enterprise Search Engine, CSIRO (Commonwealth Scientific and Industrial Research Organisation). +61-2-6216-7060. www.panopticsearch.com

Kanisa Site Search put its best foot forward as a question-and-answer system. Users can enter a question using a sentence or phrase they would use in natural speech. Kanisa answers that question with a question, guiding the user to the most appropriate answer by grouping relevant Web-site content with the guidance question. Compared with that of our other participants, Kanisa's method of getting the end user to the appropriate answer was the most complicated and costly process. Although the result can be rewarding, that did not justify the means and the cost.Once Kanisa was installed, it made 15,800 documents of Network Computing content available within eight hours. In the end, users access a sample user interface that could be generated and configured using ASPs or JSPs (Application or Java Server Pages). This makes the user interface more difficult to configure than the other products tested. But as long as you have the programming resources, a developer's programming guide is available to configure it to your site's requirements.

Kanisa was competitive in finding answers to our navigational questions, but fell short of Panoptic's prowess. For navigational searches, such as "Where is the editorial calendar," it returned an exact location for our 2001 editorial calendar. Because that did not satisfy the search, Kanisa included a pop-up box asking if we would like to see the rest of the results. Once we confirmed the need for more documents, it ranked our 2003 editorial calendar third on the list. If Kanisa cannot anticipate one correct answer to a query, it returns a list of documents relevant to the query grouped by guidance questions (see our Online First article, "Kanisa's Site Search: A Guided Search With NLP").

Although the documents available from Kanisa were the smallest set retrieved in our roundup, we did not consider this a disadvantage. Kanisa has the most control over the content-retrieval and indexing process. It breaks up the process into four segments--retrieve, normalize, summarize and index. During its spidering process to retrieve Network Computing content, it retrieved 58,590 URLs. But not all the documents were indexed because of errors in the normalization, summarization and indexing process.

The application console includes extensive reports generated from log files. We could view an exception report on Web site content that was not accepted for processing by status code (HTTP error code) and URL. Kanisa identified every URL--not just objectsthat was not found. For example, it included GIF and JPG files that were not found. It also included dynamic content generated to print articles. After the retrieval process, Kanisa normalizes or examines each document for valid content, language and document characteristics. You can configure and filter out unwanted document types and convert supported binary documents to HTML for indexing.

Moving beyond the normalization process, Kanisa summarizes documents by extracting unique titles and descriptions from metadata or from the document text. Like MondoSearch, Kanisa lets you view document URLs that failed the summarizing process. But MondoSearch makes it much easier to drill down a directory tree and ferret out an error. Kanisa forces you to wade through the report by URL in alphabetical order.Kanisa lets you implement special rules for indexing, but these affect all the collections and cannot be applied to discrete collections. Kanisa analyzes words, phrases and concepts in a document to determine their frequency and priority to come up with a score. Indexed scores determine a document's relevancy to an end user's question based on the natural language found in the question. You can assign a weight to a word or concept in a document based on its location in the document (such as title or body) and/or whether the word or concept is underlined, bolded or italicized. You can also configure metatag indexing and filter unwanted tags.

Document information such as title, description, URL location and index scores are combined in an answer matrix. The matrix is a semantic index driven by a dictionary of words and phrases called the Knowledge Layer. The Knowledge Layer interprets content in a Web site and matches it to queries from users. When you are ready to test the contents of your answer matrix, you can schedule Kanisa to generate a snapshot of it.

We installed most of Kanisa's components onto one server. There are six components: the answer matrix, the Knowledge Layer, the online query engine, the offline processing engine, the application console and the Data Mart. Once you become acquainted with how Kanisa works, you move the components to separate servers that exchange data using XML. But note the minimum hardware requirements: Pentium 4 (1 GHz), 512 MB of RAM and 40 GB of hard-disk space All-told, Kanisa would make a scalable search engine if your enterprise had a high priority to answer customer-support questions using Web-site content and external resources not available to HTTP--and you had a large budget for such a project.

Kanisa Site Search 5.0. Kanisa, (408) 863-5800. www.kanisa.com

Mondosoft's MondoSearch markets itself as a Web-site search engine created for the ordinary Web user. It aims to be user-friendly and capable of generating clear, useful results. Although we found this to be true, it did not perform as well as Panoptic or Kanisa in our navigational tests and came in second to Panoptic in indexing Network Computing's production Web site by making available 20,407 documents in just over five hours. MondoSearch took less time than Panoptic and Kanisa, but did not match the speedy indexing done by dtSearch.

MondoSearch was the only search engine we tested that could categorize Web site content automatically. Categories can provide context to searching and viewing results according to a document's subject matter, content type or other criteria. When you first use the crawler to grab content from a site, default categorization is applied and all pages are put in an "other" category. Once the initial site is grabbed and saved to the database, you can create categories unique to your site automatically or manually.To generate automatic categories, you can make use of any metatags that your site uses to classify documents. InSite provides administration pages to virtually map a metatag to "content." For our production site, the "article type" metatag distinguishes content by type: review, feature, column, sneak preview, and so on. Once we mapped the "article type" tag to "content," we took the initial database of 20,184 documents and applied the new category rules using the crawler. MondoSearch automatically generated categories that could be applied to searches and added those categories to the user interface.

Categories can be created manually using a Web form in the InSite administration tool. Once you create them, you can access a site map (grab map) of content that was grabbed to apply your newly created categories to directories. The grab map feature is unique to MondoSearch. It lets you graphically view the directory schema of downloaded content and drill down to the actual document retrieved and indexed. You can also view content that was grabbed but not indexed because of an error, a duplicate, or exclusion by design or by the robots.txt file.

MondoSearch's installation on Windows 2000 is not as complex as Kanisa's but is more involved than those of Panoptic and dtSearch. You need to configure IIS to enable the search engine. Virtual directories are required for the Web server to execute files and scripts from a CGI-BIN directory.

After the installation, you can administer and manage the search engine from a Web browser, just like Kanisa and Panoptic. MondoSearch supports both Internet Explorer 5.0 and above and Netscape 4 and above. The first time you access the Web-browser administration pages (InSite), a wizard helps you create an initial user, define a target host to grab and set a default language.

You can start the crawler right from InSite. Although the default settings for the crawler may be sufficient, you want to make sure the Master database size is set to zero. That way, MondoSearch will control the size of the database as your Web site grows and updates are applied.The user interface is implemented from design settings configured in InSite. InSite is also able to set a default search method to apply Boolean logic to multiple keywords entered on the search form. You can select a default and (search for all terms) or or (search for any terms). In addition, you can apply categories to searches and enable multiple category selections per search.

Unlike that of Kanisa and Panoptic, MondoSearch's InSite does not provide a method to automate the grabber and initiate or update the indexing process. You need to separately schedule the activity using Windows Task Manager. Although you can run the crawler with options, they require command-line parameters interpreted by their placement in the command line. It's clumsy, but you should only need to set it once.

InSite's log reporting capabilities lag behind Kanisa's, and its log files are not as extensive as Panoptic's. However, as with Kanisa, you can obtain extensive analytical reports of search-engine usage with the Behavior Tracking module. MondoSearch provides excellent control and management of the search process. It also provides an easy-to-configure sample search form with category search and retrieval at a moderate price.

MondoSearch 5.1. Mondosoft, (650) 462-2140, (800) 625-1175. www.mondosoft.comdtSearch's Web 6.20 is an easy-to-use search engine for Microsoft Internet Information Server 4 or later. It includes the dtSearch desktop version and an indexer application that satisfied our minimum requirements. Although dtSearch does not come with as many bells and whistles as Kanisa or MondoSearch, it was the easiest to set up and install on a Windows 2000 Server.

Installable via CD-ROM, the dtSearch desktop version is used to create and manage indexes for file systems or Web sites. It also launches the Web installer and creates a Web search form. We decided to index our production Web site before launching the installer. When you set up the Web version, it makes previously generated indexes accessible in the Web search form. If you add or remove indexes from service after installing the Web version, you need to regenerate the search form created during the installation for the changes to take effect. This is not true if you simply update an index manually or if you automate the update using Windows Task Manager.As with our other participants, creating a collection or index involves configuring a spider to crawl a Web site. With dtSearch's spider, we had the fewest options but the fastest performance. It made more than 20,000 documents available within an hour. Unfortunately, in this case, haste made waste. It was our worst performer in the navigational search tests when using an automatic Boolean search for the keywords entered in a search form.

Once we had a working index of Network Computing content, we launched the Web installer from within the desktop version. dtSearch Web installs with a wizard that identifies the default Web site and a common directory to run script files. During installation, the wizard provides options to customize the user interface. You can select default search strategies from an extensive list of options including Boolean, proximity, and fuzzy searching that lets you find misspelled words in documents.

Turning to the report logs, dtSearch only keeps a log of exceptions. It does not keep a detailed crawler log like Panoptic does. When we viewed the exception log, there were only seven errors in the indexing process (HTTP 404 errors), and none of the errors related to any of the pages used in our navigational testing.

dtSearch comes cheap, has a snappy indexer, integrates easily with IIS and includes a usable search form out of the box. However, it lacks the features and control of its rivals in spidering and indexing content.

dtSearch Web. dtSearch Corp., (800) IT-FINDS. www.dtsearch.com Sean Doherty is a technology editor and lawyer based at our Syracuse University Real-World Labs®. A former project manager and IT engineer at Syracuse University, he helped develop centrally supported applications and storage systems. Write to him at [email protected].Post a comment or question on this story.As more network resources and services are becoming available over the Web, more content is being created and stored in file systems and databases accessible to Web browsers. That means it's getting more difficult for enterprises to find content relevant to a problem or reuse it to generate income. The solution: enterprise search engines.

Enterprise search engines find content within a firewall or secure VPN for employees, customers and partners. These are not Internet search engines tuned for link analysis and popularity contests. These are engines designed to leverage metadata and analyze and index content in a variety of documents. In addition, these engines are designed to traverse secure Web sites, file systems and databases.

We tested enterprise search engines from CSIRO, dtSearch, Kanisa and Mondosoft, scrutinizing each's ability to index and search Web content and manage and report on the search process. CSIRO's Panoptic Enterprise Search Engine took our Editor's Choice award because of its superior performance, ease of use, exceptional management and low price.

Digital convergence white papers & research reports

Searchtools.comSearchEngines.com: Search Engine Resources

"Survivor's Guide to 2002 Business Applications: Search Engines"

"Search Engines: The Hunt Is On"We ran the Kanisa, Mondosoft and dtSearch products on Windows 2000 Server (SP3) with dual Intel Pentium III processors (1 GHz), 1 GB of RAM, and gigabit network links. Panoptic provided its own Linux operating system using the same hardware platform but it ran over a 100-Mbps network link. The search engines were not tested for speed, therefore, the different network links had no impact on the test results.

The search engines were tested for performance on Network Computing's production Web site (www.networkcomputing.com). In addition, search engine features were tested on the magazine's production site, Syracuse University Web sites (www.syr.edu/*), a test server in our Syracuse University Real-World Labs® (Sunfire 280R, Solaris 9, Apache 1.3), and the author's secure intranet server.

Results

click to enlarge

To test search performance in the real world, we reviewed the log files of real key word searches made against our production site from January to August 2003. The key words we repeatedly searched are shown in the table below.We studied the documents returned to determine their relevance. We tested performance by determining each of the engines ability to navigate to certain Web pages on NWC's production Web site. We used a few of the searches above and simulated searches designed to find identified Web pages.

For example, when the keywords fluke network inspector were searched with each of the engines, all of them returned the Sneak Preview of Fluke's Network Inspector. But not all the engines ranked that particular document first. One ranked it second and another ranked it third. In this case, the engines that ranked it first would receive one point. The engines that ranked it second would receive two points. And the engines that ranked it third would receive three points. The engines with lower scores win (see performance results).

R E V I E W

Search Engines

Sorry,
your browser
is not Java
enabled

Welcome to

NETWORK COMPUTING's Interactive Report Card, v2. To launch it, click on the Interactive Report Card ® icon

above. The program components take a few moments to load.

Once launched, enter your own product feature weights and click the Recalc button. The Interactive Report Card ® will re-sort (and re-grade!) the products based on the new category weights you entered.

Related Topics

Recent in Infrastructure

Related Topics

Recent in Network Mgmt

Related Topics

Recent in Security

Related Topics

Recent in Enterprise Connectivity

Related Topics

Recent in Wireless

Related Topics

Related Topics

Panning for Gold