Analysis: Enterprise Search
New search appliances claim to be uniquely adapted to meet enterprise needs. We tested eight enterprise search products and analyzed the technology's security and architectural implications. Our take: The
June 22, 2007
IBM-YaHoo, Microsoft, Oracle, SAP ... why are these big dogs scrapping with specialized vendors like dtSearch, Vivisimo and X1 Technologies for a market Gartner pegged at a measly $370 million in 2006 worldwide revenue?
One word: Mindshare. We've all heard end users ask, "Have you Googled it?" You can't buy that kind of name recognition, and incumbent application vendors want to make darn sure IT groups don't begin to equate enterprise search with those shiny yellow Google appliances.
Moreover, it's clear that profit margins are anorexic mainly because Google and IBM-Yahoo are exerting downward pricing pressure. Not that we're complaining--undercutting competitors is a time-honored, and customer-friendly, tactic. And yet, even at relatively attractive prices, this isn't a fast growing market. Why is that?
After testing search products from dtSearch, Google, IBM, ISYS Search Software, Mondosoft, Thunderstone Software, Vivisimo, X1 in our Green Bay, Wis., Real-World Labs®, we think we know: Current products do not do a good job providing relevant results given large amounts of typical enterprise data. That includes Google's killer PageRank algorithm, which transformed the Web into a truly useful source of information. PageRank simply does not translate well to the enterprise search market. Web search is different from desktop search is different from federated enterprise search, and so far, no one vendor has pulled it all together.We also found security concerns. An enterprise-search engine goes to work when a user points it at a file share. The software opens every document on the share--even those with sensitive information. Text and metadata from the file are extracted and indexed in a reverse index; most of the products we tested also cache document text or entire documents. We trust you see how that's a problem.
Worth The Price?
Big questions from the CIO: Why do we need this? What do these products really give an enterprise, aside from making it easier for sales folks to pull together proposals? Are they just pricey insurance against inevitable subpoenas? And how does cost stack up against functionality?
In a nutshell, end users are demanding the ability to gather knowledge from all corners of the enterprise, and as more information becomes digitized, being without sophisticated search will hamper productivity.
In addition, while recent changes in the Federal Rules of Civil Procedure didn't in themselves make implementing enterprise search a priority, they certainly jolted many IT groups out of complacency. Companies from Bank of America Securities to Philip Morris have shelled out billions in penalties for failing to provide business records. On a smaller scale, as information stores--databases, CMSs (content-management systems), file and Web servers--overflow with information, employees waste precious time digging for files that a federated-search engine might return in a fraction of a second.As to the question of how well they work, the eight products we tested have some big customer names, strong staying power ... and not a whole lot of differentiation. You'll find some variation when looking at advanced federated search features that typically involve indexing systems like databases and CMSs, but beware of paying a premium: The biggest bang for your buck is in indexing file and Web servers. The advanced features vendors will try to sell you yield only marginally better results.
Continue Reading This Story...
RELATED LINKSGoogle Launches 'Universal' Search BEA Delivers Web 2.0 Enteprise Tools Google To Sell New Dell Servers as Search Appliance |
---|
How It Works
At the heart of most search systems are three core pieces: A crawling/indexing engine, a query engine and a ranking/relevancy engine. Search vendors have varying names and different ways of segmenting these three pieces, but the paradigm is the same in the end. We did find an exception: X1 does not provide advanced search algorithms, such as fuzzy search or word stemming, or relevancy ranking. It approaches enterprise search with a unique, and more easily secured, desktop client-server architecture.
The crawling/indexing engine is responsible for retrieving documents and data from a source, say a database, file server or CMS, and placing the information into a data structure that can be searched efficiently. In most cases, the data structure is an inverted index. The crawling/indexing engine is also responsible for creating document caches, which are used for creating document "summaries" that are displayed on search-result pages.
The query engine searches for occurrences of keywords in the index and creates a list of documents that contain them. The relevancy/ranking engine is responsible for ordering documents such that, hopefully, those most useful to the user are at the top of the list.On top of these core pieces vendors add algorithms that help improve search accuracy. An advanced indexing engine, for example, might index document metadata as well as text. A "fuzzy search" capability may find keywords that are misspelled in documents, while an advanced relevancy algorithm would give more weight to documents with more occurrences of key words in a smaller area.
How It's Built
We noted three basic architectures among the tested products. Most popular were Web-based offerings that are accessed solely through a browser. The products from Google, IBM, Mondosoft, Thunderstone and Vivisimo all fall under this category, with the Google and Thunderstone products configured as appliances.
In a desktop-based architecture, indexes are shared through file shares. This architecture is easier to program because developers don't have to write a complicated client-server communication protocol. Search clients access the index directly. Of course, this can create security problems if you don't carefully control what documents are contained in a shared index. Products from dtSearch and ISYS fall into this category.
Finally, X1 submitted a desktop client-server architecture, where the interface for entering queries is separate from the search engine. That means indexes can be secured with more fine-grained control; in fact, the query interface need not even have direct access to the index because it doesn't access the index directly.Which is best for you? If you plan to use the search product as an interface to your file servers, instead of using a file explorer, choose a product with good desktop integration, as opposed to a Web interface, or one that provides a powerful API for creating custom search clients. If desktop integration isn't important or you'd rather avoid desktop search clients, go with a Web-based offering.
Once you decide on architecture, choosing a search product is easy, right? The product that gives the best search results is the one you want. Going into this review, our premise was simple: ID the killer algorithm that would give one entry an edge in providing relevant results.
Problem is, we didn't find any killer algorithms. We spent hours running the products over various data sets, but didn't even find much variance with advanced algorithms among the products tested. Where we found the most differentiation: how security is handled and product architecture.
Articles Of Federation
Google was clearly the odd duck in our lineup, not because its offering lacks features or competitive pricing--in fact, enterprises can thank Google for the price wars that have made search a relative bargain.What's different is its algorithm. The beauty of PageRank is that it enlists millions of humans to do what a program does poorly--make subjective decisions. Because Google's PageRank algorithm works by assigning a higher relevance to Web pages that have a higher number of pages linked to them, it deals in objective data. It's extremely easy to write a program that makes objective decisions. Humans, on the other hand, are better at making subjective decisions. So, if 100 people link to page A, which contains content Z, but only five people link to page B, which contains content Z, chances are, page A has more accurate information about content Z and should thus have a higher ranking than page B. Taking the example one step further, page B might have more occurrences of keywords, and keywords might be bold and in headings more often, but page A will still have a higher relevance.
Problem is, the Web and a typical enterprise's cache of data are totally different animals (and Google may be the only vendor in both games right now). Consider the task of indexing the Web and returning relevant results. The Internet consists mostly of structured documents. When parsing an HTML file, it's easy for the parser to determine which words are highlighted--bold, italicized, headers--and to give these terms higher relevance. The parsed text is placed in an inverted index, which is used by a query engine to find matching results.
HTML documents also are structured in the sense that they contain links to other HTML documents. This second form of structure, using link analysis--essentially peer review--is the basis of Google's PageRank algorithm.
Before link analysis became mainstream, strict keyword searches did an OK job of finding relevant documents. But keyword searches are easy to fool by adding hidden text to HTML pages, and simply counting the number of times certain keywords appear in a document does not give a good measure of relevance. Link-analysis algorithms help solve both problems.
Of course, there will be far fewer results returned in the enterprise. In addition, end users often will have a good idea about document names, and your file servers are most likely partitioned by department, further reducing the number of documents users must wade through. The search products we tested have a number of automatic and configurable features that will help improve search-result relevances. All were easy to install and configure, with even the most challenging taking less than an hour.Don't Look Now
Indexing information across the enterprise can undo all the security controls you've put in place to keep attackers at bay and employees honest, not to mention compliance with regs like SOX and HIPAA.
The first problem has more to do with knowing what information is on your file servers than it does with security. File servers often contain sensitive information, say a document containing passwords or an offer of employment. Indexing a file server will dredge up this quickly.
Obviously, your search system shouldn't return documents that a user wouldn't normally have access to. Better is to not even let users know certain forbidden fruit exists--giving summary info will open a can of worms.
This type of problem would most commonly occur with Web-based products, such as IBM's OmniFind Yahoo Edition, where the client is a Web browser. If users aren't authenticated by the Web server or credentials aren't passed to the Web server in some manner, the search software won't be able to check if the user has rights to selected documents. Then, when the user actually selects a document, a file protocol is used to retrieve it from the server, at which point security will be enforced.Once the user is authenticated, the search products use three methods for checking privileges: cache security information (ACLs and/or LDAP objects) on the search application and check privileges against the cache, check privileges against an LDAP server or ACL from the originating server, or use the vendor's security API.
All three methods provide the same results, but there's an extra gotcha: Caching security information can boost performance by giving the search solution a fast, local place to verify user privileges, eliminating the need to go to the originating or LDAP server to test credentials for each document. However, cached security information isn't updated in real time. Updates occur only when files are recrawled, creating a lag between when rights are granted and revoked on the originating server and when the cached security info is updated on the search app.
If you need the performance that using cached security information gives but can't budge on the security implications, all is not lost. The products we reviewed, except IBM OmniFind, let IT create multiple indexes and assign rights to those indexes. Then, if the product allows, security checking at the document level can be turned off. Say you use single-sign-on to grant access to a search utility to only those employees who have access to all the indexed content, for instance. Then, the search appliance doesn't have to check user privileges on every document returned in a query. Some of the search products let you grant users and groups access to particular indexes.
Another option is to provide multiple search servers. Using one index to provide search access to all employees isn't good practice--one index means one super-user account that can access all indexed content.
Bottom line, ensure your search mechanism is in lockstep with the security of the originating systems. Never forget that the search product stores content, maybe even a copy of the entire document, on a server separate from the originating server--meaning it is no longer governed by the rules that govern the original content.Another area of consideration is privacy, concerning two areas where content is stored: e-mail and users' desktops. As with all indexed content, ensure that proper authentication and privilege checking are in place when providing search capabilities to an e-mail database.
Users' desktops may also contain sensitive information, but a bigger issue is that to index a local desktop and let people other than the local user search the index, you'd have to create shares on every desktop--a potential security nightmare if your LAN gets hit with a virus that spreads through file shares. Desktops are better served with local indexing and search programs like Google Desktop, OmniFind Yahoo! Edition or X1 Enterprise Client.
In our reviews, we discuss how each product handles security, including whether it supports caching ACLs. n
SEARCH BY THE NUMBERS
40: Percentage of unit licenses for enterprise search sold that Google will provide by 4Q 2007. Source: GartnerLess than 5 percent: Global 2000 companies that will have selected Google as their primary information access software vendor by 4Q07 Source: Gartner
10: Percentage of unit licenses for enterprise search sold that Microsoft will provide by 4Q08 Source: Gartner
$30,000: Cost for Google Search Appliance capable of searching as many as 500,000 documents. Source: Google
$57,670: Estimated price for Microsoft SharePoint Server 2007 for Enterprise Search Source: Microsoft
$0: Cost for the OmniFind Yahoo! Edition to index as many as 500,000 documents (download at omnifind. ibm. yahoo.com). Support can be purchased for $1,999 per year. Source: IBMCACHE AND CARRY
TO DECIDE whether you need enterprise search now or can wait for offerings to mature, you need an idea of how much time your employees spend searching for content, the location of the content that employees are seeking and what the information is used for.
For salespeople who must pull results from e-mail, a file server and a Web server to build a proposal, federated search can bring a lot of value. For remote employees who keep a lot of data on their local drives, a search app that integrates tightly with the desktop, like X1's Enterprise client, is ideal. These have a multilevel architecture--desktop agents plus server-based indexing. The client can index the local computer and communicate with a "cluster" to search server file shares.
If you plan to turn on caching, you'll have to determine if you want just text cached, with no images, or the entire document. Obviously, the latter will greatly increase the amount of space needed. On the flip side, if full-document caching is enabled, users will still be able to query and view documents when the originating source is down, provided security information is also cached, or the infrastructure is such that the search engine doesn't have to verify rights against the originating source.
Another caveat of caching depends on how the software indexes a document. Some vendors don't index common words, which helps reduce the size of the index. The downside is, a separate document cache must be created, by caching the whole document or just the document text. The cache is needed to generate page summaries. Other vendors index every word at the cost of a larger index and the benefit of not needing to create a separate document cache. Because every word is contained in the index, summaries can be generated from the index, but this method of generating summaries can be slower than generating summaries from a document cache.Having an API available to plug into the search engine also may be of benefit. All the products we tested provide APIs to modify the behavior of some aspect of the product, like indexing and querying the index, and depending on the functionality provided by the API, developers could hook the search product into an ERP or SFA app. Thunderstone provides an XSL interface, for example, while dtSearch offers an API available in C/C++, COM, .Net and Java.
NWC REPORTS: ENTERPRISE SEARCH
We asked for enterprise search products, either standalone server appliances or software. Entries could not require separate licensing of a database. We rated the products on APIs, usability, security, features and caching.
PARTICIPATING VENDORS
dtSearch, Google, IBM, ISYS Search Software, Mondosoft, Thunderstone Software, Vivisimo, X1 TechnologiesTESTING SCENARIO
After installing the products we set them to index about 2 GB of data on a file server, then ran test queries. We decided not to grade based on accuracy of search results because this is a highly subjective area--one user's relevant document is another's clutter. Instead we focused on security, APIs and price as well as what constitutes a "standard" search feature and any high-end features that differentiate the tested products.
RESULTS
The dtSearch query engine sports an extensive API and gives users a fairly flexible syntax for modifying the engine's behavior. Take term and field weighting--the query "apple: 1 and pear: 5" will search for documents with both words, but give more relevance to pages that have the word "pear." Of course, users must master the syntax.
Google's GSA provides strong security and a simple interface for defining what keywords trigger a call to the OneBox API--for example, the query "employee Joe Sixpack" could be redirected to an LDAP database and return content, like a picture and title, set apart in a formatted box.All the products tested are intuitive, but IBM's OmniFind Yahoo edition absolutely buries rivals in terms of usability. A solid basic search product, the only major features it lacks are connectivity to more-advanced content systems and granular security options, though the product does provide an API that can be used to add more security features. And, the price is right (it's free).
ISYS is a solid performer, with a particularly interesting feature that lets users add more weight to a keyword, and thus more relevance to a document, when it's found in a specific meta field within a document. A user could tell the query engine to give higher relevance to documents that were authored by our friend Joe Sixpack, for example.
With today's search technologies, tuning is inevitable. Mondosoft provides a feature called Behavior Tracking that helps you find out what employees are searching for, and how successful their searches are. Although it's lacking in security, the system can generate summary reports.
Keeping content in the search index up-to-date is crucial. Thunderstone provides Adaptive Indexing to help keep indexes in tune; it works by automatically recognizing content that changes frequently and revisiting those documents regularly. Adaptive Indexing also reduces bandwidth use because it doesn't rescan every document on the same set schedule.
Although Vivisimo's interface is confusing, it does provide extensive and extensible parsing capabilities for use with structured content. The crawler can be told to ignore things like headers and footers on HTML pages, for example, or users could supply specific tags that should be ignored when the page is indexed.X1's Enterprise Client provides tight integration with the desktop, letting users simultaneously query multiple X1 Servers and their desktops. Previewing a document changes the functions available on the toolbar; previewing an e-mail changes the toolbar to common buttons available for e-mail messages, for example, and clicking on one of the buttons brings up the mail client. X1 uses its own preview generator, so the client doesn't need a reader for every document type that's indexed, including PDFs.
ANALYSIS CRITERIA
• APIs
• Usability, including interface and ease-of-use features
• Security features, including whether the product caches security info and integrates with eDirectory, ActiveDirectory or another user directory• Features, such as word stemming, synonyms, fuzzy search, proximity search
• Document- and text-caching features
Review: Enterprise Search Applications
We tested eight enterprise search products and in the process analyzed the technology's security and architectural implications. Our take: The math just doesn't add up.
dtSearch Corp dtSearch Text Retrieval Engine for Win & .NETThe dtSearch product's strength is its extensible API, and we liked that it gives users a fairly flexible syntax for modifying the engine's behavior. Take term and field weighting--the query "apple: 1 and pear: 5" will search for documents with both words, but give more relevance to pages that have the word "pear." Of course, users will have to master the syntax to see improved search results.The dtSearch Engine is available for Windows and Linux; we reviewed the Win & .NET version, which includes an extensive API for extending almost every aspect of the product. APIs, available for C++, the .NET family, Java and more, are intended to enable this product to provide text search and retrieval to third-party applications, but dtSearch told us that around 80 percent of its customers are using the API in-house.
The dtSearch Engine consists of desktop applications for indexing and searching. A Web interface for searching is also provided, however, the Web interface requires that indexed files be under an IIS virtual directory; none of the other products reviewed had this limitation. Instead of using the client-server model, indexes are shared through file shares. The index can be stored on the same share as the indexed files, or it can be on its own file share. Search clients point to the index on the file share.
Out-of-the-box, we found security a bit lacking, though the API does provide an interface for securing the product. Because indexes are shared through file shares, default security on the index is inherited from the file system: Anyone with read access to the index and a dtSearch client can search the index. The product does not check rights against documents returned from the cache, either, meaning a user who has read access to the index but not to particular files contained in the index will still be able to view the contents of those files from the document cache.
These security problems can be avoided, without using the API, if read rights for all documents contained in an index are the same, and the index is on a file share with identical read rights; in those circumstances, security was handled correctly by the file system, automatically. However, if an index contains files from multiple file shares with different read rights, you'll have a problem: Say user A has access to file shares X and Y, and he creates an index on file share X containing files from X and Y. User B has access to file share X, but not Y. Because user B has access to the index on file share X, he will be able to search files on file share Y.
From our perspective, dtSearch's strongest feature is the API, which is a differentiator among rivals and can allow the product to overcome weaknesses with its security features and Web interface. We found the API more than adequate for customizing these features to exactly what you need. However, be aware that dtSearch has special licensing restrictions on creating products that interface with a Web browser.The product was extremely fast at indexing files, and the search client provides a simple interface for creating queries. Most of the products testedprovide a fairly extensive query syntax, which helps filter results and modify ranking on a per-query basis. Of the products tested, the dtSearch client provides the easiest and most comprehensive point-and-click interface to those query features to help users create advanced queries without having to remember exact syntax. These search options are available on the main search screen.
As tested, dtSearch for Win & .NET costs $999 per server, or $2,500 for three servers and $833 for each additional server, including unlimited users and documents. The company also provides pricing for creating both non-competing and competing, stand-alone search products from the dtSearch Engine and API.
Google Search Appliance
The stylish yellow Linux-based Google Search Appliance is sized based on the number of documents you intend to index. Every aspect of the device, from configuring network settings to searching an index, is available through an HTTP interface. Google's GSA provided strong security and a simple interface for defining what keywords trigger a call to the OneBox API--for example, the query "employee Joe Sixpack" could be redirected to an LDAP database and return content, like a picture and job title, set apart in a nicely formatted box.
The GSA checks security against the originating system on each result returned in a query; no caching of security info is supported. The GSA uses NTLM to verify privileges against SMB file shares. For more advanced security needs, the GSA provides access to an SAML (Security Assertion Markup Language) API, enabling creation of custom security modules. The process of specifying shares to crawl on the GSA interface was a bit awkward. And, after choosing files to crawl, it didn't look like we did this function correctly because the GSA didn't immediately start indexing. In fact, we gave up and moved to another product after waiting five minutes! Returning to the GSA later, we saw that we had in fact specified the files to crawl correctly and the appliance had indexed them. Crisis averted.
The GSA has one of the most aesthetically pleasing UIs of the products tested, and pages aren't overloaded with information. The appliance also provides graphs detailing different aspects of performance.The GSA API lineup isn't as extensive as that in dtSearch, but the functionality available through the APIs is impressive nonetheless. The HTTP-based Search Protocol can be used for submitting queries to the GSA and receiving results. The Feeds Protocol can create "connectors" to data sources for feeding data to the GSA so it can be indexed and searched.
And then there's the Google OneBox API.
The OneBox API let us define keywords that trigger queries to external systems and provides an XSLT for transforming the results. OneBox results are displayed apart from other results at the top of the page. This API provides an alternative to trying to index a content system that provides dynamic data--instead, the system can be queried in real time.
Google offers two different models of the appliance: Google Mini and GSA. Google Mini starts at $1,995 and supports 50,000 to 300,000 documents. The GSA is aimed at larger customers. It comes in three models that support up to 30 million documents. Google submitted the entry-level GSA model, its GB-1001 for this review. This device starts at $30,000 to search as many as 500,000 documents.IBM OmniFind Yahoo Edition
IBM submitted its OmniFind Yahoo edition. While not quite what we consider enterprise-class, the product has a number of features, such as word stemming and multi-language support, that make it viable as a basic search tool. And did we mention it's a breeze to use? While all the products tested were intuitive, IBM's OmniFind Yahoo edition absolutely buried rivals in terms of usability. A solid basic search product, the only major features it lacks are connectivity to more-advanced content systems and granular security options, though the product does provide an API that can be used to add more security features. And, the price is right (it's free).OmniFind Yahoo is a software-based product, available for Windows and Linux, that is controlled through an HTTP interface. At its core is the open-source Apache Lucene index and retrieval engine, which is written in Java. The software is intended to be installed it on a standalone server and accessed through a Web browser.
Security is lacking on this product: The Web site is secured with only basic authentication, and user rights are not checked against returned results. If document caching is turned on, any user with access to the search page will be able to view all indexed content, regardless of file system rights. However, if a user doesn't have access to a particular document and tries to open it, file system rights are enforced.At first glance, this glaring hole seems to knock the product into the useless category, but it's saved by the fact that it's free software and you can provide a secure login in front of the search page, limiting use to only those with access to indexed files. Sure, you'll need an instance of OmniFind Yahoo Edition for each file server, but that's OK because the price is right. The software also includes APIs that can be used to implement more robust security. Also note that IBM offers more "enterprise ready" products with advanced security features.
The highlight of this product is its ease of use: It was by far the most intuitive entry in every aspect of installation and configuration. The Web pages are likewise aesthetically pleasing and uncluttered. Now, the flip side of "uncluttered" is that it lacked many tunable parameters that all of the other products provided. Where rivals offer, say, 20 to 30 options on a given page, OmniFind provides maybe five. But hey, they're arguably the five most useful settings.
Another area where the product is lacking is crawl scheduling. It does adaptive indexing for Web servers, but not for file servers. The file server indexer runs on startup, or it can be run manually from the Web site or the command-line, but it can't be scheduled within OmniFind. Because it can be started from the command-line, as a workaround you could use Windows task scheduler or cron on Linux to schedule file server indexing.
Even with these gripes, OmniFind Yahoo Edition is a fairly full-featured search product for the price. Note that its license restricts you from indexing more than 500,000 documents; support can be purchased for $1,999 per year.
ISYS Desktop 8ISYS provides a number of search products to meet enterprise needs, including ISYS Web, ISYS Desktop and ISYS sdk. We reviewed ISYS Desktop 8 and found it a solid performer, with a particularly interesting feature that allows users to add more weight to a keyword, and thus more relevance to a document, when it's found in a specific meta field within a document. For example, a user could tell the query engine to give higher relevance to documents that were authored by a specific person.
In terms of architecture, ISYS Desktop is similar to dtSearch, but the ISYS search client is much more feature-rich. Indexes are shared with other desktop clients through file shares. ISYS Desktop runs on Windows only.
The UI, while aesthetically pleasing, is a bit cluttered. We had many different options for searching, such as natural language and menu assisted; sorting, for example, by document type and relevance; and viewing documents, such as WYSIWYG and speed. Besides search results, the UI also shows categories and entities related to the search, in separate panels. By default, categories are defined by directory, but user-defined categories can be created using metadata. Entities, such as e-mail addresses, Web sites and people, can also be user defined.When performing a search, security is checked on each document returned, which can reduce performance significantly. The product supports caching ACLs, which will help improve performance, but you'll have to make sure the cached list is kept up-to-date to ensure proper security is enforced. As a side note, ISYS Web checks security only on items returned in a given Web page, not for every document found for a given search. Additional security methods can be implemented using the ISYS sdk, which is available as a separate product.
ISYS Desktop comes with a limited API. If the default product doesn't meet your needs, you'll have to purchase the ISYS sdk. The sdk can be used through a Windows DLL, COM, Java and .NET. Among other things, the sdk allows programmers to create custom connectors for data sources, like databases, and design custom security modules.
One of the coolest features of the ISYS Desktop client is the ISYS Browse file viewer. Double clicking a file returned in a search opens the document in ISYS Browse, with the search terms highlighted and entities underlined. Navigation buttons allow users to navigate forward and backward through the highlighted terms. Across the top of the ISYS Browse window is a row of 18 buttons that let us do things like run a query on selected text, or add annotations, such as notes, hyperlinks and graphics, to a document. Annotations are stored with the index and can be viewed only from ISYS Browse; they don't modify the original document.Other top ISYS Desktop features are automatic categorization, entity extraction for classification and a menu-assisted query feature. The product provides an interface for selecting files to index that is almost as easy to use as the one in OmniFind Yahoo.Pricing for ISYS Desktop 8 starts at $1,000 for a network license and $100 per seat. There's no limit on the number of documents that can be indexed. Yearly support is available for around 20 percent of the licensing cost. ISYS also provides a site license option.
Mondosoft MondoSearch v. 5.3; BehaviorTracking v. 5.2; InformationManager 5.3
With today's search technologies, tuning is inevitable. Mondosoft provides a feature called BehaviorTracking that will aid in this process by helping you find out just what employees are searching for, and how successful their searches are. Though it's lacking in security, the system can generate and e-mail summary reports. Mondosoft also augments its MondoSearch enterprise search product with InformationManager, a tool that provides a link between MondoSearch and BehaviorTracking. MondoSearch collects information for BehaviorTracking, then InformationManager can be used to modify the behavior of MondoSearch. All of the products are controlled through an HTTP interface. For this review, we looked at MondoSearch only.
The three products mentioned above are licensed separately but are all needed to realize the full potential of enterprise search that Mondosoft offers. MondoSearch and InformationManager together will give about 98 percent of the functionality that the other products in this review offer. MondoSearch by itself gives just basic indexing and searching options.
InformationManager adds features like SearchNames, which links a query to content (read: BestBets), and synonym suggestions to queries that don't return results, or automatically including synonyms in a query. BehaviorTracking provides the analysis needed to configure InformationManager to help MondoSearch improve relevancy of results.
Out of the box, MondoSearch integrates with Active Directory for security. For more advanced security features, like ACL caching or integration with eDirectory, you'll have to depend on a third-party consultant to provide the needed functionality--Mondosoft does not expose its APIs, with the exception of an extensive API for querying the search engine through a Web service.Some of the other products in this review provide basic analysis of have users are using the system, but Mondosoft intends BehaviorTracking to go above and beyond. An e-commerce company, for example, could quickly find which products customers are unable to locate. It could then update product descriptions to include the keywords users are expecting, use synonyms to direct the query to more relevant keywords or use SearchNames to link queries to results.
Mondosoft also has a unique feature among search products called BehaviorMatch. BehaviorTracking will recognize unsuccessful queries and how users rephrase their queries to get relevant results. Based on those results, BehaviorTracking sends suggested synonyms to InformationManager. Synonyms can be automatically added to a search, or suggested as alternative searches at the top or bottom of the result page.
MondoSearch, BehaviorTracking, and InformationManager purchased together are priced as follows: up to 50,000 documents for $10,000, up to 100,000 documents for $18,000, up to 500,000 documents for $60,000, or over 500,000 documents for $120,000. The product tops out at around 5 million documents. Support and upgrades run 20 percent of the purchase price per year.Thunderstone Software Thunderstone Search Appliance 1000
Keeping content in the search index up-to-date is crucial. Thunderstone provides a feature called Adaptive Indexing to help indexes stay in tune; it works by automatically recognizing content that changes frequently and revisiting those documents on a more regular basis. Adaptive Indexing also reduces bandwidth usage because it doesn't rescan every document on the same set schedule.
Like Google Search Appliance, the Thunderstone Search Appliance is controlled through an HTTP interface. Unlike the GSA, however, the Thunderstone UI is not aesthetically pleasing--some of the pages are jam-packed with settings. The Linux-based Thunderstone Search Appliance, which is sized based on the number of documents you intend to index, does not cache ACLs. Security is checked against every document that's returned in a search. Thunderstone's internal architecture is unique among text search and retrieval products. Most vendors use either an inverted index, or a combination of an RDBMS and an inverted index to create a hybrid system. For over 10 years, Thunderstone has been developing an integrated platform that it calls Texis. Texis can store text documents of unlimited size within standard database tables. Thunderstone provides an HTML interface to Texis; it can also be accessed through a C interface, a Perl module and an ODBC driver, but Thunderstone recommends using the HTML interface.So where's the benefit of a hybrid system? Because the storage engine is a RDBMS, storage and retrieval can be done using standard SQL syntax. Most other vendors are stuck creating proprietary interfaces to an inverted index that is likely not as full-featured as SQL. Given the power and flexibility of SQL, Thunderstone should have no problem adding new features to its core engine. Besides sitting at the core of the Thunderstone Search Appliance, Texis is sold as a standalone product with developer APIs.
Though Thunderstone offers all the main features that others in this space provide, its UI was by far the worst of the bunch. Fortunately, the product does offer two decent ways to configure how results are displayed: If you want to use the appliance as a standalone product, search results can be configured by modifying an XSL file. As mentioned, Thunderstone also offers a well-defined HTTP interface to the search engine, giving users the ability to integrate search and format results however they see fit. These options will fix the search results page, but they don't help the drab appearance of the UI for managing the appliance.Thunderstone offers a Small Businesses Edition of its appliance that can index as many as 50,000 documents for $2,495. Appliances for larger organizations range from $10,000 for 250,000 documents to $300,000 for 15 million documents. The appliance designed for 500,000 documents is $15,000.Vivisimo Velocity 5.1
Vivisimo's enterprise search platform is called Vivisimo Velocity. The Velocity platform consists of three modules: Search Engine, Content Integrator and Clustering Engine. Vivisimo Velocity is available for Windows, Linux and Solaris and includes a Web-based interface. Velocity provides the best out-of-the-box support for different security architectures among the products tested, supporting ACL caching as well as SSO, Integrated Auth, OASIS and PKI. Velocity also gave us the option for providing content-level security; when content is a virtual document consisting of pieces from several sources, security is checked against each piece of the virtual document. Out of the box, Velocity integrates with Active Directory and eDirectory. These distinctions earned it our Editor's Choice.The Content Integrator provides federated search capabilities.
Velocity goes beyond other federated search options in this roundup, which index content from disparate sources; Content Integrator will actually query external sources, combine results with the Search Engine, and create categories of all results, through the Clustering Engine. The Clustering Engine then groups like documents. Though Velocity's is one of the most aesthetically pleasing UIs of the products reviewed, it's also the most confusing: Terminology was quite different from the other products, and there were so many configuration options and pages, it was hard to discern what the options modified. Another complaint we had with the UI is that page updates are done through page refreshes, rather than using Ajax. This really is a small gripe, but given how polished the product is, we found it odd that Vivisimo didn't go the extra step to create a more interactive Web experience.The Velocity platform also provides the most configurable search functionality in this roundup. Even so, Vivisimo still offers an API for extremely complex or specialized situations available to most common programming languages.
As an interesting side note, Velocity was the only entry to fully utilize the processors on our test server (dual, dual core 3GHZ Xeons with 4GB of RAM). The processors where nearly pegged for the entire time that the software was crawling and indexing. But it was all for naught, as Velocity was not the fastest product in terms of crawling and indexing. We think part of the reason for its relatively slow performance is that it first creates a document cache of all documents selected to be indexed. After fetching all of the documents and creating the cache, the index is created, meaning that each document is basically opened and read twice. Vivisimo sent us tips for optimizing the product, which shaved ten minutes off the total time, but out of the three products that created similar caches, Velocity was still the slowest by about thirty seconds.Velocity's strongest features are its Clustering Engine, extensive list of supported security infrastructures and Content Integrator. Though Velocity goes above and beyond what other products offer in terms of out-of-the-box security options, the security capabilities provided by other vendors in this roundup will suffice for most customers. Content Integrator is cool, but not earth shattering.The feature that really sets Velocity apart is the Clustering Engine, a standalone entity that is not tied specifically to the Velocity Search Engine-it can be used with other search software as well. The Clustering Engine can be interfaced directly through an API or through an HTTP interface using XML. It automatically creates "labels" from the text of search results and groups like documents together in a cluster. By default, the clustering engine creates clusters from the top 200 documents returned to a query.Vivisimo says the average selling price (per new customer license) of the complete Vivisimo Velocity search platform is over $200,000. As a subset application, Velocity's annual price for 500,000 documents and federation of as many as five third-party content sources is $50,000 per deployment. The annual list price for 2 million documents and federation of as many as 10 third-party content sources is $125,000. All annual pricing is for a single deployment and includes support services and product updates for that year. Discounts are available with a multi-year commitment X1 Technologies X1 Enterprise Platform v5.6.3
X1's Enterprise Client provides very tight integration with the desktop, allowing users to simultaneously query multiple X1 Servers along with their desktops. Previewing a document changes the functions available on the toolbar; for example, previewing an e-mail changes the toolbar to common buttons available for e-mail messages, and clicking on one of the buttons brings up the mail client. X1 uses its own preview generator, so the client does not need a reader for every document type that's indexed, including PDFs. X1 Enterprise Platform comprises the X1 Enterprise Client and the X1 Enterprise Server. Unlike dtSearch and Isys, Enterprise Platform is a client-server architecture; indexes are not shared through file shares. The X1 Enterprise Platform is available for Windows only, and the Enterprise Server can be installed only on a server that's part of a Windows domain. X1 Enterprise Clients that want to connect to an X1 Enterprise Server must be part of the same Windows domain. X1 Enterprise Servers can also be accessed through a Web browser, but the HTTP interface lacks many of the features available to the desktop client.
The X1 Enterprise Server secures access to content through Active Directory. It will not integrate with eDirectory, but it does support ACL caching. Given its integration with Windows domains, your infrastructure may already set up to handle the X1 Enterprise Platform.X1 provides client and server SDKs. The client SDK can be used to create custom Enterprise Clients; IT can also allow the client to more thoroughly integrate with other desktop apps. The server SDK can be used to create custom connectors for pulling data from disparate systems into the X1 Server for indexing and searching.
The X1 Enterprise Platform is unique among the products tested--not necessarily better or worse, just unique. For example, when no query is entered in the client, the results pane shows a list of all documents available from the selected indexes. As a user types in a query, results in the results pane are automatically filtered. The X1 Enterprise Platform lacks many features that most of the other products support, like word stemming and synonyms, but lack of these features is due to a different train of thought when dealing with enterprise search.
X1's take is that users know a lot about the data they're searching and can provide everything that's need to properly filter the list through keywords and meta-data. Rather than give users a list of nothing, which most search engines do, X1 gives everything and lets the user whittle down and decide which documents are relevant.X1's strongest feature is the X1 Enterprise Client's integration with the desktop. Previewing a document changes the functions available on the toolbar; for example, previewing an e-mail changes the toolbar to common buttons available for e-mail messages, and clicking on one of the buttons brings up the mail client. X1 uses its own preview generator, so the client does not need a reader for every document type that's indexed. X1 also provides keyword highlighting in PDF documents when they're opened with Adobe.
The X1 professional version is priced at $50 per user, with support and volume discounts available.
Ben Dupont is a systems engineer for WPS Resources in Green Bay, Wis. He specializes in software development. Write to him at [email protected].
You May Also Like