A Quick Look at Concept Search
With most meet-and-confer still concentrating on keyword and Boolean searches, is concept search worth the investment in supporting products and time? The answer is sure -- if you know what you're doing.
July 3, 2009
Let's talk concept, or conceptual search. This type of search does not replace familiar keyword and Boolean searches but expands searching capabilities. Most eDiscovery collection and search products support concept search although they might differ in specifics. Here's the question: with most meet-and-confer still concentrating on keyword searches, is concept search worth the investment in supporting products and time? The answer is sure -- if you know what you're doing.
Can it be useful? Yes, of course. The New Jersey Law Journal reported a dramatic example of an internal investigation into suspected embezzlement. The company had its suspicions but a keyword search for terms related to banks, accounts and deposits turned up nothing meaningful. Then the company ran a search for clustered and threaded terms. They came up with a large number of baseball-related discussions between two men who were not sports fans. The company matched the terms and email dates to bank transfers, exposing the embezzlers and their code words.
However, this entertaining example is not the major reason for using concept search. One of the real challenges of keyword and Boolean searches is that a party may insist that the opposing party use dozens, even hundreds of keywords to search. The idea is that it is simple to carry out such searches. But what happens is that the returned data sets are very large. This strains storage resources and processing cycles but most importantly adds huge burdens to the already expensive manual review process. Concept search ideally fixes this problem by significantly improving search accuracy, which results in smaller and more relevant data sets without lawyer involvement. When the lawyers do go to review results, they are dealing with a much smaller and far more accurate set of data.
As usual, Sedona weighs in with useful guidance.
"Alternative search tools are available to supplement simple keyword searching and Boolean search techniques. These include using fuzzy logic to capture variations on words; using conceptual searching, which makes use of taxonomies and ontologies assembled by linguists; and using other machine learning and text mining tools that employ mathematical probabilities."Sounds great! What's the issue? The issue is that people confuse two different concept evaluation technologies: 1) concept search and 2) concept categorization. They are both useful and I recommend you use both, but they are not the same thing. Know the difference - and make sure that your vendor knows it too.
Concept search as opposed to classification takes the search string and expands it by relevant terms. For example, with traditional keyword search lawyers must input both "bank" and "account." They can use wildcards on "bank" to make more matches like "banking" and "banks," but "bank*" will never return "account" or "deposit." A concept search can work from a lexicon that returns not only "bank" and strings related to the keyword, but also "deposit," "account," "funds," "withdrawal," "transfer," and others depending on the search parameters. (Many lexicons are customizable for industry-related terms.)
Concept categorization is different. Categorization views the whole content of individual documents and groups them according to the percentage of similar concepts and terms. This is a simple sentence describing a complex undertaking, since different base technologies provide the power under the hood. For example, one common if complex method uses Bayesian or other statistical algorithms to analyze keyword frequency, positioning and relationships across documents. Another common method is indexing data sets to produce ontologies of related content for prioritization and viewing.
Vendors with particularly good approaches to concept search and/or categorization include Clearwell, StoredIQ, Kazeon and Inference. They operate in different places along the EDRM model but all offer sophisticated conceptual features.
About the Author
You May Also Like