Can IBM Bring the Semantic Web to Notes and Outlook?

OmniFind Personal Email Search tries to extract useful information like addresses or phone numbers from inboxes, and lets organizations customize semantic tagging to avoid irrelevant results.

December 28, 2007

5 Min Read
Network Computing logo

I've been trying out Omnifind Personal Email Search, a tool for Outlook and Notes that IBM posted for free download on its Alphaworks site last week. While email search itself isn't new (Google's Desktop Search will happily index your inbox along with the rest of your hard drive), the IBM software is slightly different.

Rather than finding a specific email message or thread, Omnifind is aimed at searching for unstructured data: the information buried within an inbox. And it looks like one of the first genuinely useful desktop applications based on the Semantic Web – an idea that has been somewhat eclipsed by Web services and Web 2.0, but which could eventually unite them with SOA.
The key technology in the tool is UIMA (Unstructured Information Management), an IBM-led open-source framework for analyzing text and other unstructured data. This is essentially pattern recognition: a series of ten digits with hyphens, brackets or spaces in the right places is a phone number, two letters followed by five numbers is a zip code, etc. The tool uses this to generate semantic XML tags automatically, overcoming what has been the biggest barrier to the Semantic Web: that people don't have the time or inclination to add metadata to documents manually.
Automated pattern-recognition isn't new to users of Web-based email, so OmniFind could be viewed as yet another case of corporates slowly catching up with consumers. (Gmail won't just automatically recognize a street address; it will offer to display a map and give directions.) The same kinds of algorithms are used by search engines internally, both for ad-serving on the fly and for building the main search index itself.

IBM goes beyond Google in a couple of areas. First, the tags are fully exposed so that people can use them in searches. Type "John address" and the system will show every street address in an email from (or mentioning) someone called John – even if the word "address" doesn't appear. A large email archive could include hundreds of messages from people called John, of course, so there are lots of duplicates. But that's a good thing if the point is to find and map the street address buried in his email signature, as the frequency of messages helps the program guess which John you're looking for. Who needs a separate contact database when everyone's info is already in your inbox?
The same principle applies to email addresses, URLs, and phone numbers – with the latter presenting an option to call the number using softphones like Skype if installed. By default, OmniFind also tags parts of each message as a date, time or "person", representing a human name . All of them can be used in the same kind of searches. For example, "person InformationWeek" will attempt to find all the people in an inbox who work at InformationWeek.
The algorithms underlying UIMA aren't perfect. Just as the ads on Google and Yahoo often miss their targets, Omnifind can throw up false positives: It thought that the person mentioned most in my inbox was named Colleague, presumably because so much spam begins "Dear Colleague." But this is mostly because the tags (and their corresponding patterns) are fairly generic.
Omnifind also lets users edit the default tags or create their own, using regular expressions to represent search patterns. IBM suggests that these be used to customize the search to a specific organization, finding information like employee IDs or package tracking numbers. It could also be used to weed out irrelevant search results, most of which are caused by the one-size-fits-all approach that public search engines must take.

The main drawback is that devising and hone regular expressions can take a lot of work. IBM could help a lot by including optional tags for fields that aren't organization-specific, such as ISBNs, international phone numbers and credit card details (which probably shouldn't be flying around in emails at all, but that's a separate problem.) IBM's SOA strategy is very focused on customizations for specific industries, so it will probably do something similar here if the technology makes it into a commercial product.
Another problem is that the indexing process needs to be repeated every time a new tag is added, a fairly processor-intensive task. The tool has five different settings that trade off indexing speed against impact on other apps, but even at the fastest it took three hours to apply the default tags to a 1 GB Notes archive (about 30,000 messages) on a dual-core machine. It has to run on a client PC, meaning it's actually better suited for Microsoft Outlook than IBM's own Lotus Notes.
The tool is entirely browser-based, which at first seems a little awkward: Actually reading a searched-for email requires launching a separate application (Notes or Outlook.) However, the point is to avoid needing to reading the email at all. When it works as intended, it simply extracts the necessary information and bypasses the email client.
The long-term goal of UIMA is to apply the same automated pattern recognition to other kinds of data, which will likely be harder. Email is in some senses the low-hanging fruit, as it isn't entirely unstructured: There are the formal fields like "To", plus the informal structure of salutations and signatures that it inherited from regular mail.

Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like

More Insights