Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Can IBM Bring the Semantic Web to Notes and Outlook?

I've been trying out Omnifind Personal Email Search, a tool for Outlook
and Notes that IBM posted for free download on its Alphaworks
site
last week. While email search itself isn't new (Google's
Desktop Search will happily index your inbox along with the rest of your
hard drive), the IBM software is slightly different.


Rather than finding a specific email message or thread, Omnifind is
aimed at searching for unstructured data: the information buried within
an inbox. And it looks like one of the first genuinely useful desktop
applications based on the Semantic Web – an idea that has been
somewhat eclipsed by Web services and Web 2.0, but which could
eventually unite them with SOA.

The key technology in the tool is UIMA (Unstructured Information
Management), an IBM-led open-source framework for
analyzing text and other unstructured data. This is essentially pattern
recognition: a series of ten digits with hyphens, brackets or spaces in
the right places is a phone number, two letters followed by five numbers
is a zip code, etc. The tool uses this to generate semantic XML tags
automatically, overcoming what has been the biggest barrier to the
Semantic Web: that people don't have the time or inclination to add
metadata to documents manually.

Automated pattern-recognition isn't new to users of Web-based email, so
OmniFind could be viewed as yet another case of corporates slowly
catching up with consumers. (Gmail won't just automatically recognize a
street address; it will offer to display a map and give directions.) The
same kinds of algorithms are used by search engines internally, both for
ad-serving on the fly and for building the main search index itself.


IBM goes beyond Google in a couple of areas. First, the tags are fully
exposed so that people can use them in searches. Type "John address" and
the system will show every street address in an email from (or
mentioning) someone called John – even if the word "address"
doesn't appear. A large email archive could include hundreds of messages
from people called John, of course, so there are lots of duplicates. But
that's a good thing if the point is to find and map the street address
buried in his email signature, as the frequency of messages helps the
program guess which John you're looking for. Who needs a separate
contact database when everyone's info is already in your inbox?

The same principle applies to email addresses, URLs, and phone numbers
– with the latter presenting an option to call the number using
softphones like Skype if installed. By default, OmniFind also tags parts
of each message as a date, time or "person", representing a human name .
All of them can be used in the same kind of searches. For example,
"person InformationWeek" will attempt to find all the people in an inbox
who work at InformationWeek.

The algorithms underlying UIMA aren't perfect. Just as the ads on Google
and Yahoo often miss their targets, Omnifind can throw up false
positives: It thought that the person mentioned most in my inbox was
named Colleague, presumably because so much spam begins "Dear
Colleague." But this is mostly because the tags (and their corresponding
patterns) are fairly generic.

Omnifind also lets users edit the default tags or create their own,
using regular
expressions
to represent search patterns. IBM suggests that these be
used to customize the search to a specific organization, finding
information like employee IDs or package tracking numbers. It could also
be used to weed out irrelevant search results, most of which are caused
by the one-size-fits-all approach that public search engines must take.


The main drawback is that devising and hone regular expressions can take
a lot of work. IBM could help a lot by including optional tags for
fields that aren't organization-specific, such as ISBNs, international
phone numbers and credit card details (which probably shouldn't be
flying around in emails at all, but that's a separate problem.) IBM's
SOA strategy is very focused on customizations for specific industries,
so it will probably do something similar here if the technology makes it
into a commercial product.

Another problem is that the indexing process needs to be repeated every
time a new tag is added, a fairly processor-intensive task. The tool has
five different settings that trade off indexing speed against impact on
other apps, but even at the fastest it took three hours to apply the
default tags to a 1 GB Notes archive (about 30,000 messages) on a
dual-core machine. It has to run on a client PC, meaning it's actually
better suited for Microsoft Outlook than IBM's own Lotus Notes.

The tool is entirely browser-based, which at first seems a little
awkward: Actually reading a searched-for email requires launching a
separate application (Notes or Outlook.) However, the point is to avoid
needing to reading the email at all. When it works as intended, it
simply extracts the necessary information and bypasses the email client.

The long-term goal of UIMA is to apply the same automated pattern
recognition to other kinds of data, which will likely be harder. Email
is in some senses the low-hanging fruit, as it isn't entirely
unstructured: There are the formal fields like "To", plus the informal
structure of salutations and signatures that it inherited from regular
mail.

RELATED LINKS

  • 1