SPAM Playbook: Spam-filtering Tools

There's more than one way to stop junk e-mail. We evaluate several of the latest spam-filtering mechanisms which prevent most undesired messages from ever reaching your end users.

September 2, 2004

25 Min Read
Network Computing logo

Appropriate Filtering

The only way to eliminate the costs is to eliminate spam. You can accomplish this typically by implementing one or more e-mail filters. Many options exist, though not all are viable for every organization or user. For instance, a lawyer might be required to keep a copy of every e-mail he or she receives. Further, some filters operate best at the edge, while others require end-user manipulation.

Broad-based edge filters provide the biggest bang for the buck, in that they are designed to keep junk mail from ever being received by your server. The cost savings decrease as filters move closer to the user--more bandwidth, storage and processing capacity are needed when unfiltered mail travels further into your enterprise. Rejecting mail at the edge of the network also means you don't have to generate delivery-failure notification messages (the sending system is responsible), which alleviates problems associated with junk e-mail using a forged sender address.

On the other hand, a customized user-based filter that examines each message in context can keep mailboxes remarkably clean with a small amount of false positives, but it uses the most system resources and causes the greatest productivity losses because it needs more administration.

Tiered FilteringThe best approach is to mix and match mechanisms. Edge filters reject obvious junk mail, while filters inside the messaging network act on user requirements. Layered installations can be fine-tuned to reduce false positives, while the elimination of obvious spam at the edge means fewer resources are needed to process the reduced number of messages inside.

Tiered TopologyClick to Enlarge

In our "Tiered Topology" setup (left), the edge filters weed out obvious spam using connection- and session-layer tests. The messages that survive are passed to the internal delivery servers, which apply user-specific filters.

This model requires careful planning, however. Our lawyer who needs to keep a copy of every message sent or received may require exception handling at the edge server that defers processing until all the recipients have been itemized. Along these same lines, it's usually a good idea to allow messages addressed to the postmaster account to pass through the filters so that misidentified senders can get out of the filter jail. But you still need a way to reject spam to these accounts, since some miscreants are known to target them.

If your network is complex--and if you can monitor and adjust the filters as needed--you'll likely find that internally developed solutions will provide the most bang for your buck. But if you are operating under a tight budget or if your time commitments are stretched thin, you may be better off with a packaged system. Similarly, you may want to consider outsourcing some or all spam management to a service provider that will accept all incoming mail on your behalf and forward only the clean traffic to you.Filters use weights to void e-mail, so you need to examine how the weights are assigned. In simple terms, probability scores are useful whenever two or more tests must be triggered before a message can be reliably rejected. This is typically needed when any single test is not strong enough to be used as a reject match in isolation, and can be useful with filters that are known to return false positives periodically. For example, an organization may decide that most of the e-mail from a specific domain is spam, but all of its mail cannot be refused. Similarly, you may know that e-mail messages with a certain string are probably spam, but this cannot be relied upon with absolute certainty (a human-resources employee may be looking for discount pharmaceutical products as part of his or her job). In these cases, you'll want to avoid absolute filters and stick with probability weights.

The most popular probability engine is SpamAssassin, which comes with customizable filters. SpamAssassin's built-in parsing tools assign a probabilistic score to a message based on the amount of uppercase text or colored HTML in the message body. It also calls upon external filters, such as DNS-based distributed blacklists. SpamAssassin can check for custom header fields inserted by your SMTP server, and it can call upon customized external tests. Once these passes are made, SpamAssassin adds the scores and compares the final value to user-defined thresholds. Depending on its assigned value, the message is discarded, quarantined for later examination or allowed to pass to the next point in the delivery path.

Domain-Association Tests

Click to Enlarge

In our sample messaging network, SpamAssassin runs twice: once at the edge, and again at the core of the messaging network prior to delivery. By limiting the tests that are called upon at each juncture--and by tweaking the scores of each test suite to reflect the targeted attributes--processing is minimized while each transit point gets the most appropriate benefits. In both cases, the most expensive tests are called upon only after the static filters have been used, reducing the load.

Some mail systems let you use external tools like SpamAssassin while the session is still active, which means the server can reject the mail outright based on the probability value returned. For example, Postfix 2.0 can be configured to pass an incoming message to Spam-Assassin after the internal tests have been run, and for the final probability score to be used in deciding whether the mail should be accepted. Postfix tests are absolutes; SpamAssassin scores are stored in the message header, and Postfix can apply another absolute test against that header field value, such as rejecting mail with a spam score of 5 or higher. This lets Postfix refuse the mail while the session is still active, eliminating the need for out-of-band delivery-failure notifications.

Network Blacklists

Network blacklists flat out refuse traffic from specific IP addresses and networks. Given that most e-mail servers support these filters, many e-mail admins will try to block e-mail from known offenders with a blacklist. However, this is a practical strategy only in a handful of situations.

The number of virus-infected systems on the Internet lets spammers use almost any network for transmission--it is impossible to maintain a local list of addresses that accurately reflects every infected system. Similarly, open relays and other problematic hosts come and go; it's impossible to maintain a complete and accurate list of these systems.

Network blacklists can be useful with ISPs that host known spammers--or that don't respond to complaints--and when another kind of filter is not suitable. Keep in mind that the blocked organizations may get new addresses at any time, rendering the local list obsolete and potentially giving anybody that may be assigned the old addresses an enormous headache. These filters block all traffic from the affected networks, so it's not possible for an innocent bystander in those networks to e-mail the local postmaster account to discuss the problem (though the bystander can still send mail from another physical network).Domain-based blacklists are widely supported in SMTP servers and are somewhat more effective than IP-based blacklists. In particular, these filters are useful with "professional marketing" organizations that do not fake e-mail addresses nor attempt to camouflage their connections and e-mail addresses behind random accounts.

Domain blacklists can trap senders at a variety of points in the transfer process, though the extent to which your filters work will depend on your server's filtering mechanisms. For example, Postfix lets domain filters be used against the domain name of a connecting client, the host-name parameter from the HELO and EHLO commands, and the domain name of the envelope sender. Postfix can even be used to block mail from domains that share common DNS and SMTP servers with known bad guys. It also lets these kinds of filters be defined and stored in LDAP directories, which simplifies sharing the blacklists across multiple servers.

Distributed DNS blacklists are a recent addition to the antispam arsenal, but they have been extremely useful already. These lists use name-to-value lookup services over DNS. The query identifies the suspicious host, and the answer indicates whether that host is listed in the queried blacklists. There are more than 200 public blacklists that describe almost every kind of network. There are blacklists for known spammers, open mail relays, dial-up clients that shouldn't be sending e-mail directly, systems that have been compromised by worms and viruses and even blacklists that itemize networks that have been delegated to specific service providers and countries. By combining and tweaking the local probability weights for each list, you can create explicit filtering rules. For example, you could specifically block known-infected systems on broadband networks in Brazil.

In addition, a handful of "right hand" blacklists operate against the domain name provided in the sender e-mail address rather than the IP address of the connecting client. If an e-mail arrives from "[email protected]," a query would be generated for the "example.net" domain name at the target server, and the response codes would indicate whether the sender's domain was listed in the queried blacklist. There are some right hand blacklists for tracking domain-related problems, such as whether the domain has an active and valid abuse mailbox, but these blacklists are not as common as the host-based blacklists.

DNS Lookups

Click to Enlarge

In general, it's a good idea to make limited use of a small and highly trusted subset of these lookup services, without using too many of them. Performing one or two lookups against a couple of good blacklists can eliminate most incoming spam, and this step will free up significant network resources. Even if you cannot use these filters to reject all spam, you can use some of the blacklists for delayed probability tests. For instance, you can have SpamAssassin call on the blacklists rather than have your SMTP server do it alone (or in conjunction with the SMTP server, as shown in our setup). In that kind of model, the junk mail that isn't killed by the local filters can be eliminated by the secondary tests before the messages reach the internal servers.

As with local blacklists, distributed blacklists may be incomplete or outdated, and they might block legitimate e-mail. Furthermore, DNS-based blacklists have been known to vanish from the network or become overly paranoid and list the entire Internet as offensive. If you are going to use these tools, make sure to allocate the time and responsibility to maintain them.

Whitelists

Blacklists are great for keeping known junk off your network, but they are guaranteed to make mistakes. Therefore, you need some kind of whitelist to help valid e-mail get through your filtering minefield. Most e-mail systems that support whitelists can be used with the same range of filters as their blacklist counterparts. For example, the LDAP-based blacklists provided with Postfix can be used to return "accept" codes at the same junctures as they would return "reject" codes, so that a single database can serve double duty. There are also a handful of operators that run distributed DNS whitelists (similar in design to their blacklist counterparts), including commercial trust brokers such as Habeas and Bonded Sender.

It is extremely important to put your whitelist filters in front of your blacklists, and to let whitelisted e-mail completely bypass any other filters if possible. For example, Postfix allows certain kinds of whitelisted entries (such as "trusted networks") to completely bypass all additional local filtering, but the free ride comes to an end once the mail is handed off to any external tools like SpamAssassin. SpamAssassin does not provide a bypass feature for whitelisted mail, but instead simply assigns negative probabilities. These are usually high enough to offset any other matches.While blacklists can be effective at any point in the transfer path--one reject is enough to keep the message from getting any further into the network--whitelists must be used at every transfer point to ensure that a valid message isn't killed.

Several technologies automate part of the tedious process of maintaining whitelists. For example, some simple systems track all outgoing e-mail and add all message recipients to the sender's local whitelist. This ensures that you'll still receive e-mail from Grandma's Yahoo account. Variations on this theme include systems that also add unknown addresses as long as a known good address is listed as a recipient. This is useful for automatically whitelisting users of a mailing list.

SpamAssassin's automatic whitelisting system tracks the historical average of a particular sender, with the current and long-term scores used to weight each message. For example, if a sender has a long-term average probability score of a solid -3.5, but the current message has some spam qualities, meriting it a probability score of 2.0, the immediate average score is -1.5, which will keep the message from being falsely tagged.

Some systems incorporate a challenge-response model: Incoming mail from unknown senders is put into a hold queue, and a challenge message is returned to the sender. If the original sender responds to the challenge correctly (such as putting a key value into the subject header), the e-mail address is added to the whitelist database. Although these systems often work to guarantee that a human sent the original e-mail (or has at least read the challenge message), these systems do not work seamlessly with robotic mailers like mailing-list agents or virus-notification engines. Furthermore, these systems are often poorly designed and sometimes generate a flurry of challenges every time a message is sent to a mailing list. Because much junk mail uses forged e-mail addresses, some of these systems also can be responsible for generating challenges for e-mail addresses that didn't actually send any mail.

Greylisting, another popular mechanism, makes use of simple delivery deferrals to ensure that the sending SMTP client is not a bulk-spam agent. In this model, the first e-mail from a particular sender is rejected with a temporary failure, but any subsequent e-mails from that same sender and SMTP client can pass through. This method assumes a legitimate mail server will retry delivery but a bulk-spam agent won't. It's important to note that these systems don't validate the message sender or prevent undesirable content from entering the network. Instead, they only verify that the sending client conforms with SMTP specs. Also note that greylisting works only if you can defer the initial transfer (meaning that this filter must be used at the edge of the network), but several organizations also prefer using this tool only with mail that has a probability of being spam (thereby avoiding problems with broken SMTP clients). Cumulatively, this can mean that the filter is called after the edge-based probability scoring but before the transfer has been acknowledged, which can be difficult to implement.As a relatively new trend, some SMTP servers are deploying "callback" systems that attempt to verify the message sender's e-mail address through a back-channel connection to the sending SMTP domain. For example, if a message arrives from the unknown sender "[email protected]," the SMTP server might attempt to open a connection with one of the mail servers for the example.net domain and see if it will accept e-mail for the "user" account. If the callback procedure shows that the original sender's address is valid, the account is added to the whitelist. However, there are problems with this approach. For one, the selected target server may not list all the e-mail addresses within its domain (this is a common problem with secondary mail servers), and may therefore verify all e-mail addresses, including invalid ones. In those cases where a junk mailer is using a harvested address as the sender address, these tests will only verify that the account is valid, and not that it is being used for legitimate purposes. As such, the usefulness of these tests is limited to eliminating obvious spam.

Another recent trend in the fight against junk mail is the use of protocol validity tests. These tests, which attempt to determine if a particular sender or message conforms to well-known practices, can be effective. However, because of their dependence on letter-of-the-law conformance, the tests also can generate a tremendous number of false-positives. Use good judgment before putting them into place. They are best used for determining probabilities rather than for flatly rejecting mail.

A simple example of these tests can be found with mail servers that require an exact match between the forward and reverse DNS domain names of an SMTP client. In this scenario, the IP address of an incoming connection is queried in DNS to see if a domain name is associated with the IN-ADDR.ARPA entry for that address. A subsequent lookup for the resulting domain name is also issued to verify that the target domain name is associated with the original IP address. If this verification process fails, these servers will refuse to establish the SMTP session. Similarly, some systems will refuse to accept mail if the hostname provided in the HELO greeting command is different from the host name of the connecting node. And there are systems that will accept mail only from a host in the same domain as the originator.

The basic principle with these tests is that well-managed systems should have all their ducks in a row. If operational errors are detected, it is likely that the sender has other problems and it's just not worth the risk to accept the mail. However, this kind of brute enforcement can trigger a tremendous number of false positives, largely because there is no direct correlation between management of the domain name space and management of the e-mail infrastructure. There also isn't a correlation between the quality of the content and the quality of the software used to transfer said content. Many organizations have divisions within their own mail domains that relay outbound mail through a central server or send mail through an ISP that may not be under the control of the sending party. Meanwhile, many marketing organizations follow all these rules, and those messages, therefore, will fail to trip these filters.

Validity Tests

Click to Enlarge

On the other hand, it is entirely reasonable for servers to check if the specified domain name exists--and to refuse the mail if it doesn't. Similarly, some mail servers will refuse to accept mail from hosts masquerading as originating from the same network as the recipient, or will use a "local" user's e-mail address that has not been authenticated. Frequently, some large-scale Web mail providers are used in forgeries, and mail from those domains can be presumed to have originated on servers within those domains, and the hosts on that network will have the right domain name. These kinds of tests are valid and can be extremely effective with a minimum of effort, but they are best used as probability filters because of the potential for legitimate exceptions.

Content Analysis

Most of the mechanisms we've described are intended to be used while an incoming message transfer is being negotiated. Other filters inspect and validate message content. Typically, these tests can be performed only after the message has been transferred, though some high-end SMTP servers can keep the connection open during these tests.

At the simplest level, most SMTP servers allow message headers to be analyzed for basic indicators that the remainder of the message is likely to be spam. For example, most SMTP servers can be configured to refuse e-mail that appears to contain only a single HTML body part or GIF message (both of which are common signs of spam). However, these kind of filters can have numerous problems, such as rejecting legitimate mailing lists that send HTML-only messages. As such, these kinds of tests should be used only with probability filters; they should not be used for absolute rejections.

Along the same lines, most SMTP servers also support basic filters for prohibited strings in the message body, such as looking for telltale markers of Nigerian scams, investment services, health products and the like. However, these offerings frequently are camouflaged through the use of noise text or with misspelled words. You need to use probabilistic tools that look for these markers in conjunction with the original hot-word filters.A relatively new set of these filters looks for spam-related URLs in the message body, and then checks with the clearinghouse servers to see if the URLs are associated with well-known spammers. If the message also trips other high-probability filters (such as originating at a high-scoring SMTP client), it's usually safe to simply reject the mail outright.

The current king of text-analysis tools is Bayes filtering, which uses probabilistic algorithms to determine whether the text in a message is likely to be spam. Essentially, these tools look at the words in a message (and sometimes phrases and other associations) to see if the text most often occurs in spam or "ham" (valid e-mail).

It's important to recognize that these databases are user-specific. Since each user probably deals with his or her own professional language, these tools must be trained according to each user's specific usage patterns.

The usual way to train these kinds of engines is to provide automated learning processes that periodically analyze mail that is specially marked and attempt to train themselves based on the inputs. This feedback processing can be handled on a nightly basis through automated scripts that pull new messages from the user's inbox and a special "spam" folder, and then feed all the returned messages into the Bayes engine for classification. If the engine makes an error, the user only has to move the confusing message to the appropriate folder, and the message will be reclassified on the next run.

Some standalone systems make use of "quarantine" folders or digests for the same basic purpose, storing all suspicious mail for the user to examine. Any messages that are abandoned or retrieved from the quarantine are piped into the autolearning process for reinforcement purposes.Checksum Tools

Going beyond text analysis, tools like the Distributed Checksum Clearinghouse (DCC) and Vipul's Razor use message checksums and distributed databases to find bulk transfers. If an incoming message has been seen by many other servers, the message can be assumed to be spam, though this process must be handled with care.

Checksum Tests

Click to Enlarge

In particular, DCC generates checksums from different parts of incoming messages, and the local DCC client submits the set of checksums to a DCC server that returns values indicating how often each checksum has been seen. Messages that have been seen by many participating systems will drive up the value, which can then be incorporated into probability scores. However, DCC looks only at the frequency of a message's occurrence and will therefore trigger against legitimate bulk mail, such as mailing lists and newsletters. To keep legitimate bulk mail from being aggressively scored, the senders must be whitelisted.

Vipul's Razor is slightly different from DCC, in that it uses message checksums within a distributed network but has additional mechanisms that let accredited participants signify whether a message is spam. The credibility weights of each participant are keyed to the number of coinciding reports, so the assertions of frequent valid reporters have more weight than one-time reports. Vipul's Razor can be used at the edge of the network with some success, though the tool's distributed nature means that each message will incur more latency.Other Tools

One of the most useful but underused tools in the spam-fighter arsenal is the spam-trap address. By publishing a particular e-mail address in several conspicuous places--such as making frequent posts to out-of-the-way newsgroups, signing up for known hostile mailing lists, and otherwise making the e-mail address widely available across the Internet--you can encourage spammers to send junk to a heat-sink that simply rejects or discards any e-mail that includes that address in the recipient list. Since most spam sent to an organization will have multiple recipients, discarding any e-mail that includes the spam-trap address will also serve to eliminate spam that would go to any of the other addresses. This is a very effective mechanism, and one that is relatively easy to use.

Looking toward the future, a handful of sender-authorization technologies are being developed to tell a receiver system that a message was authorized to have been sent by the sending party. Although these technologies do not say if a message is spam, they do let a recipient reject forged mail. One such effort is the Sender Policy Framework spec, which lets domain owners itemize the hosts and networks that are authorized to send mail on their behalf. The DomainKeys proposal uses public-key technology so that legitimate e-mail can be signed by the sender or an authorized relay, and recipients can then validate the signature with a quick lookup.

There is also an IETF effort to make Whois data available via XML, which should allow for improved parsing of the administrative data associated with a domain or a network, such as allowing you to determine if a URL in a piece of suspicious mail is somehow linked to a known spammer. Once the tools become available to take advantage of this data, network operators will be able to determine if an embedded URL points to a known spam-friendly network (without having to query a separate list of fast-changing URLs), and to reject or weight the message accordingly.

Perhaps the most important tool in any arsenal is a virus checker that scans all incoming e-mail and discards infected messages immediately. Given the high number of infected and exposed systems on the Internet, virus checkers are critical and should be used at the edge of the network.The scope of these filtering mechanisms may appear to be large and unwieldy, but this is the reality: Spammers and malware developers are constantly looking for new ways to circumvent filters, and new technologies must be developed to fill those gaps. On the plus side, the existing tools are extremely effective at fighting spam if an appropriate amount of computing and administrative resources are dedicated to the problem. As empirical proof, one of our small test domains rejects hundreds of attempted spam and worm messages daily, with only a handful of such messages getting through every week, and that domain uses only a few of the tests described here.

With a relatively minor level of resources and time, it is already possible to can spam.

Eric A. Hall is president of Network Technology Research Group, a Nashville-based network consultancy, and author of Internet Core Protocols: The Definitive Guide, from O'Reilly & Associates. Write to him at [email protected].

Unsolicited e-mail may seem like a slight irritation, but left unimpeded, it can clog a server and become a conduit for network-threatening viruses and worms. The best way to protect yourself is to mix and match several mail filters.

For instance, in a tiered topology, e-mail is checked at several layers, with specific rules employed at each step. A probability scoring engine ranks the e-mail based on independent spam indicators, such as the type of account it's coming from or a string within the text. Network blacklists reject e-mails from specific IP addresses and networks, while distributed DNS blacklists use name-to-value lookup services. Whitelists keep track of valid e-mail addresses. You can also rate well-known practices with validity tests or inspect the contents of a message using a Bayes filter.We consider all these mechanisms and show you the best way to combine them to ensure your network remains closed against spam.

Sites to See

"Compliance: The Next Big Thing In Messaging"

"The IT Agenda: Battling Targeted Trojan Spoofing"

"The SPAM "War Escalates""Filters Take a Bite out of SPAM"

Comprehensive filtering systems often demand a great deal of processor power and introduce latency. The more tests you perform, the longer the filtering processes will run.

The processing capacity needed depends on the amount of mail received, the time available to process each message, the number of tests to be performed and the number of processors available. Unfortunately, time isn't variable, and you don't have much control over the number of incoming messages--so the only two controllable variables are the number of tests and the number of dedicated processors. Furthermore, if you want to perform more tests against a fixed number of messages but don't want to increase your message backlog, your only option is to increase processors.

For example, a series of static blacklist tests against incoming messages may require no more than a second to process (not including any subsequent processing, such as delivery handling). There are 86,400 seconds in a day, so the same number of messages could theoretically be processed with a single system at that rate. However, if you add multiple remote lookups to your filtering system, giving you another nine seconds of task latency, the overall throughput will drop to 8,640 messages per day. To get back to 86,400 messages, you'd need to add another nine processors, with all systems running in parallel.

That may seem like a lot of systems, but the numbers usually come in somewhat lower if you use multithreading or multiprocessing systems. Furthermore, if you call your static filters before the probabilistic lookups, you can eliminate a significant number of the messages transversing the expensive lookups.A couple of high-powered systems may be enough to handle the load, with only a marginal cost increase. And keeping down the cost of spam looks good no matter how you do it.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights