java Anti Spam ENgine SourceForge.net Logo
The pure java Anti Spam ENgine
Overview
Features
Getting Started
Download
User Guide
FAQ
Forums SourceForge Link
Javadoc New Window
License
Commercial Use
Project Home SourceForge Link
 
java.net member!

Tuning

Although the jASEN engine is highly configurable, not all the configurations will have a meaningful impact on the results or accuracy of the engine.

This section outlines the most sensitive configurations which can be altered to "tune" the engine to meet your specific needs.

Tuning is classified into two distinct areas:
  1. Performance
  2. Accuracy

Performance Tuning

DNS Lookups

As mentioned in the FAQs, the main issue affecting the performance of the engine is DNS lookups. Currently, the two slowest plugins are the RBLScanner, and the SenderAddressValidationScanner however any future DNS reliant plugins will likely exhibit the same performance issues.

The reason for the slow performance of these plugins is a combination of the time to perform a DNS lookup itself (establish a network connection, handshake, request/response, close connection), and the timeout enforced by the DNS when querying hosts which do not exist.

In most DNS servers, failure lookups are not cached. This means if we do a lookup of ksdfksdhfks.com (assuming this doesn't exist), the DNS server will attempt to resolve the host until its timeout is reached, then return an "unknown host" response. Ideally this unknown response should be returned immediately if we have previously requested a lookup of this host.

In order to overcome this problem, jASEN delegates the task of DNS lookups and name/address resolution to two delegate classes:
  • DNSResolver
  • InetAddressResolver
Whenever jASEN performs a lookup of either a specific DNS record (eg MX record), or merely attempts to resolve a hostname to an IP Address, these delegate classes are used.

Thus, there are two main solutions to the problem of failed lookup caching:
  1. Configure your DNS server to cache failed host lookups (if this is possible)
  2. Implement your own caching system via the two resolver classes
Assuming the first option is not available either because the DNS server does not support this feature, or because you don't have control over it, we will look at the second option.

There are many ways to cache the results of a DNS or InetAddress lookup, from the simplest case of just holding the results in memory to a more sophisticated solution involving a formal caching policy and associated systems.

We recommend the latter of these, and have used caching systems like OSCache successfully to provide low level control over DNS caching.

Please Note: If you intend to implement a caching solution behind the DNS resolver classes we strongly recommend that you implement an appropriate expiry time on cached records. Successful or failed lookup attempts should only be cached for a limited time to ensure that changes to the domain name space are accurately reflected in the engine. An appropriate expiry time can be anywhere from one day to one week.

Refer to the javadoc for more information on the implementation requirements of the DNSResolver and InetAddressResolver.

General Recommendations

If jASEN is to be used in a production environment, we recommend that appropriate threading models be instigated such that multiple emails are scanned concurrently. Failure to do this will result in very poor performance.

It is important to note therefore, that any plugins created for use within jASEN are thread safe. Because jASEN is intended to be used in a multi-threaded environment, it makes concurrent calls to the registered plugins. Thus, plugins must not maintain state internally. Refer to the plugins section for detailed instructions on creating plugins

Accuracy Tuning

Whilst improving performance is an important and ongoing task, perhaps the most significant issue is accuracy. This refers to how effective the engine is a detecting spam, an not detecting ham.

There are several configuration options available to help you tune the engine so that the accuracy of scans in your context is as high as possible.

Please be aware however, that incorrect or nonsensical configuration may also result in very poor accuracy.

jASEN has already been tuned for general use, however you may find certain assumptions do not make sense in your context. If so, we recommend the following configuration options be reviewed:

Engine Tuning

There are two configurations available in the <scanner> element of the jasen-config file which have a significant effect on accuracy:

  • tokenLimit
    This refers to the maximum number of tokens (words) returned from the tokenizer when parsing the MimeMessage. The default value for this attribute is 30, meaning that the tokenizer will return the first 30 words from the body of a message

    Increasing the value of this attribute will increase the number of tokens returned and therefore increase the amount of data available to the engine, however this will also increase the amount of "noise" and may actually reduce accuracy

    Conversely reducing this value will reduce the amount of data available to the engine, but may reduce it to the point where statistical significance cannot be determined

    In practice we have found that a balance between the two is best. Most spam will attempt to convey its message in the first few lines, and may also append "noise words" to the end of a message in an attempt to confuse filtering systems.

    Fortunately jASEN caters for this by both limiting the number of tokens used in analysis, as well as including some simple linguistic analysis techniques to establish whether a token is a noise word or not.
  • boundary
    This refers to the upper and lower bound allowed for all scans

    It is not uncommon for certain plugins to return a near zero, or near one result. For example the RobinsonScanner will rarely return an absolute zero result, but may return a result very near to zero. This can often skew the results for other plugins such that a spam email which exhibits no overt spam words may not be detected because a single plugin result is so close to zero that all other results become meaningless.

    In order to overcome this, jASEN uses the boundary attribute to provide a hard limit to the results returned from any one plugin

    Our testing indicates that a good value for the boundary is 0.01. This means that all plugin results will be normalized to within this value from 0.0 or 1.0 depending on the result.

    For example:

    If a plugin returned 0.00000001 as a probability result, this would be normalized to 0.01. Similarly a result of 0.99999999 will be normalized to 0.99

    Increasing the boundary will decrease the range of results returned from the execution of plugins within the engine. Decreasing the boundary will increase the range (and accuracy) of the results returned from plugins, but may also lead to abnormal skewing of results if/when a single plugin returns a result very near to zero or one.

Plugin Tuning

Most of the native plugins provided with jASEN can be tuned to refine the accuracy of the engine. Typically a plugin will return a probability based on its analysis of the "spamminess" of an email. These probabilities (or more specifically, probability ranges) can be stipulated in the configuration for the appropriate plugin.

We recommend you read the configuration section for a detailed description of the configuration options available.