java Anti Spam ENgine SourceForge.net Logo
The pure java Anti Spam ENgine
Overview
Features
Getting Started
Download
User Guide
FAQ
Forums SourceForge Link
Javadoc New Window
License
Commercial Use
Project Home SourceForge Link
 
java.net member!

FAQ

How does jASEN work?

jASEN combines probability-based analysis with intelligent message tokenization coupled with an extensive database of known spam heuristics to effectively identify spam whilst minimizing the occurrence of false-positive identifications.

In addition to this, jASEN provides a mechanism to incorporate custom and or 3rd party plugins which can be engineered to perform almost any additional filtering techniques required including sender verification systems like SPF and SenderID.

Why does jASEN sometimes fail to detect spam?

This may be caused by many things. jASEN looks for certain patterns commonly found in spam email, but also looks for patterns commonly found in non-spam email so as to avoid false positives. If a spam email does not contain any major spam "markers" then it may be classified as ham. Fortunately this does not very often happen.

Even the most innocuous spam email will usually contain some spammy words or markers. In these cases, jASEN may identify that the email is not completely clean, but may not have enough evidence to make a definitive claim either way.

So... how do you improve the situation? There are three main ways to make jASEN more accurate for your situation:
  1. Training

    jASEN is distributed with a reference engine configuration based on a training corpus of approximately 10,000 emails (spam and ham). Whilst this configuration works well, the engine is designed to be regularly updated and "re-trained" to maximize its detection capabilities. As new types of spam emerge, the engine must be updated to recognize these new varieties. See the Training section for more information.
  2. Tuning

    Almost all the features of jASEN used to detect spam are completely configurable. This means you have fine grained control over how jASEN ranks and scores email messages. Be aware however, that it is also possible to render jASEN useless if nonsensical configurations are set. See the Tuning section for more information.
  3. Plugins

    If the native features of jASEN still won't pick up the offending message(s) then you can always create your own plugin. This may be as simply as manually checking for certain spam markers not identified by jASEN (special keywords, lookup against your own database etc), or may involve more complex features like incorporating AI scanning systems etc. The choice is up to you. Please note however, that the licence agreement under which jASEN is distributed requires that any work done to the engine itself must be fed back into the project.

How do I install jASEN into my mail server?

jASEN is currently an engine only and does not provide any out-of-the-box integration with major mail servers. It is our intention to provide these in the future, however at present this is left up to you.

If you have developed an integration component for a major mail server let us know and we will add it to the project!

Can I use jASEN in Outlook?

Yes and no. jASEN is simply an anti spam engine, it is not an anti spam application. Thus jASEN can be used in almost any java-based context, but does not provide any significant reference implementations for doing this.

If you want to integrate jASEN into Outlook you will have to create your own Outlook or Exchange addin. We are however, currently working on a desktop anti-spam product based on jASEN but it is unclear whether this will be part of the open source project or a stand-alone commercial application at this point

jASEN seems to take a long time to scan a message. How to I make it faster?

jASEN does several string and/or character based tokenization and manipulation, however this has a negligible effect on the total time taken for a scan. The single most expensive operation is DNS, and reverse DNS lookups. In particular, lookups of hosts which do not exist. Both the RBLScanner (Realtime Blackhole List) and the SenderVerificationScanner perform DNS lookups of domains and/or IP addresses. Unfortunately in the case of spam, these addresses are often false and do not correspond to any valid DNS record. Thus there will be a DNS "timeout" (usually in the order of 1-2 seconds) if the host requested could not be found.

Most DNS servers will cache successful DNS lookup results, but many will not cache DNS lookup failures. This means that every time an unknown host is requested, the DNS will attempt to resolve the host without looking into its cache.

To solve this problem, jASEN uses two interfaces: DNSResolver and InetAddressResolver to resolve DNS records and domains or IP addresses. jASEN also provides an implementation of these two interfaces however the default implementations do not provide any caching. Thus they suffer from the same caching problem exhibited by the DNS itself.

We recommend you implement your own DNSResolver and InetAddressResolver if you are finding the performance of DNS lookups to be a problem. By adding a simple cache system such as OSCache to these resolvers you will be able to have the low level control over the caching of DNS lookups required to overcome the performance problems this presents.

Future versions of jASEN may include cached resolvers using similar caching products however at present this is left up to you.

How do I train jASEN with my own email database?

Refer to the training section for detailed instructions on how to train jASEN.

Can I incrementally train the engine with single emails?

At present no. The engine data file used jASEN is loaded at start-up and is not referenced during normal operation.

We recognise that this is a desirable feature and are currently working towards an incremental training system however at present this is not available.

Can I do a "live update" of the spam data files without stopping the engine?

Yes!

jASEN comes with an internal auto-update system. This works by downloading a small update parcel file from an update site and on the basis of the information therein, downloads and installs the relevant updates.

This update system can even update code changes to plugins, however at present is not able to dynamically install changes to the core engine.

Refer to the configuration section for more information on the auto update engine

I am getting false positives. How do I stop this happening?

We recognise that the worst aspect of any anti spam product is falsly identifying ham email as spam, and have made every effort to ensure this does not happen.

However, whilst the likelihood of jASEN generating false positive scores is low, it does happen. There are two key situations where this may occur:
  1. Email newsletters
  2. Spammy ham
Email newsletters will often exhibit many spam characteristics. Things like only HTML (no TEXT part), spammy words like FREE and OFFER, and mail bugs are common in Email newsletters and will often make them indistinguisable from spam. In the future email newsletter providers may (we hope) begin to comply with systems like SPF which will help to prevent false identification however at present there is no elegant solution for the problem with these types of email.

Spammy ham messages are messages sent from a legitimate (usually human) sender but which contain spammy words and/or markers. If a legitimate sender sends an email containing a high portion of words like "free" or "mortgage" (etc) it may be identified as spam.

There are two solutions to this problem:
  1. White/black lists
  2. Sender verification
In the case of email newsletters, which are often sent by software rather than a real person, the simplest solution is to provide a white list of approved sender addresses or (even better) sending mail servers. jASEN does not provide a whitelist plugin by default however it is a simple plugin to create.

In the case of spammy ham the best solution is a combination of white/black lists and sender verification. In almost all cases where jASEN falsly identifies a spam email as ham, it will return a "borderline" result indicating that it can't be sure of the legitimacy of the email. In these cases a separate verification process could be undertaken to determine definatively if the sender or their mail server should be white-listed.

Our tests indicate that less than 2% of all email (excluding email newsletters) is falsly identified as spam, and over 95% of these are identified as borderline cases. This means that jASEN in combination with a white/black list approach becomes 99.999% effective at not falsly identifying ham as spam.

I am getting a java.net.MalformedURLException when I try to run the sample programs

This is most likely a classpath issue. jASEN requires the following folders be in the class path:
  • jasen-conf
  • jasen-data
These folders are found in the root path of the distributable

How do I add my own filtering systems to jASEN?

jASEN provides a simple but effective mechanism to incorporate your own scanning logic into the engine by creating plugins.

See the plugins section for detailed instructions on how to create your own plugins

Will jASEN work with multilingual spam emails?

No. jASEN uses several word and linguistic techniques to identify spam which rely on the premise that the spam is in English.

Whilst the broad techniques adopted by jASEN are transferrable to non-English (single byte) languages, we simply do not have an extensive enough database of non-English ham/spam with which to train the engine. It is unclear whether this assumption holds true for double byte languages like Chinese, and the likelihood is that such languages are not supportive of the techniques used in jASEN.

Fortunately almost all spam is in English, so this limitation does not currently present a serious problem. The increase in the use of languages like Chinese in an online context however, does indicate that this may not always be the case.

At present there are no plans to provide support for multilingual scanning however we would welcome an thoughts or contibutions on the matter.

Can I use jASEN in a client application?

Yes and no. See the outlook faq topic for more information.