java Anti Spam ENgine SourceForge.net Logo
The pure java Anti Spam ENgine
Overview
Features
Getting Started
Download
User Guide
FAQ
Forums SourceForge Link
Javadoc New Window
License
Commercial Use
Project Home SourceForge Link
 
java.net member!

Training the Engine

The primary scanning system employed by jASEN is based on a heuristic analysis of the contents of a Mime message, matched against a library of known heuristics and associated probabilities.

This library must be created via a "training" process in which the engine is shown a series of spam (and ham) emails from which the library is constructed.

We recommend that the core data file (library) be regularly updated to reflect the most recent trends in spam emails, and this updating requires re-training the engine.

Fortuntely this task is simplified by a training tool provided in the distributable.

The JasenTrainer class (org.jasen.core.engine.JasenTrainer) provides all the functionality required to train the engine and generate the required data file.

Training the engine requires a training set consisting of two distinct training sources or corpuses

Each corpus must consist of plain text, MIME formatted emails. If any of the messages in the corpus is malformed, it will be ignored and may halt training.

We strongly recommend that each corpus be of approximately equivalent size. Using significantly differing sized corpuses may lead to inaccurate scanning results.

It is also recommended that the each corpus consist of at least 2,500 emails and preferrably over 5,000 for each corpus. When collecting email for training purposes, we also recommend excluding email newsletters from the training set as these can often cause confusion within the engine and may lead to inaccurate scanning.

It is critical that there is no pollution of each corpus. That is, there MUST NOT be ANY spam in the ham corpus, and vice versa. Take extreme care when compiling the corpus such that corpus pollution is avoided.

JasenTrainer takes 3 required parameters, and one optional one:

JasenTrainer <spam corpus path> <ham corpus path> <store path> <command> (optional)
Parameter Requisite Description Sample
spam corpus path Required The folder path containing the spam corpus /usr/local/jasen/training/spam
ham corpus path Required The folder path containing the ham corpus /usr/local/jasen/training/ham
store path Required The full path (including the file name) to the data file to be written /usr/local/jasen-data/default/jasen.dat
command Optional (experimental) Optionally provides the ability to load an existing data file and append to it.
MUST be one of 'new' or 'load'
new


Once the data file has been generated (assuming you haven't automatically overwritten the existing one), you can instruct the engine to use your new data file by altering the map-path attribute in the RobinsonScanner plugin