org.jasen.core.token
Class SpamTokenizer

java.lang.Object
  extended byorg.jasen.core.token.SpamTokenizer

public class SpamTokenizer
extends Object

This class is used exlusively by the EmailTokenizer.

Author:
Jason Polites
See Also:
EmailTokenizer

Field Summary
static char[] DELIMITER_CHARS
          These are characters which should always be treated as delimiters except when within a url This array MUST be sorted to faciliate a binary search
static int MAX_TOKEN_LENGTH
           
static int MIN_TOKEN_LENGTH
           
static char[] STOP_CHARS
          List list does NOT contain "$,@,?,!" as we want to retain these.
static String[] STOP_WORDS
           
static double TOKEN_RECOGNITION_THRESHOLD
           
 
Constructor Summary
SpamTokenizer()
           
 
Method Summary
 int getMaxTokens()
          Gets the maximum number of tokens to be extracted prior to aborting the tokenization process
 void setMaxTokens(int i)
           
 String[] tokenize(Reader reader, boolean onlyUrls, TokenErrorRecorder recorder)
           
 String[] tokenize(String str, boolean onlyUrls, TokenErrorRecorder recorder)
          Custom implementation which only returns urls

This is used for mail headers specifically
 String[] tokenize(String str, TokenErrorRecorder recorder)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MIN_TOKEN_LENGTH

public static int MIN_TOKEN_LENGTH

MAX_TOKEN_LENGTH

public static int MAX_TOKEN_LENGTH

TOKEN_RECOGNITION_THRESHOLD

public static double TOKEN_RECOGNITION_THRESHOLD

STOP_WORDS

public static String[] STOP_WORDS

STOP_CHARS

public static char[] STOP_CHARS
List list does NOT contain "$,@,?,!" as we want to retain these. This array MUST be sorted to faciliate a binary search.


DELIMITER_CHARS

public static char[] DELIMITER_CHARS
These are characters which should always be treated as delimiters except when within a url This array MUST be sorted to faciliate a binary search

Constructor Detail

SpamTokenizer

public SpamTokenizer()
Method Detail

tokenize

public String[] tokenize(String str,
                         boolean onlyUrls,
                         TokenErrorRecorder recorder)
                  throws IOException
Custom implementation which only returns urls

This is used for mail headers specifically

Parameters:
str -
onlyUrls -
Returns:
The reduced set of tokens (words)
Throws:
IOException

tokenize

public String[] tokenize(String str,
                         TokenErrorRecorder recorder)
                  throws IOException
Throws:
IOException

tokenize

public String[] tokenize(Reader reader,
                         boolean onlyUrls,
                         TokenErrorRecorder recorder)
                  throws IOException
Throws:
IOException

getMaxTokens

public int getMaxTokens()
Gets the maximum number of tokens to be extracted prior to aborting the tokenization process

Returns:
The max number of tokens

setMaxTokens

public void setMaxTokens(int i)
Parameters:
i -