| 
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Objectorg.jasen.core.token.EmailTokenizer
Converts the subject, text and html parts of a MimeMessage into discrete String "tokens".
Each token represents either a word, or a specialized representation of certain key information.
For example:
Often the subject line in a message is all that is required to identify it as spam. This can be a very good source of information because it will almost always be free from obfuscation (not withstanding the use of non-ascii characters). Hence, tokens found in the subject are annotated with the word "Subject" and delimited with a question mark.
For example:
The subject line "Buy viagra!" would be tokenized as:
| Field Summary | |
static char | 
HEADER_TOKEN_DELIMITER
This is just a rare character user to identify mail header tokens It looks like two pipes ||  | 
static String[] | 
IGNORED_HEADERS
Deprecated. This should be done in config  | 
static String[] | 
INCLUDED_HEADERS
Deprecated. This should be done in config  | 
| Constructor Summary | |
EmailTokenizer()
 | 
|
| Method Summary | |
 int | 
getLinguisticLimit()
Gets the maximum number of linguistic errors tolerated before tokenization is aborted.  | 
 int | 
getTokenLimit()
Gets the maximum number of tokens extracted before tokenization is aborted  | 
 boolean | 
isIgnoreHeaders()
Tells us if we are ignoring the list of IGNORED_HEADERS when tokenizing  | 
static void | 
main(String[] args)
Internal test harness only.  | 
 void | 
setIgnoreHeaders(boolean b)
Flags the tokenizer to ignore list of IGNORED_HEADERS when tokenizing  | 
 void | 
setLinguisticLimit(int linguisticLimit)
Sets the maximum number of linguistic errors tolerated before tokenization is aborted.  | 
 void | 
setTokenLimit(int i)
Sets the maximum number of tokens extracted before tokenization is aborted  | 
 String[] | 
tokenize(javax.mail.internet.MimeMessage mail,
         JasenMessage message,
         ParserData data)
Tokenizes the given message into meaningful string tokens  | 
| Methods inherited from class java.lang.Object | 
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait | 
| Field Detail | 
public static final char HEADER_TOKEN_DELIMITER
public static String[] IGNORED_HEADERS
public static String[] INCLUDED_HEADERS
| Constructor Detail | 
public EmailTokenizer()
               throws IOException
| Method Detail | 
public String[] tokenize(javax.mail.internet.MimeMessage mail,
                         JasenMessage message,
                         ParserData data)
                  throws JasenException
MimeMessageTokenizer
tokenize in interface MimeMessageTokenizermail - message - 
JasenExceptionpublic int getLinguisticLimit()
The tokenizer uses the LinguisticAnalyzer to determine if each token is a real word. After linguisticLimit tokens have successively failed, tokenization is aborted.
public void setLinguisticLimit(int linguisticLimit)
linguisticLimit - The linguisticLimit to set.getLinguisticLimit()public boolean isIgnoreHeaders()
IGNORED_HEADERSpublic void setIgnoreHeaders(boolean b)
b - public int getTokenLimit()
public void setTokenLimit(int i)
setTokenLimit in interface MimeMessageTokenizeri - public static void main(String[] args)
args - 
  | 
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||