|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.jasen.core.token.EmailTokenizer
Converts the subject, text and html parts of a MimeMessage into discrete String "tokens".
Each token represents either a word, or a specialized representation of certain key information.
For example:
Often the subject line in a message is all that is required to identify it as spam. This can be a very good source of information because it will almost always be free from obfuscation (not withstanding the use of non-ascii characters). Hence, tokens found in the subject are annotated with the word "Subject" and delimited with a question mark.
For example:
The subject line "Buy viagra!" would be tokenized as:
Field Summary | |
static char |
HEADER_TOKEN_DELIMITER
This is just a rare character user to identify mail header tokens It looks like two pipes || |
static String[] |
IGNORED_HEADERS
Deprecated. This should be done in config |
static String[] |
INCLUDED_HEADERS
Deprecated. This should be done in config |
Constructor Summary | |
EmailTokenizer()
|
Method Summary | |
int |
getLinguisticLimit()
Gets the maximum number of linguistic errors tolerated before tokenization is aborted. |
int |
getTokenLimit()
Gets the maximum number of tokens extracted before tokenization is aborted |
boolean |
isIgnoreHeaders()
Tells us if we are ignoring the list of IGNORED_HEADERS when tokenizing |
static void |
main(String[] args)
Internal test harness only. |
void |
setIgnoreHeaders(boolean b)
Flags the tokenizer to ignore list of IGNORED_HEADERS when tokenizing |
void |
setLinguisticLimit(int linguisticLimit)
Sets the maximum number of linguistic errors tolerated before tokenization is aborted. |
void |
setTokenLimit(int i)
Sets the maximum number of tokens extracted before tokenization is aborted |
String[] |
tokenize(javax.mail.internet.MimeMessage mail,
JasenMessage message,
ParserData data)
Tokenizes the given message into meaningful string tokens |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final char HEADER_TOKEN_DELIMITER
public static String[] IGNORED_HEADERS
public static String[] INCLUDED_HEADERS
Constructor Detail |
public EmailTokenizer() throws IOException
Method Detail |
public String[] tokenize(javax.mail.internet.MimeMessage mail, JasenMessage message, ParserData data) throws JasenException
MimeMessageTokenizer
tokenize
in interface MimeMessageTokenizer
mail
- message
-
JasenException
public int getLinguisticLimit()
The tokenizer uses the LinguisticAnalyzer to determine if each token is a real word. After linguisticLimit tokens have successively failed, tokenization is aborted.
public void setLinguisticLimit(int linguisticLimit)
linguisticLimit
- The linguisticLimit to set.getLinguisticLimit()
public boolean isIgnoreHeaders()
IGNORED_HEADERS
public void setIgnoreHeaders(boolean b)
b
- public int getTokenLimit()
public void setTokenLimit(int i)
setTokenLimit
in interface MimeMessageTokenizer
i
- public static void main(String[] args)
args
-
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |