org.jasen.core.parsers
Class SpamHTMLParser

java.lang.Object
  extended byjavax.swing.text.html.HTMLEditorKit.ParserCallback
      extended byorg.jasen.core.parsers.StandardHTMLParser
          extended byorg.jasen.core.parsers.SpamHTMLParser
All Implemented Interfaces:
HTMLParser

public class SpamHTMLParser
extends StandardHTMLParser

Extracts plain text elements from an HTML document.

This implementation is specific to parsing the text out of spam emails

Author:
Jason Polites

Field Summary
static String BGCOLOR_NAME
          The CSS name for background colors (background-color)
static String COLOR_NAME
          The CSS name for foreground colors (color)
static float COLOR_THRESHOLD
          Deprecated. Use getContrastThreshold
static int DEFAULT_BGCOLOR
          The default numerical bacjground color (white)
static int DEFAULT_COLOR
          The default numerical foreground color (black)
static String DEFAULT_STR_BGCOLOR
          String (hex) value for the default background color (white)
static String DEFAULT_STR_COLOR
          String (hex) value for the default foreground color (black)
static int ELEMENT_THRESHOLD
          Deprecated. Use getMicroElementSize
static int FONTSIZE_THRESHOLD
          Deprecated. Use getMicroFontSize
static String[] HTML_COLOR_NAMES
           
static String[] HTML_COLOR_VALUES
           
static double TOKEN_RECOGNITION_THRESHOLD
          Deprecated. Not used
static String URL_REGEX
          Deprecated. Not used
 
Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED
 
Constructor Summary
SpamHTMLParser()
           
 
Method Summary
 int getConcealedHtmlCount()
          Gets the number of times concealed html was found
 float getContrastThreshold()
          Gets the threshold for contrast between foreground and background content elements.
 int getFalseAnchorCount()
          Gets the number if occurrences of "false" anchor tags.
 int getImageCount()
          Gets the number of times images were found
 int getMicroElementSize()
          Gets the size (in pixels) of the minimum allowable element dimension (usually height).
 int getMicroFontSize()
          Gets the size (in points) of the minimum allowable font size.
 int getSrcCgiCount()
          Gets the number of times the source attribute of a tag referenced a remote CGI script
 int getSrcPortCount()
          Gets the list of url ports found in tags with a src attribute
 List getUrlPorts()
          Gets the list of url ports found in anchor tags in the message html part
 void handleEndTag(HTML.Tag t, int pos)
           
 void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
           
 void handleText(char[] text, int pos)
           
 ParserData parse(javax.mail.internet.MimeMessage mm, JasenMessage message, MimeMessageTokenizer tokenizer)
          Parses the given JasenMessage and returns the results of the parse as a ParserData object.
 void setContrastThreshold(float contrastThreshold)
          Sets the threshold for contrast between foreground and background content elements.
 void setMicroElementSize(int microElementSize)
          Sets the size (in pixels) of the minimum allowable element dimension (usually height).
 void setMicroFontSize(int microFontSize)
          Sets the size (in points) of the minimum allowable font size.
 
Methods inherited from class org.jasen.core.parsers.StandardHTMLParser
extractText, extractText, extractText, handleComment, handleSimpleTag, setEncoding
 
Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
flush, handleEndOfLineString, handleError
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_BGCOLOR

public static final int DEFAULT_BGCOLOR
The default numerical bacjground color (white)

See Also:
Constant Field Values

DEFAULT_COLOR

public static final int DEFAULT_COLOR
The default numerical foreground color (black)

See Also:
Constant Field Values

DEFAULT_STR_BGCOLOR

public static final String DEFAULT_STR_BGCOLOR
String (hex) value for the default background color (white)

See Also:
Constant Field Values

DEFAULT_STR_COLOR

public static final String DEFAULT_STR_COLOR
String (hex) value for the default foreground color (black)

See Also:
Constant Field Values

COLOR_THRESHOLD

public static final float COLOR_THRESHOLD
Deprecated. Use getContrastThreshold

The contrast threshold below which content is deemed concealed

See Also:
Constant Field Values

FONTSIZE_THRESHOLD

public static final int FONTSIZE_THRESHOLD
Deprecated. Use getMicroFontSize

The font size threshold below which content is deemed concealed

See Also:
Constant Field Values

ELEMENT_THRESHOLD

public static final int ELEMENT_THRESHOLD
Deprecated. Use getMicroElementSize

The size (in pixels) below which an element is considered concealed

See Also:
Constant Field Values

TOKEN_RECOGNITION_THRESHOLD

public static final double TOKEN_RECOGNITION_THRESHOLD
Deprecated. Not used

See Also:
Constant Field Values

BGCOLOR_NAME

public static final String BGCOLOR_NAME
The CSS name for background colors (background-color)

See Also:
Constant Field Values

COLOR_NAME

public static final String COLOR_NAME
The CSS name for foreground colors (color)

See Also:
Constant Field Values

URL_REGEX

public static final String URL_REGEX
Deprecated. Not used

See Also:
Constant Field Values

HTML_COLOR_NAMES

public static String[] HTML_COLOR_NAMES

HTML_COLOR_VALUES

public static String[] HTML_COLOR_VALUES
Constructor Detail

SpamHTMLParser

public SpamHTMLParser()
Method Detail

handleStartTag

public void handleStartTag(HTML.Tag t,
                           MutableAttributeSet a,
                           int pos)
Overrides:
handleStartTag in class StandardHTMLParser

handleText

public void handleText(char[] text,
                       int pos)
Overrides:
handleText in class StandardHTMLParser

handleEndTag

public void handleEndTag(HTML.Tag t,
                         int pos)
Overrides:
handleEndTag in class StandardHTMLParser

getConcealedHtmlCount

public int getConcealedHtmlCount()
Gets the number of times concealed html was found

Returns:
An integer representing the number of times a concealment was discovered

getImageCount

public int getImageCount()
Gets the number of times images were found

Returns:
The number of images in the document

getSrcCgiCount

public int getSrcCgiCount()
Gets the number of times the source attribute of a tag referenced a remote CGI script

Returns:

getSrcPortCount

public int getSrcPortCount()
Gets the list of url ports found in tags with a src attribute

Returns:

getUrlPorts

public List getUrlPorts()
Gets the list of url ports found in anchor tags in the message html part

Returns:

getFalseAnchorCount

public int getFalseAnchorCount()
Gets the number if occurrences of "false" anchor tags.

These exist where an anchor tag displays a url as the text component,
but this url does not match the actual url of the href.

Returns:
The number of times a false anchor reference was discovered

getContrastThreshold

public float getContrastThreshold()
Gets the threshold for contrast between foreground and background content elements.
In HTML emails, and particularly spam, content is often obscured via the use of low contrast colors or tones between background and foreground elements. For example, the text of the email may be white, and the background white indicating a contrast of 0

Returns:
A value between 0.0 and 1.0 such that 0.0 indicates no contrast, and 1.0 indicates complete contrast (eg white on black)

setContrastThreshold

public void setContrastThreshold(float contrastThreshold)
Sets the threshold for contrast between foreground and background content elements.

Parameters:
contrastThreshold - A value between 0.0 and 1.0
See Also:
getContrastThreshold()

getMicroElementSize

public int getMicroElementSize()
Gets the size (in pixels) of the minimum allowable element dimension (usually height).
Content found inside elements smaller than this size is deemed concealed

Returns:
The size in pixels of the smallest allowable element dimension

setMicroElementSize

public void setMicroElementSize(int microElementSize)
Sets the size (in pixels) of the minimum allowable element dimension (usually height).

Parameters:
microElementSize - The size in pixels. It is recommended that this be less than 10. Default is 5.

getMicroFontSize

public int getMicroFontSize()
Gets the size (in points) of the minimum allowable font size.
Content found inside font tags with smaller point size than this size is deemed concealed

Returns:
The size in points of the smallest allowable font. Default is 1

setMicroFontSize

public void setMicroFontSize(int microFontSize)
Sets the size (in points) of the minimum allowable font size.

Parameters:
microFontSize - A size in points. Default is 1

parse

public ParserData parse(javax.mail.internet.MimeMessage mm,
                        JasenMessage message,
                        MimeMessageTokenizer tokenizer)
                 throws JasenException
Description copied from interface: HTMLParser
Parses the given JasenMessage and returns the results of the parse as a ParserData object.

Specified by:
parse in interface HTMLParser
Overrides:
parse in class StandardHTMLParser
Throws:
JasenException