org.jasen.core.linguistics
Class LinguisticAnalyzer

java.lang.Object
  extended byorg.jasen.core.linguistics.LinguisticAnalyzer

public final class LinguisticAnalyzer
extends Object

Singleton linguistic analyzer class used to determine if a word is valid .

Author:
Jason Polites

Field Summary
static double DEFAULT_THRESHOLD
           
static char[] EXTENDED_UNICODE_REPLACE
           
static char[] EXTENDED_UNICODE_SEARCH
           
static char[] STANDARD_UNICODE_REPLACE
           
static char[] STANDARD_UNICODE_SEARCH
           
 
Method Summary
static String clean(String word)
          Uses the replacement facilities of the analyzer to "estimate" the best character replacements to clean the word.
static char getExtendedReplacement(char chr)
          Gets the most logical standard ASCII replacement for the extended ASCII character passed
static char getFullReplacement(char chr)
          Looks for either standard, or extended replacements for the given character
static LinguisticAnalyzer getInstance()
          Returns the current instance, or creates and initialises the internal analyzer
static char getStandardReplacement(char chr)
          Does a stanndard replacement of ASCII characters to ASCII characters.
 double getWordScore(String word)
          Computes the probability that the given word is a "real" word
 double getWordScore(String word, boolean clean)
          Computes the probability that the given word is a "real" word
 boolean isWord(String word)
          Returns true if the word is valid according to the default threshold of 0.1.
 boolean isWord(String word, boolean clean)
          Returns true if the word is valid according to the default threshold of 0.1.
 boolean isWord(String word, double threshold, boolean clean)
          Returns true if the word is valid according to the given threshold.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_THRESHOLD

public static final double DEFAULT_THRESHOLD
See Also:
Constant Field Values

STANDARD_UNICODE_SEARCH

public static char[] STANDARD_UNICODE_SEARCH

STANDARD_UNICODE_REPLACE

public static char[] STANDARD_UNICODE_REPLACE

EXTENDED_UNICODE_SEARCH

public static char[] EXTENDED_UNICODE_SEARCH

EXTENDED_UNICODE_REPLACE

public static char[] EXTENDED_UNICODE_REPLACE
Method Detail

getInstance

public static final LinguisticAnalyzer getInstance()
Returns the current instance, or creates and initialises the internal analyzer

Returns:
The single analyzer instance
Throws:
IOException

getWordScore

public double getWordScore(String word,
                           boolean clean)
Computes the probability that the given word is a "real" word

Parameters:
word -
clean - If true, abberate characters (non alphabetical) are removed
Returns:
A value from 0.0 to 1.0

getWordScore

public double getWordScore(String word)
Computes the probability that the given word is a "real" word

Parameters:
word - The word to test
Returns:
A value between 0.0 and 1.0 indicating the probability that the word is genuine English word

isWord

public boolean isWord(String word,
                      double threshold,
                      boolean clean)
Returns true if the word is valid according to the given threshold.

If the probability calculated is >= threshold then true is returned

Parameters:
word - The word to test
threshold - Should be a value between 0.0 and 1.0
Returns:
True if the String passed looks like a word, false otherwise

isWord

public boolean isWord(String word,
                      boolean clean)
Returns true if the word is valid according to the default threshold of 0.1.

Parameters:
word - The word to test
clean - If true, the word has extended ASCII characters replaced with ASCII equivalents.
Returns:
True if the String passed looks like a word, false otherwise.

isWord

public boolean isWord(String word)
Returns true if the word is valid according to the default threshold of 0.1.

Parameters:
word - The word to test
Returns:
True if the String passed looks like a word, false otherwise.

clean

public static String clean(String word)
Uses the replacement facilities of the analyzer to "estimate" the best character replacements to clean the word.

Parameters:
word - The word to investigate.
Returns:
The same word with abberant characters replaced.

getExtendedReplacement

public static char getExtendedReplacement(char chr)
Gets the most logical standard ASCII replacement for the extended ASCII character passed

Parameters:
chr - The character to replace. Usually non ASCII
Returns:
The replaced character. The best match based on physical appearance of the character is made

getStandardReplacement

public static char getStandardReplacement(char chr)
Does a stanndard replacement of ASCII characters to ASCII characters.

This is used in situations where a word has been deliberately obfuscated by using similar looking characters in replacement for the actual alternative.

For example: The word: "he||0 w0r|d" should be interpreted as "hello world"

Parameters:
chr - The standard ASCII character to replace
Returns:
The replaced character

getFullReplacement

public static char getFullReplacement(char chr)
Looks for either standard, or extended replacements for the given character

Parameters:
chr - The character to replace
Returns:
The replaced character
See Also:
getExtendedReplacement(char), getStandardReplacement(char)