LinguisticAnalyzer

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.jasen.core.linguistics
Class LinguisticAnalyzer

java.lang.Object
  org.jasen.core.linguistics.LinguisticAnalyzer

public final class LinguisticAnalyzer
extends Object

Singleton linguistic analyzer class used to determine if a word is valid .

Author:: Jason Polites

Field Summary
`static double`	`DEFAULT_THRESHOLD`
`static char[]`	`EXTENDED_UNICODE_REPLACE`
`static char[]`	`EXTENDED_UNICODE_SEARCH`
`static char[]`	`STANDARD_UNICODE_REPLACE`
`static char[]`	`STANDARD_UNICODE_SEARCH`

Method Summary
`static String`	`clean(String word)` Uses the replacement facilities of the analyzer to "estimate" the best character replacements to clean the word.
`static char`	`getExtendedReplacement(char chr)` Gets the most logical standard ASCII replacement for the extended ASCII character passed
`static char`	`getFullReplacement(char chr)` Looks for either standard, or extended replacements for the given character
`static LinguisticAnalyzer`	`getInstance()` Returns the current instance, or creates and initialises the internal analyzer
`static char`	`getStandardReplacement(char chr)` Does a stanndard replacement of ASCII characters to ASCII characters.
`double`	`getWordScore(String word)` Computes the probability that the given word is a "real" word
`double`	`getWordScore(String word, boolean clean)` Computes the probability that the given word is a "real" word
`boolean`	`isWord(String word)` Returns true if the word is valid according to the default threshold of 0.1.
`boolean`	`isWord(String word, boolean clean)` Returns true if the word is valid according to the default threshold of 0.1.
`boolean`	`isWord(String word, double threshold, boolean clean)` Returns true if the word is valid according to the given threshold.

Methods inherited from class java.lang.Object

equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

DEFAULT_THRESHOLD

public static final double DEFAULT_THRESHOLD

See Also:: Constant Field Values

STANDARD_UNICODE_SEARCH

public static char[] STANDARD_UNICODE_SEARCH

STANDARD_UNICODE_REPLACE

public static char[] STANDARD_UNICODE_REPLACE

EXTENDED_UNICODE_SEARCH

public static char[] EXTENDED_UNICODE_SEARCH

EXTENDED_UNICODE_REPLACE

public static char[] EXTENDED_UNICODE_REPLACE

Method Detail

getInstance

public static final LinguisticAnalyzer getInstance()

Returns the current instance, or creates and initialises the internal analyzer

Returns:: The single analyzer instance
Throws:: IOException

getWordScore

public double getWordScore(String word,
                           boolean clean)

Computes the probability that the given word is a "real" word

Parameters:: word -; clean - If true, abberate characters (non alphabetical) are removed
Returns:: A value from 0.0 to 1.0

getWordScore

public double getWordScore(String word)

Computes the probability that the given word is a "real" word

Parameters:: word - The word to test
Returns:: A value between 0.0 and 1.0 indicating the probability that the word is genuine English word

isWord

public boolean isWord(String word,
                      double threshold,
                      boolean clean)

Returns true if the word is valid according to the given threshold.

If the probability calculated is >= threshold then true is returned

Parameters:: word - The word to test; threshold - Should be a value between 0.0 and 1.0
Returns:: True if the String passed looks like a word, false otherwise

isWord

public boolean isWord(String word,
                      boolean clean)

Returns true if the word is valid according to the default threshold of 0.1.

Parameters:: word - The word to test; clean - If true, the word has extended ASCII characters replaced with ASCII equivalents.
Returns:: True if the String passed looks like a word, false otherwise.

isWord

public boolean isWord(String word)

Returns true if the word is valid according to the default threshold of 0.1.

Parameters:: word - The word to test
Returns:: True if the String passed looks like a word, false otherwise.

clean

public static String clean(String word)

Uses the replacement facilities of the analyzer to "estimate" the best character replacements to clean the word.

Parameters:: word - The word to investigate.
Returns:: The same word with abberant characters replaced.

getExtendedReplacement

public static char getExtendedReplacement(char chr)

Gets the most logical standard ASCII replacement for the extended ASCII character passed

Parameters:: chr - The character to replace. Usually non ASCII
Returns:: The replaced character. The best match based on physical appearance of the character is made

getStandardReplacement

public static char getStandardReplacement(char chr)

Does a stanndard replacement of ASCII characters to ASCII characters.

This is used in situations where a word has been deliberately obfuscated by using similar looking characters in replacement for the actual alternative.

For example: The word: "he||0 w0r|d" should be interpreted as "hello world"

Parameters:: chr - The standard ASCII character to replace
Returns:: The replaced character

getFullReplacement

public static char getFullReplacement(char chr)

Looks for either standard, or extended replacements for the given character

Parameters:: chr - The character to replace
Returns:: The replaced character
See Also:: getExtendedReplacement(char), getStandardReplacement(char)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.jasen.core.linguistics Class LinguisticAnalyzer

DEFAULT_THRESHOLD

STANDARD_UNICODE_SEARCH

STANDARD_UNICODE_REPLACE

EXTENDED_UNICODE_SEARCH

EXTENDED_UNICODE_REPLACE

getInstance

getWordScore

getWordScore

isWord

isWord

isWord

clean

getExtendedReplacement

getStandardReplacement

getFullReplacement

org.jasen.core.linguistics
Class LinguisticAnalyzer