|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.jasen.core.calculators.ChiSquaredCalculator
Performs all the chi probability calculations required by jASEN.
This is the core calculation class which ultimately determines the spam score for a message.
Most of the methods herein are a direct port from the Python implementation published by Gary Robinson.
Constructor Summary | |
ChiSquaredCalculator()
|
Method Summary | |
double |
calculateChi(double[] fws)
Calculates the chi distribution of the word probabilities. |
double |
calculateH(double[] probs)
Calculates the probability, as a value between 0.0 and 1.0, that the tokens provided indicate a HAM message |
double |
calculateInverseChiSquare(double fChi,
int n)
Calculates the inverse chi square for the given chi distribution. |
double |
calculateReverseChi(double[] fws)
Does the same as calculateChi, but does so on 1 - f(w). |
double |
calculateS(double[] probs)
Calculates the probability, as a value between 0.0 and 1.0, that the tokens provided indicate a SPAM message |
double[] |
calculateWordProbabilities(String[] words,
JasenMap map)
Calculates the probability of each word indicating spam. |
double |
confirmHypothesis(String[] words,
JasenMap map)
Confirms or rejects the null hypothesis that the message words indicate spam. |
static void |
main(String[] args)
Test harness only |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public ChiSquaredCalculator()
Method Detail |
public double confirmHypothesis(String[] words, JasenMap map)
Specifically, this is defined as
I = H / (H + S)Where:
words
- The word tokens extracted from the messagemap
- The token map
public double calculateH(double[] probs)
probs
- The word probabilities computed by calculateWordProbabilities
calculateWordProbabilities(String[], JasenMap)
public double calculateS(double[] probs)
probs
- The word probabilities computed by calculateWordProbabilities
calculateWordProbabilities(String[], JasenMap)
public double[] calculateWordProbabilities(String[] words, JasenMap map)
Specifically, this method uses the following approach from Gary Robinson:
b(w) = (the number of spam e-mails containing the word w) / (the total number of spam e-mails)
g(w) = (the number of ham e-mails containing the word w) / (the total number of ham e-mails)
p(w) = b(w) / (b(w) + g(w))
Then we calculate:
f(w) = ((s * x) + (m * p(w)) / (s + m)Where:
words
- The set of words for which the probabilities will be calculatedmap
- The map of word probabilities
public double calculateChi(double[] fws)
This is defined as:
-2 ln ∏f(w)
Where
fws
-
public double calculateReverseChi(double[] fws)
-2 ln ∏(1 - f(w))
fws
-
calculateChi(double[])
public double calculateInverseChiSquare(double fChi, int n)
Again taken from Gary Robinsons writings, this is defined as:
H = C-1( -2 ln ( ∏ f(w) )y, 2ny)Where:
fChi
- The chi distribution calculated from calculateChi()n
- The number of tokens
public static void main(String[] args)
args
-
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |