org.jasen.core.calculators
Class ChiSquaredCalculator

java.lang.Object
  extended byorg.jasen.core.calculators.ChiSquaredCalculator

public class ChiSquaredCalculator
extends Object

Performs all the chi probability calculations required by jASEN.

This is the core calculation class which ultimately determines the spam score for a message.

Most of the methods herein are a direct port from the Python implementation published by Gary Robinson.

Author:
Jason Polites

Constructor Summary
ChiSquaredCalculator()
           
 
Method Summary
 double calculateChi(double[] fws)
          Calculates the chi distribution of the word probabilities.
 double calculateH(double[] probs)
          Calculates the probability, as a value between 0.0 and 1.0, that the tokens provided indicate a HAM message
 double calculateInverseChiSquare(double fChi, int n)
          Calculates the inverse chi square for the given chi distribution.
 double calculateReverseChi(double[] fws)
          Does the same as calculateChi, but does so on 1 - f(w).
 double calculateS(double[] probs)
          Calculates the probability, as a value between 0.0 and 1.0, that the tokens provided indicate a SPAM message
 double[] calculateWordProbabilities(String[] words, JasenMap map)
          Calculates the probability of each word indicating spam.
 double confirmHypothesis(String[] words, JasenMap map)
          Confirms or rejects the null hypothesis that the message words indicate spam.
static void main(String[] args)
          Test harness only
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ChiSquaredCalculator

public ChiSquaredCalculator()
Method Detail

confirmHypothesis

public double confirmHypothesis(String[] words,
                                JasenMap map)
Confirms or rejects the null hypothesis that the message words indicate spam.

Specifically, this is defined as

 	I = H / (H + S)
 
Where:

Parameters:
words - The word tokens extracted from the message
map - The token map
Returns:
The overall "spamminess" of the words

calculateH

public double calculateH(double[] probs)
Calculates the probability, as a value between 0.0 and 1.0, that the tokens provided indicate a HAM message

Parameters:
probs - The word probabilities computed by calculateWordProbabilities
Returns:
A value between 0.0 and 1.0 where 1.0 indicates high probability of HAM (non spam)
See Also:
calculateWordProbabilities(String[], JasenMap)

calculateS

public double calculateS(double[] probs)
Calculates the probability, as a value between 0.0 and 1.0, that the tokens provided indicate a SPAM message

Parameters:
probs - The word probabilities computed by calculateWordProbabilities
Returns:
A value between 0.0 and 1.0 where 1.0 indicates high probability of SPAM
See Also:
calculateWordProbabilities(String[], JasenMap)

calculateWordProbabilities

public double[] calculateWordProbabilities(String[] words,
                                           JasenMap map)
Calculates the probability of each word indicating spam.

Specifically, this method uses the following approach from Gary Robinson:

b(w) = (the number of spam e-mails containing the word w) / (the total number of spam e-mails)
g(w) = (the number of ham e-mails containing the word w) / (the total number of ham e-mails)
p(w) = b(w) / (b(w) + g(w))

Then we calculate:

f(w) = ((s * x) + (m * p(w)) / (s + m)

Where:

Parameters:
words - The set of words for which the probabilities will be calculated
map - The map of word probabilities
Returns:
An array of double values, between 0.0 and 1.0, indicating the probability that the word indicates spam

calculateChi

public double calculateChi(double[] fws)
Calculates the chi distribution of the word probabilities.

This is defined as:

 	 -2 ln f(w)
 

Where

Parameters:
fws -
Returns:
The chi distribution value calculated

calculateReverseChi

public double calculateReverseChi(double[] fws)
Does the same as calculateChi, but does so on 1 - f(w).

That is:

 	 -2 ln (1 - f(w))
 

Parameters:
fws -
Returns:
The reverse chi computation as a double
See Also:
calculateChi(double[])

calculateInverseChiSquare

public double calculateInverseChiSquare(double fChi,
                                        int n)
Calculates the inverse chi square for the given chi distribution.

Again taken from Gary Robinsons writings, this is defined as:

 	H = C-1( -2 ln (  f(w) )y, 2ny)
 	

Where:

Parameters:
fChi - The chi distribution calculated from calculateChi()
n - The number of tokens
Returns:
The inverse chi squared probability

main

public static void main(String[] args)
Test harness only

Parameters:
args -