はじめに
PHPで書かれたRobinson's Bayesian Spam Filterの動作を確認している時に一番厄介だったのが、chi-squareの計算です。色々調べていたら、Robinsonさん自身が書いたpythonのプログラムが出てきました。これはいいやと思って、phpの計算を確認するために、処理を追加して、気が付いたらBayesian Filterのクラスを作っていました。
Robinson's Bayesian Spam Filterの実装はググってもそんなに見つからなかったので、ひとまず公開してみることにしました。pythonは初心者なので、色々間違ってたらすいません。つっこみ歓迎です。誰かの何かの参考になれば幸いです。ちなみに著作権を主張するつもりはないので、煮るなり焼くなり好きにして下さい。
なお、ベイジアンフィルタについての日本語の解説はこちらです。
http://akademeia.info/index.php?%A5%D9%A5%A4%A5%B8%A5%A2%A5%F3%A5%D5%A5%A3%A5%EB%A5%BF
プログラム
""" Robinson's Spam filter program
This program is inspired by the following article
http://www.linuxjournal.com/article/6467?page=0,0
"""
import math
class RobinsonsBayes(object):
"""RobinsonsBayes
This class only support calculation assuming you already have training set.
"""
x = float(0.5) #possibility that first appeard word would be spam
s = float(1) #intensity of x
def __init__(self,spam_doc_num,ham_doc_num):
self.spam_doc_num = spam_doc_num
self.ham_doc_num = ham_doc_num
self.total_doc_num = spam_doc_num+ham_doc_num
self.possibility_list = []
def CalcProbabilityToBeSpam(self,num_in_spam_docs,num_in_ham_docs):
degree_of_spam = float(num_in_spam_docs)/self.spam_doc_num;
degree_of_ham = float(num_in_ham_docs)/self.ham_doc_num;
#p(w)
probability = degree_of_spam/(degree_of_spam+degree_of_ham);
#f(w)
robinson_probability = ((self.x*self.s) + (self.total_doc_num*probability))/(self.s+self.total_doc_num)
return robinson_probability
def AddWord(self,num_in_spam_docs,num_in_ham_docs):
probability = self.CalcProbabilityToBeSpam(num_in_spam_docs,num_in_ham_docs)
self.possibility_list.append(probability)
return probability
#retrieved from
#http://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/064/6467/6467s2.html
def chi2P(self,chi, df):
"""Return prob(chisq >= chi, with df degrees of freedom).
df must be even.
"""
assert df & 1 == 0
# XXX If chi is very large, exp(-m) will underflow to 0.
m = chi / 2.0
sum = term = math.exp(-m)
for i in range(1, df//2):
term *= m / i
sum += term
# With small chi and large df, accumulated
# roundoff error, plus error in
# the platform exp(), can cause this to spill
# a few ULP above 1.0. For
# example, chi2P(100, 300) on my box
# has sum == 1.0 + 2.0**-52 at this
# point. Returning a value even a teensy
# bit over 1.0 is no good.
return min(sum, 1.0)
def CalcNess(self,f,n):
Ness = self.chi2P(-2*math.log(f),2*n)
return Ness
def CalcIndicator(self):
fwpi_h=fwpi_s=1
for fwi in self.possibility_list:
fwpi_h *= fwi
fwpi_s *= (1-fwi)
H = self.CalcNess(fwpi_h,3)
S = self.CalcNess(fwpi_s,3)
#Notice that the bigger H(Hamminess) indicates that the document is more likely to be SPAM.
I = (1+H-S)/2
return I
if __name__ == '__main__':
"""
This is a exapmple of checking if "I have a pen" is a spam.
Following program assuming like:
- We have 10 spam documents and 10 ham documents in our hand.
- Number of "I" in spam documents is 1 and that of ham documents is 5
- Number of "have" in spam documents is 2 and that of ham documents is 6
- Number of "a" in spam documents is 1 and that of ham documents is 2
- Number of "pen" in spam documents is 5 and that of ham documents is 1
By the way, "I have a pen" is an sentence the most of Japanese learn in the first English class.
Enjoy!
"""
#init class by giving the number of document
RobinsonsBayes = RobinsonsBayes(10,10)
#Add train data of words one by one
print "I : "+str(RobinsonsBayes.AddWord(1,5))
print "have: "+str(RobinsonsBayes.AddWord(2,6))
print "a : "+str(RobinsonsBayes.AddWord(1,2))
print "pen : "+str(RobinsonsBayes.AddWord(5,1))
#calculate Indicater
print "I (probability to be spam)"
print RobinsonsBayes.CalcIndicator()
print ""