More than 5 years have passed since last update.

Robinson's Bayesian Spam Filter をpythonで実装してみた

Posted at 2015-06-04

はじめに

PHPで書かれたRobinson's Bayesian Spam Filterの動作を確認している時に一番厄介だったのが、chi-squareの計算です。色々調べていたら、Robinsonさん自身が書いたpythonのプログラムが出てきました。これはいいやと思って、phpの計算を確認するために、処理を追加して、気が付いたらBayesian Filterのクラスを作っていました。

Robinson's Bayesian Spam Filterの実装はググってもそんなに見つからなかったので、ひとまず公開してみることにしました。pythonは初心者なので、色々間違ってたらすいません。つっこみ歓迎です。誰かの何かの参考になれば幸いです。ちなみに著作権を主張するつもりはないので、煮るなり焼くなり好きにして下さい。

なお、ベイジアンフィルタについての日本語の解説はこちらです。
http://akademeia.info/index.php?%A5%D9%A5%A4%A5%B8%A5%A2%A5%F3%A5%D5%A5%A3%A5%EB%A5%BF

プログラム

""" Robinson's Spam filter program
This program is inspired by the following article
http://www.linuxjournal.com/article/6467?page=0,0
"""
import math

class RobinsonsBayes(object):
    """RobinsonsBayes
        This class only support calculation assuming you already have training set.
    """
    x = float(0.5) #possibility that first appeard word would be spam
    s = float(1)   #intensity of x
    def __init__(self,spam_doc_num,ham_doc_num):
        self.spam_doc_num = spam_doc_num
        self.ham_doc_num = ham_doc_num
        self.total_doc_num = spam_doc_num+ham_doc_num
        self.possibility_list = []

    def CalcProbabilityToBeSpam(self,num_in_spam_docs,num_in_ham_docs):
        degree_of_spam = float(num_in_spam_docs)/self.spam_doc_num;
        degree_of_ham  = float(num_in_ham_docs)/self.ham_doc_num;

        #p(w)
        probability = degree_of_spam/(degree_of_spam+degree_of_ham);

        #f(w)
        robinson_probability = ((self.x*self.s) + (self.total_doc_num*probability))/(self.s+self.total_doc_num)
        return robinson_probability

    def AddWord(self,num_in_spam_docs,num_in_ham_docs):
        probability = self.CalcProbabilityToBeSpam(num_in_spam_docs,num_in_ham_docs)
        self.possibility_list.append(probability)
        return probability

    #retrieved from
    #http://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/064/6467/6467s2.html
    def chi2P(self,chi, df):
        """Return prob(chisq >= chi, with df degrees of freedom).
        df must be even.
        """
        assert df & 1 == 0

        # XXX If chi is very large, exp(-m) will underflow to 0.
        m = chi / 2.0
        sum = term = math.exp(-m)
        for i in range(1, df//2):
            term *= m / i
            sum += term
        # With small chi and large df, accumulated
        # roundoff error, plus error in
        # the platform exp(), can cause this to spill
        # a few ULP above 1.0. For
        # example, chi2P(100, 300) on my box
        # has sum == 1.0 + 2.0**-52 at this
        # point.  Returning a value even a teensy
        # bit over 1.0 is no good.
        return min(sum, 1.0)

    def CalcNess(self,f,n):
        Ness = self.chi2P(-2*math.log(f),2*n)
        return Ness

    def CalcIndicator(self):
        fwpi_h=fwpi_s=1
        for fwi in self.possibility_list:
            fwpi_h *= fwi
            fwpi_s *= (1-fwi)

        H = self.CalcNess(fwpi_h,3)
        S = self.CalcNess(fwpi_s,3)

        #Notice that the bigger H(Hamminess) indicates that the document is more likely to be SPAM.
        I = (1+H-S)/2
        return I

if __name__ == '__main__':
    """
        This is a exapmple of checking if "I have a pen" is a spam.

        Following program assuming like:
            - We have 10 spam documents and 10 ham documents in our hand.
            - Number of "I" in spam documents is 1 and that of ham documents is 5
            - Number of "have" in spam documents is 2 and that of ham documents is 6
            - Number of "a" in spam documents is 1 and that of ham documents is 2
            - Number of "pen" in spam documents is 5 and that of ham documents is 1

        By the way, "I have a pen" is an sentence the most of Japanese learn in the first English class.
        Enjoy!
    """

    #init class by giving the number of document
    RobinsonsBayes = RobinsonsBayes(10,10)

    #Add train data of words one by one
    print "I   : "+str(RobinsonsBayes.AddWord(1,5))
    print "have: "+str(RobinsonsBayes.AddWord(2,6))
    print "a   : "+str(RobinsonsBayes.AddWord(1,2))
    print "pen : "+str(RobinsonsBayes.AddWord(5,1))

    #calculate Indicater
    print "I (probability to be spam)"
    print RobinsonsBayes.CalcIndicator()
    print ""

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up