概要

・大学の課題
・詳細および再現はGithubへ

課題内容

・夏目漱石「吾輩は猫である」の電子化テキストを用いて、単語のBigramおよびTrigramモデルの確率を推定せよ。

Bigram

・訓練テキストneko.numを使って、単語「て」(数値表現は 28)の直後に出現する単語の確率をneko.numに出現するすべての単語(13,938種類)に対して推定せよ．
・未知語に対する確率は推定しなくてよい．
・すなわち、13,938種類の単語の条件付き確率の合計がちょうど1.0となるように推定する．

Trigram

・上記の bigram モデルの推定を trigram に拡張した課題．
・単語「し」それに続く単語「て」(数値表現は24と28)の直後に出現する単語の確率をneko.numに出現するすべての単語(13,938種類)に対して推定せよ．

課題評価

・評価用テキストとして，夏目漱石「こころ」より抜粋した文集合(ファイル:kokoro.num)を用いて，作成したモデルの test-set perplexityを計算

手法

・neko.numにおける各単語の出現回数を，単語（KEY）とその出現回数（VALUE）でディクショナリ作成
・各単語の条件付き（Bigram,Trigram)出現回数で同様にディクショナリ作成
・最尤推定法をベースにチューニング
・各単語に対して出現確率が割り振られるので確率降順にソートしたものをファイルに出力．

kadai.py

#coding:utf-8
import csv
import sys
import codecs
import math
from urlparse import urlparse #URL --> Domain
from time import sleep

text =[]
#訓練用テキストの読み込み
with open('neko.num','r') as  a:
    for line in a:
        text.append(line.rstrip())
#スペースで区切りで配列'text'に格納してく
text = ' '.join(text).split()
N = len(text)



#単語（数値表現）をKey，出現回数をValueとした辞書（ディクショナリ）作成
dict_lib = {}
for num in text:
    dict_lib[num] = 0

for num in text:
    count = int(dict_lib[num])
    count = count + 1
    dict_lib[num] = count
Keys = dict_lib.keys()



#単語「て」（数値表現'28'）のインデックスを全て取得
indexes = [i for i, x in enumerate(text) if x == '28']


next_indexes = []
next_words   = []
for num in indexes:
    num = num + 1  
    next_indexes.append(num)
    next_words.append(text[num])

#単語「て」の１つ後ろにくる単語の，dic[単語]=出現回数，となるディクショナリ作成
dict_next_28 = {}
for num in Keys:
    dict_next_28[num] = 0
for num in next_words:
    count = int(dict_next_28[num])
    count = count + 1
    dict_next_28[num] = count


# bigram
count = 0
sum_Pwb = 0.0
dict_bigram = {}
for key in Keys:
    count = count + 1
    Cbw = dict_next_28[key] - 0.5 #absolute discount
    if Cbw < 0:
        Cbw = 0
    Cb  = dict_lib['28']
    Pwb = float(Cbw) / float(Cb)
    math.floor(Pwb) 
    dict_bigram[key] = Pwb
    sum_Pwb = float(sum_Pwb) + float(Pwb)

diff_bigram = abs(1 - float(sum_Pwb))

#Bigramモデルの出力
b = open('bigram.model','w')
count_keys =  len(Keys)
#とりあえず最初の確率は全ての単語に均等に与えておく
add_bigram = diff_bigram / count_keys
sum_after_added_bi = 0.0
for i in range(1,count_keys+1):
    bigram_added = float(dict_bigram[str(i)]) + float(add_bigram)
    b.write("{0:20.17e}".format(float(bigram_added)) + '\n')
    sum_after_added_bi = sum_after_added_bi + bigram_added
b.close()



#単語「し」（数値表現24）の１つ後ろに単語「て」（数値表現28）という条件の元，さらにその１つ後ろにくる単語をリストに格納
indexes = [i for i, x in enumerate(text) if x == '24']
next_24indexes = []
next_24words   = []
next_24_28words = []
for num in indexes:
    num = num + 1
    next_24indexes.append(num)
    next_24words.append(text[num])
    if text[num] == '28':
        next_24_28words.append(text[num+1])


dict_next_24 = {}
for num in Keys:
    dict_next_24[num] = 0
for num in next_24words:
    count = int(dict_next_24[num])
    count = count + 1
    dict_next_24[num] = count

dict_next24_28 = {}
for num in Keys:
    dict_next24_28[num] = 0
for num in next_24_28words:
    count = int(dict_next24_28[num])
    count = count + 1
    dict_next24_28[num] = count


#trigram
count = 0
sum_Pwab = 0.0
dict_trigram = {}
for key in Keys:
    count = count + 1
    Cabw = dict_next24_28[key] - 0.5 #absolute discount
    if Cabw < 0:
        Cabw = 0
    Cab  = dict_next_24['28']
    Pwab = float(Cabw) / float(Cab)
    dict_trigram[key] = Pwab
    sum_Pwab = float(sum_Pwab) + float(Pwab)
diff_trigram = abs(1 - float(sum_Pwab))


#Trigramモデルの出力
c = open('trigram.model','w')
count_keys =  len(Keys)
add_trigram = diff_trigram / count_keys
sum_after_added_tri = 0.0
for i in range(1,count_keys+1):
    trigram_added = float(dict_trigram[str(i)]) + float(add_trigram)
    c.write("{0:20.17e}".format(float(trigram_added)) + '\n')
    sum_after_added_tri = sum_after_added_tri + trigram_added
c.close()

【Python】「吾輩は猫である」の統計的言語モデル(Bigram・Trigram)作った話

概要

課題内容

Bigram

Trigram

課題評価

手法