More than 3 years have passed since last update.

【Python】「吾輩は猫である」の統計的言語モデル(Bigram・Trigram)作った話

Last updated at 2021-05-05Posted at 2019-03-15

概要

・大学の課題
・詳細および再現はGithubへ

課題内容

・夏目漱石「吾輩は猫である」の電子化テキストを用いて、単語のBigramおよびTrigramモデルの確率を推定せよ。

Bigram

・訓練テキストneko.numを使って、単語「て」(数値表現は 28)の直後に出現する単語の確率をneko.numに出現するすべての単語(13,938種類)に対して推定せよ．
・未知語に対する確率は推定しなくてよい．
・すなわち、13,938種類の単語の条件付き確率の合計がちょうど1.0となるように推定する．

Trigram

・上記の bigram モデルの推定を trigram に拡張した課題．
・単語「し」それに続く単語「て」(数値表現は24と28)の直後に出現する単語の確率をneko.numに出現するすべての単語(13,938種類)に対して推定せよ．

課題評価

・評価用テキストとして，夏目漱石「こころ」より抜粋した文集合(ファイル:kokoro.num)を用いて，作成したモデルの test-set perplexityを計算

手法

・neko.numにおける各単語の出現回数を，単語（KEY）とその出現回数（VALUE）でディクショナリ作成
・各単語の条件付き（Bigram,Trigram)出現回数で同様にディクショナリ作成
・最尤推定法をベースにチューニング
・各単語に対して出現確率が割り振られるので確率降順にソートしたものをファイルに出力．

kadai.py

# coding:utf-8
import csv
import sys
import codecs
import math
from urlparse import urlparse #URL --> Domain
from time import sleep

text =[]
# 訓練用テキストの読み込み
with open('neko.num','r') as  a:
	for line in a:
		text.append(line.rstrip())
# スペースで区切りで配列'text'に格納してく
text = ' '.join(text).split()
N = len(text)



# 単語（数値表現）をKey，出現回数をValueとした辞書（ディクショナリ）作成
dict_lib = {}
for num in text:
	dict_lib[num] = 0

for num in text:
	count = int(dict_lib[num])
	count = count + 1
	dict_lib[num] = count
Keys = dict_lib.keys()



# 単語「て」（数値表現'28'）のインデックスを全て取得
indexes = [i for i, x in enumerate(text) if x == '28']


next_indexes = []
next_words   = []
for num in indexes:
	num = num + 1  
	next_indexes.append(num)
	next_words.append(text[num])

# 単語「て」の１つ後ろにくる単語の，dic[単語]=出現回数，となるディクショナリ作成
dict_next_28 = {}
for num in Keys:
	dict_next_28[num] = 0
for num in next_words:
	count = int(dict_next_28[num])
	count = count + 1
	dict_next_28[num] = count


# bigram
count = 0
sum_Pwb = 0.0
dict_bigram = {}
for key in Keys:
	count = count + 1
	Cbw = dict_next_28[key] - 0.5 #absolute discount
	if Cbw < 0:
		Cbw = 0
	Cb  = dict_lib['28']
	Pwb = float(Cbw) / float(Cb)
	math.floor(Pwb)	
	dict_bigram[key] = Pwb
	sum_Pwb = float(sum_Pwb) + float(Pwb)

diff_bigram = abs(1 - float(sum_Pwb))

# Bigramモデルの出力
b = open('bigram.model','w')
count_keys =  len(Keys)
# とりあえず最初の確率は全ての単語に均等に与えておく
add_bigram = diff_bigram / count_keys
sum_after_added_bi = 0.0
for i in range(1,count_keys+1):
	bigram_added = float(dict_bigram[str(i)]) + float(add_bigram)
	b.write("{0:20.17e}".format(float(bigram_added)) + '\n')
	sum_after_added_bi = sum_after_added_bi + bigram_added
b.close()



# 単語「し」（数値表現24）の１つ後ろに単語「て」（数値表現28）という条件の元，さらにその１つ後ろにくる単語をリストに格納
indexes = [i for i, x in enumerate(text) if x == '24']
next_24indexes = []
next_24words   = []
next_24_28words = []
for num in indexes:
	num = num + 1
	next_24indexes.append(num)
	next_24words.append(text[num])
	if text[num] == '28':
		next_24_28words.append(text[num+1])


dict_next_24 = {}
for num in Keys:
	dict_next_24[num] = 0
for num in next_24words:
	count = int(dict_next_24[num])
	count = count + 1
	dict_next_24[num] = count

dict_next24_28 = {}
for num in Keys:
	dict_next24_28[num] = 0
for num in next_24_28words:
	count = int(dict_next24_28[num])
	count = count + 1
	dict_next24_28[num] = count


# trigram
count = 0
sum_Pwab = 0.0
dict_trigram = {}
for key in Keys:
	count = count + 1
	Cabw = dict_next24_28[key] - 0.5 #absolute discount
	if Cabw < 0:
		Cabw = 0
	Cab  = dict_next_24['28']
	Pwab = float(Cabw) / float(Cab)
	dict_trigram[key] = Pwab
	sum_Pwab = float(sum_Pwab) + float(Pwab)
diff_trigram = abs(1 - float(sum_Pwab))


# Trigramモデルの出力
c = open('trigram.model','w')
count_keys =  len(Keys)
add_trigram = diff_trigram / count_keys
sum_after_added_tri = 0.0
for i in range(1,count_keys+1):
	trigram_added = float(dict_trigram[str(i)]) + float(add_trigram)
	c.write("{0:20.17e}".format(float(trigram_added)) + '\n')
	sum_after_added_tri = sum_after_added_tri + trigram_added
c.close()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up