More than 5 years have passed since last update.

素人の言語処理100本ノック:76

Last updated at 2017-05-03Posted at 2017-02-20

言語処理100本ノック 2015の挑戦記録です。環境はUbuntu 16.04 LTS ＋ Python 3.5.2 :: Anaconda 4.1.1 (64-bit)です。過去のノックの一覧はこちらからどうぞ。

第8章: 機械学習

本章では，Bo Pang氏とLillian Lee氏が公開しているMovie Review Dataのsentence polarity dataset v1.0を用い，文を肯定的（ポジティブ）もしくは否定的（ネガティブ）に分類するタスク（極性分析）に取り組む．

76. ラベル付け

学習データに対してロジスティック回帰モデルを適用し，正解のラベル，予測されたラベル，予測確率をタブ区切り形式で出力せよ．

出来上がったコード：

main.py

# coding: utf-8
import codecs
import snowballstemmer
import numpy as np

fname_sentiment = 'sentiment.txt'
fname_features = 'features.txt'
fname_theta = 'theta.npy'
fname_result = 'result.txt'
fencoding = 'cp1252'		# Windows-1252らしい

stemmer = snowballstemmer.stemmer('english')

# ストップワードのリスト	 http://xpo6.com/list-of-english-stop-words/ のCSV Formatより
stop_words = (
	'a,able,about,across,after,all,almost,also,am,among,an,and,any,are,'
	'as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,'
	'either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,'
	'him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,'
	'likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,'
	'on,only,or,other,our,own,rather,said,say,says,she,should,since,so,'
	'some,than,that,the,their,them,then,there,these,they,this,tis,to,too,'
	'twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,'
	'will,with,would,yet,you,your').lower().split(',')


def is_stopword(str):
	'''文字がストップワードかどうかを返す
	大小文字は同一視する

	戻り値：
	ストップワードならTrue、違う場合はFalse
	'''
	return str.lower() in stop_words


def hypothesis(data_x, theta):
	'''仮説関数
	data_xに対して、thetaを使ってdata_yを予測

	戻り値：
	予測値の行列
	'''
	return 1.0 / (1.0 + np.exp(-data_x.dot(theta)))


def extract_features(data, dict_features):
	'''文章から素性を抽出
	文章からdict_featuresに含まれる素性を抽出し、
	dict_features['(素性)']の位置を1にした行列を返す。
	なお、先頭要素は固定で1。素性に対応しない重み用。

	戻り値：
	先頭要素と、該当素性の位置+1を1にした行列
	'''
	data_one_x = np.zeros(len(dict_features) + 1, dtype=np.float64)
	data_one_x[0] = 1		# 先頭要素は固定で1、素性に対応しない重み用。

	for word in data.split(' '):

		# 前後の空白文字除去
		word = word.strip()

		# ストップワード除去
		if is_stopword(word):
			continue

		# ステミング
		word = stemmer.stemWord(word)

		# 素性のインデックス取得、行列の該当箇所を1に
		try:
			data_one_x[dict_features[word]] = 1
		except:
			pass		# dict_featuresにない素性は無視

	return data_one_x


def load_dict_features():
	'''features.txtを読み込み、素性をインデックスに変換するための辞書を作成
	インデックスの値は1ベースで、features.txtにおける行番号と一致する。

	戻り値：
	素性をインデックスに変換する辞書
	'''
	with codecs.open(fname_features, 'r', fencoding) as file_in:
		return {line.strip(): i for i, line in enumerate(file_in, start=1)}


# 素性辞書の読み込み
dict_features = load_dict_features()

# 学習結果の読み込み
theta = np.load(fname_theta)

# 学習データを読み込んで予測
with codecs.open(fname_sentiment, 'r', fencoding) as file_in, \
		open(fname_result, 'w') as file_out:

	for line in file_in:

		# 素性抽出
		data_one_x = extract_features(line[3:], dict_features)

		# 予測、結果出力
		h = hypothesis(data_one_x, theta)
		if h > 0.5:
			file_out.write('{}\t{}\t{}\n'.format(line[0:2], '+1', h))
		else:
			file_out.write('{}\t{}\t{}\n'.format(line[0:2], '-1', 1 - h))

実行結果：

実行結果は「result.txt」に出力します。以下、その先頭部分です。

result.txtの先頭部分

-1	-1	0.84128525307739
+1	+1	0.9092062807282129
+1	+1	0.553085519355556
+1	+1	0.8535668467933613
-1	-1	0.7992886809287588
+1	+1	0.9989116240762246
-1	+1	0.6208624063497488
+1	+1	0.9845368320643015
+1	+1	0.7906871750078216
+1	+1	0.8645613519028749
-1	-1	0.916795585155668
+1	+1	0.9261196491506768
-1	-1	0.9114578616603789
+1	+1	0.7902482704258449
+1	-1	0.6600533200938651
+1	-1	0.5726383205991274
-1	-1	0.9173556809882624
-1	-1	0.9770172038339648
+1	+1	0.9239412556453133
+1	-1	0.5792255523114858
（以下略）

いくつか間違っているものもありますね...
ファイル全体はGitHubにアップしています。

学習データでの検証

今回は問題74の処理を学習データに対して行うだけです。結果の分析は次の問題で行います。

なお、問題78にも書かれているように、学習に使ったデータによる検証はNGです。教科書丸暗記で応用が効かないような学習（過学習とか呼びます）でも良い結果になってしまうためです。

　
77本目のノックは以上です。誤りなどありましたら、ご指摘いただけますと幸いです。

実行結果には、100本ノックで用いるコーパス・データで配布されているデータの一部が含まれます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up