素人の言語処理100本ノック:74

Last updated at 2024-11-02Posted at 2017-02-14

言語処理100本ノック 2015の挑戦記録です。環境はUbuntu 16.04 LTS ＋ Python 3.5.2 :: Anaconda 4.1.1 (64-bit)です。過去のノックの一覧はこちらからどうぞ。

第8章: 機械学習

本章では，Bo Pang氏とLillian Lee氏が公開しているMovie Review Dataのsentence polarity dataset v1.0を用い，文を肯定的（ポジティブ）もしくは否定的（ネガティブ）に分類するタスク（極性分析）に取り組む．

74. 予測

73で学習したロジスティック回帰モデルを用い，与えられた文の極性ラベル（正例なら"+1"，負例なら"-1"）と，その予測確率を計算するプログラムを実装せよ．

出来上がったコード：

main.py

# coding: utf-8
import codecs
import snowballstemmer
import numpy as np

fname_sentiment = 'sentiment.txt'
fname_features = 'features.txt'
fname_theta = 'theta.npy'
fencoding = 'cp1252'		# Windows-1252らしい

stemmer = snowballstemmer.stemmer('english')

# ストップワードのリスト	 http://xpo6.com/list-of-english-stop-words/ のCSV Formatより
stop_words = (
	'a,able,about,across,after,all,almost,also,am,among,an,and,any,are,'
	'as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,'
	'either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,'
	'him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,'
	'likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,'
	'on,only,or,other,our,own,rather,said,say,says,she,should,since,so,'
	'some,than,that,the,their,them,then,there,these,they,this,tis,to,too,'
	'twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,'
	'will,with,would,yet,you,your').lower().split(',')


def is_stopword(str):
	'''文字がストップワードかどうかを返す
	大小文字は同一視する

	戻り値：
	ストップワードならTrue、違う場合はFalse
	'''
	return str.lower() in stop_words


def hypothesis(data_x, theta):
	'''仮説関数
	data_xに対して、thetaを使ってdata_yを予測

	戻り値：
	予測値の行列
	'''
	return 1.0 / (1.0 + np.exp(-data_x.dot(theta)))


def extract_features(data, dict_features):
	'''文章から素性を抽出
	文章からdict_featuresに含まれる素性を抽出し、
	dict_features['(素性)']の位置を1にした行列を返す。
	なお、先頭要素は固定で1。素性に対応しない重み用。

	戻り値：
	先頭要素と、該当素性の位置+1を1にした行列
	'''
	data_one_x = np.zeros(len(dict_features) + 1, dtype=np.float64)
	data_one_x[0] = 1		# 先頭要素は固定で1、素性に対応しない重み用。

	for word in data.split(' '):

		# 前後の空白文字除去
		word = word.strip()

		# ストップワード除去
		if is_stopword(word):
			continue

		# ステミング
		word = stemmer.stemWord(word)

		# 素性のインデックス取得、行列の該当箇所を1に
		try:
			data_one_x[dict_features[word]] = 1
		except:
			pass		# dict_featuresにない素性は無視

	return data_one_x


def load_dict_features():
	'''features.txtを読み込み、素性をインデックスに変換するための辞書を作成
	インデックスの値は1ベースで、features.txtにおける行番号と一致する。

	戻り値：
	素性をインデックスに変換する辞書
	'''
	with codecs.open(fname_features, 'r', fencoding) as file_in:
		return {line.strip(): i for i, line in enumerate(file_in, start=1)}


# 素性辞書の読み込み
dict_features = load_dict_features()

# 学習結果の読み込み
theta = np.load(fname_theta)

# 入力
review = input('レビューを入力してください--> ')

# 素性抽出
data_one_x = extract_features(review, dict_features)

# 予測
h = hypothesis(data_one_x, theta)
if h > 0.5:
	print('label:+1 ({})'.format(h))
else:
	print('label:-1 ({})'.format(1 - h))

####実行結果：

問題70で作った「sentiment.txt」の先頭のレビュー3件を入れてみました。ちなみに1件目と3件目のレビューの正解は肯定的(+1)、2件目は否定的（-1）です。

実行結果

segavvy@ubuntu:~/ドキュメント/言語処理100本ノック2015/74$ python main.py 
レビューを入力してください--> deep intelligence and a warm , enveloping affection breathe out of every frame .
label:+1 (0.9881093733272299)
segavvy@ubuntu:~/ドキュメント/言語処理100本ノック2015/74$ python main.py 
レビューを入力してください--> before long , the film starts playing like general hospital crossed with a saturday night live spoof of dog day afternoon .
label:-1 (0.6713196688353891)
segavvy@ubuntu:~/ドキュメント/言語処理100本ノック2015/74$ python main.py 
レビューを入力してください--> by the time it ends in a rush of sequins , flashbulbs , blaring brass and back-stabbing babes , it has said plenty about how show business has infiltrated every corner of society -- and not always for the better .
label:-1 (0.6339673922580253)

1件目、2件目は正しく予測できていますが、3件目は肯定的なレビューを否定的だと予測してしまいました。
1件目の予測確率は98.8%なので、かなり自信のある予測のようです。3件目の間違った予測の予測確率は63.4%なので、やや自信もなかったようです。

###予測の方法
予測は、入力されたレビューから素性を抽出して、その結果を仮説関数に与えて予測すればOKです。
仮説関数は0から1の値を返します。その値が0.5より大きければ肯定的、小さければ否定的という予測になります。0.5ちょうどの場合ははどちらでもいいみたいですが、今回は否定的にしました。

予測確率は、仮説関数の値そのものが示しています。例えば仮説関数が0.8を返せば80%の確率で肯定的という予測です。ただし、否定的の場合は0に近い方が確率が高くなりますので、仮説関数の値を1から引いた値が予測確率です。たとえば仮説関数が0.3を返した場合は、70%の確率（=1-0.3）で否定的という予測になります。

　
75本目のノックは以上です。誤りなどありましたら、ご指摘いただけますと幸いです。

実行結果には、100本ノックで用いるコーパス・データで配布されているデータの一部が含まれます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up