More than 1 year has passed since last update.

素人の言語処理100本ノック:94

Last updated at 2024-12-31Posted at 2017-04-20

言語処理100本ノック 2015の挑戦記録です。環境はUbuntu 16.04 LTS ＋ Python 3.5.2 :: Anaconda 4.1.1 (64-bit)です。過去のノックの一覧はこちらからどうぞ。

第10章: ベクトル空間法 (II)

第10章では，前章に引き続き単語ベクトルの学習に取り組む．

94. WordSimilarity-353での類似度計算

The WordSimilarity-353 Test Collectionの評価データを入力とし，1列目と2列目の単語の類似度を計算し，各行の末尾に類似度の値を追加するプログラムを作成せよ．このプログラムを85で作成した単語ベクトル，90で作成した単語ベクトルに対して適用せよ．

出来上がったコード：

main.py

# coding: utf-8
import pickle
from collections import OrderedDict
from scipy import io
import numpy as np

fname_dict_index_t = 'dict_index_t'
fname_matrix_x300 = 'matrix_x300'
fname_input = './wordsim353/combined.tab'
fname_output = 'combined_out.tab'


def cos_sim(vec_a, vec_b):
	'''コサイン類似度の計算
	ベクトルvec_a、vec_bのコサイン類似度を求める

	戻り値：
	コサイン類似度
	'''
	norm_ab = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
	if norm_ab != 0:
		return np.dot(vec_a, vec_b) / norm_ab
	else:
		# ベクトルのノルムが0だと似ているかどうかの判断すらできないので最低値
		return -1


# 辞書読み込み
with open(fname_dict_index_t, 'rb') as data_file:
		dict_index_t = pickle.load(data_file)

# 行列読み込み
matrix_x300 = io.loadmat(fname_matrix_x300)['matrix_x300']

# 評価データ読み込み
with open(fname_input, 'rt') as data_file, \
		open(fname_output, 'wt') as out_file:

	header = True
	for line in data_file:

		# 先頭行はスキップ
		if header is True:
			header = False
			continue

		cols = line.split('\t')

		try:
			# コサイン類似度算出
			dist = cos_sim(matrix_x300[dict_index_t[cols[0]]],
					matrix_x300[dict_index_t[cols[1]]])

		except KeyError:

			# 単語がなければコサイン類似度-1で出力
			dist = -1

		# 出力
		print('{}\t{}'.format(line.strip(), dist), file=out_file)

実行結果：

結果は「combined_out.tab」に出力します。
以下、問題90の単語ベクトルに対する結果の先頭部分です。

問題90の単語ベクトルに対するcombined.tabの先頭部分

love	sex	6.77	0.558817427529164
tiger	cat	7.35	0.8104942364075417
tiger	tiger	10.00	1.0
book	paper	7.46	0.5373739037842621
computer	keyboard	7.62	0.6513348284085957
computer	internet	7.58	0.6853771864636458
plane	car	5.77	0.6047296940670726
train	car	6.31	0.6214349550041308
telephone	communication	7.50	0.5728658343918928
television	radio	6.77	0.8238737873165439
media	radio	7.42	0.6139114178674844
drug	abuse	6.85	0.6707394601769904
bread	butter	6.19	0.8784042813288622
cucumber	potato	5.92	0.7202129358391373
doctor	nurse	7.00	0.7376654700130043
professor	doctor	6.62	0.5768738999716276
student	professor	6.81	0.6515753887632422
smart	student	4.62	0.06724816967785505
smart	stupid	5.81	0.5897232807858769
company	stock	7.08	0.6406619907313633
（以下略）

The WordSimilarity-353 Test Collection

The WordSimilarity-353 Test Collectionは、言葉の類似度を学習させたり、その結果の精度を評価するためのテストデータで、実際に十数人の被験者によって類似性を判断した結果になっています。

データは2セットあり被験者の数も異なっています。今回はこの2セットがマージされた「combined」を使いました。また、ファイルのフォーマットもカンマ区切りとタブ区切りの2種類が用意されており、今回はタブ区切りの「combined.tab」を使っています。以下、その先頭部分です。

combined.tabの先頭部分

Word 1	Word 2	Human (mean)
love	sex	6.77
tiger	cat	7.35
tiger	tiger	10.00
book	paper	7.46
computer	keyboard	7.62
computer	internet	7.58
plane	car	5.77
train	car	6.31
telephone	communication	7.50
television	radio	6.77
media	radio	7.42
drug	abuse	6.85
bread	butter	6.19
cucumber	potato	5.92
doctor	nurse	7.00
professor	doctor	6.62
student	professor	6.81
smart	student	4.62
smart	stupid	5.81
（以下略）

今回の問題でやることは問題87と同じです。各行の2つの単語の類似度を求めて各行の末尾に類似度を出力します。なお、単語ベクトルにない単語については-1を出力するようにしました。
この結果の評価は、問題95で行います。

　
95本目のノックは以上です。誤りなどありましたら、ご指摘いただけますと幸いです。

実行結果には、100本ノックで用いるコーパス・データで配布されているデータの一部が含まれます。この第10章で用いているコーパス・データのライセンスはクリエイティブ・コモンズ表示-継承 3.0 非移植（日本語訳）です。また、The WordSimilarity-353 Test Collectionのライセンスはクリエイティブ・コモンズ表示 4.0 国際（日本語訳）です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up