素人の言語処理100本ノック:96

Last updated at 2024-12-31Posted at 2017-04-26

言語処理100本ノック 2015の挑戦記録です。環境はUbuntu 16.04 LTS ＋ Python 3.5.2 :: Anaconda 4.1.1 (64-bit)です。過去のノックの一覧はこちらからどうぞ。

第10章: ベクトル空間法 (II)

第10章では，前章に引き続き単語ベクトルの学習に取り組む．

96. 国名に関するベクトルの抽出

word2vecの学習結果から，国名に関するベクトルのみを抜き出せ．

出来上がったコード：

main.py

# coding: utf-8
import pickle
from collections import OrderedDict
from scipy import io
import numpy as np

fname_dict_index_t = 'dict_index_t'
fname_matrix_x300 = 'matrix_x300'
fname_countries = 'countries.txt'

fname_dict_new = 'dict_index_country'
fname_matrix_new = 'matrix_x300_country'


# 辞書読み込み
with open(fname_dict_index_t, 'rb') as data_file:
		dict_index_t = pickle.load(data_file)

# 行列読み込み
matrix_x300 = io.loadmat(fname_matrix_x300)['matrix_x300']

# 辞書にある用語のみの行列を作成
dict_new = OrderedDict()
matrix_new = np.empty([0, 300], dtype=np.float64)
count = 0

with open(fname_countries, 'rt') as data_file:
	for line in data_file:
		try:
			word = line.strip().replace(' ', '_')
			index = dict_index_t[word]
			matrix_new = np.vstack([matrix_new, matrix_x300[index]])
			dict_new[word] = count
			count += 1
		except:
			pass

# 結果の書き出し
io.savemat(fname_matrix_new, {'matrix_x300': matrix_new})
with open(fname_dict_new, 'wb') as data_file:
	pickle.dump(dict_new, data_file)

実行結果：

単語ベクトルは「matrix_x300_country.mat」に、各行に対応する単語一覧はファイル「dict_index_country」に出力します。

国名のクラスタリングの準備

前問までで単語ベクトルの検証は終わり、この問題からはクラスタリングに挑戦します。この問題では準備のために単語ベクトルを国名だけに絞ります。

国名一覧の作成

国名は問題81で作った「countries.txt」をそのまま使おうと思ったのですが、この一覧はガチガチの正式名なのに対し、Wikipediaの中では略名も多く使われています。そこで、問題81の「countries.txt」に対して、nationsonline.orgのCountries and Regions of the World from A to Zの国名一覧を追加して重複を除き、新たな「countries.txt」を作りました。
なお、Webサイトの表から任意の列だけをコピーすることはできないのですが、表全体をコピーしてExcelに貼り付けると「English Name」の列だけ簡単に取り出せます。それを問題81の「countries.txt」の末尾に追加して、問題17で勉強したUNIXコマンドで重複を除去しました。

国名のベクトルのみを取り出す

国名のベクトルのみ取り出す処理は、まず結果用の0行の行列をnumpy.empty()で作成し、今回作った新たな「countries.txt」の国名を1行ずつ単語ベクトルに変換して、変換できたものをnumpy.vstack()で行列に追加していく形で実装しました。

　
97本目のノックは以上です。誤りなどありましたら、ご指摘いただけますと幸いです。

実行結果には、100本ノックで用いるコーパス・データで配布されているデータの一部が含まれます。この第10章で用いているコーパス・データのライセンスはクリエイティブ・コモンズ表示-継承 3.0 非移植（日本語訳）です。また、国名の一覧は、「KIDS外務省 - 世界の国々」（外務省）（http://www.mofa.go.jp/mofaj/kids/ichiran/index.html）と、nationsonline.orgのCountries and Regions of the World from A to Zを加工して作成しています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up