More than 5 years have passed since last update.

言語処理100本ノック-92(Gensim使用):アナロジーデータへの適用

Posted at 2020-01-19

言語処理100本ノック 2015の92本目「アナロジーデータへの適用」の記録です。
第9章でお手製で作ったNumpy形式の単語ベクトルデータを使った場合とGensimを使った場合の2通りで単語ベクトル計算および類似単語の抽出をします。計算の速さなど、Gensimの素晴らしさを実感できます。

参考リンク

リンク	備考
092.アナロジーデータへの適用_1.ipynb	回答プログラムのGitHubリンク(自作)
092.アナロジーデータへの適用_2.ipynb	回答プログラムのGitHubリンク(Gensim版)
素人の言語処理100本ノック:92	言語処理100本ノックで常にお世話になっています

環境

種類	バージョン	内容
OS	Ubuntu18.04.01 LTS	仮想で動かしています
pyenv	1.2.15	複数Python環境を使うことがあるのでpyenv使っています
Python	3.6.9	pyenv上でpython3.6.9を使っています 3.7や3.8系を使っていないことに深い理由はありませんパッケージはvenvを使って管理しています

上記環境で、以下のPython追加パッケージを使っています。通常のpipでインストールするだけです。

種類	バージョン
gensim	3.8.1
numpy	1.17.4
pandas	0.25.3

課題

第10章: ベクトル空間法 (II)

第10章では，前章に引き続き単語ベクトルの学習に取り組む．

92. アナロジーデータへの適用

91で作成した評価データの各事例に対して，vec(2列目の単語) - vec(1列目の単語) + vec(3列目の単語)を計算し，そのベクトルと類似度が最も高い単語と，その類似度を求めよ．求めた単語と類似度は，各事例の末尾に追記せよ．このプログラムを85で作成した単語ベクトル，90で作成した単語ベクトルに対して適用せよ．

回答

自作回答プログラム 092.アナロジーデータへの適用_1.ipynb

import csv

import numpy as np
import pandas as pd

# 保存時に引数を指定しなかったので'arr_0'に格納されている
matrix_x300 = np.load('./../09.ベクトル空間法 (I)/085.matrix_x300.npz')['arr_0']

print('matrix_x300 Shape:', matrix_x300.shape)

group_t = pd.read_pickle('./../09.ベクトル空間法 (I)/083_group_t.zip')

# コサイン類似度計算
def get_cos_similarity(v1, v1_norm, v2):
    
    # ベクトルが全てゼロの場合は-1を返す
    if np.count_nonzero(v2) == 0:
        return -1
    else:
        return np.dot(v1, v2) / (v1_norm * np.linalg.norm(v2))

# 類似度が高い単語取得
def get_similar_word(cols):
    
    try:        
        vec = matrix_x300[group_t.index.get_loc(cols[1])] \
              - matrix_x300[group_t.index.get_loc(cols[0])] \
              + matrix_x300[group_t.index.get_loc(cols[2])]
        vec_norm = np.linalg.norm(vec)
        
        # 計算に使った自身の3単語は除外
        cos_sim = [-1 if group_t.index[i] in cols[:3] else get_cos_similarity(vec, vec_norm, matrix_x300[i]) for i in range(len(group_t))]
        index = np.argmax(cos_sim)
        
        cols.extend([group_t.index[index], cos_sim[index]])
        
    except KeyError:
        cols.extend(['', -1])
    return cols

# 評価データ読み込み
with open('./091.analogy_family.txt') as file_in:
    result = [get_similar_word(line.split()) for line in file_in]

with open('092.analogy_word2vec_1.txt', 'w') as file_out:
    writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
    writer.writerows(result)

回答解説

ここで類似単語を取得しています。
設問に書いていないのですが、計算に使った単語は除外するようにしています。これでいいのかわかりませんが、除外することにより正答率は上がります。

cos_sim = [-1 if group_t.index[i] in cols[:3] else get_cos_similarity(vec, vec_norm, matrix_x300[i]) for i in range(len(group_t))]

コーパス上にない単語は、類似度を-1と設定しています。

except KeyError:
    cols.extend(['', -1])

あとは、今までのノックで書いた内容が多くコードの中で特別なことはあまりしておらず、特に解説することもないです。
強いて言うなら17分ほど、処理に時間がかかるのでリスト内包表記でできるだけ書くようにしました。
出力したファイルの中身を最初の10行を出すと、こんな感じです。合っていたりなかったり、という程度です。

091.analogy_family.txt

boy	girl	brother	sister	son	0.8804225566858075
boy	girl	brothers	sisters	sisters	0.8426790631091488
boy	girl	dad	mom	mum	0.8922065515297802
boy	girl	father	mother	mother	0.847494164274725
boy	girl	grandfather	grandmother	grandmother	0.820584129035444
boy	girl	grandpa	grandma		-1
boy	girl	grandson	granddaughter	grandfather	0.6794604718339272
boy	girl	groom	bride	seduce	0.5951703092628703
boy	girl	he	she	she	0.8144501058726975
boy	girl	his	her	Mihailov	0.5752869854520882
後略

Gensim使用回答プログラム 092.アナロジーデータへの適用_2.ipynb

import csv

from gensim.models import Word2Vec

model = Word2Vec.load('./090.word2vec.model')

print(model)

# 類似度が高い単語取得
def get_similar_word(cols):
    try:
        cos_sim = model.wv.most_similar(positive=[cols[1], cols[2]], negative=[cols[0]], topn=4)       
        for word, similarity in cos_sim:
            
            # 計算に使った3単語は除外しておく
            if word not in cols[:2]:
                cols.extend([word, similarity])
                break
                
    # もともとのコーパスにない単語の場合
    except KeyError:
        cols.extend(['', -1])
    
    return cols

# 評価データ読み込み
with open('./091.analogy_family.txt') as file_in:
    result = [get_similar_word(line.split()) for line in file_in]

with open('./092.analogy_word2vec_2.txt', 'w') as file_out:
    writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
    writer.writerows(result)

回答解説

パッケージを使ってやっているだけあって、自作版より少しスリムです。
そして、実行するとわかるのですが、処理が速い！４秒程度で終わり、自作版の200倍以上速いです。Gensimすげー。
出力された結果です。正答率もあがっています。

092.analogy_word2vec_2.txt

boy	girl	brother	sister	sister	0.745887041091919
boy	girl	brothers	sisters	sisters	0.8522343039512634
boy	girl	dad	mom	mum	0.7720432281494141
boy	girl	father	mother	mother	0.8608728647232056
boy	girl	grandfather	grandmother	granddaughter	0.8341050148010254
boy	girl	grandpa	grandma		-1
boy	girl	grandson	granddaughter	granddaughter	0.8497666120529175
boy	girl	groom	bride	bride	0.7476662397384644
boy	girl	he	she	she	0.7702984809875488
boy	girl	his	her	her	0.6540039777755737

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up