More than 3 years have passed since last update.

fasttext の get_sentence_vector の結果を get_word_vector を使って表す

fastText

Last updated at 2021-01-15Posted at 2021-01-15

get_sentence_vector と get_word_vector の関係

fasttext の get_sentence_vector を使って得られた値は、出現する単語についての get_word_vector の平均とは異なります。具体的には、単語のベクトルに対し、L2ノルムで正規化し、その平均をとったものに、改行文字（終端文字）のベクトルを加算したものがget_sentense_vector の出力になります。

このあたりのissueやコードに記載されている内容です。
https://github.com/facebookresearch/fastText/issues/323
https://github.com/facebookresearch/fastText/blob/26bcbfc6b288396bd189691768b8c29086c0dab7/src/fasttext.cc#L474

コードで確認する

実際にコードで確認します。まずは学習済みモデルをDLします。

import fasttext.util
fasttext.util.download_model("ja", if_exists='ignore')

モデルを読み込み、適当な文章のベクトルを取得してみます。

import fasttext
import numpy as np

ft = fasttext.load_model('cc.ja.300.bin')
sentence = ["月", "きれい", "です"]

sentence_vector = ft.get_sentence_vector(" ".join(sentence))
print(sentence_vector[:10])

出力結果はこうなります。

[ 0.00897631  0.00557499  0.1755976   0.06277581  0.04944817  0.00894921
 -0.00580985 -0.03669543 -0.02011267 -0.03041974]

続いて、get_word_vector を使って同じベクトルを構築してみます。少々ややこしいですがこんな感じ。

v = np.sum([i / np.linalg.norm(i) for i in map(ft.get_word_vector, sentence)], axis=0) / len(sentence) + ft.get_word_vector("/n")
print(v[:10])

少し計算誤差が生じていますが、ほぼ同値です。

[ 0.0089763   0.00557499  0.17559756  0.06277581  0.04944817  0.00894922
 -0.00580985 -0.03669542 -0.02011267 -0.03041972]

念の為、値が一致しているかを確認しておきます。

print(all(np.isclose(sentence_vector, v, rtol=1e-06, atol=1e-08)))

True

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up