More than 3 years have passed since last update.

Doc2Vecで文書のEmbeddingを学習・獲得

Last updated at 2020-06-15Posted at 2020-06-15

Doc2Vecのメモ

学習

train.py

from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

f = open([分かち書き文書（改行区切り）], 'rt')
trainings = [TaggedDocument(words = data.split(),tags = [i]) for i,data in enumerate(f)]

model = Doc2Vec(documents= trainings, dm = 1, vector_size=100, window=2, min_count=1, workers=4)
model.save("model/doc2vec.model")

モデルをloadして類似度の確認

confirm.py

from gensim.models.doc2vec import Doc2Vec

model = Doc2Vec.load('model/doc2vec.model')

s1 = 'i love you .'
doc_words1 = s1.split()

s2 = 'this is a pen .'
doc_words2 = s2.split()

s3 = 'she likes her pet .'
doc_words3 = s3.split()

print(s1, s2)
sim_value = model.docvecs.similarity_unseen_docs(model, doc_words1, doc_words2, alpha=1, min_alpha=0.0001, steps=5)
print(sim_value)

print()

print(s1, s3)
sim_value = model.docvecs.similarity_unseen_docs(model, doc_words1, doc_words3, alpha=1, min_alpha=0.0001, steps=5)
print(sim_value)

$ python3 confirm.py
i love you . this is a pen .
0.35657609

i love you . she likes her pet .
0.61478287

任意の単語・文からベクトルの獲得

get_vector.py

from gensim.models.doc2vec import Doc2Vec

model = Doc2Vec.load('model/doc2vec.model')

vector = model.infer_vector(["camera"])
print(vector)

$ python3 get_vector.py
[ 2.47751679e-02 -2.37015472e-03 -9.07190889e-03  3.14958654e-02
 -9.36891476e-04 -9.46468487e-03 -1.64013170e-02  3.72726540e-03
 -5.04465075e-03 -2.05997601e-02  7.66226742e-03 -1.31780906e-02
 -2.17045713e-02 -1.72169674e-02  2.66727619e-03  2.15792600e-02
  5.42917755e-04  1.73445195e-02 -1.22042364e-02 -5.53676719e-03
  1.31262755e-02 -1.20242555e-02 -4.90346774e-02 -1.30925402e-02
 -9.20513365e-03 -1.02104060e-02 -3.45910038e-03 -1.38879791e-02
  1.24522485e-02 -8.98833107e-03  9.46434215e-03  1.60158724e-02
 -1.45071670e-02 -1.17817409e-02 -1.32775353e-02  5.12571214e-03
  2.63297558e-03  1.27882678e-02  1.84957087e-02 -9.61085316e-03
  2.48025986e-03  3.07683889e-02 -2.19578166e-02  8.81655049e-03
  2.28421725e-02  2.32696421e-02  7.16307247e-03 -1.04713291e-02
  4.92677977e-03 -3.09784282e-02  1.34340376e-02  7.26914825e-03
  4.67077689e-03  2.15533823e-02 -1.70422960e-02 -1.30527671e-02
 -7.90362991e-03 -4.17791121e-03  1.10175610e-02 -7.73182791e-03
  6.48175087e-03  6.38299622e-04  9.35730152e-03 -2.26445938e-03
  1.46172876e-02 -1.04205897e-02 -2.16954977e-05  2.67737289e-03
  1.72927193e-02  1.39058568e-02  7.09218113e-03  1.93058401e-02
  1.14299208e-02  3.92317260e-03 -7.82044325e-03 -2.86504477e-02
  8.82215053e-03 -1.86563854e-03  1.38469525e-02 -3.11182608e-04
 -1.86214391e-02 -1.87536830e-03  2.80867293e-02 -9.54967982e-04
 -1.23350583e-02 -1.43871717e-02  9.01202485e-03 -4.42029210e-03
 -1.06303710e-02 -8.69653840e-03 -1.31274825e-02 -1.78468637e-02
 -8.98253825e-03 -8.39732401e-03 -1.02942903e-02  4.17890493e-03
 -3.77285830e-03 -8.58513173e-03  1.53906625e-02  1.35426852e-03]

学習した文書のベクトルを獲得

print(len(model.docvecs))
(学習した文書の数)

print(model.docvecs[0])
[ 0.3984622   0.0011081  -0.34068212  0.2807926  -0.09965006  0.05219714
 -0.07226691 -0.10903342 -0.3401149   0.00761536  0.14741096 -0.428729
 -0.1831681  -0.18672702  0.30623436  0.09046402  0.17570521  0.59133667
 -0.01409002  0.2635858   0.15392083 -0.03985418 -0.59720606  0.306037
  0.18144156 -0.11081521 -0.00683758 -0.05085954  0.6232276  -0.45330688
  0.09846549  0.29597676  0.0834657  -0.18919533  0.17512774 -0.22221033
  0.30882886 -0.2151929   0.6124842  -0.43345436  0.5355878   0.00944662
 -0.5844124   0.04926539  0.25019678 -0.02182007  0.16996674 -0.31945443
  0.2030172  -0.38379008  0.49651483  0.2437395   0.41874662  0.43805206
 -0.9164802   0.24835101  0.11624163 -0.1720545   0.42150754 -0.4028251
 -0.09254187  0.03971908 -0.44850552  0.0768422  -0.10021693  0.47900409
 -0.24413407 -0.09366472  0.0504733   0.19371264 -0.0647843   0.41226834
 -0.07237258 -0.07221863  0.145306   -0.25470334  0.17476264  0.4592808
  0.40632915  0.2842979   0.10939611 -0.10637764 -0.06468508 -0.48528668
 -0.2735453  -0.32164642  0.20348737 -0.19311284 -0.11128855  0.17956318
  0.13550924  0.17764378  0.03881073 -0.25717908  0.03879094 -0.16467957
 -0.10741294 -0.17077914  0.74566174 -0.02592899]

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up