Doc2Vecのメモ
学習
train.py
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
f = open([分かち書き文書(改行区切り)], 'rt')
trainings = [TaggedDocument(words = data.split(),tags = [i]) for i,data in enumerate(f)]
model = Doc2Vec(documents= trainings, dm = 1, vector_size=100, window=2, min_count=1, workers=4)
model.save("model/doc2vec.model")
モデルをloadして類似度の確認
confirm.py
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec.load('model/doc2vec.model')
s1 = 'i love you .'
doc_words1 = s1.split()
s2 = 'this is a pen .'
doc_words2 = s2.split()
s3 = 'she likes her pet .'
doc_words3 = s3.split()
print(s1, s2)
sim_value = model.docvecs.similarity_unseen_docs(model, doc_words1, doc_words2, alpha=1, min_alpha=0.0001, steps=5)
print(sim_value)
print()
print(s1, s3)
sim_value = model.docvecs.similarity_unseen_docs(model, doc_words1, doc_words3, alpha=1, min_alpha=0.0001, steps=5)
print(sim_value)
$ python3 confirm.py
i love you . this is a pen .
0.35657609
i love you . she likes her pet .
0.61478287
任意の単語・文からベクトルの獲得
get_vector.py
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec.load('model/doc2vec.model')
vector = model.infer_vector(["camera"])
print(vector)
$ python3 get_vector.py
[ 2.47751679e-02 -2.37015472e-03 -9.07190889e-03 3.14958654e-02
-9.36891476e-04 -9.46468487e-03 -1.64013170e-02 3.72726540e-03
-5.04465075e-03 -2.05997601e-02 7.66226742e-03 -1.31780906e-02
-2.17045713e-02 -1.72169674e-02 2.66727619e-03 2.15792600e-02
5.42917755e-04 1.73445195e-02 -1.22042364e-02 -5.53676719e-03
1.31262755e-02 -1.20242555e-02 -4.90346774e-02 -1.30925402e-02
-9.20513365e-03 -1.02104060e-02 -3.45910038e-03 -1.38879791e-02
1.24522485e-02 -8.98833107e-03 9.46434215e-03 1.60158724e-02
-1.45071670e-02 -1.17817409e-02 -1.32775353e-02 5.12571214e-03
2.63297558e-03 1.27882678e-02 1.84957087e-02 -9.61085316e-03
2.48025986e-03 3.07683889e-02 -2.19578166e-02 8.81655049e-03
2.28421725e-02 2.32696421e-02 7.16307247e-03 -1.04713291e-02
4.92677977e-03 -3.09784282e-02 1.34340376e-02 7.26914825e-03
4.67077689e-03 2.15533823e-02 -1.70422960e-02 -1.30527671e-02
-7.90362991e-03 -4.17791121e-03 1.10175610e-02 -7.73182791e-03
6.48175087e-03 6.38299622e-04 9.35730152e-03 -2.26445938e-03
1.46172876e-02 -1.04205897e-02 -2.16954977e-05 2.67737289e-03
1.72927193e-02 1.39058568e-02 7.09218113e-03 1.93058401e-02
1.14299208e-02 3.92317260e-03 -7.82044325e-03 -2.86504477e-02
8.82215053e-03 -1.86563854e-03 1.38469525e-02 -3.11182608e-04
-1.86214391e-02 -1.87536830e-03 2.80867293e-02 -9.54967982e-04
-1.23350583e-02 -1.43871717e-02 9.01202485e-03 -4.42029210e-03
-1.06303710e-02 -8.69653840e-03 -1.31274825e-02 -1.78468637e-02
-8.98253825e-03 -8.39732401e-03 -1.02942903e-02 4.17890493e-03
-3.77285830e-03 -8.58513173e-03 1.53906625e-02 1.35426852e-03]
学習した文書のベクトルを獲得
print(len(model.docvecs))
(学習した文書の数)
print(model.docvecs[0])
[ 0.3984622 0.0011081 -0.34068212 0.2807926 -0.09965006 0.05219714
-0.07226691 -0.10903342 -0.3401149 0.00761536 0.14741096 -0.428729
-0.1831681 -0.18672702 0.30623436 0.09046402 0.17570521 0.59133667
-0.01409002 0.2635858 0.15392083 -0.03985418 -0.59720606 0.306037
0.18144156 -0.11081521 -0.00683758 -0.05085954 0.6232276 -0.45330688
0.09846549 0.29597676 0.0834657 -0.18919533 0.17512774 -0.22221033
0.30882886 -0.2151929 0.6124842 -0.43345436 0.5355878 0.00944662
-0.5844124 0.04926539 0.25019678 -0.02182007 0.16996674 -0.31945443
0.2030172 -0.38379008 0.49651483 0.2437395 0.41874662 0.43805206
-0.9164802 0.24835101 0.11624163 -0.1720545 0.42150754 -0.4028251
-0.09254187 0.03971908 -0.44850552 0.0768422 -0.10021693 0.47900409
-0.24413407 -0.09366472 0.0504733 0.19371264 -0.0647843 0.41226834
-0.07237258 -0.07221863 0.145306 -0.25470334 0.17476264 0.4592808
0.40632915 0.2842979 0.10939611 -0.10637764 -0.06468508 -0.48528668
-0.2735453 -0.32164642 0.20348737 -0.19311284 -0.11128855 0.17956318
0.13550924 0.17764378 0.03881073 -0.25717908 0.03879094 -0.16467957
-0.10741294 -0.17077914 0.74566174 -0.02592899]
参考文献