gensimのトピックモデルのget_document_topicsで再現性のある結果を得る

Last updated at 2022-09-09Posted at 2022-09-09

環境

Python 3.7.13
gensim 3.6.0
Google Colaboratoryにて実行

事象

gensimのトピックモデルのサンプルコードを参考に、適当なモデルを作ってみます。

from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel

texts = [
    ['computer', 'time', 'graph'],
    ['survey', 'response', 'eps'],
    ['human', 'system', 'computer']
]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
model = LdaModel(corpus, num_topics=2, random_state=0, id2word=dictionary)

適当な入力をすると、トピックに割り当てられる確率が出力されます。

print(model.get_document_topics(dictionary.doc2bow(["computer"])))

出力

[(0, 0.67178893), (1, 0.32821107)]

この確率はrandom_state=0にしていても実行する度に結果が微妙に変化してしまいます。

model = LdaModel(corpus, num_topics=2, random_state=0, id2word=dictionary)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))
print(model.get_document_topics(dictionary.doc2bow(["computer"])))
print(model.get_document_topics(dictionary.doc2bow(["computer"])))

出力

[(0, 0.67178893), (1, 0.32821107)]
[(0, 0.6716176), (1, 0.32838237)]
[(0, 0.6716369), (1, 0.32836312)]

これは、get_document_topics()の内部で呼び出しているinference()メソッドで乱数が進んでいるのが原因のようです。

参考: https://github.com/RaRe-Technologies/gensim/blob/62669aef21ae8047c3105d89f0032df81e73b4fa/gensim/models/ldamodel.py#L677

再現性のある結果を得るには、以下のように実行前に毎回model.random_stateを指定すると再現性が取れるようになります。

from gensim.utils import get_random_state

model = LdaModel(corpus, num_topics=2, random_state=0, id2word=dictionary)
model.random_state = get_random_state(0)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))
model.random_state = get_random_state(0)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))
model.random_state = get_random_state(0)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))

出力

[(0, 0.67151856), (1, 0.32848147)]
[(0, 0.67151856), (1, 0.32848147)]
[(0, 0.67151856), (1, 0.32848147)]

注意点

gensimのFAQには、word2vecやdoc2vecにおいて、結果が微妙に異なり再現性が取れない場合について述べられています。
そこでは、乱数による結果の微妙なズレには寛容であるべき、といったことが述べられています。（evaluation processes should be tolerant of any shifts in vector positions, and of small "jitter" in the overall utility of models, that arises from the inherent algorithm randomness.）
乱数を固定する方法は存在しますが、推奨はされないようです。

トピックモデルにおいては述べられていませんが、同様の事が言えるのではないかと思います。乱数を固定して用いるのはあくまで問題の切り分けなどで使う際に留めるべきなのだと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up