More than 3 years have passed since last update.

gensimじゃなくてtomotopy使おうよ

Last updated at 2020-08-07Posted at 2020-08-04

tomotopyって？

tomotopyは、TOpic MOdeling TOol の略で、主にLDA（Latent Dirichlet Allocation）とその派生のアルゴリズムを扱えるPythonライブラリです。

同様の機能を持つライブラリgensimと比べて簡単に扱え、C++で組まれているので計算も速いです。

導入方法

pipで入れるだけです。

pip install tomotopy

使い方

例として、gensimチュートリアルにある次のデータセットを使います。

Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

tomotopyでLDAを利用する場合は、次のようになります。

データセットは、前処理後のものを使います（前処理はこれと同じ）。

import tomotopy as tp
from pprint import pprint

texts = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]

# モデルの初期化
model = tp.LDAModel(k=2, seed=1)  # kはトピック数

# コーパスの作成
for text in texts:
    model.add_doc(text)

# 学習
model.train(iter=100)

# トピックの単語分布の取り出し
for k in range(model.k):
    print(f"Topic {k}")
    pprint(model.get_topic_words(k, top_n=5))

"""output
Topic 0
[('system', 0.20972803235054016),
 ('user', 0.15742677450180054),
 ('human', 0.10512551665306091),
 ('interface', 0.10512551665306091),
 ('computer', 0.10512551665306091)]
Topic 1
[('trees', 0.2974308431148529),
 ('graph', 0.2974308431148529),
 ('survey', 0.1986166089773178),
 ('minors', 0.1986166089773178),
 ('system', 0.0009881423320621252)]
"""

tomtopyの特徴

良いところ

扱いやすい。

LDAを利用したいときにやりたいほとんどのことが、モデルの初期化や学習関数の引数を設定することにより簡単にできます。

（並列化、TF-IDF、単語頻度や文書頻度の上限下限の設定など）
学習アルゴリズムがサンプリング（崩壊型ギブスサンプリング）である。

gensimでは変分推論が用いられいますが、サンプリングの方が精度が良いと言われています。

サンプリングのデメリットとして、時間がかかることが挙げられますが、

tomotopyはC++で組まれており、並列化も簡単にできるので、MALLETとかと比べるとすごく速いです。
LDAの派生が利用できる。

次のものが利用できます。
- Labeled LDA (LLDAModel)
- Partially Labeled LDA (PLDAModel)
- Supervised LDA (SLDAModel)
- Dirichlet Multinomial Regression (DMRModel)
- Generalized Dirichlet Multinomial Regression (GDMRModel)
- Hierarchical Dirichlet Process (HDPModel)
- Hierarchical LDA (HLDAModel)
- Multi Grain LDA (MGLDAModel)
- Pachinko Allocation (PAModel)
- Hierarchical PA (HPAModel)
- Correlated Topic Model (CTModel)
- Dynamic Topic Model (DTModel)

悪いところ

かゆいところに手が届かない場合がある。

tomotopyは扱いやすさに特化しているためか、「えっ、これできないの？」ってときがたまにあります。

例えば、
- ~~処理後のコーパスを再利用できない（学習する都度コーパスを作らなければならない）。~~
  よく調べたらtomotopy.utils.Corpusというクラスを用いてできるみたいです。
  ただしやってみたところ時間的にもRAM的にもコストが大きいという残念な仕様…。
- RAMを節約する方法がない。
（まぁどちらもウン千万件のデータセットでもなければそこまで気にならないです。）

まとめ

tomotopyを利用すると、非常に簡単に、LDA系のモデルを、サンプリングにより、学習できます。

正直もうgensimには戻れません。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up