More than 5 years have passed since last update.

gensimでserialize出来るフォーマットの形式まとめ

Posted at 2016-02-17

gensim

Pythonに実装されているトピックモデルのライブラリです。機能の詳細はここでは扱いません。
今回はgensimで文字列をBoWの形式に変換した際に、各種変換出来るフォーマットの形式について纏めます。

実行コード

from gensim import corpora
from collections import defaultdict
from pprint import pprint

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

corpora.MmCorpus.serialize("./corpus.mm", corpus)
corpora.BleiCorpus.serialize("./corpus.blei", corpus)
corpora.LowCorpus.serialize("./corpus.low", corpus)
corpora.SvmLightCorpus.serialize("./corpus.svmlight", corpus)
corpora.UciCorpus.serialize("./corpus.low", corpus)

pprint(texts)
print("\n")
pprint(dictionary.token2id)
print("\n")
pprint(corpus)

Output

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


{'computer': 1,
 'eps': 8,
 'graph': 10,
 'human': 2,
 'interface': 0,
 'minors': 11,
 'response': 6,
 'survey': 4,
 'system': 5,
 'time': 7,
 'trees': 9,
 'user': 3}


[[(0, 1), (1, 1), (2, 1)],
 [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(0, 1), (3, 1), (5, 1), (8, 1)],
 [(2, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

Matrix Market format

corpus.mm

%%MatrixMarket matrix coordinate real general
9 12 28                                           
1 1 1
1 2 1
1 3 1
2 2 1
2 4 1
2 5 1
2 6 1
2 7 1
2 8 1
3 1 1
3 4 1
3 6 1
3 9 1
4 3 1
4 6 2
4 9 1
5 4 1
5 7 1
5 8 1
6 10 1
7 10 1
7 11 1
8 10 1
8 11 1
8 12 1
9 5 1
9 11 1
9 12 1

コメントアウトが出来、%を使う
数字は1始まり
最初に行数列数非ゼロ要素数、を整数で表現
この行はgensimで出力すると、何故か文末にスペースが大量に入る
残りはi j vで示され、i番目(行目)の文章で、j番目(列目)の単語が、v回出現したことを示す
corpus.mm.indexが生成されるが、可読ファイルではない

Blei format

corpus.blei

3 0:1 1:1 2:1
6 1:1 3:1 4:1 5:1 6:1 7:1
4 0:1 3:1 5:1 8:1
3 2:1 5:2 8:1
3 3:1 6:1 7:1
1 9:1
2 9:1 10:1
3 9:1 10:1 11:1
3 4:1 10:1 11:1

corpus.blei.vocab

LDAのオリジナル論文を出したBleiによるフォーマット
数字は0始まり
各行と各文章が対応している
行の頭に、そのテキストの単語の個数が示される
j:vで表し、その列のj番目の要素がvであることを示す
serializeするときに、同時にcorpus.blei.indexとcorpus.blei.vocabが生成
indexは可読ファイルではない
vocabはBoWの特徴量が最大でいくつ存在するのかを示すようである。

UCI format

corpus.uci

corpus.uci.vocab

カリフォルニア大アーバイン校によるBoWのフォーマット
数字は1始まり
Mmフォーマットと似ているが、以下の点で異なる
コメントアウトがない
1行目のsummaryが複数行にわたっている
同時にcorpus.low.indexとcorpus.low.vocabが出力される
indexは可読ファイルではない
vocabはBoWの特徴量が最大でいくつ存在するのかを示すようである。が、何故か数字が0始まりになっている

Low format

corpus.low

9
0 1 2
1 3 4 5 6 7
0 3 5 8
2 5 5 8
3 6 7
9
9 10
9 10 11
4 10 11

corpus.low.vocab

LDAの実装をC/C++で行ったPhanらによるフォーマット
この実装はLDAの推論をBleiが行ったオリジナルの変分ベイズではなく、ギブスサンプリングによるMCMCで行っているようだ
数字は0始まり
1行目に文章数が示される
その後の各行と各文章は対応しており、j番目の単語があるだけ、jが示される

SvmLight format

corpus.svmlight

0 1:1 2:1 3:1
0 2:1 4:1 5:1 6:1 7:1 8:1
0 1:1 4:1 6:1 9:1
0 3:1 6:2 9:1
0 4:1 7:1 8:1
0 10:1
0 10:1 11:1
0 10:1 11:1 12:1
0 5:1 11:1 12:1

SVMのC実装をしたJoachimsらのフォーマット
数字は1始まり
各行の先頭に付いている0は、SVMで正負例を表す際に-1/1/0を使っている名残
上記2点がBleiフォーマットと異なっている。
各行と各文章が対応している
j:vで表し、その列のj番目の要素がvであることを示す
corpus.mm.indexが生成されるが、可読ファイルではない

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up