More than 5 years have passed since last update.

word2vecの速度比較について追加で確認してみた

Posted at 2016-01-02

あけましておめでとうございます。
新年早々なにしてるんだ感ありますが、気になったことを確認したのでまとめておきます。

経緯

Twitterで以下の記事が流れてきたわけです。
word2vec の各種実装の速度比較

この記事中末尾に記載されていた実験詳細で、対象単語が以下の条件となっていました。

実装	対象単語
word2vec	コーパス中の出現回数5回以上の単語
word2vec_cbow	コーパス中の出現回数5回以上の単語
gensim	コーパス中の出現回数5回以上の単語
TensorFlow	コーパス中の出現回数5回以上の単語
DMTK	コーパス中の出現回数5回以上の単語
Chainer	コーパス中の全単語

んー…全単語…

この出現頻度による差分、無視して良いレベルなのか気になったので確認しました。
なお実験環境はAWSで、AMIとして Amazon Linux AMI with NVIDIA GRID GPU Driver を使用し、GPUを使った実装（word2vec_cbow/Chainer 1.5/Chainer improve-word2vec）のみを比較対象とします。

事前確認

そもそも出現頻度5回以上の単語数と5回未満の単語数に大きな差がなければ、検証するまでもなく影響なしと結論づけられそうなのであらかじめ見ておきます。

$ wget http://mattmahoney.net/dc/text8.zip
$ unzip text8.zip
$ cat text8 | sed "s| |\\n|g" | sort | uniq -c | sort -k1,1 -n | grep "^ \+[1-4] " | wc -l  # 1〜4回出現した単語数
$ cat text8 | sed "s| |\\n|g" | sort | uniq -c | sort -k1,1 -n | wc -l  # 全単語数

結果は以下の通り。

出現回数	単語数
1〜4回	182565
5回〜	71290

これはそこそこ影響しそうな気がしますがどうでしょう…

実験結果

まずは結果から。

実装	実行時間
word2vec_cbow	1m59.093s
Chainer 1.5	118m54.385s
Chainer improve-word2vec	79m36.320s
Chainer 1.5（低頻度語彙の除去版）	81m39.316s
Chainer improve-word2vec（低頻度語彙の除去版）	71m3.429s

この結果から以下のことがわかります。

低頻度の語が含まれることによる影響で1.5倍近く遅くなっている
速度改善版で低頻度語を除去すると、素の状態から1.7倍近く高速化される
- 一方で速度改善のみ適用した場合と両方適用の場合とでは、12%程度の高速化にとどまる
- 低頻度語の除去による寄与率が下がったため？

ほか、実験中に感じたのですが、スクリプト末尾でモデルや語彙を保存している箇所がそれなりに遅い印象です。
測定していないのでなんともいえませんが、この辺りの処理も実行時間に影響しているのかなと感じました。¹

ちなみに低頻度語を除去することで高速になる理由に関して、入出力の次元数が30%近くまで削減されることが寄与していそうですがどうでしょう。
こちら詳しい方からコメント頂けますと幸いです。²

まとめ

ということで、全体的には元記事の内容と大きく変わらず、少なくともword2vecの学習速度を求める局面でChainerを利用するのはまだ厳しそうという結論になりそうです。
（といいつつ、そもそも比較に使った実装がサンプルであったりということを考えると…）
一方で、このレベルの速度差を無視して良いとは言い切れませんので、少々Chainerに不利な測定になっていたのではないかと思いました。³

おまけ

以下、検証時の手順や変更内容についてです。

実験手順

$ git clone https://bitbucket.org/knzm/wordembedding-experiments.git
$ cd wordembedding-experiments
$ # 適宜パッチをあてる…
$ ./setup.sh
$ time ./run_word2vec_cbow.sh
$ time ./run_chainer_word2vec.sh
# ... 以下略

なお、事前にhdf5やcuDNNのインストールなどは済ませておきます。
（参考→ Amazon EC2のGPUインスタンスにChainer v1.5を3行で入れる）

変更箇所など

Chainer用の実験スクリプトを以下のように変更しています。
（ヒドい変更ですが大勢に影響はない…ハズ）

`word2vec_chainer.py`

githubでも指摘が入っているようですが、もとのサンプルコードでcalculate_lossの引数がおかしい箇所もあわせて修正しています。

word2vec_chainer.py

diff --git a/word2vec_chainer.py b/word2vec_chainer.py
index fc827d4..137e2ac 100644
--- a/word2vec_chainer.py
+++ b/word2vec_chainer.py
@@ -96,7 +96,7 @@ class SoftmaxCrossEntropyLoss(chainer.Chain):
         return F.softmax_cross_entropy(self.W(x), t)
 
 
-def calculate_loss(model, dataset, offset):
+def calculate_loss(model, dataset, position):
     # use random window size in the same way as the original word2vec
     # implementation.
     w = np.random.randint(args.window - 1) + 1
@@ -118,22 +118,36 @@ if args.gpu >= 0:
 train_file = "text8/text8"
 result_dir = os.environ.get("result_dir", "result/text8_chainer")
 
-index2word = {}
-word2index = {}
 counts = collections.Counter()
-dataset = []
 with open(train_file) as f:
     for line in f:
         for word in line.split():
-            if word not in word2index:
-                ind = len(word2index)
-                word2index[word] = ind
-                index2word[ind] = word
-            counts[word2index[word]] += 1
-            dataset.append(word2index[word])
-
+            counts[word] += 1
+
+index2word = {0: '<UNK>'}
+word2index = {'<UNK>': 0}
+min_count = 5
+for word, n in counts.most_common():
+    if n < min_count:
+        continue
+    ind = len(word2index)
+    word2index[word] = ind
+    index2word[ind] = word
 n_vocab = len(word2index)
 
+del counts
+counts = collections.Counter()
+dataset = []
+with open(train_file) as f:
+    for line in f:
+        for word in line.split():
+            if word in word2index:
+                ind = word2index[word]
+            else:
+                ind = word2index['<UNK>']
+            dataset.append(ind)
+            counts[ind] += 1
+
 print('n_vocab: %d' % n_vocab)
 print('data length: %d' % len(dataset))

`word2vec_chainer_improved.py`

モデルのクラスで微妙におかしい箇所があるように思った（class ContinuousBoWなどの__call__でmodel.embedを直接呼んでいる）ので、こちらもあわせて修正。

word2vec_chainer_improved.py

diff --git a/word2vec_chainer_improved.py b/word2vec_chainer_improved.py
index e9372d8..381225f 100644
--- a/word2vec_chainer_improved.py
+++ b/word2vec_chainer_improved.py
@@ -59,7 +59,7 @@ class ContinuousBoW(chainer.Chain):
         )
 
     def __call__(self, x, context):
-        e = model.embed(context)
+        e = self.embed(context)
         h = F.sum(e, axis=0) * (1. / context.data.shape[0])
         return self.loss_func(h, x)
 
@@ -73,7 +73,7 @@ class SkipGram(chainer.Chain):
         )
 
     def __call__(self, x, context):
-        e = model.embed(context)
+        e = self.embed(context)
         shape = e.data.shape
         dummy = chainer.Variable(
             xp.empty((shape[0], shape[1])))
@@ -112,22 +112,36 @@ if args.gpu >= 0:
 train_file = "text8/text8"
 result_dir = os.environ.get("result_dir", "result/text8_chainer_improved")
 
-index2word = {}
-word2index = {}
 counts = collections.Counter()
-dataset = []
 with open(train_file) as f:
     for line in f:
         for word in line.split():
-            if word not in word2index:
-                ind = len(word2index)
-                word2index[word] = ind
-                index2word[ind] = word
-            counts[word2index[word]] += 1
-            dataset.append(word2index[word])
-
+            counts[word] += 1
+
+index2word = {0: '<UNK>'}
+word2index = {'<UNK>': 0}
+min_count = 5
+for word, n in counts.most_common():
+    if n < min_count:
+        continue
+    ind = len(word2index)
+    word2index[word] = ind
+    index2word[ind] = word
 n_vocab = len(word2index)
 
+del counts
+counts = collections.Counter()
+dataset = []
+with open(train_file) as f:
+    for line in f:
+        for word in line.split():
+            if word in word2index:
+                ind = word2index[word]
+            else:
+                ind = word2index['<UNK>']
+            dataset.append(ind)
+            counts[ind] += 1
+
 print('n_vocab: %d' % n_vocab)
 print('data length: %d' % len(dataset))

`run_chainer_word2vec.sh`

実行用のシェルスクリプトが記事中のパラメータと異なる状態だったようなので書き換えています。
（run_chainer_word2vec_improved.shにも同じ変更を加えています。）

run_chainer_word2vec.sh

diff --git a/run_chainer_word2vec.sh b/run_chainer_word2vec.sh
index dc7926c..8913cd5 100755
--- a/run_chainer_word2vec.sh
+++ b/run_chainer_word2vec.sh
@@ -5,10 +5,10 @@ PYTHON=virtualenvs/chainer/bin/python
 result_root=${result_root:-result}
 result_dir=$result_root/text8_chainer
 
-gpu=-1
+gpu=0
 unit=200
 window=8
-batchsize=1000
+batchsize=50000
 epoch=${epoch:-15}
 model=cbow
 out_type=ns

その他

ほか、DMTKのインストールなどにコケるようなので以下の通り手を入れています。
（今回はGPU上の性能だけ測るので余計な処理は飛ばして問題ない）

install.sh

diff --git a/install.sh b/install.sh
index b52cde0..fd62fed 100755
--- a/install.sh
+++ b/install.sh
@@ -8,11 +8,11 @@ if [ -f build/word2vec_cbow/word2vec ]; then
   cp build/word2vec_cbow/word2vec bin/word2vec_cbow
 fi
 
-cp build/dmtk/distributed_word_embedding/bin/word_embedding bin/word_embedding
-cp build/dmtk/distributed_word_embedding/preprocess/word_count bin/word_count
-cp build/dmtk/distributed_word_embedding/preprocess/stopwords_simple.txt text8/stopwords_simple.txt
-
-cp build/dmtk/distributed_skipgram_mixture/bin/distributed_skipgram_mixture bin/distributed_skipgram_mixture
+#cp build/dmtk/distributed_word_embedding/bin/word_embedding bin/word_embedding
+#cp build/dmtk/distributed_word_embedding/preprocess/word_count bin/word_count
+#cp build/dmtk/distributed_word_embedding/preprocess/stopwords_simple.txt text8/stopwords_simple.txt
+#
+#cp build/dmtk/distributed_skipgram_mixture/bin/distributed_skipgram_mixture bin/distributed_skipgram_mixture
 
 cp build/word2vec/compute-accuracy bin/compute-accuracy
 cp build/word2vec/questions-words.txt data/questions-words.txt

build/build_all.sh

diff --git a/build/build_all.sh b/build/build_all.sh
index a94c1d4..fb8dc34 100755
--- a/build/build_all.sh
+++ b/build/build_all.sh
@@ -1,10 +1,10 @@
 #!/bin/sh -e
 
-if [ "$(uname)" = "Darwin" ]; then
-  ./build_dmtk_mac.sh
-else
-  ./build_dmtk_ubuntu.sh
-fi
+#if [ "$(uname)" = "Darwin" ]; then
+#  ./build_dmtk_mac.sh
+#else
+#  ./build_dmtk_ubuntu.sh
+#fi
 
 ./build_word2vec.sh

気になるけど確認しなかったこと

numpyのバックエンドをOpenBLASなどに変更した場合、速度が変化するのかどうか
- 今回はこの辺りいじらずに試してます
- GPUで動かす場合、ボトルネックとは異なる場所に効きそうなのであまり改善しないと予想していますが…

というところまで書いてbitbucketの実験スクリプトやgithubの実装を見比べたところ、少なくともTensorFlow用のスクリプトは中間モデルの保存をしていない分有利なように見えるのですが気のせいでしょうか…？こちらこそ本当に誤差レベル？ ↩
その他「テメーの測定もおかしいんじゃゴラァ！」というご指摘もお待ちしております。切に。 ↩
こうしたあれこれを加味して、公式からベンチマークを出して頂けると大変ありがたいなと思う次第です。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up