word2vecとは

word2vec は、2013 年に Google がオープンソース化した自然言語処理のプロジェクトらしい。
文章を読み込んで単語の意味を学習し、各単語の意味を多次元ベクトルで表現することができるらしい・・・（よくわらかん）

もしかしたらSlackBotでなんか面白いことができるんじゃないかと思って試してみる。

とりあえず自分の会社用Dockerコンテナにいれてみる。

githubから持ってきてmake

command

[root@yuichi_bl word2vec]# git clone https://github.com/svn2github/word2vec.git
[root@yuichi_bl word2vec]# cd word2vec
[root@yuichi_bl word2vec]# make

ググるとsvnから取ってきてmakeって記事が多く見つかったが、svnの方はもうリポジトリが無いようなので注意。

demo用のなにかが入ってたので叩いてみる。

Segmentation faultが発生・・・

command

[root@yuichi_bl word2vec]# ./demo-word.sh 
make: Nothing to be done for `all'.
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
./demo-word.sh: line 6:  2139 Segmentation fault      ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

どうやらメモリ不足らしい。修正して再度make

word2vec.c

49 const int table_size = 1e7;

demo-word.sh

# スレッド数を1に修正
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 1

command

[root@yuichi_bl word2vec]# make clean
[root@yuichi_bl word2vec]# make

(参考）https://ghost.geeek.red/gdb-trouble-shoot/

demo-word.shを実行して遊ぶ

スレッド数を1にしたからか、学習に35分もかかった。

command

[root@yuichi_bl word2vec]# ./demo-word.sh 
make: Nothing to be done for `all'.
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.008455  Progress: 83.09%  Words/thread/sec: 121.18k  
Alpha: 0.000005  Progress: 100.00%  Words/thread/sec: 121.19k  
real    35m54.997s
user    34m29.820s
sys     0m2.956s

vectors.binというファイルが生成されていたら成功。

■類語を取得してみる

distanceは入力した単語に対し、類似度が高い単語を出力してくれるプログラムらしい。(makeしたディレクトリに入ってます。）
toyamaという単語で試す。

command

[root@yuichi_bl word2vec]# ./distance vectors.bin
Enter word or sentence (EXIT to break): toyama

Word: toyama  Position in vocabulary: 67744

                                              Word       Cosine distance
------------------------------------------------------------------------
                                              oita              0.538174
                                        kitakyushu              0.535925
                                          yokohama              0.530349
                                            nagoya              0.516729
                                           fukuoka              0.508883
                                             tokyo              0.505655

oita（大分）、kitakyushu（北九州）、yokohama（横浜）と地名が出てきた。（スコア順）
英語の辞書（コーパス？）をつかったのでこんなもんか・・・

■意味を解析する

意味解析にはword-analogyというプログラムを使う。
単語をスペース区切りで3つ入れる。

AにとってのBはCにとっての何？ってのを返してくれる。

command

[root@yuichi_bl word2vec]# ./word-analogy vectors.bin
Enter three words (EXIT to break): apple mac microsoft
Word: apple  Position in vocabulary: 1221
Word: mac  Position in vocabulary: 1722
Word: microsoft  Position in vocabulary: 1162
                                              Word              Distance
------------------------------------------------------------------------
                                           windows              0.454303
                                               dll              0.452877
                                               gtk              0.432725
                                               api              0.432079

appleにとってのmacはmicrosoftにとってのwindowsとのこと。

B - A + C = 結果

という単語で足し算引き算ができるということなので、

mac - apple + microsoft = windows

は、

mac - apple = OS
OS + microsoft = windows

というようなイメージかな。

(参考）pixiv小説で機械学習したらどうなるのっと【学習済みモデルデータ配布あり】
http://inside.pixiv.net/entry/2016/09/13/161454

日本語でやってみる

執筆中。

とりあえずmecab入れる

(参考）Amazon EC2 （Amazon Linux AMI) にnkf をインストールする手順
http://tkuchiki.hatenablog.com/entry/2012/12/01/004833

(参考）MeCabをUTF-8でインストールしたい。
http://qiita.com/junpooooow/items/0a7d13addc0acad10606

wikipediaのデータでコーパスを作る

rubyをインストールしてから。（2.1以上）
http://qiita.com/shinyashikis@github/items/3501c5f7f71a8e345c3d

（参考）wp2txtでwikipediaのコーパスを作るまでの道のり
http://qiita.com/wwacky/items/8a9eb543171afea90c0a

上のURLを参考にrubyインストール、wp2txtインストール、Wikipediaからデータを取得してからの続き

command

wp2txtでXMLフォーマットのwikipedia記事をプレーンなテキストに変換（結構時間かかる）
# wp2txt --input-file jawiki-20170101-pages-articles-multistream.xml.bz2

wp2txtで生成されたファイルは分割されているので一つにまとめる
# cat jawiki-latest-pages-articles.xml-* > jawiki.txt

mecabで分かち書きに変換する（結構時間かかる）
# mecab -Owakati jawiki.txt > jawikisep.txt

word2vecで学習させる
# ./word2vec -train jawikisep.txt -output jawikisep.bin -size 200 -threads 4 -binary 1

「word2vecで学習させる」のところはbka-dev-dockerだとメモリが足りなかった。
しょうがないのでローカルPCのdockerで実行した。

（参考）https://bookliveteam.qiita.com/yuki_iwamoto/items/e186884a0e7b67465752

mecabデフォルトの辞書だとしょぼいのでmecab-ipadic-neologdの辞書でやり直し。

（参考）neologd/mecab-ipadic-neologd
https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md

辞書入れたら分かち書きやり直し。

command

mecab  -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd -Owakati jawiki.txt > jawikisep_neologd.txt

もういちど学習させる。

command

# ./word2vec -train jawikisep_neologd.txt -output jawikisep_neologd2.bin -size 200 -window 5 -sample 1e-3 -negative 5 -hs 0 -threads 20 -binary 1

オプション
　-window: 指定した数値の分だけ、単語の前後にある単語を文脈として判断させる
　-sample: ランダムに頻出単語を消去する。1e-3は「頻出度が高め」の意味。
　-negative: ランダムに間違った解答として判断させる

■類語を取得してみる

command

[root@yuichi_bl word2vec]# ./distance jawikisep_neologd2.bin 
Enter word or sentence (EXIT to break): ポケモン
Word: ポケモン  Position in vocabulary: 9109

                                              Word       Cosine distance
------------------------------------------------------------------------
                                   ピカチュウ           0.607624
                                   モンスター           0.602709
                                         タマゴ         0.600524
                                   プチゲーム           0.599255
                                      スライム          0.596309
                                      アイテム          0.596196

ポケモンに近い単語はピカチュウ。いい感じ

command

Enter word or sentence (EXIT to break): 凸版印刷
Word: 凸版印刷  Position in vocabulary: 64924

                                              Word       Cosine distance
------------------------------------------------------------------------
                                   大日本印刷           0.819291
                                      住友商事          0.670218
                       JFEホールディングス              0.663548

凸版印刷と大日本印刷は近い単語。これもいい感じ

command

Enter word or sentence (EXIT to break): BookLive
Word: BookLive  Position in vocabulary: 357455

                                              Word       Cosine distance
------------------------------------------------------------------------
                                            Bonbee              0.646494
                                      まぐまぐ          0.622223
                                土曜はダメよ            0.614109
                                              GYAO              0.602962
                                      たたかえ          0.591559
                                      ぱにっく          0.590639

！？ Bonbeeってなんだろう・・・Wikipedia上にBookLiveって単語がすくないから精度も低いのかな。

■意味を解析する

command

Enter three words (EXIT to break): ドラゴンボール 孫悟空 スラムダンク

Word: ドラゴンボール  Position in vocabulary: 11655
Word: 孫悟空  Position in vocabulary: 20369
Word: スラムダンク  Position in vocabulary: 61130

                                              Word              Distance
------------------------------------------------------------------------
                                      桜木花道          0.553936
                                   ハチ公物語           0.474670
                       トラブル・バスター               0.473985

ドラゴンボール→孫悟空　スラムダンク→？ で桜木花道が取得できた。

word2vecであそんでみる

word2vecとは

とりあえず自分の会社用Dockerコンテナにいれてみる。

githubから持ってきてmake

demo用のなにかが入ってたので叩いてみる。

demo-word.shを実行して遊ぶ

■類語を取得してみる

■意味を解析する

日本語でやってみる

とりあえずmecab入れる

wikipediaのデータでコーパスを作る

mecabデフォルトの辞書だとしょぼいのでmecab-ipadic-neologdの辞書でやり直し。

辞書入れたら分かち書きやり直し。

もういちど学習させる。

■類語を取得してみる

■意味を解析する