More than 5 years have passed since last update.

Word2Vec インストールからモデル作成、実行まで。

Posted at 2019-06-15

（記事公開の背景）

ubuntu16.04で2018年5月当時に試した内容になります。
下書きに入ったままになっていたのですが、どなたかのお役に立てればと思い、公開します。Cabochaに苦労したような記憶が。。。

(参考サイト)
https://m0t0k1ch1st0ry.com/blog/2016/08/28/word2vec/

仮想環境の作り方

仮想環境内でバージョン管理等が可能なので、仮想環境によって、モジュールのバージョンを変えたりすることが可能。失敗しても、仮想環境ごと削除等可能。

$ mkdir originaldir #作りたいディレクトリ名
$ cd originaldir
$ python3 -m venv originalenv #作りたい仮想環境名

venvの有効化

$ source bin/activate originalenv

(参考サイト)
https://m0t0k1ch1st0ry.com/blog/2016/07/30/nlp/

このサイトの中でやったこと

Mecabインストール

$ wget 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE' -O mecab-0.996.tar.gz
$ tar zxvf mecab-0.996.tar.gz
$ cd mecab-0.996
$ ./configure --with-charset=utf8 --enable-utf8-only
$ make
$ sudo make install
$ mecab --version
mecab of 0.996

辞書インストール

/mecab-ipadic-2.7.0-20070801$ wget 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM' -O mecab-ipadic-2.7.0-20070801.tar.gz
/mecab-ipadic-2.7.0-20070801$ tar zxvf mecab-ipadic-2.7.0-20070801.tar.gz
/mecab-ipadic-2.7.0-20070801$ ./configure --with-charset=utf8
/mecab-ipadic-2.7.0-20070801$ make
/mecab-ipadic-2.7.0-20070801$ sudo make install

Mecab とPython3のバインド

$ pip install mecab-python3
$ pip list | grep mecab-python3
mecab-python3 0.7

「sudo:pip3 コマンドが見つかりません」や「pip3: コマンドが見つかりません」でエラーが出たら、依存関係を疑う。

CRF++インストール

$ wget 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7QVR6VXJ5dWExSTQ' -O CRF++-0.58.tar.gz
$ tar zxvf CRF++-0.58.tar.gz
$ cd CRF++-0.58
$ ./configure
$ make
$ sudo make install

Cabochaダウンロード

苦戦ポイント１

僕もそうですが、いろいろ調べると、みなさんまずここで苦戦するようです。

Googleドライブから、「cabocha-0.69.tar.bz2」を仮想envのディレクトリにダウンロード。
https://drive.google.com/drive/folders/0B4y35FiV1wh7cGRCUUJHVTNJRnM
ダウンロードしたディレクトリ内で、展開。

苦戦ポイント2

makeでエラーがでました。
エラー内容：「'install-libLTLIBRARIES' のレシピで失敗しました」
いろいろ調べたところ、「libigraph0-dev」を入れたら、解決しました。

# https://qiita.com/manabu013/items/b368b4fd0965e7ee6eb9
# 対処 Failing to install python-igraph
$ sudo apt-get install -y libigraph0-dev

Cabochaインストール

$ tar jxvf cabocha-0.69.tar.bz2
$ cd cabocha-0.69
$ ./configure --with-charset=utf8 --enable-utf8-only
$ make
$ make install
$ cabocha --version
cabocha of 0.69

Cabochaとpyhonバインド

$ cd cabocha-0.69/python
/cabocha-0.69/python$ python setup.py install
/cabocha-0.69/python$ pip list | grep cabocha-python
cabocha-python 0.69

gensimインストール

$ pip install gensim
$ pip list | grep gensim
gensim          3.4.0

「老人と海」を入手

次のURLより「ファイルのダウンロード」から、「テキストファイル(ルビあり)」を選択。zip形式ファイルを入手。
https://www.aozora.gr.jp/cards/001847/card57347.html#download

「老人と海」の解凍

$ unzip 57347_ruby_57225.zip 
$ wc rojinto_umi.txt #サイズ確認
   726    807 122222 rojinto_umi.txt

テキストファイルの余分な部分の削除

「rojinto_umi.txt」をテキストエディタで開き、最初の記号の説明や、最後の翻訳に関する説明は削除。
削除のフィあるサイズは次の通り。

$ wc rojinto_umi.txt #サイズ確認
   634    643 174428 rojinto_umi.txt

$ sudo apt install nkf #文字コードを確認するために、「nkf」をインストール
$ nkf -g rojinto_umi.txt #文字コード確認
Shift_JIS
$ nkf -w --overwrite rojinto_umi.txt #UTF-8へ変換
$ nkf -g rojinto_umi.txt #文字コード確認
UTF-8

分かち書き

Mecabで分かち書きする。
次の「wakati.py」と「rojin_umi.txt」を同じディレクトリに置き、Mecabで分かち書きした「rojinto_umi_wakati.txt」を作成する。

# wakati.py
# -*- coding: utf-8 -*-

import MeCab
import sys

tagger = MeCab.Tagger('-F\s%f[6] -U\s%m -E\\n')

fi = open(sys.argv[1], 'r')
fo = open(sys.argv[2], 'w')

line = fi.readline()
while line:
    result = tagger.parse(line)
    fo.write(result[1:]) # skip first \s
    line = fi.readline()

fi.close()
fo.close()

$ python wakati.py rojinto_umi.txt rojinto_umi_wakati.txt

Word2Vecの実行

Word2Vecでモデルを作成する。
次の「train.py」と「rojinto_umi_wakati.txt」を同じディレクトリに置き、「rojinto_umi.model」を作成。

# train.py
# -*- coding: utf-8 -*-

from gensim.models import word2vec
import logging
import sys

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = word2vec.LineSentence(sys.argv[1])
model = word2vec.Word2Vec(sentences,
                          sg=1,
                          size=100,
                          min_count=1,
                          window=10,
                          hs=1,
                          negative=0)
model.save(sys.argv[2])

$ python train.py rojinto_umi_wakati.txt rojinto_umi.model

作成したモデルの確認

作成したモデルを利用して、こちらが入力した単語に対して、コサイン類似度の高い単語を数値付きで抽出。

# similars.py
# -*- coding: utf-8 -*-

from gensim.models import word2vec
import sys

model   = word2vec.Word2Vec.load(sys.argv[1])
results = model.most_similar(positive=sys.argv[2], topn=10)

for result in results:
    print(result[0], '\t', result[1])

「仲間」とコサイン類似度の高い単語の抽出。
分かち書きしたボキャブラリ「rojinto_umi_wakati.txt」に無い単語に対しては、エラーが出る。

$ python similars.py rojinto_umi.model 仲間

幸運 	 0.956790566444397
素晴らしい 	 0.9093707799911499
でも 	 0.9033694267272949
礼 	 0.903014600276947
欲しい 	 0.9010084271430969
のに 	 0.8994028568267822
やり方 	 0.8970744609832764
ぞ 	 0.8941289782524109
ん 	 0.8937863111495972
はず 	 0.8931283354759216

「褐色」とコサイン類似度の高い単語を抽出。

$ python similars.py rojinto_umi.model 褐色

軽蔑 	 0.9831672310829163
ウミガメ 	 0.9824434518814087
アカ 	 0.978451132774353
ぴんと 	 0.9747155904769897
親しみ 	 0.9722762107849121
銅 	 0.9676532745361328
汁 	 0.9667260646820068
飛び散る 	 0.966325044631958
拡がる 	 0.9662150144577026
ダイオウ 	 0.9642153382301331

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up