More than 5 years have passed since last update.

青空文庫のデータを利用してWord2Vecを試してみた(Dockerfileつき)

Last updated at 2018-07-16Posted at 2018-07-16

はじめに

時間があったのでWord2Vecを試してみたのでまとめます。
参考にさせていただいたのは下記記事です。

本記事ではパラメータのチューニングや細かい点についてはふれません。
Dockerfileは作成しておいたのでとりあえず環境を作って試してみたい方には最速でお試しできると思いますのでよかったらどうぞ。

Word2Vecを試してみる

プログラムはこちらに公開しています。
READMEにもざっくりと記載していますがここではもう少し細かく記述していきます。

$ git clone git@github.com:masayuki5160/word2vec-test.git
$ cd word2vec-test

# build docker image from Dockerfile
$ docker build -t masayuki5160/python3 .

GitHubからリポジトリをcloneしてきたら登録してあるDockerfileを利用してdockerイメージのビルドをしていきます。
名前はmasayuki5160/python3としていますが好みのもので大丈夫です。

Dockerfileは以下の通りです。
gensimのインストールと青空文庫のデータ(「三四郎」を選択してます)をwgetしてきています。
青空文庫のデータは一旦/tmpに保存しています。

Dockerfile

FROM python:3.6

VOLUME /home/word2vec-test

RUN apt-get update && \
    apt-get -y install sudo vim unzip && \
    pip install --upgrade gensim && \
    pip install janome && \
    wget -O /tmp/794_ruby_4237.zip http://www.aozora.gr.jp/cards/000148/files/794_ruby_4237.zip && \
    unzip /tmp/794_ruby_4237.zip -d /tmp

ビルドが成功したらコンテナを起動します。
cloneしてきたリポジトリ内のpythonスクリプトを利用したいのでここではカレントディレクトリとコンテナ内のディレクトリのマッピングもついでにしてます。

# run docker image(masayuki5160/python3)
$ docker run -v $(pwd):/home/word2vec-test/ -p 5000:5000 -it masayuki5160/python3 /bin/bash

これで準備は完了したのでモデルを作成します。

# generate model, and save it. 
$ python createModel.py

useModel.pyでは上記で作成したモデルをloadしています。

useModel.py

import gensim

model = gensim.models.Word2Vec.load('sanshiro.model')

# ベクトルを取り出す
print(model["世間"])

# 類似の単語を探す
print(model.most_similar("日本"))

# 単語の加減を計算して類似の単語を探す
print(model.most_similar(positive=['東京'], negative=['人']))

useModel.pyをしてみるとこんな感じに出力されます。
(あんまり整理してなくてすいませんが。。)

$ python useModel.py 
[ 0.59944725 -0.25200576  0.3852602   0.04730325  0.13314192  0.54461247
  0.0471745   0.391928   -0.7717757   0.2524366  -0.21390346 -0.5256151
 -0.34063724  0.1979729   0.701523    0.64141107  0.73465765  0.00901113
  0.96841425 -0.12650324  0.150137    0.26460552 -0.11457528 -0.6921538
 -0.50044954  0.2765003   0.27651525 -0.18274967 -0.0200913   0.10246506
  0.15906459 -1.0402849   0.5950975  -0.60760593 -0.8825854  -0.04164106
  0.23078144  0.11897992 -0.45213798  0.7351423   0.40106112  0.541665
  0.5919396   0.11623159 -0.36421928  1.0413207   0.3224229   0.7492864
  0.15403754 -0.39072967 -0.16116981 -0.15397906  0.21622561  0.08972786
  0.02676369  0.8212711   0.13864756 -0.26221758  0.2627808   0.30080497
  0.3675749  -0.46718222 -0.44438344  0.4253955  -0.13240519 -0.1371136
 -0.20774814  0.37195185  0.42730212 -0.7800508  -0.2636935  -0.33571747
  0.28241587 -0.30477676 -0.09428921 -0.27701733 -0.25254947  0.23744197
  0.16743279 -0.23830141  0.5384211   0.7084161  -0.6604415  -0.4665711
 -0.37388822  0.83966595 -0.37005633  0.42384714  0.22337277  0.08281619
 -0.7998488  -0.22002839 -0.05608908 -0.33207807  0.48652875  0.2967433
  0.3217622  -0.42405376  0.15407468  0.03133222]
/usr/local/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
[('界', 0.6109025478363037), ('堪える', 0.5765180587768555), ('社会', 0.571060299873352), ('圧迫', 0.5427102446556091), ('決心', 0.5345108509063721), ('有す', 0.5281721949577332), ('公', 0.5189093947410583), ('人生', 0.5097593069076538), ('感謝', 0.4961530268192291), ('説く', 0.4923514723777771)]
[('度胸', 0.41439834237098694), ('心持ち', 0.3946040868759155), ('要領', 0.3866702914237976), ('恐れ', 0.34306836128234863), ('違う', 0.32507050037384033), ('感じ', 0.3220743238925934), ('はじめ', 0.3191494047641754), ('かた', 0.30798274278640747), ('性質', 0.2992369830608368), ('いなか者', 0.29894697666168213)]

webapi経由で指定した単語に類似した単語を取得してみる

せっかくなのでflaskを利用して簡単なwebapiも作ってみました。

webapi.py

# -*- coding: utf-8 -*-
from flask import Flask, jsonify
import gensim

model = gensim.models.Word2Vec.load('sanshiro.model')
app = Flask(__name__)
app.config['JSON_AS_ASCII'] = False

@app.route('/')
def index():
   return 'Hello flask'

# 類似の単語を取得する
@app.route('/similar/<string:word>')
def getSimilar(word):
	similarWords = model.most_similar(word)
	words = []
	for word in similarWords:
		words.append({'word':word[0], 'value':word[1]})
	return jsonify({'words': words})

# エラー時のリターン
@app.errorhandler(404)
def not_found(error):
    return jsonify({'error': 'Not found'}), 404

# サーバー起動
if __name__ == '__main__':
    app.run(host="0.0.0.0", port=5000, debug=True)

以下のようにflaskサーバを起動します。

# start api server
$ python webapi.py

コンテナの起動時にポート5000番をマッピングしているのでhttp://localhost:5000/similar/[word] のURLで指定した単語に類似した単語を取得できます。
以下はhttp://localhost:5000/similar/男としてローカルPCからDockerコンテナへアクセスし取得した結果です。

{
  "words": [
    {
      "value": 0.41406938433647156, 
      "word": "じいさん"
    }, 
    {
      "value": 0.35695937275886536, 
      "word": "迷子"
    }, 
    {
      "value": 0.32412219047546387, 
      "word": "コート"
    }, 
    {
      "value": 0.31509068608283997, 
      "word": "女"
    }, 
    {
      "value": 0.31481245160102844, 
      "word": "ただ"
    }, 
    {
      "value": 0.3141239583492279, 
      "word": "あっち"
    }, 
    {
      "value": 0.3029976487159729, 
      "word": "連中"
    }, 
    {
      "value": 0.29525405168533325, 
      "word": "便所"
    }, 
    {
      "value": 0.2852158546447754, 
      "word": "はやす"
    }, 
    {
      "value": 0.274425208568573, 
      "word": "あぐら"
    }
  ]
}

さいごに

まずは使ってみることでなんとなく雰囲気を掴めたのでよかったかなと思います。
Word2Vecについてもgensimについてもまだ理解が足りてないですが利用する流れはなんとなく掴めたのでデータを追加するなりパラメータのチューニングをしてみるなり試してみることにします。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up