More than 5 years have passed since last update.

Dockerを利用して Elasticsearch + Kibana の環境で kuromoji + Neologd を試す

Last updated at 2017-02-22Posted at 2017-02-19

目的

Dockerを利用して Elasticsearch + Kibana の環境を作るで Elasticsearch が動作する環境を作りました。kuromoji を入れているものの、辞書のメンテナンスが必要になるので、mecab-ipadic-neologd を使います。幸い、Elasticsearchのプラグインが提供されているのでこれを使います。

Dockerコンテナを更新

環境

Elasticsearch 5.1
elasticsearch-analysis-kuromoji-neologd 5.1

Dockerfileを更新

前回のDockerfileに elasticsearch-plugin install org.codelibs:elasticsearch-analysis-kuromoji-neologd:5.1.0 を追加します。Dockerfile は下の内容になります。この状態でコンテナを更新します。

es/Dockerfile

FROM elasticsearch:5.1

# x-pack をインストール
RUN elasticsearch-plugin  install --batch x-pack
 
# kuromojiをインストール
RUN elasticsearch-plugin  install analysis-kuromoji

# Elasticsearch Analysis Kuromoji Neologd をインストール
RUN elasticsearch-plugin install org.codelibs:elasticsearch-analysis-kuromoji-neologd:5.1.0

docker-compose.yml

前回からの変更はありませんが、再掲載します。

docker-compose.yml

version: '2'
services:
  elasticsearch0:
    build: es
    volumes:
        - es-data0:/usr/share/elasticsearch/data 
        - ./es/config:/usr/share/elasticsearch/config 
    ports:
        - 9200:9200
    expose:
        - 9300
    environment:
        - NODE_NAME=node0
    hostname: elasticsearch0
    ulimits:
        nofile:
            soft: 65536
            hard: 65536
  kibana:
    build: kibana
    links:
        - elasticsearch0:elasticsearch 
    ports:
        - 5601:5601

volumes:
    es-data0:
        driver: local

Analysis Kuromoji Neologd を試す

テンプレート

curl -XPUT http://localhost:9200/_template/items_template?pretty -d '
{
    "template": "items",
    "settings": {
        "index":{
            "analysis":{
                "tokenizer": {
                    "my_tokenizer": {
                        "type": "kuromoji_neologd_tokenizer",
                        "mode": "normal",
                        "discard_punctuation" : "false",
                        "user_dictionary" : "userdict_ja.txt"
                    }
                },
                "filter": {
                    "synonym_dict": {
                        "type": "synonym",
                        "synonyms_path" : "synonym.txt"
                    }
                },
                "analyzer" : {
                    "default" : {
                        "type": "custom",
                        "tokenizer": "my_tokenizer",
                        "filter": ["synonym_dict"]
                    }
                }
            }
        }
    },
    "mappings" : {
        "items":{
            "properties" : {
                "id" :{ "type" : "keyword" },
                "text" :{ "type" : "keyword" }
            }
        }
    }
}'

設定 settings -> index -> analysis -> tokenizer

設定値

                    "my_tokenizer": {
                        "type": "kuromoji_neologd_tokenizer",
                        "mode": "normal",
                        "discard_punctuation" : "false",
                        "user_dictionary" : "userdict_ja.txt"
                    }

意味

設定項目	説明
tokenizer名	`my_tokenizer` としています。後の `analyzer` で参照しています。
type	`elasticsearch-analysis-kuromoji-neologd` が提供する tokenizer を利用します。
mode	形態素解析のモードを指定します。mecab-ipadic-neologd の結果を確認しやすい様に `normal` を指定します。デフォルトは `search` です。
discard_punctuation	句読点や記号も含めます。一般的には `true` を指定します。今回は、形態素解析の結果を確認するために、あえて`false`を指定します。
user_dictionary	ユーザ辞書のファイル名です。$ES_HOME(/usr/share/elasticsearch)/config 配下にファイルを置きます。上の `docker-compose.yml` では`volumes`で`$ES_HOME/config` に `./es/config` を割り当てているので、`touch ./es/config/userdict_ja.txt` で空ファイルを作成しておきます。空ファイルがないとElasticSearchの起動に失敗します。

設定 settings -> index -> analysis -> filter

設定値

                    "synonym_dict": {
                        "type": "synonym",
                        "synonyms_path" : "synonym.txt"
                    }

意味

設定項目	説明
filter名	synonym_dict としています。後の analyzer で参照しています。
type	フィルタの種類として類義語を扱う `synonym` をしていしています。
synonyms_path	類義語を定義したファイル名を指定します。 `my_tokenizer` の `user_dictionary` と同様に空ファイルを作成しておきます。

設定 settings -> index -> analysis -> analyzer

設定値

                "analyzer" : {
                    "default" : {
                        "type": "custom",
                        "tokenizer": "my_tokenizer",
                        "filter": ["synonym_dict"]
                    }
                }

意味

設定項目	説明
analyzer名	`default` にすることで、デフォルトの analyzerを定義します。
type	`custom` analyzerをカスタマイズすることを宣言します。
tokenizer	tokenizer に上で定義した `tokenizer` を使います。
filter	適用するフィルターを指定します。

filterの説明

実用面では、他のフィルターを利用する必要がありますが、kuromoji + Neologd で行う形態素解析の結果を確認しやすくするために、利用するfilterを限定しています。

設定項目	説明
synonym_dict	同義語を扱うためのフィルターで、上で定義した`synonym_dict` を使用します。https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis-synonym-tokenfilter.html

analyzerを試す。

データ投入

上で定義したテンプレートが適用されるようにデータを1件だけ投入します。

curl -XPUT localhost:9200/items/item/1?pretty -d '
{
  "id" : "item-001",
  "text": "MeCab はオープンソースの形態素解析エンジンであり、自然言語処理の基礎となる形態素解析のデファクトとなるツールです。また各言語用バインディングを使うことで Ruby や Python をはじめ多くのさまざまなプログラミング言語から呼び出して利用することもでき大変便利です。"
}'

mecab-ipadic-neologd のサンプルを試す。

https://github.com/neologd/mecab-ipadic-neologd にある　10日放送の「中居正広のミになる図書館」（テレビ朝日系）で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。 をanalyzerにかけてみます。

curl 'localhost:9200/items/_analyze?pretty' --data-binary '{
"explain":"false",
"text":"10日放送の「中居正広のミになる図書館」（テレビ朝日系）で、SMAPの中居正広が、篠原信一の過去の勘違いを 明かす一幕があった。"
}'

実行結果

{
  "tokens" : [
    {
      "token" : "10日",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "放送",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "の",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "「",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "中居正広のミになる図書館",
      "start_offset" : 7,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "」",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "（",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "テレビ朝日",
      "start_offset" : 21,
      "end_offset" : 26,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "系",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "）",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "で",
      "start_offset" : 28,
      "end_offset" : 29,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "、",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "SMAP",
      "start_offset" : 30,
      "end_offset" : 34,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "の",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "中居正広",
      "start_offset" : 35,
      "end_offset" : 39,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : "が",
      "start_offset" : 39,
      "end_offset" : 40,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "、",
      "start_offset" : 40,
      "end_offset" : 41,
      "type" : "word",
      "position" : 16
    },
    {
      "token" : "篠原信一",
      "start_offset" : 41,
      "end_offset" : 45,
      "type" : "word",
      "position" : 17
    },
    {
      "token" : "の",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "word",
      "position" : 18
    },
    {
      "token" : "過去",
      "start_offset" : 46,
      "end_offset" : 48,
      "type" : "word",
      "position" : 19
    },
    {
      "token" : "の",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "word",
      "position" : 20
    },
    {
      "token" : "勘違い",
      "start_offset" : 49,
      "end_offset" : 52,
      "type" : "word",
      "position" : 21
    },
    {
      "token" : "を",
      "start_offset" : 52,
      "end_offset" : 53,
      "type" : "word",
      "position" : 22
    },
    {
      "token" : " ",
      "start_offset" : 53,
      "end_offset" : 54,
      "type" : "word",
      "position" : 23
    },
    {
      "token" : "明かす",
      "start_offset" : 54,
      "end_offset" : 57,
      "type" : "word",
      "position" : 24
    },
    {
      "token" : "一幕",
      "start_offset" : 57,
      "end_offset" : 59,
      "type" : "word",
      "position" : 25
    },
    {
      "token" : "が",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "word",
      "position" : 26
    },
    {
      "token" : "あっ",
      "start_offset" : 60,
      "end_offset" : 62,
      "type" : "word",
      "position" : 27
    },
    {
      "token" : "た",
      "start_offset" : 62,
      "end_offset" : 63,
      "type" : "word",
      "position" : 28
    },
    {
      "token" : "。",
      "start_offset" : 63,
      "end_offset" : 64,
      "type" : "word",
      "position" : 29
    }
  ]
}

同じ結果になりました。品詞や読み仮名などの詳細を得るには explain を true にします。

参考ページ

https://medium.com/hello-elasticsearch/elasticsearch-833a0704e44b
https://www.elastic.co/guide/en/elasticsearch/plugins/5.1/analysis-kuromoji.html の Japanese (kuromoji) Analysis Plugin 配下の文書

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up