More than 5 years have passed since last update.

elasticsearchのセットアップして日本語の全文検索を試す

Posted at 2016-01-28

この人が熱いので書きます

バージョン

elasticsearct2.1.1

インストールとか

serverはcentos7
java はyum install java
yum confの設定
/etc/yum.repos.d/elasticsearch.repo

[elasticsearch-2.x]
name=Elasticsearch repository for 2.x packages
baseurl=http://packages.elastic.co/elasticsearch/2.x/centos
gpgcheck=1
gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch
enabled=1
```
これで yum install elasticsearchで2系の最新がインストールできます
他の方法でinstallしたい場合は公式を参照してください

　日本語の全文検索に必要なpluginをinstall

cd /usr/share/elasticsearch/
bin/plugin install analysis-kuromoji
```

管理画面的なものを入れる(kopf)

bin/plugin install lmenezes/elasticsearch-kopf/v2.1.1
```
+ kopfはこちらを参照。
+ 合うバージョンとか見てinstallするとはまらない

インストールされてるか確認

[root@xxx ~]# ll /usr/share/elasticsearch/plugins/
合計 8
drwxr-xr-x. 2 root root 4096 1月 12 00:08 analysis-kuromoji
drwxr-xr-x. 8 root root 4096 1月 12 00:33 kopf
```

Elasticsearchの中身の仕組みとか

Elasticsearch 日本語で全文検索その１の'Analysis モジュール'のところでわかりやすく説明されているので参照

インデックスや検索されるときにオリジナルの文字を Char filtersってゆうところを通る
2. 必要に応じてfilterされる。（しなくても良い）
トーカナイザーでトーカナイズ方法（Kuromojiとかngram式）を定義
Token filtersでトーカナイズされた文字をフィルターする。
2. 必要に応じてfilterされる。（しなくても良い）

Elasticsearchのconfig設定(indexのところだけ)

/etc/elasticsearch/elasticsearch.yml

# ---------------------------------- Index -----------------------------------
index :
    analysis :
        analyzer :
            ja :
                type : custom
                tokenizer : ja_tokenizer
                char_filter : [
                    html_strip,
                    kuromoji_iteration_mark
                ]
                filter : [
                    lowercase,
                    cjk_width,
                    katakana_stemmer,
                    kuromoji_part_of_speech
                ]
            ja_ngram :
                type : custom
                tokenizer : ngram_ja_tokenizer
                char_filter : [html_strip]
                filter : [
                    cjk_width,
                    lowercase
                ]
        tokenizer :
           ja_tokenizer :
               type : kuromoji_tokenizer
               mode : search
               user_dictionary : /etc/elasticsearch/userdict_ja.txt
           ngram_ja_tokenizer :
                type : nGram
                min_gram : 2
                max_gram : 3
                token_chars : [letter, digit]
        filter :
            katakana_stemmer :
                type : kuromoji_stemmer

kuromojiを使ったindexとngramを用いたインデックスをしたかったので「analyzer」と「tokenizer」を設定
試しにuser_dictionary : /etc/elasticsearch/userdict_ja.txtで独自辞書を設定

indexのテンプレート設定

curl -XPUT localhost:9200/_template/projects03 -d '

{
  "order": 0,
  "template": "projects03-*",
  "settings": {
    "index": {
      "number_of_shards": "1",
      "number_of_replicas": "0"
    }
  },
  "mappings": {
    "project": {
      "_source": {
        "enabled": true
      },
      "_all": {
        "analyzer": "ja",
        "enabled": true
      },
      "properties": {
        "update_time": {
          "format": "YYYY-MM-dd HH:mm:ss",
          "type": "date"
        },
        "project_id": {
          "index": "not_analyzed",
          "type": "string"
        },
        "detail": {
          "analyzer": "ja",
          "type": "string"
        },
        "suggest": {
          "search_analyzer": "ja",
          "analyzer": "ja",
          "type": "completion"
        },
        "detail_ngram": {
          "analyzer": "ja_ngram",
          "type": "string"
        },
        "title": {
          "analyzer": "ja",
          "type": "string"
        },
        "title_ngram": {
          "analyzer": "ja_ngram",
          "type": "string"
        }
      }
    }
  },
  "aliases": {

  }
}'

kopfの管理画面からでも登録可能
テストだったのでnumber_of_shards number_of_replicasを最小値へ
suggest機能も試したかったので設定
色々な人の設定を参考にしました

Elasticsearchの設定がおわったので反映

[root@xxx ~]# /etc/init.d/elasticsearch restart
Restarting elasticsearch (via systemctl):                  [  OK  ]

データを流し込んでみる

shellやら簡単なプログラムでcurlを使って流し込む

curl -X POST http://localhost:9200/projects03-20160111/project/<id>  -d '
{
    "project_id": 1,
    "title" : "川島さんこんにちわ",
    "title_ngram" : "川島さんこんにちわ",
    "detail" : "内容内容内容内容",
    "detail_ngram" : "内容内容内容内容",
    "update_time" : "2016-01-28 22:22:22"
}
'

のところをユニークな数値で入れていくと_idに入るので効率的です。

じゃ検索してみるよ

kuromojiでインデックスされた方に検索

[root@xxx ~]# curl -XGET 'localhost:9200/projects03-20160111/project/_search?pretty' -d'
> {
>  "query":{"match":{"title":"こんに"}}
> }'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

「こんに」では検索にヒットせず

curl -XGET 'localhost:9200/projects03-20160111/project/_search?pretty' -d'
> {
>  "query":{"match":{"title":"こんにちわ"}}
> }'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.15342641,
    "hits" : [ {
      "_index" : "projects03-20160111",
      "_type" : "project",
      "_id" : "1",
      "_score" : 0.15342641,
      "_source":
{
    "project_id": 1,
    "title" : "川島さんこんにちわ",
    "title_ngram" : "川島さんこんにちわ",
    "detail" : "内容内容内容内容",
    "detail_ngram" : "内容内容内容内容",
    "update_time" : "2016-01-28 22:22:22"
}

    } ]
  }
}

「こんにちわ」でヒット

これは

curl -XGET 'localhost:9200/projects03-20160111/_analyze?analyzer=ja&pretty' -d 'こんにちわ'
{
  "tokens" : [ {
    "token" : "こんにちわ",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "word",
    "position" : 0
  } ]
}

「こんにちわ」というワードでindexされた為です。

ngramでインデックスされた方に検索

curl -XGET 'localhost:9200/projects03-20160111/project/_search?pretty' -d'
> {
>  "query":{"match":{"title_ngram":"こんに"}}
> }' 
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.13287117,
    "hits" : [ {
      "_index" : "projects03-20160111",
      "_type" : "project",
      "_id" : "1",
      "_score" : 0.13287117,
      "_source":
{
    "project_id": 1,
    "title" : "川島さんこんにちわ",
    "title_ngram" : "川島さんこんにちわ",
    "detail" : "内容内容内容内容",
    "detail_ngram" : "内容内容内容内容",
    "update_time" : "2016-01-28 22:22:22"
}

    } ]
  }
}

ngramなので「こんに」でヒットする

注意、ハマリどころ、所感

Centos7からiptablesでなくfirewalldになってた
elastsearchはデフォルトで外部に公開されないので同じNW内またはlocalにloginしてcurlでapiを叩く
kopfはelasticsearchが入ってるサーバでないと入らないので管理サーバだけ別サーバみたいなのが無理っぽい
kuromojiにすでに入ってるオプション？の利用だけで結構事足りる気がする
_souceは要件によっては削って全然おk

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up