LoginSignup
3
5

More than 5 years have passed since last update.

ElasticSearch Sudachi Windows + Python

Posted at

ElasticSearch Sudachi Windows + Python

  • Windows向け作業メモ
  • ElasticSearchを形態素解析APIサーバとして使うまでを記載

Sudachiのインストール(Elasticsearch5.6.x)

作者の情報を元にインストールと勉強

適当な場所にDL or clone

GitからGet

mvn packageするために、Mavenのインストール

ここの情報を参考にしました

Sudachi.zipの作成

D:\>mvn package

Sudachi.zipのインストール

D:\elasticsearch-5.6.5\bin>elasticsearch-plugin install file:/D:\elasticsearch-sudachi-develop\target\releases\analysis-sudachi-1.0.0-SNAPSHOT.zip
-> Downloading file:/D:\elasticsearch-sudachi-develop\target\releases\analysis-sudachi-1.0.0-SNAPSHOT.zip
[=================================================] 100%??
-> Installed analysis-sudachi

Sudachi.zipの確認

D:\elasticsearch-5.6.5\bin>elasticsearch-plugin list
analysis-sudachi

Sudachi辞書の追加

作者の情報を元にDL

以下のフォルダを作成して、格納
D:\elasticsearch-5.6.5\config\sudachi_tokenizer\system_core.dic

  • dictionary_fullでDLしたが、上記ファイル名に変更した
  • ※DICのファイル変更がわからなかったため・・・

elasticsearch起動


D:\elasticsearch-5.6.5>bin\elasticsearch

postman or Fiddler

GET 'http://localhost:9200/_nodes/plugins?pretty'

      "plugins" : [
        {
          "name" : "analysis-sudachi",
          "version" : "1.0.0-SNAPSHOT",
          "description" : "The Japanese (Sudachi) Analysis plugin integrates Lucene Sudachi analysis module into elasticsearch.",
          "classname" : "com.worksap.nlp.elasticsearch.sudachi.plugin.AnalysisSudachiPlugin",
          "has_native_controller" : false
        }
      ],

make index

put http://localhost:9200/sudachi_test
{
    "settings": {
        "index": {
            "analysis": {
                "tokenizer": {
                    "sudachi_tokenizer": {
                        "type": "sudachi_tokenizer",
                        "mode": "search"
                    }
                },
                "analyzer": {
                    "sudachi_analyzer": {
                        "filter": [],
                        "tokenizer": "sudachi_tokenizer",
                        "type": "custom"
                    }
                }
            }
        }
    }
}

フォルダ指定がよくわからなかったので、Tokenizer部分を修正してる。
良い方法があれば教えてください。

elasticsearch install(conda)

D:\>conda install elasticsearch

Pythonで結果を得る

els.py
from elasticsearch import Elasticsearch


class ESAnalyzer(object):
    def __init__(self, host="localhost", port=9200, index=None, analyzer=None):
        if index is None:
            index = "test"
        if analyzer is None:
            index = "test"

        self.es = Elasticsearch(hosts=[{"host": host, "port": port}], send_get_body_as="POST")
        self.index = index
        self.type = type
        self.analyzer = analyzer

    def __call__(self, text):
        if not text:
            return []

        data = self.es.indices.analyze(index=self.index,
                                       body={"analyzer": self.analyzer, "text": text})
        tokens = []
        for token in data["tokens"]:
            tokens.append((token["token"], token["position"]))
        tokens = list(map(lambda x: x[0], sorted(tokens, key=lambda x: x[1])))
        return tokens

def main():
    E_HOST = "localhost"
    E_PORT = 9200
    E_INDEX = "sudachi_test"
    E_ANALYZER = "sudachi_analyzer"
    analyzer = ESAnalyzer(host=E_HOST, port=E_PORT, index=E_INDEX, analyzer=E_ANALYZER)

    text = "医療品安全管理責任者"
    text = "打込む"
    print(analyzer(text))

if __name__ == "__main__":
    main()
3
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
5