More than 5 years have passed since last update.

ElasticSearch Sudachi Windows + Python

Posted at 2017-12-14

ElasticSearch Sudachi Windows + Python

Windows向け作業メモ
ElasticSearchを形態素解析APIサーバとして使うまでを記載

Sudachiのインストール(Elasticsearch5.6.x)

作者の情報を元にインストールと勉強

適当な場所にDL or clone

GitからGet

mvn packageするために、Mavenのインストール

ここの情報を参考にしました

Sudachi.zipの作成

D:\>mvn package

Sudachi.zipのインストール

D:\elasticsearch-5.6.5\bin>elasticsearch-plugin install file:/D:\elasticsearch-sudachi-develop\target\releases\analysis-sudachi-1.0.0-SNAPSHOT.zip
-> Downloading file:/D:\elasticsearch-sudachi-develop\target\releases\analysis-sudachi-1.0.0-SNAPSHOT.zip
[=================================================] 100%??
-> Installed analysis-sudachi

Sudachi.zipの確認

D:\elasticsearch-5.6.5\bin>elasticsearch-plugin list
analysis-sudachi

Sudachi辞書の追加

作者の情報を元にDL

以下のフォルダを作成して、格納
D:\elasticsearch-5.6.5\config\sudachi_tokenizer\system_core.dic

dictionary_fullでDLしたが、上記ファイル名に変更した
※DICのファイル変更がわからなかったため・・・

elasticsearch起動


D:\elasticsearch-5.6.5>bin\elasticsearch

postman or Fiddler

GET 'http://localhost:9200/_nodes/plugins?pretty'

      "plugins" : [
        {
          "name" : "analysis-sudachi",
          "version" : "1.0.0-SNAPSHOT",
          "description" : "The Japanese (Sudachi) Analysis plugin integrates Lucene Sudachi analysis module into elasticsearch.",
          "classname" : "com.worksap.nlp.elasticsearch.sudachi.plugin.AnalysisSudachiPlugin",
          "has_native_controller" : false
        }
      ],

make index

put http://localhost:9200/sudachi_test
{
    "settings": {
        "index": {
            "analysis": {
                "tokenizer": {
                    "sudachi_tokenizer": {
                        "type": "sudachi_tokenizer",
                        "mode": "search"
                    }
                },
                "analyzer": {
                    "sudachi_analyzer": {
                        "filter": [],
                        "tokenizer": "sudachi_tokenizer",
                        "type": "custom"
                    }
                }
            }
        }
    }
}

フォルダ指定がよくわからなかったので、Tokenizer部分を修正してる。
良い方法があれば教えてください。

elasticsearch install(conda)

D:\>conda install elasticsearch

Pythonで結果を得る

els.py

from elasticsearch import Elasticsearch


class ESAnalyzer(object):
    def __init__(self, host="localhost", port=9200, index=None, analyzer=None):
        if index is None:
            index = "test"
        if analyzer is None:
            index = "test"

        self.es = Elasticsearch(hosts=[{"host": host, "port": port}], send_get_body_as="POST")
        self.index = index
        self.type = type
        self.analyzer = analyzer

    def __call__(self, text):
        if not text:
            return []

        data = self.es.indices.analyze(index=self.index,
                                       body={"analyzer": self.analyzer, "text": text})
        tokens = []
        for token in data["tokens"]:
            tokens.append((token["token"], token["position"]))
        tokens = list(map(lambda x: x[0], sorted(tokens, key=lambda x: x[1])))
        return tokens

def main():
    E_HOST = "localhost"
    E_PORT = 9200
    E_INDEX = "sudachi_test"
    E_ANALYZER = "sudachi_analyzer"
    analyzer = ESAnalyzer(host=E_HOST, port=E_PORT, index=E_INDEX, analyzer=E_ANALYZER)

    text = "医療品安全管理責任者"
    text = "打込む"
    print(analyzer(text))

if __name__ == "__main__":
    main()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up