ElasticSearch Sudachi Windows + Python
- Windows向け作業メモ
- ElasticSearchを形態素解析APIサーバとして使うまでを記載
Sudachiのインストール(Elasticsearch5.6.x)
適当な場所にDL or clone
mvn packageするために、Mavenのインストール
Sudachi.zipの作成
D:\>mvn package
Sudachi.zipのインストール
D:\elasticsearch-5.6.5\bin>elasticsearch-plugin install file:/D:\elasticsearch-sudachi-develop\target\releases\analysis-sudachi-1.0.0-SNAPSHOT.zip
-> Downloading file:/D:\elasticsearch-sudachi-develop\target\releases\analysis-sudachi-1.0.0-SNAPSHOT.zip
[=================================================] 100%??
-> Installed analysis-sudachi
Sudachi.zipの確認
D:\elasticsearch-5.6.5\bin>elasticsearch-plugin list
analysis-sudachi
Sudachi辞書の追加
以下のフォルダを作成して、格納
D:\elasticsearch-5.6.5\config\sudachi_tokenizer\system_core.dic
- dictionary_fullでDLしたが、上記ファイル名に変更した
- ※DICのファイル変更がわからなかったため・・・
elasticsearch起動
D:\elasticsearch-5.6.5>bin\elasticsearch
postman or Fiddler
GET 'http://localhost:9200/_nodes/plugins?pretty'
"plugins" : [
{
"name" : "analysis-sudachi",
"version" : "1.0.0-SNAPSHOT",
"description" : "The Japanese (Sudachi) Analysis plugin integrates Lucene Sudachi analysis module into elasticsearch.",
"classname" : "com.worksap.nlp.elasticsearch.sudachi.plugin.AnalysisSudachiPlugin",
"has_native_controller" : false
}
],
make index
put http://localhost:9200/sudachi_test
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer",
"mode": "search"
}
},
"analyzer": {
"sudachi_analyzer": {
"filter": [],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
}
}
}
}
}
フォルダ指定がよくわからなかったので、Tokenizer部分を修正してる。
良い方法があれば教えてください。
elasticsearch install(conda)
D:\>conda install elasticsearch
Pythonで結果を得る
els.py
from elasticsearch import Elasticsearch
class ESAnalyzer(object):
def __init__(self, host="localhost", port=9200, index=None, analyzer=None):
if index is None:
index = "test"
if analyzer is None:
index = "test"
self.es = Elasticsearch(hosts=[{"host": host, "port": port}], send_get_body_as="POST")
self.index = index
self.type = type
self.analyzer = analyzer
def __call__(self, text):
if not text:
return []
data = self.es.indices.analyze(index=self.index,
body={"analyzer": self.analyzer, "text": text})
tokens = []
for token in data["tokens"]:
tokens.append((token["token"], token["position"]))
tokens = list(map(lambda x: x[0], sorted(tokens, key=lambda x: x[1])))
return tokens
def main():
E_HOST = "localhost"
E_PORT = 9200
E_INDEX = "sudachi_test"
E_ANALYZER = "sudachi_analyzer"
analyzer = ESAnalyzer(host=E_HOST, port=E_PORT, index=E_INDEX, analyzer=E_ANALYZER)
text = "医療品安全管理責任者"
text = "打込む"
print(analyzer(text))
if __name__ == "__main__":
main()