More than 5 years have passed since last update.

Elasticsearch 6.2.3 で Ingest Attachment Processor Plugin を使って添付ファイルをインデクシングする

Elasticsearch

Last updated at 2018-04-19Posted at 2018-04-04

Elasticsearch に文書に紐付いた添付ファイルをインデクシングしたい場合の設定を行います。
内容としては以下のドキュメントをまとめたものになります。
Ingest Attachment Processor Plugin

手元の環境は以下になります。

Ubuntu 16.04 LTS
Elasticsearch 6.2.3

最初に ingest-attachment プラグインをインストールします。

$ sudo bin/elasticsearch-plugin install ingest-attachment
$ sudo /etc/init.d/elasticsearch restart

次に、パイプラインと呼ばれるインデクシング前の繋ぎ込み設定を行います。

$ curl -X PUT 'localhost:9200/_ingest/pipeline/attachment' -H 'Content-Type: application/json' -d'
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1,
        "properties" : [
         "content",
         "content_type"
        ]
      }
    }
  ]
}
'

パイプラインで指定できる設定には以下のようなものがあります。

field
必須。base64 でエンコードされたバイナリを入れるフィールド名。
target_field
任意。抽出した添付ファイル情報を入れるフィールド名（デフォルトだと attachment ）
indexed_chars
任意。抽出する最大 chars 数（デフォルトだと 100000 ）-1 に設定すると無制限になる。
properties
任意。格納したい情報を指定。指定可能なものは以下。
content, title, name, author, keywords, date, content_type, content_length, language
ignore_missing
任意。フィールドが存在しない場合は処理を終了する（デフォルトだと False）

なお、パイプラインを削除したい場合は以下のように実行します。

$ curl -X DELETE 'localhost:9200/_ingest/pipeline/attachment'

bulk API を使うので、インデクシングの前に JSON ファイルを生成します。
ここでパイプラインを指定します。
なお、base64 でエンコードされたバイナリが必要なので変換します。（ CBOR オブジェクトを使うことも可能なようです）

$ file_path='/****/****/test.xlsx'
$ file=$(base64 $file_path | perl -pe 's/\n//g')
$ echo -e "{ \"index\" : { \"_index\" : \"shop\", \"_type\" : \"goods\", \"_id\" : \"1\", \"pipeline\": \"attachment\" }\n{ \"data\" : \"$file\" }" > request_bulk.json

作成した JSON ファイルを使ってインデクシングします。

$ curl -H "Content-type: application/x-ndjson" -X POST http://localhost:9200/_bulk?refresh=false --data-binary @request_bulk.json
$ curl -X POST 'localhost:9200/shop/_refresh'

処理の内部で Apache Tika を使っていますが、抽出処理はサーバーに負荷が掛かるので elasticsearch.yml で

node.master: false
node.ingest: true
node.data: false

のようにして専用のノードを作ったほうがいいかもしれません。

参考になった記事

Indexing Attachment file to elastic search
'Illegal base64 character' exception when indexing pdf in elasticsearch from shell script
Elastic Search Bulk API, Pipeline and Geo IP

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up