More than 5 years have passed since last update.

Elastic Stack (Elasticsearch)その2Advent Calendar 2019

@ysd_marrrr(Masataka Yoshida)in

株式会社オープンストリーム

[解決法付き]Elasticsearchで"No space left on device"からのUNASSIGNEDシャードを発生させた話

Elasticsearch

Last updated at 2019-12-24Posted at 2019-12-24

この記事は「Elastic Stack (Elasticsearch)その2 Advent Calendar 2019」の20日目の記事です。

Elasticsearchに保存したデータを使って分析…と思ったらElasticsearchそのものの面倒を見ている @ysd_marrrr です。

「Elasticsearchの path.data にシステムパーティションが含まれており、データがたまってくるとシステムパーティションが逼迫される」という ~~今思うと~~ マズい設計のElasticsearchサーバーの保守をしていますが、このサーバーでシステムパーティションの空きに関するアラートが上がり確認することになりました。

パーティションの空きに関する問題を解決し(前提条件)、Elasticsearchのインデックスを確認するとstatusがREDに。
クラスターを組んでいてその中の1台が落ちるとREDになるので、落ちているノードを起動させようと確認するとクラスター内のノードは全て生きてる🤔

REDになっているindexのシャードを見ると、一部が UNASSIGNED になっていました…😱

$ curl "http://localhost:9200/_cat/shards/myindex1"
myindex1 1  p STARTED 4822406 33.5gb 10.127.110.1 elasticsearch-node1
myindex1 2  p STARTED 4818526 34.6gb 10.127.110.1 elasticsearch-node1
myindex1 3  p UNASSIGNED 4799590 33.3gb 10.127.110.2  elasticsearch-node2
myindex1 4  p STARTED 4824062 33.7gb 10.127.110.3  elasticsearch-node3
myindex1 5  p UNASSIGNED 4804203   34gb 10.127.110.2  elasticsearch-node2
myindex1 6  p STARTED 4824062 33.7gb 10.127.110.3  elasticsearch-node3

パーティションの空き領域不足でシャードが外れてREDになってしまったこの場合 ~~しょうもない方法で~~ データを消さずに解決したので共有します。

環境

次の環境で確認しています。
＊ Elasticsearch 5.x系の話なので、6.x以降で curl からJSONを送りつける際のお約束については察してください

$ curl "http://localhost:9200/"
{
  "version" : {
    "number" : "5.6.8",
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

⚠ レプリケーションは設定していませんが「データが消えると地味に痛い」環境で考えています
⚠ 絶対にデータを消してはいけないproductionな環境では当てはまらないと思っています
⚠ この操作をする前に先にストレージの空きを確保してください

"Elasticsearch UNASSIGNED" で調べるとシャードを犠牲にする方法が目立つ

UNASSIGNED を解決した事例を調べると、「削除しても問題ないシャードで起きたのでこれを削除して空のシャードを割り当てます」という解決法が目立ちます。
~~レプリカとかバックアップとか取っておけよという話になりますが~~ シャードを削除すると地味に痛いので他に方法がないか探します。

Elasticsearch5系で、シャードunassignedによるstatus redを強引に修復する方法 - Gunosy Tech Blog
https://tech.gunosy.io/entry/elasticsearch-force-green

elasticsearch unassignedになってしまったシャードの割り当て - notebook
https://swfz.hatenablog.com/entry/2015/08/19/023956

`_cluster/allocation/explain` で確認する → 実は解決策が！！

UNASSIGNED と出てしまったとき、はじめにこのAPIを使ってシャードの割当状況を確認しました。
すると見事に No space left on device と出ています。

$ curl "http://localhost:9200/_cluster/allocation/explain?pretty"
{
  "index" : "myindex1",
  "shard" : 6,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2019-12-01T19:50:00.027Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed to create shard, failure IOException[No space left on device]",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "-CLqY8ecTdSkufWq0ba28w",
      "node_name" : "elasticsearch-node3",
      "transport_address" : "10.127.100.3:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "pMm2kOjhRHqJwGba2A7U3Q"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-12-01T19:50:00.027Z], failed_attempts[5], delayed=false, details[failed to create shard, failure IOException[No space left on device]], allocation_status[deciders_no]]]"
        }
      ]
    },
    {
      "node_id" : "7fhGQ7jQTTS0zTh5YI-GAg",
      "node_name" : "elasticsearch-node1",
      "transport_address" : "10.127.100.1:9300",
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "s5J-96kgRraq1E56jHTv2Q",
      "node_name" : "elasticsearch-node2",
      "transport_address" : "10.127.100.2:9300",
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    }
  ]
}

しかし、実はこの結果の中に解決策がさらっと書いてありました。

"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-12-01T19:50:00.027Z], failed_attempts[5], delayed=false, details[failed to create shard, failure IOException[No space left on device]], allocation_status[deciders_no]]]"

指定されたAPIにPOSTリクエストを投げます。

$ curl -XPOST 'http://localhost:9200/_cluster/reroute?retry_failed=true&pretty'

無事に解決っ✌

$ curl "http://localhost:9200/_cat/shards/myindex1"
myindex1 1  p STARTED 4822406 33.5gb 10.127.110.1 elasticsearch-node1
myindex1 2  p STARTED 4818526 34.6gb 10.127.110.1 elasticsearch-node1
myindex1 3  p STARTED 4799590 33.3gb 10.127.110.2  elasticsearch-node2
myindex1 4  p STARTED 4824062 33.7gb 10.127.110.3  elasticsearch-node3
myindex1 5  p STARTED 4804203   34gb 10.127.110.2  elasticsearch-node2
myindex1 6  p STARTED 4824062 33.7gb 10.127.110.3  elasticsearch-node3

今回の `UNASSIGNED`の状態を解消できなかった方法

UNASSIGNED になったシャードを削除するしかないのか…となんとか他の方法を探しましたが解決しませんでした。

`/_cluster/reroute` の `allocate` が効かない

シャードの割当がうまくいかないとUNASSIGNEDになり、/_cluster/reroute に向けて allocateのコマンドを送ると解決した例があります。

elasticsearch unassignedになってしまったシャードの割り当て - notebook
https://swfz.hatenablog.com/entry/2015/08/19/023956

elasticsearch - what to do with unassigned shards - Stack Overflow
https://stackoverflow.com/questions/23656458/elasticsearch-what-to-do-with-unassigned-shards/23816954#23816954

ではやってみましょう。

/_cluster/reroute

$ curl -XPOST 'http://localhost:9200/_cluster/reroute?pretty' -d '{
  "commands": [{
    "allocate": {
      "index": "myindex1",
      "shard": 5,
      "node": "elasticsearch-node3",
      "allow_primary": true
    }
  }]
}'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "unknown_named_object_exception",
        "reason" : "Unknown AllocationCommand [allocate]",
        "line" : 3,
        "col" : 17
      }
    ],
    "type" : "parsing_exception",
    "reason" : "[cluster_reroute] failed to parse field [commands]",
    "line" : 3,
    "col" : 17,
    "caused_by" : {
      "type" : "unknown_named_object_exception",
      "reason" : "Unknown AllocationCommand [allocate]",
      "line" : 3,
      "col" : 17
    }
  },
  "status" : 400
}

allocate なんて知らねー！と出てきました😥

`"index.routing.allocation.disable_allocation": false` が効かない

「Elasticsearchのサポートに聞いてこれで解決したぜ！」という方法が "index.routing.allocation.disable_allocation": false と設定するものなのですが、

sharding - ElasticSearch: Unassigned Shards, how to fix? - Stack Overflow
https://stackoverflow.com/a/20010544

stackoverflow.com/a/20010544

curl -XPUT 'localhost:9200/<index>/_settings' \
    -d '{"index.routing.allocation.disable_allocation": false}'

同様に"unknown setting"と出てしまいます。<index>の変数を実際に UNASSIGNED になったインデックスに変えても同様でした。

index.routing.allocation.disable_allocation

$ curl -XPUT 'http://localhost:9200/_settings?pretty' -d ' {"index.routing.allocation.disable_allocation": false}'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unknown setting [index.routing.allocation.disable_allocation] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unknown setting [index.routing.allocation.disable_allocation] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
  },
  "status" : 400
}

上の回答を見るとしっかり v0.90.x and earlier と書いてありますね。
"cluster.routing.allocation.enable" : "all" も試したのですが効果はありませんでした。

stackoverflow.com/a/23781013

# v0.90.x and earlier
curl -XPUT 'localhost:9200/_settings' -d '{
    "index.routing.allocation.disable_allocation": false
}'

# v1.0+
curl -XPUT 'localhost:9200/_cluster/settings' -d '{
    "transient" : {
        "cluster.routing.allocation.enable" : "all"
    }
}'

watermarkについて

watermarkの設定が低すぎるとシャードを割り当てないことがある、とあるのでwatermarkの設定変更を試しました。
https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html

The master node may not be able to assign shards if there are not enough nodes with sufficient disk space (it will not assign shards to nodes that have over 85 percent disk in use).

Reason 5: Low disk watermark
https://www.datadoghq.com/ja/blog/elasticsearch-unassigned-shards/

watermark設定変更の例(一時的な変更なので注意)

$ curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{
	"transient": {	
	      "cluster.routing.allocation.disk.watermark.low": "90%",
	      "cluster.routing.allocation.disk.watermark.high": "95%"
	}
}'

watermarkを変更すると即座に「ディスク使用率がwatermarkを超えているためシャードを移動した」旨のメッセージが表示されました。

[2019-12-02T15:16:49,472][WARN ][o.e.c.r.a.DiskThresholdMonitor] [elasticsearch-node1] high disk watermark [90%] exceeded on [-CLqY8ecTdSkufWq0ba28w][elasticsearch-node1][/mnt/elasticsearch/data/nodes/0] free: 8.4gb[8.4%], shards will be relocated away from this node
[2019-12-02T15:16:49,472][INFO ][o.e.c.r.a.DiskThresholdMonitor] [elasticsearch-node1] rerouting shards: [high disk watermark exceeded on one or more nodes]

"No space left on device"と出てしまったときはシャードの移動がなかったため、何らかの原因で正しくwatermarkの判定が実行されなかったことになります。
この設定はデフォルトで設定されているものなのですが(low = 85%, high = 90%)なぜデフォルトの設定で正しく動作しなかったのかがわかりません🤔

そしてこのあとに /_cluster/reroute の allocate などを試しても UNASSIGNED の状態から変わりませんでした。

おわりに

_cluster/allocation/explain に助けられてよかったなぁと感じました。
バックアップから戻せない状況でコマンドを調べても "Unknown AllocationCommand [allocate]" と言われると 結構面倒くさいことになるので、 簡単に戻せるようにレプリカやバックアップを用意しましょう😇

(私の場合「データのサイズが大きくバックアップのストレージを用意することはできない」となったそうです😇😇😇)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

[解決法付き]Elasticsearchで"No space left on device"からのUNASSIGNEDシャードを発生させた話

環境

"Elasticsearch UNASSIGNED" で調べるとシャードを犠牲にする方法が目立つ

_cluster/allocation/explain で確認する → 実は解決策が！！

今回の UNASSIGNEDの状態を解消できなかった方法

/_cluster/reroute の allocate が効かない

"index.routing.allocation.disable_allocation": false が効かない