More than 3 years have passed since last update.

オークファングループ Advent Calendar 2021

@iwatake255in

株式会社オークファン

【Elasticsearch】メモリ32GBの境界を越えてゆけ

Last updated at 2021-12-14Posted at 2021-12-14

はじめに

オークファンの@iwatake255です。
開発部データソリューションチームでリードエンジニアをやっております。
Elasticsearch歴2リットルの初心者ですがElasticsearchについて書きたいと思います。

さて、オークファンでは膨大な量のオークションデータを保存・検索するためにElasticsearchを採用しております。

Elasticsearchでは自分が使用するJVMヒープメモリサイズを変更することもできるのですが、世間ではこのメモリサイズを32GB以上に設定すべきではないという神話が語り継がれてきていました。

しかし弊社サービスにおける性能チューニングの中で、JVMヒープメモリサイズを32GB→64GBに変更しただけで検索性能が劇的に向上したケースがあり、本記事ではこの神話の妥当性について検証してみようと思います。

Elasticsearchの「32GB神話」について

公式ドキュメント等

かつて(Elasticsearch 7.10まで?)は約32GBを超えないようにJVMヒープメモリサイズ(XmsおよびXmx)を設定するようにと明記されていました。

Set Xmx and Xms to no more than the threshold that the JVM uses for compressed object pointers (compressed oops). The exact threshold varies but is near 32 GB.

(Ealsticsearch 7.10 公式ドキュメントより)

現在(Elasticsearch 7.15)では26GB～30GBが安全という表現に変わっています。

Set Xms and Xmx to no more than the threshold for compressed ordinary object pointers (oops). The exact threshold varies but 26GB is safe on most systems and can be as large as 30GB on some systems.

(Ealsticsearch 7.15 公式ドキュメントより)

それとは別に(ノードの)総メモリ量の50%以下に設定するようにとも記載されています。

Set Xms and Xmx to no more than 50% of your total memory.

なぜ総メモリ量の50%までにする必要があるかというと、残りの50%をOSのファイルシステムキャッシュ用の領域として残しておくことにより、データ読み込み時のパフォーマンス向上が見込めるからのようです。

Elasticsearchにとって、これは何を意味するのでしょうか。ディスクのデータにアクセスするのではなく、ページキャッシュを使用することですばやくデータにアクセスできるということです。Elasticsearchのメモリに関する推奨事項では、一般的に、利用可能なメモリの半分以上を使用しないようにすることとしていますが、その理由の1つはこのためです。つまり、メモリの残りの半分はページキャッシュのために使用できるようにしたいからです。これはメモリが無駄にならないことも意味します。ページキャッシュに再利用されるからです。

(Elastic Blogより)

なおJVMヒープメモリサイズを明示的に指定しない場合、Elasticsearch側でノードのロールと総メモリ量を見て自動的にメモリサイズを決めるようです。

By default, Elasticsearch automatically sets the JVM heap size based on a node’s roles and total memory. Using the default sizing is recommended for most production environments.

(Ealsticsearch 7.15 公式ドキュメントより)

すべてのパターンを試したわけではないのですが、おおむね総メモリ量の50%か32GBのうち小さい方がデフォルト値として選ばれるようです。

このように公式ドキュメントやElasticsearchデフォルトでの振る舞いを見る限り、32GB以上のメモリを割り当てるのを極端に忌避しているような、もっというと32GB以上割り当てるととんでもなく悪いことが起こるようにも見えます。

そもそもElasticsearchに割り当てたJVMヒープメモリは何に使われるのか

一定割合でElasticsearchの内部キャッシュとして使用されるようです。

The more heap available to Elasticsearch, the more memory it can use for its internal caches.

(Ealsticsearch 7.15 公式ドキュメントより)

内部キャッシュの内訳としては以下があるようです。

32GBに何があるのか

JVMのCompressed OOPsという機能が使える／使えないの境界があります。

Compressed OOPs については以下の記事がとても分かりやすいです。

ざっくり言うと、JVMヒープメモリのサイズを32GB以上に設定するとそれまで32bitで表現できていたメモリアドレスが64bitになり、そのためメモリアドレス(≒javaオブジェクトへのポインタ)そのものが消費するメモリ量が32GB未満のときと比べ2倍になってしまうということのようです。

これによりメモリサイズを32GB以上に設定した場合、以下のような悪影響があることが考えられます

同じオブジェクトを生成する場合でも、32GB未満ときよりメモリを消費する
特に32GBをちょっとだけ越えて設定した場合、32GB未満の時よりかえって使えるメモリ量が減ることがありうる

[本題]32GBを越えてはいけないのか

上記だけを見ると確かに32GBを越えると悪いことが起きるように見えます。

ですが、そもそもデータをメモリ上にキャッシュするということ自体にとてつもないアドバンテージがあることを忘れてはいけないと思います。

前述のとおりElasticsearchに割り当てられたJVMヒープメモリは一定割合でキャッシュに使用されます。なので基本的にはJVMヒープメモリを大きく割り当てればそれだけキャッシュのヒット率が上がり、SSDからデータを読まずに済む可能性が上がるはずです。
SSDがいくら高速といえども、メモリと比較すると読み込み性能で数倍～100倍程度の差があります。ですのでJVMヒープメモリサイズを大きくしてキャッシュのヒット率を上げる作戦は、Elasticsearchの検索応答速度を向上させるという観点では非常に強力な手段であるはずです。

実際に弊社内で実施した負荷検証では、各データノードのJVMヒープメモリを32GB→64GBを変更しただけで平均レイテンシが90%以上改善したケースもありました。

その時の検証結果(Gatling)を以下に示します。

32GB
64GB

あくまで「このケースでは」になりますが、32GB以上のJVMヒープメモリを割り当てることがでプラスに働くこともあることがお分かりいただけるかと思います。

ちなみに弊社の負荷検証の場合の前提条件は以下のようになります。

データノードの物理メモリ: 128GB
1データノードあたりデータ量: 40GB～50GB
1データノードあたりドキュメント数: 7000万～8000万
レプリカシャード数: 1
書き込み頻度低、読み込み頻度高

ここからは推測になりますが、以下のような条件を満たす場合にJVMヒープメモリ64GB作戦が有効な気がします。

条件1
- データノードの物理メモリ量が128GB以上(JVMヒープメモリに64GB割り当てても総メモリ量の50%以下)
条件2
- 1データノードあたりのデータ量が32GB以上で64GB以下(JVMヒープメモリを64GBにすることでデータの大半または丸ごとキャッシュできる可能性)
条件3
- 書き込み頻度が少ない(キャッシュが有効である可能性が高い、スラッシングがほぼ発生しない)

[おまけ] 公式ベンチマークRallyで64GBが有効かどうか確認してみた

弊社負荷検証では結果が出ました!ではエビデンスとして弱いかなと思い、客観的なエビデンスとしてElasticsearch公式ベンチマークのRallyを使ってメモリ64GBかどうか検証してみました。

環境

AWS EC2 r6g.4xlarge (16vCPU、128GBメモリ、EBS gp3 100GB)
Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type
Elasticsearch 7.15.2
Rally 2.3.0

環境構築は以下の手順で実施しました。

ベンチマーク条件

JVMヒープメモリ
- 32GB、64GB
  - 条件1より
track(ベンチマークシナリオ)
- nyc_taxis
  - Uncompressed Size が74.3 GBなので条件2に近い
task
- ドキュメント検索系のtaskのみ比較
  - 条件3より

結果

service timeの50パーセンタイル値を比較します。
(待ち時間を含まない応答時間=service time)

task	32GB (ms)	64GB (ms)	差(ms)
range	254.869	191.784	-63.0847
distance_amount_agg	2.93646	2.91249	-0.02397
autohisto_agg	408.145	356.736	-51.4087
date_histogram_agg	148.163	119.555	-28.6081

ベンチマーク結果(32GB-64GB比較)

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/


Comparing baseline
  Race ID: 2716fc79-3d9e-4985-9061-832a78794231
  Race timestamp: 2021-12-07 01:23:39
  Challenge: append-no-conflicts
  Car: external
  User tags: jvm_memory_size=32GB(default)

with contender
  Race ID: b2836ba5-4b5d-4fb9-8c23-d6eb94112010
  Race timestamp: 2021-12-07 04:46:31
  Challenge: append-no-conflicts
  Car: external
  User tags: jvm_memory_size=64GB

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                        Metric |                Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|--------------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |                     |     46.2922 |     50.8722 |  4.57995 |    min |
|             Min cumulative indexing time across primary shard |                     |           0 |           0 |        0 |    min |
|          Median cumulative indexing time across primary shard |                     |     23.1461 |     25.4361 |  2.28998 |    min |
|             Max cumulative indexing time across primary shard |                     |     46.2922 |     50.8722 |  4.57995 |    min |
|           Cumulative indexing throttle time of primary shards |                     |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |                     |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |                     |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |                     |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |                     |     15.9374 |     24.5854 |  8.64797 |    min |
|                      Cumulative merge count of primary shards |                     |          26 |          29 |        3 |        |
|                Min cumulative merge time across primary shard |                     |           0 |           0 |        0 |    min |
|             Median cumulative merge time across primary shard |                     |     7.96871 |     12.2927 |  4.32398 |    min |
|                Max cumulative merge time across primary shard |                     |     15.9374 |     24.5854 |  8.64797 |    min |
|              Cumulative merge throttle time of primary shards |                     |     4.42985 |     11.4517 |  7.02182 |    min |
|       Min cumulative merge throttle time across primary shard |                     |           0 |           0 |        0 |    min |
|    Median cumulative merge throttle time across primary shard |                     |     2.21493 |     5.72583 |  3.51091 |    min |
|       Max cumulative merge throttle time across primary shard |                     |     4.42985 |     11.4517 |  7.02182 |    min |
|                     Cumulative refresh time of primary shards |                     |     1.06192 |     1.20283 |  0.14092 |    min |
|                    Cumulative refresh count of primary shards |                     |          36 |          46 |       10 |        |
|              Min cumulative refresh time across primary shard |                     |           0 |           0 |        0 |    min |
|           Median cumulative refresh time across primary shard |                     |    0.530958 |    0.601417 |  0.07046 |    min |
|              Max cumulative refresh time across primary shard |                     |     1.06192 |     1.20283 |  0.14092 |    min |
|                       Cumulative flush time of primary shards |                     |       2.306 |     1.89908 | -0.40692 |    min |
|                      Cumulative flush count of primary shards |                     |          11 |          12 |        1 |        |
|                Min cumulative flush time across primary shard |                     |           0 |           0 |        0 |    min |
|             Median cumulative flush time across primary shard |                     |       1.153 |    0.949542 | -0.20346 |    min |
|                Max cumulative flush time across primary shard |                     |       2.306 |     1.89908 | -0.40692 |    min |
|                                            Total Young Gen GC |                     |        0.88 |       0.689 |   -0.191 |      s |
|                                              Total Old Gen GC |                     |           0 |           0 |        0 |      s |
|                                                    Store size |                     |     7.76016 |      8.6953 |  0.93514 |     GB |
|                                                 Translog size |                     | 1.02445e-07 | 1.02445e-07 |        0 |     GB |
|                                        Heap used for segments |                     |   0.0902176 |    0.108299 |  0.01808 |     MB |
|                                      Heap used for doc values |                     |   0.0259933 |   0.0264435 |  0.00045 |     MB |
|                                           Heap used for terms |                     |   0.0432129 |   0.0553589 |  0.01215 |     MB |
|                                           Heap used for norms |                     |           0 |           0 |        0 |     MB |
|                                          Heap used for points |                     |           0 |           0 |        0 |     MB |
|                                   Heap used for stored fields |                     |   0.0210114 |   0.0264969 |  0.00549 |     MB |
|                                                 Segment count |                     |          36 |          32 |       -4 |        |
|                                                Min Throughput |               index |     98282.7 |     92858.8 | -5423.87 | docs/s |
|                                             Median Throughput |               index |      100017 |      115730 |  15713.4 | docs/s |
|                                                Max Throughput |               index |      242632 |      156386 | -86246.3 | docs/s |
|                                       50th percentile latency |               index |     89.9578 |     92.2309 |  2.27314 |     ms |
|                                       90th percentile latency |               index |      649.19 |     652.938 |  3.74808 |     ms |
|                                       99th percentile latency |               index |     1412.24 |     1355.37 | -56.8697 |     ms |
|                                     99.9th percentile latency |               index |     5064.39 |     4593.95 | -470.443 |     ms |
|                                    99.99th percentile latency |               index |     5616.33 |      5507.4 |  -108.93 |     ms |
|                                      100th percentile latency |               index |     5921.07 |     6087.86 |  166.791 |     ms |
|                                  50th percentile service time |               index |     89.9578 |     92.2309 |  2.27314 |     ms |
|                                  90th percentile service time |               index |      649.19 |     652.938 |  3.74808 |     ms |
|                                  99th percentile service time |               index |     1412.24 |     1355.37 | -56.8697 |     ms |
|                                99.9th percentile service time |               index |     5064.39 |     4593.95 | -470.443 |     ms |
|                               99.99th percentile service time |               index |     5616.33 |      5507.4 |  -108.93 |     ms |
|                                 100th percentile service time |               index |     5921.07 |     6087.86 |  166.791 |     ms |
|                                                    error rate |               index |     80.1289 |     74.8857 | -5.24328 |      % |
|                                                Min Throughput |             default |     3.02016 |     3.02014 |   -2e-05 |  ops/s |
|                                             Median Throughput |             default |     3.02991 |     3.02987 |   -4e-05 |  ops/s |
|                                                Max Throughput |             default |     3.05805 |     3.05801 |   -4e-05 |  ops/s |
|                                       50th percentile latency |             default |     4.28662 |     4.48204 |  0.19542 |     ms |
|                                       90th percentile latency |             default |      4.4747 |     4.68339 |   0.2087 |     ms |
|                                       99th percentile latency |             default |     5.28046 |     5.59364 |  0.31318 |     ms |
|                                      100th percentile latency |             default |     5.45401 |     6.08082 |  0.62681 |     ms |
|                                  50th percentile service time |             default |       3.945 |     4.14404 |  0.19904 |     ms |
|                                  90th percentile service time |             default |     4.13767 |     4.34255 |  0.20488 |     ms |
|                                  99th percentile service time |             default |     4.94238 |     5.25597 |  0.31359 |     ms |
|                                 100th percentile service time |             default |      5.1136 |     5.74084 |  0.62724 |     ms |
|                                                    error rate |             default |           0 |           0 |        0 |      % |
|                                                Min Throughput |               range |    0.703838 |     0.70406 |  0.00022 |  ops/s |
|                                             Median Throughput |               range |    0.705763 |    0.706075 |  0.00031 |  ops/s |
|                                                Max Throughput |               range |     0.71209 |    0.712078 |   -1e-05 |  ops/s |
|                                       50th percentile latency |               range |     256.054 |     193.034 | -63.0194 |     ms |
|                                       90th percentile latency |               range |     270.379 |     193.655 | -76.7243 |     ms |
|                                       99th percentile latency |               range |      282.01 |     193.938 | -88.0715 |     ms |
|                                      100th percentile latency |               range |     290.636 |     193.996 | -96.6406 |     ms |
|                                  50th percentile service time |               range |     254.869 |     191.784 | -63.0847 |     ms |
|                                  90th percentile service time |               range |     269.212 |     192.401 | -76.8114 |     ms |
|                                  99th percentile service time |               range |      280.83 |     192.689 | -88.1409 |     ms |
|                                 100th percentile service time |               range |     289.466 |     192.745 |  -96.721 |     ms |
|                                                    error rate |               range |           0 |           0 |        0 |      % |
|                                                Min Throughput | distance_amount_agg |     2.01342 |     2.01342 |        0 |  ops/s |
|                                             Median Throughput | distance_amount_agg |     2.02006 |     2.02006 |       -0 |  ops/s |
|                                                Max Throughput | distance_amount_agg |      2.0397 |     2.03971 |    1e-05 |  ops/s |
|                                       50th percentile latency | distance_amount_agg |      3.4459 |     3.41906 | -0.02685 |     ms |
|                                       90th percentile latency | distance_amount_agg |     3.69374 |     3.59011 | -0.10363 |     ms |
|                                       99th percentile latency | distance_amount_agg |     4.33156 |     4.06222 | -0.26933 |     ms |
|                                      100th percentile latency | distance_amount_agg |     4.37477 |     4.36694 | -0.00784 |     ms |
|                                  50th percentile service time | distance_amount_agg |     2.93646 |     2.91249 | -0.02397 |     ms |
|                                  90th percentile service time | distance_amount_agg |     3.18146 |     3.08136 |  -0.1001 |     ms |
|                                  99th percentile service time | distance_amount_agg |     3.81911 |      3.5584 | -0.26071 |     ms |
|                                 100th percentile service time | distance_amount_agg |     3.87012 |     3.86308 | -0.00704 |     ms |
|                                                    error rate | distance_amount_agg |           0 |           0 |        0 |      % |
|                                                Min Throughput |       autohisto_agg |     1.50389 |     1.50469 |  0.00079 |  ops/s |
|                                             Median Throughput |       autohisto_agg |     1.50581 |     1.50698 |  0.00116 |  ops/s |
|                                                Max Throughput |       autohisto_agg |     1.51119 |     1.51353 |  0.00235 |  ops/s |
|                                       50th percentile latency |       autohisto_agg |     408.417 |     357.057 | -51.3594 |     ms |
|                                       90th percentile latency |       autohisto_agg |     409.489 |      358.31 | -51.1792 |     ms |
|                                       99th percentile latency |       autohisto_agg |     412.245 |      360.08 |  -52.165 |     ms |
|                                      100th percentile latency |       autohisto_agg |     455.638 |     360.609 | -95.0292 |     ms |
|                                  50th percentile service time |       autohisto_agg |     408.145 |     356.736 | -51.4087 |     ms |
|                                  90th percentile service time |       autohisto_agg |     409.221 |     357.989 | -51.2313 |     ms |
|                                  99th percentile service time |       autohisto_agg |     411.976 |     359.761 | -52.2145 |     ms |
|                                 100th percentile service time |       autohisto_agg |     455.367 |     360.288 | -95.0798 |     ms |
|                                                    error rate |       autohisto_agg |           0 |           0 |        0 |      % |
|                                                Min Throughput |  date_histogram_agg |     1.50117 |     1.50123 |    6e-05 |  ops/s |
|                                             Median Throughput |  date_histogram_agg |     1.50155 |     1.50164 |    9e-05 |  ops/s |
|                                                Max Throughput |  date_histogram_agg |     1.50232 |     1.50246 |  0.00014 |  ops/s |
|                                       50th percentile latency |  date_histogram_agg |     148.683 |     120.112 | -28.5717 |     ms |
|                                       90th percentile latency |  date_histogram_agg |     149.535 |     120.498 | -29.0374 |     ms |
|                                       99th percentile latency |  date_histogram_agg |     150.108 |     120.891 | -29.2163 |     ms |
|                                      100th percentile latency |  date_histogram_agg |     150.238 |     122.718 | -27.5205 |     ms |
|                                  50th percentile service time |  date_histogram_agg |     148.163 |     119.555 | -28.6081 |     ms |
|                                  90th percentile service time |  date_histogram_agg |     149.015 |     119.938 | -29.0775 |     ms |
|                                  99th percentile service time |  date_histogram_agg |     149.579 |     120.328 | -29.2505 |     ms |
|                                 100th percentile service time |  date_histogram_agg |     149.709 |      122.16 | -27.5492 |     ms |
|                                                    error rate |  date_histogram_agg |           0 |           0 |        0 |      % |


-------------------------------
[INFO] SUCCESS (took 0 seconds)
-------------------------------

弊社内での事例ほど大きな差は出ませんでしたが、64GBに設定したときの方が応答速度が速いことが確認できるかと思います。

まとめ

ElasticsearchのJVMヒープメモリ32GB神話について解説しました
弊社の事例をもとに32GBを越えても良いのか、越えても良いのであればその条件は何かを考察しました
考察に基づき公式ベンチマークツールで評価してみたところ、64GBの方が応答速度が改善することが確認できました

おわりに

この記事を書こうと思った背景として、弊社内負荷検証で長らく良い結果が出ず何の成果も得られませんでした状態だったところに突然「インデックスのデータ量を減らして全部メモリにキャッシュすればいいんじゃね？」という天啓を得て、実際にやってみたところ一気に性能改善しブレークスルーした実体験があります。

本記事が今後32GB神話に挑戦しようとする方の助けに少しでもなれば幸いです。

現実には、メモリ128GB以上のマシンを用意するコストに見合う結果が必ず得られるとは限らないので、手放しでおすすめできる手法ではありませんが、メモリが潤沢な環境が用意できるのであれば一考に値する選択肢かなと思います。

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up