Keystoneは遅すぎる
あらゆるOpenStackサービスを利用するときに背後ではKeystoneの認証プロセスが動いている。このKeystoneの動作を高速化することで快適なOpenStack環境を作れるのでは?ということで色々試してみた。
まともな結論は出ていない。
ベース
libvirt上でubuntu20.04を稼働させるためのVagrantfile
簡単な前処理はスクリプトでやっておく。
Vagrant.configure("2") do |config|
config.vm.define "master" do |host|
host.vm.box = "generic/ubuntu2004"
host.vm.provision :shell, inline: $script
host.vm.provider "libvirt" do |vb|
vb.memory = 8192
vb.cpus = 8
end
end
end
$script = <<END
apt update
apt upgrade -y
apt install memcached keystone python3-openstackclient -y
END
ソフトウェアのバージョン
vagrant@ubuntu2004:~$ dpkg -l | grep -P "(mysql)|(memcache)|(keystone)"
ii keystone 2:17.0.0-0ubuntu0.20.04.1 all OpenStack identity service - Daemons
ii mysql-server-8.0 8.0.22-0ubuntu0.20.04.2 amd64 MySQL database server binaries and system database setup
ii memcached 1.5.22-2ubuntu0.1 amd64 High-performance in-memory object caching system
まずはドキュメント通りインストールを実施。
https://docs.openstack.org/keystone/ussuri/install/keystone-install-ubuntu.html
※memcachedはまずは使わない。
どこに時間が掛かる?
timingオプションをつけることでAPIごとの時間が計測可能
tokensへのリクエストが0.3秒程度 * 2でここだけで0.6秒も掛かっている。
(なんでtokensを2回も叩いてるんだろう)
root@ubuntu2004:~# openstack project list --timing
+----------------------------------+-------+
| ID | Name |
+----------------------------------+-------+
| 973e028a13184bf585915d1dbcb8bd69 | admin |
+----------------------------------+-------+
+-------------------------------------------+--------------------+
| URL | Seconds |
+-------------------------------------------+--------------------+
| GET http://localhost:5000/v3 | 0.003235 |
| POST http://localhost:5000/v3/auth/tokens | 0.333022 |
| POST http://localhost:5000/v3/auth/tokens | 0.313014 |
| GET http://localhost:5000/v3/projects | 0.079824 |
| Total | 0.7290949999999999 |
+-------------------------------------------+--------------------+
root@ubuntu2004:~#
Jmeterを使って/tokensに対して10秒で100リクエスト投げてみた。
Jmeter実行元サーバーが非力なので参考値程度だが、平均で0.263秒掛かっていることがわかる。
Starting the test @ Sat Oct 31 12:26:14 JST 2020 (1604114774995)
Waiting for possible shutdown message on port 4445
summary = 100 in 12s = 8.6/s Avg: 263 Min: 249 Max: 313 Err: 0 (0.00%)
Tidying up ... @ Sat Oct 31 12:26:26 JST 2020 (1604114786675)
memcachedを使ってみる
keystoneにmemcacheを利用する設定を追加する。
token発行時にどれほどmemcachedが効くか分からなかったが、多少改善されたようにも見える。(平均で0.231秒)
keystone.confに記載するキャッシュバックエンド設定は以下を使った。
backend = dogpile.cache.memcached
Starting the test @ Sat Oct 31 12:31:27 JST 2020 (1604115087396)
Waiting for possible shutdown message on port 4445
summary + 19 in 3s = 7.2/s Avg: 232 Min: 222 Max: 284 Err: 0 (0.00%) Active: 2 Started: 3 Finished: 1
summary + 81 in 9s = 9.3/s Avg: 231 Min: 218 Max: 238 Err: 0 (0.00%) Active: 0 Started: 10 Finished: 10
summary = 100 in 11.3s = 8.8/s Avg: 231 Min: 218 Max: 284 Err: 0 (0.00%)
Tidying up ... @ Sat Oct 31 12:31:38 JST 2020 (1604115098736)
以下のバックエンド設定を使ってみた場合も試行した。
backend = oslo_cache.memcache_pool
結果はdogpile.cache.memcached
とほとんど変わらないように見える。
Starting the test @ Sat Oct 31 12:36:02 JST 2020 (1604115362429)
Waiting for possible shutdown message on port 4445
summary + 1 in 0.5s = 2.2/s Avg: 280 Min: 280 Max: 280 Err: 0 (0.00%) Active: 1 Started: 1 Finished: 0
summary + 99 in 11s = 9.1/s Avg: 230 Min: 219 Max: 239 Err: 0 (0.00%) Active: 0 Started: 10 Finished: 10
summary = 100 in 11.3s = 8.9/s Avg: 230 Min: 219 Max: 280 Err: 0 (0.00%)
Tidying up ... @ Sat Oct 31 12:36:13 JST 2020 (1604115373754)
dogpile.cache.memcached
とoslo_cache.memcache_pool
の違い
設定ファイルのドキュメントによると以下の記載がある
小規模な環境を除いて、基本はmemcache_poolの利用が推奨のようだ。
その名の通りmemcacheへのコネクションがプールされるか。
Cache backend module. For eventlet-based or environments with hundreds of threaded servers,
Memcache with pooling (oslo_cache.memcache_pool) is recommended.
For environments with less than 100 threaded servers,
Memcached (dogpile.cache.memcached) or Redis (dogpile.cache.redis) is recommended.
Test environments with a single instance of the server can use the dogpile.cache.memory backend.
ちなみにpool_maxsize
という設定でコネクションプール数を制御できるようなので、環境によってはこの数値を変化させることで改善されることもありそう。(ただし本記事の実験環境ではkeystoneのみが起動しているため、MySQLやMemcachedのプール数による改善ではなく単発のリクエストの速度改善を目指す。)
MySQLへのアクセス
トークン発行プログラムのどこに時間が掛かっているかは分からないが、チューニングしやすそうなMySQLから見てみる。
tcpdump+wiresharkで見てみるとトークン発行時にMySQLにアクセスしているのでどのようなクエリを投げているか見てみた。すべてSELECT文なので読み込み速度を改善すれば早くなりそう。またORDER BYによって並び替えを行っていることから並び替えに関するパフォーマンス改善の余地もあるか。
SELECT user.enabled AS user_enabled, user.id AS user_id, user.domain_id AS user_domain_id, user.extra AS user_extra, user.default_project_id AS user_default_project_id, user.created_at AS user_created_at, user.last_active_at AS user_last_active_at, password_1.created_at AS password_1_created_at, password_1.expires_at AS password_1_expires_at, password_1.id AS password_1_id, password_1.local_user_id AS password_1_local_user_id, password_1.password_hash AS password_1_password_hash, password_1.created_at_int AS password_1_created_at_int, password_1.expires_at_int AS password_1_expires_at_int, password_1.self_service AS password_1_self_service, local_user_1.id AS local_user_1_id, local_user_1.user_id AS local_user_1_user_id, local_user_1.domain_id AS local_user_1_domain_id, local_user_1.name AS local_user_1_name, local_user_1.failed_auth_count AS local_user_1_failed_auth_count, local_user_1.failed_auth_at AS local_user_1_failed_auth_at, federated_user_1.id AS federated_user_1_id, federated_user_1.user_id AS federated_user_1_user_id, federated_user_1.idp_id AS federated_user_1_idp_id, federated_user_1.protocol_id AS federated_user_1_protocol_id, federated_user_1.unique_id AS federated_user_1_unique_id, federated_user_1.display_name AS federated_user_1_display_name, nonlocal_user_1.domain_id AS nonlocal_user_1_domain_id, nonlocal_user_1.name AS nonlocal_user_1_name, nonlocal_user_1.user_id AS nonlocal_user_1_user_id
FROM user LEFT OUTER JOIN local_user AS local_user_1 ON user.id = local_user_1.user_id AND user.domain_id = local_user_1.domain_id LEFT OUTER JOIN password AS password_1 ON local_user_1.id = password_1.local_user_id LEFT OUTER JOIN federated_user AS federated_user_1 ON user.id = federated_user_1.user_id LEFT OUTER JOIN nonlocal_user AS nonlocal_user_1 ON user.domain_id = nonlocal_user_1.domain_id AND user.id = nonlocal_user_1.user_id
WHERE user.id = 'e901f4f3d5544817bf89c1946e4ed419' ORDER BY password_1.created_at_int
SELECT user_option.user_id AS user_option_user_id, user_option.option_id AS user_option_option_id, user_option.option_value AS user_option_option_value, anon_1.user_id AS anon_1_user_id
FROM (SELECT user.id AS user_id
FROM user
WHERE user.id = 'e901f4f3d5544817bf89c1946e4ed419') AS anon_1 INNER JOIN user_option ON anon_1.user_id = user_option.user_id ORDER BY anon_1.user_id
SELECT user.enabled AS user_enabled, user.id AS user_id, user.domain_id AS user_domain_id, user.extra AS user_extra, user.default_project_id AS user_default_project_id, user.created_at AS user_created_at, user.last_active_at AS user_last_active_at, password_1.created_at AS password_1_created_at, password_1.expires_at AS password_1_expires_at, password_1.id AS password_1_id, password_1.local_user_id AS password_1_local_user_id, password_1.password_hash AS password_1_password_hash, password_1.created_at_int AS password_1_created_at_int, password_1.expires_at_int AS password_1_expires_at_int, password_1.self_service AS password_1_self_service, local_user_1.id AS local_user_1_id, local_user_1.user_id AS local_user_1_user_id, local_user_1.domain_id AS local_user_1_domain_id, local_user_1.name AS local_user_1_name, local_user_1.failed_auth_count AS local_user_1_failed_auth_count, local_user_1.failed_auth_at AS local_user_1_failed_auth_at, federated_user_1.id AS federated_user_1_id, federated_user_1.user_id AS federated_user_1_user_id, federated_user_1.idp_id AS federated_user_1_idp_id, federated_user_1.protocol_id AS federated_user_1_protocol_id, federated_user_1.unique_id AS federated_user_1_unique_id, federated_user_1.display_name AS federated_user_1_display_name, nonlocal_user_1.domain_id AS nonlocal_user_1_domain_id, nonlocal_user_1.name AS nonlocal_user_1_name, nonlocal_user_1.user_id AS nonlocal_user_1_user_id
FROM user LEFT OUTER JOIN local_user AS local_user_1 ON user.id = local_user_1.user_id AND user.domain_id = local_user_1.domain_id LEFT OUTER JOIN password AS password_1 ON local_user_1.id = password_1.local_user_id LEFT OUTER JOIN federated_user AS federated_user_1 ON user.id = federated_user_1.user_id LEFT OUTER JOIN nonlocal_user AS nonlocal_user_1 ON user.domain_id = nonlocal_user_1.domain_id AND user.id = nonlocal_user_1.user_id
WHERE user.id = 'e901f4f3d5544817bf89c1946e4ed419' ORDER BY password_1.created_at_int
SELECT user_option.user_id AS user_option_user_id, user_option.option_id AS user_option_option_id, user_option.option_value AS user_option_option_value, anon_1.user_id AS anon_1_user_id
FROM (SELECT user.id AS user_id
FROM user
WHERE user.id = 'e901f4f3d5544817bf89c1946e4ed419') AS anon_1 INNER JOIN user_option ON anon_1.user_id = user_option.user_id ORDER BY anon_1.user_id
SELECT revocation_event.id AS revocation_event_id, revocation_event.domain_id AS revocation_event_domain_id, revocation_event.project_id AS revocation_event_project_id, revocation_event.user_id AS revocation_event_user_id, revocation_event.role_id AS revocation_event_role_id, revocation_event.trust_id AS revocation_event_trust_id, revocation_event.consumer_id AS revocation_event_consumer_id, revocation_event.access_token_id AS revocation_event_access_token_id, revocation_event.issued_before AS revocation_event_issued_before, revocation_event.expires_at AS revocation_event_expires_at, revocation_event.revoked_at AS revocation_event_revoked_at, revocation_event.audit_id AS revocation_event_audit_id, revocation_event.audit_chain_id AS revocation_event_audit_chain_id
FROM revocation_event
WHERE revocation_event.issued_before >= '2020-10-31 09:13:17' AND (revocation_event.user_id IS NULL OR revocation_event.user_id = 'e901f4f3d5544817bf89c1946e4ed419') AND (revocation_event.project_id IS NULL OR revocation_event.project_id = '973e028a13184bf585915d1dbcb8bd69') AND (revocation_event.audit_id IS NULL OR revocation_event.audit_id = '8QLRph58TVuNZ-DwS_m1Qw')
SELECT project.id AS project_id, project.name AS project_name, project.domain_id AS project_domain_id, project.description AS project_description, project.enabled AS project_enabled, project.extra AS project_extra, project.parent_id AS project_parent_id, project.is_domain AS project_is_domain
FROM project
WHERE project.id != '<<keystone.domain.root>>' AND project.is_domain = false
SELECT project_tag.project_id AS project_tag_project_id, project_tag.name AS project_tag_name, anon_1.project_id AS anon_1_project_id
FROM (SELECT project.id AS project_id
FROM project
WHERE project.id != '<<keystone.domain.root>>' AND project.is_domain = false) AS anon_1 INNER JOIN project_tag ON project_tag.project_id = anon_1.project_id ORDER BY anon_1.project_id
SELECT project_option.project_id AS project_option_project_id, project_option.option_id AS project_option_option_id, project_option.option_value AS project_option_option_value, anon_1.project_id AS anon_1_project_id
FROM (SELECT project.id AS project_id
FROM project
WHERE project.id != '<<keystone.domain.root>>' AND project.is_domain = false) AS anon_1 INNER JOIN project_option ON anon_1.project_id = project_option.project_id ORDER BY anon_1.project_id
MySQLパフォーマンス
MySQL8.0からはクエリーキャッシュが利用不可となり、クエリーキャッシュ系の設定を入れるとMySQLが起動しない。
https://yakst.com/ja/posts/4612
代替案としてProxySQLなるものをクライアントとMySQLの間に噛ませることでキャッシュを効かせる試みをしているらしい。管理ソフトウェアが増えること、新たな学習コストを考えると微妙か。一旦mysqltunerを使った簡易的な方法を試してみる。
mysqltunerを使ったパフォーマンス測定
インターネット上には多くのMySQLパフォーマンス改善系の情報が出ているが、かなり奥が深そう学習コストが高そう…)。ちょっと面倒なので、mysqltunerを使って簡単に診断を受けてみる。
ubuntuリポジトリに入っているのでaptで簡単にインストールできた。
実行後、色々出たが以下が改善余地のありそうなポイントとのこと。
-------- Recommendations ---------------------------------------------------------------------------
General recommendations:
Control warning line(s) into /var/log/mysql/error.log file
Control error line(s) into /var/log/mysql/error.log file
MySQL was started within the last 24 hours - recommendations may be inaccurate
Configure your accounts with ip or subnets only, then update your configuration with skip-name-resolve=1
Before changing innodb_log_file_size and/or innodb_log_files_in_group read this: https://bit.ly/2TcGgtU
Variables to adjust:
innodb_log_file_size should be (=16M) if possible, so InnoDB total log files size equals to 25% of buffer pool size.
root@ubuntu2004:~#
まずは分かりやすい名前解決のオフ。期待はしてなかったが変わらない。
Starting the test @ Sat Oct 31 18:55:19 JST 2020 (1604138119110)
Waiting for possible shutdown message on port 4445
summary + 98 in 11s = 9.0/s Avg: 233 Min: 223 Max: 286 Err: 0 (0.00%) Active: 1 Started: 10 Finished: 9
summary + 2 in 0.5s = 4.4/s Avg: 224 Min: 222 Max: 226 Err: 0 (0.00%) Active: 0 Started: 10 Finished: 10
summary = 100 in 11.3s = 8.8/s Avg: 232 Min: 222 Max: 286 Err: 0 (0.00%)
Tidying up ... @ Sat Oct 31 18:55:30 JST 2020 (1604138130464)
... end of run
次にinnodb_log_file_size
を16MBにしてみた。まぁこちらも変化なし。
Starting the test @ Sat Oct 31 19:19:25 JST 2020 (1604139565449)
Waiting for possible shutdown message on port 4445
summary + 39 in 5s = 8.4/s Avg: 235 Min: 227 Max: 279 Err: 0 (0.00%) Active: 2 Started: 5 Finished: 3
summary + 61 in 7s = 9.1/s Avg: 234 Min: 227 Max: 243 Err: 0 (0.00%) Active: 0 Started: 10 Finished: 10
summary = 100 in 11.4s = 8.8/s Avg: 235 Min: 227 Max: 279 Err: 0 (0.00%)
Tidying up ... @ Sat Oct 31 19:19:36 JST 2020 (1604139576840)
ProxySQL
結局ProxySQLも試してみた。
設定は以下を参考にしました。ちなみに設定はrmで直接消したほうがよさそう。keystoneユーザーが認識されず困った。
https://qiita.com/bringer1092/items/7f2729ac83df92541e29
その後、keystone.confもアクセスポートを6033(ProxySQL経由)に変更して計測
結果は以下。
Starting the test @ Sat Oct 31 20:33:14 JST 2020 (1604143994240)
Waiting for possible shutdown message on port 4445
summary = 100 in 11.4s = 8.8/s Avg: 235 Min: 224 Max: 286 Err: 0 (0.00%)
Tidying up ... @ Sat Oct 31 20:33:25 JST 2020 (1604144005631)
単純に間に1つ入ったのでそうそう早くはならないよなぁと思ったが遅くもならなかった。現実的にMySQLサーバーをクラスタリングする環境を考えるとHAProxyよりはマシ?
まとめ
簡単に試せる部分ではmemcachedが分かりやすく効果があった。memcachedは普通入れているところが多そうなので次にMySQLやmemcachedのチューニングに入ると効果が出やすそう。また、捌きの遅い原因によっては
- keystoneのプロセス数を増やす
- コネクションプール数を増やす
- ProxySQLのようなプロキシサーバーを噛ませる
といった処理でも分かりやすい効果がでそう。
単発処理の速度改善はプログラム内でデバッグログ仕込んでより細かく処理時間が掛かっている部分を見つけないと難しそう。