俺です。
MHA on EC2を実装して以下のテストを行いました。
テスト結果を残しておきます。
内容
- MHAのフェイルオーバーにクラウドデザインパターンのRouting-Based HAパターンを利用
- アプリケーションサーバはMySQLマスタに対するエンドポイントとしてRouteTableのVIPを指定することを想定
- MHA管理下のMySQLサーバ(Master)がfailした時にSlaveが自動昇格され、RouteTableの書き換えが発生することを確認する
- MHA管理下のMySQLサーバ(Master)を手動スイッチした時にSlaveが自動昇格され、RouteTableの書き換えが発生することを確認する
結果
- MySQLサーバがfail/手動スイッチした際にVIPを持つRouteTableの宛先インスタンスがMaster昇格したEC2になることが確認できた
課題
- AWS EC2のネットワークパフォーマンス障害発生時の挙動は未確認
- EBS障害によるパフォーマンス低下時の挙動は未確認
- アプリケーションサーバに書いているslaveのエンドポイント書き換えは行われないため、MySQL Master昇格からエンドポイント書き換え完了までの間、アプリケーションサーバはslaveではなくマスタを参照になる
構成図
MySQLマスタサーバへ接続するアプリケーションサーバが利用するルートテーブルに、
VIPと宛先となるMySQLマスタのEC2インスタンスIDを登録しておく
検証環境
- MHA Manager
項目 | 内容 |
---|---|
instance type | t2.large |
OS | Amazon Linux 2015.09 |
Perl | Perl 5.10(plenv) |
- MySQLサーバ
項目 | 内容 |
---|---|
instance type | t2.large |
OS | Amazon Linux 2015.09 |
MySQL version | MySQL 5.6 |
- Cluster情報
クラスタ | VIP | 内部IP | ホスト名 |
---|---|---|---|
MySQL Cluster A | 192.168.0.10 | 172.16.11.88 | ip-172-16-11-88 |
MySQL Cluster A | 192.168.0.10 | 172.16.10.175 | ip-172-16-10-175 |
MySQL Cluster A | 192.168.0.10 | 172.16.10.248 | ip-172-16-10-248 |
MySQL Cluster B | 192.168.10.100 | 172.16.10.135 | ip-172-16-10-135 |
MySQL Cluster B | 192.168.10.100 | 172.16.10.12 | ip-172-16-10-12 |
MySQL Cluster B | 192.168.10.100 | 172.16.11.12 | ip-172-16-11-12 |
確認
MHA Managerの起動
構築手順はググると色々できるので端折ります。
supervisordで制御しているMHA Managerを起動します
supervisordの設定
- /etc/supervisord.d/supervisord_mha_cluster_b.conf
[program:mha_cluster_a]
command=bash /root/mha_start.sh
environment=PLENV_ROOT=/root/.plenv,PATH="/root/.plenv/shims:/root/.plenv/bin:/usr/local/sbin:/usr/local/bin:$PATH"
user=root
directory=/root
stdout_logfile=/var/log/mha/mha_cluster_a.stdout.log
stderr_logfile=/var/log/mha/mha_cluster_a.stderr.log
numprocs=1
autostart=false
autorestart=unexpected
stopsignal=TERM
# default 3
startretries=2
retries = 1
- /etc/supervisord.d/supervisord_mha_cluster_b.conf
[program:mha_cluster_b]
command=bash /root/mha_cluster_b_start.sh
environment=PLENV_ROOT=/root/.plenv,PATH="/root/.plenv/shims:/root/.plenv/bin:/usr/local/sbin:/usr/local/bin:$PATH"
user=root
directory=/root
stdout_logfile=/var/log/mha/mha_cluster_b.stdout.log
stderr_logfile=/var/log/mha/mha_cluster_b.stderr.log
numprocs=1
autostart=false
autorestart=unexpected
stopsignal=TERM
# default 3
startretries=2
retries = 1
MHAのconfig
MySQL Slaveが2台以上ある場合、 no_master=1
を設定しておくことで昇格対象外のノードに指定できます。
MySQL Backup用Slaveとして管理できるので便利です。
- /etc/mha_cluster_a.conf
[server default]
user=mha
password=mhapassword
manager_workdir=/var/lib/mha/cluster_a
remote_workdir=/var/lib/mha/cluster_a
repl_user=repl
repl_password=repl
ssh_user=root
master_ip_failover_script="/root/mysqlfailgover --mysql_master_vip=192.168.0.10/32"
master_ip_online_change_script="/root/mysqlfailgover --mysql_master_vip=192.168.0.10/32"
[server1]
hostname=172.16.11.88
[server2]
hostname=172.16.10.175
[server3]
hostname=172.16.10.248
- /etc/mha_cluster_b.conf
[server default]
user=mha
password=mhapassword
manager_workdir=/var/lib/mha/cluster_b
remote_workdir=/var/lib/mha/cluster_b
repl_user=repl
repl_password=repl
ssh_user=root
master_ip_failover_script="/root/mysqlfailgover --mysql_master_vip=192.168.10.100/32"
master_ip_online_change_script="/root/mysqlfailgover --mysql_master_vip=192.168.10.100/32"
[server1]
hostname=172.16.10.135
[server2]
hostname=172.16.10.12
[server3]
hostname=172.16.11.12
MHA Managerの起動スクリプト
- /root/mha_start.sh
#!/bin/bash -x
export PATH="$HOME/.plenv/bin/:/usr/bin:/bin:$PATH"
echo `date` $PATH
eval "$(plenv init -)"
exec plenv exec masterha_manager --conf=/etc/mha_cluster_a.conf
exit $?
RouteTable書き換えのフェイルオーバースクリプト
Goで書いた俺俺バイナリを使います。
- MHAのフェイルオーバースクリプト実行引数のorig_master_ipから、切替前のインスタンスが登録されているRouteTable IDを取得する
- new_master_ip引数に渡された内部IPを持つEC2のInstance IDを新しいRouteTableの宛先として登録する
MySQLマスタの自動フェイルオーバー(EC2のstop)
stop-instancesでMySQL Masterとして稼働しているEC2を停止します。
切替前確認
- MySQL cluster AのMySQLマスタ確認
$ mysql -u mha -pmhapassword -h 192.168.0.10 -e "show variables like 'hostname'"
Warning: Using a password on the command line interface can be insecure.
+---------------+------------------+
| Variable_name | Value |
+---------------+------------------+
| hostname | ip-172-16-10-248 |
+---------------+------------------+
- MySQL cluster BのMySQLマスタ確認
$ mysql -u mha -pmhapassword -h 192.168.10.100 -e "show variables like 'hostname'"
Warning: Using a password on the command line interface can be insecure.
+---------------+-----------------+
| Variable_name | Value |
+---------------+-----------------+
| hostname | ip-172-16-11-12 |
+---------------+-----------------+
EC2停止
- ip-172-16-10-248を停止します
マスタを停止したホストへの接続確認
- MHAフェイルオーバー中
$ mysql -u mha -pmhapassword -h 192.168.0.10 -e "show variables like 'hostname'"
Warning: Using a password on the command line interface can be insecure.
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.0.10' (110)
- MHAフェイルオーバー後
停止したMySQL Cluster AのRouteTableが書き換えられたことでホスト名が変わりました。
$ mysql -u mha -pmhapassword -h 192.168.0.10 -e "show variables like 'hostname'"
Warning: Using a password on the command line interface can be insecure.
+---------------+------------------+
| Variable_name | Value |
+---------------+------------------+
| hostname | ip-172-16-10-175 |
+---------------+------------------+
なお、MHAで管理するMySQLクラスタのプロセスは分けているのでMySQL Cluser Bにはなにも影響ありません。
$ mysql -u mha -pmhapassword -h 192.168.10.100 -e "show variables like 'hostname'"
Warning: Using a password on the command line interface can be insecure.
+---------------+-----------------+
| Variable_name | Value |
+---------------+-----------------+
| hostname | ip-172-16-11-12 |
+---------------+-----------------+
- MHA Managerのログ
ヘルスチェックタイムアウト後に、マスタ昇格対象のノードが選定され、マスタ切替に成功しています。
Thu Jan 21 20:09:29 2016 - [info] Checking master_ip_failover_script status:
Thu Jan 21 20:09:29 2016 - [info] /root/mysqlfailgover --mysql_master_vip=192.168.0.10/32 --command=status --ssh_user=root --orig_master_host=172.16.10.248 --orig_master_ip=172.16.10.248 --orig_master_port=3306
Thu Jan 21 20:09:30 2016 - [info] OK.
Thu Jan 21 20:09:30 2016 - [warning] shutdown_script is not defined.
Thu Jan 21 20:09:30 2016 - [info] Set master ping interval 3 seconds.
Thu Jan 21 20:09:30 2016 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
Thu Jan 21 20:09:30 2016 - [info] Starting ping health check on 172.16.10.248(172.16.10.248:3306)..
Thu Jan 21 20:09:30 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Thu Jan 21 20:16:27 2016 - [warning] Got error on MySQL select ping: 2013 (Lost connection to MySQL server during query)
Thu Jan 21 20:16:27 2016 - [info] Executing SSH check script: exit 0
ssh: connect to host 172.16.10.248 port 22: Connection refused
Thu Jan 21 20:16:27 2016 - [warning] HealthCheck: SSH to 172.16.10.248 is NOT reachable.
Thu Jan 21 20:16:33 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.10.248' (110))
Thu Jan 21 20:16:33 2016 - [warning] Connection failed 2 time(s)..
Thu Jan 21 20:16:36 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.10.248' (110))
Thu Jan 21 20:16:36 2016 - [warning] Connection failed 3 time(s)..
Thu Jan 21 20:16:39 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.10.248' (110))
Thu Jan 21 20:16:39 2016 - [warning] Connection failed 4 time(s)..
Thu Jan 21 20:16:39 2016 - [warning] Master is not reachable from health checker!
Thu Jan 21 20:16:39 2016 - [warning] Master 172.16.10.248(172.16.10.248:3306) is not reachable!
Thu Jan 21 20:16:39 2016 - [warning] SSH is NOT reachable.
Thu Jan 21 20:16:39 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/mha_cluster_a.conf again, and trying to connect to all servers to check server status..
Thu Jan 21 20:16:39 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Jan 21 20:16:39 2016 - [info] Reading application default configuration from /etc/mha_cluster_a.conf..
Thu Jan 21 20:16:39 2016 - [info] Reading server configuration from /etc/mha_cluster_a.conf..
Thu Jan 21 20:16:40 2016 - [info] GTID failover mode = 1
Thu Jan 21 20:16:40 2016 - [info] Dead Servers:
Thu Jan 21 20:16:40 2016 - [info] 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:40 2016 - [info] Alive Servers:
Thu Jan 21 20:16:40 2016 - [info] 172.16.11.88(172.16.11.88:3306)
Thu Jan 21 20:16:40 2016 - [info] 172.16.10.175(172.16.10.175:3306)
Thu Jan 21 20:16:40 2016 - [info] Alive Slaves:
Thu Jan 21 20:16:40 2016 - [info] 172.16.11.88(172.16.11.88:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 20:16:40 2016 - [info] GTID ON
Thu Jan 21 20:16:40 2016 - [info] Replicating from 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:40 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jan 21 20:16:40 2016 - [info] 172.16.10.175(172.16.10.175:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 20:16:40 2016 - [info] GTID ON
Thu Jan 21 20:16:40 2016 - [info] Replicating from 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:40 2016 - [info] Checking slave configurations..
Thu Jan 21 20:16:40 2016 - [info] read_only=1 is not set on slave 172.16.11.88(172.16.11.88:3306).
Thu Jan 21 20:16:40 2016 - [info] read_only=1 is not set on slave 172.16.10.175(172.16.10.175:3306).
Thu Jan 21 20:16:40 2016 - [info] Checking replication filtering settings..
Thu Jan 21 20:16:40 2016 - [info] Replication filtering check ok.
Thu Jan 21 20:16:40 2016 - [info] Master is down!
Thu Jan 21 20:16:40 2016 - [info] Terminating monitoring script.
Thu Jan 21 20:16:40 2016 - [info] Got exit code 20 (Master dead).
Thu Jan 21 20:16:40 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Jan 21 20:16:40 2016 - [info] Reading application default configuration from /etc/mha_cluster_a.conf..
Thu Jan 21 20:16:40 2016 - [info] Reading server configuration from /etc/mha_cluster_a.conf..
Thu Jan 21 20:16:40 2016 - [info] MHA::MasterFailover version 0.56.
Thu Jan 21 20:16:40 2016 - [info] Starting master failover.
Thu Jan 21 20:16:40 2016 - [info]
Thu Jan 21 20:16:40 2016 - [info] * Phase 1: Configuration Check Phase..
Thu Jan 21 20:16:40 2016 - [info]
Thu Jan 21 20:16:41 2016 - [info] GTID failover mode = 1
Thu Jan 21 20:16:41 2016 - [info] Dead Servers:
Thu Jan 21 20:16:41 2016 - [info] 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:41 2016 - [info] Checking master reachability via MySQL(double check)...
Thu Jan 21 20:16:42 2016 - [info] ok.
Thu Jan 21 20:16:42 2016 - [info] Alive Servers:
Thu Jan 21 20:16:42 2016 - [info] 172.16.11.88(172.16.11.88:3306)
Thu Jan 21 20:16:42 2016 - [info] 172.16.10.175(172.16.10.175:3306)
Thu Jan 21 20:16:42 2016 - [info] Alive Slaves:
Thu Jan 21 20:16:42 2016 - [info] 172.16.11.88(172.16.11.88:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 20:16:42 2016 - [info] GTID ON
Thu Jan 21 20:16:42 2016 - [info] Replicating from 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:42 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jan 21 20:16:42 2016 - [info] 172.16.10.175(172.16.10.175:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 20:16:42 2016 - [info] GTID ON
Thu Jan 21 20:16:42 2016 - [info] Replicating from 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:42 2016 - [info] Starting GTID based failover.
Thu Jan 21 20:16:42 2016 - [info]
Thu Jan 21 20:16:42 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Jan 21 20:16:42 2016 - [info]
Thu Jan 21 20:16:42 2016 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Jan 21 20:16:42 2016 - [info]
Thu Jan 21 20:16:42 2016 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Jan 21 20:16:42 2016 - [info] Executing master IP deactivation script:
Thu Jan 21 20:16:42 2016 - [info] /root/mysqlfailgover --mysql_master_vip=192.168.0.10/32 --orig_master_host=172.16.10.248 --orig_master_ip=172.16.10.248 --orig_master_port=3306 --command=stop
Thu Jan 21 20:16:42 2016 - [info] done.
Thu Jan 21 20:16:42 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Jan 21 20:16:43 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Jan 21 20:16:43 2016 - [info]
Thu Jan 21 20:16:43 2016 - [info] * Phase 3: Master Recovery Phase..
Thu Jan 21 20:16:43 2016 - [info]
Thu Jan 21 20:16:43 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Jan 21 20:16:43 2016 - [info]
Thu Jan 21 20:16:43 2016 - [info] The latest binary log file/position on all slaves is mysqld-bin.000004:271
Thu Jan 21 20:16:43 2016 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Jan 21 20:16:43 2016 - [info] 172.16.11.88(172.16.11.88:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 20:16:43 2016 - [info] GTID ON
Thu Jan 21 20:16:43 2016 - [info] Replicating from 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:43 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jan 21 20:16:43 2016 - [info] 172.16.10.175(172.16.10.175:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 20:16:43 2016 - [info] GTID ON
Thu Jan 21 20:16:43 2016 - [info] Replicating from 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:43 2016 - [info] The oldest binary log file/position on all slaves is mysqld-bin.000004:271
Thu Jan 21 20:16:43 2016 - [info] Oldest slaves:
Thu Jan 21 20:16:43 2016 - [info] 172.16.11.88(172.16.11.88:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 20:16:43 2016 - [info] GTID ON
Thu Jan 21 20:16:43 2016 - [info] Replicating from 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:43 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jan 21 20:16:43 2016 - [info] 172.16.10.175(172.16.10.175:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 20:16:43 2016 - [info] GTID ON
Thu Jan 21 20:16:43 2016 - [info] Replicating from 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:43 2016 - [info]
Thu Jan 21 20:16:43 2016 - [info] * Phase 3.3: Determining New Master Phase..
Thu Jan 21 20:16:43 2016 - [info]
Thu Jan 21 20:16:43 2016 - [info] Searching new master from slaves..
Thu Jan 21 20:16:43 2016 - [info] Candidate masters from the configuration file:
Thu Jan 21 20:16:43 2016 - [info] Non-candidate masters:
Thu Jan 21 20:16:43 2016 - [info] 172.16.11.88(172.16.11.88:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 20:16:43 2016 - [info] GTID ON
Thu Jan 21 20:16:43 2016 - [info] Replicating from 172.16.10.248(172.16.10.248:3306)
Thu Jan 21 20:16:43 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jan 21 20:16:43 2016 - [info] Searching from all slaves which have received the latest relay log events..
Thu Jan 21 20:16:43 2016 - [info] New master is 172.16.10.175(172.16.10.175:3306)
Thu Jan 21 20:16:43 2016 - [info] Starting master failover..
Thu Jan 21 20:16:43 2016 - [info]
From:
172.16.10.248(172.16.10.248:3306) (current master)
+--172.16.11.88(172.16.11.88:3306)
+--172.16.10.175(172.16.10.175:3306)
To:
172.16.10.175(172.16.10.175:3306) (new master)
+--172.16.11.88(172.16.11.88:3306)
Thu Jan 21 20:16:43 2016 - [info]
Thu Jan 21 20:16:43 2016 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Jan 21 20:16:43 2016 - [info]
Thu Jan 21 20:16:43 2016 - [info] Waiting all logs to be applied..
Thu Jan 21 20:16:43 2016 - [info] done.
Thu Jan 21 20:16:43 2016 - [info] Replicating from the latest slave 172.16.11.88(172.16.11.88:3306) and waiting to apply..
Thu Jan 21 20:16:43 2016 - [info] Waiting all logs to be applied on the latest slave..
Thu Jan 21 20:16:43 2016 - [info] Resetting slave 172.16.10.175(172.16.10.175:3306) and starting replication from the new master 172.16.11.88(172.16.11.88:3306)..
Thu Jan 21 20:16:43 2016 - [info] Executed CHANGE MASTER.
Thu Jan 21 20:16:43 2016 - [info] Slave started.
Thu Jan 21 20:16:43 2016 - [info] Waiting to execute all relay logs on 172.16.10.175(172.16.10.175:3306)..
Thu Jan 21 20:16:43 2016 - [info] master_pos_wait(mysqld-bin.000004:271) completed on 172.16.10.175(172.16.10.175:3306). Executed 2 events.
Thu Jan 21 20:16:43 2016 - [info] done.
Thu Jan 21 20:16:43 2016 - [info] done.
Thu Jan 21 20:16:43 2016 - [info] Getting new master's binlog name and position..
Thu Jan 21 20:16:43 2016 - [info] mysqld-bin.000003:271
Thu Jan 21 20:16:43 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.10.175', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Jan 21 20:16:43 2016 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysqld-bin.000003, 271, 82ff7e88-bb8d-11e5-9486-0a4d175d9f7d:1-5,
83058435-bb8d-11e5-9486-0afb1c3e988d:1-27789,
83106a40-bb8d-11e5-9486-0a9fdce3260d:1-931
Thu Jan 21 20:16:43 2016 - [info] Executing master IP activate script:
Thu Jan 21 20:16:43 2016 - [info] /root/mysqlfailgover --mysql_master_vip=192.168.0.10/32 --command=start --ssh_user=root --orig_master_host=172.16.10.248 --orig_master_ip=172.16.10.248 --orig_master_port=3306 --new_master_host=172.16.10.175 --new_master_ip=172.16.10.175 --new_master_port=3306 --new_master_user='mha' --new_master_password='mhapassword'
Thu Jan 21 20:16:44 2016 - [info] OK.
Thu Jan 21 20:16:44 2016 - [info] ** Finished master recovery successfully.
Thu Jan 21 20:16:44 2016 - [info] * Phase 3: Master Recovery Phase completed.
Thu Jan 21 20:16:44 2016 - [info]
Thu Jan 21 20:16:44 2016 - [info] * Phase 4: Slaves Recovery Phase..
Thu Jan 21 20:16:44 2016 - [info]
Thu Jan 21 20:16:44 2016 - [info]
Thu Jan 21 20:16:44 2016 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Jan 21 20:16:44 2016 - [info]
Thu Jan 21 20:16:44 2016 - [info] -- Slave recovery on host 172.16.11.88(172.16.11.88:3306) started, pid: 22736. Check tmp log /var/lib/mha/cluster_a/172.16.11.88_3306_20160121201640.log if it takes time..
Thu Jan 21 20:16:45 2016 - [info]
Thu Jan 21 20:16:45 2016 - [info] Log messages from 172.16.11.88 ...
Thu Jan 21 20:16:45 2016 - [info]
Thu Jan 21 20:16:44 2016 - [info] Resetting slave 172.16.11.88(172.16.11.88:3306) and starting replication from the new master 172.16.10.175(172.16.10.175:3306)..
Thu Jan 21 20:16:44 2016 - [info] Executed CHANGE MASTER.
Thu Jan 21 20:16:44 2016 - [info] Slave started.
Thu Jan 21 20:16:44 2016 - [info] gtid_wait(82ff7e88-bb8d-11e5-9486-0a4d175d9f7d:1-5,
83058435-bb8d-11e5-9486-0afb1c3e988d:1-27789,
83106a40-bb8d-11e5-9486-0a9fdce3260d:1-931) completed on 172.16.11.88(172.16.11.88:3306). Executed 0 events.
Thu Jan 21 20:16:45 2016 - [info] End of log messages from 172.16.11.88.
Thu Jan 21 20:16:45 2016 - [info] -- Slave on host 172.16.11.88(172.16.11.88:3306) started.
Thu Jan 21 20:16:45 2016 - [info] All new slave servers recovered successfully.
Thu Jan 21 20:16:45 2016 - [info]
Thu Jan 21 20:16:45 2016 - [info] * Phase 5: New master cleanup phase..
Thu Jan 21 20:16:45 2016 - [info]
Thu Jan 21 20:16:45 2016 - [info] Resetting slave info on the new master..
Thu Jan 21 20:16:45 2016 - [info] 172.16.10.175: Resetting slave info succeeded.
Thu Jan 21 20:16:45 2016 - [info] Master failover to 172.16.10.175(172.16.10.175:3306) completed successfully.
Thu Jan 21 20:16:45 2016 - [info]
----- Failover Report -----
mha_cluster_a: MySQL Master failover 172.16.10.248(172.16.10.248:3306) to 172.16.10.175(172.16.10.175:3306) succeeded
Master 172.16.10.248(172.16.10.248:3306) is down!
Check MHA Manager logs at ip-172-16-3-23 for details.
Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.10.248(172.16.10.248:3306)
Selected 172.16.10.175(172.16.10.175:3306) as a new master.
172.16.10.175(172.16.10.175:3306): OK: Applying all logs succeeded.
172.16.10.175(172.16.10.175:3306): OK: Activated master IP address.
172.16.11.88(172.16.11.88:3306): OK: Slave started, replicating from 172.16.10.175(172.16.10.175:3306)
172.16.10.175(172.16.10.175:3306): Resetting slave info succeeded.
Master failover to 172.16.10.175(172.16.10.175:3306) completed successfully.
復旧方法
以下の流れで自動フェイルオーバーして停止されたMHA MHA Managerは復旧できます。
- MySQL スレーブを追加する
- mhaの定義ファイルを修正する
- manager_workdirにあるcompleteファイルを消す
- MHA Managerを起動する
MySQLマスタの手動切替
MySQL Cluster Bのマスタを手動で切り替えます。
利用ケースとしては下記が考えられます。
- MySQLマスタがEC2リタイアメント対象になった際にオンラインor最小限のダウンタイムで回避する
- EC2ホストパフォーマンス低下の煽りを食らって、MHAのしきい値に引っかからないけどMasterパフォーマンスが低下している場合
手動切替の実行
- 切替前のMySQLマスタ確認
# mysql -u mha -pmhapassword -h 192.168.10.100 mha_cluster_b -e 'show variables like '\''hostname'\'''
+---------------+-----------------+
| Variable_name | Value |
+---------------+-----------------+
| hostname | ip-172-16-11-12 |
+---------------+-----------------+
- masterha_managerの停止
# /usr/local/bin/supervisorctl status
consul RUNNING pid 30208, uptime 0:27:44
mha_cluster_a EXITED Jan 21 09:56 PM
mha_cluster_b RUNNING pid 30263, uptime 0:27:23
# /usr/local/bin/supervisorctl stop mha_cluster_b
mha_cluster_b: stopped
# /usr/local/bin/supervisorctl status
consul RUNNING pid 30208, uptime 0:28:47
mha_cluster_a EXITED Jan 21 09:56 PM
mha_cluster_b STOPPED Jan 21 10:24 PM
- 切替の実行
RouteTableのVIP192.168.10.100に設定されている宛先のインスタンスを172.16.11.12から172.16.10.12に変更します
# masterha_master_switch --master_state=alive --conf=/etc/mha_cluster_b.conf --new_master_host=172.16.10.12
Thu Jan 21 23:10:49 2016 - [info] MHA::MasterRotate version 0.56.
Thu Jan 21 23:10:49 2016 - [info] Starting online master switch..
Thu Jan 21 23:10:49 2016 - [info]
Thu Jan 21 23:10:49 2016 - [info] * Phase 1: Configuration Check Phase..
Thu Jan 21 23:10:49 2016 - [info]
Thu Jan 21 23:10:49 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Jan 21 23:10:49 2016 - [info] Reading application default configuration from /etc/mha_cluster_b.conf..
Thu Jan 21 23:10:49 2016 - [info] Reading server configuration from /etc/mha_cluster_b.conf..
Thu Jan 21 23:10:50 2016 - [info] GTID failover mode = 1
Thu Jan 21 23:10:50 2016 - [info] Current Alive Master: 172.16.11.12(172.16.11.12:3306)
Thu Jan 21 23:10:50 2016 - [info] Alive Slaves:
Thu Jan 21 23:10:50 2016 - [info] 172.16.10.135(172.16.10.135:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 23:10:50 2016 - [info] GTID ON
Thu Jan 21 23:10:50 2016 - [info] Replicating from 172.16.11.12(172.16.11.12:3306)
Thu Jan 21 23:10:50 2016 - [info] 172.16.10.12(172.16.10.12:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 23:10:50 2016 - [info] GTID ON
Thu Jan 21 23:10:50 2016 - [info] Replicating from 172.16.11.12(172.16.11.12:3306)
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 172.16.11.12(172.16.11.12:3306)? (YES/no): yes
Thu Jan 21 23:10:58 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Thu Jan 21 23:10:58 2016 - [info] ok.
Thu Jan 21 23:10:58 2016 - [info] Checking MHA is not monitoring or doing failover..
Thu Jan 21 23:10:58 2016 - [info] Checking replication health on 172.16.10.135..
Thu Jan 21 23:10:58 2016 - [info] ok.
Thu Jan 21 23:10:58 2016 - [info] Checking replication health on 172.16.10.12..
Thu Jan 21 23:10:58 2016 - [info] ok.
Thu Jan 21 23:10:58 2016 - [info] 172.16.10.12 can be new master.
Thu Jan 21 23:10:58 2016 - [info]
From:
172.16.11.12(172.16.11.12:3306) (current master)
+--172.16.10.135(172.16.10.135:3306)
+--172.16.10.12(172.16.10.12:3306)
To:
172.16.10.12(172.16.10.12:3306) (new master)
+--172.16.10.135(172.16.10.135:3306)
Starting master switch from 172.16.11.12(172.16.11.12:3306) to 172.16.10.12(172.16.10.12:3306)? (yes/NO): yes
Thu Jan 21 23:11:03 2016 - [info] Checking whether 172.16.10.12(172.16.10.12:3306) is ok for the new master..
Thu Jan 21 23:11:03 2016 - [info] ok.
Thu Jan 21 23:11:03 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Jan 21 23:11:03 2016 - [info]
Thu Jan 21 23:11:03 2016 - [info] * Phase 2: Rejecting updates Phase..
Thu Jan 21 23:11:03 2016 - [info]
Thu Jan 21 23:11:03 2016 - [info] Executing master ip online change script to disable write on the current master:
Thu Jan 21 23:11:03 2016 - [info] /root/mysqlfailgover --mysql_master_vip=192.168.10.100/32 --command=stop --orig_master_host=172.16.11.12 --orig_master_ip=172.16.11.12 --orig_master_port=3306 --orig_master_user='mha' --orig_master_password='mhapassword' --new_master_host=172.16.10.12 --new_master_ip=172.16.10.12 --new_master_port=3306 --new_master_user='mha' --new_master_password='mhapassword' --orig_master_ssh_user=root --new_master_ssh_user=root
16
Thu Jan 21 23:11:03 2016 - [info] ok.
Thu Jan 21 23:11:03 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Thu Jan 21 23:11:03 2016 - [info] Executing FLUSH TABLES WITH READ LOCK..
Thu Jan 21 23:11:03 2016 - [info] ok.
Thu Jan 21 23:11:03 2016 - [info] Orig master binlog:pos is mysqld-bin.000001:941.
Thu Jan 21 23:11:03 2016 - [info] Waiting to execute all relay logs on 172.16.10.12(172.16.10.12:3306)..
Thu Jan 21 23:11:03 2016 - [info] master_pos_wait(mysqld-bin.000001:941) completed on 172.16.10.12(172.16.10.12:3306). Executed 0 events.
Thu Jan 21 23:11:03 2016 - [info] done.
Thu Jan 21 23:11:03 2016 - [info] Getting new master's binlog name and position..
Thu Jan 21 23:11:03 2016 - [info] mysqld-bin.000001:1556
Thu Jan 21 23:11:03 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.10.12', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Jan 21 23:11:03 2016 - [info] Executing master ip online change script to allow write on the new master:
Thu Jan 21 23:11:03 2016 - [info] /root/mysqlfailgover --mysql_master_vip=192.168.10.100/32 --command=start --orig_master_host=172.16.11.12 --orig_master_ip=172.16.11.12 --orig_master_port=3306 --orig_master_user='mha' --orig_master_password='mhapassword' --new_master_host=172.16.10.12 --new_master_ip=172.16.10.12 --new_master_port=3306 --new_master_user='mha' --new_master_password='mhapassword' --orig_master_ssh_user=root --new_master_ssh_user=root
16
mysql master vip: 192.168.10.100/32
master_host: 172.16.11.12
master_host_ip:
172.16.11.12 = i-b5d9683a
172.16.10.12 = i-57d667d8
orig_master_ip registration route table id: rtb-764d3b13
route table id rtb-764d3b13 ,replace destination instance i-b5d9683a to i-57d667d8
route table, rtb-764d3b13 replaced.
Thu Jan 21 23:11:03 2016 - [info] ok.
Thu Jan 21 23:11:03 2016 - [info]
Thu Jan 21 23:11:03 2016 - [info] * Switching slaves in parallel..
Thu Jan 21 23:11:03 2016 - [info]
Thu Jan 21 23:11:03 2016 - [info] -- Slave switch on host 172.16.10.135(172.16.10.135:3306) started, pid: 415
Thu Jan 21 23:11:03 2016 - [info]
Thu Jan 21 23:11:04 2016 - [info] Log messages from 172.16.10.135 ...
Thu Jan 21 23:11:04 2016 - [info]
Thu Jan 21 23:11:03 2016 - [info] Waiting to execute all relay logs on 172.16.10.135(172.16.10.135:3306)..
Thu Jan 21 23:11:03 2016 - [info] master_pos_wait(mysqld-bin.000001:941) completed on 172.16.10.135(172.16.10.135:3306). Executed 0 events.
Thu Jan 21 23:11:03 2016 - [info] done.
Thu Jan 21 23:11:03 2016 - [info] Resetting slave 172.16.10.135(172.16.10.135:3306) and starting replication from the new master 172.16.10.12(172.16.10.12:3306)..
Thu Jan 21 23:11:03 2016 - [info] Executed CHANGE MASTER.
Thu Jan 21 23:11:03 2016 - [info] Slave started.
Thu Jan 21 23:11:04 2016 - [info] End of log messages from 172.16.10.135 ...
Thu Jan 21 23:11:04 2016 - [info]
Thu Jan 21 23:11:04 2016 - [info] -- Slave switch on host 172.16.10.135(172.16.10.135:3306) succeeded.
Thu Jan 21 23:11:04 2016 - [info] Unlocking all tables on the orig master:
Thu Jan 21 23:11:04 2016 - [info] Executing UNLOCK TABLES..
Thu Jan 21 23:11:04 2016 - [info] ok.
Thu Jan 21 23:11:04 2016 - [info] All new slave servers switched successfully.
Thu Jan 21 23:11:04 2016 - [info]
Thu Jan 21 23:11:04 2016 - [info] * Phase 5: New master cleanup phase..
Thu Jan 21 23:11:04 2016 - [info]
Thu Jan 21 23:11:04 2016 - [info] 172.16.10.12: Resetting slave info succeeded.
Thu Jan 21 23:11:04 2016 - [info] Switching master to 172.16.10.12(172.16.10.12:3306) completed successfully.
- 切替後確認
# mysql -u mha -pmhapassword -h 192.168.10.100 mha_cluster_b -e 'show variables like '\''hostname'\'''
+---------------+-----------------+
| Variable_name | Value |
+---------------+-----------------+
| hostname | ip-172-16-10-12 |
+---------------+-----------------+
MHA Managerの起動
手動切替した後MHA Managerの復旧方法
- MySQL スレーブを追加する
- mhaの定義ファイルを修正する
- MHA Managerを起動する
とても簡単ですね。
その他
masterha_**管理コマンドでMHA管理下にあるMySQLクラスタのヘルスチェックができます。
MHA管理下のクラスタ疎通確認
# masterha_master_monitor --conf=/etc/mha_cluster_b.conf
Thu Jan 21 23:09:28 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Jan 21 23:09:28 2016 - [info] Reading application default configuration from /etc/mha_cluster_b.conf..
Thu Jan 21 23:09:28 2016 - [info] Reading server configuration from /etc/mha_cluster_b.conf..
Thu Jan 21 23:09:28 2016 - [info] MHA::MasterMonitor version 0.56.
Thu Jan 21 23:09:29 2016 - [info] GTID failover mode = 1
Thu Jan 21 23:09:29 2016 - [info] Dead Servers:
Thu Jan 21 23:09:29 2016 - [info] Alive Servers:
Thu Jan 21 23:09:29 2016 - [info] 172.16.10.135(172.16.10.135:3306)
Thu Jan 21 23:09:29 2016 - [info] 172.16.10.12(172.16.10.12:3306)
Thu Jan 21 23:09:29 2016 - [info] 172.16.11.12(172.16.11.12:3306)
Thu Jan 21 23:09:29 2016 - [info] Alive Slaves:
Thu Jan 21 23:09:29 2016 - [info] 172.16.10.135(172.16.10.135:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 23:09:29 2016 - [info] GTID ON
Thu Jan 21 23:09:29 2016 - [info] Replicating from 172.16.11.12(172.16.11.12:3306)
Thu Jan 21 23:09:29 2016 - [info] 172.16.10.12(172.16.10.12:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 23:09:29 2016 - [info] GTID ON
Thu Jan 21 23:09:29 2016 - [info] Replicating from 172.16.11.12(172.16.11.12:3306)
Thu Jan 21 23:09:29 2016 - [info] Current Alive Master: 172.16.11.12(172.16.11.12:3306)
Thu Jan 21 23:09:29 2016 - [info] Checking slave configurations..
Thu Jan 21 23:09:29 2016 - [info] read_only=1 is not set on slave 172.16.10.135(172.16.10.135:3306).
Thu Jan 21 23:09:29 2016 - [info] read_only=1 is not set on slave 172.16.10.12(172.16.10.12:3306).
Thu Jan 21 23:09:29 2016 - [info] Checking replication filtering settings..
Thu Jan 21 23:09:29 2016 - [info] binlog_do_db= , binlog_ignore_db=
Thu Jan 21 23:09:29 2016 - [info] Replication filtering check ok.
Thu Jan 21 23:09:29 2016 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Thu Jan 21 23:09:29 2016 - [info] Checking SSH publickey authentication settings on the current master..
Warning: Permanently added '172.16.11.12' (ECDSA) to the list of known hosts.
Permission denied (publickey).
Thu Jan 21 23:09:29 2016 - [warning] HealthCheck: SSH to 172.16.11.12 is NOT reachable.
Thu Jan 21 23:09:29 2016 - [info]
172.16.11.12(172.16.11.12:3306) (current master)
+--172.16.10.135(172.16.10.135:3306)
+--172.16.10.12(172.16.10.12:3306)
Thu Jan 21 23:09:29 2016 - [info] Checking master_ip_failover_script status:
Thu Jan 21 23:09:29 2016 - [info] /root/mysqlfailgover --mysql_master_vip=192.168.10.100/32 --command=status --ssh_user=root --orig_master_host=172.16.11.12 --orig_master_ip=172.16.11.12 --orig_master_port=3306
8
orig_master_ip registration route table id: rtb-764d3b13
Thu Jan 21 23:09:30 2016 - [info] OK.
Thu Jan 21 23:09:30 2016 - [warning] shutdown_script is not defined.
Thu Jan 21 23:09:30 2016 - [info] Set master ping interval 3 seconds.
Thu Jan 21 23:09:30 2016 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
Thu Jan 21 23:09:30 2016 - [info] Starting ping health check on 172.16.11.12(172.16.11.12:3306)..
Thu Jan 21 23:09:30 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
MHA管理下にあるMySQLクラスタのレプリケーションチェック
# masterha_check_repl --conf=/etc/mha_cluster_b.conf
Thu Jan 21 22:57:14 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Jan 21 22:57:14 2016 - [info] Reading application default configuration from /etc/mha_cluster_b.conf..
Thu Jan 21 22:57:14 2016 - [info] Reading server configuration from /etc/mha_cluster_b.conf..
Thu Jan 21 22:57:14 2016 - [info] MHA::MasterMonitor version 0.56.
Thu Jan 21 22:57:15 2016 - [info] GTID failover mode = 1
Thu Jan 21 22:57:15 2016 - [info] Dead Servers:
Thu Jan 21 22:57:15 2016 - [info] Alive Servers:
Thu Jan 21 22:57:15 2016 - [info] 172.16.10.135(172.16.10.135:3306)
Thu Jan 21 22:57:15 2016 - [info] 172.16.10.12(172.16.10.12:3306)
Thu Jan 21 22:57:15 2016 - [info] 172.16.11.12(172.16.11.12:3306)
Thu Jan 21 22:57:15 2016 - [info] Alive Slaves:
Thu Jan 21 22:57:15 2016 - [info] 172.16.10.135(172.16.10.135:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 22:57:15 2016 - [info] GTID ON
Thu Jan 21 22:57:15 2016 - [info] Replicating from 172.16.10.12(172.16.10.12:3306)
Thu Jan 21 22:57:15 2016 - [info] 172.16.11.12(172.16.11.12:3306) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled
Thu Jan 21 22:57:15 2016 - [info] GTID ON
Thu Jan 21 22:57:15 2016 - [info] Replicating from 172.16.10.12(172.16.10.12:3306)
Thu Jan 21 22:57:15 2016 - [info] Current Alive Master: 172.16.10.12(172.16.10.12:3306)
Thu Jan 21 22:57:15 2016 - [info] Checking slave configurations..
Thu Jan 21 22:57:15 2016 - [info] read_only=1 is not set on slave 172.16.10.135(172.16.10.135:3306).
Thu Jan 21 22:57:15 2016 - [info] read_only=1 is not set on slave 172.16.11.12(172.16.11.12:3306).
Thu Jan 21 22:57:15 2016 - [info] Checking replication filtering settings..
Thu Jan 21 22:57:15 2016 - [info] binlog_do_db= , binlog_ignore_db=
Thu Jan 21 22:57:15 2016 - [info] Replication filtering check ok.
Thu Jan 21 22:57:15 2016 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Thu Jan 21 22:57:15 2016 - [info] Checking SSH publickey authentication settings on the current master..
Permission denied (publickey).
Thu Jan 21 22:57:15 2016 - [warning] HealthCheck: SSH to 172.16.10.12 is NOT reachable.
Thu Jan 21 22:57:15 2016 - [info]
172.16.10.12(172.16.10.12:3306) (current master)
+--172.16.10.135(172.16.10.135:3306)
+--172.16.11.12(172.16.11.12:3306)
Thu Jan 21 22:57:15 2016 - [info] Checking replication health on 172.16.10.135..
Thu Jan 21 22:57:15 2016 - [info] ok.
Thu Jan 21 22:57:15 2016 - [info] Checking replication health on 172.16.11.12..
Thu Jan 21 22:57:15 2016 - [info] ok.
Thu Jan 21 22:57:15 2016 - [info] Checking master_ip_failover_script status:
Thu Jan 21 22:57:15 2016 - [info] /root/mysqlfailgover --mysql_master_vip=192.168.10.100/32 --command=status --ssh_user=root --orig_master_host=172.16.10.12 --orig_master_ip=172.16.10.12 --orig_master_port=3306
8
orig_master_ip registration route table id: rtb-764d3b13
Thu Jan 21 22:57:15 2016 - [info] OK.
Thu Jan 21 22:57:15 2016 - [warning] shutdown_script is not defined.
Thu Jan 21 22:57:15 2016 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.