LoginSignup
1
0

More than 5 years have passed since last update.

PacemakerでSQL Server on Linux 2017を管理してみる。。。

Last updated at Posted at 2018-05-15

前文

歳のせいか、業界の動向に疎くなっているようで。。。いつの間にか、SQL Server on Linux 2017でした(涙)
- https://www.microsoft.com/ja-jp/sql-server/sql-server-2017-linux
正式にMSのPacemaker用のRA(fci)も用意されているようです。
- https://github.com/Microsoft/mssql-server-ha

今のバージョンがどの程度、安定性があるのか不明ですが、LinuxのDatabaseの選択子の一つになっていくかも知れません。

環境

  • RHEL7.4の単ノード
  • Pacemakerは2.0-rc3
  • SQL Server 2017
[root@test~]# rpm -qa | grep sql                                                                                                                                                                           
msodbcsql17-17.1.0.1-1.x86_64                                                                                                                                                                                    
mssql-server-ha-14.0.3025.34-3.x86_64                                                                                                                                                                            
mssql-tools-17.1.0.1-1.x86_64                                                                                                                                                                                    
mssql-server-14.0.3025.34-3.x86_64      

設定

「SQL Server on Linux 2017」のインストールや初期設定については、関連リンク先に詳しく書かれているので割愛します。
systemd管理と、fci管理の2つのパターンで、STONITH無で設定しています。単ノードで管理するRAは1つのみにしています。
実際には、クラスタ構成するのであれば、アクセス用のVIP(IPaddr2)や、DB領域用の共有(もしくは、DRBDなどの同期可能なディスク)が必要ですが、省いています。

動作確認

systemd管理

  • Pacemakerを起動してcrmファイルを流し込みます。
property no-quorum-policy="ignore" \
        stonith-enabled="false" \
        startup-fencing="false" \

rsc_defaults resource-stickiness="INFINITY" \
        migration-threshold="1"
#
primitive prmMssqlServer systemd:mssql-server \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="block"
  • crm_monの様子
[root@rh74-test ~]# crm_mon -1 -Af                                                                                                                                                                               
Stack: corosync
Current DC: rh74-test (version 2.0.0-2c83c6872a) - partition WITHOUT quorum
Last updated: Tue May 15 15:26:25 2018
Last change: Tue May 15 15:26:19 2018 by root via cibadmin on rh74-test

1 node configured
1 resource configured

Online: [ rh74-test ]

Active resources:

 prmMssqlServer (systemd:mssql-server): Started rh74-test

Node Attributes:
* Node rh74-test:

Migration Summary:
* Node rh74-test:
  • psとstatusで確認してみます
[root@rh74-test ~]# ps -ef | grep mssql                                                                                                                                                                          
mssql     5853     1  0 15:26 ?        00:00:01 /opt/mssql/bin/sqlservr                                                                                                                                          
mssql     5855  5853  9 15:26 ?        00:00:10 /opt/mssql/bin/sqlservr                                                                                                                                          
root      6083 25955  0 15:28 pts/0    00:00:00 grep --color=auto mssql                                                                                                                                          
[root@rh74-test ~]# systemctl status mssql-server.service 
● mssql-server.service - Cluster Controlled mssql-server
   Loaded: loaded (/usr/lib/systemd/system/mssql-server.service; enabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/mssql-server.service.d
           └─50-pacemaker.conf
   Active: active (running) since 火 2018-05-15 15:26:19 JST; 2min 9s ago
     Docs: https://docs.microsoft.com/en-us/sql/linux
 Main PID: 5853 (sqlservr)
   Memory: 693.7M
   CGroup: /system.slice/mssql-server.service
           ├─5853 /opt/mssql/bin/sqlservr
           └─5855 /opt/mssql/bin/sqlservr

 5月 15 15:26:28 rh74-test sqlservr[5853]: 2018-05-15 15:26:28.41 Server      2018-05-15 15:26:28.42 Server      Server is listening on [ ::1 <ipv6> 1434].
 5月 15 15:26:28 rh74-test sqlservr[5853]: 2018-05-15 15:26:28.42 Server      2018-05-15 15:26:28.42 Server      Server is listening on [ 127.0.0.1 <ipv4> 1434].
 5月 15 15:26:28 rh74-test sqlservr[5853]: 2018-05-15 15:26:28.42 Server      2018-05-15 15:26:28.42 Server      Dedicated admin connection support was established for listening locally on port 1434.
 5月 15 15:26:28 rh74-test sqlservr[5853]: 2018-05-15 15:26:28.44 spid19s     SQL Server 2018-05-15 15:26:28.44 spid19s     SQL Server is now ready for client connections. This is an informat...n is required.
 5月 15 15:26:28 rh74-test sqlservr[5853]: 2018-05-15 15:26:28.98 spid9s      2018-05-15 15:26:28.98 spid9s      Starting up database 'tempdb'.
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.26 spid9s      tempdb 2018-05-15 15:26:29.26 spid9s      The tempdb database has 1 data file(s).
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.27 spid24s     Service Broker 2018-05-15 15:26:29.27 spid24s     The Service Broker endpoint is in disabled or stopped state.
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.27 spid24s     Database Mirroring 2018-05-15 15:26:29.27 spid24s     The Database Mirroring endpoint is in disabled or stopped state.
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.29 spid24s     Service Broker 2018-05-15 15:26:29.29 spid24s     Service Broker manager has started.
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.45 spid5s      2018-05-15 15:26:29.45 spid5s      Recovery is complete. This is an informational message only. No user action is required.
Hint: Some lines were ellipsized, use -l to show in full.
  • mssqlプロセスに疑似故障を起こしてみます。故障検知し単ノードの為、停止します。
[root@rh74-test opt]# ps -ef |grep mssql                                                                                                                                                                         
mssql     5853     1  0 15:26 ?        00:00:01 /opt/mssql/bin/sqlservr                                                                                                                                          
mssql     5855  5853  6 15:26 ?        00:00:12 /opt/mssql/bin/sqlservr                                                                                                                                          
root      6106 26259  0 15:29 pts/1    00:00:00 grep --color=auto mssql                                                                                                                                          
[root@rh74-test opt]# kill -9 5853
[root@rh74-test opt]# ps -ef |grep mssql
root      6109 26259  0 15:29 pts/1    00:00:00 grep --color=auto mssql   

[root@rh74-test ~]# crm_mon -1 -Af                                                                                                                                                                               
Stack: corosync
Current DC: rh74-test (version 2.0.0-2c83c6872a) - partition WITHOUT quorum
Last updated: Tue May 15 15:30:03 2018
Last change: Tue May 15 15:26:19 2018 by root via cibadmin on rh74-test

1 node configured
1 resource configured

Online: [ rh74-test ]

No active resources


Node Attributes:
* Node rh74-test:

Migration Summary:
* Node rh74-test:
   prmMssqlServer: migration-threshold=1 fail-count=1 last-failure='Tue May 15 15:29:52 2018'

Failed Actions:
* prmMssqlServer_monitor_10000 on rh74-test 'not running' (7): call=7, status=complete, exitreason='',
    last-rc-change='Tue May 15 15:29:52 2018', queued=0ms, exec=0ms

[root@rh74-test ~]# systemctl status mssql-server.service 
● mssql-server.service - Microsoft SQL Server Database Engine
   Loaded: loaded (/usr/lib/systemd/system/mssql-server.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since 火 2018-05-15 15:29:44 JST; 1min 1s ago
     Docs: https://docs.microsoft.com/en-us/sql/linux
 Main PID: 5853 (code=killed, signal=KILL)

 5月 15 15:26:28 rh74-test sqlservr[5853]: 2018-05-15 15:26:28.44 spid19s     SQL Server 2018-05-15 15:26:28.44 spid19s     SQL Server is now ready for client connections. This is an informat...n is required.
 5月 15 15:26:28 rh74-test sqlservr[5853]: 2018-05-15 15:26:28.98 spid9s      2018-05-15 15:26:28.98 spid9s      Starting up database 'tempdb'.
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.26 spid9s      tempdb 2018-05-15 15:26:29.26 spid9s      The tempdb database has 1 data file(s).
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.27 spid24s     Service Broker 2018-05-15 15:26:29.27 spid24s     The Service Broker endpoint is in disabled or stopped state.
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.27 spid24s     Database Mirroring 2018-05-15 15:26:29.27 spid24s     The Database Mirroring endpoint is in disabled or stopped state.
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.29 spid24s     Service Broker 2018-05-15 15:26:29.29 spid24s     Service Broker manager has started.
 5月 15 15:26:29 rh74-test sqlservr[5853]: 2018-05-15 15:26:29.45 spid5s      2018-05-15 15:26:29.45 spid5s      Recovery is complete. This is an informational message only. No user action is required.
 5月 15 15:29:44 rh74-test systemd[1]: mssql-server.service: main process exited, code=killed, status=9/KILL
 5月 15 15:29:44 rh74-test systemd[1]: Unit mssql-server.service entered failed state.
 5月 15 15:29:44 rh74-test systemd[1]: mssql-server.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

fci管理

  • Pacemakerを起動してcrmファイルを流し込みます。
property no-quorum-policy="ignore" \
        stonith-enabled="false" \
        startup-fencing="false" \

rsc_defaults resource-stickiness="INFINITY" \
        migration-threshold="1"
#
primitive rh74-01 ocf:mssql:fci \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="block"
  • crm_monの様子
[root@rh74-test ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh74-test (version 2.0.0-2c83c6872a) - partition WITHOUT quorum
Last updated: Tue May 15 15:39:50 2018
Last change: Tue May 15 15:39:31 2018 by root via cibadmin on rh74-test

1 node configured
1 resource configured

Online: [ rh74-test ]

Active resources:

 rh74-01        (ocf::mssql:fci):       Started rh74-test

Node Attributes:
* Node rh74-test:

Migration Summary:
* Node rh74-test:
  • psとstatusで確認してみます。systemd管理ではないので、systemdのstatusでは確認されません。
[root@rh74-test ~]# ps -ef |grep mssql                                                                                                                                                                           
mssql     9475     1  3 15:44 ?        00:00:00 /opt/mssql/bin/sqlservr                                                                                                                                          
mssql     9486  9475 73 15:44 ?        00:00:09 /opt/mssql/bin/sqlservr                                                                                                                                          
root      9730  9353  0 15:44 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/mssql/fci monitor                                                                                                              
root      9757  9730  0 15:44 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/mssql/fci monitor                                                                                                              
root      9758  9757  4 15:44 ?        00:00:00 /usr/lib/ocf/lib/mssql/fci-helper --port 1433 --credentials-file /var/opt/mssql/secrets/passwd --application-name monitor-rh74-01 --connection-timeout 20 --health-threshold 3 --action monitor --virtual-server-name rh74-01
root      9759  9757  0 15:44 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/mssql/fci monitor                                                                                                              
root      9785 25955  0 15:44 pts/0    00:00:00 grep --color=auto mssql      


[root@rh74-test ~]# systemctl status mssql-server.service 
● mssql-server.service - Microsoft SQL Server Database Engine
   Loaded: loaded (/usr/lib/systemd/system/mssql-server.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since 火 2018-05-15 15:43:10 JST; 1min 20s ago
     Docs: https://docs.microsoft.com/en-us/sql/linux
  Process: 9110 ExecStart=/opt/mssql/bin/sqlservr (code=exited, status=0/SUCCESS)
 Main PID: 9110 (code=exited, status=0/SUCCESS)

 5月 15 15:43:00 rh74-test sqlservr[9110]: 2018-05-15 15:43:00.21 spid24s     Service Broker 2018-05-15 15:43:00.21 spid24s     The Service Broker endpoint is in disabled or stopped state.
 5月 15 15:43:00 rh74-test sqlservr[9110]: 2018-05-15 15:43:00.21 spid24s     Database Mirroring 2018-05-15 15:43:00.22 spid24s     The Database Mirroring endpoint is in disabled or stopped state.
 5月 15 15:43:00 rh74-test sqlservr[9110]: 2018-05-15 15:43:00.23 spid24s     Service Broker 2018-05-15 15:43:00.23 spid24s     Service Broker manager has started.
 5月 15 15:43:00 rh74-test sqlservr[9110]: 2018-05-15 15:43:00.28 spid6s      2018-05-15 15:43:00.28 spid6s      Recovery is complete. This is an informational message only. No user action is required.
 5月 15 15:43:00 rh74-test systemd[1]: Stopping Microsoft SQL Server Database Engine...
 5月 15 15:43:00 rh74-test sqlservr[9110]: 2018-05-15 15:43:00.47 spid6s      Always On: SQL Server is shutting down 2018-05-15 15:43:00.48 spid6s      Always On: The availability replica man...n is required.
 5月 15 15:43:00 rh74-test sqlservr[9110]: 2018-05-15 15:43:00.48 spid6s      2018-05-15 15:43:00.48 spid6s      SQL Server is terminating in response to a 'stop' request from Service Control...n is required.
 5月 15 15:43:00 rh74-test sqlservr[9110]: 2018-05-15 15:43:00.59 spid24s     Service Broker 2018-05-15 15:43:00.59 spid24s     Service Broker manager has shut down.
 5月 15 15:43:10 rh74-test sqlservr[9110]: 2018-05-15 15:43:10.30 spid6s      2018-05-15 15:43:10.30 spid6s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an infor...n is required.
 5月 15 15:43:10 rh74-test systemd[1]: Stopped Microsoft SQL Server Database Engine.
Hint: Some lines were ellipsized, use -l to show in full.

  • mssqlプロセスに疑似故障を起こしてみます。故障検知し単ノードの為、停止します。
[root@rh74-test ~]# ps -ef |grep mssql
mssql     9475     1  1 15:44 ?        00:00:00 /opt/mssql/bin/sqlservr                                                                                                                                          
mssql     9486  9475 25 15:44 ?        00:00:10 /opt/mssql/bin/sqlservr                                                                                                                                          
root      9876 25955  0 15:44 pts/0    00:00:00 grep --color=auto mssql                                                                                                                                          
[root@rh74-test ~]# kill -9 9475
[root@rh74-test ~]# ps -ef |grep mssql                                                                                                                                                                           
root      9995 25955  0 15:44 pts/0    00:00:00 grep --color=auto mssql          

[root@rh74-test ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh74-test (version 2.0.0-2c83c6872a) - partition WITHOUT quorum
Last updated: Tue May 15 15:45:05 2018
Last change: Tue May 15 15:44:01 2018 by root via cibadmin on rh74-test

1 node configured
1 resource configured

Online: [ rh74-test ]

No active resources


Node Attributes:
* Node rh74-test:

Migration Summary:
* Node rh74-test:
   rh74-01: migration-threshold=1 fail-count=1 last-failure='Tue May 15 15:44:48 2018'

Failed Actions:
* rh74-01_monitor_10000 on rh74-test 'unknown error' (1): call=7, status=complete, exitreason='2018/05/15 15:44:48 Unexpected error: Unable to open tcp connection with host 'localhost:1433': dial tcp 127.0.0.1:1433: getsock',
    last-rc-change='Tue May 15 15:44:48 2018', queued=0ms, exec=0ms

気になる点

  • systemd管理時に疑似故障時にcrm_monのexitreasonが何も入っていないので、何かエラー文字列を埋めた方が良いと思われます。(もしかしたら、systemdリソースの場合は、exitreasonは設定出来ないのかも??)
  • fci管理の場合、fciリソースの名称を変更する必要があるようです。(詳細をみていませんが、設定などで回避する方法があるのかも知れません。。。)
[root@rh74-test ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh74-test (version 2.0.0-2c83c6872a) - partition WITHOUT quorum
Last updated: Tue May 15 15:35:54 2018
Last change: Tue May 15 15:35:39 2018 by root via cibadmin on rh74-test

1 node configured
1 resource configured

Online: [ rh74-test ]

Active resources:

 prmMssqlServer (ocf::mssql:fci):       FAILED rh74-test

Node Attributes:
* Node rh74-test:

Migration Summary:
* Node rh74-test:
   prmMssqlServer: migration-threshold=1 fail-count=1000000 last-failure='Tue May 15 15:35:53 2018'

Failed Actions:
* prmMssqlServer_start_0 on rh74-test 'invalid parameter' (2): call=6, status=complete, exitreason='2018/05/15 15:35:53 Expected local server name to be prmMssqlServer but it was rh74-test',
    last-rc-change='Tue May 15 15:35:40 2018', queued=0ms, exec=12836ms

(fci-helperの該当エラー箇所)
(snip)
// Function: monitor
//
// Description:
//    Implements the OCF "monitor" action
//
func monitor(db *sql.DB, virtualServerName string, stdout *log.Logger) (mssqlcommon.OcfExitCode, error) {
        stdout.Println("Querying local server name...")
        currentServerName, err := mssqlcommon.GetLocalServerName(db)
        if err != nil {
                return mssqlcommon.OCF_ERR_GENERIC, fmt.Errorf("Could not query local server name: %s", err)
        }
        stdout.Printf("Local server name is %s\n", currentServerName)
        if !strings.EqualFold(currentServerName, virtualServerName) {
                return mssqlcommon.OCF_ERR_ARGS, fmt.Errorf("Expected local server name to be %s but it was %s", virtualServerName, currentServerName)
        }

        return mssqlcommon.OCF_SUCCESS, nil
}
(snip)
fciのRA内の$OCF_RESOURCE_INSTANCEを渡している模様。。。

  • fci管理で起動するとmonitorの都度、ログが大量に出て、かなり鬱陶しい感じです。実際にPacemakerでクラスタする場合には、改善(ログをdebugなどに落とす)が必要でしょう。
(snip)
May 15 15:06:44 rh74-test fci(rh74-test)[4149]: INFO: mssql_validate
May 15 15:06:44 rh74-test fci(rh74-test)[4149]: INFO: Resource agent invoked with: monitor
May 15 15:06:44 rh74-test fci(rh74-test)[4149]: INFO: mssql_monitor
May 15 15:06:44 rh74-test fci(rh74-test)[4149]: INFO: monitor: 2018/05/15 15:06:44 fci-helper invoked with hostname [localhost]; port [1433]; credentials-file [/var/opt/mssql/secrets/passwd]; application-name [monitor-rh74-test]; connection-timeout [20]; health-threshold [3]; action [monitor]
May 15 15:06:44 rh74-test fci(rh74-test)[4149]: INFO: monitor: 2018/05/15 15:06:44 fci-helper invoked with virtual-server-name [rh74-test]
May 15 15:06:44 rh74-test fci(rh74-test)[4149]: INFO: monitor: 2018/05/15 15:06:44 Attempt 1 to connect to the instance at localhost:1433 and run sp_server_diagnostics
May 15 15:06:45 rh74-test fci(rh74-test)[4149]: INFO: monitor: 2018/05/15 15:06:45 Connected to the instance at localhost:1433
May 15 15:06:50 rh74-test fci(rh74-test)[4149]: INFO: monitor: 2018/05/15 15:06:50 Querying local server name...
May 15 15:06:50 rh74-test fci(rh74-test)[4149]: INFO: monitor: 2018/05/15 15:06:50 Local server name is rh74-test
May 15 15:06:50 rh74-test fci(rh74-test)[4149]: INFO: rh74-test monitor : 0
May 15 15:07:00 rh74-test fci(rh74-test)[4250]: INFO: mssql_validate
May 15 15:07:00 rh74-test fci(rh74-test)[4250]: INFO: Resource agent invoked with: monitor
May 15 15:07:00 rh74-test fci(rh74-test)[4250]: INFO: mssql_monitor
May 15 15:07:00 rh74-test fci(rh74-test)[4250]: INFO: monitor: 2018/05/15 15:07:00 fci-helper invoked with hostname [localhost]; port [1433]; credentials-file [/var/opt/mssql/secrets/passwd]; application-name [monitor-rh74-test]; connection-timeout [20]; health-threshold [3]; action [monitor]
May 15 15:07:00 rh74-test fci(rh74-test)[4250]: INFO: monitor: 2018/05/15 15:07:00 fci-helper invoked with virtual-server-name [rh74-test]
May 15 15:07:00 rh74-test fci(rh74-test)[4250]: INFO: monitor: 2018/05/15 15:07:00 Attempt 1 to connect to the instance at localhost:1433 and run sp_server_diagnostics
May 15 15:07:00 rh74-test fci(rh74-test)[4250]: INFO: monitor: 2018/05/15 15:07:00 Connected to the instance at localhost:1433
May 15 15:07:05 rh74-test fci(rh74-test)[4250]: INFO: monitor: 2018/05/15 15:07:05 Querying local server name...
May 15 15:07:05 rh74-test fci(rh74-test)[4250]: INFO: monitor: 2018/05/15 15:07:05 Local server name is rh74-test
May 15 15:07:05 rh74-test fci(rh74-test)[4250]: INFO: rh74-test monitor : 0
(snip)

関連リンク

以下に既にPacemakerで設定されている記載がありました。参考にさせて頂きました。
- https://qiita.com/harukano/items/6488c15e0845402fafc3
- https://www.server-world.info/query?os=CentOS_7&p=mssql2017&f=1

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0