SLURMとは?
SLURM は、いわゆるリソースマネジメントを行うツールで
バッチジョブを流すと、適切なCPUへジョブを割り当ててくれるものです。
その昔はPBS あるいはPBSProが使われてきましたが、最近ではSLURMも捨てたもんじゃありません。
おりょ?SLURMがパッケージになっている。
ということで、このパッケージを入れていきます。
これもいたって簡単、
# dnf install slurm*
ということで、SLURMも入りました。
さてここからがちょっとキツイ
SLURM を入れると、MUNGE という認証アプリを通すことになります。
まずは、MUGNEを立ち上げるところから。
[root@localhost system]# systemctl start munge
Job for munge.service failed because the control process exited with error code.
See "systemctl status munge.service" and "journalctl -xe" for details.
おや?立ち上げりませんね、ステータスを見てみましょう。
[root@localhost system]# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2019-12-02 09:46:54 EST; 7s ago
Docs: man:munged(8)
Process: 27359 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)
CPU: 28ms
Dec 02 09:46:54 vpn.sc-magi.com systemd[1]: Starting MUNGE authentication service...
Dec 02 09:46:54 vpn.sc-magi.com munged[27359]: munged: Error: Failed to check keyfile "/etc/munge/munge.key": No such file or directory
Dec 02 09:46:54 vpn.sc-magi.com systemd[1]: munge.service: Control process exited, code=exited, status=1/FAILURE
Dec 02 09:46:54 vpn.sc-magi.com systemd[1]: munge.service: Failed with result 'exit-code'.
Dec 02 09:46:54 vpn.sc-magi.com systemd[1]: Failed to start MUNGE authentication service.
/etc/munge/munge.keyがないって言われてしまいました。
そこで、
# man munge
をすると、一番下に、WebページのURLがあったのでそれをクリック。
https://dun.github.io/munge/
で、インストール後の設定を見ていると
# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key
をせよとのこと、というわけで上記を実行します。
[root@localhost munge]# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key
1024+0 records in
1024+0 records out
1024 bytes (1.0 kB, 1.0 KiB) copied, 0.029465 s, 34.8 kB/s
で、このキーの所有者を変えます。
[root@localhost munge]# chown munge:munge /etc/munge/munge.key
さぁ、これで動くだろうと実行してみると…。
[root@localhost munge]# systemctl start munge
Job for munge.service failed because the control process exited with error code.
See "systemctl status munge.service" and "journalctl -xe" for details.
[root@localhost munge]# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2019-12-02 09:51:21 EST; 5s ago
Docs: man:munged(8)
Process: 27526 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)
CPU: 31ms
Dec 02 09:51:19 vpn.sc-magi.com systemd[1]: Starting MUNGE authentication service...
Dec 02 09:51:21 vpn.sc-magi.com munged[27526]: munged: Error: Keyfile is insecure: "/etc/munge/munge.key" should not be readable or writable by group or world
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: munge.service: Control process exited, code=exited, status=1/FAILURE
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: munge.service: Failed with result 'exit-code'.
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: Failed to start MUNGE authentication service.
あちゃー、また怒られました。ので、変更します。
[root@localhost munge]# chmod 600 /etc/munge/munge.key
さぁ、これでもう動くだろうと思って実行、
[root@localhost munge]# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2019-12-02 09:51:21 EST; 51s ago
Docs: man:munged(8)
Process: 27526 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)
CPU: 31ms
Dec 02 09:51:19 vpn.sc-magi.com systemd[1]: Starting MUNGE authentication service...
Dec 02 09:51:21 vpn.sc-magi.com munged[27526]: munged: Error: Keyfile is insecure: "/etc/munge/munge.key" should not be readable or writable by group or world
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: munge.service: Control process exited, code=exited, status=1/FAILURE
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: munge.service: Failed with result 'exit-code'.
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: Failed to start MUNGE authentication service.
[root@localhost munge]# systemctl start munge
[root@localhost munge]# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; disabled; vendor preset: disabled)
Active: active (running) since Mon 2019-12-02 09:52:16 EST; 3s ago
Docs: man:munged(8)
Process: 27563 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
Main PID: 27565 (munged)
Tasks: 4 (limit: 999)
Memory: 1.1M
CPU: 40ms
CGroup: /system.slice/munge.service
mq27565 /usr/sbin/munged
Dec 02 09:52:16 vpn.sc-magi.com systemd[1]: Starting MUNGE authentication service...
Dec 02 09:52:16 vpn.sc-magi.com systemd[1]: Started MUNGE authentication service.
これで、見事に動きました。
次にいよいよSLURMを立ち上げてみます。
[root@localhost munge]# systemctl start slurmctld
[root@localhost munge]# systemctl start slurmdbd
[root@localhost munge]# systemctl start slurmd
[root@localhost munge]# systemctl sttus slurmd
Unknown operation sttus.
[root@localhost munge]# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
Active: active (running) since Mon 2019-12-02 09:44:48 EST; 7min ago
Process: 27305 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 27307 (slurmd)
Tasks: 1
Memory: 2.6M
CPU: 668ms
CGroup: /system.slice/slurmd.service
mq27307 /usr/sbin/slurmd
Dec 02 09:52:08 vpn.sc-magi.com slurmd[27307]: error: authentication: Invalid authentication credential
Dec 02 09:52:08 vpn.sc-magi.com slurmd[27307]: error: Unable to register: Protocol authentication error
Dec 02 09:52:11 vpn.sc-magi.com slurmd[27307]: error: If munged is up, restart with --num-threads=10
Dec 02 09:52:11 vpn.sc-magi.com slurmd[27307]: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
Dec 02 09:52:11 vpn.sc-magi.com slurmd[27307]: error: authentication: Invalid authentication credential
Dec 02 09:52:11 vpn.sc-magi.com slurmd[27307]: error: Unable to register: Protocol authentication error
Dec 02 09:52:14 vpn.sc-magi.com slurmd[27307]: error: If munged is up, restart with --num-threads=10
Dec 02 09:52:14 vpn.sc-magi.com slurmd[27307]: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
Dec 02 09:52:14 vpn.sc-magi.com slurmd[27307]: error: authentication: Invalid authentication credential
Dec 02 09:52:14 vpn.sc-magi.com slurmd[27307]: error: Unable to register: Protocol authentication error
立ち上がっていはいますが、うまく動いていない模様。そこでこのプロセスをもう一度再起動
[root@localhost munge]# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
Active: active (running) since Mon 2019-12-02 10:07:18 EST; 2s ago
Process: 27641 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 27643 (slurmd)
Tasks: 1
Memory: 1.4M
CPU: 98ms
CGroup: /system.slice/slurmd.service
mq27643 /usr/sbin/slurmd
Dec 02 10:07:17 vpn.sc-magi.com systemd[1]: Starting Slurm node daemon...
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27641]: error: Node configuration differs from hardware: Procs=1:4(hw) Boards=1:1(hw) SocketsPerBoard=1:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27641]: Message aggregation disabled
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27643]: slurmd version 19.05.4 started
Dec 02 10:07:17 vpn.sc-magi.com systemd[1]: slurmd.service: Can't open PID file /run/slurm/slurmd.pid (yet?) after start: No such file or directory
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27643]: slurmd started on Mon, 02 Dec 2019 10:07:17 -0500
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27643]: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=951 TmpDisk=475 Uptime=9636 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Dec 02 10:07:18 vpn.sc-magi.com systemd[1]: Started Slurm node daemon.
あ、あとは設定ファイルの問題みたいですね。それは追々やっていきます。