1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

いま再びのシステムまぎちゃんへの道~SLURMのインストール~

Last updated at Posted at 2019-12-02

SLURMとは?

SLURM は、いわゆるリソースマネジメントを行うツールで
バッチジョブを流すと、適切なCPUへジョブを割り当ててくれるものです。
その昔はPBS あるいはPBSProが使われてきましたが、最近ではSLURMも捨てたもんじゃありません。

おりょ?SLURMがパッケージになっている。

ということで、このパッケージを入れていきます。
これもいたって簡単、

# dnf install slurm*

ということで、SLURMも入りました。

さてここからがちょっとキツイ

SLURM を入れると、MUNGE という認証アプリを通すことになります。
まずは、MUGNEを立ち上げるところから。

[root@localhost system]# systemctl start munge
Job for munge.service failed because the control process exited with error code.
See "systemctl status munge.service" and "journalctl -xe" for details.

おや?立ち上げりませんね、ステータスを見てみましょう。

[root@localhost system]# systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2019-12-02 09:46:54 EST; 7s ago
     Docs: man:munged(8)
  Process: 27359 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)
      CPU: 28ms

Dec 02 09:46:54 vpn.sc-magi.com systemd[1]: Starting MUNGE authentication service...
Dec 02 09:46:54 vpn.sc-magi.com munged[27359]: munged: Error: Failed to check keyfile "/etc/munge/munge.key": No such file or directory
Dec 02 09:46:54 vpn.sc-magi.com systemd[1]: munge.service: Control process exited, code=exited, status=1/FAILURE
Dec 02 09:46:54 vpn.sc-magi.com systemd[1]: munge.service: Failed with result 'exit-code'.
Dec 02 09:46:54 vpn.sc-magi.com systemd[1]: Failed to start MUNGE authentication service.

/etc/munge/munge.keyがないって言われてしまいました。

そこで、

# man munge

をすると、一番下に、WebページのURLがあったのでそれをクリック。
https://dun.github.io/munge/

で、インストール後の設定を見ていると

# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key

をせよとのこと、というわけで上記を実行します。

[root@localhost munge]# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key
1024+0 records in
1024+0 records out
1024 bytes (1.0 kB, 1.0 KiB) copied, 0.029465 s, 34.8 kB/s

で、このキーの所有者を変えます。

[root@localhost munge]# chown munge:munge /etc/munge/munge.key

さぁ、これで動くだろうと実行してみると…。

[root@localhost munge]# systemctl start munge
Job for munge.service failed because the control process exited with error code.
See "systemctl status munge.service" and "journalctl -xe" for details.
[root@localhost munge]# systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2019-12-02 09:51:21 EST; 5s ago
     Docs: man:munged(8)
  Process: 27526 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)
      CPU: 31ms

Dec 02 09:51:19 vpn.sc-magi.com systemd[1]: Starting MUNGE authentication service...
Dec 02 09:51:21 vpn.sc-magi.com munged[27526]: munged: Error: Keyfile is insecure: "/etc/munge/munge.key" should not be readable or writable by group or world
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: munge.service: Control process exited, code=exited, status=1/FAILURE
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: munge.service: Failed with result 'exit-code'.
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: Failed to start MUNGE authentication service.

あちゃー、また怒られました。ので、変更します。

[root@localhost munge]# chmod 600 /etc/munge/munge.key

さぁ、これでもう動くだろうと思って実行、

[root@localhost munge]# systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2019-12-02 09:51:21 EST; 51s ago
     Docs: man:munged(8)
  Process: 27526 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)
      CPU: 31ms

Dec 02 09:51:19 vpn.sc-magi.com systemd[1]: Starting MUNGE authentication service...
Dec 02 09:51:21 vpn.sc-magi.com munged[27526]: munged: Error: Keyfile is insecure: "/etc/munge/munge.key" should not be readable or writable by group or world
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: munge.service: Control process exited, code=exited, status=1/FAILURE
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: munge.service: Failed with result 'exit-code'.
Dec 02 09:51:21 vpn.sc-magi.com systemd[1]: Failed to start MUNGE authentication service.
[root@localhost munge]# systemctl start munge
[root@localhost munge]# systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-12-02 09:52:16 EST; 3s ago
     Docs: man:munged(8)
  Process: 27563 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 27565 (munged)
    Tasks: 4 (limit: 999)
   Memory: 1.1M
      CPU: 40ms
   CGroup: /system.slice/munge.service
           mq27565 /usr/sbin/munged

Dec 02 09:52:16 vpn.sc-magi.com systemd[1]: Starting MUNGE authentication service...
Dec 02 09:52:16 vpn.sc-magi.com systemd[1]: Started MUNGE authentication service.

これで、見事に動きました。
次にいよいよSLURMを立ち上げてみます。

[root@localhost munge]# systemctl start slurmctld
[root@localhost munge]# systemctl start slurmdbd
[root@localhost munge]# systemctl start slurmd
[root@localhost munge]# systemctl sttus slurmd
Unknown operation sttus.
[root@localhost munge]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-12-02 09:44:48 EST; 7min ago
  Process: 27305 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 27307 (slurmd)
    Tasks: 1
   Memory: 2.6M
      CPU: 668ms
   CGroup: /system.slice/slurmd.service
           mq27307 /usr/sbin/slurmd

Dec 02 09:52:08 vpn.sc-magi.com slurmd[27307]: error: authentication: Invalid authentication credential
Dec 02 09:52:08 vpn.sc-magi.com slurmd[27307]: error: Unable to register: Protocol authentication error
Dec 02 09:52:11 vpn.sc-magi.com slurmd[27307]: error: If munged is up, restart with --num-threads=10
Dec 02 09:52:11 vpn.sc-magi.com slurmd[27307]: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
Dec 02 09:52:11 vpn.sc-magi.com slurmd[27307]: error: authentication: Invalid authentication credential
Dec 02 09:52:11 vpn.sc-magi.com slurmd[27307]: error: Unable to register: Protocol authentication error
Dec 02 09:52:14 vpn.sc-magi.com slurmd[27307]: error: If munged is up, restart with --num-threads=10
Dec 02 09:52:14 vpn.sc-magi.com slurmd[27307]: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
Dec 02 09:52:14 vpn.sc-magi.com slurmd[27307]: error: authentication: Invalid authentication credential
Dec 02 09:52:14 vpn.sc-magi.com slurmd[27307]: error: Unable to register: Protocol authentication error

立ち上がっていはいますが、うまく動いていない模様。そこでこのプロセスをもう一度再起動

[root@localhost munge]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-12-02 10:07:18 EST; 2s ago
  Process: 27641 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 27643 (slurmd)
    Tasks: 1
   Memory: 1.4M
      CPU: 98ms
   CGroup: /system.slice/slurmd.service
           mq27643 /usr/sbin/slurmd

Dec 02 10:07:17 vpn.sc-magi.com systemd[1]: Starting Slurm node daemon...
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27641]: error: Node configuration differs from hardware: Procs=1:4(hw) Boards=1:1(hw) SocketsPerBoard=1:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27641]: Message aggregation disabled
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27643]: slurmd version 19.05.4 started
Dec 02 10:07:17 vpn.sc-magi.com systemd[1]: slurmd.service: Can't open PID file /run/slurm/slurmd.pid (yet?) after start: No such file or directory
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27643]: slurmd started on Mon, 02 Dec 2019 10:07:17 -0500
Dec 02 10:07:17 vpn.sc-magi.com slurmd[27643]: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=951 TmpDisk=475 Uptime=9636 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Dec 02 10:07:18 vpn.sc-magi.com systemd[1]: Started Slurm node daemon.

あ、あとは設定ファイルの問題みたいですね。それは追々やっていきます。

1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?