More than 3 years have passed since last update.

FrontISTRをAWS ParallelCluster バージョン3で利用する②

Last updated at 2022-03-23Posted at 2022-03-23

0. はじめに

前回の記事では，カスタムAMIの作成とAWS ParallelCluster (コマンドラインツール)のインストールまでを行いました．これらの作業がまだの方は，下記「シリーズの流れ」より1回目をご覧ください．

今回は，AMIを利用して実際にParallelClusterを起動するところまでを扱います．具体的な内容は，設定ファイルの作成とコマンドの確認・実行になります．

1. シリーズの流れ

ParallelCluster バージョン3用のカスタムAMIの作成
ParallelCluster バージョン3用の設定ファイルの作成〜起動 (この記事)
Slurmを利用したFrontISTR並列実行

2. 設定ファイルの作成

2.1. バージョン2用との違い

書籍のサポートリポジトリではAWS ParallelCluster バージョン2用の設定ファイル v2.11.1-FrontISTR.config が提供されていますが，バージョン3用の設定ファイルは仕様が大きく異なるため，はじめから作成する必要があります．いくつか相違点をあげると

ログインするノードの名称: 「Master」から「HeadNode」に
ファイル形式: バージョン3用ではYAML形式に
内容のまとまり: バージョン2ではクラスタ・VPCなど機能ごと，バージョン3ではヘッドノード・子ノードと部分ごと
バージョン2用冒頭のテンプレート名称宣言や各種チェックが削除

となります．詳しくは，公式ドキュメントをご覧ください．

以下では，「バージョン3でジョブスケジューラにSlurmを使用する場合」に，サポートリポジトリの設定ファイルと同様の内容を表現できるよう，記述する内容を上から順に説明していきます(一般的な書き方の説明ではないためご注意ください)．サポートリポジトリのファイルとの対応は，項目ごとに「(→サポL. 行数)」という形式で明示します．また，対応関係の全体像はこのセクションの最後に画像として示しています．

(注) 公式ドキュメントでのリファレンス・作成例を断りなく参照します．また，各項目の必須・オプションは強調したい場合以外記載しませんので，リファレンスをご覧ください．

2.2. `Region`セクション

バージョン3用の設定ファイルではじめに記載される内容です．

Region: ap-northeast-1

Region(オプション): 利用するリージョン．東京リージョン(ap-northeast-1)を指定します．(→サポL. 24)

2.3. `Image`セクション

Image:
  Os: alinux2
  CustomAmi: ami-xxxxxxxxxxxxxxxxx

Os(必須): 利用するOS．Amazon Linux (alinux2)を指定します．(→サポL. 27)
CustomAmi(オプション): ベースとなるAMI．自身で作成したカスタムAMIを指定してください．(→サポL. 29)

2.4. `HeadNode`セクション

AWS ParallelCluster バージョン2では，最初にログインするEC2インスタンスを「Master」と呼んでいましたが，バージョン3では「HeadNode」という名称に変更になりました．GitHubのブランチ名と同様，inclusive languageを反映したものです．以下，3パートに分けて説明します．

HeadNode:
  InstanceType: c5n.large

InstanceType(必須): ヘッドノードで利用するインスタンスのタイプ．c5n.xlargeを指定します．(→サポL. 33)

HeadNode:
...
  Networking:
    SubnetId: subnet-xxxxxxxxxxxxxxxxx
    ElasticIp: true

SubnetId: 利用するサブネット．自身で作成したサブネットを指定してください．(→サポL. 54)
- VPCを指定する箇所はありません．サブネットに関連付けられたVPCが利用されます．
ElasticIp: ヘッドノードにElastic IP (パブリックIP)を付与するか．必ずtrueを指定し有効化してください．
- (失敗談) 初めて行った際にはこの設定を無視したために，パブリックIPが付与されず，立ち上がったのにいつまでもログインできない状態に陥ってしまいました．

HeadNode:
...
  Ssh:
    KeyName: FrontISTR-key
  LocalStorage:
    RootVolume:
      Size: 35

Ssh > KeyName: SSH接続で利用する鍵．自身で作成した鍵を指定してください．(→サポL. 28)
LocalStorage > RootVolume > Size: ルートボリュームのサイズ．35 (GiB)とします．(→サポL. 35)

2.5. `Scheduling`セクション

長いため，こちらもいくつかに分割して説明します．

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 15

Scheduler: 利用するジョブスケジューラ．ここではSlurmを指定します．(→サポL. 30)
SlurmSetting > ScaledownIdletime: 何もしないと子ノードがシャットダウンする時間．サポートリポジトリのものより長く，15(分)とします．(→サポL. 50)

2.5.1. `SlurmQueues`サブセクション

子ノードに関する具体的な設定など，Schedulingセクションの中核をなすのがこの部分です．

Scheduling:
...
  SlurmQueues:
    - Name: queue
      ComputeSettings:
        LocalStorage:
          RootVolume:
            Size: 35
      CapacityType: SPOT

Name: このキューの名称．適当に指定します．
ComputeSettings > LocalStorage > RootVolume > Size: 子ノードのルートボリュームのサイズ．35 (GiB)とします．(→サポL. 36)
CapacityType: 子ノードの起動タイプ，オンデマンドかスポットか．SPOTを指定します．(→サポL. 31)

...
  SlurmQueues:
...
      Networking:
        SubnetIds:
          - subnet-xxxxxxxxxxxxxxxxx
        AssignPublicIp: true
        PlacementGroup:
          Enabled: true

SubnetIds: 利用するサブネット．自身で作成したサブネットを指定してください．(→サポL. 54)
- バージョン2ではMasterのサブネットのみ指定した場合に子ノードにも同じものが引き継がれましたが，バージョン3では両方に記載する必要があります．このNetworking, SubnetIdsは共に必須の項目ですので，書き漏れのないようにしてください．
- ヘッドノードと同じくVPCを指定する箇所はありません．
AssignPublicIp: 子ノードにパブリックIPを付与するか．必ずtrueを指定し有効化してください．
- (失敗談) ヘッドノードの方でElasticIp: trueとしたにもかかわらずこの設定を無視したために，ノードが立っても計算が走らない状態に陥ってしまいました．
PlacementGroup > Enabled: クラスタでプレースメントグループを利用するか．trueを指定します．(→サポL. 43)
- バージョン2では，placement_groupで自動作成か既存のものの利用かを指定していました．バージョン3では，PlacementGroup > Idに既存のものを記載するかでこれを区別します．サポートリポジトリではDYNAMIC (自動作成)としていたため，ここではIdを記載していません．
- バージョン2では，placementでグループを利用する対象がクラスタ全体か子ノードのみか指定していました．バージョン3ではこれに対応するものがないようですので，無視しています．(→サポL. 42)

...
  SlurmQueues:
...
      ComputeResources:
        - Name: compute-resource
          InstanceType: c5n.18xlarge
          MinCount: 0
          DisableSimultaneousMultithreading: true
          Efa:
            Enabled: true

Name: この計算資源の名称．適当に指定します．
InstanceType: 子ノードのインスタンスのタイプ．c5n.18xlargeを指定します．(→サポL. 34)
MinCount: 子ノードの下限数．0とします．(→サポL. 32)
- バージョン2ではデフォルトの下限数が2であった¹ために記載していますが，バージョン3ではデフォルトが0に変更されているため，省略しても構いません．
DisableSimultaneousMultithreading: ハイパースレッディング無効化の設定．trueを指定します．(→サポL. 37)
Efa > Enabled: EFA (Elastic Fabric Adapter)²有効化の設定．trueを指定します．(→サポL. 40)

...
  SlurmQueues:
...
      Iam:
        S3Access:
          - BucketName: xxxxxxxxxxxxxxxxx
            EnableWriteAccess: true

S3Access: S3利用に関する設定．S3との接続を考えるため記述します．
BucketName: 利用するS3バケット．自身で作成したバケットの名称を指定してください．(→サポL. 41)
- バージョン2とは異なり，ARN (Amazon Resource Names)ではなくバケットの名称自体を記載します．
EnableWriteAccess: 書き込みを許可するか．trueを指定します．(→サポL. 41)

2.6. `SharedStorage`セクション

SharedStorage:
  - MountDir: shared
    Name: ebs
    StorageType: Ebs

MountDir: 共有するディレクトリ．shared (/sharedのこと)を指定します．(→サポL. 57)
Name: 共有ストレージの名称．適当に指定します．
StorageType: 共有ストレージのタイプ．EBS (Amazon Elastic Block Store)³を指定します．(→サポL. 47, 56)

2.7. 設定ファイルの全体像

作成したファイルの全体は以下のようになります．YAML形式で書かれていますので，ここでは「v3.1.2-FrontISTR.yaml」という名称で保存しました．

v3.1.2-FrontISTR.yaml

Region: ap-northeast-1
Image:
  Os: alinux2
  CustomAmi: ami-xxxxxxxxxxxxxxxxx
HeadNode:
  InstanceType: c5n.large
  Networking:
    SubnetId: subnet-xxxxxxxxxxxxxxxxx
    ElasticIp: true
  Ssh:
    KeyName: xxxxxxxxxxxxxxxxx
  LocalStorage:
    RootVolume:
      Size: 35
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 15
  SlurmQueues:
    - Name: queue
      ComputeSettings:
        LocalStorage:
          RootVolume:
            Size: 35
      CapacityType: SPOT
      Networking:
        SubnetIds:
          - subnet-xxxxxxxxxxxxxxxxx
        AssignPublicIp: true
        PlacementGroup:
          Enabled: true
      ComputeResources:
        - Name: compute-resource
          InstanceType: c5n.18xlarge
          MinCount: 0
          DisableSimultaneousMultithreading: true
          Efa:
            Enabled: true
      Iam:
        S3Access:
          - BucketName: xxxxxxxxxxxxxxxxx
            EnableWriteAccess: true
SharedStorage:
  - MountDir: shared
    Name: ebs
    StorageType: Ebs

非常に見にくくなっていますが，サポートリポジトリで提供されているバージョン2用の設定ファイルとの対応は画像のようになります．

3. ローカルPCで用いるコマンド

ここでは，主に用いる3つのコマンドについて，最低限の内容を記述します．記事で触れていないオプションやその他のコマンドについては，公式ドキュメントをご覧ください．

3.1. クラスタの作成

バージョン3で作成に用いるコマンドはpcluster create-clusterになります．バージョン2のpcluster createとは異なりますので，注意が必要です．また，設定ファイルの指定が必須になっています．

pcluster create-cluster --cluster-name CLUSTER_NAME --cluster-configuration CLUSTER_CONFIGURATION

オプション--cluster-nameは-nに，--cluster-configurationは-cにそれぞれ省略可能なので，以下のように記述しても同じです．

pcluster create-cluster -n CLUSTER_NAME -c CLUSTER_CONFIGURATION

3.2. クラスタの削除

バージョン3で削除に用いるコマンドはpcluster delete-clusterで，こちらもバージョン2のpcluster deleteから変更されています．また，削除の際に設定ファイルが指定できなくなっています．

pcluster delete-cluster --cluster-name CLUSTER_NAME

作成時と同じく，オプション--cluster-nameは-nに省略可能です．

pcluster delete-cluster -n CLUSTER_NAME

3.3. クラスタの一覧表示

バージョン3で一覧表示に用いるコマンドはpcluster list-clustersで，同じくバージョン2のpcluster listから変更されています．

pcluster list-clusters

list-clustersと複数形になっている点に注意が必要です．

4. ParallelClusterを実際に起動する

ここまでで，設定ファイルの作成とコマンドの確認が完了しました．早速，ParallelClusterを起動していきます．

4.1. ヘッドノードの起動

作成した設定ファイルを指定し，起動します．名称は適当に「FrontISTR-cluster」としました．

pcluster create-cluster -n FrontISTR-cluster -c v3.1.2-FrontISTR.yaml

コマンドを実行すると，JSON形式で詳細が返ってきます．

$ pcluster create-cluster -n FrontISTR-cluster -c v3.1.2-FrontISTR.yaml 
{
  "cluster": {
    "clusterName": "FrontISTR-cluster",
    "cloudformationStackStatus": "CREATE_IN_PROGRESS",
    "cloudformationStackArn": "arn:aws:cloudformation:ap-northeast-1:xxxxxxxxxxxx:stack/FrontISTR-cluster/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "region": "ap-northeast-1",
    "version": "3.1.2",
    "clusterStatus": "CREATE_IN_PROGRESS"
  },
  "validationMessages": [
    {
      "level": "WARNING",
      "type": "CustomAmiTagValidator",
      "message": "The custom AMI may not have been created by pcluster. You can ignore this warning if the AMI is shared or copied from another pcluster AMI. If the AMI is indeed not created by pcluster, cluster creation will fail. If the cluster creation fails, please go to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting.html#troubleshooting-stack-creation-failures for troubleshooting."
    },
    {
      "level": "WARNING",
      "type": "AmiOsCompatibleValidator",
      "message": "Could not check node AMI ami-xxxxxxxxxxxxxxxxx OS and cluster OS alinux2 compatibility, please make sure they are compatible before cluster creation and update operations."
    }
  ]
}

pcluster list-clustersで一覧表示させると，同じくJSON形式で返ってきます．

$ pcluster list-clusters
{
  "clusters": [
    {
      "clusterName": "FrontISTR-cluster",
      "cloudformationStackStatus": "CREATE_IN_PROGRESS",
      "cloudformationStackArn": "arn:aws:cloudformation:ap-northeast-1:xxxxxxxxxxxx:stack/FrontISTR-cluster/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
      "region": "ap-northeast-1",
      "version": "3.1.2",
      "clusterStatus": "CREATE_IN_PROGRESS"
    }
  ]
}

作成が完了するとステータスが"CREATE_COMPLETE"となります．

$ pcluster list-clusters
{
  "clusters": [
    {
      "clusterName": "FrontISTR-cluster",
      "cloudformationStackStatus": "CREATE_COMPLETE",
      "cloudformationStackArn": "arn:aws:cloudformation:ap-northeast-1:xxxxxxxxxxxx:stack/FrontISTR-cluster/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
      "region": "ap-northeast-1",
      "version": "3.1.2",
      "clusterStatus": "CREATE_COMPLETE"
    }
  ]
}

AWS マネジメントコンソールをチェックすると，「HeadNode」という名称のEC2インスタンスができていることが分かります．

4.2. ヘッドノードへのログイン

バージョン3のpcluster create-clusterでは，作成完了時にパブリックIPなどがCLIに返ってくる仕様になっていません．よって，pcluster sshコマンドを利用してログインすると，AWS マネジメントコンソールにアクセスしてパブリックIPを調べる手間が省けます．

pcluster ssh -n FrontISTR-cluster -i PRIVATE_KEY

もちろん，コンソールよりIPを取得し，通常のsshコマンドでログインすることも可能です．

ssh -i PRIVATE_KEY ec2-user@xxx.xxx.xxx.xxx

ログイン後の実際にFrontISTRを実行する例については，ジョブスクリプトの書き方とともに次の記事で紹介します．

4.3. ヘッドノードの削除

ヘッドノードを保持しておくと無尽蔵に課金されてしまうため，利用後には削除が必要です．作成したParallelClusterの名称を指定し，削除を実行します．起動時と同様，JSON形式で詳細が返ってきます．

$ pcluster delete-cluster -n FrontISTR-cluster
{
  "cluster": {
    "clusterName": "FrontISTR-cluster",
    "cloudformationStackStatus": "DELETE_IN_PROGRESS",
    "cloudformationStackArn": "arn:aws:cloudformation:ap-northeast-1:xxxxxxxxxxxx:stack/FrontISTR-cluster/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "region": "ap-northeast-1",
    "version": "3.1.2",
    "clusterStatus": "DELETE_IN_PROGRESS"
  }
}

削除が完了すると，pcluster list-clustersで得られるリストは空になります．

$ pcluster list-clusters
{
  "clusters": []
}

前の記事: FrontISTRをAWS ParallelCluster バージョン3で利用する①
次の記事: FrontISTRをAWS ParallelCluster バージョン3で利用する③

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up