0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

最小のIAM権限でRayクラスターをEC2上に作成

Last updated at Posted at 2022-01-07

背景/目的

  • Rayクラスターを作成するのにAdmin権限をもったIAMユーザを利用するのが簡単だがセキュリティ的に良くない
  • 最小限の権限でRayクラスターを立ち上げたい。権限を絞ってIAMユーザ/ロールを作りRayクラスターを作成する
  • IAMユーザ/ロールの作成にはCloud Formationを利用する

参考

手順概要

  1. 最小限の権限をもったIAMユーザとIAMロールを作成する
  2. IAMロールをもとにインスタンスプロファイルを作成する
  3. 作成したインスタンスプロファイルのARN(Amazon Resource Name)をRayクラスター作成用コンフィグファイルで参照し、Ray Head NodeとRay Woker Nodeに最小限の権限を割り当てる
  4. Rayクラスター作成用コンフィグファイルをもとにRayクラスターを起動する

手順詳細

1. IAM Policy/Role/IAM User/AccessKeyを作成

CloudFormationのスタック用テンプレートを作成し、一括して作成する。テンプレートCloudFormationForRay.jsonは付録に記載。IAM Policy、IAM Role、IAM User、AccessKeyの概要は下記の通り。

  • IAM Policy

    • ray-ec2-launcher-policy
    • ray-s3-access-policy
  • IAM Role

    • ray-head-v1-role
      • ray-ec2-launcher-policyとray-s3-access-policyの両方を割り当て
    • ray-worker-v1-role
      • ray-s3-access-policyのみ割り当て
  • IAM InstanceProfile

    • ray-head-v1-instanceprofile
      • ray-head-v1-roleを割り当て
    • ray-worker-v1-instanceprofile
      • ray-worker-v1-roleを割り当て
  • IAM User

    • ray-launcher-user
      • ray-ec2-launcher-policyとray-s3-access-policyの両方を割り当て
  • IAM AccessKey

  • SecretsManager Secret

CloudFormationでスタックの作成が完了したら、AWS Secrets Managerの「シークレット」からray-launcher-user-credentialsを開いてaccessKeyIdとsecretAccessKeyの値をメモしておく。

image.png

以下のコマンドでIAMユーザray-launcher-userのクレデンシャル情報を登録する。対話式でAWS Access Key IDAWS Secret Access Keyを尋ねられるので、それぞれ先ほどメモしたaccessKeyIdとsecretAccessKeyを入力する。Default region nameとDefault output formatは適当で構わない。

$ aws configure

2. Rayクラスター起動

下記を参考にしながらRayクラスター起動用コンフィグRayClusterConfig.yamlを作成する。詳細は付録を参照。

作成したRayClusterConfig.yamlでRayクラスターを起動する。

$ ray up -f RayClusterConfig.yaml

3. Jupyter notebookにアクセス

下記コマンドでヘッドノードにアクセスする。

ray attach RayClusterConfig.yaml

Jupyter notebookを起動する。

jupyter notebook --ip=* --no-browser --port=8888

Chromeなど適当なブラウザでヘッドノード上に起動したJupyter notebookにアクセスする。tokenはJupyter notebook起動時にBash上に表示されたものを入力する。

http://<クラスターヘッドのIP>:8888/

Jupyter notebook上でray.init()によりRayに接続する。ダッシュボードを利用するため引数dashboard_host="0.0.0.0"でアクセス元IPを全許可してやる。

import ray
ray.init(dashboard_host="0.0.0.0")

すると、Jupyter notebook上で以下のようにDashboardが起動しているポートが表示される。

2022-01-11 01:24:48,877	INFO services.py:1340 -- View the Ray dashboard at http://172.31.17.230:8266

ブラウザでhttp://<クラスタIP>:8266へアクセスするとダッシュボードが表示される。※セキュリティグループで8266番ポートを許可しておく必要があります。

image.png

以上で終わりです。お疲れ様でした✨

付録

CloudFormationスタック用テンプレート

CloudFormationForRay.json
{
    "AWSTemplateFormatVersion" : "2010-09-09",
    
    "Parameters" : {
      "ACCOUNTID": {
        "Type": "String",
        "Default" : "AccountID"
      },
      "REGION": {
        "Type": "String",
        "Default" : "us-west-2"
      }
    },
  
    "Resources" : {
      "RayEc2LauncherPolicy" : {
        "Type" : "AWS::IAM::ManagedPolicy",
        "Properties" : {
          "ManagedPolicyName": "ray-ec2-launcher-policy",
          "PolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": "ec2:RunInstances",
                    "Resource": "*"
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "ec2:TerminateInstances",
                        "ec2:DeleteTags",
                        "ec2:StartInstances",
                        "ec2:CreateTags",
                        "ec2:StopInstances"
                    ],
                    "Resource": !Sub "arn:aws:ec2:${REGION}:${ACCOUNTID}:instance/*"
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "ec2:Describe*",
                        "ec2:DescribeInstances",
                        "ec2:DescribeImages",
                        "ec2:DescribeKeyPairs",
                        "ec2:DescribeSecurityGroups",
                        "ec2:AuthorizeSecurityGroupIngress",
                        "iam:GetInstanceProfile",
                        "ec2:CreateSecurityGroup",
                        "ec2:CreateKeyPair"
                    ],
                    "Resource": "*"
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "iam:PassRole"
                    ],
                    "Resource": [
                    !Sub "arn:aws:iam::${ACCOUNTID}:role/ray-head-v1-role",
                    !Sub "arn:aws:iam::${ACCOUNTID}:role/ray-worker-v1-role"
                    ]
                }
            ]
          }
        }
      },
      "RayS3AccessPolicy" : {
        "Type" : "AWS::IAM::ManagedPolicy",
        "Properties" : {
          "ManagedPolicyName": "ray-s3-access-policy",
          "PolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Action": [
                        "s3:*"
                    ],
                    "Effect": "Allow",
                    "Resource": [
                        "arn:aws:s3:::ray-data/*",
                        "arn:aws:s3:::ray-data"
                    ]
                }
            ]
          }
        }
      },
      "RayHeadV1Role": {
        "Type" : "AWS::IAM::Role",
        "Properties" : {
            "AssumeRolePolicyDocument" : {
              "Version": "2012-10-17",
              "Statement": [
                {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": [
                          "ec2.amazonaws.com"
                      ]
                  },
                  "Action": [
                      "sts:AssumeRole"
                  ]
              }
              ]
            },
            "ManagedPolicyArns" : [{"Ref": "RayEc2LauncherPolicy"}, {"Ref": "RayS3AccessPolicy"}],
            "RoleName" : "ray-head-v1-role"
          }
      },
      "RayWorkerV1Role": {
        "Type" : "AWS::IAM::Role",
        "Properties" : {
            "AssumeRolePolicyDocument" : {
              "Version": "2012-10-17",
              "Statement": [
                {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": [
                          "ec2.amazonaws.com"
                      ]
                  },
                  "Action": [
                      "sts:AssumeRole"
                  ]
              }
              ]
            },
            "ManagedPolicyArns" : [{"Ref": "RayS3AccessPolicy"}],
            "RoleName" : "ray-worker-v1-role"
          }
      },
      "RayHeadV1InstanceProfile": {
        "Type" : "AWS::IAM::InstanceProfile",
        "Properties" : {
            "InstanceProfileName" : "ray-head-v1-instanceprofile",
            "Roles" : [{"Ref": "RayHeadV1Role"}]
          }
      },
      "RayWorkerV1InstanceProfile": {
        "Type" : "AWS::IAM::InstanceProfile",
        "Properties" : {
            "InstanceProfileName" : "ray-worker-v1-instanceprofile",
            "Roles" : [{"Ref": "RayWorkerV1Role"}]
          }
      },
      "RayLauncherUser": {
        "Type" : "AWS::IAM::User",
        "Properties" : {
            "ManagedPolicyArns" : [{"Ref": "RayEc2LauncherPolicy"}, {"Ref": "RayS3AccessPolicy"}],
            "UserName" : "ray-launcher-user"
        }
      },
      "RayLauncherUserAccessKey": {
        "Type": "AWS::IAM::AccessKey",
        "Properties": {
          "UserName": {"Ref": "RayLauncherUser"}
        }
      },
      "RayLauncherUserAccessKeySecret": {
        "Type": "AWS::SecretsManager::Secret",
        "Properties": {
          "Name": !Sub "${RayLauncherUser}-credentials",
          "SecretString": !Sub "{\"accessKeyId\":\"${RayLauncherUserAccessKey}\",\"secretAccessKey\":\"${RayLauncherUserAccessKey.SecretAccessKey}\"}"
        }
      }
    }
  }
  

Rayクラスター用コンフィグファイル

Rayクラスターコンフィグで特に大事なところをピックアップしていく。コンフィグ全文は最後に記載。

AWSリージョン&セキュリティグループを指定

provider:
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b
    security_group:
      GroupName: str
      IpPermissions:  
        - IpPermission

デフォルトでセキュリティグループ上ポート22番は許可されていることに注意。ポート22番を許可する設定を入れると以下のエラーがでる。

ポート22番を許可すると、、、
    security_group:
      GroupName: ray-cluster-sg
      IpPermissions:
        - FromPort: 22
          IpProtocol: tcp
          IpRanges: 
          - CidrIp: 0.0.0.0/0
          ToPort: 22
ポートを重複して許可した場合のエラー
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) 
when calling the AuthorizeSecurityGroupIngress operation: 
The same permission must not appear multiple times

EC2インスタンスタイプ/インスタンスプロファイルの指定

available_node_types:
    ray.head.default:
        node_config:
            InstanceType: m5.large
            IamInstanceProfile:
                Arn: arn:aws:iam::<AccountID>:instance-profile/ray-head-v1-instanceprofile
    ray.worker.default:
        node_config:
            InstanceType: m5.large
            IamInstanceProfile:
                Arn: arn:aws:iam::<AccountID>:instance-profile/ray-worker-v1-instanceprofile

ローカルのディレクトリをRayクラスターにコピー

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

スポットインスタンスの利用

スポットインスタンスを利用することでEC2利用料安くすることができる。MarketType: spotがおススメ。

# Provider-specific config for worker nodes, e.g. instance type.
worker_nodes:
    InstanceType: m5.large
    ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0

    # Run workers on spot by default. Comment this out to use on-demand.
    InstanceMarketOptions:
        MarketType: spot
        SpotOptions:
            MaxPrice: 1.0  # Max Hourly Price

RayClusterConfig.yaml 全文

RayClusterConfig.yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    # Availability zone(s), comma-separated, that nodes may be launched in.
    # Nodes are currently spread between zones by a round-robin approach,
    # however this implementation detail should not be relied upon.
    availability_zone: us-west-2a,us-west-2b
    # Whether to allow node reuse. If set to False, nodes will be terminated
    # instead of stopped.
    cache_stopped_nodes: True # If not present, the default is True.
    security_group:
        GroupName: ray-cluster-sg
        IpPermissions:
            - FromPort: 8888
              IpProtocol: tcp
              IpRanges: 
              - CidrIp: 0.0.0.0/0
              ToPort: 8888
            

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
            # You can provision additional disk space with a conf as follows
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 100
            # Additional options in the boto docs.
            IamInstanceProfile:
                Arn: arn:aws:iam::<AccountID>:instance-profile/ray-head-v1-instanceprofile
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
            # Run workers on spot by default. Comment this out to use on-demand.
            # NOTE: If relying on spot instances, it is best to specify multiple different instance
            # types to avoid interruption when one instance type is experiencing heightened demand.
            # Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
            InstanceMarketOptions:
                MarketType: spot
                # Additional options can be found in the boto docs, e.g.
                #   SpotOptions:
                #       MaxPrice: MAX_HOURLY_PRICE
            # Additional options in the boto docs.
            IamInstanceProfile:
                Arn: arn:aws:iam::<AccountID>:instance-profile/ray-worker-v1-instanceprofile

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?