SageMakerでのデプロイメント入門(2/2)：非同期推論のオートスケール設定解説

Posted at 2024-07-09

背景

前回は SageMakerでのデプロイメント(1/2): SageMakerデプロイについてでSageMakerのデプロイについてまとめました
今回は非同期推論におけるオートスケールの設定方法についてまとめていきます
これができると、キューにリクエストがある場合のみ、GPU付きサーバーを起動して処理し処理が行われ、キューがない場合はインスタンスの起動を0にできるので料金をかからなくできます
今回は非同期推論でオートスケールでインスタンス数が0から5になるような設定のデプロイを行なっていきます

実装(notebook)

実装詳細は下記Gist参照してください

モデルとエンドポイント設定の作成

response = sm_client.create_model(
    ModelName=model_name,
    PrimaryContainer={
        'Image': container_image_uri,
        'ModelDataUrl': model_s3_uri,
        'Environment': {
            'SAGEMAKER_CONTAINER_LOG_LEVEL': '20',
            'SAGEMAKER_PROGRAM': 'inference.py',
            'SAGEMAKER_REGION': region,
            'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/code',
            'TS_DEFAULT_RESPONSE_TIMEOUT': '3600',
            'TS_MAX_RESPONSE_SIZE': '2000000000'
        },
    },
    ExecutionRoleArn=role,
)
# EndpointConfig 作成
response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            'VariantName': 'AllTrafic',
            'ModelName': model_name,
            'InitialInstanceCount': 1,
            'InstanceType': 'ml.g4dn.2xlarge',
        },
    ],
    AsyncInferenceConfig={
        "OutputConfig": {
            "S3OutputPath": f"s3://{bucket}/async_sdxl/output"
        },
        "ClientConfig": {"MaxConcurrentInvocationsPerInstance": 1},
    }
)

TS_DEFAULT_RESPONSE_TIMEOUT: 3600にすることで、タイムアウトのリミットを1時間に
TS_MAX_RESPONSE_SIZE: 2,000,000,000 にすることで最大2GBまで出力可能に
MaxConcurrentInvocationsPerInstance: 1にすることで1インスタンスに対し同時実行が1までに
- 今回のように ApproximateBacklogSizePerInstance　のメトリクスでオートスケールを設定する際、MaxConcurrentInvocationsPerInstanceの設定が重要になります . この値を設定しないと ApproximateBacklogSizePerInstance のメトリクスでの値が"in sufficient" になり続け、オートスケールが動かないことがあります.

オートスケール用のポリシーの設定

session = boto3.Session()  # 使用するプロファイルに合わせて設定
sagemaker = session.client('sagemaker')
autoscaling = session.client('application-autoscaling')
resource_id = f'endpoint/{endpoint_name}/variant/{variant_name}'


# Configure Autoscaling on asynchronous endpoint down to zero instances
response = autoscaling.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId= resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=5,
)

response = autoscaling.put_scaling_policy(
    PolicyName="Invocations-ScalingPolicy",
    ServiceNamespace="sagemaker",  # The namespace of the AWS service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="TargetTrackingScaling",  # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 3.0,  # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": endpoint_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 30,
        "ScaleOutCooldown": 30
    },
)

最小のインスタンス数を0, 最大のインスタンス数を5で設定しています
TargetValueを3.0にします. これによって以下2つのアラート設定が自動的に作成されます
- 1インスタンスあたりのキューの待ちリクエスト数平均が3を超えるとスケールアウトするようなアラート設定
- 1インスタンスあたりのキューの待ちリクエスト数平均が2.9を下回るとスケールインするようなアラート設定
  - AWS consoleの SageMaker -> エンドポイント -> 対象エンドポイント -> アラームタブ　から確認できます

インスタンス0 -> 1用のオートスケール用のポリシーの設定

# 0 -> 1 へのオートスケールの設定

response = autoscaling.put_scaling_policy(
    PolicyName="HasBacklogWithoutCapacity-ScalingPolicy",
    ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
        "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
        "Cooldown": 30, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
        "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
        [
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
          ]
    },
)

cw_client = boto3.client('cloudwatch')

response = cw_client.put_metric_alarm(
    AlarmName="HasBacklogWithoutCapacity-ScalingPolicy",
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 1,
    DatapointsToAlarm= 1,
    Threshold= 0.5,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[
        { 'Name':'EndpointName', 'Value':endpoint_name },
    ],
    Period= 60,
    AlarmActions=[response['PolicyARN']]
)

実装詳細

参照

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up