SageMaker で Scale an endpoint to zero instances を試す

Last updated at 2024-12-07Posted at 2024-12-06

Scale an endpoint to zero instances

SageMaker のリアルタイム推論でインスタンスを 0 台にスケールインする機能が出ました。

GPU インスタンスの利用はコストが高いため、アプリケーションの開発時や、夜間など使わない時間はエンドポイントを落とす運用をしている人もいるのではないでしょうか。そこにマネージドで 0 台スケールインする機能が加わりました。

しかし、0 台にしたらコールドスタート問題が絶対発生するんちゃうの…？という疑問もあったりするので、なにはともあれ試してみましょう。

開発者ドキュメントはこちら…だが。

なのですが、例がなくて理解が難しいです。

GitHub にわかりやすい例があったのでこれを改変して試します。

zero scale の流れ

今回は実験しやすいよう、Llama-3.2-1b という SLM を g5.xlarge でホストしてみます。

0 台にスケールイン、及び 1 台以上に再度スケールアウトする手順は以下の通りです。

SageMaker で endpoint config を作成
SageMaker で endpoint を作成
SageMaker でモデルを作成
SageMaker で inference component を作成
Application Autoscaling で Scalable Target を登録
Application Autoscaling で Scaling Policy を設置
Cloudwatch Alarm で Metric Alarm を設置

結構長い道のりですね。SageMaker SDK を使えば 4 まではショートカットできますが、ここでは boto3 で動作を追っていきましょう。

SageMaker で endpoint を立てるときは、

create model
create endpoint config
create endpoint

の手順が一般的でしたが、2023 年(去年)の re:Invent 直前に inference component という概念ができました。

従前のやり方では事前に使用するモデルを決定し、エンドポイントを立ち上げるときにモデルが自動で事前に読み込まれる、という動きでした。（マルチモデルエンドポイントは後からモデルを追加できましたが、削除はできなかった。）

inference component は立ち上げ済のエンドポイントにモデルを自由に差し替えできるようになりました。
create endpoint configure (使用するインスタンスタイプと台数の決定、など) -> create endpoint -> (create model) -> create inference component(modelを指定) することで、あとからエンドポイントにモデルを突っ込めるようになりました。また、エンドポイントを削除しなくても delete inference component することでエンドポイントに配置したモデルを削除したり、そこからさらに create inference component で新モデルを追加投入、複数モデルを使う、など柔軟に対応できるようになりました。

この時点でオートスケーリングも対応していたのですが、ここに zero scale が追加になった、という流れです。

さて、ここからはコードをベタベタ貼っていきます。

必要に応じて以下のコマンドを事前に打っておいてください。

pip install boto3 sagemaker -U

# モジュール読み込み
import boto3
from sagemaker.jumpstart.model import JumpStartModel
import sagemaker
import json
from pprint import pprint
from time import sleep

# クライアント生成 
smr = boto3.client('sagemaker-runtime')
sm = boto3.client('sagemaker')
aas = boto3.client('application-autoscaling')
cw = boto3.client('cloudwatch')
waiter = sm.get_waiter('endpoint_in_service')
role = sagemaker.get_execution_role()

# setting constant param
model_name = 'llama-3-2-1b'
endpoint_config_name = f'{model_name}-config'
endpoint_name = f'{model_name}-endpoint'
variant_name = 'AllTraffic'
inference_component_name = 'llama-3-2-1b-component'
target_policy_name = f'{inference_component_name}-target'
step_scaling_policy_name = f'{inference_component_name}-step'
resource_id = f'inference-component/{inference_component_name}'
service_namespace = 'sagemaker'
scalable_dimension = 'sagemaker:inference-component:DesiredCopyCount'
alarm_name = 'ic-step-scaling-alarm'

# endpoint config 作成
response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": 'ml.g5.xlarge',
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 1200,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1200,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 0,
                "MaxInstanceCount": 1,
            },
            'RoutingConfig': {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
            },
        }
    ],
)

# endpoint 作成と待機
sm.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)
waiter.wait(
    EndpointName=endpoint_name,
    WaiterConfig={'Delay':5}
)

# model 作成
# create model
model = {
    'ExecutionRoleArn': role,
    'ModelName': model_name,
    'PrimaryContainer': {
        'Environment': {
            'ENDPOINT_SERVER_TIMEOUT': '3600',
            'HF_MODEL_ID': '/opt/ml/model',
            'MODEL_CACHE_ROOT': '/opt/ml/model',
            'OPTION_ENABLE_CHUNKED_PREFILL': 'true',
            'SAGEMAKER_ENV': '1',
            'SAGEMAKER_MODEL_SERVER_WORKERS': '1',
            'SAGEMAKER_PROGRAM': 'inference.py'
        },
        'Image': '763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124',
        'Mode': 'SingleModel',
        'ModelDataSource': {
            'S3DataSource': {
                'CompressionType': 'None',
                'ModelAccessConfig': {
                    'AcceptEula': True
                },
                'S3DataType': 'S3Prefix',
                'S3Uri': 's3://jumpstart-private-cache-prod-us-east-1/meta-textgeneration/meta-textgeneration-llama-3-2-1b/artifacts/inference-prepack/v1.0.0/'
            }
        }
    },
}
sm.create_model(**model)

# inference component 作成
response = sm.create_inference_component(
    InferenceComponentName = inference_component_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification = {
        'ModelName': model_name,
        'ComputeResourceRequirements': {
            'NumberOfCpuCoresRequired': 1,
            'NumberOfAcceleratorDevicesRequired': 1,
            'MinMemoryRequiredInMb': 1024,
            'MaxMemoryRequiredInMb': 1024*12,
        }
    },
    RuntimeConfig = {
        'CopyCount': 1
    }
)

# inference component で指定したモデルが利用可能になるまで待つ
while True:
    description = sm.describe_inference_component(InferenceComponentName=inference_component_name)
    status = description['InferenceComponentStatus']
    print(status)
    if status == 'Creating':
        pass
    elif status == 'Failed':
        print(description['FailureReason'])
        raise ValueError("error!")
    else:
        print(status)
        break
    sleep(30)

# 推論テスト
response = smr.invoke_endpoint(
    EndpointName = endpoint_name,
    ContentType = 'application/json',
    Accept = 'application/json',
    InferenceComponentName = inference_component_name,
    Body = json.dumps(
        {
            'inputs': 'Hello, who are you?', 
            'parameters': {
                'max_new_tokens': 64, 
                'top_p': 0.9, 
                'temperature': 0.6
            }
        }
    )
)
print(response['Body'].read().decode('utf-8'))

出力

{"generated_text": " I am a student of the University of California, Berkeley. I am a member of the Berkeley chapter of the American Association of University Professors. I am a member of the American Association of University Professors. I am a member of the American Association of University Professors. I am a member of the"}

Llama-3.2-1b は the University of California, Berkeley の学生なんですね。

さて、ここから zero scale in とそこから scale out の設定です。

# scalable target の登録
response = aas.register_scalable_target(
    ServiceNamespace = service_namespace,
    ResourceId = resource_id,
    ScalableDimension = scalable_dimension,
    MinCapacity=0,
    MaxCapacity=1
)
# scaling policy の設置
aas.put_scaling_policy(
    PolicyName = target_policy_name,
    PolicyType = 'TargetTrackingScaling',
    ResourceId = resource_id,
    ServiceNamespace = service_namespace,
    ScalableDimension = scalable_dimension,
    TargetTrackingScalingPolicyConfiguration={
      "PredefinedMetricSpecification": {
          "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution",
      },
      "TargetValue": 1,
      "ScaleInCooldown": 60,
      "ScaleOutCooldown": 60,
    },
)
response = aas.put_scaling_policy(
    PolicyName=step_scaling_policy_name,
    PolicyType="StepScaling",
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "MetricAggregationType": "Maximum",
        "Cooldown": 60,
        "StepAdjustments":
          [
             {
               "MetricIntervalLowerBound": 0,
               "ScalingAdjustment": 1
             }
          ]
    },
)
step_scaling_policy_arn = response['PolicyARN']

# Cloudwatch Alarm の設置
response = cw.put_metric_alarm(
    AlarmName = alarm_name,
    AlarmActions = [step_scaling_policy_arn],
    MetricName = 'NoCapacityInvocationFailures',
    Namespace = 'AWS/SageMaker',
    Statistic = 'Maximum',
    Dimensions = [
        {
            'Name' : 'InferenceComponentName',
            'Value' : inference_component_name
        }
    ],
    Period = 30,
    EvaluationPeriods = 1,
    DatapointsToAlarm = 1,
    Threshold = 1,
    ComparisonOperator = 'GreaterThanOrEqualToThreshold',
    TreatMissingData='missing'
)

# 60 秒で 0 台になるが、余裕を持って 120 秒待機
sleep(120)

ここから、0 台になった endpoint が 1 台で推論できるようになるまでの時間を計測します。
ちなみに、0 台のときに invoke_endpoint を実行すると An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API. というエラーが出ますので、 try ~ except しています。

%%time
while True:
    try:
        response = smr.invoke_endpoint(
            EndpointName = endpoint_name,
            ContentType = 'application/json',
            Accept = 'application/json',
            InferenceComponentName = inference_component_name,
            Body = json.dumps(
                {
                    'inputs': 'Hello, who are you?', 
                    'parameters': {
                        'max_new_tokens': 64, 
                        'top_p': 0.9, 
                        'temperature': 0.6
                    }
                }
            )
        )
        print(response['Body'].read().decode('utf-8'))
        break
    except Exception as e:
        print(e)
        sleep(5)

出力

…
An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.
{"generated_text": " I am a student of the University of California, Berkeley. I am a member of the Berkeley chapter of the American Association of University Professors. I am a member of the American Association of University Professors. I am a member of the American Association of University Professors. I am a member of the"}
CPU times: user 244 ms, sys: 20.8 ms, total: 265 ms
Wall time: 6min 1s

だいたい 6 分かかりましたね。インスタンス起動、コンテナイメージとモデルの DL およびロードなのでそんなもんか、という数値でした。

おわりに

実運用で invoke_endpoint をするとエラーが返ってきたのは想定外でした。
業務時間前、あるいは開発前に invoke_endpoint を一発叩いて、エラーが起きて、裏側でエンドポイントを立ち上げる、といった感じでしょうか。
それでも毎回 create_endpoint し直しよりは楽かもしれませんね。
invoke_endpoint を叩けばいい、というのはエンドユーザーからインスタンスを起こすことができますからね。（ただしエラーと共に）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up