More than 1 year has passed since last update.

AI/ML on AWSAdvent Calendar 2022

AWS re:Invent 2022 で発表された新機能 Amazon SageMaker の Notebook Jobs で遊んでみる

Posted at 2022-12-25

はじめに

皆さんこんにちは。このブログでは、AWS re:Invent 2022 で発表された新機能である Amazon SageMaker Studio の Notebook Jobs について、実際に触りながら、その仕組みや応用方法についてまとめてみます。

そもそも Notebook Jobs とは？についてはまとめませんので、ご存知ない方は公式ドキュメントや公式ブログ、非公式ブログなどをご参照ください。以下リンクです。

おことわり

本題に入る前に念の為のおことわりです。

SageMaker Notebook Jobs は 2022 年 12 月に発表された機能で、ブログ執筆時点では、まだまだ公式ドキュメントや公式ブログなどが完全に揃っていない状態です。このブログの内容は、公式ドキュメントに記載されている情報以外にも、記載されていないがログなどからある程度推測できてしまう情報に基づいて執筆しています。そのため、ドキュメントに未記載の内容はあくまでエンジニアの遊びにとどめ、本番環境では採用しないようお願いします（もしくは AWS の技術サポートに確認ください）。

遊んで分かったこと

Notebook Jobs の on-demand job は SageMaker training job を直接実行する。
Notebook Jobs の schedule job を作成すると、裏側で SageMaker training job だけがステップの SageMaker pipeline と、を EventBridge schedule rule が作成され、EventBrdige schedule rule によって定期実行される。
2 の SageMaker pipeline を利用すると、外部から notebook job が実行できる。
Notebook Jobs の schedule job は、作成された時点で notebook のファイルが入力の S3 へアップロードされ、以降同じ job ならば、Studio 上で変更と上書きをしても S3 のファイルへは反映されない。
2 と 3 と 4 の性質を利用することで、外部から動的に notebook を変更して job を実行できる。

ブログの構成は、各ポイント (1 から 6）について、なぜそれが分かるのか、どう実装するのか、を順番にまとめます。

ちなみに、私のブログでは、サービス名や機能名は大文字（例えば、Notebook Jobs や SageMaker Pipelines）で記載し、実際のリソース名は小文字（例えば notebook job や SageMaker pipeline）で記載しています。

1. Notebook Jobs の on-demand job は SageMaker training job を直接実行する

セクション名通りですが、Notebook Jobs の on-demand job の実体は SageMaker training job という内容です。On-demand job は SageMaker Studio 上で notebook job を作成ときに "Run now" を選択して実行できるものです。

実際に確認してみます。まず notebook job 実行前に SageMaker training job の一覧を AWS CLI で取得してみます（タイムスタンプからこのブログを書いている時間が分かってしまいますが、無視するのがマナーです）。

aws sagemaker list-training-jobs --creation-time-after 2022-12-25T23:00:11+09:00
# {
#     "TrainingJobSummaries": []
# }

出力は何もないですね。

では、on-demand job を実行してみます。実行内容は何でもいいんですが、現在時間を print するコードと Matplotlib のサンプルコードを実行してみます。具体的には以下のような 1 つのセルしかない notebook です。

%matplotlib inline

import datetime

print(datetime.datetime.now())

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

fruits = ['apple', 'blueberry', 'cherry', 'orange']
counts = [40, 100, 30, 55]
bar_labels = ['red', 'blue', '_red', 'orange']
bar_colors = ['tab:red', 'tab:blue', 'tab:red', 'tab:orange']

ax.bar(fruits, counts, label=bar_labels, color=bar_colors)

ax.set_ylabel('fruit supply')
ax.set_title('Fruit supply by kind and color')
ax.legend(title='Fruit color')

plt.show()

上記を SageMaker Studio の notebook に貼り付けファイル名を qiita_sample.ipynb （任意）に変更し、上の toolbar から notebook job 作成のアイコンをクリックします。アイコンは次のようなカレンダーっぽいものです。

すべてデフォルトのまま Schedule の欄が "Run now" であることを確認して、"Create" をクリックします。

ここで SageMaker training job が作成されているはずです。AWS CLI で確認してみましょう。

aws sagemaker list-training-jobs --creation-time-after 2022-12-25T23:00:11+09:00
# {
#     "TrainingJobSummaries": [
#         {
#             "TrainingJobName": "qiitasampleipynb-qiitasample-00000000-2022-12-25-14-24-31",
#             "TrainingJobArn": "arn:aws:sagemaker:ap-northeast-1:000000000000:training-job/qiitasampleipynb-qiitasample-00000000-2022-12-25-14-24-31",
#             "CreationTime": "2022-12-25T23:24:32.836000+09:00",
#             "LastModifiedTime": "2022-12-25T23:26:01.191000+09:00",
#             "TrainingJobStatus": "InProgress"
#         }
#     ]
# }

ありますね。ちゃんと "InProgress" です。このセクションでは詳しく見ませんが aws sagemaker describe-training-image コマンドを利用すれば、この SageMaker training job のコンテナイメージや入出力などを詳しく確認することができます。またログなども確認することができます。公式ドキュメントにも書かれていますが、ログを読むと Notebook Jobs の裏側では OSS の Papermill が利用されていることが読み取れます。なので制限やカスタマイズ方法も Papermill に依存する形になりそうです。

Status が "Completed" になれば、Studio 上、もしくは出力先の S3 から notebook job の出力である notebook を確認することができます。ちゃんと Matplotlib の図が表示されていることを確認しましょう。

2. Notebook Jobs の schedule job を作成すると、EventBridge rule と SageMaker pipeline が作成される

On-demand job とは異なり、schedule job は複数のサービスを経由して SageMaker training job が実行されます。公式ブログにも書かれていますが、そのサービスというには、EventBridge と SageMaker Pipelines です。具体的なリソースは、EventBridge schedule rule (EventBridge Scheduler ではない) と SageMaker pipeline です。

実際に確認してみましょう。まずは SageMaker training job 一覧と SageMaker pipeline 一覧と EventBridge rule 一覧を確認してみます。

aws sagemaker list-training-jobs --creation-time-after 2022-12-25T23:50:00+09:00
# {
#     "TrainingJobSummaries": []
# }

aws sagemaker list-pipelines --created-after 2022-12-25T23:50:00+09:00
# {
#     "PipelineSummaries": []
# }

aws events list-rules --query "Rules[].Name" --output text | wc -w
#       24

EventBridge rule に関しては、作成時刻の絞り込みができないため、単純に数だけを確認しています。

つづいて schedule job の作成です。Notebook は前のセクションと同じものを利用します。Toolbar から notebook job 作成のアイコンをクリックし、今度は Schedule の欄を "Run on a schedule" に変更して、"Create" をクリックします。パラメータはデフォルトでいいでしょう。

ここで　EventBridge rule と SageMaker pipeline が作成されているはずです（SageMaker training job は実行タイミングが来るまでは作成されないです）。

aws sagemaker list-pipelines --created-after 2022-12-25T23:50:00+09:00
# {
#     "PipelineSummaries": [
#         {
#             "PipelineArn": "arn:aws:sagemaker:ap-northeast-1:000000000000:pipeline/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30",
#             "PipelineName": "qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30",
#             "PipelineDisplayName": "qiitasampleipynb",
#             "PipelineDescription": "Created for Notebook execution from Studio",
#             "RoleArn": "arn:aws:iam::000000000000:role/SageMaker-Studio-Domain-v-DomainExecutionRoleXXXXX-XXXXXXXXXXXX",
#             "CreationTime": "2022-12-26T00:00:30.956000+09:00",
#             "LastModifiedTime": "2022-12-26T00:00:30.956000+09:00"
#         }
#     ]
# }

aws events list-rules --query "Rules[].Name" --output text | wc -w
#       25

ちゃんと両方ありますね。この SageMaker pipeline がどのようなステップで構成されているのか気になりますので確認してみます。

aws sagemaker describe-pipeline \
    --pipeline-name qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30 \
    --query 'PipelineDefinition' | \
    tr -d "\\" | \
    sed 's/^.\{1\}//' | \
    sed 's/.\{1\}$//' | \
    jq '.'
# ...

aws sagemaker describe-pipeline \
    --pipeline-name qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30 \
    --query 'PipelineDefinition' | \
    tr -d "\\" | \
    sed 's/^.\{1\}//' | \
    sed 's/.\{1\}$//' | \
    jq '.Steps | length'
# 1

ステップは sagemaker:DescribePipeline API のレスポンスの PipelineDefinition キーで確認できますが、そのままでは読み取りにくいので tr や sed や jq コマンドを利用して整形しています。ステップ数で見ると 1 しかなく、その内容は SageMaker training job であることが確認できます。次はこの SageMaker pipeline を外部から実行してみましょう。

3. Notebook Jobs の schedule job で作成される SageMaker pipeline を利用して外部から job を実行できる

前のセクションで確認したように、schedule job を作成すると、EventBridge schedule rule と SageMaker pipeline が作成されます。公式ドキュメントや公式ブログを見ると、schedule job を好きなタイミング（オンデマンドという意味だが、on-demand job と混同してしまうので、以降も「好きなタイミング」というクドい呼び方をします）を実行するには、SageMaker Studio 経由をする必要がありそうに思えてしまいますが、SageMaker pipeline があるので、これをターゲットにすれば外部からも実行が可能です。

Schedule job で作成される SageMaker pipeline は特にパラメータを必要としない（EventBridge rule を見ると実際に必要ないことがわかります）ので、そのまま実行してみましょう。

まずは実行前の SageMaker pipeline execution を確認します。

aws sagemaker list-pipeline-executions --pipeline-name qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30
# {
#     "PipelineExecutionSummaries": []
# }

想定通り実行履歴はないです。また実行結果が保存される S3 も確認してみます。S3 bucket/prefix は SageMaker Studio の Notebook Job Definitions から確認可能です。

aws s3 ls s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/
#                            PRE input/

input フォルダーしかないですね。想定通りです。

では、SageMaker pipeline を実行してみます。

aws sagemaker start-pipeline-execution --pipeline-name qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30
# {
#     "PipelineExecutionArn": "arn:aws:sagemaker:ap-northeast-1:000000000000:pipeline/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/execution/xxxxxxxxxxxx"
# }

正しく実行されていれば、SageMaker training job が新規作成されて、出力が S3 に保存されているはずです。確認してみましょう。

aws sagemaker list-training-jobs --creation-time-after 2022-12-25T23:50:00+09:00
# {
#     "TrainingJobSummaries": [
#         {
#             "TrainingJobName": "pipelines-xxxxxxxxxxxx-qiitasampl-qiitasamp-xxxxxxxxxx",
#             "TrainingJobArn": "arn:aws:sagemaker:ap-northeast-1:000000000000:training-job/pipelines-xxxxxxxxxxxx-qiitasampl-qiitasamp-xxxxxxxxxx",
#             "CreationTime": "2022-12-26T00:23:06.491000+09:00",
#             "TrainingEndTime": "2022-12-26T00:26:08.594000+09:00",
#             "LastModifiedTime": "2022-12-26T00:26:09.223000+09:00",
#             "TrainingJobStatus": "Completed"
#         }
#     ]
# }

aws s3 ls s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/
#                            PRE input/
#                            PRE pipelines-xxxxxxxxxxxx-qiitasampl-qiitasamp-xxxxxxxxxx/

想定通りですね。S3 の出力用フォルダの中身を確認してみます。

aws s3 ls s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/pipelines-xxxxxxxxxxxx-qiitasampl-qiitasamp-xxxxxxxxxx/
#                            PRE output/

aws s3 ls s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/pipelines-xxxxxxxxxxxx-qiitasampl-qiitasamp-xxxxxxxxxx/output/
# 2022-12-26 00:26:07      21590 output.tar.gz

どうやら output.tar.gz に出力がまとめられているようです。ダウンロードして展開して中身を確認してみます。

aws s3 cp s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/pipelines-xxxxxxxxxxxx-qiitasampl-qiitasamp-xxxxxxxxxx/output/output.tar.gz .
tar fx output.tar.gz
ls -1
# qiita_sample.ipynb
# qiitasampl-qiitasampl-2022-12-25T15-23-04-640Z.ipynb

qiitasampl-qiitasampl-2022-12-25T15-23-04-640Z.ipynb の方が出力ですね。手元のローカル PC の Jupyter (or JupyterLab) で開けば、ちゃんと出力結果が確認できます。想定通りです。

SageMaker pipeline を使って外部実行できることを確認しました。これを使えば、Lambda function などから呼び出されるので、他サービスと連携させ好きなタイミングで notebook job を実行できます。

補足ですが、出力先の S3 bucket/key などは SageMaker Studio からも確認できますが、これでは自動化できません。実は SageMaker pipeline の pipeline definition から確認できるので、連携する際は:

SageMaker pipeline name を Parameter Store から取得し
SageMaker pipeline name を指定して、sagemaker:DescribePipeline API を呼び出し
API の返り値の JSON の PipelineDefinition キーを展開し、出力先の S3 bucket/key を確認する

という実装をするのがいいでしょう。

4. Notebook Jobs の schedule job は一度作成すると入力の notebook ファイルは固定される

Schedule job で遊んでいて疑問に思うことが、schedule job を作成した「後」に、notebook ファイルを変更し上書き保存すると、job の入力も変更されるのか？ということです。実際に試せば確認できますが、答えは「変更されない」です。具体的な振る舞いとして、schedule job を作成すると「その時点での」notebook ファイルが S3 にアップロードされ、以降同じ notebook job definition ならば、S3 に置かれている notebook ファイルも変更されず、また S3 の bucket/key も変更されない、という形となります。

実際に確認してみます。まず notebook ファイルを変更します。セクション 1 で保存した qiita_sample.ipynb というファイルに次のコードのセルを「追加」します（既にあるセルは消さないように）。

for i in range(10):
    print(f'Hello {i}')

特にコードに意味はありませんが、順番に Hello 0, Hello 1, ..., Hello 9 と print するコードです。これを追加し、そのままファイルを保存します。

さて、それでは notebook job を実行してみます。SageMaker Studio から次のような Notebook Job Definitions のページを開きます。

Job defition name の qiita_sample.ipynb をクリックし、詳細画面に移ります。この画面の上部の "Run Job" をクリックします。

次の画面ではそのまま "Create" をクリックします（特にパラメータはないので）。すると、次のようなそのまま notebook job 一覧のページに遷移するはずです。

一番上が今実行開始した notebook job です。Status が Completed になるまで待ちましょう。少し余談ですが、このページには 3 つの notebook job が表示されていますが、ここまでの手順で SageMaker Studio からは 2 つの notebook job しか実行していないはずです。1 つはセクション 1 の on-demand job、1 つは今実行開始したものです。となるともう 1 つはセクション 3 で SageMaker pipeline 経由で実行したものである以外考えられません。ここから推測すると、この notebook job 一覧のページの右上の "Reload" は、SageMaker training job または S3 から情報を取得していると考えられます（深くは調査しません）。

Status が Completed になったら、出力をダウンロードして、結果の notebook ファイルを開いてみます。すると、先程追加したはずのセルがなく、当然出力結果もないことがわかります。想定通りの結果となりました。

そしたら、この notebook ファイルはどこに置かれているのかが気になります。この情報は SageMaker pipeline の定義から確認できます。

aws sagemaker describe-pipeline \
    --pipeline-name qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30 \
    --query 'PipelineDefinition' | \
    tr -d "\\" | \
    sed 's/^.\{1\}//' | \
    sed 's/.\{1\}$//' | \
    jq '.'
# ...
# より具体的には...
aws sagemaker describe-pipeline \
    --pipeline-name qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30 \
    --query 'PipelineDefinition' | \
    tr -d "\\" | \
    sed 's/^.\{1\}//' | \
    sed 's/.\{1\}$//' | \
    jq '.Steps[0].Arguments.InputDataConfig | map(select(.ChannelName == "sagemaker_headless_execution"))[0].DataSource.S3DataSource.S3Uri'
# "s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/input"

つまり s3://sagemaker-automated-execution-{AccountId}-{Region}/{SageMakerPipelineName}/input フォルダ以下であることがわかります。実際にそのフォルダを確認してみると...

aws s3 ls s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/input/
# 2022-12-26 00:00:31       1786 qiita_sample.ipynb

Schedule job 作成時刻とほぼ同じの notebook ファイルであること確認できます。次のセクションでは、逆にこの仕様を利用して外部から動的に notebook のファイルを変更して notebook job を実行してみます。

5. Notebook Jobs の schedule job の notebook ファイルは外部から動的に変更が可能である

Schedule job の notebook ファイルの S3 のパスが固定であることを利用すれば、その S3 のパスに任意の notebook ファイルを配置することで自由に notebook job を実行することができます。実装にやってみます。

まずは適当な notebook ファイルを記述します。SageMaker Studio 上でもローカル PC 上でも、GitHub から落としてきてもなんでも構いません。私は SageMaker Studio 上で記述します。内容は前と同じく何でも構いませんが、再び現在時間を print するコードと別の Matplotlib のサンプルコードを実行してみます。具体的には以下のような 1 つのセルしかない notebook です。

%matplotlib inline

import datetime

print(datetime.datetime.now())

import numpy as np
import matplotlib.pyplot as plt


fig = plt.figure()
x = np.arange(10)
y = 2.5 * np.sin(x / 20 * np.pi)
yerr = np.linspace(0.05, 0.2, 10)

plt.errorbar(x, y + 3, yerr=yerr, label='both limits (default)')

plt.errorbar(x, y + 2, yerr=yerr, uplims=True, label='uplims=True')

plt.errorbar(x, y + 1, yerr=yerr, uplims=True, lolims=True,
             label='uplims=True, lolims=True')

upperlimits = [True, False] * 5
lowerlimits = [False, True] * 5
plt.errorbar(x, y, yerr=yerr, uplims=upperlimits, lolims=lowerlimits,
             label='subsets of uplims and lolims')

plt.legend(loc='lower right')

これのファイル名を replace_sample.ipynb （任意）として保存します。

続いて terminal を開き（SageMaker Studio の場合は image terminal ではなく system terminal 推奨）、このファイルを schedule job の入力となるべき notebook ファイルが置かれているパスに置き、上書きします。

aws s3 ls s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/input/
# 2022-12-26 00:00:31       1786 qiita_sample.ipynb

aws s3 cp replace_sample.ipynb s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/input/qiita_sample.ipynb
# upload: ./replace_sample.ipynb to s3://sagemaker-automated-execution-000000000000-ap-northeast-1/qiitasampleipynb-qiitasample-00000000-2022-12-25-15-00-30/input/qiita_sample.ipynb

この状態で前のセクション同様に SageMaker Studio から schedule job を "Run Job" ボタン経由で実行します。Status が Completed になったら、出力をダウンロードして、結果の notebook ファイルを開いてみます。すると...、もともと scheudle job 作成時に指定した notebook ファイルではなく、先程作成した上書きした notebook ファイルの実行結果となっていることが確認できます。想定通りですね。これとセクション 3 を利用することで、任意のタイミングで notebook ファイルを変更した上で、外部から自動的に notebook job を実行することが可能となります。これで、notebook job の活用の幅がかなり広がります。

まとめ

SageMaker Studio の新機能である Notebook Jobs で遊んでみました。裏側で SageMaker Pipeline が動いているのでかなり自由度高く他のサービスと連携できそうな印象です。このブログでは、notebook ファイルだけを動的に指定しましたが、このドキュメントに記載されているオプションを使えば、ライブラリの自動インストールなど更に色々なカスタマイズができそうな気がしています。もし時間が空いたら、ここらへんも触ってみようと思っています。では。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up