More than 1 year has passed since last update.

SageMaker入門者向け - 資料リンク集 -

Last updated at 2023-02-07Posted at 2020-05-11

AWSの機械学習マネジメントサービスであるSageMakerは、なかなかピンポイントで欲しい資料が出てこないので、まとめておく。

Amazon SageMaker の特徴

本番環境でMLシステムを運用していく際に、マネージド機能が真価を発揮する。
MLopsやMDLCを実現していくためのコンポーネント。
PoCのモデル開発だけでは、真の恩恵は得られない。

SageMaker 初級者から中級者への近道（個人の経験）

以下の要件での学習、前処理、推論を構築できれば、応用が効くレベルになっていると思う。

学習：カスタムコンテナで、学習スクリプトが複数ファイルになっている(独自util.pyを作成など）ケースを理解する。
・dockerの理解
・SageMakerの仕様の理解(/opt/ml/へのデータ展開、実行スクリプトの場所など）

（オプション）前処理：前処理をSageMaker ProcessingとStep Functions DataScienceSDKで実施する。
・パイプラインの理解
SDKの理解

推論：学習と同様カスタムコンテナを使い、API Gatewayを通してデータを渡し、結果をクライアントに返す。
・APIの理解
・SageMakerのエンドポイントの仕様の理解

公式リファレンス

Machine Learning Lens, AWS Well-Architected Framework
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf
AWSで構築するMLシステムの指針
https://d1.awsstatic.com/whitepapers/ja_JP/architecture/wellarchitected-Machine-Learning-Lens.pdf
日本語

開発者ガイドは、SDKのリファレンスなど
https://docs.aws.amazon.com/ja_jp/sagemaker/index.html

GitHub(英語)
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/README.md

GitHub(日本語)
https://github.com/aws-samples/amazon-sagemaker-examples-jp

Amazon SageMaker Discussion Forums
https://forums.aws.amazon.com/forum.jspa?forumID=285

AWS サービス別資料
https://aws.amazon.com/jp/aws-jp-introduction/aws-jp-webinar-service-cut/#ai-wn

Amazon SageMaker の体験ハンズオン動画とQAを公開しました
https://aws.amazon.com/jp/blogs/news/amazon-sagemaker-handson-20190517/
QAが非常に参考になる。

MDLC : Model Development Life Cycle

「Amazon.comにおけるスケール可能なモデルデプロイと自動化」
https://aws.amazon.com/jp/blogs/news/aws-aiml-tokyo2/

本セッションでは、お客様が機械学習を導入する上で考慮すべき機械学習のモデル開発ライフサイクル（Model Development Life Cycle; MDLC）という概念、およびMDLCを実行するためのワークフローについて紹介しました。この MDLC は実際に、Amazon Consumer Payments で取り入れられている方式であり、re:Invent 2019 におけるセッションでも紹介されています

Code4兄弟(3兄弟)を使ってリポジトリにPushしたらDockerイメージをECRにPushする
https://qiita.com/is_ryo/items/7c28616b0e25013ec121

CodeBuildでDockerイメージをPushした方がもっと楽ですよ？

が参考になる。

GitHub/CodeBuild/CodePipelineを利用してCloudFormationのCI/CDパイプラインを構築する
https://dev.classmethod.jp/articles/developing-cloudformation-ci-cd-pipeline-with-github-codebuild-codepipeline/
ソースコードのPushから、CloudFormationの実行までサンプル付で解説されている。

SageMaker 機能の全体像

Amazon SageMaker の基礎
https://pages.awscloud.com/rs/112-TZM-766/images/20191128_Amazon%20SageMaker_Basic.pdf

[AWS Black Belt Online Seminar] Amazon SageMaker Advanced Session 資料及び QA 公開
https://aws.amazon.com/jp/blogs/news/webinar-bb-amazon-sagemaker-advanced-session-2019/?nc1=f_ls

SageMakerとxGBoostで機械学習の全体像をつかむ
https://qiita.com/suzukihi724/items/3792f395fb22cf7fb311

コンテナへのデータ渡し型

AWS SageMaker 組み込み手法を使う際の投げ方
https://qiita.com/kazuhisa-nagashima/items/21d80271da733d15f4b0

Docker で環境変数をホスト OS 側からゲスト OS （コンテナ）に渡す方法（各種）
https://qiita.com/KEINOS/items/518610bc2fdf5999acf2
パラメーターの受け渡しになぜ環境変数を使うのだろうかと思っていたが、コンテナに情報を伝えるのは環境変数を使うのがセオリーのようだ。

S3の1バケットの中にある、複数のファイルを学習データとしたい場合は、manifestファイルで指定する。
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/augmented-manifest.html
[注意]拡張マニフェストファイルは、パイプ入力モードを使用しているチャネルでのみサポートされます。

例：複数CSVファイルを指定する場合のマニフェストファイルの書き方

{"train-ref": "s3://<S3のURI>/train1.csv"}
{"train-ref": "s3://<S3のURI>/train2.csv"}
{"train-ref": "s3://<S3のURI>/train3.csv"}

https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html
If ‘ManifestFile’ or ‘AugmentedManifestFile’, then s3_data defines a single S3 manifest file or augmented manifest file (respectively), listing the S3 data to train on.

[2020/11/14時点]S3のあるフォルダのファイル群のうち一部を学習に使いたいという場合、拡張マニファストで指定できるが、アルゴリズムがPipeモードである必要がある。

ビルトインアルゴリズムのXGBoostは、Fileモードのみ対応（Pipeモード非対応）のため、上のような一部のファイルを指定した学習ができない。1つの学習ファイルを指定するか、フォルダ全体を指定する必要がある。
よって、例えば日々のPOSデータをYYYYMMDD.csvのように日付毎に一つのフォルダに保存していて、学習はその一部（10日分など）を用いて行いたい、というような場合は、別途 train/フォルダを作成し、該当するファイル達をコピーして、このディレクトリを指定するか、この10日分をマージした1ファイルを作り、そのファイルを指定する。いずれにしても既存のデータからさらにストレージ領域を消費して学習のための準備をする必要がある。

ビルトインアルゴリズムの対応モード一覧は以下
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

学習ジョブ(Training job)

学習ジョブでしかできないこと
・利用可能インスタンスの種類(g4など)
・Luster
・マネージドSpotトレーニング

Amazon SageMakerを活用した推論パイプライン運用　ディー・エヌ・エーのエンジニアが語る構成とツール検討の試行錯誤
https://logmi.jp/tech/articles/324734
学習ジョブで推論をすることでspotインスタンスが使え、コストダウンできる。
記載されてないが、ファイル出力はboto3を利用して都度S3に保存することができる。（圧縮もされない）

【公式blog】Choose the best data source for your Amazon SageMaker training job
https://aws.amazon.com/jp/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/
学習ジョブにおける、データ供給の仕方によって速度がどう違うかを検証したブログ

独自コンテナで学習/推論する

Amazon SageMakerでLightGBMが使えるコンテナイメージを作ってみた
https://dev.classmethod.jp/articles/sagemaker-container-image-lightgbm/

Amazon SageMakerで独自の学習/推論用コンテナイメージを作ってみる
https://dev.classmethod.jp/articles/sagemaker-container-image-custom/

Amazon SageMakerでRを使った独自コンテナを作成してみる
https://dev.classmethod.jp/articles/sagemaker-r-container/
args <- commandArgs()
は、
args <- commandArgs(trailingOnly = T)
の方がよい。ファイル名にtrain_serve.Rとか入ってると正常に動かなくなる。

GitHub(日本語)
https://github.com/aws-samples/amazon-sagemaker-examples-jp/tree/master/workshop/lab_bring-your-own-containers

【開発者ガイド】ホスティングサービスでの独自の推論コードの使用
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/your-algorithms-inference-code.html

【Techの道も一歩から】第22回「AWS SageMakerで任意のコンテナをデプロイする」
https://buildersbox.corp-sansan.com/entry/2019/10/16/110000
gunicorn
Nginx
で、Flaskで記載されたアプリを推論する独自エンドポイントを構築している

SageMakerの内部の動きを解説している

推論：APIで実施する場合

Amazon SageMaker 推論エンドポイントを利用したアプリケーション開発
https://www.slideshare.net/AmazonWebServicesJapan/amazon-sagemaker-122749918

Amazon SageMaker エンドポイント用のサーバーレスフロントエンドを構築する
https://aws.amazon.com/jp/blogs/news/build-a-serverless-frontend-for-an-amazon-sagemaker-endpoint/

Amazon Sagemaker 推論モデルを構築、テストし、AWS Lambda にデプロイする
https://aws.amazon.com/jp/blogs/news/build-test-and-deploy-your-amazon-sagemaker-inference-models-to-aws-lambda/

AWSでAPI Gatewayから非同期でLambdaを起動してS3にファイルアップロードしようとしたらハマった話。
https://www.slideshare.net/takehirosuemitsu/awsapi-gatewaylambdas3
API Gatewayから利用するLambdaは仕様が異なるので注意

LambdaはAPI Gatewayを用いたHTTPSを経由して実行される場合、常にRequestResponse(同期)の呼び出しタイプを使用します。

SageMakerのSDKソケットタイムアウトとは？

リアルタイム推論で60秒以上かかってタイムアウトしてしまう場合、下記にあるようにSDKソケットタイムアウトを70秒に設定することで解決する場合があります。
https://aws.amazon.com/jp/premiumsupport/knowledge-center/sagemaker-upstream-timed-out-header/?nc1=h_ls

ホスティングサービスの推論コードを実行している場合: モデルコンテナは 60 秒以内にリクエストに応答する必要があります。モデル自体の最長処理時間は 60 秒です。モデルの処理時間が 50 ～ 60 秒であることがわかっている場合は、SDK ソケットタイムアウトを 70 秒に設定します。

この「SDKソケットタイムアウト」とはなんだ？と思うと思いますが、これは、boto3を利用してinvoke(データをエンドポイントに投げる）している場合はbotocoreのread_timeout設定値のことです。

read_timeout (float or int) -- The time in seconds till a timeout exception is thrown when attempting to read from a connection. The default is 60 seconds.

ちなみに、
ソケットタイムアウトとは、すでに確立された TCP 接続からデータを読み取るために HTTP クライアントが待つ最大時間となり、HTTP POST が終了してから、リクエストの全応答が受け取られるまでの時間を指します。

SageMakerとLambdaでサーバレス推論

from sagemaker.serverless import LambdaModel　の紹介
https://sagemaker.readthedocs.io/en/stable/overview.html#serverless-inference

ログ

20190206 AWS Black Belt Online Seminar Amazon SageMaker Basic Session
https://www.slideshare.net/AmazonWebServicesJapan/20190206-aws-black-belt-online-seminar-amazon-sagemaker-basic-session-130777850
末尾の参考に記載あり

セキュリティ

ML Security on AWS
https://pages.awscloud.com/rs/112-TZM-766/images/A2-07.pdf

【YouTube】ML Security on AWS | AWS Summit Tokyo 2019
https://www.youtube.com/watch?v=ypUrtcRZQmU&feature=youtu.be
上記の動画

Enable secure ML Deployments in Financial Services
https://d1.awsstatic.com/events/reinvent/2019/Enable_secure_ML_deployments_in_Financial_Services_FSI404.pdf
金融事業向けのセキュリティベストプラクティス

AWSで「インターネットに出てはいけない」要件を解決する方法～PrivateLink対応～
https://bcblog.sios.jp/aws-privatelink/

AWS PrivateLinkの使い方を解説する
https://blog.mmmcorp.co.jp/blog/2017/11/15/aws_privatelink/

AWS PrivateLinkの使い方と注意点～VPCピアリングとの使い分け～
https://devlog.arksystems.co.jp/2018/05/11/4896/

CodeArtifact:Working with Amazon VPC endpoints
https://docs.aws.amazon.com/codeartifact/latest/ug/vpc-endpoints.html
インターネットに接続できない方向けのライブラリ管理ソリューション
Privatelink経由でpipとか

閉域網で Amazon SageMaker を利用する際のポイントと手順
https://aws.amazon.com/jp/blogs/news/internet-free-sagemaker/

【公式:AWS blog】Building secure machine learning environments with Amazon SageMaker
https://aws.amazon.com/blogs/machine-learning/building-secure-machine-learning-environments-with-amazon-sagemaker/

workshopへのGitHubリンク記載あり

AWS re:Invent 2019: Security for ML environments w/ Amazon SageMaker, featuring Vanguard (AIM327-R1)
https://www.youtube.com/watch?v=asuYSd5OmAE
PrivateVPCが必要なサービスが一覧化されている

体験ハンズオン/QA

設計・運用

AWS SageMakerでの機械学習モデル開発フロー（PyTorch）
https://qiita.com/noko_qii/items/41130f66afbb8e451f23

前処理:SageMaker Processing

Amazon SageMaker Processingで独自コンテナを使ってみた
https://dev.classmethod.jp/articles/amazon-sagemaker-processing-original-container/
必要なライブラリがインストールされたコンテナを用意する以外は SKLearnProcessor とほぼ同じ

ビルトインコンテナを利用：SKLearnProcessor
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/use-scikit-learn-processing-container.html

独自コンテナを利用：ScriptProcessor

Amazon SageMaker Processing – フルマネージドなデータ加工とモデル評価
https://aws.amazon.com/jp/blogs/news/amazon-sagemaker-processin-fully-managed-data-processing-and-model-evaluation/

SageMaker Processing が trainingジョブと違うところ

・g4インスタンスが使えない
https://aws.amazon.com/jp/sagemaker/pricing/?nc1=h_ls
「処理」タブをみてください

・Lustreが使えない
https://sagemaker.readthedocs.io/en/stable/overview.html

https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html
・spotトレーニングが使えない

ipynbファイルをそのまま実行したい

https://github.com/aws-samples/sagemaker-run-notebook
papermill(ipynbをバッチ実行）とSageMaker Processingを使って定期的にipynbを実行する仕組みを提供している。

パイプライン構築

Amazon SageMaker を利用したMLのためのCI/CDパイプライン
https://pages.awscloud.com/rs/112-TZM-766/images/E-3.pdf

増分学習

一部のアルゴリズムでは、既存のモデルを読み込んで学習をすすめる、増分学習（Incremental Training)を実行することができる
https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html

Only three built-in algorithms currently support incremental training: Object Detection Algorithm, Image Classification Algorithm, and Semantic Segmentation Algorithm.

物体検知、画像分類、セメンティックセグメンテーションの3つ[2020/12/9時点]

Amazon SageMaker Object Detection Incremental Training¶
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/object_detection_pascalvoc_coco/object_detection_incremental_training.ipynb
物体検出（Object Detection)における増分学習について、参考になる。

Amazon SageMaker と Amazon Augmented AI によるオブジェクトの検出とモデルの再トレーニング
https://aws.amazon.com/jp/blogs/news/object-detection-and-model-retraining-with-amazon-sagemaker-and-amazon-augmented-ai/

【小ネタ】[Amazon SageMaker] 既存のモデルを使用した増分学習をJupyter Notebookでやってみました
https://dev.classmethod.jp/articles/amazon-sagemaker-incrementaltraining/

デプロイ

ステップ 6: モデルを Amazon SageMaker にデプロイする
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/ex1-model-deployment.html

予測を取得するには、モデルをデプロイします。使用する方法は、推論を生成する方法によって異なります。
・リアルタイムで一度に 1 つの推論を取得するには、Amazon SageMaker ホスティングサービスを使用して永続エンドポイントを設定します。
・データセット全体の推論を取得するには、Amazon SageMaker バッチ変換を使用します。

【小ネタ】[Amazon SageMaker] 既存のモデルのデプロイをJupyter Notebookでやってみました
https://dev.classmethod.jp/articles/amazon-sagemaker-deploy_existing_model/

Amazon SageMakerでコードを書かずにモデルのデプロイをする
https://dev.classmethod.jp/articles/amazon-sagemaker-non-coding-deploy/
完全にコンソールのみで学習からデプロイすることも可能

オートスケーリングも自動でしてくれる

Amazon SageMaker のモデルを自動的にスケーリングする
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/endpoint-auto-scaling.html

デプロイメントのベストプラクティス
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/best-practices.html
停止が発生したりインスタンスが失敗した場合、SageMaker は自動的にインスタンスを複数のアベイラビリティーゾーンに分散させようとします。

Lambda推論

SageMaker Endpointを使うのではなく、Lambdaでモデルをロードして推論を行う方式もある。
従量課金になるメリットがあるが、コールドスタートが許容できるか、メモリ制約(10GB)をクリアできるかなど注意する点もある。

AWS Lambda がコンテナイメージをサポートしたので YOLOv5 を hosting してみた
https://zenn.dev/gokauz/articles/72e543796a6423

Lambdaのコールドスタート対応Provisioned Concurrencyについて
Lambdaのコールドスタートを改めて整理する
https://qiita.com/ny7760/items/700ae917da2c5b5e3f8a
ただし、課金はされるので、つけっぱなしだとコストメリットはなくなる。。

Lambdaのメモリ制約について
[アップデート]Lambdaのメモリ上限が10G、vCPUの上限が6に拡張されました！！ #reinvent
https://dev.classmethod.jp/articles/lambda-memory-limit-inclease-to-10g/

ABテスト、カナリアリリース、Blue/Greenデプロイ

Amazon SageMaker を使用して本番稼働で ML モデルの A/B テストを行う
https://aws.amazon.com/jp/blogs/news/a-b-testing-ml-models-in-production-using-amazon-sagemaker/

【GitHub】AB testing
https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_endpoints/a_b_testing/a_b_testing.ipynb

CodeDeployのBlue/Greenデプロイ機能を利用する場合
【公式blog】Safely deploying and monitoring Amazon SageMaker endpoints with AWS CodePipeline and AWS CodeDeploy
https://aws.amazon.com/blogs/machine-learning/safely-deploying-and-monitoring-amazon-sagemaker-endpoints-with-aws-codepipeline-and-aws-codedeploy/

【GitHub】Amazon SageMaker Safe Deployment Pipeline
https://github.com/aws-samples/amazon-sagemaker-safe-deployment-pipeline

【公式】Amazon SageMaker Safe Deployment Pipeline ハンズオン
https://mlops-safe-deployment-pipeline.workshop.aws/

バッチ推論

trainingジョブを推論で使うという裏技もある。
・spotインスタンスが使える
・利用可能なインスタンスが多い。

バッチ変換の使用
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/batch-transform.html

リアルタイム推論

Amazon SageMaker の推論パイプラインで、独自コンテナを組み合わせる方法
https://qiita.com/yaiwase/items/79f99d2c38ed66729a47

推論パイプライン

推論パイプラインのデプロイ
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/inference-pipelines.html

推論パイプラインは、データに対する推論要求を処理する 2〜5 個のコンテナの順番で構成される Amazon SageMaker モデルです。

パイプラインモデルをデプロイすると、Amazon SageMaker はエンドポイントまたは変換ジョブの各 Amazon Elastic Compute Cloud (Amazon EC2) インスタンスにすべてのコンテナをインストールし、実行します。機能の処理と推測は低レイテンシーで実行されます。これは、コンテナが同じ EC2 インスタンスに配置されているためです。

SageMakerの推論パイプラインについて調べてみた
https://dev.classmethod.jp/articles/yoshim-sagemakerinference-pipeline/

リアルタイム推論、バッチ変換のいずれにおいても、「3」のモデルを利用することができます。

リアルタイム推論で前処理、後処理をしたい場合は、推論パイプラインを使う。
推論パイプラインは、バッチ推論にも使える？？
->使える。下記にあるように、推論パイプライン（というメタモデル）をバッチ変換から読んでいる。
普段リアルタイム推論をやっているが、夜間はバッチでよい。という場合はこちらでもよい。
一方で、リアルタイム推論は不要で、学習時にSageMaker Processingで前処理・後処理しているなら、
Step Functions で前処理（Processing)->推論（バッチ変換）->後処理(Processing)も選択肢。

Amazon SageMaker の推論パイプラインで、独自コンテナを組み合わせる方法
https://qiita.com/yaiwase/items/79f99d2c38ed66729a47

【公式】Github（英語）
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/inference_pipeline_sparkml_xgboost_abalone/inference_pipeline_sparkml_xgboost_abalone.ipynb

【DeNA事例】SageMakerで試行錯誤する推論パイプライン
https://speakerdeck.com/moajo/sagemakerdeshi-xing-cuo-wu-surutui-lun-paipurain

障害ケース

推論が時間がたつと徐々に遅くなってしまう場合がある。（学習だったかな？）
EFSのスロットリングが発生している可能性がある。
EFSのスロットリング：バーストに対応できるクレジットがあるが、それを使い切ってしまい、
ベースラインのスピードになってしまうため、遅くなる。

## バッチ変換JOB用Classの作成
transformer = sagemaker.transformer.Transformer(
    # This was the model created using PipelineModel and it contains feature processing and XGBoost
    model_name = model_name,
    instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    strategy = 'SingleRecord',
    assemble_with = 'Line',
    output_path = output_data_path,
    base_transform_job_name='serial-inference-batch',
    sagemaker_session=sess,
    accept = CONTENT_TYPE_CSV
)

Step Functions で Lambda を利用する際、引数を受け渡ししていく方法

公式のガイドを見てもわからなかったので、わかりやすい記事をリストアップする

Stepで覚えるStepFunctions基本のキ
https://dev.classmethod.jp/articles/step_stepfunctions_tutorial/
pythonのlambdaを使った、引数の受け渡しのサンプルがある。

lambda-test.py

def lambda_handler(event, context):
    event["Result"] = True #項目はeventに任意に追加可能です。
    event["LoopCount"] = 10 
    return event

この、event変数はなんだ？とわからなかったが、下記の記事が参考になった。

AWS Lambdaメモ -ハンドラの引数eventの中身-
http://tk5-21.hatenablog.com/entry/2018/01/12/003751

context変数については、下記の公式を参照。

【公式】Python の AWS Lambda context オブジェクト
https://docs.aws.amazon.com/ja_jp/lambda/latest/dg/python-context.html
ようするに、実行中のlambdaに関する情報のようだ。

実行時に Step Functions に渡す変数は、JSON形式で指定する。
GUIでテストケースを作成する場合は、下記のようになる。

{
  "Comment": "Insert your JSON here",
  "Var1": 1,
  "Var2": 2,
  "Var3": 3
}

"Comment": "Insert your JSON here", は、デフォルトで入っている。

AWS Step Functions

AWS Step Functions
https://docs.aws.amazon.com/ja_jp/step-functions/latest/dg/connect-sagemaker.html

パターン1：StepFunctions -> SageMakerAPI
（Setp Functions に導入されているSageMakerAPIあれば可能）

API Gateway を使用して Step Functions API を作成する
https://docs.aws.amazon.com/ja_jp/step-functions/latest/dg/tutorial-api-gateway.html
API Gateway から受け付けたい場合。

AWS Step Functions Data Science SDK for Python
https://docs.amazonaws.cn/en_us/step-functions/latest/dg/concepts-python-sdk.html

ステートマシーンを簡単に作るための「AWS Step Functions Data Science SDK」の紹介
https://dev.classmethod.jp/articles/yoshim_step_functions_datascience_sdk/

Step Functionsで扱えるSageMakerのAPIが増えました
https://dev.classmethod.jp/articles/stepfunctions-sagemaker-api-update/

パターン2：StepFunctions -> Lambda -> SageMakerAPI
独自の処理コンテナを使用したスクリプトの実行
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/processing-container-run-scripts.html

MLワークフローを「AWS Step Functions」でサーバレスに管理する:Amazon SageMaker Advent Calendar 2018
https://dev.classmethod.jp/articles/2018advent-calendar-sagemaker-20181210/

AWS StepFunctions を使った機械学習ワークフローの管理
https://qiita.com/kurakura0916/items/5e89cb86e86d22fdc5d8

AWS Step FunctionsでLambdaを組み合わせたバッチ処理を作る
https://dev.classmethod.jp/articles/aws-step-functions-batch-service/

[GitHub] AWS Step Functions Data Science SDK - Hello World
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/step-functions-data-science-sdk/hello_world_workflow/hello_world_workflow.ipynb

Automating Amazon SageMaker workflows with AWS Step Functions
https://www.youtube.com/watch?v=0kMdOi69tjQ

Orchestrate Machine Learning Workflows with Amazon SageMaker and AWS Step Functions
https://www.youtube.com/watch?v=dNb5jVffzPs

20190522 AWS Black Belt Online Seminar AWS Step Functions
https://www.slideshare.net/AmazonWebServicesJapan/20190522-aws-black-belt-online-seminar-aws-step-functions

Building an AWS Serverless ML Pipeline with Step Functions
https://tech.olx.com/building-an-aws-serverless-ml-pipeline-with-step-functions-b39feed12bab

Step Functionsを使って初めてループや分岐をやってみた！
https://dev.classmethod.jp/articles/first-aws-step-functions/
Step Functions + Lambdaのハンズオン

AWS Lambda

AWS Lambdaのアプリケーション作成を使ってCI/CDパイプラインを一気に構築
https://qiita.com/shonansurvivors/items/b223fbb362aed3c1c536

AWS GLUE

AWS GLUE の概念
https://docs.aws.amazon.com/ja_jp/glue/latest/dg/components-key-concepts.html

Glueの使い方的な
https://qiita.com/pioho07/items/32f76a16cbf49f9f712f

5TB/日のデータをAWS Glueでさばくためにやったこと（概要編
https://future-architect.github.io/articles/20180828/
EMR と Glue 比較がとても参考になる。

MLOps

ML Ops on AWS
https://d1.awsstatic.com/events/jp/2018/summit/tokyo/aws/44.pdf

【YouTube】ML Ops on AWS｜AWS Summit Tokyo 2018
https://www.youtube.com/watch?v=RJodLEo-SuE&list=PLzWGOASvSx6Gm88FXmryU-T1WKC3-SOl2&index=103
上記の動画

Lambdaと下げメーカーで機械学習モデルの構築を自動化
https://medium.com/@yuyasugano/lambda%E3%81%A8%E4%B8%8B%E3%81%92%E3%83%A1%E3%83%BC%E3%82%AB%E3%83%BC%E3%81%A7%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%83%A2%E3%83%87%E3%83%AB%E3%81%AE%E6%A7%8B%E7%AF%89%E3%82%92%E8%87%AA%E5%8B%95%E5%8C%96-73161d316c0e

Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤
https://www.slideshare.net/yuyamada777/step-functionsaws-batch

機械学習基盤/MLOpsまわりの勉強をしたときに参考にさせていただいた記事（2018.12時点）
https://qiita.com/noko_qii/items/f31901817dbed86f2b25

ハイパーパラメータチューニング

ハイパーパラメータ調整の仕組み
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html

Optuna(TPE)のアルゴリズム理解ー Part 1 ー
https://qiita.com/nabenabe0928/items/708d221dbccebf31f01c

GitHub(日本語)
https://github.com/aws-samples/amazon-sagemaker-examples-jp/blob/master/hpo_pytorch_mnist/pytorch_mnist.ipynb

Optuna ハイパーパラメータ最適化フレームワーク
https://www.slideshare.net/pfi/pydatatokyo-meetup-21-optuna
Define-by-Run:探索空間を目的関数の実行時に定義

SageMakerでOptunaを使う
https://aws.amazon.com/jp/blogs/machine-learning/implementing-hyperparameter-optimization-with-optuna-on-amazon-sagemaker/

SageMaker HPO:はじめに探索空間がきまる。
Optuna:for loop の中でパラメータを決める。動的に探索空間を決める
・1つのトレーニングジョブに複数のトライアルが走る。
・1トライアル終わると、オーロラにパラメータ書き込む。過去のパラメータみて、次のパラメータ決める。
pytorch-simple
https://github.com/aws-samples/amazon-sagemaker-optuna-hpo-blog/blob/master/examples/pytorch_simple/src/pytorch_simple.py

・objectibe(trial) トライアル、実験などという。
・層のユニット数もチューニング。層の数もチューニング46行目
n_layers = trial.suggest_int("n_layers", 1, 3)
・OptunaはバックエンドがRDSなので、RDSやAuloraが必要になった。
・51行目でユニット数をチューニングしている。
out_features = trial.suggest_int("n_units_l{}".format(i), 4, 128)

【公式youtube】Tune Your ML Models to the Highest Accuracy with Amazon SageMaker Automatic Model Tuning
https://www.youtube.com/watch?v=xpZFNIOaQns
1stフェーズは同じモデルで3パターンの異なるハイパーパラメータで実行する。など説明されている。(2:20-)

AutoML

その機械学習プロセス、自動化できませんか？
https://qiita.com/Hironsan/items/30fe09c85da8a28ebd63

【開催報告 & 資料公開】 AI/ML@Tokyo #6 AutoGluon 開催報告
https://aws.amazon.com/jp/blogs/news/aws-aiml-tokyo6/

Auto Gluon(英語)
https://aws.amazon.com/jp/blogs/opensource/machine-learning-with-autogluon-an-open-source-automl-library/

説明可能性

SHAP値で解釈する前にPermutation ImportanceとPDPを知る
https://research.miidas.jp/2019/11/shap%E5%80%A4%E3%81%A7%E8%A7%A3%E9%87%88%E3%81%99%E3%82%8B%E5%89%8D%E3%81%ABpermutation-importance%E3%82%92%E7%9F%A5%E3%82%8B/
SHAPなどの数式的な説明

可視化 : Amazon QuickSight

Amazon QuickSight とは
https://docs.aws.amazon.com/ja_jp/quicksight/latest/user/welcome.html

AWSマンガ第 8 話：全てのデータを可視化しろ！ ( 1 / 8 )
https://aws.amazon.com/jp/campaigns/manga/vol8-1/
下部にハンズオン資料あり

サンプルデータを使用して Amazon QuickSight で試用する 10 の視覚化
https://aws.amazon.com/jp/blogs/news/10-visualizations-to-try-in-amazon-quicksight-with-sample-data/

【公式】ML インサイトの使用
https://docs.aws.amazon.com/ja_jp/quicksight/latest/user/making-data-driven-decisions-with-ml-in-quicksight.html
異常検知、数値予測ともにランダムカットフォレストを使っている

Visualizing Amazon SageMaker machine learning predictions with Amazon QuickSight
https://aws.amazon.com/jp/blogs/machine-learning/making-machine-learning-predictions-in-amazon-quicksight-and-amazon-sagemaker/
SageMakerとQuickSightとの連携

既存のRコードを動かしたい

カスタムコンテナを使います
Amazon SageMaker で独自のアルゴリズムやモデルを使用する
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/your-algorithms.html

GitHub(英語)
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/r_bring_your_own

Kubernetes/Kubeflow ユーザー向け

Amazon SageMaker Operators for Kubernetes のご紹介
https://aws.amazon.com/jp/blogs/news/introducing-amazon-sagemaker-operators-for-kubernetes/
EKSから使える。素のKubernetesでも使える。

Amazon SageMaker Components for Kubeflow Pipelines
https://aws.amazon.com/about-aws/whats-new/2020/06/amazon-sagemaker-components-kubeflow-pipelines/
KubeflowにSageMakerが組み込める

機械学習基盤を作るのに Kubernetes か SageMaker で迷っている人へ / ML Platform
https://speakerdeck.com/hariby/ml-platform

分散処理

Amazon SageMaker で実現する大規模データのための分散学習とワークフロー | AWS Summit Tokyo 2019
https://www.youtube.com/watch?v=NUnIiYD-PEU&t=714s
AWS CodeCommit
EFS:インスタンス間/ユーザー間でのデータ共有
ファイルアクセス/パイプモード
100GBの大規模データの場合のパラメータ
HPO
Amazon Elastic Inference
バッチ変換
データ準備のベストプラクティス（Glue/Athena)
データセットを RecordIO / TFRecord ファイルにパッキング
オンライン予測を必要としない場合はバッチ変換を使用
複数のエンドポイントの代わりに推論パイプラインを使用

【上記資料】Amazon SageMaker で実現する大規模データのための分散学習とワークフロー
https://pages.awscloud.com/rs/112-TZM-766/images/D3-04.pdf

100GB超えてきたら、EFSを検討
1TB超えてきたら、FSx for Lustreを検討（FSx for Lustreのprovisioningできる最低サイズが1.2TB & 200 MB/s/TB なので）※上記資料には記載ない

推論時の分散処理
https://docs.aws.amazon.com/ja_jp/sagemaker/latest/APIReference/API_S3DataSource.html

・S3から読み込むときは、S3のオブジェクトキーを元にSageMaker側で自動で分散処理してくれる。
・Lusterの場合は分散を明示的に書く必要ある？

＜LusterとEFSの料金比較＞
LusterよりEFSの方がやすくなるケースもある。Lusterはプロビジョニングした分課金される。EFSは利用サイズ分課金される。Lusterは余分を持ってOverプロビジョニングすることが多いため、高くなる場合もある。

InputDataConfig=[
        {
            'ChannelName': 'training',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': 's3://dena-vinay/input/',
                    'S3DataDistributionType': 'ShardedByS3Key', #
                }
            },
            'InputMode': 'File'
        },
    ],

SageMaker側での単純な分散処理と、horovodを用いた場合の処理内容の違い
https://aws.amazon.com/jp/blogs/news/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/

SageMaker側での単純な分散処理のsampleコード（上記ブログの#1）
https://github.com/aws-samples/amazon-sagemaker-examples-jp/tree/master/tensorflow_script_mode_training_and_serving

【公式】分散学習（distributed training)ワークショップ
https://sagemaker-workshop.com/builtin/parallelized.html

【公式】SageMaker Training Job で使うファイルモード比較のブログ
https://aws.amazon.com/jp/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/
Lustre が強いのは小さいファイルが大量にあるユースケース。

強化学習

「機械学習の実運用でよくある課題と， AWS を使った解決方法・事例紹介」 | AWS Summit Tokyo 2019
https://www.youtube.com/watch?v=a3smIzBC6BQ

AWS DeepRacerリーグで世界一になるには
https://www.dnp-ds.co.jp/tech_fun/pdf/tech_fun_007_1.pdf

【公式blog】Amazon SageMaker で分散強化学習を使用して AI 搭載の Battlesnake をスケーリングする
https://aws.amazon.com/jp/blogs/news/scaling-your-ai-powered-battlesnake-with-distributed-reinforcement-learning-in-amazon-sagemaker/

【公式blog】Amazon SageMaker の強化学習を使用して AI で駆動する Battlesnake を構築する
https://aws.amazon.com/jp/blogs/news/building-an-ai-powered-battlesnake-with-reinforcement-learning-on-amazon-sagemaker/

強化学習と遺伝アルゴリズムの違い

状態の評価関数 f に対し、強化学習は行動を何回もした後の累積報酬 f(x₁) + f(x₂) + ... + f(xₙ) を最大化するための手法であるのに対し、他のふたつは 1 回の報酬を最大にするような値 argmaxₓ f(x) を求めるための手法です。

非常に明解。強化学習は累積報酬を最適化する。状況の変化に対応。次のアクションの最適化。
遺伝的アルゴリズムは最適化問題解決のための手法の一つ。

機械学習の始め方

【初級】AWS の機械学習サービス入門 | AWS Summit Tokyo 2019
https://www.youtube.com/watch?v=1gC46ODyudE

コンテナ(ECS/EKS/ECR/Fargate)

SageMaker概要

【AWS Black Belt Online Seminar】Amazon SageMaker Advanced Session
https://www.youtube.com/watch?v=G-s67PmTCjo&t=2496s

AIサービス

【AWS Black Belt Online Seminar】AWS AI Language Services
https://www.youtube.com/watch?v=Q0Ety9Z7oWM

【AWS Black Belt Online Seminar】AWS AI Services
https://www.youtube.com/watch?v=xvUyKjuv-Z4&t=1183s
Rekognitionの推論アーキテクチャあり

Amazon API Gateway

【AWS Black Belt Online Seminar】Amazon API Gateway
https://www.youtube.com/watch?v=EpEETIox03s&list=RDCMUCnjKWUK2t5QJYfeqqilhJhQ&index=8

他サービス連携

Amazon SageMaker と Amazon Redshift を利用した、高速・柔軟・セキュアな機械学習基盤の構築
https://aws.amazon.com/jp/blogs/news/build-fast-flexible-secure-machine-learning-platform-using-amazon-sagemaker-and-amazon-redshift/
Redshiftからのデータ読み込み

Machine Learning (ML) with Amazon Athena (Preview) を使用する
https://docs.aws.amazon.com/ja_jp/athena/latest/ug/querying-mlmodel.html
AthenaからSageMakerの推論を実行する。

SageMaker Python SDK

Pytorch:SageMakerでPytorchを使う場合
https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html
複数ファイル(*.py)も利用可能。source_dirで指定する。

分散学習

Use XGBoost with the SageMaker Python SDK
https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html
XGBoostも対応しているみたいなことが書いてある

The XGBoost open source algorithm provides the following benefits over the built-in algorithm:

Latest version - The open source XGBoost algorithm typically supports a more recent version of XGBoost. To see the XGBoost version that is currently supported, see XGBoost SageMaker Estimators and Models.

Flexibility - Take advantage of the full range of XGBoost functionality, such as cross-validation support. You can add custom pre- and post-processing logic and run additional code after training.

Scalability - The XGBoost open source algorithm has a more efficient implementation of distributed training, which enables it to scale out to more instances and reduce out-of-memory errors.

Extensibility - Because the open source XGBoost container is open source, you can extend the container to install additional libraries and change the version of XGBoost that the container uses. For an example notebook that shows how to extend SageMaker containers, see Extending our PyTorch containers.

PyTorch の学習と推論を Amazon SageMaker で行う
https://github.com/harusametime/sagemaker-notebooks/blob/master/pytorch_mnist/pytorch_mnist.ipynb
複数ノードの場合には、エントリーポイントに分散学習となるような実装が必要になります。PyTorchではCPUの分散学習のためにgloo、GPUの分散学習のためにncclを選ぶことができます。

フレームワークつかうときはここのdistributionの設定で振り分けるか決められる
https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html

S3からのデータ読み込みの高速化

【公式blog】Amazon SageMaker で Amazon FSx For Lustre と Amazon EFS のサポートを開始、モデルのトレーニングを一層すばやく簡単に実行可能に
https://aws.amazon.com/jp/about-aws/whats-new/2019/08/amazon-sagemaker-works-with-amazon-fsx-lustre-amazon-efs-model-training/
大規模データの場合は、S3に置く以外に、EFSやLustreに置く手段が取れる（学習のみ）
Lustreは最小ディスクが1.2TB
(感覚的目安）EFSは学習データが100GB以上、Lustreは1TB以上の場合

アノテーション・ラベリング

CUS-94：FiNC Technologies における食事画像アノテーションのための Ground Truth への移行と活用事例
https://resources.awscloud.com/aws-summit-online-japan-2020-on-demand-industry-2-55513/cus-94-aws-summit-online-2020-finc-technologies-inc

エッジ側での推論（SageMaker Neo）

SageMaker Neoの可能性について- 第3回 Amazon SageMaker 事例祭り＋体験ハンズオン
https://www.slideshare.net/tkatojp/sagemaker-neo-3-amazon-sagemaker
iOSはCoreMLというフレームワークを利用している。

オンプレミス環境からSageMakerをSDK経由で利用する

オンプレミス環境から Amazon SageMaker を利用する
https://aws.amazon.com/jp/blogs/news/sagemaker_from_onpremises/
IAMの設定と、Credentialの設定が必要

最適化問題を解きたい場合

[AWS Blog] Solving numerical optimization problems like scheduling, routing, and allocation with Amazon SageMaker Processing
https://aws.amazon.com/blogs/machine-learning/solving-numerical-optimization-problems-like-scheduling-routing-and-allocation-with-amazon-sagemaker-processing/
SageMaker Processing で、最適化ライブラリを利用する。

[GitHub] FarOpt
https://github.com/aws-samples/faropt
FarOptは、サーバレスの最適化アーキテクチャです。

今度こそ？使い物になるフリーの数理最適化（混合整数最適化）ソルバー（付きインターフェース） Python-MIP
https://qiita.com/keisukesato-ac/items/f2fb63140b80226ba687
PuLPからSCIP利用が述べられている

[GitHub] Pythonではじめる数理最適化〜ケーススタディでモデリングのスキルを身につけよう〜
https://github.com/ohmsha/PyOptBook
書籍のサポートコード

その他Tips

Learn Amazon SageMaker: A guide to building, training, and deploying machine learning models for developers and data scientists (英語)より
https://www.amazon.co.jp/gp/product/180020891X

・Managed Spot trainingはビルトインアルゴリズムとTensorflowのみ。
60minsの制約がある。カスタムコンテナの場合は自分で処理を記述する必要がある。

・CFnでは、ノートブックインスタンスの作成と、作成済みモデルからエンドポイント作成ができる

・SageMakerで作成したモデルをFargateで動かすことはできるが、FargateはGPUに対応していないので注意。

・ローカルモードは、ビルトインアルゴリズムに対しては使えない
https://github.com/aws-samples/amazon-sagemaker-local-mode/issues/2

All SageMaker built in algorithms can be trained only with SageMaker managed instances.
SageMaker local mode is for Script Mode and for Bring Your Own Container training.

【公式】SageMakerのSLA
https://aws.amazon.com/sagemaker/sla/

ハンズオン

(Game）機械学習を用いたプレイヤーの異常行動検知
http://educationhub-old-b73182c0-0b3c-11ea-97bc-252c1e0244c6.s3-website-us-east-1.amazonaws.com/

SageMaker JumpStart
https://aws.amazon.com/jp/builders-flash/202105/hajimete-sagemaker/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up