More than 1 year has passed since last update.

AWSで実現するMLOps

Last updated at 2023-03-15Posted at 2020-09-04

AWSで実現するMLOps参考リンク

事例

【公式事例：株式会社デンソー】高度運転支援システムの機械学習モデル開発を Amazon SageMaker をはじめとするマネージドサービスで自動化
データ管理工数を 55%、ML エンジニアの作業工数を 66% に削減
https://aws.amazon.com/jp/solutions/case-studies/denso/

【AWS Summit 2021】高度運転支援向け画像認識のための Machine Learning Production Line
https://d1.awsstatic.com/events/jp/2021/summit-online/CUS-17_AWS_Summit_Online_2021_DENSO.pdf

カンムを支える技術 ~機械学習編~
https://tech.kanmu.co.jp/entry/2021/06/11/120953
学習と推論をAWSで実装

【公式ブログ】マネージドサービスを活用した機械学習のためのCI/CDパイプラインの構築
https://aws.amazon.com/jp/blogs/news/brainpad-ml-cicd-pipeline/
日本たばこ産業株式会社（JT）

【開催報告】「 AWSアドテク夏祭り2021 〜事例から学ぶアドテク業界の先進的取り組み〜」セミナー
https://aws.amazon.com/jp/blogs/news/aws-adtech-natsumatsuri-2021/
機械学習を活用した運用型広告の課題解決〜初学者が実践する AutoGluon-Tabular on SageMaker 〜
(株式会社アドウェイズ)

【AWS Summit 2019】TRI-ADにおける自動運転向けDeep Learning学習パイプラインの構築
https://pages.awscloud.com/rs/112-TZM-766/images/J3-05.pdf

みんなはどうやってるの？MLOps on AWS
https://qiita.com/mariohcat/items/b7dce44dc3d386bcb5a7

【開催報告 & 資料公開】 AI/ML@Tokyo #11 AWS の機械学習を使った製造業における業務活用 – Amazon SageMaker
https://aws.amazon.com/jp/blogs/news/aws-aiml-tokyo11/
アーキテクチャの具体的な事例がある

MLOps一般

第1回 MLOps 勉強会 Tokyo (Online)
https://mlops.connpass.com/event/184958/presentation/

機械学習システムの設計パターンを公開します。
https://engineering.mercari.com/blog/entry/ml-system-design/

MLOps – 機械学習モデルの活用、その先にあるチャレンジ Part 2
https://www.datarobot.com/jp/blog/mlops-ml-model-usage-future-issues-part2/

第15回 MLOps 勉強会(Online)
https://www.slideshare.net/MariOhbuchi/awsmlops
AWSのMLOpsソリューション資料が公開されている。

AWSのMLOpsの具体的ソリューション

AWS re:Invent 2020: Implementing MLOps practices with Amazon SageMaker
https://www.youtube.com/watch?v=8ZpE-9LnaJk&t=778s

【ホワイトペーパー】Build a Secure Enterprise Machine Learning Platform on AWS
https://docs.aws.amazon.com/whitepapers/latest/build-secure-enterprise-ml-platform/build-secure-enterprise-ml-platform.html?did=wp_card&trk=wp_card

【開催報告 & 資料公開】 AI/ML@Tokyo #7 AWSにおけるMLOps 開催報告
https://aws.amazon.com/jp/blogs/news/aws-aiml-tokyo7/
研究実験フェーズと運用フェーズに求められる機械学習基盤の違いを定義している

機械学習エンジニアが直面する課題とその解決策 — ML@Loftで語られた20を超える事例を通して見えてきたもの —
https://pages.awscloud.com/rs/112-TZM-766/images/F-1.pdf

Machine Learning Lens, AWS Well-Architected Framework
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf

上記の日本語版
https://d1.awsstatic.com/whitepapers/ja_JP/architecture/wellarchitected-Machine-Learning-Lens.pdf

ML Ops on AWS
https://d1.awsstatic.com/events/jp/2018/summit/tokyo/aws/44.pdf

Build An Automated Machine Learning Pipeline On AWS
https://www.cloudreach.com/en/resources/blog/machine-learning-model-aws/

【DeNA】CUS-90：プロダクション環境でフル活用する Amazon SageMaker
https://resources.awscloud.com/vidyard-all-players/cus-90-aws-summit-online-2020-dena

[AWS] AWSCloudFormationでCI/CDパイプライン(CodePipeLine)を構築する
https://qiita.com/is_ryo/items/0382d183f514e0d06f4d
DataScientistは、DockerfileをCodeCommitにPushすると、ImageがECRに保存される。
データは、データエンジニアが用意する。
学習コード、推論コードはデータサイエンティストがCodeCommitにPushする。学習コードをPushしたら学習が、推論コードをPushしたら推論が実行される。（推論を実行する場合は、モデルを登録して、モデルのpathを提示する）

[レポート] SageMakerを使用した機械学習用CI/CDパイプラインの構築 #reinvent #DVC303
https://dev.classmethod.jp/articles/reinvent-dvc303/

Amazon ECR をソースとしてコンテナイメージの継続的デリバリパイプラインを構築する
https://aws.amazon.com/jp/blogs/news/build-a-continuous-delivery-pipeline-for-your-container-images-with-amazon-ecr-as-source/
CodePipeline/CodeDeploy/ECR

【公式】AWS MLOps フレームワーク(2021.1)
https://aws.amazon.com/jp/solutions/implementations/aws-mlops-framework/?nc1=h_ls

【公式ブログ】Architect and build the full machine learning lifecycle with AWS: An end-to-end Amazon SageMaker demo
https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/
AWSでの機械学習一連のソリューションが整理されている。

金融向けMLOps

Machine learning best practices in financial services
https://aws.amazon.com/jp/blogs/machine-learning/machine-learning-best-practices-in-financial-services/

上記ブログで紹介されているホワイトペーパー(MLOps)
https://d1.awsstatic.com/whitepapers/machine-learning-in-financial-services-on-aws.pdf
【日本語版】
https://d1.awsstatic.com/whitepapers/ja_JP/machine-learning-in-financial-services-on-aws.pdf

上記ブログで紹介されているホワイトペーパー(システム全般）
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Financial-Services-Industry-Lens.pdf

金融 white paper にある、CodeBuildからCFn、SFnをどう実行するか

ECSのCI/CD環境をCodePipelineに変えた話
https://cam-inc.co.jp/p/techblog/405624087101047961

1. データサイエンティスト支援のためのMLOps

分散処理

パラーメーターサーバー方式とHorovod方式
https://aws.amazon.com/jp/blogs/news/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/

下に記載がある。 Sharding data: in this example we used one large TFRecord file containing the entire CIFAR-10 dataset, which is relatively small. However, larger datasets might require that you shard the data into multiple files, particularly if Pipe Mode is used (see below). Sharding may be accomplished by specifying an Amazon S3 data source as a manifest file or ShardedByS3Key.

データのシャーディング: この例では、CIFAR-10データセット全体を含む1つの大きなTFRecordファイルを使用しましたが、これは比較的小さいものです。しかし、より大きなデータセットでは、特にパイプモードを使用している場合には、複数のファイルにデータをシャーディングする必要があるかもしれません（後述）。シャードは、マニフェストファイルまたはShardedByS3KeyとしてAmazon S3データソースを指定することで実現できます。

https://docs.aws.amazon.com/ja_jp/sagemaker/latest/APIReference/API_S3DataSource.html
ShardedByS3Key方式

2. 本番運用のためのMLOps

簡単なデモサイトを作る

AWS Lambda + API Gateway + S3 で簡易Webアプリケーションを作成
https://www.datastadium.co.jp/engineer/e-reports/5245

ハンズオン

【公式】AWS MLOps Framework（英語）
https://aws.amazon.com/jp/solutions/implementations/aws-mlops-framework/?nc1=h_ls

Code シリーズ入門ハンズオンを公開しました！- Monthly AWS Hands-on for Beginners 2020年8月号
https://aws.amazon.com/jp/blogs/news/aws-hands-on-for-beginners-10/

CI/CD環境構築ハンズオン
https://classmethod.github.io/ci-cd-hands-on-ecs/

ContainerImage:
    Description: Please enter container image for ECS.
    Type: String
    Default: 738965884675.dkr.ecr.ap-northeast-1.amazonaws.com/ci-cd-hands-on:node

CFnテンプレート内の上記がもう存在しないため最初のCFn作成でCreatingのままスタックして進まないが、
流れの説明は読むと参考になる。

上記作者の記事
https://dev.classmethod.jp/articles/understanding-codebuild/

CodeBuildを身近にするためのはじめの一歩
https://www.slideshare.net/ssuser8125c5/codebuild-82151957

Amazon Sagemaker MLops (with classic CI/CD tools) Workshop
https://github.com/awslabs/amazon-sagemaker-mlops-workshop

【公式】AWS CI/CD WORKSHOP
https://aws-ci-cd.workshop.aws/

【GitHub】Amazon SageMaker MLOps
https://github.com/aws-samples/mlops-amazon-sagemaker/blob/master/README.md

エッジデバイス側でのMLOps

【レポート】CUS-91：Edge Deep Learning におけるMLOps #AWSSummit
https://dev.classmethod.jp/articles/aws-summit-online-2020-session-report-cus-91/

MLプロジェクトマネジメント

Managing Machine Learning Projects
https://d1.awsstatic.com/whitepapers/aws-managing-ml-projects.pdf

推論方式デザイン

【YouTube】AWS Summit Brussels 2022 - Optimize Amazon SageMaker deployment strategies | AWS Events
https://www.youtube.com/watch?v=eN9-aEZhJ78&t=883s
4つの推論方式の選択基準が解説されている
・リアルタイム推論
・非同期推論
・バッチ推論
・サーバレス推論

モデル精度の監視

SageMaker Model Monitor
モデル学習時のデータ（過去）と、現在推論に使っているデータの性質が同じかを監視する。
特徴量生成には、Deequを利用している

【公式blog】Amazon SageMaker Model Monitor を活用したデータドリフト検知の解説
https://aws.amazon.com/jp/blogs/news/detect-data-drift-with-amazon-sagemaker-model-monitor/

GitHub
https://github.com/awslabs/deequ

トレーニングプログラム

[AWSブログ] 新しい AWS コースで ML モデルを運用する方法を学ぶ
https://aws.amazon.com/jp/blogs/news/learn-how-to-operationalize-ml-models-with-new-aws-course/

"
MLOps は次の3つの理由から DevOps よりもチャレンジングです。1) DevOps のプラクティスとは異なり、ソフトウェア開発者や運用者にさらに2 つ (つまり、データエンジニアとデータサイエンティスト) を加えた 4 つのロール間で作業を調整する必要がある。2) MLOps では、コードに加えてモデルとデータの両方を展開している。3) ML 予測の確率的性質により、ビジネス指標との整合性を確保するために、予測品質を継続的に監視する必要があります。
"

実験パイプライン

運用パイプライン

デプロイ戦略

「本番稼働バリアント（production variants)」の概念を理解する
-> バリアントとは、コンピューティング、メモリ、ストレージを提供するクラウドインスタンスと、モデルコンテナアプリケーションの組み合わせ。インスタンスタイプ、インスタンス数、モデルの3つを定義する。

production_variant()にて、バリアントを定義することもできる。

【公式】SageMaker Python SDK doc
https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.production_variant

カナリアロールアウト

新しいモデルをごく一部のユーザーに安全にリリースし、本番環境での初期テストに利用。
endpoint_configで、InitialInstanceCountとInitialVariantWeightを、VariantAとVariantBに対して設定し、update_endpoint()にてコンフィグを反映する。

Blue/Greenデプロイ

カナリアロールアウトで新しいモデルのパフォーマンスが良好であることを確認したら、全てのトラフィックを新しいモデルに移行するブルー/グリーンデプロイを進める。
update_endpoint_weights_and_capacities()にて、トラフィックの配分を変更する。

【GitHub】Amazon SageMaker Safe Deployment Pipeline
https://github.com/aws-samples/amazon-sagemaker-safe-deployment-pipeline

A/Bテスト

カナリアロールアウトと似ているように見えるが、A/Bテストは「モデルの異なるバリアントに関するデータを収集すること」に焦点を当てている。そのため、カナリアロールアウトよりも大きなユーザーグループを対象とし、より多くのトラフィックを集め、より長い期間に渡って実施される。

カナリアロールアウトは、リスクの軽減とスムーズなアップグレードに焦点を当てている。

【GitHub】Amazon SageMaker A/B Testing Pipeline
https://github.com/aws-samples/amazon-sagemaker-ab-testing-pipeline

【公式blog】Amazon SageMaker を使用して本番稼働で ML モデルの A/B テストを行う
https://aws.amazon.com/jp/blogs/news/a-b-testing-ml-models-in-production-using-amazon-sagemaker/
Amazon AlexaのSageMakerユースケース記載あり

【YouTube】AWS Summit ANZ 2021 - A/B testing machine learning models with Amazon SageMaker MLOps
https://www.youtube.com/watch?v=5WysxAYDH1k

DEMOで使われているコード
https://github.com/aws-samples/amazon-sagemaker-ab-testing-pipeline/blob/main/notebook/mab-reviews-helpfulness.ipynb

【YouTube】Amazon re:MARS 2022 - A/B testing with real-world data to evaluate ML model updates (MLR207)
https://www.youtube.com/watch?v=sk2DcwGBRO4
コードはなし

【公式doc】SageMaker example
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_endpoints/a_b_testing/a_b_testing.html

多腕バンディットテスト（MABテスト）

A/Bテストは静的なテストであり、統計的に有意であると判断されるまでには、数週間から数ヶ月の期間が必要。
異なるモデルのバリアントをテストするためのより動的な方法がMAB。

MABの探索戦略には、以下のようなものがある。
・イプシロン・グリーディ（epsilon greedy）
・Thompsonサンプリング（Thompson sampling）
・バギング（bagging）
・オンラインカバー（online cover）

SageMakerは、Vowpal Wabbit、Ray、Coach、Unityなどの一般的なRLライブラリをネイティブにサポートしている。

参考

MLOps: 機械学習における継続的デリバリーと自動化のパイプライン
https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning?hl=ja

MLSE 機械学習基盤本番適用と運用の事例・知見共有会
https://www.youtube.com/watch?v=nNFCc3nowfg

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up