More than 5 years have passed since last update.

Dockerコンテナを用いたAWS SageMakerのトレーニング

Posted at 2019-01-20

本記事でやること

S3にトレーニング用のデータを置く
自分で作成したDockerイメージをAWS ECRにPushする
SageMakerの訓練用APIを利用してトレーニングジョブを起動する

本記事でやらないことは以下の通りなので、他の記事を参照してください。

SageMakerやECRなどにアクセスするためのIAMロールの作成
DockerイメージのAWS ECRへのPushの仕方

対象読者

とりあえずAWS SageMakerを動かしてみたい人
Dockerに関して基礎的な知識を持っている人

使用言語

Python 3.6.3

使用するファイルとディレクトリ構成

sagemaker/以下のファイルやディレクトリの構成については、こちら公式のgit repositoryにサンプルが用意されているのでgit cloneして使います。

tree

.
├──sagemaker_client.py
└── sagemaker
    ├── Dockerfile
    └── program
        ├── nginx.conf
        ├── predictor.py
        ├── serve
        ├── train
        └── wsgi.py

sagemaker/program/trainファイルに自分で使いたい機械学習のアルゴリズムを書くことでSageMakerのビルドイン以外のアルゴリズムを使うことができます。
また、今回はトレーニングのみを行うのでnginx.confやpredictor.py、serve、wsgi.pyについての説明は省略します。

独自アルゴリズムを使うための`train`ファイルの中身

今回は、とりあえずSageMakerを動かすことが目的なので公式のサンプル同様にDecision Treeのアルゴリズムを用いますので、Trainファイルの中身は以下になっています。

train

# !/usr/bin/env python

# A sample training component that trains a simple scikit-learn decision tree model.
# This implementation works in File mode and makes no assumptions about the input file names.
# Input is specified as CSV with a data point in each row and the labels in the first column.

from __future__ import print_function

import os
import json
import pickle
import sys
import traceback

import pandas as pd

from sklearn import tree

# These are the paths to where SageMaker mounts interesting things in your container.

prefix = '/opt/ml/'

input_path = prefix + 'input/data'
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')

# This algorithm has a single channel of input data called 'training'. Since we run in
# File mode, the input files are copied to the directory specified here.
channel_name='training'
training_path = os.path.join(input_path, channel_name)

# The function to execute the training.
def train():
    print('Starting the training.')
    try:
        # Read in any hyperparameters that the user passed with the training job
        with open(param_path, 'r') as tc:
            trainingParams = json.load(tc)

        # Take the set of files and read them all into a single pandas dataframe
        input_files = [ os.path.join(training_path, file) for file in os.listdir(training_path) ]
        if len(input_files) == 0:
            raise ValueError(('There are no files in {}.\n' +
                              'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                              'the data specification in S3 was incorrectly specified or the role specified\n' +
                              'does not have permission to access the data.').format(training_path, channel_name))
        raw_data = [ pd.read_csv(file, header=None) for file in input_files ]
        train_data = pd.concat(raw_data)

        # labels are in the first column
        train_y = train_data.iloc[:,-1]
        train_X = train_data.iloc[:,:-1]

        # Here we only support a single hyperparameter. Note that hyperparameters are always passed in as
        # strings, so we need to do any necessary conversions.
        max_leaf_nodes = trainingParams.get('max_leaf_nodes', None)
        if max_leaf_nodes is not None:
            max_leaf_nodes = int(max_leaf_nodes)

        # Now use scikit-learn's decision tree classifier to train the model.
        clf = tree.DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)
        clf = clf.fit(train_X, train_y)

        # save the model
        with open(os.path.join(model_path, 'decision-tree-model.pkl'), mode='wb') as out:
            pickle.dump(clf, out)
        print('Training complete.')
    except Exception as e:
        # Write out an error file. This will be returned as the failureReason in the
        # DescribeTrainingJob result.
        trc = traceback.format_exc()
        with open(os.path.join(output_path, 'failure'), 'w') as s:
            s.write('Exception during training: ' + str(e) + '\n' + trc)
        # Printing this causes the exception to be in the training job logs, as well.
        print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)
        # A non-zero exit code causes the training job to be marked as Failed.
        sys.exit(255)

if __name__ == '__main__':
    train()

    # A zero exit code causes the job to be marked a Succeeded.
    sys.exit(0)

公式のtrainファイルと異なっている点が2つあります。

まず1点目は、正解ラベルと入力データの切り分け方です。今回はiris.csvを用いたため、データセットの最終列を正解ラベル、最終列を除いた3列を入力データとしています。（公式のサンプルでは、最初の列を正解ラベルとして想定しているので要注意）

train

# labels are in the first column
train_y = train_data.iloc[:,-1]
train_X = train_data.iloc[:,:-1]

2点目は、ファイルをバイナリーモードで開くようにしました。公式ではmode="w"ですが、今回はmode="wbとします。

train

# save the model
with open(os.path.join(model_path, 'decision-tree-model.pkl'), mode='wb') as out:
  pickle.dump(clf, out)

S3にトレーニング用のデータを置く

上図、公式のページに記載がある通り、トレーニング用インプットデータはS3の特定のバケットにおく様にします。（後述しますが、SageMakerにジョブを送る際のパラメータとしてインプットデータを置いたS3のパスを指定します。）
今回はiris.csvを用いますので、ローカルにダウンロードし適当にバケットを作り置いておきましょう。

また、トレーニングした後のモデルの出力先として、別のフォルダも用意しておきましょう。下図で言うとoutput-dataにあたります。

自分で作成したDockerイメージをAWS ECRにPushする

sagemaker/Dockerfileを用いて、program/以下のファイルを入れたDockerイメージを作成します。

tree

.
├──sagemaker_client.py
└── sagemaker
    ├── Dockerfile
    └── program
        ├── nginx.conf
        ├── predictor.py
        ├── serve
        ├── train
        └── wsgi.py

こちらも公式のサンプルにある内容とほぼ同じですが、1点だけ異なる点があるので紹介します。

トレーニングを行う場合、SageMakerはdocker run image trainを実行するのでtrainファイルに実行権限を付与するために以下のコマンドを追加しました。
RUN chmod +x /opt/program/train

こちらのコマンドを入れずにSageMakerにジョブを送った時、以下のエラーが発生したので同じエラーが出た方は、trainファイルに実行権限を付与する様にしましょう。

エラー内容：exec: "train": executable file not found in $PATH

Dockerfile

# reference:
# https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/container/Dockerfile

FROM python:3.6

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Here we get all python packages.
# There's substantial overlap between scipy and numpy that we eliminate by
# linking them together. Likewise, pip leaves the install caches populated which uses
# a significant amount of space. These optimizations save a fair amount of space in the
# image, which reduces start up time.
RUN pip3 install　　pandas flask gevent gunicorn sklearn

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"


COPY program /opt/program

RUN chmod +x /opt/program/train # trainファイルに実行権限を付与する

WORKDIR /opt/program

あとは、こちらのDockerfileを使ってイメージをビルドし、AWS ECRにPushしましょう。

SageMakerの訓練用APIを利用してトレーニングジョブを起動する

今回は、sagemaker_client.pyのファイルを作って、SageMakerにトレーニングジョブを送ります。

tree

.
├──sagemaker_client.py
└── sagemaker
    ├── Dockerfile
    └── program
        ├── nginx.conf
        ├── predictor.py
        ├── serve
        ├── train
        └── wsgi.py

以下の通り、SageMakerにトレーニングジョブを送る際は、使用するDockerイメージやインプットデータのS3のパスなどのパラメータを作り、create_training_jobの引数として渡す必要があります。

sagemaker_client.py

from boto3.session import Session


class SagemakerClient:

    def __init__(self):
        self.client = Session(profile_name="hoge").client("sagemaker", region_name="ap-northeast-1")

    def submit_training_job(self):
        training_params = {
            "TrainingJobName": "sample-training" # トレーニングのジョブ名,
        "HyperParameters": {  # 学習時のパラメータ
            'objective': 'multiclass',
            'num_class': '3'
        },
        "AlgorithmSpecification": {
            'TrainingImage': "123.dkr.ecr.ap-northeast-1.amazonaws.com/sagemaker-repo:latest", # トレーニング時に使用するDocker イメージ
            'TrainingInputMode': 'File'
        },
        "RoleArn": "arn:aws:iam::123:role/dev-sagemaker", # SageMakerにアタッチするロール
        "InputDataConfig": [
            {
                'ChannelName': 'training',
                'DataSource': {
                    'S3DataSource': {
                        'S3DataType': 'S3Prefix',  # s3上のファイルを使う場合はこちらを指定する
                        'S3Uri': "s3://hoge/input-data/iris.csv" # トレーニング時に使用するデータを入れたS3のパス
                    }
                }
            }
        ],
        "OutputDataConfig": {
            'S3OutputPath': "s3://hoge/output-data/"　# トレーニングした後にモデルを出力するためのS3のパス
        },
        "ResourceConfig": {
            'InstanceType': 'ml.m4.xlarge',  # インスタンスタイプ
            'InstanceCount': 1,  # 学習インスタンス台数
            'VolumeSizeInGB': 10  # 学習インスタンスのボリューム
        },
        "StoppingCondition": {
            'MaxRuntimeInSeconds': 60 * 60
        }
        }

        response = self.client.create_training_job(**training_params)
        print(response)

if __name__ == '__main__':
    SagemakerClient().submit_training_job()

こちらのファイルを実行したら、AWS SageMakerのコンソールを確認します。
以下に通り、ステータスがCompletedになったらokです。もし、faildになりトレーニングが終了したらcloudwatchのログなどを確認し、修正していきましょう。

SageMakerのトレーニングジョブが完了したら、S3でモデルが出力されているのか確認しましょう。
以下の様に、予め用意しておいたフォルダ>トレーニングジョブ名>outputのフォルダ内にmodel.tar.gzの形でモデルが出力されていることを確認出来たら終わりです。

終わりに

今回は、とりあえず独自アルゴリズムでSageMakerを動かすことを目的にしていたので公式のサンプルをほぼ転用いたしました。
実際に動かしてみた感想としは、こちらの記事にも記載されている通りに、sagemaker/program/以下にフォルダを作り、そのフォルダの中でアリゴリズムを記載し、sagemaker/program/trainファイルでは、アリゴリズムのファイルを呼び出すことをした方が良さそうだなと思いました。また、ジョブを送る際のパラメータもベタ書ではなく、どこかのファイルから呼ぶ様にしたい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Dockerコンテナを用いたAWS SageMakerのトレーニング

本記事でやること

対象読者

使用言語

使用するファイルとディレクトリ構成

独自アルゴリズムを使うためのtrainファイルの中身

S3にトレーニング用のデータを置く

自分で作成したDockerイメージをAWS ECRにPushする

SageMakerの訓練用APIを利用してトレーニングジョブを起動する

終わりに

独自アルゴリズムを使うための`train`ファイルの中身