More than 3 years have passed since last update.

エンジニア目線で始める Amazon SageMaker Training ②エンジニアが①を読んで浮かぶQA

Last updated at 2022-04-04Posted at 2022-02-16

記事一覧

前回の記事の落穂拾い

前回は、「Amazon SageMaker Training とは、①用意したコードを②用意したデータと③用意した環境で実行してくれ、④結果を自動で保存してくれる、バッチ処理サービスです。」という個人のお気持ちを表現しました。

その中で、用意したコードの実行、用意したデータの利用、結果の自動保存のやり方にフォーカスしてチュートリアル的記事を書き、とにかく読者の皆様が SageMaker SDK を用いて SageMaker Training を動かすことに集中した記事を書きました。

よく聞く疑問に答えた記事を書いたつもりではありますが、それでも私が最初に学習した疑問を全部入れられてはいないので、拾っていく記事とします。

使用したコードはすべて GitHub にありますので clone してお使いください。

実行環境周り

ストレージの容量は？

Estimator 定義時に volume_size という引数で指定できます。デフォルトは 30GB です。しかしマウント先などがわからないので、確認してみましょう。

確認コードと実行

./src/2-1/check.py抜粋

import subprocess
def exec_cmd(cmd):
    res = subprocess.run(cmd.split(' '),stdout=subprocess.PIPE)
    print(f'result : {cmd}')
    print(res.stdout.decode('utf-8'))
exec_cmd('df -h')
exec_cmd('mount -l')

./2_QA.ipynb

from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
    entry_point='check.py',
    source_dir = './src/2-1',
    py_version='py38', 
    framework_version='2.6.0',
    instance_count=1,
    instance_type='ml.m5.xlarge',
    role=sagemaker.get_execution_role(),
    hyperparameters={ # ハイパーパラメータはダミーです
        'first-num':5,
        'second-num':2,
        'operator':'m'
    },
    volume_size=50, # 50GBを指定
)
estimator.fit()

実行結果抜粋

result : df -h
Filesystem                                                                                        Size  Used Avail Use% Mounted on
/dev/mapper/docker-259:5-135494-42513eb0bbc3c7941b6cf364d9f5240c677c0a04ba106cd2b24015859e9a2f57   26G  3.7G   21G  15% /
tmpfs                                                                                              64M     0   64M   0% /dev
tmpfs                                                                                             7.7G     0  7.7G   0% /sys/fs/cgroup
/dev/mapper/xvdf_crypt                                                                             49G   53M   47G   1% /tmp
/dev/nvme0n1p1                                                                                     40G  8.9G   31G  23% /etc/hosts
shm                                                                                               7.3G     0  7.3G   0% /dev/shm
tmpfs                                                                                             7.7G     0  7.7G   0% /proc/acpi
tmpfs                                                                                             7.7G     0  7.7G   0% /sys/firmware
result : mount -l
/dev/mapper/docker-259:5-135494-42513eb0bbc3c7941b6cf364d9f5240c677c0a04ba106cd2b24015859e9a2f57 on / type ext4 (rw,relatime,stripe=128,data=ordered)
(略)
/dev/mapper/xvdf_crypt on /tmp type ext4 (rw,relatime,data=ordered)
(略)
/dev/mapper/xvdf_crypt on /opt/ml/model type ext4 (rw,relatime,data=ordered)
/dev/mapper/xvdf_crypt on /opt/ml/output type ext4 (rw,relatime,data=ordered)
/dev/mapper/xvdf_crypt on /opt/ml/input type ext4 (rw,relatime,data=ordered)
/dev/mapper/xvdf_crypt on /opt/ml/errors type ext4 (rw,relatime,data=ordered)
/dev/mapper/xvdf_crypt on /opt/ml/output/tensors type ext4 (rw,relatime,data=ordered)
/dev/mapper/xvdf_crypt on /opt/ml/input/data type ext4 (rw,relatime,data=ordered)
/dev/mapper/xvdf_crypt on /opt/ml/output/data type ext4 (rw,relatime,data=ordered)
/dev/mapper/xvdf_crypt on /opt/ml/input/config type ext4 (rw,relatime,data=ordered)
/dev/mapper/xvdf_crypt on /opt/ml/output/metrics/sagemaker type ext4 (rw,relatime,data=ordered)
/dev/mapper/xvdf_crypt on /opt/ml/output/profiler/framework type ext4 (rw,relatime,data=ordered)
(略)

/dev/mapper/xvdf_crypt というファイルシステムに 50GB マウントされていることがわかります。df -h の結果では /tmp にマウントされているように見えていますが、mount -l のコマンドを確認してみると、/opt/ml/以下の各ディレクトリが /dev/mapper/xvdf_crypt を使用していることがわかります。入力データ(学習データ) + 出力データ(モデル) + 中間生成物の容量の合計で volume_size の引数を設定し、特に中間生成物は /opt/ml/ にある各ディレクトリ以下、もしくは /tmp を使いましょう。

volume_size の指定が効かない？

こんなジョブを実行してみます。

./2_QA.ipynb

from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
    entry_point='check.py',
    source_dir = './src/2-1',
    py_version='py38', 
    framework_version='2.6.0',
    instance_count=1,
    instance_type='ml.g4dn.xlarge',
    role=sagemaker.get_execution_role(),
    hyperparameters={
        'first-num':5,
        'second-num':2,
        'operator':'m'
    },
    volume_size=50, # 50GBを指定
)
estimator.fit()

実行結果

(略)
Filesystem                                                                                        Size  Used Avail Use% Mounted on
/dev/mapper/docker-259:5-135494-00dff3abf47479beb12d7eeb686ed039e573080b212f9bbbed84c3b9947652e6   26G  9.5G   15G  39% /
tmpfs                                                                                              64M     0   64M   0% /dev
tmpfs                                                                                             7.7G     0  7.7G   0% /sys/fs/cgroup
/dev/nvme3n1                                                                                      115G   61M  109G   1% /tmp
/dev/nvme0n1p1                                                                                     40G  8.9G   31G  23% /etc/hosts
shm                                                                                               7.3G     0  7.3G   0% /dev/shm
tmpfs                                                                                             7.7G   12K  7.7G   1% /proc/driver/nvidia
tmpfs                                                                                             7.7G  4.0K  7.7G   1% /etc/nvidia/nvidia-application-profiles-rc.d
devtmpfs                                                                                          7.7G  144K  7.7G   1% /dev/nvidia0
tmpfs                                                                                             7.7G     0  7.7G   0% /proc/acpi
tmpfs                                                                                             7.7G     0  7.7G   0% /sys/firmware
(略)
result : mount -l
(略)
/dev/nvme3n1 on /tmp type ext4 (rw,relatime,data=ordered)
(略)
/dev/nvme3n1 on /opt/ml/errors type ext4 (rw,relatime,data=ordered)
/dev/nvme3n1 on /opt/ml/model type ext4 (rw,relatime,data=ordered)
/dev/nvme3n1 on /opt/ml/input type ext4 (rw,relatime,data=ordered)
/dev/nvme3n1 on /opt/ml/output type ext4 (rw,relatime,data=ordered)
/dev/nvme3n1 on /opt/ml/input/data type ext4 (rw,relatime,data=ordered)
/dev/nvme3n1 on /opt/ml/output/tensors type ext4 (rw,relatime,data=ordered)
/dev/nvme3n1 on /opt/ml/output/data type ext4 (rw,relatime,data=ordered)
/dev/nvme3n1 on /opt/ml/input/config type ext4 (rw,relatime,data=ordered)
/dev/nvme3n1 on /opt/ml/output/metrics/sagemaker type ext4 (rw,relatime,data=ordered)
/dev/nvme3n1 on /opt/ml/output/profiler/framework type ext4 (rw,relatime,data=ordered)

ムムッ!? volume_size=50, # 50GBを指定 が効いてない…！/tmpに100GB 以上マウントされている？そもそもファイルシステムの名前が違う！？と思われるかもしれませんが、これは仕様です。
以下の Doc を引用します。ResourceConfig

Certain Nitro-based instances include local storage with a fixed total size, dependent on the instance type. When using these instances for training, Amazon SageMaker mounts the local instance storage instead of Amazon EBS gp2 storage. You can't request a VolumeSizeInGB greater than the total size of the local instance storage.

For a list of instance types that support local instance storage, including the total size per instance type, see Instance Store Volumes.

TOEIC 390 の私が翻訳アプリを使って解読したところ、 Nitro ベースのインスタンスは EBS ではなくローカルインスタンスストレージがマウントされます。今回指定した ml.g4dn.xlarge は、Nitro のインスタンスであり、そのインスタンスストレージは 125GB なので、その設定に引きづられた、ということです。
ちなみにこの結果から想像着くかと思いますが、Nitro の場合は volume_size で指定したサイズ以上(この場合は50GB)以上の容量を使ったとしても、ストレージインスタンス容量以下であれば処理可能です。

トレーニングインスタンスのディレクトリ構成は？

ストレージのところでも出てきましたが、基本的には /opt/ml/ 以下を使っていきます。/opt/ml/以下のディレクトリは以下の通りです。
SageMaker Training Toolkit と同様です。
(sageMaker Training Toolkit はコンテナイメージを持ち込むときに AWS が管理しているコンテナイメージと同じことを実現する仕組みを提供するものです)
上記 Doc に書かれていない情報は、私が動かした時の挙動から拾っているため正しい保証がないのと、今後変わる可能性があることにご留意ください。

ディレクトリ構成

/opt/ml
├── input/ # 設定や学習データが入るディレクトリ
│   ├── config/ # 設定が入るディレクトリ
│   │   ├── debughookconfig.json # sagemaker debugger を使った時の設定が入る
│   │   ├── hyperparameters.json # Estimator で指定したハイパーパラメータや、内部的にハイパーパラメータとして処理される情報が入る
│   │   ├── init-config.json # Training Instance の初期動作の設定が入る
│   │   ├── inputdataconfig.json # Training Insntace に持ち込むデータの設定が入る
│   │   ├── metric-definition-regex.json # metrics を取得するための正規表現が入る
│   │   ├── profilerconfig.json
│   │   ├── resourceconfig.json # インスタンスのホスト名や、instance_count を 2 以上した場合の他のホスト名、
│   │   │                       # インスタンスで使用しているネットワークインターフェースの名前が入る
│   │   ├── trainingjobconfig.json # トレーニングジョブ起動時の設定内容が格納される
│   │   └── upstreamoutputdataconfig.json
│   └── data/ # 学習データが入るディレクトリ
│       └── <channel_name>/ # 設定したチャネル名でディレクトリが作成される。設定しなかったときのデフォルトは training 
│           └── <input data> # ファイルが格納される
├── model/ # 機械学習のモデルを出力する（ことを想定して作られた）ディレクトリで、
│          # トレーニングが完了すると tar.gz で固められて S3 に転送される
├── code/ # Estimator の引数 source_dir で指定したディレクトリ一式が展開される。
│         # 実際には source_dir 一式が tar.gz で固められて S3 に転送され、それがダウンロードされて解凍される
├── output/
│   ├── success # 処理が成功したときに出力される空ファイル
│   ├── failure # 処理が失敗したときに失敗した理由が出力されるファイルで、ジョブ完了後にジョブの詳細から確認できる
│   ├── data/ # 処理のアウトプットを格納するディレクトリで処理完了時に output.tar.gz で固められて S3 に配置される
│   │         # 配置先は model.tar.gz と同じ S3 のプレフィックス
│   └── intermediate/ # ジョブ実行中にファイルを配置すると、S3 にすぐに sync される
│                     # hyperparameters で 'sagemaker_s3_output' キーに転送先の S3 URI を指定しておく必要あり
└── errors

環境変数は？

ディレクトリやホスト名を指定する環境変数が多数あります。その中で、SageMaker 独自の環境変数は SM_ から始まるものです。
詳細はこちらを参照してほしいですが、よく使うものだけピックアップして紹介します。
念の為環境変数の使い方を復習しておくと、Python の場合は以下のように取得します。

SM_MODEL_DIRの場合

import os
# パターン1(環境変数が設定されていない場合はエラーが発生する）
model_dir = os.environ['SM_MODEL_DIR']
# パターン2(環境変数が設定されていない場合は None が返る
# 第 2 引数にデフォルト値を入れられる
model_dir = os.environ.get('SM_MODEL_DIR')

SM_MODEL_DIR
モデルの出力ディレクトリの環境変数(値は /opt/ml/model)。このディレクトリにモデルを出力すると自動で model.tar.gz に固められて S3 に転送される。
SM_CHANNELS
トレーニングに用いるデータのチャネル名が格納されている環境変数。
SM_CHANNEL_{channel_name}
トレーニングに用いるデータのチャネルのディレクトリが格納されている環境変数。
チャネル数分だけ用意される(ex:SM_CHANNEL_TRAINING)
値は /opt/ml/input/{channel_name}
SM_HPS
設定されたハイパーパラメータのキーと値が辞書形式で格納される環境変数
model_dir はデフォルトで入る
ハイパーパラメータは argparse や、↓のSM_HP_{hyperparameter_name} を使うより hps = json.loads(os.environ.get('SM_HPS')) したほうが楽という気がする
SM_HP_{hyperparameter_name}
設定したハイパーパラメータがそれぞれ設定される
SM_NUM_GPUS
インスタンスに備わる GPU の数が格納される
SM_NUM_CPUS
インスタンスに備わる CPU の数が格納される
SM_OUTPUT_INTERMEDIATE_DIR
ジョブ実行中に S3 に連携できるディレクトリ名が格納される。
ハイパーパラメータで sagemaker_s3_output を設定していないと使えない
/opt/ml/output/intermediate が現在設定されている
SM_OUTPUT_DATA_DIR
トレーニングジョブ完了後に S3 に output.tar.gz に固めて転送するディレクトリ名が可能される
/opt/ml/output/data が現在設定されている

他にも環境変数はありますが（例えば /opt/ml/input/config/*.json の中身が格納されているなど）、
トレーニングジョブを実行した際の標準出力に SM_ から始まる環境変数一覧が表示されますので、必要に応じてお使いください。

実行コマンドは？

前回の記事でハイパーパラメータのコマンドライン引数のところで見えておりましたが、持ち込んだコードを実行する時のコマンドは、標準出力内で確認できます。

TrainingJob標準出力抜粋

/usr/local/bin/python3.8 -m pip install -r requirements.txt
/usr/local/bin/python3.8 check.py --first-num 5 --model_dir s3://sagemaker-{REGION}-{ACCOUND_ID}/{output_path}-{YYYY}-{MM}-{DD}-{HH}-{MI}-{SS}-{mmm}/model --operator m --second-num 2

1 行目については、requirements.txt が source_dir で指定したディレクトリ直下にあった場合のみ動きます（なかった場合はコマンド自体が実行されない）。また /usr/local/bin/python3.8 の部分についてはコンテナに依ります。

コマンドライン引数は、--model_dir が必ず入り、SageMaker Trainingジョブの成果物(SM_MODEL_DIR という環境変数に定義されたディレクトリ=(/opt/ml/model)に保存されたファイル達をtar.gzで固めたもの)の S3 の配置先が入ります。他のコマンドライン引数は hyperparameters で設定された値が入ります。
ただし前述の通りハイパーパラメータは環境変数からも取れます。
/usr/local/bin/python3.8 の部分についてはコンテナに依ります。

カレントディレクトリは？

/opt/ml/codeです。
環境変数で PWD=/opt/ml/code と設定されている他、作成したコードで下記を実行することでも確認できます。

./src/2-1/check.pyより抜粋

import subprocess
def exec_cmd(cmd):
    res = subprocess.run(cmd.split(' '),stdout=subprocess.PIPE)
    print(f'result : {cmd}')
    print(res.stdout.decode('utf-8'))
exec_cmd('pwd')

実行結果抜粋

result : pwd
/opt/ml/code

用意したコードの配置場所は？

ディレクトリ構成でも触れましたが、/opt/ml/codeです。カレントディレクトリと一緒です。source_dirを指定した場合はディレクトリ一式配置されます。
(流れは source_dir 一式を S3 に sourcedir.tar.gz に転送したあと、トレーニングインスタンスに転送されて解凍される)
以下は確認コードです。

./src/2-1/check.pyより抜粋

import subprocess
def exec_cmd(cmd):
    res = subprocess.run(cmd.split(' '),stdout=subprocess.PIPE)
    print(f'result : {cmd}')
    print(res.stdout.decode('utf-8'))
exec_cmd('ls -la /opt/ml/code')

実行結果抜粋

result : ls -la /opt/ml/code/
total 20
drwxr-xr-x 3 root root 4096 Feb  7 08:21 .
drwxr-xr-x 7 root root 4096 Feb  7 08:21 ..
-rw-rw-r-- 1 1000 1000  785 Feb  7 08:16 check.py
-rw-rw-r-- 1 1000 1000   14 Feb  5 07:19 requirements.txt

実行ユーザは？

AWS 管理ユーザの場合は Docker のデフォルトで root です。
バッチ処理なので特に問題は起きないですが、Docker イメージを持ち込んだ場合は例えば Dockerfile 内で指定することで変更できます。
確認スクリプトは以下の通りです。

./src/2-1/check.pyより抜粋

import subprocess
def exec_cmd(cmd):
    res = subprocess.run(cmd.split(' '),stdout=subprocess.PIPE)
    print(f'result : {cmd}')
    print(res.stdout.decode('utf-8'))
exec_cmd('whoami')

実行結果抜粋

result : whoami
root

入っているライブラリは？

コンテナごとに違います。コンテナのソースコードを読んでください。
AWS 管理の DL コンテナの場合のソースコードはこちらから CPU の場合{framework}/training/docker/{version}/{python version}/Dockerfile.cpu、GPUは更にその下の {cuda version}/Dockerfile.gpu を見てください。

あるいはトレーニングスクリプト内で以下を実行すると確認できます。

./src/2-1/check.pyより抜粋

import subprocess
def exec_cmd(cmd):
    res = subprocess.run(cmd.split(' '),stdout=subprocess.PIPE)
    print(f'result : {cmd}')
    print(res.stdout.decode('utf-8'))
exec_cmd('pip freeze')

TensorFlow 2.6.0 Py38 コンテナの場合(長いので折りたたんでます)

実行結果抜粋

absl-py==0.10.0
argon2-cffi==21.1.0
astunparse==1.6.3
attrs==21.2.0
autovizwidget==0.19.1
awscli==1.21.1
backcall==0.2.0
bcrypt==3.2.0
beautifulsoup4==4.10.0
bleach==4.1.0
bokeh==2.3.3
boto3==1.19.1
botocore==1.22.1
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
chardet==3.0.4
clang==5.0
cloudpickle==2.0.0
cmake==3.18.2.post1
colorama==0.4.3
cryptography==35.0.0
cycler==0.10.0
decorator==5.1.0
defusedxml==0.7.1
dill==0.3.4
docutils==0.15.2
entrypoints==0.3
filelock==3.3.1
flatbuffers==1.12
fsspec==2021.10.1
future==0.18.2
gast==0.4.0
gevent==21.8.0
google-auth==2.3.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
greenlet==1.1.2
grpcio==1.41.0
h5py==3.1.0
hdijupyterutils==0.19.1
horovod==0.22.1
idna==2.10
imageio==2.9.0
importlib-metadata==4.8.1
inotify-simple==1.2.1
ipykernel==5.5.6
ipython==7.28.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
jedi==0.18.0
Jinja2==3.0.2
jmespath==0.10.0
joblib==1.1.0
jsonschema==4.1.2
jupyter==1.0.0
jupyter-client==7.0.6
jupyter-console==6.4.0
jupyter-core==4.8.1
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.2
keras==2.6.0
Keras-Preprocessing==1.1.2
kiwisolver==1.3.2
llvmlite==0.37.0
Markdown==3.3.4
MarkupSafe==2.0.1
matplotlib==3.4.3
matplotlib-inline==0.1.3
mistune==0.8.4
mock==4.0.3
mpi4py==3.0.3
multiprocess==0.70.12.2
nbclient==0.5.4
nbconvert==6.2.0
nbformat==5.1.3
nest-asyncio==1.5.1
nose==1.3.7
notebook==6.4.5
numba==0.54.1
numpy==1.19.5
oauthlib==3.1.1
opencv-python==4.5.4.58
opt-einsum==3.3.0
packaging==21.0
pandas==1.2.5
pandocfilters==1.5.0
paramiko==2.8.0
parso==0.8.2
pathos==0.2.8
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.3.2
plotly==5.3.1
pox==0.3.0
ppft==1.6.6.4
prometheus-client==0.11.0
prompt-toolkit==3.0.21
protobuf==3.19.0
protobuf3-to-dict==0.1.5
psutil==5.7.2
ptyprocess==0.7.0
pure-sasl==0.6.2
pyarrow==5.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.8.0
pycparser==2.20
pyfunctional==1.4.3
Pygments==2.10.0
PyHive==0.6.4
pyinstrument==3.4.2
pyinstrument-cext==0.2.4
pykerberos==1.2.1
PyNaCl==1.4.0
pyparsing==2.4.7
pyrsistent==0.18.0
python-dateutil==2.8.1
pytz==2021.3
PyYAML==5.4.1
pyzmq==22.3.0
qtconsole==5.1.1
QtPy==1.11.2
requests==2.24.0
requests-kerberos==0.12.0
requests-oauthlib==1.3.0
retrying==1.3.3
rsa==4.7.2
s3fs==0.4.2
s3transfer==0.5.0
sagemaker==2.65.0
sagemaker-experiments==0.1.35
sagemaker-studio-analytics-extension==0.0.2
sagemaker-studio-sparkmagic-lib==0.1.3
sagemaker-tensorflow==2.6.0.1.11.0
sagemaker-tensorflow-training==20.3.0
sagemaker-training==4.0.0
sasl==0.3.1
scikit-learn==0.24.2
scipy==1.7.0
seaborn==0.11.2
Send2Trash==1.8.0
shap==0.40.0
six==1.15.0
sklearn==0.0
slicer==0.0.7
smclarify==0.2
smdebug==1.0.12
smdebug-rulesconfig==1.0.1
soupsieve==2.3.1
sparkmagic==0.19.1
tabulate==0.8.9
tenacity==8.0.1
tensorboard==2.7.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.6.0
tensorflow-cpu @ https://aws-tensorflow-binaries.s3-us-west-2.amazonaws.com/tensorflow/r2.6_aws/20210816_235124/cpu/py38/tensorflow_cpu-2.6.0-cp38-cp38-manylinux2010_x86_64.whl
tensorflow-estimator==2.6.0
tensorflow-io==0.21.0
tensorflow-io-gcs-filesystem==0.21.0
termcolor==1.1.0
terminado==0.12.1
testpath==0.5.0
threadpoolctl==3.0.0
thrift==0.15.0
thrift-sasl==0.4.3
tornado==6.1
tqdm==4.62.3
traitlets==5.1.0
typing-extensions==3.7.4.3
urllib3==1.25.11
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==2.0.2
widgetsnbextension==3.5.1
wrapt==1.12.1
zipp==3.6.0
zope.event==4.5.0
zope.interface==5.4.0

トレーニングインスタンスのアーティファクトを S3 に出力する方法

ディレクトリのところで軽く触れましたが、SM_MODEL_DIR 環境変数が示すディレクトリにファイルを配置すると、トレーニングが完了したときに model.tar.gz に固められて S3 に転送されます。しかし、model.tar.gz は SageMaker では特別なファイルであり、このシリーズ（エンジニア目線で始める Amazon SageMaker Training）では触れませんが、Amazon SageMaker Hosting でも使用するものであり、機械学習のモデル以外の余計なものを入れたくありません。では、中間成果物やなにかを出力して確認したい場合などはどうすればよいのか？について簡単に紹介します。

トレーニング完了後にデータを転送する方法

環境変数 SM_OUTPUT_DATA_DIR (値は/opt/ml/output/data) 以下にファイルを出力するとトレーニング完了後に output.tar.gz として固められて S3 に転送されます。転送先は model.tar.gz と同じプレフィックスです。以下が例です。

./src/2-1/check.py抜粋

output_data_dir = os.environ.get('SM_OUTPUT_DATA_DIR')
with open(os.path.join(output_data_dir,'data.txt'),'wt') as f:
    f.write(os.path.join(output_data_dir,'data.txt'))

トレーニング中にデータを転送する方法

環境変数 SM_OUTPUT_INTERMEDIATE_DIR (値は/opt/ml/output/intermediate) 以下にファイルを出力するとトレーニング中に S3 にデータを転送します。 TensorBoard などでニアリアルタイムにトレーニングを確認したい場合に利用できます。
前述のとおりハイパーパラメータで、sagemaker_s3_output というキーに S3 の URI を指定する必要があることに注意してください。以下が例です。

./src/2-1/check.py抜粋

output_intermediate_dir = os.environ.get('SM_OUTPUT_INTERMEDIATE_DIR')
with open(os.path.join(output_intermediate_dir,'1.txt'),'at') as f:
    f.write('a')

2_QA.ipynb

from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
    entry_point='check.py',
    source_dir = './src/2-1',
    py_version='py38', 
    framework_version='2.6.0',
    instance_count=1,
    instance_type='ml.m5.xlarge',
    role=sagemaker.get_execution_role(),
    hyperparameters={ # ハイパーパラメータはダミーです
        'first-num':5,
        'second-num':2,
        'operator':'m',
        'sagemaker_s3_output':f's3://{sagemaker.session.Session().default_bucket()}/intermediate'
    },
    volume_size=50, # 50GBを指定
)
estimator.fit({
    'training':training_input_s3_uri,
    'validation':validation_input_s3_uri,
    'test': test_input_s3_uri
})

開発環境連携

Git 連携

コードは基本的に Git で管理しているチームが多いと思いますし、それが当然であってほしい私の願望もあります。
ではトレーニングを実行する際、Git リポジトリのコードを実行できないか？という願望が生まれるのもまた自然です。もちろんできます。 Estimator 生成時に git_config という引数でリポジトリを指定することで、トレーニング実行時に Git リポジトリからコードを引っ張ってきて実行できます。以下が例です。

2_QA.ipynb

from sagemaker.tensorflow import TensorFlow
git_config = {'repo': 'https://github.com/kazuhitogo/sagemaker-training-tutorial'}
estimator = TensorFlow(
    entry_point='check.py',
    source_dir = './src/2-1',
    git_config=git_config,
    py_version='py38', 
    framework_version='2.6.0',
    instance_count=1,
    instance_type='ml.m5.xlarge',
    role=sagemaker.get_execution_role(),
    hyperparameters={ # ハイパーパラメータはダミーです
        'first-num':5,
        'second-num':2,
        'operator':'m',
        'sagemaker_s3_output':f's3://{sagemaker.session.Session().default_bucket()}/intermediate'
    },
    volume_size=50, # 50GBを指定
)
estimator.fit({
    'training':training_input_s3_uri,
    'validation':validation_input_s3_uri,
    'test': test_input_s3_uri
})

本当に Git リポジトリを使ったかの確認するには事前に、

$ rm -rf ./src/2-1/

しておくとよいでしょう。(ローカルのソースコードが無い状況を作る）

また、git_config 変数は辞書型ですが、辞書内でブランチの指定やコミット ID も指定できますので、ブランチやコミット ID を指定したい場合は明示しましょう。詳細はこちらを参照ください。
また、自動的に使ったコード(Git リポジトリから引っ張ってきたコード)一式は、S3 に sourcedir.tar.gz に固められます。

この機能を用いて、実行前に commit & push して、トレーニングを実行する、というフローを組むと良いでしょう。

S3 のソースコードを指定

今まではローカルの source_dir だったり、Git のリポジトリを指定してきました。またその実態は必ず S3 に sourcedir.tar.gz に固められて保管されます。しかし、運用が乗ってきたときにこのままだと S3 に同じソースコードがずっと sourcedir.tar.gz にアップロードしつづけます。大した容量ではないですが、本番運用では美しくないケースもあります（もちろん実行したコードの証左として、毎回ソースコードを S3 に転送するという意味はあるので、利用用途に依ります)。
では、S3 のソースコードを指定する方法はどうすればよいのでしょうか？
Estimator クラスを利用することで実行できます。例は以下の通りです。
前回の実行コードとイメージを流用します。

2_QA.ipynb

# 前回のイメージとコードを利用する
image_uri = estimator.latest_training_job.describe()['AlgorithmSpecification']['TrainingImage']
source_tar_gz = estimator.latest_training_job.describe()['HyperParameters']['sagemaker_submit_directory']
estimator = sagemaker.estimator.Estimator(
    image_uri=image_uri,
    role=sagemaker.get_execution_role(),
    hyperparameters={
        'first-num':5,
        'second-num':2,
        'operator':'m',
        'sagemaker_s3_output':f's3://{sagemaker.session.Session().default_bucket()}/intermediate',
        'sagemaker_program' : 'check.py',
        'sagemaker_submit_directory' : source_tar_gz
   },
   instance_count=1,
   instance_type='ml.g4dn.xlarge',
)
estimator.fit({
    'training':training_input_s3_uri,
    'validation':validation_input_s3_uri,
    'test': test_input_s3_uri
})

このように hyperparameters の sagemKaer_submit_directory に S3 にある sourcedir.tar.gz の URI を、また起動スクリプトを sagemaker_program で指定することで、実行することができます。

実行記録

SageMaker には実験管理をサポートするための SageMaker Experiments があり、トレーニングのメトリクスを回収して可視化する仕組みがありますが、そこまでしないにしても、トレーニングの実行記録は残っています。すでに何度か SageMaker SDK の describe() を使って確認していますが、トレーニングを横断的に確認するときは boto3 を利用すると簡単です。

2_QA.ipynb

import boto3
sm_client = boto3.client('sagemaker')

# トレーニングジョブのリストを取得
print(sm_client.list_training_jobs())
# 最後のトレーニングジョブの詳細を取得
print(sm_client.describe_training_job(TrainingJobName=sm_client.list_training_jobs()['TrainingJobSummaries'][0]['TrainingJobName']))

上記でトレーニングジョブの一覧と最後に実行したトレーニングジョブの詳細を取得できます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up