目的
SageMakerのEstimator APIの検証のために WindowsのWSL環境で「CIFAR-10分類のための機械学習トレーニングスクリプト (PyTorch Lightning使用)」を試しに実行してみた。検証メモです。
環境
- WSL(Ubuntu)
- Linux DESKTOP-C0A2DE7 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- Jupyter notebook(WSLのUbuntuに入れて実行)
主な機能:
- SageMaker Estimatorのセットアップ。
- PyTorch Lightningスクリプトを指定してトレーニングを実行。
- ログとモデルアーティファクトをS3に保存。
トレーニングジョブのコンテナ環境
コンテナイメージに使うDockerfileは下記になります。
FROM nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04
ENV TZ=Asia/Tokyo
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
# 必要なパッケージをインストール
RUN apt-get update && \
apt-get install -yq --no-install-recommends \
software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -yq --no-install-recommends python3.9 python3.9-dev python3.9-distutils \
build-essential \
wget \
git \
unzip \
curl \
tzdata && \
apt-get clean
# シンボリックリンクを作成
RUN ln -sf /usr/bin/python3.9 /usr/bin/python
# pipのインストール
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python
# 必要なPythonパッケージをインストール
RUN pip install --no-cache-dir torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116
RUN pip install --no-cache-dir pytorch-lightning \
matplotlib \
scikit-learn \
timm \
sagemaker-training \
tensorboard
Dockerビルドしてイメージを作成します。sagemaker.gpuが作られましたね。
@DESKTOP-C0A2DE7:~/09_local_mode/docker$ docker build -t sagemaker.gpu -f Dockerfile_local_sagemaker_gpu .
@DESKTOP-C0A2DE7:~/09_local_mode/docker$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
sagemaker.gpu latest cce745b2c2e9 37 minutes ago 14.6GB
<none> <none> c881fdb09124 55 minutes ago 14.6GB
ubuntu22.04.python latest a8d5dd4a7c63 2 weeks ago 272MB
ubuntu22.04.vim latest 001a4fe504ab 2 weeks ago 196MB
public.ecr.aws/aws-cli/aws-cli latest cab764e2e45d 2 weeks ago 418MB
nginx latest f876bfc1cc63 7 weeks ago 192MB
255456539449.dkr.ecr.ap-northeast-1.amazonaws.com/test_container ubuntu.20.04 6013ae1a63c2 3 months ago 72.8MB
ubuntu 20.04 6013ae1a63c2 3 months ago 72.8MB
ubuntu 22.04 97271d29cb79 4 months ago 77.9MB
nvidia/cuda 11.0.3-base-ubuntu20.04 97dfa1ef5ee7 14 months ago 122MB
ジョブスクリプト
トレーニングジョブを呼びだすpythonスクリプトは下記の通り。
上記で作成したDockerイメージも
image_uri="sagemaker.gpu:latest"
で指定しています。
import boto3
import sagemaker
from sagemaker.estimator import Estimator
def main():
# Boto3セッションを初期化 (AWSのプロファイルを指定)
boto_session = boto3.Session(profile_name="develop")
# SageMakerローカルセッションを作成
sagemaker_session = sagemaker.LocalSession(boto_session=boto_session)
# 実行に必要なロールARNを取得
role = sagemaker_session.get_caller_identity_arn()
# Estimatorの設定
estimator = Estimator(
image_uri="sagemaker.gpu:latest", # 使用するDockerイメージ
entry_point="09_cifar.py", # 実行するスクリプト
role=role, # 実行権限
instance_count=1, # 使用するインスタンス数
instance_type="local", # ローカル実行 (CPUの場合)
output_path="s3://takuma-s3-sagemaker-kensho/sagemaker/logs", # 出力先S3バケット
code_location="s3://takuma-s3-sagemaker-kensho/sagemaker/logs", # スクリプト保存先
sagemaker_session=sagemaker_session, # SageMakerセッション
base_job_name="local-cifar", # ベースジョブ名
hyperparameters={ # ハイパーパラメータ
"ckpt_dir": "/opt/ml/model" # モデル保存先 (コンテナ内)
}
)
# トレーニングジョブの実行
estimator.fit()
if __name__ == "__main__":
# メイン関数の実行
main()
CIFAR-10分類のための機械学習トレーニングスクリプト (PyTorch Lightning使用)
'''
CIFAR-10分類のための機械学習トレーニングスクリプト (PyTorch Lightning使用)
このスクリプトは、CIFAR-10データセットを分類するための畳み込みニューラルネットワーク (ConvNet) モデルを実装しています。
PyTorch Lightningを使用して、トレーニングおよびバリデーションループの管理、ログ記録、チェックポイントの保存を行います。
主な機能:
- LightningDataModule (`CIFARModule`) を使用したデータ準備。
- PyTorch Lightningモジュールとして定義されたConvNetモデル (`ConvNet`)。
- TensorBoardLoggerを使用したトレーニングとバリデーションのメトリクス記録。
- バリデーション精度に基づくチェックポイント保存。
使用方法:
1. スクリプト内のハイパーパラメータやパスを調整するか、コマンドライン引数として指定してください。
2. 必要な依存関係がインストールされたPython環境でスクリプトを実行してください。
'''
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import torchvision
import torchmetrics
import pytorch_lightning as pl
import argparse
class ConvNet(pl.LightningModule):
def __init__(self):
super().__init__()
# モデルの各層を定義するリスト
self.layers = nn.ModuleList()
for i in range(3): # 畳み込みブロックを3つ作成
for j in range(3): # 各ブロックに3層ずつ追加
if i == 0 and j == 0:
in_chs = 3 # CIFAR-10はRGB画像なので入力チャネル数は3
elif j == 0:
in_chs = 64 * (2**(i-1)) # ブロック間でチャネル数を増加
else:
in_chs = 64 * (2**i) # ブロック内では一定のチャネル数
out_chs = 64 * (2**i)
# 畳み込み層を追加
self.layers.append(nn.Conv2d(in_chs, out_chs, 3, padding=1))
# バッチ正規化層を追加
self.layers.append(nn.BatchNorm2d(out_chs))
# 活性化関数を追加(最終ブロックの最後には追加しない)
if i != 2 and j != 2:
self.layers.append(nn.ReLU(inplace=True))
# プーリング層を追加(最終ブロック以外の最後に追加)
if i != 2 and j == 2:
self.layers.append(nn.AvgPool2d(2))
# グローバルプーリング層
self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
# 全結合層(10クラス分類用)
self.fc = nn.Linear(256, 10)
# トレーニングとバリデーション用の精度メトリクス
self.train_acc = torchmetrics.Accuracy(task="multiclass", num_classes=10)
self.val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=10)
def forward(self, inputs):
# 順伝播の定義
x = inputs
for l in self.layers:
x = l(x)
# グローバルプーリング後に平坦化
x = self.global_pool(x).view(x.shape[0], 256)
# 全結合層を通過
x = self.fc(x)
return x
def configure_optimizers(self):
# オプティマイザ(SGD)と学習率スケジューラの設定
optimizer = torch.optim.SGD(self.parameters(), lr=1e-2, momentum=0.9)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [60, 80], gamma=0.2)
return [optimizer], [scheduler]
def training_step(self, train_batch, batch_idx):
# トレーニングデータの1バッチを処理
x, y_true = train_batch # 入力データとラベル
y_pred = self.forward(x) # モデルの順伝播
loss = F.cross_entropy(y_pred, y_true) # クロスエントロピー損失
y_pred_label = torch.argmax(y_pred, dim=-1) # 予測ラベル
acc = self.train_acc(y_pred_label, y_true) # 精度を計算
# 損失と精度をログに記録
self.log("train_loss", loss, prog_bar=False, logger=True)
self.log("train_acc", acc, prog_bar=True, logger=True)
return loss
def validation_step(self, val_batch, batch_idx):
# バリデーションデータの1バッチを処理
x, y_true = val_batch # 入力データとラベル
y_pred = self.forward(x) # モデルの順伝播
loss = F.cross_entropy(y_pred, y_true) # クロスエントロピー損失
y_pred_label = torch.argmax(y_pred, dim=-1) # 予測ラベル
acc = self.val_acc(y_pred_label, y_true) # 精度を計算
# 損失と精度をログに記録
self.log("val_loss", loss, prog_bar=False, logger=True)
self.log("val_acc", acc, prog_bar=True, logger=True)
def on_validation_epoch_end(self):
# エポック終了時にバリデーション精度を表示し、メトリクスをリセット
print(f"\nEpoch {self.current_epoch:03} | ValidationAcc={self.val_acc.compute():.2%}")
self.val_acc.reset()
class CIFARModule(pl.LightningDataModule):
def __init__(self, opt):
super().__init__()
self.opt = opt
def prepare_data(self):
# CIFAR-10データセットの準備
self.train_dataset = torchvision.datasets.CIFAR10(
self.opt.data_dir, train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.RandomCrop((32, 32), padding=2), # ランダムクロップ
torchvision.transforms.RandomHorizontalFlip(), # ランダム水平反転
torchvision.transforms.ToTensor() # テンソルに変換
]))
self.val_dataset = torchvision.datasets.CIFAR10(
self.opt.data_dir, train=False, download=True,
transform=torchvision.transforms.ToTensor())
def train_dataloader(self):
# トレーニングデータのデータローダー
return DataLoader(self.train_dataset, batch_size=128, num_workers=4, shuffle=True)
def val_dataloader(self):
# バリデーションデータのデータローダー
return DataLoader(self.val_dataset, batch_size=128, num_workers=4, shuffle=False)
def main(opt):
# ConvNetモデルを初期化
model = ConvNet()
# CIFAR-10データモジュールを初期化
data = CIFARModule(opt)
# TensorBoardLoggerを設定
logger = pl.loggers.TensorBoardLogger(f"{opt.ckpt_dir}/logs")
# モデルのチェックポイント設定
ckpt = pl.callbacks.ModelCheckpoint(
monitor="val_acc", # 監視するメトリクス
dirpath=opt.ckpt_dir, # 保存先ディレクトリ
filename="cifar-10-{epoch:03d}-{val_acc:.4f}", # 保存ファイル名
save_top_k=3, # 上位3つのモデルを保存
mode="max" # 最大値を基準にする
)
# トレーナーの設定
trainer = pl.Trainer(
max_epochs=100, # 最大エポック数
callbacks=[ckpt], # コールバックの設定
accelerator="auto", # 使用可能なデバイスを自動検出
devices="auto", # 利用可能なGPUまたはCPUを使用
logger=logger # ロガーを指定
)
# トレーニングの実行
trainer.fit(model, data)
if __name__ == "__main__":
# コマンドライン引数の設定
parser = argparse.ArgumentParser(description="MNIST Training")
parser.add_argument("--data_dir", type=str, default="./data") # データセットの保存先
parser.add_argument("--ckpt_dir", type=str, default="./ckpt") # チェックポイントの保存先
opt = parser.parse_args()
# メイン関数の実行
main(opt)
実行結果
time="2025-01-19T22:43:48+09:00" level=warning msg="/tmp/tmpm1ue0tm7/docker-compose.yaml: the attribute version
is obsolete, it will be ignored, please remove it to avoid potential confusion"
time="2025-01-19T22:43:48+09:00" level=warning msg="a network with name sagemaker-local exists but was not created for project "tmpm1ue0tm7".\nSet external: true
to use an existing network"
Container d7n2uflacp-algo-1-wmrbd Creating
Container d7n2uflacp-algo-1-wmrbd Created
Attaching to d7n2uflacp-algo-1-wmrbd
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | ==========
d7n2uflacp-algo-1-wmrbd | == CUDA ==
d7n2uflacp-algo-1-wmrbd | ==========
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | CUDA Version 11.6.1
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
d7n2uflacp-algo-1-wmrbd | By pulling and using the container, you accept the terms and conditions of this license:
d7n2uflacp-algo-1-wmrbd | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
d7n2uflacp-algo-1-wmrbd | Use the NVIDIA Container Toolkit to start this container with GPU support; see
d7n2uflacp-algo-1-wmrbd | https://docs.nvidia.com/datacenter/cloud-native/ .
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | *************************
d7n2uflacp-algo-1-wmrbd | ** DEPRECATION NOTICE! **
d7n2uflacp-algo-1-wmrbd | *************************
d7n2uflacp-algo-1-wmrbd | THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
d7n2uflacp-algo-1-wmrbd | https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,443 botocore.credentials INFO Found credentials in environment variables.
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,652 botocore.httpchecksum INFO Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,656 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,656 sagemaker-training-toolkit INFO Failed to parse hyperparameter ckpt_dir value /opt/ml/model to Json.
d7n2uflacp-algo-1-wmrbd | Returning the value itself
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,658 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,663 sagemaker-training-toolkit INFO instance_groups entry not present in resource_config
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,669 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,669 sagemaker-training-toolkit INFO Failed to parse hyperparameter ckpt_dir value /opt/ml/model to Json.
d7n2uflacp-algo-1-wmrbd | Returning the value itself
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,670 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,677 sagemaker-training-toolkit INFO instance_groups entry not present in resource_config
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,682 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,682 sagemaker-training-toolkit INFO Failed to parse hyperparameter ckpt_dir value /opt/ml/model to Json.
d7n2uflacp-algo-1-wmrbd | Returning the value itself
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,683 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,688 sagemaker-training-toolkit INFO instance_groups entry not present in resource_config
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,692 sagemaker-training-toolkit INFO Invoking user script
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | Training Env:
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | {
d7n2uflacp-algo-1-wmrbd | "additional_framework_parameters": {},
d7n2uflacp-algo-1-wmrbd | "channel_input_dirs": {},
d7n2uflacp-algo-1-wmrbd | "current_host": "algo-1-wmrbd",
d7n2uflacp-algo-1-wmrbd | "current_instance_group": "homogeneousCluster",
d7n2uflacp-algo-1-wmrbd | "current_instance_group_hosts": [],
d7n2uflacp-algo-1-wmrbd | "current_instance_type": "local",
d7n2uflacp-algo-1-wmrbd | "distribution_hosts": [
d7n2uflacp-algo-1-wmrbd | "algo-1-wmrbd"
d7n2uflacp-algo-1-wmrbd | ],
d7n2uflacp-algo-1-wmrbd | "distribution_instance_groups": [],
d7n2uflacp-algo-1-wmrbd | "framework_module": null,
d7n2uflacp-algo-1-wmrbd | "hosts": [
d7n2uflacp-algo-1-wmrbd | "algo-1-wmrbd"
d7n2uflacp-algo-1-wmrbd | ],
d7n2uflacp-algo-1-wmrbd | "hyperparameters": {
d7n2uflacp-algo-1-wmrbd | "ckpt_dir": "/opt/ml/model"
d7n2uflacp-algo-1-wmrbd | },
d7n2uflacp-algo-1-wmrbd | "input_config_dir": "/opt/ml/input/config",
d7n2uflacp-algo-1-wmrbd | "input_data_config": {},
d7n2uflacp-algo-1-wmrbd | "input_dir": "/opt/ml/input",
d7n2uflacp-algo-1-wmrbd | "instance_groups": [],
d7n2uflacp-algo-1-wmrbd | "instance_groups_dict": {},
d7n2uflacp-algo-1-wmrbd | "is_hetero": false,
d7n2uflacp-algo-1-wmrbd | "is_master": true,
d7n2uflacp-algo-1-wmrbd | "is_modelparallel_enabled": null,
d7n2uflacp-algo-1-wmrbd | "is_smddpmprun_installed": false,
d7n2uflacp-algo-1-wmrbd | "is_smddprun_installed": false,
d7n2uflacp-algo-1-wmrbd | "job_name": "local-cifar-2025-01-19-13-43-47-954",
d7n2uflacp-algo-1-wmrbd | "log_level": 20,
d7n2uflacp-algo-1-wmrbd | "master_hostname": "algo-1-wmrbd",
d7n2uflacp-algo-1-wmrbd | "model_dir": "/opt/ml/model",
d7n2uflacp-algo-1-wmrbd | "module_dir": "s3://takuma-s3-sagemaker-kensho/sagemaker/logs/local-cifar-2025-01-19-13-43-47-954/source/sourcedir.tar.gz",
d7n2uflacp-algo-1-wmrbd | "module_name": "09_cifar",
d7n2uflacp-algo-1-wmrbd | "network_interface_name": "eth0",
d7n2uflacp-algo-1-wmrbd | "num_cpus": 16,
d7n2uflacp-algo-1-wmrbd | "num_gpus": 0,
d7n2uflacp-algo-1-wmrbd | "num_neurons": 0,
d7n2uflacp-algo-1-wmrbd | "output_data_dir": "/opt/ml/output/data",
d7n2uflacp-algo-1-wmrbd | "output_dir": "/opt/ml/output",
d7n2uflacp-algo-1-wmrbd | "output_intermediate_dir": "/opt/ml/output/intermediate",
d7n2uflacp-algo-1-wmrbd | "resource_config": {
d7n2uflacp-algo-1-wmrbd | "current_host": "algo-1-wmrbd",
d7n2uflacp-algo-1-wmrbd | "hosts": [
d7n2uflacp-algo-1-wmrbd | "algo-1-wmrbd"
d7n2uflacp-algo-1-wmrbd | ]
d7n2uflacp-algo-1-wmrbd | },
d7n2uflacp-algo-1-wmrbd | "user_entry_point": "09_cifar.py"
d7n2uflacp-algo-1-wmrbd | }
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | Environment variables:
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | SM_HOSTS=["algo-1-wmrbd"]
d7n2uflacp-algo-1-wmrbd | SM_NETWORK_INTERFACE_NAME=eth0
d7n2uflacp-algo-1-wmrbd | SM_HPS={"ckpt_dir":"/opt/ml/model"}
d7n2uflacp-algo-1-wmrbd | SM_USER_ENTRY_POINT=09_cifar.py
d7n2uflacp-algo-1-wmrbd | SM_FRAMEWORK_PARAMS={}
d7n2uflacp-algo-1-wmrbd | SM_RESOURCE_CONFIG={"current_host":"algo-1-wmrbd","hosts":["algo-1-wmrbd"]}
d7n2uflacp-algo-1-wmrbd | SM_INPUT_DATA_CONFIG={}
d7n2uflacp-algo-1-wmrbd | SM_OUTPUT_DATA_DIR=/opt/ml/output/data
d7n2uflacp-algo-1-wmrbd | SM_CHANNELS=[]
d7n2uflacp-algo-1-wmrbd | SM_CURRENT_HOST=algo-1-wmrbd
d7n2uflacp-algo-1-wmrbd | SM_CURRENT_INSTANCE_TYPE=local
d7n2uflacp-algo-1-wmrbd | SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
d7n2uflacp-algo-1-wmrbd | SM_CURRENT_INSTANCE_GROUP_HOSTS=[]
d7n2uflacp-algo-1-wmrbd | SM_INSTANCE_GROUPS=[]
d7n2uflacp-algo-1-wmrbd | SM_INSTANCE_GROUPS_DICT={}
d7n2uflacp-algo-1-wmrbd | SM_DISTRIBUTION_INSTANCE_GROUPS=[]
d7n2uflacp-algo-1-wmrbd | SM_IS_HETERO=false
d7n2uflacp-algo-1-wmrbd | SM_MODULE_NAME=09_cifar
d7n2uflacp-algo-1-wmrbd | SM_LOG_LEVEL=20
d7n2uflacp-algo-1-wmrbd | SM_FRAMEWORK_MODULE=
d7n2uflacp-algo-1-wmrbd | SM_INPUT_DIR=/opt/ml/input
d7n2uflacp-algo-1-wmrbd | SM_INPUT_CONFIG_DIR=/opt/ml/input/config
d7n2uflacp-algo-1-wmrbd | SM_OUTPUT_DIR=/opt/ml/output
d7n2uflacp-algo-1-wmrbd | SM_NUM_CPUS=16
d7n2uflacp-algo-1-wmrbd | SM_NUM_GPUS=0
d7n2uflacp-algo-1-wmrbd | SM_NUM_NEURONS=0
d7n2uflacp-algo-1-wmrbd | SM_MODEL_DIR=/opt/ml/model
d7n2uflacp-algo-1-wmrbd | SM_MODULE_DIR=s3://takuma-s3-sagemaker-kensho/sagemaker/logs/local-cifar-2025-01-19-13-43-47-954/source/sourcedir.tar.gz
d7n2uflacp-algo-1-wmrbd | SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1-wmrbd","current_instance_group":"homogeneousCluster","current_instance_group_hosts":[],"current_instance_type":"local","distribution_hosts":["algo-1-wmrbd"],"distribution_instance_groups":[],"framework_module":null,"hosts":["algo-1-wmrbd"],"hyperparameters":{"ckpt_dir":"/opt/ml/model"},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","instance_groups":[],"instance_groups_dict":{},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":false,"is_smddprun_installed":false,"job_name":"local-cifar-2025-01-19-13-43-47-954","log_level":20,"master_hostname":"algo-1-wmrbd","model_dir":"/opt/ml/model","module_dir":"s3://takuma-s3-sagemaker-kensho/sagemaker/logs/local-cifar-2025-01-19-13-43-47-954/source/sourcedir.tar.gz","module_name":"09_cifar","network_interface_name":"eth0","num_cpus":16,"num_gpus":0,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1-wmrbd","hosts":["algo-1-wmrbd"]},"user_entry_point":"09_cifar.py"}
d7n2uflacp-algo-1-wmrbd | SM_USER_ARGS=["--ckpt_dir","/opt/ml/model"]
d7n2uflacp-algo-1-wmrbd | SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
d7n2uflacp-algo-1-wmrbd | SM_HP_CKPT_DIR=/opt/ml/model
d7n2uflacp-algo-1-wmrbd | PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python39.zip:/usr/lib/python3.9:/usr/lib/python3.9/lib-dynload:/usr/local/lib/python3.9/dist-packages:/usr/lib/python3/dist-packages
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | Invoking script with the following command:
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | /usr/bin/python 09_cifar.py --ckpt_dir /opt/ml/model
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,692 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker Debugger as it is not installed.
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,692 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker TF as Tensorflow is not installed.
d7n2uflacp-algo-1-wmrbd | GPU available: False, used: False
d7n2uflacp-algo-1-wmrbd | TPU available: False, using: 0 TPU cores
d7n2uflacp-algo-1-wmrbd | HPU available: False, using: 0 HPUs
d7n2uflacp-algo-1-wmrbd | Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100% 170M/170M [00:42<00:00, 4.03MB/s]
d7n2uflacp-algo-1-wmrbd | Extracting ./data/cifar-10-python.tar.gz to ./data
d7n2uflacp-algo-1-wmrbd | Files already downloaded and verified
d7n2uflacp-algo-1-wmrbd | /usr/local/lib/python3.9/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory /opt/ml/model exists and is not empty.
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | | Name | Type | Params | Mode
d7n2uflacp-algo-1-wmrbd | -----------------------------------------------------------
d7n2uflacp-algo-1-wmrbd | 0 | layers | ModuleList | 1.9 M | train
d7n2uflacp-algo-1-wmrbd | 1 | global_pool | AdaptiveAvgPool2d | 0 | train
d7n2uflacp-algo-1-wmrbd | 2 | fc | Linear | 2.6 K | train
d7n2uflacp-algo-1-wmrbd | 3 | train_acc | MulticlassAccuracy | 0 | train
d7n2uflacp-algo-1-wmrbd | 4 | val_acc | MulticlassAccuracy | 0 | train
d7n2uflacp-algo-1-wmrbd | -----------------------------------------------------------
d7n2uflacp-algo-1-wmrbd | 1.9 M Trainable params
d7n2uflacp-algo-1-wmrbd | 0 Non-trainable params
d7n2uflacp-algo-1-wmrbd | 1.9 M Total params
d7n2uflacp-algo-1-wmrbd | 7.701 Total estimated model params size (MB)
d7n2uflacp-algo-1-wmrbd | 29 Modules in train mode
d7n2uflacp-algo-1-wmrbd | 0 Modules in eval mode
Sanity Checking DataLoader 0: 100% 2/2 [00:00<00:00, 3.98it/s]
d7n2uflacp-algo-1-wmrbd | Epoch 000 | ValidationAcc=7.81%
Epoch 0: 100% 391/391 [04:25<00:00, 1.48it/s, v_num=0, train_acc=0.525]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 1% 1/79 [00:00<00:19, 3.90it/s]
Validation DataLoader 0: 3% 2/79 [00:00<00:17, 4.39it/s]
Validation DataLoader 0: 4% 3/79 [00:00<00:16, 4.57it/s]
Validation DataLoader 0: 5% 4/79 [00:00<00:15, 4.71it/s]
Validation DataLoader 0: 6% 5/79 [00:01<00:15, 4.74it/s]
Validation DataLoader 0: 8% 6/79 [00:01<00:15, 4.76it/s]
Validation DataLoader 0: 9% 7/79 [00:01<00:14, 4.80it/s]
Validation DataLoader 0: 10% 8/79 [00:01<00:14, 4.82it/s]
Validation DataLoader 0: 11% 9/79 [00:01<00:14, 4.84it/s]
Validation DataLoader 0: 13% 10/79 [00:02<00:14, 4.85it/s]
Validation DataLoader 0: 14% 11/79 [00:02<00:14, 4.83it/s]
Validation DataLoader 0: 15% 12/79 [00:02<00:13, 4.86it/s]
Validation DataLoader 0: 16% 13/79 [00:02<00:13, 4.87it/s]
Validation DataLoader 0: 18% 14/79 [00:02<00:13, 4.87it/s]
Validation DataLoader 0: 19% 15/79 [00:03<00:13, 4.86it/s]
Validation DataLoader 0: 20% 16/79 [00:03<00:12, 4.85it/s]
Validation DataLoader 0: 22% 17/79 [00:03<00:12, 4.85it/s]
Validation DataLoader 0: 23% 18/79 [00:03<00:12, 4.87it/s]
Validation DataLoader 0: 24% 19/79 [00:03<00:12, 4.88it/s]
Validation DataLoader 0: 25% 20/79 [00:04<00:12, 4.89it/s]
Validation DataLoader 0: 27% 21/79 [00:04<00:11, 4.88it/s]
Validation DataLoader 0: 28% 22/79 [00:04<00:11, 4.89it/s]
Validation DataLoader 0: 29% 23/79 [00:04<00:11, 4.89it/s]
Validation DataLoader 0: 30% 24/79 [00:04<00:11, 4.90it/s]
Validation DataLoader 0: 32% 25/79 [00:05<00:10, 4.92it/s]
Validation DataLoader 0: 33% 26/79 [00:05<00:10, 4.91it/s]
Validation DataLoader 0: 34% 27/79 [00:05<00:10, 4.92it/s]
Validation DataLoader 0: 35% 28/79 [00:05<00:10, 4.92it/s]
Validation DataLoader 0: 37% 29/79 [00:05<00:10, 4.92it/s]
Validation DataLoader 0: 38% 30/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 39% 31/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 41% 32/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 42% 33/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 43% 34/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 44% 35/79 [00:07<00:08, 4.93it/s]
Validation DataLoader 0: 46% 36/79 [00:07<00:08, 4.93it/s]
Validation DataLoader 0: 47% 37/79 [00:07<00:08, 4.93it/s]
Validation DataLoader 0: 48% 38/79 [00:07<00:08, 4.94it/s]
Validation DataLoader 0: 49% 39/79 [00:07<00:08, 4.94it/s]
Validation DataLoader 0: 51% 40/79 [00:08<00:07, 4.95it/s]
Validation DataLoader 0: 52% 41/79 [00:08<00:07, 4.94it/s]
Validation DataLoader 0: 53% 42/79 [00:08<00:07, 4.94it/s]
Validation DataLoader 0: 54% 43/79 [00:08<00:07, 4.93it/s]
Validation DataLoader 0: 56% 44/79 [00:08<00:07, 4.93it/s]
Validation DataLoader 0: 57% 45/79 [00:09<00:06, 4.92it/s]
Validation DataLoader 0: 58% 46/79 [00:09<00:06, 4.92it/s]
Validation DataLoader 0: 59% 47/79 [00:09<00:06, 4.93it/s]
Validation DataLoader 0: 61% 48/79 [00:09<00:06, 4.93it/s]
Validation DataLoader 0: 62% 49/79 [00:09<00:06, 4.93it/s]
Validation DataLoader 0: 63% 50/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 65% 51/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 66% 52/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 67% 53/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 68% 54/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 70% 55/79 [00:11<00:04, 4.92it/s]
Validation DataLoader 0: 71% 56/79 [00:11<00:04, 4.92it/s]
Validation DataLoader 0: 72% 57/79 [00:11<00:04, 4.92it/s]
Validation DataLoader 0: 73% 58/79 [00:11<00:04, 4.94it/s]
Validation DataLoader 0: 75% 59/79 [00:11<00:04, 4.94it/s]
Validation DataLoader 0: 76% 60/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 77% 61/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 78% 62/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 80% 63/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 81% 64/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 82% 65/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 84% 66/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 85% 67/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 86% 68/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 87% 69/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 89% 70/79 [00:14<00:01, 4.95it/s]
Validation DataLoader 0: 90% 71/79 [00:14<00:01, 4.95it/s]
Validation DataLoader 0: 91% 72/79 [00:14<00:01, 4.94it/s]
Validation DataLoader 0: 92% 73/79 [00:14<00:01, 4.95it/s]
Validation DataLoader 0: 94% 74/79 [00:14<00:01, 4.95it/s]
Validation DataLoader 0: 95% 75/79 [00:15<00:00, 4.94it/s]
Validation DataLoader 0: 96% 76/79 [00:15<00:00, 4.94it/s]
Validation DataLoader 0: 97% 77/79 [00:15<00:00, 4.94it/s]
Validation DataLoader 0: 99% 78/79 [00:15<00:00, 4.95it/s]
Validation DataLoader 0: 100% 79/79 [00:15<00:00, 5.00it/s]
d7n2uflacp-algo-1-wmrbd | Epoch 000 | ValidationAcc=51.52%
d7n2uflacp-algo-1-wmrbd |
Epoch 1: 100% 391/391 [04:23<00:00, 1.48it/s, v_num=0, train_acc=0.663, val_acc=0.515]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 1% 1/79 [00:00<00:20, 3.89it/s]
Validation DataLoader 0: 3% 2/79 [00:00<00:18, 4.16it/s]
Validation DataLoader 0: 4% 3/79 [00:00<00:17, 4.34it/s]
Validation DataLoader 0: 5% 4/79 [00:00<00:16, 4.46it/s]
Validation DataLoader 0: 6% 5/79 [00:01<00:16, 4.51it/s]
Validation DataLoader 0: 8% 6/79 [00:01<00:15, 4.60it/s]
Validation DataLoader 0: 9% 7/79 [00:01<00:15, 4.62it/s]
Validation DataLoader 0: 10% 8/79 [00:01<00:15, 4.64it/s]
Validation DataLoader 0: 11% 9/79 [00:01<00:15, 4.65it/s]
Validation DataLoader 0: 13% 10/79 [00:02<00:14, 4.66it/s]
Validation DataLoader 0: 14% 11/79 [00:02<00:14, 4.67it/s]
Validation DataLoader 0: 15% 12/79 [00:02<00:14, 4.70it/s]
Validation DataLoader 0: 16% 13/79 [00:02<00:14, 4.71it/s]
Validation DataLoader 0: 18% 14/79 [00:02<00:13, 4.72it/s]
Validation DataLoader 0: 19% 15/79 [00:03<00:13, 4.72it/s]
Validation DataLoader 0: 20% 16/79 [00:03<00:13, 4.73it/s]
Validation DataLoader 0: 22% 17/79 [00:03<00:13, 4.75it/s]
Validation DataLoader 0: 23% 18/79 [00:03<00:12, 4.76it/s]
Validation DataLoader 0: 24% 19/79 [00:03<00:12, 4.77it/s]
Validation DataLoader 0: 25% 20/79 [00:04<00:12, 4.77it/s]
Validation DataLoader 0: 27% 21/79 [00:04<00:12, 4.79it/s]
Validation DataLoader 0: 28% 22/79 [00:04<00:11, 4.79it/s]
Validation DataLoader 0: 29% 23/79 [00:04<00:11, 4.79it/s]
Validation DataLoader 0: 30% 24/79 [00:05<00:11, 4.79it/s]
Validation DataLoader 0: 32% 25/79 [00:05<00:11, 4.81it/s]
Validation DataLoader 0: 33% 26/79 [00:05<00:11, 4.81it/s]
Validation DataLoader 0: 34% 27/79 [00:05<00:10, 4.81it/s]
Validation DataLoader 0: 35% 28/79 [00:05<00:10, 4.82it/s]
Validation DataLoader 0: 37% 29/79 [00:06<00:10, 4.82it/s]
Validation DataLoader 0: 38% 30/79 [00:06<00:10, 4.82it/s]
Validation DataLoader 0: 39% 31/79 [00:06<00:09, 4.83it/s]
Validation DataLoader 0: 41% 32/79 [00:06<00:09, 4.83it/s]
Validation DataLoader 0: 42% 33/79 [00:06<00:09, 4.84it/s]
Validation DataLoader 0: 43% 34/79 [00:07<00:09, 4.84it/s]
Validation DataLoader 0: 44% 35/79 [00:07<00:09, 4.84it/s]
Validation DataLoader 0: 46% 36/79 [00:07<00:08, 4.85it/s]
Validation DataLoader 0: 47% 37/79 [00:07<00:08, 4.85it/s]
Validation DataLoader 0: 48% 38/79 [00:07<00:08, 4.85it/s]
Validation DataLoader 0: 49% 39/79 [00:08<00:08, 4.85it/s]
Validation DataLoader 0: 51% 40/79 [00:08<00:08, 4.85it/s]
Validation DataLoader 0: 52% 41/79 [00:08<00:07, 4.85it/s]
Validation DataLoader 0: 53% 42/79 [00:08<00:07, 4.86it/s]
Validation DataLoader 0: 54% 43/79 [00:08<00:07, 4.86it/s]
Validation DataLoader 0: 56% 44/79 [00:09<00:07, 4.86it/s]
Validation DataLoader 0: 57% 45/79 [00:09<00:06, 4.87it/s]
Validation DataLoader 0: 58% 46/79 [00:09<00:06, 4.87it/s]
Validation DataLoader 0: 59% 47/79 [00:09<00:06, 4.87it/s]
Validation DataLoader 0: 61% 48/79 [00:09<00:06, 4.87it/s]
Validation DataLoader 0: 62% 49/79 [00:10<00:06, 4.88it/s]
Validation DataLoader 0: 63% 50/79 [00:10<00:05, 4.87it/s]
Validation DataLoader 0: 65% 51/79 [00:10<00:05, 4.87it/s]
Validation DataLoader 0: 66% 52/79 [00:10<00:05, 4.88it/s]
Validation DataLoader 0: 67% 53/79 [00:10<00:05, 4.88it/s]
Validation DataLoader 0: 68% 54/79 [00:11<00:05, 4.88it/s]
Validation DataLoader 0: 70% 55/79 [00:11<00:04, 4.88it/s]
Validation DataLoader 0: 71% 56/79 [00:11<00:04, 4.88it/s]
Validation DataLoader 0: 72% 57/79 [00:11<00:04, 4.88it/s]
Validation DataLoader 0: 73% 58/79 [00:11<00:04, 4.88it/s]
Validation DataLoader 0: 75% 59/79 [00:12<00:04, 4.87it/s]
Validation DataLoader 0: 76% 60/79 [00:12<00:03, 4.88it/s]
Validation DataLoader 0: 77% 61/79 [00:12<00:03, 4.88it/s]
Validation DataLoader 0: 78% 62/79 [00:12<00:03, 4.88it/s]
Validation DataLoader 0: 80% 63/79 [00:12<00:03, 4.88it/s]
Validation DataLoader 0: 81% 64/79 [00:13<00:03, 4.88it/s]
Validation DataLoader 0: 82% 65/79 [00:13<00:02, 4.88it/s]
Validation DataLoader 0: 84% 66/79 [00:13<00:02, 4.88it/s]
Validation DataLoader 0: 85% 67/79 [00:13<00:02, 4.89it/s]
Validation DataLoader 0: 86% 68/79 [00:13<00:02, 4.89it/s]
Validation DataLoader 0: 87% 69/79 [00:14<00:02, 4.89it/s]
Validation DataLoader 0: 89% 70/79 [00:14<00:01, 4.88it/s]
Validation DataLoader 0: 90% 71/79 [00:14<00:01, 4.86it/s]
Validation DataLoader 0: 91% 72/79 [00:14<00:01, 4.86it/s]
Validation DataLoader 0: 92% 73/79 [00:15<00:01, 4.85it/s]
Validation DataLoader 0: 94% 74/79 [00:15<00:01, 4.85it/s]
Validation DataLoader 0: 95% 75/79 [00:15<00:00, 4.85it/s]
Validation DataLoader 0: 96% 76/79 [00:15<00:00, 4.84it/s]
Validation DataLoader 0: 97% 77/79 [00:15<00:00, 4.84it/s]
Validation DataLoader 0: 99% 78/79 [00:16<00:00, 4.84it/s]
Validation DataLoader 0: 100% 79/79 [00:16<00:00, 4.89it/s]
d7n2uflacp-algo-1-wmrbd | Epoch 001 | ValidationAcc=53.24%
d7n2uflacp-algo-1-wmrbd |
Epoch 2: 100% 391/391 [04:24<00:00, 1.48it/s, v_num=0, train_acc=0.663, val_acc=0.532]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 1% 1/79 [00:00<00:20, 3.88it/s]
Validation DataLoader 0: 3% 2/79 [00:00<00:18, 4.22it/s]
Validation DataLoader 0: 4% 3/79 [00:00<00:17, 4.36it/s]
Validation DataLoader 0: 5% 4/79 [00:00<00:17, 4.41it/s]
Validation DataLoader 0: 6% 5/79 [00:01<00:16, 4.43it/s]
以下省略
まとめ
ローカル環境で、トレーニングジョブを実行してみました。
ローカルのGPUではない環境では、トレーニングジョブの一部修正が必要でした。