1
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

【機械学習/トレーニング】SageMakerのEstimator APIを使用して、ローカル環境でPyTorch Lightningを動作させる。

Last updated at Posted at 2025-01-19

目的

SageMakerのEstimator APIの検証のために WindowsのWSL環境で「CIFAR-10分類のための機械学習トレーニングスクリプト (PyTorch Lightning使用)」を試しに実行してみた。検証メモです。

環境

  • WSL(Ubuntu)
  • Linux DESKTOP-C0A2DE7 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Jupyter notebook(WSLのUbuntuに入れて実行)

主な機能:

  • SageMaker Estimatorのセットアップ。
  • PyTorch Lightningスクリプトを指定してトレーニングを実行。
  • ログとモデルアーティファクトをS3に保存。

トレーニングジョブのコンテナ環境

コンテナイメージに使うDockerfileは下記になります。

FROM nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04

ENV TZ=Asia/Tokyo
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

# 必要なパッケージをインストール
RUN apt-get update && \
    apt-get install -yq --no-install-recommends \
        software-properties-common && \
    add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && \
    apt-get install -yq --no-install-recommends python3.9 python3.9-dev python3.9-distutils \
        build-essential \
        wget \
        git \
        unzip \
        curl \
        tzdata && \
    apt-get clean

# シンボリックリンクを作成
RUN ln -sf /usr/bin/python3.9 /usr/bin/python

# pipのインストール
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python

# 必要なPythonパッケージをインストール
RUN pip install --no-cache-dir torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116
RUN pip install --no-cache-dir pytorch-lightning \
        matplotlib \
        scikit-learn \
        timm \
        sagemaker-training \
        tensorboard

Dockerビルドしてイメージを作成します。sagemaker.gpuが作られましたね。

@DESKTOP-C0A2DE7:~/09_local_mode/docker$ docker build -t sagemaker.gpu -f Dockerfile_local_sagemaker_gpu .

@DESKTOP-C0A2DE7:~/09_local_mode/docker$ docker images
REPOSITORY                                                         TAG                       IMAGE ID       CREATED          SIZE
sagemaker.gpu                                                      latest                    cce745b2c2e9   37 minutes ago   14.6GB
<none>                                                             <none>                    c881fdb09124   55 minutes ago   14.6GB
ubuntu22.04.python                                                 latest                    a8d5dd4a7c63   2 weeks ago      272MB
ubuntu22.04.vim                                                    latest                    001a4fe504ab   2 weeks ago      196MB
public.ecr.aws/aws-cli/aws-cli                                     latest                    cab764e2e45d   2 weeks ago      418MB
nginx                                                              latest                    f876bfc1cc63   7 weeks ago      192MB
255456539449.dkr.ecr.ap-northeast-1.amazonaws.com/test_container   ubuntu.20.04              6013ae1a63c2   3 months ago     72.8MB
ubuntu                                                             20.04                     6013ae1a63c2   3 months ago     72.8MB
ubuntu                                                             22.04                     97271d29cb79   4 months ago     77.9MB
nvidia/cuda                                                        11.0.3-base-ubuntu20.04   97dfa1ef5ee7   14 months ago    122MB

ジョブスクリプト

トレーニングジョブを呼びだすpythonスクリプトは下記の通り。
上記で作成したDockerイメージも
 image_uri="sagemaker.gpu:latest"
で指定しています。

run_local_job.py
import boto3
import sagemaker
from sagemaker.estimator import Estimator

def main():
    # Boto3セッションを初期化 (AWSのプロファイルを指定)
    boto_session = boto3.Session(profile_name="develop")
     
    # SageMakerローカルセッションを作成
    sagemaker_session = sagemaker.LocalSession(boto_session=boto_session)
    # 実行に必要なロールARNを取得
    role = sagemaker_session.get_caller_identity_arn()
  
    # Estimatorの設定
    estimator = Estimator(
        image_uri="sagemaker.gpu:latest",  # 使用するDockerイメージ
        entry_point="09_cifar.py",  # 実行するスクリプト
        role=role,  # 実行権限
        instance_count=1,  # 使用するインスタンス数
        instance_type="local",  # ローカル実行 (CPUの場合)
        output_path="s3://takuma-s3-sagemaker-kensho/sagemaker/logs",  # 出力先S3バケット
        code_location="s3://takuma-s3-sagemaker-kensho/sagemaker/logs",  # スクリプト保存先
        sagemaker_session=sagemaker_session,  # SageMakerセッション
        base_job_name="local-cifar",  # ベースジョブ名
        hyperparameters={  # ハイパーパラメータ
            "ckpt_dir": "/opt/ml/model"  # モデル保存先 (コンテナ内)
        }
    )
    # トレーニングジョブの実行
    estimator.fit()
 
if __name__ == "__main__":
    # メイン関数の実行
    main()

CIFAR-10分類のための機械学習トレーニングスクリプト (PyTorch Lightning使用)

python.py

'''
CIFAR-10分類のための機械学習トレーニングスクリプト (PyTorch Lightning使用)

このスクリプトは、CIFAR-10データセットを分類するための畳み込みニューラルネットワーク (ConvNet) モデルを実装しています。
PyTorch Lightningを使用して、トレーニングおよびバリデーションループの管理、ログ記録、チェックポイントの保存を行います。

主な機能:
- LightningDataModule (`CIFARModule`) を使用したデータ準備。
- PyTorch Lightningモジュールとして定義されたConvNetモデル (`ConvNet`)。
- TensorBoardLoggerを使用したトレーニングとバリデーションのメトリクス記録。
- バリデーション精度に基づくチェックポイント保存。

使用方法:
1. スクリプト内のハイパーパラメータやパスを調整するか、コマンドライン引数として指定してください。
2. 必要な依存関係がインストールされたPython環境でスクリプトを実行してください。
'''

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import torchvision
import torchmetrics
import pytorch_lightning as pl
import argparse
 
class ConvNet(pl.LightningModule):
    def __init__(self):
        super().__init__()
        # モデルの各層を定義するリスト
        self.layers = nn.ModuleList()
        for i in range(3):  # 畳み込みブロックを3つ作成
            for j in range(3):  # 各ブロックに3層ずつ追加
                if i == 0 and j == 0:
                    in_chs = 3  # CIFAR-10はRGB画像なので入力チャネル数は3
                elif j == 0:
                    in_chs = 64 * (2**(i-1))  # ブロック間でチャネル数を増加
                else:
                    in_chs = 64 * (2**i)  # ブロック内では一定のチャネル数
                out_chs = 64 * (2**i)
                # 畳み込み層を追加
                self.layers.append(nn.Conv2d(in_chs, out_chs, 3, padding=1))
                # バッチ正規化層を追加
                self.layers.append(nn.BatchNorm2d(out_chs))
                
                # 活性化関数を追加(最終ブロックの最後には追加しない)
                if i != 2 and j != 2:
                    self.layers.append(nn.ReLU(inplace=True))
                # プーリング層を追加(最終ブロック以外の最後に追加)
                if i != 2 and j == 2:
                    self.layers.append(nn.AvgPool2d(2))

        # グローバルプーリング層
        self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
        # 全結合層(10クラス分類用)
        self.fc = nn.Linear(256, 10)

        # トレーニングとバリデーション用の精度メトリクス
        self.train_acc = torchmetrics.Accuracy(task="multiclass", num_classes=10)
        self.val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=10)
    
    def forward(self, inputs):
        # 順伝播の定義
        x = inputs
        for l in self.layers:
            x = l(x)
        # グローバルプーリング後に平坦化
        x = self.global_pool(x).view(x.shape[0], 256)
        # 全結合層を通過
        x = self.fc(x)
        return x
 
    def configure_optimizers(self):
        # オプティマイザ(SGD)と学習率スケジューラの設定
        optimizer = torch.optim.SGD(self.parameters(), lr=1e-2, momentum=0.9)
        scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [60, 80], gamma=0.2)
        return [optimizer], [scheduler]
 
    def training_step(self, train_batch, batch_idx):
        # トレーニングデータの1バッチを処理
        x, y_true = train_batch  # 入力データとラベル
        y_pred = self.forward(x)  # モデルの順伝播
        loss = F.cross_entropy(y_pred, y_true)  # クロスエントロピー損失
        y_pred_label = torch.argmax(y_pred, dim=-1)  # 予測ラベル
        acc = self.train_acc(y_pred_label, y_true)  # 精度を計算
        # 損失と精度をログに記録
        self.log("train_loss", loss, prog_bar=False, logger=True)
        self.log("train_acc", acc, prog_bar=True, logger=True)
        return loss
 
    def validation_step(self, val_batch, batch_idx):
        # バリデーションデータの1バッチを処理
        x, y_true = val_batch  # 入力データとラベル
        y_pred = self.forward(x)  # モデルの順伝播
        loss = F.cross_entropy(y_pred, y_true)  # クロスエントロピー損失
        y_pred_label = torch.argmax(y_pred, dim=-1)  # 予測ラベル
        acc = self.val_acc(y_pred_label, y_true)  # 精度を計算
        # 損失と精度をログに記録
        self.log("val_loss", loss, prog_bar=False, logger=True)
        self.log("val_acc", acc, prog_bar=True, logger=True)
 
    def on_validation_epoch_end(self):
        # エポック終了時にバリデーション精度を表示し、メトリクスをリセット
        print(f"\nEpoch {self.current_epoch:03} | ValidationAcc={self.val_acc.compute():.2%}")
        self.val_acc.reset()
        
class CIFARModule(pl.LightningDataModule):
    def __init__(self, opt):
        super().__init__()
        self.opt = opt
 
    def prepare_data(self):
        # CIFAR-10データセットの準備
        self.train_dataset = torchvision.datasets.CIFAR10(
            self.opt.data_dir, train=True, download=True,
            transform=torchvision.transforms.Compose([
                torchvision.transforms.RandomCrop((32, 32), padding=2),  # ランダムクロップ
                torchvision.transforms.RandomHorizontalFlip(),  # ランダム水平反転
                torchvision.transforms.ToTensor()  # テンソルに変換
            ]))
        self.val_dataset = torchvision.datasets.CIFAR10(
            self.opt.data_dir, train=False, download=True,
            transform=torchvision.transforms.ToTensor())
 
    def train_dataloader(self):
        # トレーニングデータのデータローダー
        return DataLoader(self.train_dataset, batch_size=128, num_workers=4, shuffle=True)
 
    def val_dataloader(self):
        # バリデーションデータのデータローダー
        return DataLoader(self.val_dataset, batch_size=128, num_workers=4, shuffle=False)
 
def main(opt):
    # ConvNetモデルを初期化
    model = ConvNet()
    # CIFAR-10データモジュールを初期化
    data = CIFARModule(opt)
    # TensorBoardLoggerを設定
    logger = pl.loggers.TensorBoardLogger(f"{opt.ckpt_dir}/logs")
 
    # モデルのチェックポイント設定
    ckpt = pl.callbacks.ModelCheckpoint(
        monitor="val_acc",  # 監視するメトリクス
        dirpath=opt.ckpt_dir,  # 保存先ディレクトリ
        filename="cifar-10-{epoch:03d}-{val_acc:.4f}",  # 保存ファイル名
        save_top_k=3,  # 上位3つのモデルを保存
        mode="max"  # 最大値を基準にする
    )
 
    # トレーナーの設定
    trainer = pl.Trainer(
        max_epochs=100,  # 最大エポック数
        callbacks=[ckpt],  # コールバックの設定
        accelerator="auto",  # 使用可能なデバイスを自動検出
        devices="auto",  # 利用可能なGPUまたはCPUを使用
        logger=logger  # ロガーを指定
    )
    # トレーニングの実行
    trainer.fit(model, data)
 
if __name__ == "__main__":
    # コマンドライン引数の設定
    parser = argparse.ArgumentParser(description="MNIST Training")
    parser.add_argument("--data_dir", type=str, default="./data")  # データセットの保存先
    parser.add_argument("--ckpt_dir", type=str, default="./ckpt")  # チェックポイントの保存先
 
    opt = parser.parse_args()
 
    # メイン関数の実行
    main(opt)

実行結果

time="2025-01-19T22:43:48+09:00" level=warning msg="/tmp/tmpm1ue0tm7/docker-compose.yaml: the attribute version is obsolete, it will be ignored, please remove it to avoid potential confusion"
time="2025-01-19T22:43:48+09:00" level=warning msg="a network with name sagemaker-local exists but was not created for project "tmpm1ue0tm7".\nSet external: true to use an existing network"
Container d7n2uflacp-algo-1-wmrbd Creating
Container d7n2uflacp-algo-1-wmrbd Created
Attaching to d7n2uflacp-algo-1-wmrbd
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | ==========
d7n2uflacp-algo-1-wmrbd | == CUDA ==
d7n2uflacp-algo-1-wmrbd | ==========
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | CUDA Version 11.6.1
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
d7n2uflacp-algo-1-wmrbd | By pulling and using the container, you accept the terms and conditions of this license:
d7n2uflacp-algo-1-wmrbd | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
d7n2uflacp-algo-1-wmrbd | Use the NVIDIA Container Toolkit to start this container with GPU support; see
d7n2uflacp-algo-1-wmrbd | https://docs.nvidia.com/datacenter/cloud-native/ .
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | *************************
d7n2uflacp-algo-1-wmrbd | ** DEPRECATION NOTICE! **
d7n2uflacp-algo-1-wmrbd | *************************
d7n2uflacp-algo-1-wmrbd | THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
d7n2uflacp-algo-1-wmrbd | https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,443 botocore.credentials INFO Found credentials in environment variables.
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,652 botocore.httpchecksum INFO Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,656 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,656 sagemaker-training-toolkit INFO Failed to parse hyperparameter ckpt_dir value /opt/ml/model to Json.
d7n2uflacp-algo-1-wmrbd | Returning the value itself
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,658 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,663 sagemaker-training-toolkit INFO instance_groups entry not present in resource_config
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,669 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,669 sagemaker-training-toolkit INFO Failed to parse hyperparameter ckpt_dir value /opt/ml/model to Json.
d7n2uflacp-algo-1-wmrbd | Returning the value itself
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,670 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,677 sagemaker-training-toolkit INFO instance_groups entry not present in resource_config
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,682 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,682 sagemaker-training-toolkit INFO Failed to parse hyperparameter ckpt_dir value /opt/ml/model to Json.
d7n2uflacp-algo-1-wmrbd | Returning the value itself
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,683 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,688 sagemaker-training-toolkit INFO instance_groups entry not present in resource_config
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,692 sagemaker-training-toolkit INFO Invoking user script
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | Training Env:
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | {
d7n2uflacp-algo-1-wmrbd | "additional_framework_parameters": {},
d7n2uflacp-algo-1-wmrbd | "channel_input_dirs": {},
d7n2uflacp-algo-1-wmrbd | "current_host": "algo-1-wmrbd",
d7n2uflacp-algo-1-wmrbd | "current_instance_group": "homogeneousCluster",
d7n2uflacp-algo-1-wmrbd | "current_instance_group_hosts": [],
d7n2uflacp-algo-1-wmrbd | "current_instance_type": "local",
d7n2uflacp-algo-1-wmrbd | "distribution_hosts": [
d7n2uflacp-algo-1-wmrbd | "algo-1-wmrbd"
d7n2uflacp-algo-1-wmrbd | ],
d7n2uflacp-algo-1-wmrbd | "distribution_instance_groups": [],
d7n2uflacp-algo-1-wmrbd | "framework_module": null,
d7n2uflacp-algo-1-wmrbd | "hosts": [
d7n2uflacp-algo-1-wmrbd | "algo-1-wmrbd"
d7n2uflacp-algo-1-wmrbd | ],
d7n2uflacp-algo-1-wmrbd | "hyperparameters": {
d7n2uflacp-algo-1-wmrbd | "ckpt_dir": "/opt/ml/model"
d7n2uflacp-algo-1-wmrbd | },
d7n2uflacp-algo-1-wmrbd | "input_config_dir": "/opt/ml/input/config",
d7n2uflacp-algo-1-wmrbd | "input_data_config": {},
d7n2uflacp-algo-1-wmrbd | "input_dir": "/opt/ml/input",
d7n2uflacp-algo-1-wmrbd | "instance_groups": [],
d7n2uflacp-algo-1-wmrbd | "instance_groups_dict": {},
d7n2uflacp-algo-1-wmrbd | "is_hetero": false,
d7n2uflacp-algo-1-wmrbd | "is_master": true,
d7n2uflacp-algo-1-wmrbd | "is_modelparallel_enabled": null,
d7n2uflacp-algo-1-wmrbd | "is_smddpmprun_installed": false,
d7n2uflacp-algo-1-wmrbd | "is_smddprun_installed": false,
d7n2uflacp-algo-1-wmrbd | "job_name": "local-cifar-2025-01-19-13-43-47-954",
d7n2uflacp-algo-1-wmrbd | "log_level": 20,
d7n2uflacp-algo-1-wmrbd | "master_hostname": "algo-1-wmrbd",
d7n2uflacp-algo-1-wmrbd | "model_dir": "/opt/ml/model",
d7n2uflacp-algo-1-wmrbd | "module_dir": "s3://takuma-s3-sagemaker-kensho/sagemaker/logs/local-cifar-2025-01-19-13-43-47-954/source/sourcedir.tar.gz",
d7n2uflacp-algo-1-wmrbd | "module_name": "09_cifar",
d7n2uflacp-algo-1-wmrbd | "network_interface_name": "eth0",
d7n2uflacp-algo-1-wmrbd | "num_cpus": 16,
d7n2uflacp-algo-1-wmrbd | "num_gpus": 0,
d7n2uflacp-algo-1-wmrbd | "num_neurons": 0,
d7n2uflacp-algo-1-wmrbd | "output_data_dir": "/opt/ml/output/data",
d7n2uflacp-algo-1-wmrbd | "output_dir": "/opt/ml/output",
d7n2uflacp-algo-1-wmrbd | "output_intermediate_dir": "/opt/ml/output/intermediate",
d7n2uflacp-algo-1-wmrbd | "resource_config": {
d7n2uflacp-algo-1-wmrbd | "current_host": "algo-1-wmrbd",
d7n2uflacp-algo-1-wmrbd | "hosts": [
d7n2uflacp-algo-1-wmrbd | "algo-1-wmrbd"
d7n2uflacp-algo-1-wmrbd | ]
d7n2uflacp-algo-1-wmrbd | },
d7n2uflacp-algo-1-wmrbd | "user_entry_point": "09_cifar.py"
d7n2uflacp-algo-1-wmrbd | }
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | Environment variables:
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | SM_HOSTS=["algo-1-wmrbd"]
d7n2uflacp-algo-1-wmrbd | SM_NETWORK_INTERFACE_NAME=eth0
d7n2uflacp-algo-1-wmrbd | SM_HPS={"ckpt_dir":"/opt/ml/model"}
d7n2uflacp-algo-1-wmrbd | SM_USER_ENTRY_POINT=09_cifar.py
d7n2uflacp-algo-1-wmrbd | SM_FRAMEWORK_PARAMS={}
d7n2uflacp-algo-1-wmrbd | SM_RESOURCE_CONFIG={"current_host":"algo-1-wmrbd","hosts":["algo-1-wmrbd"]}
d7n2uflacp-algo-1-wmrbd | SM_INPUT_DATA_CONFIG={}
d7n2uflacp-algo-1-wmrbd | SM_OUTPUT_DATA_DIR=/opt/ml/output/data
d7n2uflacp-algo-1-wmrbd | SM_CHANNELS=[]
d7n2uflacp-algo-1-wmrbd | SM_CURRENT_HOST=algo-1-wmrbd
d7n2uflacp-algo-1-wmrbd | SM_CURRENT_INSTANCE_TYPE=local
d7n2uflacp-algo-1-wmrbd | SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
d7n2uflacp-algo-1-wmrbd | SM_CURRENT_INSTANCE_GROUP_HOSTS=[]
d7n2uflacp-algo-1-wmrbd | SM_INSTANCE_GROUPS=[]
d7n2uflacp-algo-1-wmrbd | SM_INSTANCE_GROUPS_DICT={}
d7n2uflacp-algo-1-wmrbd | SM_DISTRIBUTION_INSTANCE_GROUPS=[]
d7n2uflacp-algo-1-wmrbd | SM_IS_HETERO=false
d7n2uflacp-algo-1-wmrbd | SM_MODULE_NAME=09_cifar
d7n2uflacp-algo-1-wmrbd | SM_LOG_LEVEL=20
d7n2uflacp-algo-1-wmrbd | SM_FRAMEWORK_MODULE=
d7n2uflacp-algo-1-wmrbd | SM_INPUT_DIR=/opt/ml/input
d7n2uflacp-algo-1-wmrbd | SM_INPUT_CONFIG_DIR=/opt/ml/input/config
d7n2uflacp-algo-1-wmrbd | SM_OUTPUT_DIR=/opt/ml/output
d7n2uflacp-algo-1-wmrbd | SM_NUM_CPUS=16
d7n2uflacp-algo-1-wmrbd | SM_NUM_GPUS=0
d7n2uflacp-algo-1-wmrbd | SM_NUM_NEURONS=0
d7n2uflacp-algo-1-wmrbd | SM_MODEL_DIR=/opt/ml/model
d7n2uflacp-algo-1-wmrbd | SM_MODULE_DIR=s3://takuma-s3-sagemaker-kensho/sagemaker/logs/local-cifar-2025-01-19-13-43-47-954/source/sourcedir.tar.gz
d7n2uflacp-algo-1-wmrbd | SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1-wmrbd","current_instance_group":"homogeneousCluster","current_instance_group_hosts":[],"current_instance_type":"local","distribution_hosts":["algo-1-wmrbd"],"distribution_instance_groups":[],"framework_module":null,"hosts":["algo-1-wmrbd"],"hyperparameters":{"ckpt_dir":"/opt/ml/model"},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","instance_groups":[],"instance_groups_dict":{},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":false,"is_smddprun_installed":false,"job_name":"local-cifar-2025-01-19-13-43-47-954","log_level":20,"master_hostname":"algo-1-wmrbd","model_dir":"/opt/ml/model","module_dir":"s3://takuma-s3-sagemaker-kensho/sagemaker/logs/local-cifar-2025-01-19-13-43-47-954/source/sourcedir.tar.gz","module_name":"09_cifar","network_interface_name":"eth0","num_cpus":16,"num_gpus":0,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1-wmrbd","hosts":["algo-1-wmrbd"]},"user_entry_point":"09_cifar.py"}
d7n2uflacp-algo-1-wmrbd | SM_USER_ARGS=["--ckpt_dir","/opt/ml/model"]
d7n2uflacp-algo-1-wmrbd | SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
d7n2uflacp-algo-1-wmrbd | SM_HP_CKPT_DIR=/opt/ml/model
d7n2uflacp-algo-1-wmrbd | PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python39.zip:/usr/lib/python3.9:/usr/lib/python3.9/lib-dynload:/usr/local/lib/python3.9/dist-packages:/usr/lib/python3/dist-packages
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | Invoking script with the following command:
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | /usr/bin/python 09_cifar.py --ckpt_dir /opt/ml/model
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,692 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker Debugger as it is not installed.
d7n2uflacp-algo-1-wmrbd | 2025-01-19 22:43:49,692 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker TF as Tensorflow is not installed.
d7n2uflacp-algo-1-wmrbd | GPU available: False, used: False
d7n2uflacp-algo-1-wmrbd | TPU available: False, using: 0 TPU cores
d7n2uflacp-algo-1-wmrbd | HPU available: False, using: 0 HPUs
d7n2uflacp-algo-1-wmrbd | Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100% 170M/170M [00:42<00:00, 4.03MB/s]
d7n2uflacp-algo-1-wmrbd | Extracting ./data/cifar-10-python.tar.gz to ./data
d7n2uflacp-algo-1-wmrbd | Files already downloaded and verified
d7n2uflacp-algo-1-wmrbd | /usr/local/lib/python3.9/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory /opt/ml/model exists and is not empty.
d7n2uflacp-algo-1-wmrbd |
d7n2uflacp-algo-1-wmrbd | | Name | Type | Params | Mode
d7n2uflacp-algo-1-wmrbd | -----------------------------------------------------------
d7n2uflacp-algo-1-wmrbd | 0 | layers | ModuleList | 1.9 M | train
d7n2uflacp-algo-1-wmrbd | 1 | global_pool | AdaptiveAvgPool2d | 0 | train
d7n2uflacp-algo-1-wmrbd | 2 | fc | Linear | 2.6 K | train
d7n2uflacp-algo-1-wmrbd | 3 | train_acc | MulticlassAccuracy | 0 | train
d7n2uflacp-algo-1-wmrbd | 4 | val_acc | MulticlassAccuracy | 0 | train
d7n2uflacp-algo-1-wmrbd | -----------------------------------------------------------
d7n2uflacp-algo-1-wmrbd | 1.9 M Trainable params
d7n2uflacp-algo-1-wmrbd | 0 Non-trainable params
d7n2uflacp-algo-1-wmrbd | 1.9 M Total params
d7n2uflacp-algo-1-wmrbd | 7.701 Total estimated model params size (MB)
d7n2uflacp-algo-1-wmrbd | 29 Modules in train mode
d7n2uflacp-algo-1-wmrbd | 0 Modules in eval mode
Sanity Checking DataLoader 0: 100% 2/2 [00:00<00:00, 3.98it/s]
d7n2uflacp-algo-1-wmrbd | Epoch 000 | ValidationAcc=7.81%
Epoch 0: 100% 391/391 [04:25<00:00, 1.48it/s, v_num=0, train_acc=0.525]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 1% 1/79 [00:00<00:19, 3.90it/s]
Validation DataLoader 0: 3% 2/79 [00:00<00:17, 4.39it/s]
Validation DataLoader 0: 4% 3/79 [00:00<00:16, 4.57it/s]
Validation DataLoader 0: 5% 4/79 [00:00<00:15, 4.71it/s]
Validation DataLoader 0: 6% 5/79 [00:01<00:15, 4.74it/s]
Validation DataLoader 0: 8% 6/79 [00:01<00:15, 4.76it/s]
Validation DataLoader 0: 9% 7/79 [00:01<00:14, 4.80it/s]
Validation DataLoader 0: 10% 8/79 [00:01<00:14, 4.82it/s]
Validation DataLoader 0: 11% 9/79 [00:01<00:14, 4.84it/s]
Validation DataLoader 0: 13% 10/79 [00:02<00:14, 4.85it/s]
Validation DataLoader 0: 14% 11/79 [00:02<00:14, 4.83it/s]
Validation DataLoader 0: 15% 12/79 [00:02<00:13, 4.86it/s]
Validation DataLoader 0: 16% 13/79 [00:02<00:13, 4.87it/s]
Validation DataLoader 0: 18% 14/79 [00:02<00:13, 4.87it/s]
Validation DataLoader 0: 19% 15/79 [00:03<00:13, 4.86it/s]
Validation DataLoader 0: 20% 16/79 [00:03<00:12, 4.85it/s]
Validation DataLoader 0: 22% 17/79 [00:03<00:12, 4.85it/s]
Validation DataLoader 0: 23% 18/79 [00:03<00:12, 4.87it/s]
Validation DataLoader 0: 24% 19/79 [00:03<00:12, 4.88it/s]
Validation DataLoader 0: 25% 20/79 [00:04<00:12, 4.89it/s]
Validation DataLoader 0: 27% 21/79 [00:04<00:11, 4.88it/s]
Validation DataLoader 0: 28% 22/79 [00:04<00:11, 4.89it/s]
Validation DataLoader 0: 29% 23/79 [00:04<00:11, 4.89it/s]
Validation DataLoader 0: 30% 24/79 [00:04<00:11, 4.90it/s]
Validation DataLoader 0: 32% 25/79 [00:05<00:10, 4.92it/s]
Validation DataLoader 0: 33% 26/79 [00:05<00:10, 4.91it/s]
Validation DataLoader 0: 34% 27/79 [00:05<00:10, 4.92it/s]
Validation DataLoader 0: 35% 28/79 [00:05<00:10, 4.92it/s]
Validation DataLoader 0: 37% 29/79 [00:05<00:10, 4.92it/s]
Validation DataLoader 0: 38% 30/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 39% 31/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 41% 32/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 42% 33/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 43% 34/79 [00:06<00:09, 4.93it/s]
Validation DataLoader 0: 44% 35/79 [00:07<00:08, 4.93it/s]
Validation DataLoader 0: 46% 36/79 [00:07<00:08, 4.93it/s]
Validation DataLoader 0: 47% 37/79 [00:07<00:08, 4.93it/s]
Validation DataLoader 0: 48% 38/79 [00:07<00:08, 4.94it/s]
Validation DataLoader 0: 49% 39/79 [00:07<00:08, 4.94it/s]
Validation DataLoader 0: 51% 40/79 [00:08<00:07, 4.95it/s]
Validation DataLoader 0: 52% 41/79 [00:08<00:07, 4.94it/s]
Validation DataLoader 0: 53% 42/79 [00:08<00:07, 4.94it/s]
Validation DataLoader 0: 54% 43/79 [00:08<00:07, 4.93it/s]
Validation DataLoader 0: 56% 44/79 [00:08<00:07, 4.93it/s]
Validation DataLoader 0: 57% 45/79 [00:09<00:06, 4.92it/s]
Validation DataLoader 0: 58% 46/79 [00:09<00:06, 4.92it/s]
Validation DataLoader 0: 59% 47/79 [00:09<00:06, 4.93it/s]
Validation DataLoader 0: 61% 48/79 [00:09<00:06, 4.93it/s]
Validation DataLoader 0: 62% 49/79 [00:09<00:06, 4.93it/s]
Validation DataLoader 0: 63% 50/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 65% 51/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 66% 52/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 67% 53/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 68% 54/79 [00:10<00:05, 4.93it/s]
Validation DataLoader 0: 70% 55/79 [00:11<00:04, 4.92it/s]
Validation DataLoader 0: 71% 56/79 [00:11<00:04, 4.92it/s]
Validation DataLoader 0: 72% 57/79 [00:11<00:04, 4.92it/s]
Validation DataLoader 0: 73% 58/79 [00:11<00:04, 4.94it/s]
Validation DataLoader 0: 75% 59/79 [00:11<00:04, 4.94it/s]
Validation DataLoader 0: 76% 60/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 77% 61/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 78% 62/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 80% 63/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 81% 64/79 [00:12<00:03, 4.95it/s]
Validation DataLoader 0: 82% 65/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 84% 66/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 85% 67/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 86% 68/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 87% 69/79 [00:13<00:02, 4.95it/s]
Validation DataLoader 0: 89% 70/79 [00:14<00:01, 4.95it/s]
Validation DataLoader 0: 90% 71/79 [00:14<00:01, 4.95it/s]
Validation DataLoader 0: 91% 72/79 [00:14<00:01, 4.94it/s]
Validation DataLoader 0: 92% 73/79 [00:14<00:01, 4.95it/s]
Validation DataLoader 0: 94% 74/79 [00:14<00:01, 4.95it/s]
Validation DataLoader 0: 95% 75/79 [00:15<00:00, 4.94it/s]
Validation DataLoader 0: 96% 76/79 [00:15<00:00, 4.94it/s]
Validation DataLoader 0: 97% 77/79 [00:15<00:00, 4.94it/s]
Validation DataLoader 0: 99% 78/79 [00:15<00:00, 4.95it/s]
Validation DataLoader 0: 100% 79/79 [00:15<00:00, 5.00it/s]
d7n2uflacp-algo-1-wmrbd | Epoch 000 | ValidationAcc=51.52%
d7n2uflacp-algo-1-wmrbd |
Epoch 1: 100% 391/391 [04:23<00:00, 1.48it/s, v_num=0, train_acc=0.663, val_acc=0.515]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 1% 1/79 [00:00<00:20, 3.89it/s]
Validation DataLoader 0: 3% 2/79 [00:00<00:18, 4.16it/s]
Validation DataLoader 0: 4% 3/79 [00:00<00:17, 4.34it/s]
Validation DataLoader 0: 5% 4/79 [00:00<00:16, 4.46it/s]
Validation DataLoader 0: 6% 5/79 [00:01<00:16, 4.51it/s]
Validation DataLoader 0: 8% 6/79 [00:01<00:15, 4.60it/s]
Validation DataLoader 0: 9% 7/79 [00:01<00:15, 4.62it/s]
Validation DataLoader 0: 10% 8/79 [00:01<00:15, 4.64it/s]
Validation DataLoader 0: 11% 9/79 [00:01<00:15, 4.65it/s]
Validation DataLoader 0: 13% 10/79 [00:02<00:14, 4.66it/s]
Validation DataLoader 0: 14% 11/79 [00:02<00:14, 4.67it/s]
Validation DataLoader 0: 15% 12/79 [00:02<00:14, 4.70it/s]
Validation DataLoader 0: 16% 13/79 [00:02<00:14, 4.71it/s]
Validation DataLoader 0: 18% 14/79 [00:02<00:13, 4.72it/s]
Validation DataLoader 0: 19% 15/79 [00:03<00:13, 4.72it/s]
Validation DataLoader 0: 20% 16/79 [00:03<00:13, 4.73it/s]
Validation DataLoader 0: 22% 17/79 [00:03<00:13, 4.75it/s]
Validation DataLoader 0: 23% 18/79 [00:03<00:12, 4.76it/s]
Validation DataLoader 0: 24% 19/79 [00:03<00:12, 4.77it/s]
Validation DataLoader 0: 25% 20/79 [00:04<00:12, 4.77it/s]
Validation DataLoader 0: 27% 21/79 [00:04<00:12, 4.79it/s]
Validation DataLoader 0: 28% 22/79 [00:04<00:11, 4.79it/s]
Validation DataLoader 0: 29% 23/79 [00:04<00:11, 4.79it/s]
Validation DataLoader 0: 30% 24/79 [00:05<00:11, 4.79it/s]
Validation DataLoader 0: 32% 25/79 [00:05<00:11, 4.81it/s]
Validation DataLoader 0: 33% 26/79 [00:05<00:11, 4.81it/s]
Validation DataLoader 0: 34% 27/79 [00:05<00:10, 4.81it/s]
Validation DataLoader 0: 35% 28/79 [00:05<00:10, 4.82it/s]
Validation DataLoader 0: 37% 29/79 [00:06<00:10, 4.82it/s]
Validation DataLoader 0: 38% 30/79 [00:06<00:10, 4.82it/s]
Validation DataLoader 0: 39% 31/79 [00:06<00:09, 4.83it/s]
Validation DataLoader 0: 41% 32/79 [00:06<00:09, 4.83it/s]
Validation DataLoader 0: 42% 33/79 [00:06<00:09, 4.84it/s]
Validation DataLoader 0: 43% 34/79 [00:07<00:09, 4.84it/s]
Validation DataLoader 0: 44% 35/79 [00:07<00:09, 4.84it/s]
Validation DataLoader 0: 46% 36/79 [00:07<00:08, 4.85it/s]
Validation DataLoader 0: 47% 37/79 [00:07<00:08, 4.85it/s]
Validation DataLoader 0: 48% 38/79 [00:07<00:08, 4.85it/s]
Validation DataLoader 0: 49% 39/79 [00:08<00:08, 4.85it/s]
Validation DataLoader 0: 51% 40/79 [00:08<00:08, 4.85it/s]
Validation DataLoader 0: 52% 41/79 [00:08<00:07, 4.85it/s]
Validation DataLoader 0: 53% 42/79 [00:08<00:07, 4.86it/s]
Validation DataLoader 0: 54% 43/79 [00:08<00:07, 4.86it/s]
Validation DataLoader 0: 56% 44/79 [00:09<00:07, 4.86it/s]
Validation DataLoader 0: 57% 45/79 [00:09<00:06, 4.87it/s]
Validation DataLoader 0: 58% 46/79 [00:09<00:06, 4.87it/s]
Validation DataLoader 0: 59% 47/79 [00:09<00:06, 4.87it/s]
Validation DataLoader 0: 61% 48/79 [00:09<00:06, 4.87it/s]
Validation DataLoader 0: 62% 49/79 [00:10<00:06, 4.88it/s]
Validation DataLoader 0: 63% 50/79 [00:10<00:05, 4.87it/s]
Validation DataLoader 0: 65% 51/79 [00:10<00:05, 4.87it/s]
Validation DataLoader 0: 66% 52/79 [00:10<00:05, 4.88it/s]
Validation DataLoader 0: 67% 53/79 [00:10<00:05, 4.88it/s]
Validation DataLoader 0: 68% 54/79 [00:11<00:05, 4.88it/s]
Validation DataLoader 0: 70% 55/79 [00:11<00:04, 4.88it/s]
Validation DataLoader 0: 71% 56/79 [00:11<00:04, 4.88it/s]
Validation DataLoader 0: 72% 57/79 [00:11<00:04, 4.88it/s]
Validation DataLoader 0: 73% 58/79 [00:11<00:04, 4.88it/s]
Validation DataLoader 0: 75% 59/79 [00:12<00:04, 4.87it/s]
Validation DataLoader 0: 76% 60/79 [00:12<00:03, 4.88it/s]
Validation DataLoader 0: 77% 61/79 [00:12<00:03, 4.88it/s]
Validation DataLoader 0: 78% 62/79 [00:12<00:03, 4.88it/s]
Validation DataLoader 0: 80% 63/79 [00:12<00:03, 4.88it/s]
Validation DataLoader 0: 81% 64/79 [00:13<00:03, 4.88it/s]
Validation DataLoader 0: 82% 65/79 [00:13<00:02, 4.88it/s]
Validation DataLoader 0: 84% 66/79 [00:13<00:02, 4.88it/s]
Validation DataLoader 0: 85% 67/79 [00:13<00:02, 4.89it/s]
Validation DataLoader 0: 86% 68/79 [00:13<00:02, 4.89it/s]
Validation DataLoader 0: 87% 69/79 [00:14<00:02, 4.89it/s]
Validation DataLoader 0: 89% 70/79 [00:14<00:01, 4.88it/s]
Validation DataLoader 0: 90% 71/79 [00:14<00:01, 4.86it/s]
Validation DataLoader 0: 91% 72/79 [00:14<00:01, 4.86it/s]
Validation DataLoader 0: 92% 73/79 [00:15<00:01, 4.85it/s]
Validation DataLoader 0: 94% 74/79 [00:15<00:01, 4.85it/s]
Validation DataLoader 0: 95% 75/79 [00:15<00:00, 4.85it/s]
Validation DataLoader 0: 96% 76/79 [00:15<00:00, 4.84it/s]
Validation DataLoader 0: 97% 77/79 [00:15<00:00, 4.84it/s]
Validation DataLoader 0: 99% 78/79 [00:16<00:00, 4.84it/s]
Validation DataLoader 0: 100% 79/79 [00:16<00:00, 4.89it/s]
d7n2uflacp-algo-1-wmrbd | Epoch 001 | ValidationAcc=53.24%
d7n2uflacp-algo-1-wmrbd |
Epoch 2: 100% 391/391 [04:24<00:00, 1.48it/s, v_num=0, train_acc=0.663, val_acc=0.532]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 0% 0/79 [00:00<?, ?it/s]
Validation DataLoader 0: 1% 1/79 [00:00<00:20, 3.88it/s]
Validation DataLoader 0: 3% 2/79 [00:00<00:18, 4.22it/s]
Validation DataLoader 0: 4% 3/79 [00:00<00:17, 4.36it/s]
Validation DataLoader 0: 5% 4/79 [00:00<00:17, 4.41it/s]
Validation DataLoader 0: 6% 5/79 [00:01<00:16, 4.43it/s]
以下省略

まとめ

ローカル環境で、トレーニングジョブを実行してみました。
ローカルのGPUではない環境では、トレーニングジョブの一部修正が必要でした。

1
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?