Hugging Face Hub の SDXL を SageMaker AI の非同期推論で動かしてみた

Last updated at 2025-08-30Posted at 2025-08-28

はじめに

Hugging Face Hub 上にある Stability AI 社の Stable Diffusion XL（SDXL） を、
SageMaker AI の 非同期推論エンドポイント にデプロイして検証してみたので、手順などを共有できればと思います。

用語確認

Hugging Face Hub の概要

Hugging Face Hub はモデルやデータセットなどを共有・利用できるオープンソースプラットフォームです。今回は Hugging Face Hub に登録されている stabilityai/stable-diffusion-xl-base-1.0（Stable Diffusion XL、以下 SDXL）をデプロイします。

非同期推論エンドポイントの概要

非同期推論エンドポイントは
SageMaker AIが提供している推論オプションのひとつ（Async Inference）です。

推論エンドポイントに対して、
リクエストを送信することでモデルの推論結果を取得することができます。

非同期推論エンドポイントの特徴としては、ペイロード1GB以内、推論時間1時間以内といったあまりリアルタイム性が要らないユースケースに適しています。
（代表的な制約の一例です。最新の仕様は公式ドキュメントをご確認ください）

前提条件

今回の検証で使用しているプロファイルは、admin権限のものを使用しています
ご自身の環境で検証を行う場合は、プロファイル名やS3バケット名、アカウントIDなどは適宜読み替えをお願いします

ローカル環境のディレクトリ構成

今回の検証でのローカル環境ディレクトリ構成は、
このような形になりました。

こちらを参照していただきながら、後続の手順のファイル作成を行なってください。

directory-structure.txt

inference/
├── requirements.txt                   # 依存関係
├── .python-version                    # Pythonバージョン指定
├── out/                               # 出力ディレクトリ
├── my-handler/                        # SageMakerハンドラー
│   ├── model.tar.gz                   # 圧縮済みモデル
│   ├── sd_xl_base_1.0.safetensors     # Stable Diffusion XLモデル
│   └── code/
│       ├── requirements.txt           # ハンドラー用依存関係
│       └── inference.py               # 推論処理
└── script/                            # 実行スクリプト群
    ├── deploy.py                      # デプロイスクリプト
    ├── extract_image.py               # 画像抽出処理
    ├── request.py                     # リクエスト送信
    ├── cleanup.py                     # クリーンアップ

0. 事前準備

S3バケットの作成

非同期推論で使用するS3バケットを、名前以外デフォルトの設定で作成します。

バケット名には「sagemaker」を入れてください（後述のIAMロールがバケット名で権限を制限しているため）。

非同期推論では、
推論のインプット、推論のアウトプット、エラー出力が全てS3に格納されます。
今回は推論に使用するモデルとコードも、このバケットに格納します。

SageMaker AI 実行ロール（IAM ロール）の作成

推論エンドポイントに紐付ける IAM ロールを作成します。

ユースケースの指定

信頼されたエンティティタイプ：AWSサービス
サービスまたはユースケース：SageMaker
ユースケース：SageMaker-Execution

許可の追加

AmazonSageMakerFullAccessが
追加されていることを確認して次へをクリックします

今回の検証では AmazonSageMakerFullAccessを使用していますが、
本番は最小権限になるように権限を絞ってください。

IAM ロール名の指定と作成

IAM ロール名を指定して作成をクリックします。
今回は、image-generator-sagemaker-execution-roleという名前で作成しました。

1. モデルと推論コードを圧縮し S3 へアップロード

モデルと推論コードを圧縮したファイルをS3にアップロードし、モデルの設定で圧縮ファイルの格納先を指定することで、推論エンドポイントがモデルと推論コードを自身のコンテナ内に展開し使用することができるようになります。

モデル取得

Hugging Face Hubから、
stable-diffusion-xl-base-1.0 のsafetensorsファイルをダウンロードします。
ダウンロードしたファイルは、my-handler ディレクトリに配置します。

推論コードの作成と依存関係の定義

推論エンドポイントで実行されるinference.pyを作成します。

コード：my-handler/code/inference.py

my-handler/code/inference.py

import os, io, json, base64, torch
from diffusers import StableDiffusionXLPipeline

# コンテナ起動時に1回モデルをロードして返す
def model_fn(model_dir):
    # S3に配置されたモデルファイルを必須で使用
    model_file_path = os.path.join(model_dir, "sd_xl_base_1.0.safetensors")
    
    if not os.path.exists(model_file_path):
        raise FileNotFoundError(f"Model file not found at: {model_file_path}")
    
    print(f"Loading model from: {model_file_path}")
    
    pipe = StableDiffusionXLPipeline.from_single_file(
        model_file_path, 
        torch_dtype=torch.float16, 
        safety_checker=None, 
        requires_safety_checker=False
    ).to("cuda" if torch.cuda.is_available() else "cpu")
    
    pipe.enable_attention_slicing()
    return pipe

# 入力JSONを内部形式に整形
def input_fn(body, content_type):  # JSON 前提
    data = json.loads(body if isinstance(body, str) else body.decode("utf-8"))
    
   
    if isinstance(data, dict) and "inputs" in data and "parameters" in data:
        params = data.get("parameters", {})
        return {
            "prompt": data.get("inputs", ""),
            "negative_prompt": params.get("negative_prompt"),
            "width": params.get("width"),
            "height": params.get("height"),
            "num_inference_steps": params.get("num_inference_steps"),
            "guidance_scale": params.get("guidance_scale"),
            "seed": params.get("seed"),
        }
    
    return data

# 整形済み入力でパイプライン実行
def predict_fn(data, pipe):
    g = None
    seed = data.get("seed")
    if seed is not None:
        g = torch.Generator(device=pipe.device).manual_seed(int(seed))
    
    image = pipe(
        prompt=data.get("prompt", "masterpiece, best quality"),
        negative_prompt=data.get("negative_prompt", "low quality, nsfw"),
        num_inference_steps=int(data.get("num_inference_steps", 28) or 28),
        guidance_scale=float(data.get("guidance_scale", 7.0) or 7.0),
        width=int(data.get("width", 1024) or 1024),
        height=int(data.get("height", 1024) or 1024),
        generator=g
    ).images[0]
    
    # 生成画像（PIL.Image）を Base64 の PNG として文字列化し、JSON で返せる形にする
    buf = io.BytesIO()
    image.save(buf, format="PNG")
    return {"image_base64": base64.b64encode(buf.getvalue()).decode("utf-8")}

# 推論結果をレスポンス形式に整形
def output_fn(prediction, accept):
    # SageMakerの非同期推論ではJSONまたはテキスト形式が必要
    return json.dumps(prediction), "application/json"

inference.pyに対応した、requirements.txtも作成します。

my-handler/code/requirements.txt

diffusers>=0.20.0
transformers>=4.35.0
accelerate>=0.21.0
safetensors
huggingface_hub>=0.19.0
omegaconf

モデルと推論コードのアップロード

my-handlerディレクトリにて、
以下のコマンドを実行してモデルと推論コードを圧縮して、model.tar.gzを作成します。

my-handler/

tar -czf model.tar.gz sd_xl_base_1.0.safetensors code/

その後、以下コマンドにてS3にmodel.tar.gzをアップロードします。

my-handler/

aws s3 cp model.tar.gz s3://sagemaker-async-inference-image-generation/artifact/model.tar.gz --profile your-profile

2. 推論エンドポイントの作成

今回はローカル環境から SageMaker Python SDK を使用して、
非同期推論エンドポイントを作成しました。

実行環境と実行コードは以下になります。

実行環境

Python バージョン

python-version.txt

Python 3.12.7

requirements.txt

sagemaker>=2.200.0
boto3>=1.34.0

deploy.py の作成

deploy.pyを作成・実行することで、
非同期推論エンドポイントとモデルをデプロイすることができます。

プロファイル名、S3のパス、IAMロールはご自身の環境のものに書き換えてください。

コード：script/deploy.py

script/deploy.py

import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.async_inference import AsyncInferenceConfig
from botocore.exceptions import ClientError

# ローカル環境用のセッション初期化
boto_session = boto3.Session(profile_name='your-profile', region_name='ap-northeast-1')
sess = sagemaker.Session(boto_session=boto_session)
region = sess.boto_region_name

# 推論用のDockerイメージを取得
image_uri = sagemaker.image_uris.retrieve(
    framework="huggingface", 
    region=region,
    version='4.49.0', 
    image_scope="inference", 
    base_framework_version="pytorch2.6.0",
    instance_type='ml.g4dn.xlarge' 
)

# HuggingFaceModelオブジェクトを作成
hf_model = HuggingFaceModel(
    image_uri=image_uri,
    role='arn:aws:iam::{アカウントID}:role/image-generator-sagemaker-execution-role',  # 実際のロールARNに変更してください,
    sagemaker_session=sess,
    model_data='s3://sagemaker-async-inference-image-generation/artifact/model.tar.gz',
    name="sdxl-base-async-model"
)

# Async inference 設定
async_cfg = AsyncInferenceConfig(
    output_path='s3://sagemaker-async-inference-image-generation/async/output/',
    failure_path='s3://sagemaker-async-inference-image-generation/async/failed/',
    max_concurrent_invocations_per_instance=2,
)

# エンドポイントデプロイ
predictor = hf_model.deploy(
    initial_instance_count=1,
    instance_type='ml.g4dn.xlarge',
    endpoint_name='stable-diffusion-xl-base-1-async',
    async_inference_config=async_cfg
)

deploy.py の詳細

セッションの設定

以下コードでセッションを指定しています。セッションを指定することで、
使用するAWSプロファイルとリージョンを明示的に設定することができます。

your-profileに実際に使用するプロファイルを指定してください。

script/deploy.py（セッション設定抜粋）

# AWS セッション初期化
boto_session = boto3.Session(profile_name='your-profile', region_name='ap-northeast-1')
sess = sagemaker.Session(boto_session=boto_session)
region = sess.boto_region_name

推論エンドポイントで使用するコンテナイメージ URL の取得

以下では SageMaker Python SDK の image_uris.retrieve() を使って、
推論用のコンテナイメージURIを取得しています。

Hugging Face 用のコンテナイメージを使うので
framework="huggingface" を指定しています。

version は HuggingFaceが提供するTransformersライブラリのバージョンを指定します。

script/deploy.py（イメージURI取得抜粋）

# 推論用のDockerイメージを取得
image_uri = sagemaker.image_uris.retrieve(
    framework="huggingface", # 使用するフレームワークを指定
    region=region, # AWSのリージョン
    version='4.49.0', # Transformersライブラリのバージョン
    image_scope="inference", # 今回は推論をするためinferenceを指定
    base_framework_version="pytorch2.6.0",
    instance_type='ml.g4dn.xlarge' # 使用するインスタンスタイプを指定

利用可能なTransformersバージョンは下記から確認することができます。

非同期推論設定の作成

非同期推論の出力/失敗時の保存先（S3）を指定し、インスタンス1台あたりにコンテナへ同時送信するリクエスト数の上限を 2 に設定します。

script/deploy.py（Async 設定抜粋）

# Async inference 設定
async_cfg = AsyncInferenceConfig(
    output_path='s3://sagemaker-async-inference-image-generation/async/output/', # 推論結果出力パス
    failure_path='s3://sagemaker-async-inference-image-generation/async/failed/', #推論エラー出力パス
    max_concurrent_invocations_per_instance=2, # インスタンスあたりの最大同時リクエスト数を指定
)

モデル設定の作成

DockerイメージURI、実行ロール、モデルの格納場所を渡してモデルの設定を作成します。
実行ロールには先ほど作成したロールのARNを指定してください。

script/deploy.py（モデル設定抜粋）

# HuggingFaceModelオブジェクトを作成
hf_model = HuggingFaceModel(
    image_uri=image_uri,
    role='arn:aws:iam::{アカウントID}:role/image-generator-sagemaker-execution-role',  # 実際のロールARNに変更してください,
    sagemaker_session=sess,
    model_data='s3://sagemaker-async-inference-image-generation/artifact/model.tar.gz',
    name="sdxl-base-async-model"
)

エンドポイントのデプロイ

以下のコードで、非同期エンドポイントがデプロイできます。
モデルの作成、エンドポイント設定の作成、エンドポイントの作成が行われます。

script/deploy.py（デプロイ抜粋）

# エンドポイントデプロイ
predictor = hf_model.deploy(
    initial_instance_count=1, # インスタンス数
    instance_type='ml.g4dn.xlarge', # インスタンスタイプ
    endpoint_name='stable-diffusion-xl-base-1-async', # エンドポイント名
    async_inference_config=async_cfg # 非同期推論設定
)

作成されたリソースはそれぞれ、
SageMaker AIのコンソール左メニューの「推論」から確認することができます。

エンドポイントのステータスがInServiceになると、
非同期推論エンドポイントにリクエストを送信することができます。

3. 非同期エンドポイントの検証

request.py の作成とリクエスト送信

request.pyを作成して、非同期推論エンドポイントにリクエストを送信します。
こちらも自身のプロファイルやS3バケットの名前を置き換えてください。

入力データのペイロードをS3にアップロードの後、
非同期推論を実行するようにしています。

コード：script/request.py

script/request.py

#!/usr/bin/env python3
import json
import boto3
import uuid

# AWS クライアント作成
session = boto3.Session(profile_name='your-profile', region_name='ap-northeast-1')
s3 = session.client('s3')
sagemaker_runtime = session.client('sagemaker-runtime')

# テストペイロード（Stable Diffusion XL用）
payload = {
    "inputs": "birthday cake on a table",
    "parameters": {
        "negative_prompt": "low quality, nsfw",
        "width": 1024,
        "height": 1024,
        "num_inference_steps": 100,
        "guidance_scale": 7.0,
        "seed": 12
    }
}

# ジョブID生成
job_id = str(uuid.uuid4())

# S3に入力データをアップロード
input_key = f"async/input/{job_id}.json"
s3.put_object(
    Bucket='sagemaker-async-inference-image-generation',
    Key=input_key,
    Body=json.dumps(payload).encode('utf-8'),
    ContentType='application/json'
)
input_location = f"s3://sagemaker-async-inference-image-generation/{input_key}"

# 非同期推論実行
response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName='stable-diffusion-xl-base-1-async',
    InputLocation=input_location,
    ContentType='application/json'
)

推論結果のダウンロード

リクエストが成功すると、S3のsagemaker-async-inference-image-generation/async/outputに推論で生成された画像データ入りのファイルが作成されます。
コンソールからダウンロードの後、outディレクトリに配置してください。

推論結果からの画像抽出

生成された画像データ入りファイルをoutディレクトリに配置後、extract_image.py を実行することで、ダウンロードしたファイルから生成された画像をoutディレクトリに抽出することができます。

コード：script/extract_image.py

script/extract_image.py

#!/usr/bin/env python3
import json
import base64
import os
import glob
from datetime import datetime

def get_latest_json():
    out_dir = "../out"
    json_files = glob.glob(os.path.join(out_dir, "*.json"))
    if not json_files:
        return None
    return max(json_files, key=os.path.getmtime)

def extract_image_from_json(json_file_path):
    with open(json_file_path, 'r') as f:
        data = json.load(f)
    
    if isinstance(data, list) and len(data) > 0:
        inner_data = json.loads(data[0]) if isinstance(data[0], str) else data[0]
    else:
        inner_data = data
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_path = f"../out/generated_image_{timestamp}.png"
    
    image_data = base64.b64decode(inner_data['image_base64'])
    
    with open(output_path, 'wb') as f:
        f.write(image_data)
    
    return output_path

latest_json = get_latest_json()
if latest_json:
    output_path = extract_image_from_json(latest_json)

出力画像

今回は私のお誕生日が近いのでケーキを焼いてもらいました。
きれいに焼けています。

4. 作成したリソースの削除

SageMaker AI のリソース削除

SageMaker AIで作成したリソースは、以下になります。

モデル
エンドポイント設定
エンドポイント

cleanup.pyを作成して、リソースの削除を行ってください。

コード：script/cleanup.py

script/cleanup.py

#!/usr/bin/env python3
import boto3

# 設定
session = boto3.Session(profile_name='your-profile', region_name='ap-northeast-1')
sm = session.client('sagemaker')

ENDPOINT_NAME = 'stable-diffusion-xl-base-1-async'
MODEL_NAME = 'sdxl-base-async-model'

# 削除実行
try:
    sm.delete_endpoint(EndpointName=ENDPOINT_NAME)
    print(f"エンドポイント {ENDPOINT_NAME} 削除中...")
except Exception as e:
    print(f"エンドポイント削除エラー: {e}")

try:
    sm.delete_endpoint_config(EndpointConfigName=ENDPOINT_NAME)
    print(f"エンドポイント設定 {ENDPOINT_NAME} 削除完了")
except Exception as e:
    print(f"エンドポイント設定削除エラー: {e}")

try:
    sm.delete_model(ModelName=MODEL_NAME)
    print(f"モデル {MODEL_NAME} 削除完了")
except Exception as e:
    print(f"モデル削除エラー: {e}")

print("削除処理完了")

SageMaker AIのエンドポイントは料金が高いので、コンソールなどからエンドポイントが確実に削除されていることを確認してください。

その他リソースの削除

以下のリソースは必要に応じて削除するようにしてください。

作成したS3バケット
作成したIAMロール
エンドポイントのロググループ

まとめ

今回の検証でこれまで触れていなかった SageMaker AI の非同期推論の使い方に触れ、S3 経由の入出力 → エンドポイント実行 → 結果取得までの基本フローを一通り確認できました。

オートスケーリングや推論完了の際の通知などは、時間の制約で今回設定できなかったのでそちらの設定方法も検証していきたいです。

本記事が少しでも皆さまのお役に立てば幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up