More than 1 year has passed since last update.

【AzureMachineLearning】機械学習の推論処理をバッチエンドポイントを使って並列実行する

Last updated at 2022-11-12Posted at 2022-11-12

AzureMachineLearningのバッチエンドポイントを使うことで、機械学習の推論処理の並列化を簡単に実装できます。

本記事では実装方法を解説します。
とはいっても、ほとんどMicrosoftの公式ドキュメントを参考にしています。

個人的に詰まった点などを補足しています。

前提

下記準備は整っていることを前提とします。

Azureのサブスクリプション
AzureCLIとml拡張がインストールされている
- 参考：https://learn.microsoft.com/ja-jp/azure/machine-learning/how-to-configure-cli?tabs=public
AzureMachineLearningのワークスペースは作成済み
コンピューティングクラスターを作成済（ノード(max_node)を2以上に設定しておく）

複数ノードを使用した並列実行を行うので、コンピューティングクラスターの最大ノード数は下記の通り2以上に設定してください。

これから作成していくファイルの構成は下記のとおりです。

Microsoftのチュートリアルのリポジトリをクローンする

Microsoftのチュートリアル用リポジトリをクローンします。
ここにひな形となるコードがあるため、それをカスタマイズしていくのが便利です。

git clone https://github.com/Azure/azureml-examples
cd azureml-examples
cd cli/endpoint

エンドポイントを作成

エンドポイント名を指定します。

ENDPOINT_NAME="mnist-batch"

エンドポイントの設定情報をyamlファイルで定義します。

./batch/batch-endpoint.yml

$schema: https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.json
name: mnist-batch
description: A batch endpoint for scoring images from the MNIST dataset.
auth_mode: aad_token

下記コマンドを実行してエンドポイントを作成

az ml batch-endpoint create --name $ENDPOINT_NAME

AzureML Studioのエンドポイントの画面において、"mnist-batch"というエンドポイントが作成されていることを確認します。

推論処理用のスクリプトを作成

推論処理用のスクリプトを作成します。
チュートリアルでは手書き文字認識（MNIST）の推論処理がTensorflowで書かれています。
このあたりはタスクに合わせて適宜変更してください。

init()はデプロイ時に実行される関数で、モデルのダウンロードといった負荷の高い処理をここに記述するとよさそうです。

run(mini_batch)でミニバッチ単位で推論処理を実行してくれます。mini_batchの値は後述するデプロイ設定情報にて定義します。

./batch/mnist/code/batch_driver.py

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license.

import os
import numpy as np
import pandas as pd
import tensorflow as tf
from PIL import Image


def init():
    global g_tf_sess

    # AZUREML_MODEL_DIR is an environment variable created during deployment
    # It is the path to the model folder (./azureml-models)
    # Please provide your model's folder name if there's one
    model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model")

    # contruct graph to execute
    tf.reset_default_graph()
    saver = tf.train.import_meta_graph(os.path.join(model_path, "mnist-tf.model.meta"))
    g_tf_sess = tf.Session(config=tf.ConfigProto(device_count={"GPU": 0}))
    saver.restore(g_tf_sess, os.path.join(model_path, "mnist-tf.model"))


def run(mini_batch):
    print(f"run method start: {__file__}, run({mini_batch})")
    resultList = []
    in_tensor = g_tf_sess.graph.get_tensor_by_name("network/X:0")
    output = g_tf_sess.graph.get_tensor_by_name("network/output/MatMul:0")

    for image in mini_batch:
        # prepare each image
        data = Image.open(image)
        np_im = np.array(data).reshape((1, 784))
        # perform inference
        inference_result = output.eval(feed_dict={in_tensor: np_im}, session=g_tf_sess)
        # find best probability, and add to result list
        best_result = np.argmax(inference_result)
        resultList.append([os.path.basename(image), best_result])

    df_result = pd.DataFrame(resultList)

    return df_result

デプロイ情報を定義する

実行環境の情報を定義します。Pythonのバージョンや必要なライブラリはここに記述します。

./batch/mnist/environment/conda.yml

name: mnist-env
channels:
  - conda-forge
dependencies:
  - python=3.6.2
  - pip<22.0
  - pip:
    - tensorflow==1.15.2
    - numpy
    - pandas
    - pillow
    - azureml-core
    - azureml-dataset-runtime[fuse]

デプロイの情報を定義します。

./batch/mnist-torch-deployment.yml

$schema: https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json
name: mnist-tf-dpl-core2
description: A deployment using Torch to solve the MNIST classification dataset.
endpoint_name: mnist-batch
model: 
  path: ./mnist/model/
code_configuration:
  code: ./mnist/code/
  scoring_script: batch_driver.py
environment:
  conda_file: ./mnist/environment/conda.yml
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
compute: azureml:cpu-cluster
resources:
  instance_count: 2
max_concurrency_per_instance: 2
mini_batch_size: 10
output_action: append_row
output_file_name: predictions.csv
retry_settings:
  max_retries: 3
  timeout: 30
error_threshold: -1
logging_level: info

パラメータがたくさんありますが、とりあえず重要そうなのは下記です。

name: デプロイメントの名前
endpoint-name: エンドポイント名称
model.path: 推論に利用するモデルのありか
code_configuration.code: 推論処理のスクリプトのありか
code_configuration.scoring_script: 推論処理のスクリプト名称
environment.conda_file: 実行環境の定義情報のありか
compute: 処理実行に利用するコンピューティングクラスターの名称（事前に作成済みのものを指定）
resources.instance_count: 処理に利用するノード数（複数指定したら並列実行されるはず）
mini_batch_size: ミニバッチ数

デプロイを作成する

下記コマンドを実行して、デプロイを作成します。
--fileで上記で作成したデプロイ情報（mnist-torch-deployment.yml）を指定します

az ml batch-deployment create --file batch/mnist-torch-deployment.yml --endpoint-name $ENDPOINT_NAME --set-default

バッチエンドポイントを呼び出し

下記コマンドを実行

JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input https://pipelinedata.blob.core.windows.net/sampledata/mnist --input-type uri_folder --query name -o tsv)

結果の比較

instance_countが1と2のそれぞれをデプロイ＆実行して、処理速度を比較しました。

instance_count=1の場合

instance_count=2の場合

199秒→154秒と約50秒短縮できました

ノードを2つ使用しているので速度も2倍・・・とはいかなかったですが、一定の速度向上の効果はありそうです。

以上です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up