More than 3 years have passed since last update.

Azure ML Python SDK v2 で AutoML を試してみた

Last updated at 2022-06-24Posted at 2022-06-24

はじめに

Build 2022でプレビューとなったAzure Machine Learning (Azure ML)のPython SDK v2でAutoML実行を試してみました。
SDK v1を書いた/見たことがある方を対象に、SDK v2の雰囲気がざっとつかめれば良いなと思っています。
SDK v1とかなり書き方が変わり、色々と試行錯誤をした部分もありますが、ここで共有して皆さんのフィードバックをぜひ頂けたらと思います！

前提

Azure ML ワークスペースをデプロイ済み
Azure ML の環境構成ファイル (config.json)をDLして作業ディレクトリに格納済み

構築

ライブラリ

Python SDK v2をインストールします。

$ pip install --pre azure-ai-ml

ちなみにバージョン確認、アップグレードはそれぞれ以下のコマンドで行えます。

$ pip show azure-ai-ml
$ pip install --pre --upgrade azure-ai-ml

Python SDK 実行

ワークスペース情報の取得

Azure ML WSのconfig.jsonを同一ディレクトリに置くことで、そこからWSの情報を読み取ります。

#import required libraries
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    # This will open a browser page for
    credential = InteractiveBrowserCredential()

ml_client = MLClient.from_config(credential=credential)

workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
output["Storage Account"] = workspace.storage_account.split("/")[-1]
output

計算リソース

計算リソースの作成をします。存在しない場合は作成しています。

# Compute resources
from azure.ai.ml.entities import AmlCompute

# specify aml compute name.
cpu_compute_target = "cpu-cluster"

try:
    ml_client.compute.get(cpu_compute_target)
except Exception:
    print("Creating a new cpu compute target...")
    compute = AmlCompute(
        name=cpu_compute_target, 
        size="Standard_DS3_v2", 
        min_instances=0, 
        max_instances=4,
        tier="LowPriority",
        idle_time_before_scale_down=300
    )
    ml_client.compute.begin_create_or_update(compute)

データ用意

今回新しく登場してしまったMLTableを使ってデータを読み込んでいます。
./data/training以下にMLTableというファイルを作成し、そこにデータセットのファイルパスや形式について記載します。(この辺が最初よく分からず混乱しました..)

MLTable

paths: 
  - file: ./titanic_train.csv
transformations: 
  - read_delimited: 
      delimiter: ',' 
      encoding: 'ascii' 
      empty_as_string: false
      header: from_first_file

その後Python SDK v2では書きのように、MLTableがあるディレクトリまでを指定する形で記載します。
いまは一度データセットをAzureMLに登録して、それを再度名前で読み取って使う形になっているのですが、再度名前で読み取らなくても良い方法がある気がします..。

# Prepare data
from azure.ai.ml import Input
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Create MLTables for training dataset
try:
    registered_data_asset = ml_client.data.get(name='titanic-mltable-train', version="1")
    my_training_data_input = Input(
    type=AssetTypes.MLTABLE, 
    path=registered_data_asset.id
    )
except Exception:
    my_data = Data(
        type=AssetTypes.MLTABLE, 
        path="./data/training",
        name="titanic-mltable-train",
        description="Titanic data for training, created by SDK v2",
    )

    ml_client.data.create_or_update(my_data)

    registered_data_asset = ml_client.data.get(name='titanic-mltable-train') #要調査
    my_training_data_input = Input(
    type=AssetTypes.MLTABLE, 
    path=registered_data_asset.id
    )

AutoML 実行設定

ここで構成を指定しています。SDK v1を使ったことがある方ならすぐ馴染めると思います。

# Configure AutoML classification job
from azure.ai.ml import automl

# General job parameters
compute_name = cpu_compute_target
max_trials = 5
exp_name = "dpv2_classification_titanic"

# Create the AutoML classification job with the related factory-function.

classification_job = automl.classification(
    compute=compute_name,
    experiment_name=exp_name,
    training_data=my_training_data_input,
    target_column_name="Survived",
    primary_metric="AUC_weighted",
    n_cross_validations=5,
    enable_model_explainability=True,
    tags={"dpv2": "SDKv2"},
)

# Limits are all optional
classification_job.set_limits(
    timeout_minutes=600,
    trial_timeout_minutes=20,
    max_trials=max_trials,
    # max_concurrent_trials = 4,
    # max_cores_per_trial: -1,
    enable_early_termination=True,
)

# # Training properties are optional その他オプション
# classification_job.set_training(
#     blocked_training_algorithms=["LogisticRegression"],
#     enable_onnx_compatible_models=True,
# )

ジョブ送信

AutoMLジョブを送信しています。v2から実験 (Experiments)がジョブという名前に変わっています！

# Submit the AutoML job 
returned_job = ml_client.jobs.create_or_update(
    classification_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

# Get a URL for the status of the job
returned_job.services["Studio"].endpoint

おわりに

いかがでしたでしょうか？SDK v2はプレビュー段階なのでまだまだドキュメントの記載が薄い部分があり、オブジェクトのプロパティ・メソッド一覧をprintしてそれっぽいものを探したりとなかなか手探りで試していました。ただ現在進行形で情報が増えているので、皆さんがこの記事を読んでいる頃にはもう少し情報が整っていると信じています。

参考文献

Docs

SDK v2 (プレビュー) を使ってデータを操作する - Azure Machine Learning | Microsoft Docs

Notebooks

SDK v2 exampless
Workspace configuration
Training
MLTable

SDK v2 references

azure.ai.ml package - Azure Machine Learning Python | Microsoft Docs
azure.ai.ml.entities.AmlCompute class - Azure Machine Learning Python | Microsoft Docs

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up