More than 5 years have passed since last update.

ローカルでAzure Machine Learningを実行するときのメモ

Last updated at 2020-01-02Posted at 2020-01-02

はじめに

Azure Machine Learningをローカルマシーンで使用しようとしたら、公式のドキュメントや情報はあるものの何回か詰まって実行までに時間がかかったので、備忘録を兼ねてメモします。

今回はAzure MLを動かすUbuntuのセットアップからAzure MLの実行までをまとめます。

環境の準備

Mac上でdockerイメージを使用してUbuntu 18.04 の環境をセットアップします。dockerイメージの取得や、実行部分については割愛します。

また、Azure MLの実行にはAzureのアカウントやワークスペースの作成が必要です。その作業については割愛します。

Ubuntuイメージのセットアップ

apt-get update
apt-get upgrade

dockerイメージだと足りていないものがいくつかあるので、こちら（ https://qiita.com/manabuishiirb/items/26de8c9740a1d2c7cfdd ）を参考にして、必要なものをインストールします。

apt-get install -y iputils-ping net-tools wget curl vim build-essential

Anacondaのインストール

今回はコマンドでインストールしようと思い、こちら（ https://www.virment.com/setup-anaconda-python-jupyter-ubuntu/ ）を参考にして以下のようにAnacondaをダウンロードします。

wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh

次のようにインストールします。


bash Anaconda3-2019.10-Linux-x86_64.sh

condaコマンドを使えるようにするために、conda initを実行します。
/root/にインストールしたので、次のコマンドを実行します。

/root/anaconda3/bin/conda init
source /root/.bashrc

Azure Python SDKのインストール

公式のドキュメント（ https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-configure-environment#local ）を参考にして、azure-mlをインストールします。最初にAnacondaの仮想環境を作成します。

conda create -n myenv python=3.6.5
conda activate myenv
conda install notebook ipykernel
ipython kernel install --user --name myenv --display-name "Python (myenv)"

次に認証などで必要となるAzure CLIをインストールします。
こちら（ https://docs.microsoft.com/ja-jp/cli/azure/install-azure-cli-apt?view=azure-cli-latest ）を参考にしました。

curl -sL https://aka.ms/InstallAzureCLIDeb | bash

最後にAzureML SDKをインストールします。

pip install azureml-sdk[notebooks,automl]

途中で次のエラーが出てきますが、問題ありませんでした。

ERROR: azureml-automl-runtime 1.0.81 has requirement azureml-automl-core==1.0.81, but you'll have azureml-automl-core 1.0.81.1 which is incompatible.

自動機械学習の実行

`az login`による認証

最初にaz loginコマンドで認証します。コマンド実行後に現れるURLにWebブラウザでアクセスし、コードを入力します。

az login

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code GPVMUVTKF to authenticate.

ワークスペースへの接続用ファイルの作成

ワークスペースの情報を作成するためのPythonプログラム（auth.py）を作成します。

auth.py

from azureml.core import Workspace

subscription_id = '<サブスクリプションid>'
resource_group  = '<リソースグループ名>'
workspace_name  = '<ワークスペース名>'

try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    ws.write_config()
    print('Library configuration succeeded')
except:
    print('Workspace not found')

実行すると、カレントディレクトリの.azureml/config.json にワークスペースに接続するためのコンフィグファイルが作成されます。

実行

機械学習を実行するためのPython プログラム（run.py）を作成します。
データはscikit-learnで用意されている、乳がんのデータを使用します。データセットの詳細はこちら（ https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer ）を参照ください。

run.py

import logging

from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.experiment import Experiment

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# ワークスペースのコンフィグ読み取り
ws = Workspace.from_config()

# データのロード
data = load_breast_cancer()
df_X = pd.DataFrame(data.data, columns=data.feature_names)
df_y = pd.DataFrame(data.target, columns=['target'])
x_train, x_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.2, random_state=100)

# 機械学習の設定
automl_settings = {
    "iteration_timeout_minutes": 2,
    "experiment_timeout_minutes": 20,
    "enable_early_stopping": True,
    "primary_metric": 'AUC_weighted',
    "featurization": 'auto',
    "verbosity": logging.INFO,
    "n_cross_validations": 5
}


automl_config = AutoMLConfig(task='classification',
                             debug_log='automated_ml_errors.log',
                             X=x_train.values,
                             y=y_train.values.flatten(),
                             **automl_settings)

# 実行
experiment = Experiment(ws, "my-experiment")
local_run = experiment.submit(automl_config, show_output=True)

automl_settingsで設定している部分は、データや問題に応じて記載します。今回は二値分類問題なので、最適化指標はAUCに設定し、AutoMLConfigのtaskにclassificationを設定しています。
詳細はこちら（ https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-configure-auto-train ）。

実行すると、簡単な特徴量エンジニアリングをしたあと、いくつかのモデルを構築し、アンサンブルしてくれます。

python run.py 

(省略)

Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Classes are balanced in the training data.

TYPE:         Missing values imputation
STATUS:       PASSED
DESCRIPTION:  There were no missing values found in the training data.

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and no high cardinality features were detected.

****************************************************************************************************
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   StandardScalerWrapper SGD                      0:00:13       0.9940    0.9940
         1   StandardScalerWrapper SGD                      0:00:12       0.9958    0.9958
         2   MinMaxScaler LightGBM                          0:00:12       0.9888    0.9958
         3   StandardScalerWrapper SGD                      0:00:11       0.9936    0.9958
         4   StandardScalerWrapper ExtremeRandomTrees       0:00:14       0.9908    0.9958
         5   StandardScalerWrapper LightGBM                 0:00:11       0.9887    0.9958
         6   StandardScalerWrapper SGD                      0:00:11       0.9956    0.9958
         7   MinMaxScaler RandomForest                      0:00:13       0.9814    0.9958
         8   StandardScalerWrapper SGD                      0:00:11       0.9851    0.9958
         9   MinMaxScaler SGD                               0:00:11       0.9441    0.9958
        10   MinMaxScaler RandomForest                      0:00:11       0.9802    0.9958
        11   MaxAbsScaler LightGBM                          0:00:11       0.9780    0.9958
        12   MinMaxScaler LightGBM                          0:00:12       0.9886    0.9958
        13   MinMaxScaler ExtremeRandomTrees                0:00:11       0.9816    0.9958
        14   MinMaxScaler LightGBM                          0:00:11       0.9731    0.9958
        15   StandardScalerWrapper BernoulliNaiveBayes      0:00:11       0.9705    0.9958
        16   StandardScalerWrapper LogisticRegression       0:00:13       0.9959    0.9959
        17   MaxAbsScaler ExtremeRandomTrees                0:00:28       0.9906    0.9959
        18   RobustScaler LogisticRegression                0:00:13       0.9853    0.9959
        19   RobustScaler LightGBM                          0:00:12       0.9904    0.9959
        20   StandardScalerWrapper LogisticRegression       0:00:11       0.5000    0.9959
        21   MaxAbsScaler LinearSVM                         0:00:12       0.9871    0.9959
        22   StandardScalerWrapper SVM                      0:00:12       0.9873    0.9959
        23   RobustScaler LogisticRegression                0:00:14       0.9909    0.9959
        24   MaxAbsScaler LightGBM                          0:00:15       0.9901    0.9959
        25   RobustScaler LogisticRegression                0:00:29       0.9894    0.9959
        26   MaxAbsScaler LightGBM                          0:00:13       0.9897    0.9959
        27   MaxAbsScaler LightGBM                          0:00:15       0.9907    0.9959
        28   RobustScaler KNN                               0:00:12       0.9887    0.9959
        29   MaxAbsScaler LogisticRegression                0:00:13       0.9940    0.9959
        30   VotingEnsemble                                 0:00:31       0.9965    0.9965
        31   StackEnsemble                                  0:00:36       0.9960    0.9965
Stopping criteria reached at iteration 31. Ending experiment.

AUCが0.99とかなり高いので、何かリークしていそうですが今回は一旦無視します。

まとめ

ローカル環境でAzure MLを実行するための流れをまとめてみました。Azure MLは便利だなーと思う一方で、Azure公式のドキュメントがもうちょっとわかりやすくまとまっていたらいいんですが...

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up