【AWS SageMaker Studio】SageMaker ClarifyでPDPレポートの作成を試してみた

Posted at 2024-07-21

背景

SageMaker Clarifyを用いる事で、作成したモデルの予測に対するレポートを自動で作成してくれるみたいですので、試してみました。

環境

sagemaker：2.219.0

試した事(概要)

Nishikaのコンペサイトのデータを使って、SageMakerの組み込みアルゴリズムのXGBoostで予測を行い、その予測に対してSageMaker Clarifyでレポートを作成してみました。

試した事(詳細)

1. 準備

1.1. データを用意

こちらのサイトから、train.zipをダウンロードします。

zipファイルを解凍すると、trainフォルダ内に多数のcsvファイルが入っていますので、それらcsvファイルを、pandasを使って1つのcsvファイルに纏めて、train.csvとします。

hogehoge.ipynb

train_df_list = []
for csv_file in glob.glob("data/Nishika_ApartmentPrice/train/*"):
    df = pd.read_csv(filepath_or_buffer=csv_file,
                     converters={"面積（㎡）": str},
                     encoding="utf-8")
    train_df_list.append(df)
train_df = pd.concat(objs=train_df_list,
                     axis=0,
                     ignore_index=True)
print(train_df.columns)
train_df.to_csv(path_or_buf="data/Nishika_ApartmentPrice/train.csv",
                header=True,
                index=False,
                encoding="utf-8")

Index(['ID', '種類', '地域', '市区町村コード', '都道府県名', '市区町村名', '地区名', '最寄駅：名称', '最寄駅：距離（分）', '間取り', '面積（㎡）', '土地の形状', '間口', '延床面積（㎡）', '建築年', '建物の構造', '用途', '今後の利用目的', '前面道路：方位', '前面道路：種類', '前面道路：幅員（ｍ）', '都市計画', '建ぺい率（％）', '容積率（％）', '取引時点', '改装', '取引の事情等', '取引価格（総額）_log'], dtype='object')

1.2. データを前処理

先程作成したtrain.csvをDataFrameの形で取得します。

hogehoge.ipynb

train_df = pd.read_csv(filepath_or_buffer="data/Nishika_ApartmentPrice/train.csv",
                       converters={"面積（㎡）": str},
                       encoding="utf-8")

見やすくするために、カラムを少し減らそうと思います。
今回はこちらの記事の1.4.と同様の前処理をtrain_dfに行い、simple_train_dfを作成します。

hogehoge.ipynb

simple_train_df = train_df[["市区町村コード", "最寄駅：距離（分）", "面積（㎡）", "建ぺい率（％）", "容積率（％）", "取引価格（総額）_log"]].dropna(how="any").copy()
simple_train_df["面積（㎡）"] = simple_train_df["面積（㎡）"].apply(lambda x: "2000" if x == "2000㎡以上" else x)
simple_train_df["面積（㎡）"] = simple_train_df["面積（㎡）"].astype("int")
simple_train_df["最寄駅：距離（分）"] = simple_train_df["最寄駅：距離（分）"].apply(lambda x: "45" if x == "30分?60分" else "75" if x == "1H?1H30" else "105" if x == "1H30?2H" else "120" if x == "2H?" else x)
simple_train_df["最寄駅：距離（分）"] = simple_train_df["最寄駅：距離（分）"].astype("int")
simple_train_df["建ぺい率（％）"] = simple_train_df["建ぺい率（％）"].astype("int")
simple_train_df["容積率（％）"] = simple_train_df["容積率（％）"].astype("int")
print(simple_train_df)

        市区町村コード  最寄駅：距離（分）  面積（㎡）  建ぺい率（％）  容積率（％）  取引価格（総額）_log
0         30201         45     45       80     300      6.875061
1         30201          8     75       80     400      7.397940
2         30201          6     75       80     400      6.880814
3         30201         29     60       80     300      6.869232
4         30201          9     65       80     400      7.255273
...         ...        ...    ...      ...     ...           ...
637346    41201         24     70       60     200      6.944483
637347    41201          6     65       80     400      7.000000
637348    41201         26     35       80     400      6.643453
637349    41201         28     15       80     400      6.431364
637350    41201          8     75       60     200      7.146128

[600296 rows x 6 columns]

simple_train_dfの20%をsimple_valid_dfにします。

hogehoge.ipynb

simple_train_df, simple_valid_df = sklearn.model_selection.train_test_split(simple_train_df,
                                                                            test_size=0.2,
                                                                            random_state=42,
                                                                            shuffle=True)

hogehoge.ipynb

print(simple_train_df)

        市区町村コード  最寄駅：距離（分）  面積（㎡）  建ぺい率（％）  容積率（％）  取引価格（総額）_log
495218    28111         15     85       60     200      7.255273
618152    43105          5     75       60     200      6.740363
192482    13114          5     15       80     400      7.041393
280095    13102          4     20       80     500      7.204120
179815    13104          6     20       80     500      7.322219
...         ...        ...    ...      ...     ...           ...
114801    11219         18     65       60     200      6.959041
272349    13119          8     20       80     500      7.322219
386147    23106          1     20       80     600      6.568202
137545    34101         45     65       80     300      7.255273
127044     4102          5     55       80     500      7.602060

[480236 rows x 6 columns]

hogehoge.ipynb

print(simple_valid_df)

        市区町村コード  最寄駅：距離（分）  面積（㎡）  建ぺい率（％）  容積率（％）  取引価格（総額）_log
496579    28204          7     85       60     200      7.505150
99805     11243          6     75       60     200      7.146128
268888    13103          3     15       60     300      7.278754
323553    13114          5     35       80     500      7.414973
253899    13215         11     50       60     200      7.041393
...         ...        ...    ...      ...     ...           ...
322959    13122          8     60       60     400      7.544068
59887     27123          5     65       80     600      7.113943
77106     25201          1     85       80     600      7.322219
461758    14118         13     95       60     200      7.653213
22763     27109          4     85       80     600      7.477121

[120060 rows x 6 columns]

simple_valid_dfの0.5%をsimple_valid_for_clarify_dfにします。

hogehoge.ipynb

simple_valid_df, simple_valid_for_clarify_df = sklearn.model_selection.train_test_split(simple_valid_df,
                                                                                              test_size=0.005,
                                                                                              random_state=42,
                                                                                              shuffle=True)
print(simple_valid_for_clarify_df)

        取引価格（総額）_log  市区町村コード  最寄駅：距離（分）  面積（㎡）  建ぺい率（％）  容積率（％）
426497      7.255273    14133          5     45       80     400
625409      6.857332    26109          8     60       60     300
299436      7.913814    13105          7    115       60     150
104324      7.342423    11109         16     60       60     200
558387      6.832509    40109         14     65       60     200
...              ...      ...        ...    ...      ...     ...
167628      7.397940    13110         13     25       80     400
527812      7.342423    28111          7     80       60     200
574033      7.255273    29201         10     80       60     200
371022      7.301030    23104         17     70       60     200
311413      7.568202    13123          1     75       60     400

[601 rows x 6 columns]

作成したsimple_train_dfとsimple_valid_dfとsimple_valid_for_clarify_dfを、simple_train.csvとsimple_valid.csvとsimple_valid_for_clarify.csvのCSVファイルで保存して、ダウンロードします。
SageMakerの組み込みアルゴリズムのXGBoostで学習や予測を行う時、ヘッダーのカラム名は不要なので、このタイミングでカラム名を外します。

hogehoge.ipynb

simple_train_df.to_csv(path_or_buf="data/Nishika_ApartmentPrice/simple_train.csv",
                       header=False,
                       index=False,
                       encoding="utf-8")
simple_valid_df.to_csv(path_or_buf="data/Nishika_ApartmentPrice/simple_valid.csv",
                       header=False,
                       index=False,
                       encoding="utf-8")
simple_valid_for_clarify_df.to_csv(path_or_buf="data/Nishika_ApartmentPrice/simple_valid_for_clarify.csv",
                                   header=False,
                                   index=False,
                                   encoding="utf-8")

1.3. S3にデータを保存

今回利用するS3フォルダを作成して、simple_train.csvとsimple_valid.csvとsimple_valid_for_clarify.csvをアップロードします。

準備は以上になります。

2. 実装

2.1. SageMaker StudioのJupyterLabを開く

SageMaker StudioのJupyterLabで作成済みインスタンス、もしくは新規作成インスタンスからJupyterLabを開きます。

2.2. SageMaker組み込みアルゴリズムXGBoostを学習

ライブラリー、初期変数、お作法的な変数を設定します。

wakuwaku.ipynb

from datetime import datetime
import glob
import os
import numpy as np
import pandas as pd
import boto3
import sagemaker
import sklearn.model_selection

wakuwaku.ipynb

S3_PREFIX = "data-for-machine-learning/for_sagemaker_clarify"
TRAIN_DATA_PATH = "simple_train.csv"
VALID_DATA_PATH = "simple_valid.csv"

wakuwaku.ipynb

ROLE = sagemaker.get_execution_role()
SAGEMAKER_SESSION = sagemaker.Session()
S3_BUCKET = SAGEMAKER_SESSION.default_bucket()

組み込みアルゴリズムXGBoostのコンテナを取得して、Estimatorクラスをインスタンス化した後、そのEstimatorクラスのオブジェクトにXGBoostのハイパーパラメータを設定します。

wakuwaku.ipynb

xgboost_container = sagemaker.image_uris.retrieve(framework="xgboost",
                                                  region="ap-northeast-1",
                                                  version="1.5-1")

wakuwaku.ipynb

s3_train_output_path = "s3://" + os.path.join(S3_BUCKET, S3_PREFIX)
xgboost_model = sagemaker.estimator.Estimator(image_uri=xgboost_container,
                                              role=ROLE,
                                              instance_count=1,
                                              instance_type="ml.m5.large",
                                              output_path=s3_train_output_path,
                                              sagemaker_session=SAGEMAKER_SESSION)

wakuwaku.ipynb

xgboost_model.set_hyperparameters(num_round=100,
                                  early_stopping_rounds=20,
                                  eval_metric="rmse",
                                  objective="reg:squarederror")

S3に保存したsimple_train.csvとsimple_valid.csvを使って、組み込みアルゴリズムXGBoostを学習させます。

wakuwaku.ipynb

s3_train_data_file_path = "s3://" + os.path.join(S3_BUCKET, S3_PREFIX, TRAIN_DATA_PATH)
s3_valid_data_file_path = "s3://" + os.path.join(S3_BUCKET, S3_PREFIX, VALID_DATA_PATH)
s3_train_data = sagemaker.inputs.TrainingInput(s3_data=s3_train_data_file_path,
                                               content_type="text/csv")
s3_valid_data = sagemaker.inputs.TrainingInput(s3_data=s3_valid_data_file_path,
                                               content_type="text/csv")

wakuwaku.ipynb

now = datetime.now()
year = str(now.year)
month = str(now.month)
day = str(now.day)
hour = str(now.hour)
minute = str(now.minute)
second = str(now.second)
job_timestamp = year + month + day + hour + minute + second
training_job_name = "test-{a}".format(a=job_timestamp)
xgboost_model.fit({"train": s3_train_data, "validation": s3_valid_data},
                  job_name=training_job_name)

INFO:sagemaker:Creating training-job with name: test-2024714111453
2024-07-14 11:14:53 Starting - Starting the training job...

2024-07-14 11:18:08 Completed - Training job completed
Training seconds: 145
Billable seconds: 145

2.3. Clarifyを行うためのモデルを作成

学習を行って作成されたmodel.tar.gzファイルを使って、Modelクラスをインスタンス化します。
(変数名のxgboost_modelを流用して、分かりにくい形にしてしまって、すみません。)

wakuwaku.ipynb

training_job_name = "test-2024714111453"  # 学習ジョブ時の名前を設定
xgboost_model = sagemaker.model.Model(image_uri=xgboost_container,
                                      model_data="{a}/{b}/output/model.tar.gz".format(a=s3_train_output_path,
                                                                                      b=training_job_name),
                                      role=ROLE,
                                      predictor_cls=sagemaker.predictor.RealTimePredictor)

インスタンス化したModelクラスのオブジェクトを使って、Clarify用のモデルを作成します。

wakuwaku.ipynb

model_name_for_clarify = "test-xgboost-model-for-clarify"  # Clarify用のモデル名を設定
container_for_clarify = xgboost_model.prepare_container_def(instance_type="ml.m5.large")
SAGEMAKER_SESSION.create_model(name=model_name_for_clarify,
                               role=ROLE,
                               container_defs=container_for_clarify)

これでClarify用のモデルが作成されました。
マネジメントコンソールのSageMakerの画面でもモデルを確認出来ます。

2.4. Clarifyでレポートを作成するための準備

S3に保存したsimple_valid_for_clarify.csvを使って、作成したXGBoostモデルの予測に対するレポートをClarifyで作成するための準備を行います。
まずは、simple_valid_for_clarify.csvのS3パスを準備します。

wakuwaku.ipynb

s3_clarify_pdp_report_output_path = "s3://" + os.path.join(S3_BUCKET, S3_PREFIX) + "/clarify_pdp_report"
VALID_FOR_CLARIFY_DATA_PATH = "simple_valid_for_clarify.csv"
s3_valid_for_clarify_data_file_path = "s3://" + os.path.join(S3_BUCKET, S3_PREFIX, VALID_FOR_CLARIFY_DATA_PATH)

今回はClarifyでPDPレポートを作成してみます。
PDPレポートを作成するためには、4つの設定が必要になります。
1つ目はDataConfigになります。
引数のheadersには、CSVファイル化の際に削除したカラム名を設定する形になります。PDPレポートで文字化けしないように、英語で設定します。
引数のlabelには、目的変数のカラム名を設定します。

wakuwaku.ipynb

clarify_pdp_report_data_config = sagemaker.clarify.DataConfig(s3_data_input_path=s3_valid_for_clarify_data_file_path,
                                                              s3_output_path=s3_clarify_pdp_report_output_path,
                                                              label="predicted_price",
                                                              headers=["predicted_price", "prefecture_code", "distance_from_station", "area", "built_area", "built_volumn"],
                                                              dataset_type="text/csv")

2つ目はModelConfigになります。
引数のmodel_nameには、先程作成したClarify用のモデルの名前を設定します。
(今回の場合は、model_name_for_clarify = "test-xgboost-model-for-clarify"の名前になります。)

wakuwaku.ipynb

clarify_pdp_report_model_config = sagemaker.clarify.ModelConfig(model_name=model_name_for_clarify,
                                                                instance_count=1,
                                                                instance_type="ml.m5.large",
                                                                accept_type="text/csv",
                                                                content_type="text/csv")

3つ目はModelPredictedLabelConfigになります。

wakuwaku.ipynb

clarify_pdp_report_predict_config = sagemaker.clarify.ModelPredictedLabelConfig(probability_threshold=0.8)

4つ目はPDPConfigになります。
引数のfeaturesには、説明変数のカラム名を設定します。
(DataConfigの引数headersは全カラム名、DataConfigの引数labelは目的変数のカラム名、PDPConfigの引数featuresには説明変数のカラム名、を設定するイメージになります。)

wakuwaku.ipynb

clarify_pdp_report_config = sagemaker.clarify.PDPConfig(features=["prefecture_code", "distance_from_station", "area", "built_area", "built_volumn"],
                                                        grid_resolution=15)

2.5. Clarifyでレポートを作成

まずは、Processorを作成します。
Clarifyでレポートを作成する時は、このProcessorを使う形になります。

wakuwaku.ipynb

clarify_processor = sagemaker.clarify.SageMakerClarifyProcessor(role=ROLE,
                                                                instance_count=1,
                                                                instance_type="ml.m5.large",
                                                                max_runtime_in_seconds=3600,
                                                                sagemaker_session=SAGEMAKER_SESSION)

作成したProcessorのrun_explainabilityメソッドを使って、ClarifyのPDPレポートを作成します。
引数には、先程準備したDataConfig、ModelConfig、ModelPredictedLabelConfig、PDPConfigの4つを設定します。

wakuwaku.ipynb

clarify_processor.run_explainability(data_config=clarify_pdp_report_data_config,
                                     model_config=clarify_pdp_report_model_config,
                                     explainability_config=clarify_pdp_report_config,
                                     model_scores=clarify_pdp_report_predict_config,
                                     wait=True,
                                     logs=True)

INFO:sagemaker.clarify:Analysis Config: {'dataset_type': 'text/csv', 'headers': ['predicted_price', 'prefecture_code', 'distance_from_station', 'area', 'built_area', 'built_volumn'], 'label': 'predicted_price', 'predictor': {'model_name': 'test-xgboost-model-for-clarify', 'instance_type': 'ml.m5.large', 'initial_instance_count': 1, 'accept_type': 'text/csv', 'content_type': 'text/csv'}, 'probability_threshold': 0.8, 'methods': {'report': {'name': 'report', 'title': 'Analysis Report'}, 'pdp': {'grid_resolution': 15, 'top_k_features': 10, 'features': ['prefecture_code', 'distance_from_station', 'area', 'built_area', 'built_volumn']}}}
INFO:sagemaker:Creating processing-job with name Clarify-Explainability-2024-07-16-01-45-46-375

[NbConvertApp] Converting notebook /opt/ml/processing/output/report.ipynb to html
[NbConvertApp] Writing 561483 bytes to /opt/ml/processing/output/report.html
INFO:analyzer.utils.util:['wkhtmltopdf', '-q', '--enable-local-file-access', '/opt/ml/processing/output/report.html', '/opt/ml/processing/output/report.pdf']
INFO:analyzer.utils.system_util:exit_message: Completed: SageMaker XAI Analyzer ran successfully
INFO:py4j.clientserver:Closing down clientserver connection
---!

DataConfigの引数s3_output_pathで設定したS3フォルダ内に、複数のファイルが作成されました。

2.6. レポートを確認

作成されたファイルのreport.pdfをローカルにダウンロードして開いてみます。

駅からの距離(分)(distance_from_station)が大きければ大きいほど、アパート価格を安く予測していそうです。また、30分と60分で何となくアパート価格の変化に境目がありそうです。
一方、面積(area)が大きければ大きいほど、アパート価格を高く予測していそうです。
どちらも、一般的な考えと合っていて、「確かに」と思いました(笑)

今回は以上になります。

まとめ

SageMaker Clarifyを使って、モデルの予測に対するレポートを作成してみました。レポートはPDPレポートの他に、SHAP値レポートもあるようですので、次回はSHAP値レポートを作成してみようと思います。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up