More than 1 year has passed since last update.

老技術者がGoogle ColaboratoryでAutoGluonを使って機械学習モデルを作成する話

Last updated at 2023-05-15Posted at 2023-05-12

1. まずはやってみました「5行で機械学習モデル構築」

AutoGluonというのを使うと5行で機械学習モデルを作成できるそうなのでやってみたよ。
開発環境はもうローカルに置く時代ではないので、おとなしくColabを使った。
5行の前におまじないが必要だそうでビックリpip(!pipのこと)を使って足りないものを準備する。


!pip install --upgrade pip
!pip install --upgrade setuptools
!pip install --upgrade "mxnet<2.0.0"
!pip install --pre autogluon

するとRESTART RUNTIMEと出るのでこのボタンを押してやる。

この儀式を執り行いますと以下のimportが通るようになるのです。

from autogluon.tabular import TabularDataset, TabularPredictor

そしてチュートリアル用のデータをもらってきて突っ込むとできちゃうのでした。

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data, time_limit=60)  # Fit models for 60s
leaderboard = predictor.leaderboard(test_data)

これを実行した時の出力をお見せするとこんな風になる。実行ボタンを押して83秒後でこういうのが一気にダーッと出てくる。

【パスが書かれてないので、モデルを次のフォルダに保存】

No path specified. Models will be saved in: "AutogluonModels/ag-20230512_140319/"

【学習の設定、環境、モジュールのバージョン等】

Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20230512_140319/"
AutoGluon Version:  0.7.1b20230511
Python Version:     3.10.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Sat Dec 10 16:00:40 UTC 2022
Disk Space Avail:   84.10 GB / 115.66 GB (72.7%)
Train Data Rows:    39073
Train Data Columns: 14
Label Column: class

【データの前処理。
　　　二値分類でいいね？違ってたら多値分類とか回帰とか明示して】

Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' <=50K', ' >50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    12300.57 MB
	Train Data (Original)  Memory Usage: 22.92 MB (0.2% of available memory)

【特徴量エンジニアリング。
　　　カラムの値から各特徴量のデータ型を推論】

	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
		('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('int', ['bool']) : 1 | ['sex']
	1.0s = Fit runtime
	14 features in original data used to generate 14 features in processed data.
	Train Data (Processed) Memory Usage: 2.19 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 1.31s ...

【正解率を指標に予測性能を見ながらモデル構築。
　　　eval_metric設定で他の指標に変えられるよ】

AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.0639828014229775, Train Rows: 36573, Val Rows: 2500
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 58.69s of the 58.67s of remaining time.
	0.7752	 = Validation score   (accuracy)
	9.01s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 49.54s of the 49.53s of remaining time.
	0.766	 = Validation score   (accuracy)
	0.17s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 49.25s of the 49.23s of remaining time.
	0.8792	 = Validation score   (accuracy)
	3.24s	 = Training   runtime
	0.11s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 45.82s of the 45.8s of remaining time.
	0.8824	 = Validation score   (accuracy)
	1.83s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 43.88s of the 43.87s of remaining time.
	0.8612	 = Validation score   (accuracy)
	17.17s	 = Training   runtime
	0.23s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 25.15s of the 25.14s of remaining time.
	0.8584	 = Validation score   (accuracy)
	17.58s	 = Training   runtime
	0.44s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 5.92s of the 5.9s of remaining time.
	Ran out of time, early stopping on iteration 67.
	0.8696	 = Validation score   (accuracy)
	5.84s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 0.05s of the 0.03s of remaining time.
	0.8528	 = Validation score   (accuracy)
	8.68s	 = Training   runtime
	0.49s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 58.69s of the -11.45s of remaining time.
	WARNING: Setting `self._oof_pred_proba` by predicting on train directly! This is probably a bug and should be investigated...
	0.8868	 = Validation score   (accuracy)
	1.92s	 = Training   runtime
	0.01s	 = Validation runtime

【リーダーボード。
　　 "WeightedEnsemble_L2"がトップ】

AutoGluon training complete, total runtime = 73.55s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230512_140319/")
                 model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L2    0.874910     0.8868        1.075271       0.287940   7.162890                 0.035590                0.010334           1.917196            2       True          9
1             LightGBM    0.873477     0.8824        0.322208       0.071207   1.828028                 0.322208                0.071207           1.828028            1       True          4
2           LightGBMXT    0.871430     0.8792        0.536249       0.111765   3.243746                 0.536249                0.111765           3.243746            1       True          3
3             CatBoost    0.866107     0.8696        0.063781       0.014976   5.838724                 0.063781                0.014976           5.838724            1       True          7
4     RandomForestGini    0.859351     0.8612        1.407241       0.231354  17.167245                 1.407241                0.231354          17.167245            1       True          5
5     RandomForestEntr    0.857611     0.8584        1.523726       0.443146  17.578542                 1.523726                0.443146          17.578542            1       True          6
6       ExtraTreesGini    0.853414     0.8528        1.882560       0.494139   8.678844                 1.882560                0.494139           8.678844            1       True          8
7       KNeighborsUnif    0.773467     0.7752        0.171709       0.103045   9.005079                 0.171709                0.103045           9.005079            1       True          1
8       KNeighborsDist    0.762719     0.7660        0.181224       0.094634   0.173920                 0.181224                0.094634           0.173920            1       True          2

以上が上記のコード実行時の出力である。
コードを改めて書くと、

from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data, time_limit=60)  # Fit models for 60s
leaderboard = predictor.leaderboard(test_data)

というたったの5行。AutoGluon、やるじゃないか！とはいえ、これだけじゃ物足りないよね。
コードをつけ足していくことにする。

2. AutoGluonに鞭を入れる

2.1 データをトレーニング用とテスト用に分割(scikit-learn使用)

前章ではトレーニング用とテスト用にあらかじめ分けてあるデータをAutoGluonに食わせた。ここではトレーニング用とテスト用に分かれていないデータを読み込みトレーニング用とテスト用へデータを分割する。
（なお、一般にはトレーニング用データをさらに学習用データと検証用データに分けてトレーニングを実行しますが、AutoGluon では K -交差検証を自動で行うため、事前に学習用データと検証用データを分割する必要はない。）
アヤメ分類用データのcsvファイルをGoogle driveに置いて、それをpandasのデータフレームに読み込み、scikit-learnのsplitを使ってトレーニング用とテスト用へデータを分割してみる。
データを置いた場所（dataPath）やターゲットの項目名（targetLabel）、テスト用データの全データに対する比率（testRate）はそれぞれの環境に合った値に変えてほしい。

import numpy as np
from sklearn.model_selection import train_test_split
# from sklearn.datasets import load_iris
import pandas as pd
dataPath = '/content/drive/MyDrive/Colab Notebooks/Data/iris.csv'
targetLabel = 'Species'
testRate = 0.3
df = pd.read_csv(dataPath)
train_data, test_data = train_test_split(df,test_size=testRate, random_state=42)
predictor = TabularPredictor(label=targetLabel).fit(train_data, time_limit=60)  # Fit models for 60s
leaderboard = predictor.leaderboard(test_data)

これを実行すると、ダーッとこんなのが出力される。

No path specified. Models will be saved in: "AutogluonModels/ag-20230515_104202/"
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20230515_104202/"
AutoGluon Version:  0.7.1b20230515
Python Version:     3.10.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Sat Apr 29 09:15:28 UTC 2023
Disk Space Avail:   84.10 GB / 115.66 GB (72.7%)
Train Data Rows:    105
Train Data Columns: 4
Label Column: Species
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	3 unique label values:  ['Iris-versicolor', 'Iris-virginica', 'Iris-setosa']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    11936.16 MB
	Train Data (Original)  Memory Usage: 0.0 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('float', []) : 4 | ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
	Types of features in processed data (raw dtype, special dtypes):
		('float', []) : 4 | ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
	0.1s = Fit runtime
	4 features in original data used to generate 4 features in processed data.
	Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.17s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 84, Val Rows: 21
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 59.83s of the 59.82s of remaining time.
	0.8571	 = Validation score   (accuracy)
	7.73s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 52.07s of the 52.06s of remaining time.
	0.8571	 = Validation score   (accuracy)
	0.01s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 52.05s of the 52.04s of remaining time.
No improvement since epoch 3: early stopping
	0.9048	 = Validation score   (accuracy)
	1.68s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 50.32s of the 50.31s of remaining time.
	0.9524	 = Validation score   (accuracy)
	1.64s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 48.66s of the 48.65s of remaining time.
	0.8571	 = Validation score   (accuracy)
	0.34s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 48.3s of the 48.29s of remaining time.
	0.8571	 = Validation score   (accuracy)
	1.09s	 = Training   runtime
	0.08s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 47.1s of the 47.09s of remaining time.
	0.8571	 = Validation score   (accuracy)
	0.73s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 46.26s of the 46.26s of remaining time.
	0.9524	 = Validation score   (accuracy)
	0.64s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 45.61s of the 45.6s of remaining time.
	0.9048	 = Validation score   (accuracy)
	0.77s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: ExtraTreesEntr ... Training model for up to 44.74s of the 44.73s of remaining time.
	0.9048	 = Validation score   (accuracy)
	0.75s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: XGBoost ... Training model for up to 43.88s of the 43.87s of remaining time.
	0.9524	 = Validation score   (accuracy)
	0.19s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: NeuralNetTorch ... Training model for up to 43.65s of the 43.64s of remaining time.
	0.9524	 = Validation score   (accuracy)
	0.59s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 43.03s of the 43.03s of remaining time.
	0.8571	 = Validation score   (accuracy)
	0.34s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 59.83s of the 42.65s of remaining time.
	WARNING: Setting `self._oof_pred_proba` by predicting on train directly! This is probably a bug and should be investigated...
	0.9524	 = Validation score   (accuracy)
	0.79s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 18.21s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230515_104202/")
                  model  score_test  score_val  pred_time_test  pred_time_val  fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0        KNeighborsDist    1.000000   0.857143        0.004992       0.007145  0.008269                 0.004992                0.007145           0.008269            1       True          2
1               XGBoost    1.000000   0.952381        0.024361       0.004678  0.188949                 0.024361                0.004678           0.188949            1       True         11
2      RandomForestEntr    1.000000   0.857143        0.053687       0.070775  0.733705                 0.053687                0.070775           0.733705            1       True          7
3        ExtraTreesEntr    1.000000   0.904762        0.056597       0.085614  0.748429                 0.056597                0.085614           0.748429            1       True         10
4      RandomForestGini    1.000000   0.857143        0.061609       0.084392  1.093779                 0.061609                0.084392           1.093779            1       True          6
5        ExtraTreesGini    1.000000   0.904762        0.076055       0.068997  0.772397                 0.076055                0.068997           0.772397            1       True          9
6         LightGBMLarge    0.977778   0.857143        0.001844       0.002012  0.344893                 0.001844                0.002012           0.344893            1       True         13
7        KNeighborsUnif    0.977778   0.857143        0.005777       0.016529  7.729195                 0.005777                0.016529           7.729195            1       True          1
8              LightGBM    0.955556   0.857143        0.001748       0.003089  0.343162                 0.001748                0.003089           0.343162            1       True          5
9            LightGBMXT    0.933333   0.952381        0.002483       0.003491  1.644389                 0.002483                0.003491           1.644389            1       True          4
10             CatBoost    0.933333   0.952381        0.002877       0.001789  0.642138                 0.002877                0.001789           0.642138            1       True          8
11  WeightedEnsemble_L2    0.933333   0.952381        0.004703       0.005915  2.438580                 0.002220                0.002424           0.794191            2       True         14
12       NeuralNetTorch    0.911111   0.952381        0.006964       0.006251  0.594723                 0.006964                0.006251           0.594723            1       True         12
13      NeuralNetFastAI    0.866667   0.904762        0.019233       0.016817  1.683288                 0.019233                0.016817           1.683288            1       True          3

2.2 特徴量の重要度

特徴量の重要度を確認したいときは、

# 特徴量の重要度の確認
predictor.feature_importance(train_data)

これを実行すると、下記が得られる。

Computing feature importance via permutation shuffling for 4 features using 105 rows with 5 shuffle sets...
	0.3s	= Expected runtime (0.06s per shuffle set)
	0.09s	= Actual runtime (Completed 5 of 5 shuffle sets)

index	importance	stddev	p_value	n	p99_high	p99_low
Petal.Width	0.2666666666666666	0.031586902765289505	2.3187467236626715e-05	5	0.3317045360377894	0.20162879729554384
Petal.Length	0.14476190476190476	0.03588846415507956	0.00041841415602088863	5	0.2186567484885019	0.07086706103530763
Sepal.Width	0.04190476190476189	0.012777531299998786	0.0009202540135577972	5	0.06821387545570984	0.015595648353813937
Sepal.Length	0.005714285714285694	0.010858813572372703	0.15227939234026738	5	0.0280727329445543	-0.016644161515982907

(続く）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up