1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

老技術者がGoogle ColaboratoryでAutoGluonを使って機械学習モデルを作成する話

Last updated at Posted at 2023-05-12

1. まずはやってみました「5行で機械学習モデル構築」

AutoGluonというのを使うと5行で機械学習モデルを作成できるそうなのでやってみたよ。
開発環境はもうローカルに置く時代ではないので、おとなしくColabを使った。
5行の前におまじないが必要だそうでビックリpip(!pipのこと)を使って足りないものを準備する。


!pip install --upgrade pip
!pip install --upgrade setuptools
!pip install --upgrade "mxnet<2.0.0"
!pip install --pre autogluon

するとRESTART RUNTIMEと出るのでこのボタンを押してやる。

2023-05-12_22h58_48.png

この儀式を執り行いますと以下のimportが通るようになるのです。

from autogluon.tabular import TabularDataset, TabularPredictor

そしてチュートリアル用のデータをもらってきて突っ込むとできちゃうのでした。

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data, time_limit=60)  # Fit models for 60s
leaderboard = predictor.leaderboard(test_data)

これを実行した時の出力をお見せするとこんな風になる。実行ボタンを押して83秒後でこういうのが一気にダーッと出てくる。

【パスが書かれてないので、モデルを次のフォルダに保存】

No path specified. Models will be saved in: "AutogluonModels/ag-20230512_140319/"

【学習の設定、環境、モジュールのバージョン等】

Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20230512_140319/"
AutoGluon Version:  0.7.1b20230511
Python Version:     3.10.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Sat Dec 10 16:00:40 UTC 2022
Disk Space Avail:   84.10 GB / 115.66 GB (72.7%)
Train Data Rows:    39073
Train Data Columns: 14
Label Column: class

【データの前処理。
   二値分類でいいね?違ってたら多値分類とか回帰とか明示して】

Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' <=50K', ' >50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    12300.57 MB
	Train Data (Original)  Memory Usage: 22.92 MB (0.2% of available memory)

【特徴量エンジニアリング。
   カラムの値から各特徴量のデータ型を推論】

	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
		('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('int', ['bool']) : 1 | ['sex']
	1.0s = Fit runtime
	14 features in original data used to generate 14 features in processed data.
	Train Data (Processed) Memory Usage: 2.19 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 1.31s ...

【正解率を指標に予測性能を見ながらモデル構築。
   eval_metric設定で他の指標に変えられるよ】

AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.0639828014229775, Train Rows: 36573, Val Rows: 2500
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 58.69s of the 58.67s of remaining time.
	0.7752	 = Validation score   (accuracy)
	9.01s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 49.54s of the 49.53s of remaining time.
	0.766	 = Validation score   (accuracy)
	0.17s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 49.25s of the 49.23s of remaining time.
	0.8792	 = Validation score   (accuracy)
	3.24s	 = Training   runtime
	0.11s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 45.82s of the 45.8s of remaining time.
	0.8824	 = Validation score   (accuracy)
	1.83s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 43.88s of the 43.87s of remaining time.
	0.8612	 = Validation score   (accuracy)
	17.17s	 = Training   runtime
	0.23s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 25.15s of the 25.14s of remaining time.
	0.8584	 = Validation score   (accuracy)
	17.58s	 = Training   runtime
	0.44s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 5.92s of the 5.9s of remaining time.
	Ran out of time, early stopping on iteration 67.
	0.8696	 = Validation score   (accuracy)
	5.84s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 0.05s of the 0.03s of remaining time.
	0.8528	 = Validation score   (accuracy)
	8.68s	 = Training   runtime
	0.49s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 58.69s of the -11.45s of remaining time.
	WARNING: Setting `self._oof_pred_proba` by predicting on train directly! This is probably a bug and should be investigated...
	0.8868	 = Validation score   (accuracy)
	1.92s	 = Training   runtime
	0.01s	 = Validation runtime

【リーダーボード。
   "WeightedEnsemble_L2"がトップ】

AutoGluon training complete, total runtime = 73.55s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230512_140319/")
                 model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L2    0.874910     0.8868        1.075271       0.287940   7.162890                 0.035590                0.010334           1.917196            2       True          9
1             LightGBM    0.873477     0.8824        0.322208       0.071207   1.828028                 0.322208                0.071207           1.828028            1       True          4
2           LightGBMXT    0.871430     0.8792        0.536249       0.111765   3.243746                 0.536249                0.111765           3.243746            1       True          3
3             CatBoost    0.866107     0.8696        0.063781       0.014976   5.838724                 0.063781                0.014976           5.838724            1       True          7
4     RandomForestGini    0.859351     0.8612        1.407241       0.231354  17.167245                 1.407241                0.231354          17.167245            1       True          5
5     RandomForestEntr    0.857611     0.8584        1.523726       0.443146  17.578542                 1.523726                0.443146          17.578542            1       True          6
6       ExtraTreesGini    0.853414     0.8528        1.882560       0.494139   8.678844                 1.882560                0.494139           8.678844            1       True          8
7       KNeighborsUnif    0.773467     0.7752        0.171709       0.103045   9.005079                 0.171709                0.103045           9.005079            1       True          1
8       KNeighborsDist    0.762719     0.7660        0.181224       0.094634   0.173920                 0.181224                0.094634           0.173920            1       True          2

以上が上記のコード実行時の出力である。
コードを改めて書くと、

from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data, time_limit=60)  # Fit models for 60s
leaderboard = predictor.leaderboard(test_data)

というたったの5行。AutoGluon、やるじゃないか!とはいえ、これだけじゃ物足りないよね。
コードをつけ足していくことにする。

2. AutoGluonに鞭を入れる

2.1 データをトレーニング用とテスト用に分割(scikit-learn使用)

前章ではトレーニング用とテスト用にあらかじめ分けてあるデータをAutoGluonに食わせた。ここではトレーニング用とテスト用に分かれていないデータを読み込みトレーニング用とテスト用へデータを分割する。
(なお、一般にはトレーニング用データをさらに学習用データと検証用データに分けてトレーニングを実行しますが、AutoGluon では K -交差検証を自動で行うため、事前に学習用データと検証用データを分割する必要はない。)
アヤメ分類用データのcsvファイルをGoogle driveに置いて、それをpandasのデータフレームに読み込み、scikit-learnのsplitを使ってトレーニング用とテスト用へデータを分割してみる。
データを置いた場所(dataPath)やターゲットの項目名(targetLabel)、テスト用データの全データに対する比率(testRate)はそれぞれの環境に合った値に変えてほしい。

import numpy as np
from sklearn.model_selection import train_test_split
# from sklearn.datasets import load_iris
import pandas as pd
dataPath = '/content/drive/MyDrive/Colab Notebooks/Data/iris.csv'
targetLabel = 'Species'
testRate = 0.3
df = pd.read_csv(dataPath)
train_data, test_data = train_test_split(df,test_size=testRate, random_state=42)
predictor = TabularPredictor(label=targetLabel).fit(train_data, time_limit=60)  # Fit models for 60s
leaderboard = predictor.leaderboard(test_data)

これを実行すると、ダーッとこんなのが出力される。

No path specified. Models will be saved in: "AutogluonModels/ag-20230515_104202/"
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20230515_104202/"
AutoGluon Version:  0.7.1b20230515
Python Version:     3.10.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Sat Apr 29 09:15:28 UTC 2023
Disk Space Avail:   84.10 GB / 115.66 GB (72.7%)
Train Data Rows:    105
Train Data Columns: 4
Label Column: Species
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	3 unique label values:  ['Iris-versicolor', 'Iris-virginica', 'Iris-setosa']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    11936.16 MB
	Train Data (Original)  Memory Usage: 0.0 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('float', []) : 4 | ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
	Types of features in processed data (raw dtype, special dtypes):
		('float', []) : 4 | ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
	0.1s = Fit runtime
	4 features in original data used to generate 4 features in processed data.
	Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.17s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 84, Val Rows: 21
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 59.83s of the 59.82s of remaining time.
	0.8571	 = Validation score   (accuracy)
	7.73s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 52.07s of the 52.06s of remaining time.
	0.8571	 = Validation score   (accuracy)
	0.01s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 52.05s of the 52.04s of remaining time.
No improvement since epoch 3: early stopping
	0.9048	 = Validation score   (accuracy)
	1.68s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 50.32s of the 50.31s of remaining time.
	0.9524	 = Validation score   (accuracy)
	1.64s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 48.66s of the 48.65s of remaining time.
	0.8571	 = Validation score   (accuracy)
	0.34s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 48.3s of the 48.29s of remaining time.
	0.8571	 = Validation score   (accuracy)
	1.09s	 = Training   runtime
	0.08s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 47.1s of the 47.09s of remaining time.
	0.8571	 = Validation score   (accuracy)
	0.73s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 46.26s of the 46.26s of remaining time.
	0.9524	 = Validation score   (accuracy)
	0.64s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 45.61s of the 45.6s of remaining time.
	0.9048	 = Validation score   (accuracy)
	0.77s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: ExtraTreesEntr ... Training model for up to 44.74s of the 44.73s of remaining time.
	0.9048	 = Validation score   (accuracy)
	0.75s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: XGBoost ... Training model for up to 43.88s of the 43.87s of remaining time.
	0.9524	 = Validation score   (accuracy)
	0.19s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: NeuralNetTorch ... Training model for up to 43.65s of the 43.64s of remaining time.
	0.9524	 = Validation score   (accuracy)
	0.59s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 43.03s of the 43.03s of remaining time.
	0.8571	 = Validation score   (accuracy)
	0.34s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 59.83s of the 42.65s of remaining time.
	WARNING: Setting `self._oof_pred_proba` by predicting on train directly! This is probably a bug and should be investigated...
	0.9524	 = Validation score   (accuracy)
	0.79s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 18.21s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230515_104202/")
                  model  score_test  score_val  pred_time_test  pred_time_val  fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0        KNeighborsDist    1.000000   0.857143        0.004992       0.007145  0.008269                 0.004992                0.007145           0.008269            1       True          2
1               XGBoost    1.000000   0.952381        0.024361       0.004678  0.188949                 0.024361                0.004678           0.188949            1       True         11
2      RandomForestEntr    1.000000   0.857143        0.053687       0.070775  0.733705                 0.053687                0.070775           0.733705            1       True          7
3        ExtraTreesEntr    1.000000   0.904762        0.056597       0.085614  0.748429                 0.056597                0.085614           0.748429            1       True         10
4      RandomForestGini    1.000000   0.857143        0.061609       0.084392  1.093779                 0.061609                0.084392           1.093779            1       True          6
5        ExtraTreesGini    1.000000   0.904762        0.076055       0.068997  0.772397                 0.076055                0.068997           0.772397            1       True          9
6         LightGBMLarge    0.977778   0.857143        0.001844       0.002012  0.344893                 0.001844                0.002012           0.344893            1       True         13
7        KNeighborsUnif    0.977778   0.857143        0.005777       0.016529  7.729195                 0.005777                0.016529           7.729195            1       True          1
8              LightGBM    0.955556   0.857143        0.001748       0.003089  0.343162                 0.001748                0.003089           0.343162            1       True          5
9            LightGBMXT    0.933333   0.952381        0.002483       0.003491  1.644389                 0.002483                0.003491           1.644389            1       True          4
10             CatBoost    0.933333   0.952381        0.002877       0.001789  0.642138                 0.002877                0.001789           0.642138            1       True          8
11  WeightedEnsemble_L2    0.933333   0.952381        0.004703       0.005915  2.438580                 0.002220                0.002424           0.794191            2       True         14
12       NeuralNetTorch    0.911111   0.952381        0.006964       0.006251  0.594723                 0.006964                0.006251           0.594723            1       True         12
13      NeuralNetFastAI    0.866667   0.904762        0.019233       0.016817  1.683288                 0.019233                0.016817           1.683288            1       True          3

2.2 特徴量の重要度

特徴量の重要度を確認したいときは、

# 特徴量の重要度の確認
predictor.feature_importance(train_data)

これを実行すると、下記が得られる。

Computing feature importance via permutation shuffling for 4 features using 105 rows with 5 shuffle sets...
	0.3s	= Expected runtime (0.06s per shuffle set)
	0.09s	= Actual runtime (Completed 5 of 5 shuffle sets)
index importance stddev p_value n p99_high p99_low
Petal.Width 0.2666666666666666 0.031586902765289505 2.3187467236626715e-05 5 0.3317045360377894 0.20162879729554384
Petal.Length 0.14476190476190476 0.03588846415507956 0.00041841415602088863 5 0.2186567484885019 0.07086706103530763
Sepal.Width 0.04190476190476189 0.012777531299998786 0.0009202540135577972 5 0.06821387545570984 0.015595648353813937
Sepal.Length 0.005714285714285694 0.010858813572372703 0.15227939234026738 5 0.0280727329445543 -0.016644161515982907

(続く)

1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?