1. まずはやってみました「5行で機械学習モデル構築」
AutoGluonというのを使うと5行で機械学習モデルを作成できるそうなのでやってみたよ。
開発環境はもうローカルに置く時代ではないので、おとなしくColabを使った。
5行の前におまじないが必要だそうでビックリpip(!pipのこと)を使って足りないものを準備する。
!pip install --upgrade pip
!pip install --upgrade setuptools
!pip install --upgrade "mxnet<2.0.0"
!pip install --pre autogluon
するとRESTART RUNTIMEと出るのでこのボタンを押してやる。
この儀式を執り行いますと以下のimportが通るようになるのです。
from autogluon.tabular import TabularDataset, TabularPredictor
そしてチュートリアル用のデータをもらってきて突っ込むとできちゃうのでした。
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data, time_limit=60) # Fit models for 60s
leaderboard = predictor.leaderboard(test_data)
これを実行した時の出力をお見せするとこんな風になる。実行ボタンを押して83秒後でこういうのが一気にダーッと出てくる。
【パスが書かれてないので、モデルを次のフォルダに保存】
No path specified. Models will be saved in: "AutogluonModels/ag-20230512_140319/"
【学習の設定、環境、モジュールのバージョン等】
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20230512_140319/"
AutoGluon Version: 0.7.1b20230511
Python Version: 3.10.11
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Sat Dec 10 16:00:40 UTC 2022
Disk Space Avail: 84.10 GB / 115.66 GB (72.7%)
Train Data Rows: 39073
Train Data Columns: 14
Label Column: class
【データの前処理。
二値分類でいいね?違ってたら多値分類とか回帰とか明示して】
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [' <=50K', ' >50K']
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K
Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 12300.57 MB
Train Data (Original) Memory Usage: 22.92 MB (0.2% of available memory)
【特徴量エンジニアリング。
カラムの値から各特徴量のデータ型を推論】
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']
1.0s = Fit runtime
14 features in original data used to generate 14 features in processed data.
Train Data (Processed) Memory Usage: 2.19 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 1.31s ...
【正解率を指標に予測性能を見ながらモデル構築。
eval_metric設定で他の指標に変えられるよ】
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.0639828014229775, Train Rows: 36573, Val Rows: 2500
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 58.69s of the 58.67s of remaining time.
0.7752 = Validation score (accuracy)
9.01s = Training runtime
0.1s = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 49.54s of the 49.53s of remaining time.
0.766 = Validation score (accuracy)
0.17s = Training runtime
0.09s = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 49.25s of the 49.23s of remaining time.
0.8792 = Validation score (accuracy)
3.24s = Training runtime
0.11s = Validation runtime
Fitting model: LightGBM ... Training model for up to 45.82s of the 45.8s of remaining time.
0.8824 = Validation score (accuracy)
1.83s = Training runtime
0.07s = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 43.88s of the 43.87s of remaining time.
0.8612 = Validation score (accuracy)
17.17s = Training runtime
0.23s = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 25.15s of the 25.14s of remaining time.
0.8584 = Validation score (accuracy)
17.58s = Training runtime
0.44s = Validation runtime
Fitting model: CatBoost ... Training model for up to 5.92s of the 5.9s of remaining time.
Ran out of time, early stopping on iteration 67.
0.8696 = Validation score (accuracy)
5.84s = Training runtime
0.01s = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 0.05s of the 0.03s of remaining time.
0.8528 = Validation score (accuracy)
8.68s = Training runtime
0.49s = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 58.69s of the -11.45s of remaining time.
WARNING: Setting `self._oof_pred_proba` by predicting on train directly! This is probably a bug and should be investigated...
0.8868 = Validation score (accuracy)
1.92s = Training runtime
0.01s = Validation runtime
【リーダーボード。
"WeightedEnsemble_L2"がトップ】
AutoGluon training complete, total runtime = 73.55s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230512_140319/")
model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.874910 0.8868 1.075271 0.287940 7.162890 0.035590 0.010334 1.917196 2 True 9
1 LightGBM 0.873477 0.8824 0.322208 0.071207 1.828028 0.322208 0.071207 1.828028 1 True 4
2 LightGBMXT 0.871430 0.8792 0.536249 0.111765 3.243746 0.536249 0.111765 3.243746 1 True 3
3 CatBoost 0.866107 0.8696 0.063781 0.014976 5.838724 0.063781 0.014976 5.838724 1 True 7
4 RandomForestGini 0.859351 0.8612 1.407241 0.231354 17.167245 1.407241 0.231354 17.167245 1 True 5
5 RandomForestEntr 0.857611 0.8584 1.523726 0.443146 17.578542 1.523726 0.443146 17.578542 1 True 6
6 ExtraTreesGini 0.853414 0.8528 1.882560 0.494139 8.678844 1.882560 0.494139 8.678844 1 True 8
7 KNeighborsUnif 0.773467 0.7752 0.171709 0.103045 9.005079 0.171709 0.103045 9.005079 1 True 1
8 KNeighborsDist 0.762719 0.7660 0.181224 0.094634 0.173920 0.181224 0.094634 0.173920 1 True 2
以上が上記のコード実行時の出力である。
コードを改めて書くと、
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data, time_limit=60) # Fit models for 60s
leaderboard = predictor.leaderboard(test_data)
というたったの5行。AutoGluon、やるじゃないか!とはいえ、これだけじゃ物足りないよね。
コードをつけ足していくことにする。
2. AutoGluonに鞭を入れる
2.1 データをトレーニング用とテスト用に分割(scikit-learn使用)
前章ではトレーニング用とテスト用にあらかじめ分けてあるデータをAutoGluonに食わせた。ここではトレーニング用とテスト用に分かれていないデータを読み込みトレーニング用とテスト用へデータを分割する。
(なお、一般にはトレーニング用データをさらに学習用データと検証用データに分けてトレーニングを実行しますが、AutoGluon では K -交差検証を自動で行うため、事前に学習用データと検証用データを分割する必要はない。)
アヤメ分類用データのcsvファイルをGoogle driveに置いて、それをpandasのデータフレームに読み込み、scikit-learnのsplitを使ってトレーニング用とテスト用へデータを分割してみる。
データを置いた場所(dataPath)やターゲットの項目名(targetLabel)、テスト用データの全データに対する比率(testRate)はそれぞれの環境に合った値に変えてほしい。
import numpy as np
from sklearn.model_selection import train_test_split
# from sklearn.datasets import load_iris
import pandas as pd
dataPath = '/content/drive/MyDrive/Colab Notebooks/Data/iris.csv'
targetLabel = 'Species'
testRate = 0.3
df = pd.read_csv(dataPath)
train_data, test_data = train_test_split(df,test_size=testRate, random_state=42)
predictor = TabularPredictor(label=targetLabel).fit(train_data, time_limit=60) # Fit models for 60s
leaderboard = predictor.leaderboard(test_data)
これを実行すると、ダーッとこんなのが出力される。
No path specified. Models will be saved in: "AutogluonModels/ag-20230515_104202/"
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20230515_104202/"
AutoGluon Version: 0.7.1b20230515
Python Version: 3.10.11
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Sat Apr 29 09:15:28 UTC 2023
Disk Space Avail: 84.10 GB / 115.66 GB (72.7%)
Train Data Rows: 105
Train Data Columns: 4
Label Column: Species
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
3 unique label values: ['Iris-versicolor', 'Iris-virginica', 'Iris-setosa']
If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 11936.16 MB
Train Data (Original) Memory Usage: 0.0 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 4 | ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 4 | ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
0.1s = Fit runtime
4 features in original data used to generate 4 features in processed data.
Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.17s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 84, Val Rows: 21
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 59.83s of the 59.82s of remaining time.
0.8571 = Validation score (accuracy)
7.73s = Training runtime
0.02s = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 52.07s of the 52.06s of remaining time.
0.8571 = Validation score (accuracy)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 52.05s of the 52.04s of remaining time.
No improvement since epoch 3: early stopping
0.9048 = Validation score (accuracy)
1.68s = Training runtime
0.02s = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 50.32s of the 50.31s of remaining time.
0.9524 = Validation score (accuracy)
1.64s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ... Training model for up to 48.66s of the 48.65s of remaining time.
0.8571 = Validation score (accuracy)
0.34s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 48.3s of the 48.29s of remaining time.
0.8571 = Validation score (accuracy)
1.09s = Training runtime
0.08s = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 47.1s of the 47.09s of remaining time.
0.8571 = Validation score (accuracy)
0.73s = Training runtime
0.07s = Validation runtime
Fitting model: CatBoost ... Training model for up to 46.26s of the 46.26s of remaining time.
0.9524 = Validation score (accuracy)
0.64s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 45.61s of the 45.6s of remaining time.
0.9048 = Validation score (accuracy)
0.77s = Training runtime
0.07s = Validation runtime
Fitting model: ExtraTreesEntr ... Training model for up to 44.74s of the 44.73s of remaining time.
0.9048 = Validation score (accuracy)
0.75s = Training runtime
0.09s = Validation runtime
Fitting model: XGBoost ... Training model for up to 43.88s of the 43.87s of remaining time.
0.9524 = Validation score (accuracy)
0.19s = Training runtime
0.0s = Validation runtime
Fitting model: NeuralNetTorch ... Training model for up to 43.65s of the 43.64s of remaining time.
0.9524 = Validation score (accuracy)
0.59s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 43.03s of the 43.03s of remaining time.
0.8571 = Validation score (accuracy)
0.34s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 59.83s of the 42.65s of remaining time.
WARNING: Setting `self._oof_pred_proba` by predicting on train directly! This is probably a bug and should be investigated...
0.9524 = Validation score (accuracy)
0.79s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 18.21s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230515_104202/")
model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 KNeighborsDist 1.000000 0.857143 0.004992 0.007145 0.008269 0.004992 0.007145 0.008269 1 True 2
1 XGBoost 1.000000 0.952381 0.024361 0.004678 0.188949 0.024361 0.004678 0.188949 1 True 11
2 RandomForestEntr 1.000000 0.857143 0.053687 0.070775 0.733705 0.053687 0.070775 0.733705 1 True 7
3 ExtraTreesEntr 1.000000 0.904762 0.056597 0.085614 0.748429 0.056597 0.085614 0.748429 1 True 10
4 RandomForestGini 1.000000 0.857143 0.061609 0.084392 1.093779 0.061609 0.084392 1.093779 1 True 6
5 ExtraTreesGini 1.000000 0.904762 0.076055 0.068997 0.772397 0.076055 0.068997 0.772397 1 True 9
6 LightGBMLarge 0.977778 0.857143 0.001844 0.002012 0.344893 0.001844 0.002012 0.344893 1 True 13
7 KNeighborsUnif 0.977778 0.857143 0.005777 0.016529 7.729195 0.005777 0.016529 7.729195 1 True 1
8 LightGBM 0.955556 0.857143 0.001748 0.003089 0.343162 0.001748 0.003089 0.343162 1 True 5
9 LightGBMXT 0.933333 0.952381 0.002483 0.003491 1.644389 0.002483 0.003491 1.644389 1 True 4
10 CatBoost 0.933333 0.952381 0.002877 0.001789 0.642138 0.002877 0.001789 0.642138 1 True 8
11 WeightedEnsemble_L2 0.933333 0.952381 0.004703 0.005915 2.438580 0.002220 0.002424 0.794191 2 True 14
12 NeuralNetTorch 0.911111 0.952381 0.006964 0.006251 0.594723 0.006964 0.006251 0.594723 1 True 12
13 NeuralNetFastAI 0.866667 0.904762 0.019233 0.016817 1.683288 0.019233 0.016817 1.683288 1 True 3
2.2 特徴量の重要度
特徴量の重要度を確認したいときは、
# 特徴量の重要度の確認
predictor.feature_importance(train_data)
これを実行すると、下記が得られる。
Computing feature importance via permutation shuffling for 4 features using 105 rows with 5 shuffle sets...
0.3s = Expected runtime (0.06s per shuffle set)
0.09s = Actual runtime (Completed 5 of 5 shuffle sets)
index | importance | stddev | p_value | n | p99_high | p99_low |
---|---|---|---|---|---|---|
Petal.Width | 0.2666666666666666 | 0.031586902765289505 | 2.3187467236626715e-05 | 5 | 0.3317045360377894 | 0.20162879729554384 |
Petal.Length | 0.14476190476190476 | 0.03588846415507956 | 0.00041841415602088863 | 5 | 0.2186567484885019 | 0.07086706103530763 |
Sepal.Width | 0.04190476190476189 | 0.012777531299998786 | 0.0009202540135577972 | 5 | 0.06821387545570984 | 0.015595648353813937 |
Sepal.Length | 0.005714285714285694 | 0.010858813572372703 | 0.15227939234026738 | 5 | 0.0280727329445543 | -0.016644161515982907 |
(続く)