PyCaret使ってみた【ワインの品質判定③】

Posted at 2024-08-24

このページについて

今回はPyCaret(パイキャレット)というライブラリを使って，ワインの品質分類をしてみます．

Pycaretとは

PyCaretは，超簡単に機械学習ができてしまう便利ツールです．
数行のコードで，複数のAIモデルを構築して評価して比較することができます．

コードと解説

プログラムを以下に記載したので，さっそく見ていきましょう．
前回と同じく，ワインの特徴量データからワインの品質を予測するようなプログラムになってます．

# ライブラリのインポート
import pandas as pd
from pycaret.classification import *

# データセットの読み込み
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

使用するライブラリと，ワインの品質データセットを読み込みます．

# 特徴名の空白をアンダースコアに置き換える
df.columns = df.columns.str.replace(' ', '_')

特徴名（カラム名）の空白をアンダースコアに置き換え，後の処理でエラーが発生しないようにします．

# PyCaretのセットアップ
clf = setup(df, target='quality', session_id=123)

PyCaretを使ってデータセットをセットアップします．
targetパラメータで予測対象のカラム（品質）を指定し，session_idでランダムシードを設定します．

このPyCaretのsetup関数はなかなかすごい関数でして，データの前処理・特徴量エンジニアリング・データの分割・スケーリングなど，モデル構築のための初期設定を一括で行ってくれます．

この行のコードを実行することで，PyCaretがデータセットを解析し，機械学習モデルの構築に必要な準備を自動的に行います（後続のモデル比較や作成，評価が簡単に行えるようになる．）

# モデルの比較
best_model = compare_models()

PyCaretのcompare_models関数を使って，複数の分類モデルを比較し，最も性能の良いモデルを選択します．
ロジスティック回帰とか決定木とかサポートベクターマシンとか，いろんなアルゴリズムを比較して，一番いいやつをピックアップしてくれるらしいです．

# モデルの作成
model = create_model(best_model)

選択した最適なモデルを作成する．
※ best_model = compare_models()の時点では，まだモデルが完全に構築されていないらしい．なので，この行で完全に構築したかんじですね．

# モデルの評価
evaluate_model(model)

モデルの性能を評価します．PyCaretは様々な評価指標を提供します．

# 訓練データとテストデータに分割
train_data = get_config('X_train')
test_data = get_config('X_test')
train_labels = get_config('y_train')
test_labels = get_config('y_test')

ここは訓練データとテストデータに分割してるだけ．

# モデルの予測
train_predictions = predict_model(model, data=train_data)['prediction_label']
test_predictions = predict_model(model, data=test_data)['prediction_label']

訓練データとテストデータに対して予測を行います．
（predict_model関数を使って予測ラベルを取得．）

# 正解率の計算
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(train_labels, train_predictions)
test_accuracy = accuracy_score(test_labels, test_predictions)

訓練データとテストデータの正解率を計算します．
（accuracy_score関数を使って，予測ラベルと実際のラベルを比較）

# 結果の表示
print(f'正解率(train): {train_accuracy:.3f}')
print(f'正解率(test): {test_accuracy:.3f}')

訓練データとテストデータの正解率を表示します．

実行結果

実行結果はこんな感じすね．

C:\Python\Python実践AIモデル構築\pycaret>python wine_pycaret.py
                    Description                               Value
0                    Session id                                 123
1                        Target                             quality
2                   Target type                          Multiclass
3                Target mapping  3: 0, 4: 1, 5: 2, 6: 3, 7: 4, 8: 5
4           Original data shape                          (1599, 12)
5        Transformed data shape                          (1599, 12)
6   Transformed train set shape                          (1119, 12)
7    Transformed test set shape                           (480, 12)
8              Numeric features                                  11
9                    Preprocess                                True
10              Imputation type                              simple
11           Numeric imputation                                mean
12       Categorical imputation                                mode
13               Fold Generator                     StratifiedKFold
14                  Fold Number                                  10
15                     CPU Jobs                                  -1
16                      Use GPU                               False
17               Log Experiment                               False
18              Experiment Name                    clf-default-name
19                          USI                                4903
Processing:  82%|█████████████████████████████████████████████████████████             | 53/65 [00:12<00:04,  2.48it/s][LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004022 seconds.



（略）


                                    Model  Accuracy     AUC  Recall   Prec.  \
et                 Extra Trees Classifier    0.6702  0.5871  0.6702  0.6485
xgboost         Extreme Gradient Boosting    0.6622  0.5663  0.6622  0.6420
rf               Random Forest Classifier    0.6595  0.5761  0.6595  0.6323
lightgbm  Light Gradient Boosting Machine    0.6497  0.5725  0.6497  0.6277
gbc          Gradient Boosting Classifier    0.6336  0.5461  0.6336  0.6183
lr                    Logistic Regression    0.6032  0.5288  0.6032  0.5726
lda          Linear Discriminant Analysis    0.5987  0.5329  0.5987  0.5806
ridge                    Ridge Classifier    0.5880  0.0000  0.5880  0.4973
dt               Decision Tree Classifier    0.5872  0.4659  0.5872  0.5837
nb                            Naive Bayes    0.5558  0.5033  0.5558  0.5628
qda       Quadratic Discriminant Analysis    0.5478  0.4938  0.5478  0.5467
ada                  Ada Boost Classifier    0.5478  0.4347  0.5478  0.4596
knn                K Neighbors Classifier    0.4808  0.4497  0.4808  0.4551
dummy                    Dummy Classifier    0.4263  0.3500  0.4263  0.1817
svm                   SVM - Linear Kernel    0.4111  0.0000  0.4111  0.3872

              F1   Kappa     MCC  TT (Sec)
et        0.6494  0.4612  0.4667     0.039
xgboost   0.6479  0.4581  0.4607     0.373
rf        0.6388  0.4444  0.4500     0.044
lightgbm  0.6342  0.4365  0.4403     0.549
gbc       0.6223  0.4153  0.4174     0.162
lr        0.5733  0.3410  0.3475     0.300
lda       0.5845  0.3558  0.3586     0.005
ridge     0.5318  0.2989  0.3103     0.005
dt        0.5821  0.3556  0.3572     0.006
nb        0.5551  0.3141  0.3162     0.005
qda       0.5409  0.2872  0.2912     0.005
ada       0.4938  0.2362  0.2499     0.017
knn       0.4604  0.1502  0.1529     0.226
dummy     0.2548  0.0000  0.0000     0.127
svm       0.3232  0.1108  0.1618     0.008
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
Fold
0       0.7232  0.0000  0.7232  0.7028  0.7091  0.5497  0.5529
1       0.6786  0.0000  0.6786  0.6668  0.6586  0.4677  0.4733
2       0.6518  0.8484  0.6518  0.6312  0.6374  0.4377  0.4426
3       0.7054  0.8492  0.7054  0.6723  0.6765  0.5156  0.5228
4       0.7054  0.8360  0.7054  0.6798  0.6838  0.5169  0.5226
5       0.6786  0.8127  0.6786  0.6464  0.6619  0.4881  0.4892
6       0.6339  0.8441  0.6339  0.6026  0.6063  0.4046  0.4180
7       0.6964  0.8616  0.6964  0.6600  0.6717  0.5042  0.5087
8       0.5536  0.8191  0.5536  0.5746  0.5292  0.2554  0.2612
9       0.6757  0.0000  0.6757  0.6485  0.6592  0.4725  0.4757
Mean    0.6702  0.5871  0.6702  0.6485  0.6494  0.4612  0.4667
Std     0.0463  0.3846  0.0463  0.0358  0.0477  0.0792  0.0783
interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipeline'), ('Hyperparameters', 'parameter'), ('AUC', 'auc'), ('Confusion Matrix', 'confusion_matrix'), ('Threshold', 'threshold'), ('Precision Recall', 'pr'), ('Prediction Error', 'error'), ('Class Report', 'class_report'), ('Feature Selection', 'rfe'), ('Learning Curve', 'learning'), ('Manifold Learning', 'manifold'), ('Calibration Curve', 'calibration'), ('Validation Curve', 'vc'), ('Dimensions', 'dimension'), ('Feature Importance', 'feature'), ('Feature Importance (All)', 'feature_all'), ('Decision Boundary', 'boundary'), ('Lift Chart', 'lift'), ('Gain Chart', 'gain'), ('Decision Tree', 'tree'), ('KS Statistic Plot', 'ks')), value='pipeline'), Output()), _dom_classes=('widget-interact',))
正解率(train): 1.000
正解率(test): 0.692

上記の実行結果から，以下のようなことが分かると思われる．

・Extra Trees Classifierが最も高い正解率を示していますが、他のモデル（XGBoostやRandom Forest）も近い性能を示している．

・訓練データの正解率が1.000である一方、テストデータの正解率は0.692であるため、モデルが過学習している可能性がある😿クロスバリデーションとかハイパーパラメータの調整とか正則化の導入が必要かも．

・まあとにかく，Extra Trees Classifierが最も高い性能を示しているため、このモデルをベースにさらなるチューニングを行うと良いっぽい．

まとめ

これは便利だわ．
知らないと損な気がします．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up