More than 1 year has passed since last update.

PyCaret（+ MLflow）を使ったタイタニックの生存者予測

Last updated at 2022-11-09Posted at 2022-10-23

PyCaret で行うTitanicの生存者予測

PyCaret（+MLflow）の使い方を調べたので備忘録を残します。
題材としてはKaggleのチュートリアルでも使われるTitanicの生存者予測を行います。

参考資料

本投稿は以下記事を参考にしています。

データ

データは以下よりダウンロードし、実行場所と同階層にtrain.csvとtest.csvを配置してください。

Titanic - Machine Learning from Disaster

ライブラリのインストール

事前にMLFlowとPyCaretをインストールしておいてください。

pip install mlflow
pip install pycaret

ライブラリのインポート

import pandas as pd
from pycaret.classification import *

データ読み込み

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

PyCaretにおける実験環境の設定

PyCaretではsetupという関数を実行し、実験環境の設定を行う必要があります。
具体的に何をしているかというと、以下になります。

目的変数の指定
入力変数の型推定（必要に応じて明示的に指定することも可能）
乱数シードの指定

また、簡単な前処理などはこの時点で行ってくれます。
※不要な列削除や特徴量作成などはこの前に自分で行う必要あり。

今回は目的変数にSurvivedを指定し、乱数シードに0を指定します。

更に、次の2つの引数を指定することで、MLflowでの実験記録の保存ができます。

log_experiment: 実験記録の保存有無
experiment_name: 実験記録名

s = setup(
    data=train, # データセット
    target="Survived", # 目的変数
    log_experiment=True, # 実験記録の保存
    experiment_name="Titanic", # 実験記録の名前
    categorical_features=[], # カテゴリカル変数列を明示的に指定する場合は設定
    numeric_features=[], # 数値変数列を明示的に指定する場合は設定
    session_id=0, # 乱数シード（今回は0）
    silent=False, # 結果確認の有無（Trueの場合、実験環境設定の結果表示を省略）
)

実行後に以下メッセージと自動推定された各変数の型が表示されます。

Following data types have been inferred automatically, if they are correct press enter to continue or type 'quit' otherwise.

自動で推定した型が問題なければ、Enterを押下してください。
そうすると、以下のような実効結果の表がずらーっと表示されます。

	Description	Value
0	session_id	0
1	Target	Survived
2	Target Type	Binary
3	Label Encoded	None
4	Original Data	(891, 12)
5	Missing Values	True
6	Numeric Features	3
7	Categorical Features	8
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(623, 717)
12	Transformed Test Set	(268, 717)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	True
20	Experiment Name	Titanic
21	USI	805e
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Remove Perfect Collinearity	True
45	Clustering	False
46	Clustering Iteration	None
47	Polynomial Features	False
48	Polynomial Degree	None
49	Trignometry Features	False
50	Polynomial Threshold	None
51	Group Features	False
52	Feature Selection	False
53	Feature Selection Method	classic
54	Features Selection Threshold	None
55	Feature Interaction	False
56	Feature Ratio	False
57	Interaction Threshold	None
58	Fix Imbalance	False
59	Fix Imbalance Method	SMOTE

機械学習アルゴリズムの比較

PyCaretには複数の機械学習アルゴリズムを一度に比較することができます。
アルゴリズムの比較にはcompare_model関数を使います。
戻り値として、最も評価が高いモデルが返却されます。

model = compare_models(
    sort='Accuracy', # ソートする評価指標を指定できます。
    fold=4, # cross validation の fold数
    exclude=[], # 比較から除外するアルゴリズム名を指定可能です。
)

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
ridge	Ridge Classifier	0.8235	0.0000	0.7356	0.7954	0.7635	0.6231	0.6251	0.0400
catboost	CatBoost Classifier	0.8138	0.8598	0.6525	0.8342	0.7305	0.5917	0.6034	1.3675
lightgbm	Light Gradient Boosting Machine	0.8122	0.8545	0.6981	0.7934	0.7424	0.5956	0.5989	0.0650
lr	Logistic Regression	0.8106	0.8614	0.7109	0.7830	0.7445	0.5947	0.5970	0.5450
gbc	Gradient Boosting Classifier	0.8042	0.8522	0.6279	0.8302	0.7132	0.5689	0.5831	0.1550
ada	Ada Boost Classifier	0.8010	0.8238	0.7069	0.7629	0.7336	0.5752	0.5765	0.1000
rf	Random Forest Classifier	0.7993	0.8515	0.6691	0.7822	0.7202	0.5655	0.5704	0.1850
dt	Decision Tree Classifier	0.7913	0.7691	0.6693	0.7642	0.7130	0.5503	0.5538	0.0400
et	Extra Trees Classifier	0.7913	0.8421	0.6733	0.7624	0.7136	0.5507	0.5544	0.1675
xgboost	Extreme Gradient Boosting	0.7882	0.8398	0.6942	0.7430	0.7177	0.5485	0.5494	0.7125
knn	K Neighbors Classifier	0.6855	0.7068	0.5416	0.6059	0.5714	0.3244	0.3260	0.0700
lda	Linear Discriminant Analysis	0.6487	0.6334	0.4919	0.5583	0.5214	0.2457	0.2480	0.0850
svm	SVM - Linear Kernel	0.6165	0.0000	0.5908	0.5187	0.5042	0.2095	0.2460	0.0500
dummy	Dummy Classifier	0.6116	0.5000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0325
qda	Quadratic Discriminant Analysis	0.5828	0.5474	0.3886	0.5156	0.4061	0.1050	0.1191	0.0650
nb	Naive Bayes	0.4591	0.5404	0.9050	0.4109	0.5651	0.0663	0.1122	0.0375

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=0, solver='auto',
                tol=0.001)

今回の比較では、Ridge Classifierが最もAccuracyが高い結果となりました。
Ridge Classifierを使用し、ハイパーパラメータ調整を行います。

モデルの作成

上記では複数のアルゴリズムを比較した結果でモデルを作成しましたが、
対象のアルゴリズムが決まっている場合、compare_modelsを使わずにcreate_model関数を使って直接モデルを作成することができます。

model = create_model(
    'ridge', # 対象アルゴリズムを指定
    cross_validation=True, # cross validationの実施有無
    fold=4, # cross validation の fold数
)

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.8141	0.0000	0.7213	0.7857	0.7521	0.6039	0.6053
1	0.8141	0.0000	0.7377	0.7759	0.7563	0.6062	0.6067
2	0.8205	0.0000	0.6833	0.8200	0.7455	0.6086	0.6146
3	0.8452	0.0000	0.8000	0.8000	0.8000	0.6737	0.6737
Mean	0.8235	0.0000	0.7356	0.7954	0.7635	0.6231	0.6251
Std	0.0128	0.0000	0.0421	0.0166	0.0214	0.0293	0.0283

ハイパーパラメータのチューニング

作成したモデルに対してハイパーパラメータチューニングします。
tune_model関数を使い、グリッドサーチ法を用いた調整を行うことができます。

tuned_model = tune_model(
    model, # 定義したモデル
    optimize="Accuracy", # 評価指標
    fold=4, # cross validation のfold数
    n_iter=30, # ランダムグリッド検索の試行回数。デフォルトは10
)

n_iterが未定義で紹介している記事をよく見かけますが、デフォルト回数だとチューニング前を超えない事が多々あります。
ただ、その分処理時間は延びますのでご注意ください。

モデルを確認

モデルに対し、evaluate_model()で学習曲線などの様々な指標を確認することができます。

evaluate_model(tuned_model)

モデルの完成と予測

調整したモデルをfinalize_model()を実行することで、全教師データを使用してモデルを完成させます。
その後、完成させたモデルを使ってpredict_model()を実行しテストデータの予測を行います。

# モデルの決定
final_model = finalize_model(
    tuned_model, # ハイパーパラメータ調整後のモデル
)

# テストデータの予測
result = predict_model(
    final_model, # モデル
    data=test, # テストデータ
    raw_score=True, # ラベル予測時のscoreを表示
)
result

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Label
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q	0
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S	0
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q	0
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S	0
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S	1
...	...	...	...	...	...	...	...	...	...	...	...	...
413	1305	3	Spector, Mr. Woolf	male	NaN	0	0	A.5. 3236	8.0500	NaN	S	0
414	1306	1	Oliva y Ocana, Dona. Fermina	female	39.0	0	0	PC 17758	108.9000	C105	C	1
415	1307	3	Saether, Mr. Simon Sivertsen	male	38.5	0	0	SOTON/O.Q. 3101262	7.2500	NaN	S	0
416	1308	3	Ware, Mr. Frederick	male	NaN	0	0	359309	8.0500	NaN	S	0
417	1309	3	Peter, Master. Michael J	male	NaN	1	1	2668	22.3583	NaN	C	0

predict_model()の結果はDataFrameが返却されます。
このDataFrameには対象のテストデータに予測結果であるLabel列が追加されています。

提出ファイルの作成

予測結果から提出ファイルを作成します。

ちなみに試しにKaggleに提出したところ、スコアは0.78708でした。
高得点を目指すなら、setupの前段階で特徴量エンジニアリングをする必要がありそうです。

submit = result[['PassengerId', 'Label']]
submit = submit.rename(columns={'Label':'Survived'})
submit.to_csv("submission.csv", encoding='utf-8', index=False)

MLflowの確認

MLflowはコマンドで起動します。
以下のコマンドを実行し、表示されるURL(例えば http://127.0.0.1:5000 ) にアクセスします

!mlflow ui

[2022-10-23 11:16:04 +0900] [559] [INFO] Starting gunicorn 20.1.0
[2022-10-23 11:16:04 +0900] [559] [INFO] Listening at: http://127.0.0.1:5000 (559)
[2022-10-23 11:16:04 +0900] [559] [INFO] Using worker: sync
[2022-10-23 11:16:04 +0900] [561] [INFO] Booting worker with pid: 561
^C
[2022-10-23 11:16:11 +0900] [559] [INFO] Handling signal: int
[2022-10-23 11:16:11 +0900] [561] [INFO] Worker exiting (pid: 561)

おわりに

PyCaretを使用する事で諸々のコードを書く必要がなくなり、すぐにデータ分析やら特徴量の作成に勤しむことができるのが良きです。
各種ツールも設定値が多々あり、使いこなせるよう色々調べていきたいです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up