More than 1 year has passed since last update.

【LightBGM入門】住宅価格、Tipsそしてタイタニック生存予測で遊んでみた♪

Posted at 2022-10-02

今回は、社会復帰のために以下のサイトを参考にLightBGMで遊んでみた。
参考は、➀がボストンの住宅価格、➁がseabornのtitanic、そして➂がseaborn datasetつながりでtipsのdata取得の参考です。
なお、ボストンの住宅価格は、参考➃の事情があるので、ここではカリフォルニア住宅価格を使って予測してみました。
参考
➀Kaggler がよく使う「LightGBM」とは？【機械学習】
➁LightGBMを超わかりやすく解説(理論+実装)【機械学習入門33】
➂学習用データセット – seaborn【Python】
➃機械学習の回帰データとしては、ボストン住宅価格データではなく、カリフォルニア住宅価格データを使おう

環境

・windows11上のVscodeのJupyter notebook で、Terminal上で以下を入れればLightBGMが使えました。

pip install LightGBM

または、Jupyter notebook のコードに以下を追加して、インストールします。

！pip install LightGBM

住宅価格の予測

淡々と、参考➀のcode写経します。
まずは、利用するLibは以下のとおりです。

LightBGM_house.py

# 必要なライブラリのインストール
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
import time

from pandas import DataFrame
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

ボストンは参考➃のとおり、取得できません。そこで、参考➃のようにカリフォルニアの住宅価格を取得します。

.py

from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing()
train_x = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
train_y = pd.Series(california_housing.target)

train_x.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25

.py

train_y.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
dtype: float64

以下のとおり、ボストンのデータに合わせtrain_xと、プライスとしてtrain_yを代入します。

.py

df = train_x
df["Price"] = train_y
df.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	Price
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

ここからは、参考➀のとおりです。

.py

# 訓練データとテストデータに分ける
train_set, test_set = train_test_split(df, test_size = 0.2, random_state = 123)
# 説明変数と目的変数に分ける
x_train = train_set.drop('Price', axis = 1)
y_train = train_set['Price']
x_test = test_set.drop('Price', axis = 1)
y_test = test_set['Price']
# LightGBM用のデータセットに入れる
lgb_train = lgb.Dataset(x_train, y_train)
lgb.test = lgb.Dataset(x_test, y_test)

.py

# 評価基準を設定する 
params = {'metric' : 'rmse'}

# 訓練データから回帰モデルを作る
gbm = lgb.train(params, lgb_train)

.py

# テストデータを用いて予測精度を確認する
test_predicted = gbm.predict(x_test)
predicted_df = pd.concat([y_test.reset_index(drop=True), pd.Series(test_predicted)], axis = 1)
predicted_df.columns = ['true', 'predicted']
predicted_df.head()

	true	predicted
0	1.516	2.218834
1	0.992	0.887906
2	1.345	1.560843
3	2.317	1.789321
4	4.629	4.411697

.py

# 予測値を図で確認する関数の定義
def Prediction_accuracy(predicted_df):
    RMSE = np.sqrt(mean_squared_error(predicted_df['true'], predicted_df['predicted']))
    plt.figure(figsize = (7,7))
    ax = plt.subplot(111)
    ax.scatter('true', 'predicted', data = predicted_df)
    ax.set_xlabel('True Price', fontsize = 20)
    ax.set_ylabel('Predicted Price', fontsize = 20)
    plt.tick_params(labelsize = 15)
    x = np.linspace(0, 10)
    y = x
    ax.plot(x, y, 'r-')
    plt.text(0.1, 0.9, 'RMSE = {}'.format(str(round(RMSE,3))),transform = ax.transAxes, fontsize = 15)

.py

# 予測値を図で確認する
Prediction_accuracy(predicted_df)

.py

# 特徴量の重要度を確認
lgb.plot_importance(gbm, height = 0.5, figsize = (8,16))

.py

# 決定木の分岐の可視化
# 最後の数字でleaf番号を指定
G =  lgb.create_tree_digraph(gbm, 1)
G

Tipsの予測

実は、住宅価格の予測とTipの予測は、ほぼ同じコードである。
異なるのは、データの読み込みのみ。
ということで、早速見ていこう。

Libは、まったく上記と同じなので、省略する。
データの読み込みは、以下のとおりシンプルだ。

その結果、下のようなものになっている。

.py

df = sns.load_dataset('tips')
df.head()

total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

この後も全く同じコードが動く。
結果は、以下の通り得られる。

	true	predicted
0	4.00	5.851008
1	3.35	3.829987
2	2.00	4.606835
3	2.00	1.987755
4	2.50	2.888355

Titanicの生死予測

読み込みは、上記と同様
ここでは、生死を判定するので、生死が判明しているDataを読み込み、生存の有無をyとして代入する。

.py

df = sns.load_dataset('titanic')

X = df.loc[:, (df.columns!='survived') & (df.columns!='alive')]
X = pd.get_dummies(X, drop_first=True)
y = df['survived']

.py

pd.set_option('display.max_columns', 100)
df.head()

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	0	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

.py

X.head()

	pclass	age	sibsp	parch	fare	adult_male	alone	sex_male	embarked_Q	embarked_S	class_Second	class_Third	who_man	who_woman	deck_B	deck_C	deck_D	deck_E	deck_F	deck_G	embark_town_Queenstown	embark_town_Southampton
0	3	22.0	1	0	7.2500	True	False	1	0	1	0	1	1	0	0	0	0	0	0	0	0	1
1	1	38.0	1	0	71.2833	False	False	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0
2	3	26.0	0	0	7.9250	False	True	0	0	1	0	1	0	1	0	0	0	0	0	0	0	1
3	1	35.0	1	0	53.1000	False	False	0	0	1	0	0	0	1	0	1	0	0	0	0	0	1
4	3	35.0	0	0	8.0500	True	True	1	0	1	0	1	1	0	0	0	0	0	0	0	0	1

.py

y.head()

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64

.py

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

import lightgbm as lgb
model = lgb.LGBMClassifier(boosting_type='goss', max_depth=5, random_state=0)

eval_set = [(X_test, y_test)]
callbacks = []
callbacks.append(lgb.early_stopping(stopping_rounds=10))
callbacks.append(lgb.log_evaluation())
model.fit(X_train, y_train, eval_set=eval_set, callbacks=callbacks)

from sklearn import metrics
y_pred = model.predict_proba(X_test)
metrics.log_loss(y_test, y_pred)

.py

lgb.plot_metric(model)

.py

lgb.plot_importance(model)

.py

# 決定木の分岐の可視化
# 最後の数字でleaf番号を指定
G =  lgb.create_tree_digraph(model,1)
G

最後に精度を求めてみる。

.py

import numpy as np
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
# Accuracy score: 正解率。1のものは1として分類(予測)し、0のものは0として分類した割合
# Precision score: 精度。1に分類したものが実際に1だった割合
# Recall score: 検出率。1のものを1として分類(予測)した割合
# F1 score: PrecisionとRecallとの調和平均であり、0~1のスコア。数字が大きいほど良い評価。

preds = np.round(model.predict(X_test))
print('Accuracy score = \t {}'.format(accuracy_score(y_test, preds)))
print('Precision score = \t {}'.format(precision_score(y_test, preds)))
print('Recall score =   \t {}'.format(recall_score(y_test, preds)))
print('F1 score =      \t {}'.format(f1_score(y_test, preds)))

Accuracy score = 	 0.8134328358208955
Precision score = 	 0.7549019607843137
Recall score =   	 0.7549019607843137
F1 score =      	 0.7549019607843137

参考
➄LightGBMを試す
➅F値 (評価指標)
➆Classification metrics

まとめ

今回は、以下のように回帰と分類をやってみたが、まだまだ遊び足りていないと感じている。

➀連続値の予測すなわち

.py

#　訓練データから回帰モデルを作る
gbm = lgb.train(params, lgb_train)

を利用して、住宅価格とTipを予測した。
この方式だと、因果関係がある量の予測には、いろいろ使えそうである。
参考
lightgbm.train

➁もう一つは、Titanicの生死予測を以下の分類モデルで実施した。

.py

model = lgb.LGBMClassifier(boosting_type='goss', max_depth=5, random_state=0)

こちらも、二値予測にはこのままで利用可能であろう。
参考
lightgbm.LGBMClassifier

実は、LightBGMは、以下の参考を見ると、分類、回帰に留まらず、LightGBM ranker.というのもできるようであり、何よりパラメタを見るともっと奥が深そうである。
そして、原理を読むと高速で有用なようなので、さらにいろいろ試してみようと思う。

参考
Python API

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up