More than 3 years have passed since last update.

【R】tydymodelsを用いてlightgbmをやってみる（その３）

Last updated at 2022-01-22Posted at 2021-07-29

1.はじめに

tidymodels関係の記事はquitaの中でも少ないので、（Rがそもそも少ないですが）、将来の自分用のために投稿します。
勾配ブースティングのアルゴリズムはXgboostが有名ですが、lightgbmも良く使われているようです。そこで、tidymodelsのフレームワークの中でlightgbmを使い、さらにベイズ最適化でハイパーパラメータを求めるという機械学習の一連の流れをやってみました。

※R-BLOGGRES:How to Use Lightgbm with Tidymodelsを参考にしました。

2.環境

os:ubuntu20.04.2LTS
R:Version 4.1.0
Rstudio:Version 1.4.1717

3.パッケージのインストール

tidymodelsでlightGBMを利用するためにはtreesnipが別途必要になります。

CRANにはないのでGithubからインストールしてください。

remotes::install_github("curso-r/treesnip")

# data
library(AmesHousing)
# data cleaning
library(janitor)
# data prep
library(tidyverse)
# tidymodels
library(rsample)
library(recipes)
library(parsnip)
library(tune)
library(dials)
library(workflows)
library(yardstick)
library(treesnip)
# others
library(ParBayesianOptimization)
library(lightgbm)

4.1DATA

AmesHoisingパッケージで入手可能な住宅価格のデータを使用します。

# set the random seed so we can reproduce any simulated results.
set.seed(123)
# load the housing data and clean names
ames_data = make_ames() %>%
janitor::clean_names()

clean_names():説明変数の名前を自動的にクレンジングする
janitorパッケージについて

4.2前処理

多くのモデルでは、正確な予測を行うために、慎重かつ広範な変数の前処理を必要とします。
XGBoost、lightgbm、catboostのような勾配ブースティングツリーモデルは、高度に歪んだデータや相関性のあるデータに対して非常に頑健であるため、必要な前処理の量は最小限で済みます。XGBoostとは対照的に、lightgbmとcatboostはカテゴリー変数（因子）を扱うことができるので、変数をダミー（ワンホットエンコード）にする必要はありません。

それでは、データをトレーニングデータとテストデータにわけます。

ames_split = rsample::initial_split(
ames_data,
prop = 0.8,
strata = sale_price
)

train_data = training(ames_split)
test_data =testing(ames_split)

recipe関数を使って前処理をします。
sale_price を目的変数に設定します。
step_other(all_nominal(), threshold = 0.01)は、全てのカテゴリー変数の中に含まれる出現頻度が少ないカテゴリーをotherとして自動的にまとめます。
step_nzv(all_nominal())全てのカテゴリー変数のうち、カテゴリー間の分散が小さい説明変数は除去します。
例：カテゴリーA:997、B:3のカテゴリー変数であった場合、ほとんどがAの区分になる（分散が小さい）ため、この説明変数は役に立たないため、除去される。

preprocessing_recipe =
recipes::recipe(sale_price ~ ., data = train_data) %>%
# 頻度の低い要因水準を組み合わせる
recipes::step_other(all_nominal(), threshold = 0.01) %>%
# 予測に寄与しない分散が少ない質的説明変数を除去する
recipes::step_nzv(all_nominal())

5.ハイパーパラメータのベスト探索

ハイパーパラメータはガウス過程回帰によるベイズ最適化で最適な値を求めます。

parsnipでは、アルゴリズム毎に異なるハイパーパラメータ名を統一的な表現で表すことが可能です。
lightGBMでは主に次のハイパーパラメータがありますが、そのうち３つのパラメーターが最も重要です。

・num_leaves
・feature_fraction (mtry)
・num_iterations (trees)
・min_data_in_leaf (min_n)
・max_depth (tree_depth)
・learning_rate (learn_rate)
・min_gain_to_split (loss_reduction)
・bagging_fraction (sample_size)
（　）の中が、parsnipによる統一的な表現
太字：importnt hyper parameter

まず求めるパラメーターによるスコアを算出する関数を作成します。
今回の最適化をねらうターゲットパラメータ:mtry,min_n,tree_depth　
treesは1000で固定します。

スコアの損失関数はマイナスにして最大化をねらいます。

score_function = function(mtry,min_n,tree_depth){
  lgm_model =
          boost_tree( mode = "regression",
                      trees = 1000,
                      mtry = mtry,
                      min_n = min_n,
                      tree_depth = tree_depth,
                      
  ) %>%
    set_engine("lightgbm",verbose=-1)
  
    lgm_workflow = workflow() %>% add_recipe(preprocessing_recipe) %>% 
    add_model(lgm_model)
    
   cv_folds = vfold_cv(data = train_data, v=5)
   result=collect_metrics(lgm_workflow %>% fit_resamples(cv_folds))
   score = -result$mean[1]
                   
  return(list(Score = score))
}

次にハイパーパラメータの探索空間を定義します。

limites = list( 
              # Use suffix L for integer values
              mtry = c(1L, 81L),
              min_n = c(1L,200L),
              tree_depth = c(1L, 15L)
           )

ベイズ最適化を実行します。

set.seed(12345)
bayes_opt = bayesOpt(
                FUN = score_function, 
                bounds = limites,
                initPoints = 4, 
                iters.n = 8,
                iters.k = 1,
                otherHalting = list(timeLimit = 5*60)
            )

output


Running initial scoring function 4 times in 1 thread(s)...  30.036 seconds

Starting Epoch 1 
  1) Fitting Gaussian Process...
  2) Running local optimum search...        1.403 seconds
  3) Running FUN 1 times in 1 thread(s)...  4.258 seconds

Starting Epoch 2 
  1) Fitting Gaussian Process...
  2) Running local optimum search...        3.259 seconds
  3) Running FUN 1 times in 1 thread(s)...  39.827 seconds

Starting Epoch 3 
  1) Fitting Gaussian Process...
  2) Running local optimum search...        4.154 seconds
  3) Running FUN 1 times in 1 thread(s)...  50.749 seconds

Starting Epoch 4 
  1) Fitting Gaussian Process...
  2) Running local optimum search...        3.206 seconds
  3) Running FUN 1 times in 1 thread(s)...  4.385 seconds

Starting Epoch 5 
  1) Fitting Gaussian Process...
  2) Running local optimum search...        5.359 seconds
  3) Running FUN 1 times in 1 thread(s)...  5.258 seconds

Starting Epoch 6 
  1) Fitting Gaussian Process...
  2) Running local optimum search...        8.461 seconds
  3) Running FUN 1 times in 1 thread(s)...  6.438 seconds

Starting Epoch 7 
  1) Fitting Gaussian Process...
  2) Running local optimum search...        5.291 seconds
  3) Running FUN 1 times in 1 thread(s)...  5.34 seconds

Starting Epoch 8 
  1) Fitting Gaussian Process...
  2) Running local optimum search...        9.041 seconds
  3) Running FUN 1 times in 1 thread(s)...  6.845 seconds

結果を図示化します。

plot(bayes_opt)

library(gganimate)
library(patchwork)
g1=ggplot(data = bayes_opt$scoreSummary,
       aes(x = mtry, y = tree_depth, size = Score, color = Score)) +
  geom_point() +
  scale_color_viridis_c() +
  theme_bw()
g2=ggplot(data = bayes_opt$scoreSummary,
       aes(x = mtry, y = min_n, size = Score, color = Score)) +
  geom_point() +
  scale_color_viridis_c() +
  theme_bw()
g3=ggplot(data = bayes_opt$scoreSummary,
       aes(x = tree_depth, y = min_n, size = Score, color = Score)) +
  geom_point() +
  scale_color_viridis_c() +
  theme_bw()


g1|g2|g3

ベイズ最適化による最適な結果は次のとおりです。

getBestPars(bayes_opt)

output

$mtry
[1] 37

$min_n
[1] 1

$tree_depth
[1] 3

6.モデルの作成、トレーニングデータによる学習

モデルに結果を反映させます。

 lgm_model =
          boost_tree( mode = "regression",
          trees=1000,           
  ) %>%
    set_engine("lightgbm")

model_best = update(lgm_model,getBestPars(bayes_opt))
model_best

output

Boosted Tree Model Specification (regression)

Main Arguments:
  mtry = 37
  trees = 1000
  min_n = 1
  tree_depth = 3

Computational engine: lightgbm

workflowに落とします。

 lgm_workflow = workflow() %>% add_recipe(preprocessing_recipe) %>% 
    add_model(model_best)
lgm_workflow

output

══ Workflow ═════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

─ Preprocessor ───────────────────────────────────────
2 Recipe Steps

• step_other()
• step_nzv()

─ Model ──────────────────────────────────────────
Boosted Tree Model Specification (regression)

Main Arguments:
  mtry = 37
  trees = 1000
  min_n = 1
  tree_depth = 3

Computational engine: lightgbm

workflowを学習させます。

lgm_workflow_fit = lgm_workflow %>% fit(train_data)

output

[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000725 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3778
[LightGBM] [Info] Number of data points in the train set: 2342, number of used features: 67
[LightGBM] [Info] Start training from score 179795.180188
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

7.テストデータによる評価

学習済みのworkflowをテストデータで評価します。


pred = bind_cols(test_data %>% select(sale_price) ,lgm_workflow_fit%>% predict(test_data) )

pred %>% metrics(sale_price,.pred)

output

.metric　.estimator　　.estimate
<chr>　　　<chr>　　　　<dbl>
rmse	standard	2.213027e+04		
rsq	standard	9.353872e-01		
mae	standard	1.347298e+04		
3 rows

残さを図でプロットします。

house_residual = pred %>% mutate(residual_pct=(sale_price-.pred)/.pred)
ggplot(house_residual, aes(x = .pred, y = residual_pct)) +
geom_point() +
xlab("Predicted Sale Price") +
ylab("Residual (%)") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::percent) +theme_bw()

8.参考

[R-BLOGGRES:How to Use Lightgbm with Tidymodels] (https://www.r-bloggers.com/2020/08/how-to-use-lightgbm-with-tidymodels/)

Optimización bayesiana de hiperparámetros

Enjoy!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up