【R】tidymodelsとworkflowを中心に〜機械学習のフレームワーク

Last updated at 2024-11-18Posted at 2021-01-20

一部修正：2024.9.02

1.はじめに

tidyversベースの新しい機械学習のフレームワークとして、tidymodelsがあります。
tidymodelsとは、統計モデリング/機械学習をするためのメタパッケージで、主に次のパッケージ群から構成されています。
・rsample：訓練データ/テストデータの分割,リサンプリング
・recipe：特徴量エンジニアリング
・parsnip：異なるパッケージのアルゴリズムを同じシンタックスで扱えるようにするラッパー
・yardstic：モデルの精度評価
・workflow：前処理、モデリング、後処理を一括に処理できる

今回は、この中でもworkflowを中心に回帰モデルを実行してみたいと思います。

2.利用するパッケージ

library(tidyverse)
library(gamlss)
library(ggpubr)
library(skimr)
library(patchwork)
library(tidymodels)
library(doParallel)

3.データのインポートとデータ分析

gamlssパッケージからrentデータを利用します。
1993年4月にInfratest　Sozialforschungによって実施され、ミュンヘンで過去4年以内に新たに賃貸契約を結んでいるか、または家賃が値上げされている賃貸物件を無作為に抽出した賃貸価格の調査データです。
skim()はデータ概要を簡単にまとめてくれます。

dat = rent
dat = dat %>% dplyr:: select(R, Fl, A, H, loc ,B,L)
colnames(dat) = c("price", "floor", "building_year", 
                   "heating", "location","bathroom","kitchen")
skim(dat)

output

$>─ Data Summary ────────────
$>                           Values
$>Name                       dat   
$>Number of rows             1969  
$>Number of columns          7     
$>_______________________          
$>Column type frequency:           
$>  factor                   4     
$>  numeric                  3     
$>________________________         
$>Group variables            None  
$>
$>─ Variable type: factor ───────────────────────────────
$>  skim_variable n_missing complete_rate ordered n_unique top_counts             
$>1 heating               0             1 FALSE          2 0: 1580, 1: 389        
$>2 location              0             1 FALSE          3 2: 1247, 3: 550, 1: 172
$>3 bathroom              0             1 FALSE          2 0: 1925, 1: 44         
$>4 kitchen               0             1 FALSE          2 0: 1808, 1: 161        
$>
$>─ Variable type: numeric ───────────────────────────────
$>  skim_variable n_missing complete_rate   mean    sd    p0   p25   p50   p75  p100
$>1 price                 0             1  812.  379.   102.  544.  738.  1022  3000
$>2 floor                 0             1   67.7  20.9   30    52    67     82   120
$>3 building_year         0             1 1948.   29.0 1890  1934  1957   1972  1988
$>  hist 
$>1 ▇▇▂▁▁
$>2 ▆▇▇▅▂
$>3 ▃▁▃▇▇

price : 賃貸価格（目的変数）
floor : 家の広さ
building_year: 建築年
bathroom : バスルームがあるか,1, (1925 obs.) or not, 0, (44 obs.)
heating: 床暖房があるかどうか, 1, (1580 obs.) or not, 0, (389 obs.)
kitchen: 台所の装備が平均より上かどうか,1, (161 obs.) or not, 0, (1808 obs.)
location: 家の所在地を評価する指数,平均より下, 1, 平均, 2, 平均より上 3

最初に目的変数の分布を見てみます。

# hist
g1 = ggplot(dat, aes(x=price)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666")+ 
  labs(x = "", y = "")
# boxplot
g2 = ggplot(dat, aes(y=price)) + 
 geom_boxplot(aes(x=""), colour="black", fill="white")+
  coord_flip()+ 
  labs(x = "", y = "")

g1 | g2   # パッケージ　patchwork でグラフを簡単に並べることができる

[output]

説明変数と目的変数との関係を調べます。

custom_corr_plot = function(variable1, variable2, df, alpha=0.3){
  p = df %>%
       mutate(
          title = paste(toupper(variable2), "vs", toupper(variable1))
       ) %>%
       ggplot(aes(x = !!sym(variable1), y = !!sym(variable2))) + 
       geom_point(alpha = alpha) +
       # 加法モデル(GAM)
       geom_smooth(se = FALSE, method = "gam", formula =  y ~ splines::bs(x, 2)) +
       # 線形モデル(LM)
       geom_smooth(se = FALSE, method = "lm", color = "firebrick") +
       facet_grid(. ~ title) +
       theme_bw() +
       theme(strip.text = element_text(colour = "black", size = 8, face = 2),
             axis.title = element_blank())
  return(p)
}

variables_continuas <- c("floor","building_year")

plots = map(
            .x = variables_continuas,
            .f = custom_corr_plot,
            variable2 = "price",
            df = dat
         )
p3 = ggplot(data = dat, aes(x = heating, y = price)) + 
      geom_violin() +
      geom_boxplot(width = 0.1, outlier.shape = NA) +
      labs(ttile = "price vs heating") +
      theme_bw()

p4 = ggplot(data = dat, aes(x = location, y =price)) + 
      geom_violin() +
      geom_boxplot(width = 0.1, outlier.shape = NA) +
      labs(ttile = "price vs location") +
      theme_bw()
p5 = ggplot(data = dat, aes(x = bathroom, y =price)) + 
      geom_violin() +
      geom_boxplot(width = 0.1, outlier.shape = NA) +
      labs(ttile = "price vs bathroom") +
      theme_bw()

p6 = ggplot(data = dat, aes(x = kitchen, y =price)) + 
      geom_violin() +
      geom_boxplot(width = 0.1, outlier.shape = NA) +
      labs(ttile = "price vs kitchen") +
      theme_bw()

plot2 = list(p3,p4,p5,p6)
plot1 = c(plots,plot2)
ggarrange(plotlist = plot1, ncol = 2, nrow = 3) #ggpubrパッケージを使って図を並べる

[output]

4.訓練データとテストデータの分割

priceで層化抽出をしつつ、訓練データとテストデータに分割します。

set.seed(12)
initial_split = initial_split(dat, strata = "price", p = 0.75)
data_train = training(initial_split)
data_test  = testing(initial_split)

5.recipeを使った特徴量エンジニアリング

パッケージrecipesには、データを前処理するための様々な機能が組み込まれており、それらを任意のデータセットに適用できます。
1.recipe()：目的変数と説明変数、訓練データを定義します。
2.step_()：適用するすべての変換（スケーリングなど）を定義します。
step_center():平均を0にするセンタリングします。all_numeric()で数値型変数に適用します。
step_scale():標準偏差で割ることによってスケールをそろえます。
　 step_dummy():factorをダミー変数化します。その際にすべて変数はnumeric型になります。　
3.prep():訓練データを用いて、これらの変換に必要なパラメータを学習します。
4.juice()、bake():juiceは現在のデータに変換を適用します。bakeは、新しいデータに変換を適用します。

recipe1 = recipe(price ~.,data=data_train) %>% 
  step_center(all_numeric(), -all_outcomes()) %>% 
  step_scale(all_numeric(), -all_outcomes()) %>%                                            
  step_dummy(all_nominal(),-all_outcomes())

workflowを使う場合には、recipe関数のprep()やjuice(),bake()は用いません。

6.モデルの定義

回帰木モデルを用いて、賃貸価格の予測にチャレンジします。
チューニングするハイパーパラメータを定義します。
　tree_depth：木の深さ
　min_n：最小の葉（ノード）の分割単位

チューニングするパラメータはtune()で指定します。　

model_tree <- decision_tree(
                 mode       = "regression",
                 tree_depth = tune(),
                 min_n      = tune()
               ) %>%
               set_engine(engine = "rpart")

7.workflowの定義

workflowを使って一連の流れを定義します。
データの前処理にrecipe1を適用し、モデルとしてmodel_treeを適用します。

rt_workflow = workflow() %>% add_recipe(recipe1) %>% add_model(model_tree)
rt_workflow

output

$>══ Workflow ═══════════════════════════════════════════════════════════════════════════
$>Preprocessor: Recipe
$>Model: decision_tree()
$>
$>─ Preprocessor ────────────────────────────────────
$>3 Recipe Steps
$>
$>● step_center()
$>● step_scale()
$>● step_dummy()
$>
$>─ Model ───────────────────────────────────────
$>Decision Tree Model Specification (regression)
$>
$>Main Arguments:
$>  tree_depth = tune()
$>  min_n = tune()
$>
$>Computational engine: rpart

8.ハイパーパラメータのチューニング

先程、定義したハイパーパラメータの最適値を探索します。
そのために、交差検証法を使います。

set.seed(123)
cv_folds = vfold_cv(
              data    = data_train,
              v       = 5,
              strata  = price
             )

vfols_cv()関数は、データをAnalysis（訓練データ）とAssess（検証データ）に自動的に振り分けてくれます。

cv_folds$splits

output

$>[[1]]
$><Analysis/Assess/Total>
$><1181/297/1478>
$>
$>[[2]]
$><Analysis/Assess/Total>
$><1181/297/1478>
$>
$>[[3]]
$><Analysis/Assess/Total>
$><1183/295/1478>
$>
$>[[4]]
$><Analysis/Assess/Total>
$><1183/295/1478>
$>
$>[[5]]
$><Analysis/Assess/Total>
$><1184/294/1478>

ハイパーパラメータの探索空間を定義します。
grid_space_filling()を用いてグリッドをランダムに探索空間域を最大限カバーするように定義します。
dialsパッケージを用います。

set.seed(123)
hiper_grid = grid_space_filling(
                  tree_depth(range = c(1, 15), trans = NULL),
                  min_n(range      = c(2, 100), trans = NULL),
                  size = 50
                )

定義したハイパーパラメータの探索空間を見てみます。

plot(hiper_grid)

[output]

rt_workflowにtune_gridを適用して、交差検証法によるパラメータ探索を行います。
tuneパッケージを用います。
結果は、workflowオブジェクトからlistオブジェクトに変わります。

registerDoParallel(cores = parallel::detectCores() - 1) #並列処理をする
rt_workflow_result = rt_workflow %>% tune_grid(
              resamples    = cv_folds,
              metrics      = metric_set(rmse),
              grid         = hiper_grid
)
stopImplicitCluster()

探索結果を図示化します。

autoplot(rt_workflow_result)

[output]

探索したパラメータ結果の上位3を表示します。

rt_workflow_result %>% show_best(metric="rmse",n=3)

9.ハイパーパラメータの探索結果をworkflowオブジェクトに反映させる

workflowオブジェクトにモデルのパラメータのベストを入れて更新します。

rt_workflow = rt_workflow %>% 
  finalize_workflow(parameters = select_best(rt_workflow_result,metric = "rmse"))
                    
rt_workflow

output

$>══ Workflow ═══════════════════════════════════════════════════════════════════════════
$>Preprocessor: Recipe
$>Model: decision_tree()
$>
$>─ Preprocessor ────────────────────────────────────
$>3 Recipe Steps
$>
$>● step_center()
$>● step_scale()
$>● step_dummy()
$>
$>─ Model ───────────────────────────────────────
$>Decision Tree Model Specification (regression)
$>
$>Main Arguments:
$>  tree_depth = 15
$>  min_n = 15
$>
$>Computational engine: rpart

10.モデルの訓練

モデルを訓練データにfitさせます。

rt_workflow_fit = rt_workflow %>% fit(data_train) 
rt_workflow_fit

output

$>══ Workflow [trained] ═════════════════════════════════════════════════════════════════
$>Preprocessor: Recipe
$>Model: decision_tree()
$>
$>─ Preprocessor ────────────────────────────────────
$>3 Recipe Steps
$>
$>● step_center()
$>● step_scale()
$>● step_dummy()
$>
$>─ Model ───────────────────────────────────────
$>n= 1478 
$>
$>node), split, n, deviance, yval
$>      * denotes terminal node
$>
$> 1) root 1478 206965000  810.1180  
$>   2) floor< 0.2118702 866  67975480  685.4267  
$>     4) heating_X1>=0.5 188   9618601  512.6782 *
$>     5) heating_X1< 0.5 678  51190920  733.3274  
$>      10) building_year< 0.8182134 573  35776430  696.1368  
$>        20) floor< -1.140178 151   4161538  574.7689 *
$>        21) floor>=-1.140178 422  28594760  739.5647 *
$>      11) building_year>=0.8182134 105  10296940  936.2819 *
$>   3) floor>=0.2118702 612 106472300  986.5603  
$>     6) heating_X1>=0.5 96   9467052  714.1677 *
$>     7) heating_X1< 0.5 516  88557020 1037.2380  
$>      14) floor< 1.032757 317  39553970  946.4211 *
$>      15) floor>=1.032757 199  42223690 1181.9060  
$>        30) building_year< 0.9907466 170  33859810 1126.9080  
$>          60) floor< 2.384806 161  29676900 1100.4560 *
$>          61) floor>=2.384806 9   2055055 1600.1000 *
$>        31) building_year>=0.9907466 29   4835272 1504.3100 *

Workflow に[訓練済み]が表示されます。

11.モデルの検証

新しいデータ（テストデータ）を検証します。

pred = bind_cols(data_test,rt_workflow_fit%>% predict(data_test))
rmse(pred, truth=price, estimate=.pred)

[output]

.metric	.estimator	.estimate
rmse	standard	313.5088

1 row

さらにもう一度、全データを10分割し、交差検証法で検証します。
vfold_cvで分割したデータ全てにfit＆predictを適用させるにはfit_resamplesを使います。

set.seed(123)
cv_folds10 = vfold_cv(
              data    = dat,　#全データを対象に10分割
              v       = 10,
              strata  = price
             )

collect_metrics(rt_workflow %>% fit_resamples(cv_folds10))

[output]

.metric	.estimator	mean	n	std_err	.estimator	mean	n	std_err	.config
rmse	standard	317.6397944	10	2.62537362	standard	317.6397944	10	2.62537362	Preprocessor1_Model1
rsq	standard	0.3000373	10	0.02077113	standard	0.3000373	10	0.02077113	Preprocessor1_Model1

上記の関数の結果をマニュアルで確認します。

keka=NULL
for(i in 1:10){
cv = cv_folds10$splits[[i]]  
fit = rt_workflow %>% fit(cv %>%  analysis()) #10分割の訓練データをfitさせる
pred = bind_cols(cv %>% assessment(),
                 fit %>% predict(cv %>% assessment()))　#10分割の検証データをpredictさせる
k=rmse(pred, truth=price, estimate=.pred) #rmseを算出する
keka=c(keka,k$.estimate)
}
mean(keka)

output

&>[1] 317.6398

fit_resamplesの結果と一致しました。

12.前処理の入れ替え

ここからが、workflowオブジェクトの出番です
前処理として正規化する過程を省いた影響を調べてみます。

recipe2 = recipe(price ~.,data=data_train) %>% 
  #step_center(all_numeric(), -all_outcomes()) %>% 
  #step_scale(all_numeric(), -all_outcomes()) %>%                                           
  step_dummy(all_nominal(),-all_outcomes()) 

#update_recipe()でworkflowオブジェクトのうち、recipeだけをアップデートさせる
rt_workflow = rt_workflow %>% update_recipe(recipe2) 
rt_workflow

output

$>Preprocessor: Recipe
$>Model: decision_tree()
$>
$>─ Preprocessor ────────────────────────────────────
$>1 Recipe Step
$>
$>● step_dummy()
$>
$>─ Model ───────────────────────────────────────
$>Decision Tree Model Specification (regression)
$>
$>Main Arguments:
$>  tree_depth = 15
$>  min_n = 15
$>
$>Computational engine: rpart

10分割交差検証法で結果を見ます。

collect_metrics(rt_workflow %>% fit_resamples(cv_folds10))

[output]

.metric	.estimator	mean	n	std_err	.estimator	mean	n	std_err	.config
rmse	standard	317.6397944	10	2.62537362	standard	317.6397944	10	2.62537362	Preprocessor1_Model1
rsq	standard	0.3000373	10	0.02077113	standard	0.3000373	10	0.02077113	Preprocessor1_Model1

正規化しなくても結果は、全く変わりませんでした。
一般的にTreeモデル系は、正規化することは必要とされていないことがわかります。
このように簡単にモデルを変更し、結果を検証することが可能になります。

13.参考

tidymodelsによるtidyな機械学習（その2：Cross Varidation）
Machine Learning con R y

enjoy!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up