Objective
To explore Permutation Feature Importance (PFI) and run it in R script. This article is based on information in 「機械学習を解釈する技術 ~Techniques for Interpreting Machine Learning~」by Mitsunosuke Morishita. In this book, the author does not go through all the methods by R, so I decided to make a brief note with an R script.
Permutation Feature Importance
Permutation Feature Importance (PFI) is defined to be the decrease in a model score when a single feature value is randomly shuffled 1. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. -scikit-learn Here are simple 5 steps of PFI
- Predict the target with ALL explanatory variables and calculate prediction error, which is going to be the baseline
- Pick one explanatory variable and permeate/shuffle it on the debatable. Predict the target and calculate prediction error
- Calculate the difference of prediction errors from steps 1 and 2. Make the difference the importance of variables piked at step 2
- Repeat steps for all explanatory variables
- See the importance of all variables and analyze
Execution with Real Data
Now, let's see how to run PFI with actual dataset.
Get Dataset
# Set up
library(mlbench)
library(tidymodels)
library(DALEX)
library(ranger)
library(Rcpp)
library(corrplot)
data("BostonHousing")
df = BostonHousing
`%notin%` <- Negate(`%in%`)
Obserview of the Dataset
Here are overview of the dataset
head(df)
medv is our response variable, We predict this.
hist(df$medv,breaks = 20, main = 'Histgram of medv', xlab = 'medv ($ in 1,000)')
Build a Model
We won't cover building a model in this article. I used XGBoost for the model.
split = initial_split(df, 0.8)
train = training(split)
test = testing(split)
model = rand_forest(trees = 100, min_n = 1, mtry = 13) %>%
set_engine(engine = "ranger", seed(25)) %>%
set_mode("regression")
fit = model %>%
fit(medv ~., data=train)
fit
Predict medv
result = test %>%
select(medv) %>%
bind_cols(predict(fit, test))
metrics = metric_set(rmse, rsq)
result %>%
metrics(medv, .pred)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 3.8857 |
rsq | standard | 0.8627 |
Interpre Feature Importance
Use the function explain to create an explainer object that helps us to interpret the model.
explainer = fit %>%
explain(
data = test %>% select(-medv),
y = test$medv
)
Use model_parts function to get PFI. Here you can see rm and lstat are the top 2 important variables to predict medv. The bark blue box chart show distribution of error loss since we calculate it multiple times.
- loss_function: Evaluation metrics
- B: # of shuffles
- type: method of calculating importance "difference" or "ratio" are applicable
pfi = explainer %>%
model_parts(
loss_function = loss_root_mean_square,
B = 10,
type = "difference"
)
plot(pfi)
FYI
Method | Function |
---|---|
Permutation Feature Importance(PFI) | model_parts() |
Partial Dependence(PD) | model_profile() |
Individual Conditional Expectation(ICE) | predict_profile() |
SHAP | predict_parts() |
Grouped Permutation Feature Importance (GPFI)
If some explanatory variables are correlated with each other, PFI won't work well. Let's say $X0$ and $X1$ are correlated. While calculating the importance of $X0$, the model still uses $X1$ on prediction. The performance of the model would not decrease much because $X0$ and $X1$ are correlated. Thus, PFI will underestimate the importance of $X1$. In the plot below, rad; index of accessibility to radial highway, and tax;full-value property-tax rate per $10,000. In a situation like this, we should shuffle both variables together. In addition to that, we should use this GPFI when the variables are encoded by one-hot encoding. Or you can use it when you are dealing with data like latitudes and longitudes.
cor <- df[,unlist(lapply(df, is.numeric)) ]
corrplot(cor(cor), method = 'number', order = 'alphabet')
Rad and Tax
So let's run GPFI on our dataset. model_parts function have variable_groups method. It takes list objects. So make a list that contains name of explanatory variables in this case rad and tax1. The source code of feature_importance is here.
# Make list
paired_var = list(c("tax","rad"))
# Male vector of explanatory variables Do not forget to take out your response variable
all_vars = c(colnames(df)[colnames(df) %notin% c("tax","rad","medv")])
# Gather
var_list = c(all_vars, paired_var)
pfi = explainer %>%
model_parts(
loss_function = loss_root_mean_square,
B = 10,
type = "difference",
variable_groups = var_list
)
# Gather explaniner object
plot(pfi)
If you keep tax and rad in the plot, you can see that the importance of tax and rad are dispersed.
# Make list
paired_var = list(c("tax","rad"))
# Make vector of explanatory variables Do not forget to take out your response variable
all_vars = c(colnames(df)[colnames(df) %notin% c("medv")])
# Gather
var_list = c(all_vars, paired_var)
pfi = explainer %>%
model_parts(
loss_function = loss_root_mean_square,
B = 10,
type = "difference",
variable_groups = var_list
)
# Gather explaniner object
plot(pfi)
Conclution
PFI and GPFI are very sufficient models to calculate the importance of explanatory variables in the model. On the other hand, PFI does not explain how each variable affects the prediction of the model. This could be done by Partial Dependence (PD).
References
Methods of Interpreting Machine Learning Qiita Links
-
It may not be right to pair up tax and rad variables without decent causal inference. ↩