More than 5 years have passed since last update.

[Review] MATLAB ~Selection of Predictors in Random Forest~

Last updated at 2018-07-16Posted at 2018-07-16

Link: https://jp.mathworks.com/examples/statistics/mw/stats-ex32080088-select-predictors-for-random-forests

Introduction

Recently I have been preparing for my master course in UK, which starts from 24th in this September. And one the programming language they use is MATLAB, so I have written this article to review the knowledge of MATLAB.

Select Predictors for Random Forests

This example shows how to choose the appropriate split predictor selection technique for your data set when growing a random forest of regression trees. This example also shows how to decide which predictors are most important to include in the training data.

Load and Pre-process Data

Load the carbig dataset. Consider a model that predicts the fuel economy of a car given its number of cylinders, engine displacement, horsepower, weight, acceleration, model_year, and country of origin. Consider Cylinders, Model_year, and Origin as categorical variables.

load carbig
Cylinders = categorical(Cylinders)
Model_Year = categorical(Model_Year)
Origin = categorical(cellstr(Origin))
X = table(Cylinders, Displacement, Horsepower, Weight, Acceleration, Model_Year, Origin, MPG)

Determines Levels in Predictors

The standard CART(Classification and Regression Trees) algorithm tends to split predictors with many unique values(levels) e.g., continuous variables, over those with fewer levels, e.g., categorical variables. If your data is heterogeneous, or your predictor variables vary greatly in their number of levels, then consider using the curvature or interaction tests for split-predictor selection of standard CART.
For each predictor, determine the number of levels in the data. One way to do this is determine an anonymous function that:

Converts all variables to the categorical data type using categorical
Determines all unique categories while ignoring missing values categories
Counts the categories using numel

then, apply the function to each variable using varfun.

countLevels = @(x)numel(categories(categorical(x)))
numLevels = varfun(countLevels, X(:, 1:end-1), 'OutputFormat', 'uniform')

Compare the number of levels among the predictor variables.

figure
bar(numLevels)
title('Number of Levels Among Predictors')
xlabal('Predictor variable')
ylabel('Number of levels')
h = gca
h.XtickLabel = X.Properties.VariableNames(1:end-1)
h.XTickLabelRotation = 45
h.TickLabelInterpreter = 'none'

The continuous variables have many more levels than the categorical variables. Because the number of levels among the predictors vary so much, using standard CART to select split predictors at each node of the trees in a random forest can yield inaccurate predictor importance estimates.

Grow Robust Random Forest

Grow a random forest of 200 regression trees. Specify sampling all variables at each node. Specify usage of the interaction test to select split predictors. Becuase there are missing values in the data, speficy usage of surrogate splits to increase accuracy.

t = templateTree('NumPredictorsToSample','all', 'PredictorSelection', 'interaction-curvature', 'Surrogate', 'on')
rng(1)
Mdl = fitrensemble(X, 'MPG', 'Method', 'bag', 'NumLearningCycles', 200, 'Learners', t)

Mdl is a RegressionBaggedEnsemble model.

Estimate the model $R^2$ using out-of-bags predictions.

yHat = oobPredict(Mdl)
R2 = corr(Mdl.Y, yHat)^2

R2 = 0.8739

Mdl explains 87.39% of the variability around the mean.

Predictor Importance Estimation

Estimate predictor importane values by permuting out-of-bag observation among the trees.

impOOB = oobPermutedPredictorImportance(Mdl)

impOOB is a 1by7 vector of predictor importance estimates corresponding to the predictors in Mdl.PredictorNames. The estimates are not biased toward predictors containing many levels.

figure
bar(impOOB)
title('Unbiased Predictor Importance Estimates')
xlabel('Predictor variable')
ylabel('Importance')
h - gca
h.XTickLabel = Mdl.PredictorNames
h.XTickLabelRotation = 45
h.TickLabelInterpreter = 'none'

Greater importance estimates indicate more important predictors. The bar suggests that Model_Year is the most importance predictor, followed by Weight. Model_Year has 13 distinct levels only, whereas Weight has over 300.

Compare predictor importance estimates by permuting out-of-bag observations and those estimates obtained by summing gains in the mean squared error due to splits on each predictor. Also, obtain predictor association measures estimated by surrogate splits.

[impGain,predAssociation] = predictorImportance(Mdl);
figure;
plot(1:numel(Mdl.PredictorNames),[impOOB' impGain']);
title('Predictor Importance Estimation Comparison')
xlabel('Predictor variable');
ylabel('Importance');
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
legend('OOB permuted','MSE improvement')
grid on

impGain is commensurate with impOOB. According to the values of impGain, Model_Year and Weight do not appear to be the most important predictors.

predAssociation is a 7by7 matrix of predictor association measures. Rows and columns corespond to the predictors in Mdl.PredictorNames. You can infer the strength of the relationship between pairs of predictors using the elements of predAssociation. Larger values indicate more highly correlated pairs of predictors.

figure;
imagesc(predAssociation);
title('Predictor Association Estimates');
colorbar;
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
h.YTickLabel = Mdl.PredictorNames;

predAssociation(1,2)

ans = 0.683

Tha largest association is between Cylinders and Displacement, but the value is not high enough to indicate a strong relationship between the two predictors.

Grow Random Forest Using Reduced Predictor Set

Because prediction time increases with the number of predictors in random forests, it is good practice to create a model using as few predictors as possible.

Grow a random forest of 200 regression trees using the best two predictors only.

,{'Model_Year' 'Weight' 'MPG'}),'MPG','Method','bag',...

    'NumLearningCycles',200,'Learners',t);

Compute the $R^2$ o the reduced model.

yHatReduced = oobPredict(MdlReduced);
r2Reduced = corr(Mdl.Y,yHatReduced)^2

r2Reduced = 0.8524

The $R^2$ for the reduced model is close to the $R^2$ of the full model.
This result suggests that the reduced model is sufficient for prediction.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up