In this short note I would like to share something I have myself learned only recently. Although this will be familiar to those
of you who are expert ML practitioners, to me this fact was highly counter-intuitive and costed my a few hours of debugging recently.
Therefore, I write this article more as a memo for myself.
So, here is the deal: column order changes results of XGBoost prediction. Let me explain what I mean with an example (the notebook is available at Kaggle):
First, let us make ourselves a simple "dataset" filled with random values and an XGBoost model:
import xgboost as xgb
import pandas as pd
import numpy as np
import sklearn
np.random.seed(42) #reproducibility
X = pd.DataFrame(data=np.random.randn(1000,10),columns=[f"f{i}"for i in range(10)])
y = np.random.randn(1000)
X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y)
model = xgb.XGBRegressor()
Next, we define a simple function to compute the squared mean error of a model on validation set:
def grade(X_train,X_test,y_train,y_test,model):
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
return sklearn.metrics.mean_squared_error(y_pred,y_test),y_pred.sum()
Note, that the function returns not only the error, but also the sum of the predictions (we will use this data later, in order to determine whether predictions changed).
Now, we establish a baseline score:
> grade(X_train,X_test,y_train,y_test,model)
(1.1874636567148573, -16.174873)
Next, we permute the rows of training dataset and see whether this changes the results:
> permuted_rows = np.random.permutation(range(X_train.shape[0]))
> grade(X_train.iloc[permuted_rows,:],X_test,y_train[permuted_rows],y_test,model)
(1.1874636567148573, -16.174873)
so we see, that permutation of rows in training dataset does not change the result.
Similarly, row-permutation of validation dataset does not change the result:
> permuted_rows = np.random.permutation(range(X_test.shape[0]))
> grade(X_train,X_test.iloc[permuted_rows,:],y_train,y_test[permuted_rows],model)
(1.1874636567148573, -16.174873)
However, if we change the columns order, the result changes!
> np.random.seed(42)
> permuted_col_names = np.random.permutation(list(X))
> grade(X_train[permuted_col_names],X_test[permuted_col_names],y_train,y_test,model)
(1.17516067774827, -16.930027)
As a final note, I would like to point out that the same thing happens in Bigquery ML's
version of XGBoost (where I myself originally discovered this issue).
This was especially counter-intuitive for me, as I have used to think that column order does not matter in tables.
However, as seen above, changing column orders in the table one submits for Bigquery ML XGBoost changes the prediction results one gets.