Simple function to export and download XGBoost BigQuery model

Today I would like briefly share with you a simple function I have assembled in order to export and download XGBoost BigQuery model in a single shot. I would like to mention that the procedure was suggested to me by my boss and colleague E. Torii and I just
figured out how to do it completely in Python and assembled the pieces.

So, here goes the code:

from google.cloud import storage, bigquery
import xgboost as xgb
import json
import os
from os import path

def _list_blobs(storage_client, bucket_name, prefix="", include_prefix=False):
    blobs = storage_client.list_blobs(bucket_name)
    res = [blob.name[len(prefix):]
           for blob in blobs if blob.name.startswith(prefix)]
    if include_prefix:
        res = [prefix+x for x in res]
    return res

def _download_blob(storage_client, bucket_name, source_blob_name, destination_file_name):
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)

def export_model(model_tn, bucket_name, storage_client=None, bq_client=None, model_export_dir="for-models-export",local_download_dir="/tmp/downloaded-models"):
    if storage_client is None:
        storage_client = storage.Client()
    if bq_client is None:
        bq_client = bigquery.Client()
    _MODEL_EXPORT_DIR = model_export_dir
    _DESTINATION_DIR = local_download_dir
    _BUCKET_NAME = bucket_name
    _FILES = {"model_metadata.json": "assets/model_metadata.json",
              "model.bst": "model.bst"}

    model_split = model_tn.split(".")
    assert len(model_split) == 3, (model_tn, model_split)
    model_name = model_split[-1]
    exported_models = {pn.split("/")[0] for pn in _list_blobs(
        storage_client, _BUCKET_NAME, prefix=f"{_MODEL_EXPORT_DIR}/")}
    if model_name not in exported_models:
        model_ref = bq_client.get_model(model_tn)
            source=model_ref, destination_uris=f"gs://{_BUCKET_NAME}/{_MODEL_EXPORT_DIR}/{model_name}",).result()
    _dest_dir = path.join(_DESTINATION_DIR, model_name)
    os.makedirs(_dest_dir, exist_ok=True)
    for fn, remote_path in _FILES.items():
        _fn = (path.join(_dest_dir, fn))
        if not path.isfile(_fn):
            _download_blob(storage_client, _BUCKET_NAME,
                           f"{_MODEL_EXPORT_DIR}/{model_name}/{remote_path}", path.join(_fn))
    model = xgb.Booster(model_file=path.join(_dest_dir, "model.bst"))
    with open(path.join(_dest_dir, "model_metadata.json")) as f:
        metadata = json.load(f)
    return model, metadata

Suppose that my model is called model and located in project project in dataset dataset, and I will use my bucket bucket for export. Then, the sample usage of the function export_model defined in the gist above will look as follows:

model,metadata = common.export_model("project.dataset.model","bucket")

As one can see, the function call returns two objects: model, which is an instance of xgboost.Booster class, and model,
which is the object parsed from model_metadata.json and contains description of model's metadata.

More precisely, here is what will happen under the hood:

  1. model project.dataset.model will be downloaded to folder for-models-export/model in bucket bucket (if such folder already exists, procedure assumes that this step is already done and skips it); note that the default directory name for-model-export can be overridden via the optional keyword argument model_export_dir of export_model;
  2. files for-models-export/model/model.bst and for-models-export/model/assets/model_metadata.json in folder for-models-export/model in bucket bucket are downloaded to local files /tmp/downloaded-models/model/model.bst and /tmp/downloaded-models/model/metadata.json respectively (again, if these files exist, this step is skipped); the name /tmp/downloaded-models of local directory can be overridden with optional keyword argument local_download_dir;
  3. /tmp/downloaded-models/model/model.bst and /tmp/downloaded-models/model/metadata.json get parsed and returned as model and metadata respectively;

These objects can then be used, for example, to plot the gain of every feature used in model:

import xgboost as xgb

or list the gain explicitly in a table:

import xgboost as xgb
import pandas as pd

_feature_names = {f"f{i}":fn for i,fn in enumerate(metadata["feature_names"])}
_score = model.get_score(importance_type="gain")
        **{k:v for k,v in zip(["xgboost_name","original_name"],t)},
    for t
    in _feature_names.items()

As a final remark, the code now only only works with XGboost models, but it can be easily adapted for other model types as well.


