More than 3 years have passed since last update.

Simple function to export and download XGBoost BigQuery model

Posted at 2021-08-04

Today I would like briefly share with you a simple function I have assembled in order to export and download XGBoost BigQuery model in a single shot. I would like to mention that the procedure was suggested to me by my boss and colleague E. Torii and I just
figured out how to do it completely in Python and assembled the pieces.

So, here goes the code:

from google.cloud import storage, bigquery
import xgboost as xgb
import json
import os
from os import path


def _list_blobs(storage_client, bucket_name, prefix="", include_prefix=False):
    blobs = storage_client.list_blobs(bucket_name)
    res = [blob.name[len(prefix):]
           for blob in blobs if blob.name.startswith(prefix)]
    if include_prefix:
        res = [prefix+x for x in res]
    return res


def _download_blob(storage_client, bucket_name, source_blob_name, destination_file_name):
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)


def export_model(model_tn, bucket_name, storage_client=None, bq_client=None, model_export_dir="for-models-export",local_download_dir="/tmp/downloaded-models"):
    if storage_client is None:
        storage_client = storage.Client()
    if bq_client is None:
        bq_client = bigquery.Client()
    _MODEL_EXPORT_DIR = model_export_dir
    _DESTINATION_DIR = local_download_dir
    _BUCKET_NAME = bucket_name
    _FILES = {"model_metadata.json": "assets/model_metadata.json",
              "model.bst": "model.bst"}

    model_split = model_tn.split(".")
    assert len(model_split) == 3, (model_tn, model_split)
    model_name = model_split[-1]
    exported_models = {pn.split("/")[0] for pn in _list_blobs(
        storage_client, _BUCKET_NAME, prefix=f"{_MODEL_EXPORT_DIR}/")}
    if model_name not in exported_models:
        model_ref = bq_client.get_model(model_tn)
        bq_client.extract_table(
            source=model_ref, destination_uris=f"gs://{_BUCKET_NAME}/{_MODEL_EXPORT_DIR}/{model_name}",).result()
    _dest_dir = path.join(_DESTINATION_DIR, model_name)
    os.makedirs(_dest_dir, exist_ok=True)
    for fn, remote_path in _FILES.items():
        _fn = (path.join(_dest_dir, fn))
        if not path.isfile(_fn):
            _download_blob(storage_client, _BUCKET_NAME,
                           f"{_MODEL_EXPORT_DIR}/{model_name}/{remote_path}", path.join(_fn))
    model = xgb.Booster(model_file=path.join(_dest_dir, "model.bst"))
    with open(path.join(_dest_dir, "model_metadata.json")) as f:
        metadata = json.load(f)
    return model, metadata

Suppose that my model is called model and located in project project in dataset dataset, and I will use my bucket bucket for export. Then, the sample usage of the function export_model defined in the gist above will look as follows:

model,metadata = common.export_model("project.dataset.model","bucket")

As one can see, the function call returns two objects: model, which is an instance of xgboost.Booster class, and model,
which is the object parsed from model_metadata.json and contains description of model's metadata.

More precisely, here is what will happen under the hood:

model project.dataset.model will be downloaded to folder for-models-export/model in bucket bucket (if such folder already exists, procedure assumes that this step is already done and skips it); note that the
default directory name for-model-export can be overridden via the optional keyword argument model_export_dir of export_model;
files for-models-export/model/model.bst and for-models-export/model/assets/model_metadata.json in folder for-models-export/model in bucket bucket are downloaded to local files /tmp/downloaded-models/model/model.bst
and /tmp/downloaded-models/model/metadata.json respectively (again, if these files exist, this step is skipped); the name /tmp/downloaded-models of local directory can be overridden with optional keyword argument local_download_dir;
/tmp/downloaded-models/model/model.bst and /tmp/downloaded-models/model/metadata.json get parsed and returned as model and metadata respectively;

These objects can then be used, for example, to plot the gain of every feature used in model:

import xgboost as xgb
xgb.plot_importance(model,importance_type="gain")

or list the gain explicitly in a table:

import xgboost as xgb
import pandas as pd

_feature_names = {f"f{i}":fn for i,fn in enumerate(metadata["feature_names"])}
_score = model.get_score(importance_type="gain")
pd.DataFrame([
    {
        **{k:v for k,v in zip(["xgboost_name","original_name"],t)},
        "score":_score[t[0]],
    }
    for t
    in _feature_names.items()
]).set_index("xgboost_name").sort_values(by="score",ascending=False)

As a final remark, the code now only only works with XGboost models, but it can be easily adapted for other model types as well.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up