Today I would like briefly share with you a simple function I have assembled in order to export and download XGBoost BigQuery model in a single shot. I would like to mention that the procedure was suggested to me by my boss and colleague E. Torii and I just
figured out how to do it completely in Python and assembled the pieces.
So, here goes the code:
from google.cloud import storage, bigquery
import xgboost as xgb
import json
import os
from os import path
def _list_blobs(storage_client, bucket_name, prefix="", include_prefix=False):
blobs = storage_client.list_blobs(bucket_name)
res = [blob.name[len(prefix):]
for blob in blobs if blob.name.startswith(prefix)]
if include_prefix:
res = [prefix+x for x in res]
return res
def _download_blob(storage_client, bucket_name, source_blob_name, destination_file_name):
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
def export_model(model_tn, bucket_name, storage_client=None, bq_client=None, model_export_dir="for-models-export",local_download_dir="/tmp/downloaded-models"):
if storage_client is None:
storage_client = storage.Client()
if bq_client is None:
bq_client = bigquery.Client()
_MODEL_EXPORT_DIR = model_export_dir
_DESTINATION_DIR = local_download_dir
_BUCKET_NAME = bucket_name
_FILES = {"model_metadata.json": "assets/model_metadata.json",
"model.bst": "model.bst"}
model_split = model_tn.split(".")
assert len(model_split) == 3, (model_tn, model_split)
model_name = model_split[-1]
exported_models = {pn.split("/")[0] for pn in _list_blobs(
storage_client, _BUCKET_NAME, prefix=f"{_MODEL_EXPORT_DIR}/")}
if model_name not in exported_models:
model_ref = bq_client.get_model(model_tn)
bq_client.extract_table(
source=model_ref, destination_uris=f"gs://{_BUCKET_NAME}/{_MODEL_EXPORT_DIR}/{model_name}",).result()
_dest_dir = path.join(_DESTINATION_DIR, model_name)
os.makedirs(_dest_dir, exist_ok=True)
for fn, remote_path in _FILES.items():
_fn = (path.join(_dest_dir, fn))
if not path.isfile(_fn):
_download_blob(storage_client, _BUCKET_NAME,
f"{_MODEL_EXPORT_DIR}/{model_name}/{remote_path}", path.join(_fn))
model = xgb.Booster(model_file=path.join(_dest_dir, "model.bst"))
with open(path.join(_dest_dir, "model_metadata.json")) as f:
metadata = json.load(f)
return model, metadata
Suppose that my model is called model
and located in project project
in dataset dataset
, and I will use my bucket bucket
for export. Then, the sample usage of the function export_model
defined in the gist above will look as follows:
model,metadata = common.export_model("project.dataset.model","bucket")
As one can see, the function call returns two objects: model
, which is an instance of xgboost.Booster class, and model
,
which is the object parsed from model_metadata.json
and contains description of model's metadata.
More precisely, here is what will happen under the hood:
- model
project.dataset.model
will be downloaded to folderfor-models-export/model
in bucketbucket
(if such folder already exists, procedure assumes that this step is already done and skips it); note that the
default directory namefor-model-export
can be overridden via the optional keyword argumentmodel_export_dir
ofexport_model
; - files
for-models-export/model/model.bst
andfor-models-export/model/assets/model_metadata.json
in folderfor-models-export/model
in bucketbucket
are downloaded to local files/tmp/downloaded-models/model/model.bst
and/tmp/downloaded-models/model/metadata.json
respectively (again, if these files exist, this step is skipped); the name/tmp/downloaded-models
of local directory can be overridden with optional keyword argumentlocal_download_dir
; -
/tmp/downloaded-models/model/model.bst
and/tmp/downloaded-models/model/metadata.json
get parsed and returned asmodel
andmetadata
respectively;
These objects can then be used, for example, to plot the gain of every feature used in model:
import xgboost as xgb
xgb.plot_importance(model,importance_type="gain")
or list the gain explicitly in a table:
import xgboost as xgb
import pandas as pd
_feature_names = {f"f{i}":fn for i,fn in enumerate(metadata["feature_names"])}
_score = model.get_score(importance_type="gain")
pd.DataFrame([
{
**{k:v for k,v in zip(["xgboost_name","original_name"],t)},
"score":_score[t[0]],
}
for t
in _feature_names.items()
]).set_index("xgboost_name").sort_values(by="score",ascending=False)
As a final remark, the code now only only works with XGboost models, but it can be easily adapted for other model types as well.