BigQuery → TensorFlow Dataset

Last updated at 2023-12-07Posted at 2023-12-07

本記事は ZOZO Advent Calendar 2023 シリーズ 7 の 8 日目の記事です。

概要

BigQuery のテーブルを TensorFlow Dataset に変換する方法を2つご紹介します。

1. BigQuery リーダーを使用する方法

チュートリアル記載の方法です。

import tensorflow as tf
from tensorflow_io.bigquery import BigQueryClient
from google.cloud import bigquery

PROJECT_ID = "project_id"
DATASET_ID = "dataset_id"
CSV_SCHEMA = [
      bigquery.SchemaField("user_id", "INTEGER"),
      bigquery.SchemaField("item_id", "INTEGER"),
      bigquery.SchemaField("score", "INTEGER"),
]

def read_bigquery(table_name):
    tensorflow_io_bigquery_client = BigQueryClient()
    read_session = tensorflow_io_bigquery_client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID,
        table_name,
        DATASET_ID,
        list(field.name for field in CSV_SCHEMA),
        requested_streams=2,
    )

    dataset = read_session.parallel_read_rows()
    return dataset

train = (
    read_bigquery("table_name")
    .batch(256)
)

2. Cloud Storage を介する方法

以下の方法は Cloud Storage に中間データを保存したい場合に使えます。

1. BigQuery → Cloud Storage

EXPORT DATA
  OPTIONS (
    uri = "gs://bucket_name/dataset/*.csv",
    format = "CSV",
    overwrite = true,
    header = true
  )
AS 
SELECT
  user_id,
  item_id,
  score,
FROM `project_id.dataset_id.table_name`

2. Cloud Storage → TensorFlow Dataset

import tensorflow as tf
train_data = tf.data.experimental.make_csv_dataset(
    file_pattern="gs://bucket_name/dataset/*.csv",
    batch_size=256,
)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up