More than 3 years have passed since last update.

BigQueryのclusterとpartition カルトクイズその２

BigQuery

Posted at 2021-04-09

その１では、単一の列に対してpartitionやclusterを適用する場合を扱った。
その２では、複数列への適用ケースやpartitionとclusterの併用について取り扱う。

複数列のcluster

taxi_tripsデータセットには、payment_typeと呼ばれる支払い方法を表す列がある。
支払い方法はCashやCreditCardといった11種類がある。

このデータセットに対し、

trip_start_timestamp => payment_typeの順にclusterを作成したcluster_time_payと、
payment_type -> trip_start_timestampの順にclusterを作成したcluster_pay_time
を作成した。

create or replace table taxi_sample.cluster_time_pay 
cluster by trip_start_timestamp, payment_type
as
SELECT 
  *
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`

Q. clusterの指定順の影響

3つのテーブルのアクセスデータ量はどのような順序関係になるか。

SELECT 
  *
FROM taxi_sample.cluster_taxi_trips or cluster_time_pay or cluster_pay_time
where DATE(trip_start_timestamp) = "2014-04-13"

A. 結果は以下の表のようになった。

時刻のみcluster	時刻, 支払いにcluster	支払い, 時刻にcluster
122.4MB	103.6MB	579.6MB

どのケースでも、clusterは活用されている。一方で、支払い, 時刻にclusterをしたケースでは明確にアクセス量が増加している。
これは、特定の時刻範囲のデータが複数のclusterにまたがって配置されてしまうからだろう。

Q. clusterの指定順の影響その２

今度は、payment_typeに条件を指定してみよう。それぞれで、アクセスデータ量はどの程度になるか。

SELECT 
  *
FROM taxi_sample.cluster_time_pay or cluster_pay_time
where payment_type="Mobile"

A. 結果は以下の表のようになった。

時刻, 支払いにcluster	支払い, 時刻にcluster
70.7GB (フルアクセス)	216.1MB

時刻, 支払いにclusterを作成した場合、フルアクセスになってしまう。
これは、時刻でclusteringした時点で十分に細いclusterが作成されており、それ以上分割する余地が無いためと思われる。
つまり、clusteringする際、一つ目の列に連続値のような値の種類が多い列を指定してしまうと、それ以降の列の指定は機能しない。

取る値の種類が少ない列へのclustering

trip_start_timestampはバリエーションがありすぎて、複数列のclusterの議論に向かないことがわかった。
以下のように、曜日と時間に対してclusterを設定する。
曜日 => 時間の順にclusterを設定したものをcluster_dow_hour, 時間 => 曜日の順にclusterを設定したものをcluster_hour_dowとする。

create table taxi_sample.cluster_hour_dow
cluster by hour,dow
as
SELECT 
  extract(dayofweek from trip_start_timestamp) as dow,
  extract(hour from trip_start_timestamp) as hour,
  *
FROM taxi_sample.taxi_trips

Q. 片方の要素でfilterした際の挙動

hourでフィルタした場合、2つのテーブルの間でアクセスするデータ量の大小関係はどうなるか？

select
  *
from taxi_sample.cluster_dow_hour or taxi_sample.cluster_hour_dow
where hour=12

A. 差はない (3.8GB vs 3.8GB)

これは、「Q. clusterの指定順の影響」において、支払い, 時刻でclusteringしたケースの結果と矛盾する。
前回の結果に従うのであれば、cluster_dow_hourのアクセスデータ量は大きくなるべきである。

この違いが生じるのは、dow, hourでデータを分割した場合、分割後の集合が十分大きいためと考えられる。
支払い, 時刻でデータを分割すると、一つのcluster内に複数の時刻を持ったデータが同居する。そのため、特定の時刻のデータにアクセスすると、合わせて必要のない時刻のデータにアクセスしてしまう。
一方で、dow, hourでデータを分割した場合は、分割後の集合が十分大きいため、一つのclusterに同じhourのデータしか含まれないと考えられる。

partitionとclusterの組み合わせ

partitionはclusterで使えない機能をいくつか持っているため、partitionとclusterを組み合わせて使われることは珍しくない。そこで、二つを組み合わせた場合の挙動を確認する。
partitioningはテーブルをサブテーブルに分割し、clusteringはそのサブテーブルをclusterに分割することを踏まえると少しだけわかりやすくなる。・・・が直感に反する挙動もある。

セッティング

以下のように日付にpartitionを作成し、payment_typeにclusterを作成する。

create table taxi_sample.partition_cluster
partition by DATE(trip_start_timestamp)
cluster by payment_type
as 
select
  *
from taxi_sample.taxi_trips

Q. cluster列だけfilterする

以下のようなクエリでアクセスした場合、アクセス量はどの程度になるだろうか。
フルアクセスは約70GB, payment_type単独にclusterを施した場合のアクセス量は216.1MBである。

select
  *
from taxi_sample.partition_cluster
where payment_type="Mobile"

A. 60.8GB
これは、日付で分割した後のサブテーブルが十分に小さく、clusterが十分に機能しないためである。

セッティングやり直し

partitionが細粒度すぎたので、partitionを月別に切り直す。

create table taxi_sample.month_partition_cluster
partition by TIMESTAMP_TRUNC(trip_start_timestamp, MONTH)
cluster by payment_type
as 
select
  *
from taxi_sample.taxi_trips

Q. cluster列だけでのfilter再試行

再びアクセス先テーブルだけ変えて、以下のクエリを発行するとアクセスデータ量はどの程度になるだろう？

select
  *
from taxi_sample.month_partition_cluster
where payment_type="Mobile"

A. 5.1GB

だいぶ減った。
これまでの実験から、複数のcluster列やpartitionとclusterの組み合わせのように、複数の列で分割する場合、1個目以外の列のみでアクセスするには慎重な設計がいることがわかる。

Q. 組み合わせたアクセス

partitionとclusterの双方に条件をつけた場合、partition_clusterとmonth_partition_clusterでアクセス量に差は出るか？出るとすればどの程度の差か？

select
  *
from taxi_sample.partition_cluster or month_partiton_cluster
where TIMESTAMP_TRUNC(trip_start_timestamp, MONTH)= "2014-04-01"
  and payment_type="Unknown"

A. 差は出る。

partition_clusterは867MBにアクセスし、month_partition_clusterは56MBにアクセスする(逆ではない！)。
最初に、粗い粒度でpartitionを作成した方が少ないデータアクセス量となる。
これは、複数列でclusterした際に、最初にバリエーションが多すぎる列でclusterすると生じる問題と同根である。

アクセス先を支払い,時刻の順にclusterを作成したcluster_pay_timeにすると、同様のクエリでアクセスデータ量は52MBとなる。

Q. 組み合わせ指定の分割

これまでは一つのwhere節で条件を指定していたが、with節で指定を分割するとどうなるだろう。
以下の2つのクエリはアクセス量に差が出るか？

with base as (
select
  *
from taxi_sample.month_partition_cluster
where TIMESTAMP_TRUNC(trip_start_timestamp, MONTH)= "2014-04-01"
)
select
  *
from base
where payment_type="Unknown"

with base as (
select
  *
from taxi_sample.month_partition_cluster
where payment_type="Unknown"
)
select
  *
from base
where TIMESTAMP_TRUNC(trip_start_timestamp, MONTH)= "2014-04-01"

A. 差は出ない。
両方とも56MBとなる。この程度であれば、Query Optimizerで同等のクエリになっている。

Q. partition列のみの指定

trip_end_timestamp に条件をつける場合、partitionとclusterを併用したmonth_partition_clusterとpartition_taxi_trips、どちらを用いる方がアクセスするデータ量が少ないか。

select
  *
from taxi_sample.month_partition_cluster or taxi_sample.partition_taxi_trips
where DATE(trip_end_timestamp) = "2014-04-13"

A. month_partition_clusterの方が小さい

month_partition_clusterは4.3GB, partition_taxi_tripsは70.7GBとなる。
これは全く自明ではない！
今回関係のないcluster列の設定がpartition列へ影響を与えているのである。
これは、clusterが設定されている場合、parition列へのアクセスであってもclusterと同じルールで運用され、動的な最適化が有効となることを意味する。

merge編

bigqueryではmerge文を使うことでデータの更新が可能である。
公式ドキュメントではパーティション分割テーブルに対してmergeをする際は、条件文を設定することでアクセスするデータ量を削減できるとある。clusteringでも同様の効果は発動するだろうか？

Q. mergeでの有効性

いきなり複雑なクエリになってしまうが、cluster_taxi_tripsの行を更新している。
ポイントは、trip_start_timestampが2014-04-13になるような行を更新するが、mergeの条件付けはtrip_end_timestampで行っている点である。この際、アクセス量は削減されるか？

merge 
  taxi_sample.cluster_taxi_trips as target
using (  
SELECT
  trip_end_timestamp,
  unique_key,
  trip_miles
FROM taxi_sample.cluster_taxi_trips
where DATE(trip_start_timestamp)="2014-04-13"
) as source
on target.trip_end_timestamp=source.trip_end_timestamp
   and target.unique_key=source.unique_key
when matched then update set target.trip_miles = source.trip_miles

A. 削減される

9.2GBに削減される。(なぜか必要データ量より異様に多い)
なお、cluster_taxi_tripsをpartition_taxi_tripsにした場合はフルアクセス(70.7GB)になる。

Q. 直接的な条件の指定

今度は条件を直接的にtrip_start_timestampにした。アクセスデータ量は削減されるだろうか？

merge 
  taxi_sample.cluster_taxi_trips as target
using (  
SELECT
  trip_start_timestamp,
  unique_key,
  trip_miles
FROM taxi_sample.cluster_taxi_trips
where DATE(trip_start_timestamp)="2014-04-13"
) as source
on target.trip_start_timestamp=source.trip_start_timestamp
   and target.unique_key=source.unique_key
when matched then update set target.trip_miles = source.trip_miles

A. データアクセス利用が減る
しかも、わずか157MBになる。
この9.2GBと157MBの差を説明するために、色々実験したが、mergeのON節でもJOIN時と同様に動的なアクセス最適化は走るようなのだが、なぜか静的な最適化に比べてアクセス量が増えてしまう。mergeのアクセス量最適化には静的な最適化を用いる方が良いだろう。

まとめ

複数列を用いたアクセスの最適化は、容量を意識した設計が必要である。
それを除けば、ここまで読んだ読者であれば、もはやpartitionもclusterも手に取るように挙動が分かるだろう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

BigQueryのclusterとpartition カルトクイズ その２