More than 3 years have passed since last update.

【Athena】既存データをPartition分割する

Last updated at 2020-09-25Posted at 2020-09-25

今使っているデータをPartition分割したい

Athenaで利用しているデータが拡大していき、費用削減・速度向上を狙ってPartitionでデータを分割したくなることがあると思います。
初期からデータを分けていた場合は問題がないのですが、何も考えていなかった場合、１ファイルまたは１ディレクトリ（S3にディレクトリはありませんが便宜上ディレクトリと表現します）に全てのデータが入っていることがあります。
この際、CTAS（CREATE TABLE AS SELECT）クエリで分割、というかデータを再構築します。

分割例

少し長いですが以下のようなデータを用意します。

id	name	team_id	year
1	佐藤	2	2019
2	高橋	2	2019
3	近藤	3	2019
4	田中	1	2019
5	山田	2	2019
6	野村	4	2019
7	河合	2	2019
8	五十嵐	1	2019
9	川端	4	2019
10	比屋根	3	2019
11	上田	2	2020
12	村上	4	2020
13	星	5	2020
14	歳内	3	2020
15	西浦	2	2020
16	西田	1	2020
17	中村	4	2020
18	畠山	5	2020
19	真中	3	2020
20	高津	2	2020

このデータを元にAthenaテーブルを作成します。

create.sql

CREATE EXTERNAL TABLE IF NOT EXISTS db_name.partition_test (
  `id` int,
  `name` string,
  `team_id` int,
  `year` int 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://example_bucket/partition_test/original/'
TBLPROPERTIES ('has_encrypted_data'='false');

これはCSVファイルをただ読み取るだけのcreate文です。
現在はs3://example_bucket/partition_test/original/ディレクトリに全テータを入れています。

このファイルを分割するCTAS文が

ctas.sql

CREATE TABLE db_name.partitioned_test 
WITH (
     format = 'PARQUET',  
     external_location = 's3://example_bucket/partition_test/partitioned/', 
     partitioned_by = ARRAY['team_id', 'year']
) 
AS SELECT id, name, team_id, year FROM partition_test;

これを実行することでlsコマンドでディレクトリ構造を見てみると…

$ aws s3 ls s3://example_bucket/partition_test/partitioned/
                           PRE team_id=1/
                           PRE team_id=2/
                           PRE team_id=3/
                           PRE team_id=4/
                           PRE team_id=5/

$ aws s3 ls s3://example_bucket/partition_test/partitioned/team_id=1/
                           PRE year=2019/
                           PRE year=2020/

このようにteam_idとyearでデータが分割されています。これでwhere句でteam_idかyearを絞るとアクセスするファイルを減らせるため速度・費用の節約を図ることができます。

しかし、今回のようにデータ量が少ない場合は、where句で条件を指定しない場合逆にS3へのリクエスト量が増えるため費用は上がってしまいます。
Athenaはどのような読み取り量も最低でも10MBとなる点、S3の費用も考えてデータを構成する必要があります。
ファイルサイズの最適化でも書きましたが、1ファイルサイズは128MB程度になるのが一番オーバーヘッドが少なくなります。もちろんこのような調整は簡単でないことも多いのですが、気に留めておく必要あります。

エラー

HIVE_COLUMN_ORDER_MISMATCH: Partition keys must be the last columns in the table and in the same order as the table properties: [team_id, year]. You may need to manually clean the data at location

CTAS文を実行した際にこのようなエラーが出る場合があります。これは読んでそのままなのですが、パーティションキーに指定したものはselect文の末尾に順番通り並べる必要があります。
なので今回の例でいうと

SELECT id, name, year, team_id FROM partition_test;

だと上記エラーが出ます。エラー文読めば問題なく対応できるのですが、直感的にこのようなエラーが出ることは予想できないので注意が必要です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up