【Athena】重くなってきた読み込みを何とかしたい② パーティション射影を利用してみる

Posted at 2023-12-25

はじめに

【Athena】重くなってきたパーティション読み込みを何とかしたい① シンプルに読み込み対象のログを減らしてみたの続きとなります。

今回試してみた事：パーティション射影の導入

元のDDLをベースにTBLPROPERTIESを以下のように書き換えて設定を行いました。
パーティションの範囲の関しては..が使用でき、例えば'01,02,03,04'は01..04と書けます。
また「〇〇以降」といった設定法が無さそうでしたので、yearに関しては仮で2021..2100としました
設定の結果、数日おきのパーティションロード等であれば数十秒程度で完了出来るようになりました。

CREATE EXTERNAL TABLE `app_prod_rails`(
  `date` string COMMENT 'from deserializer', 
  `container_name` string COMMENT 'from deserializer', 
  `source` string COMMENT 'from deserializer', 
  `log` string COMMENT 'from deserializer', 
  `container_id` string COMMENT 'from deserializer', 
  `ec2_instance_id` string COMMENT 'from deserializer', 
  `ecs_cluster` string COMMENT 'from deserializer', 
  `ecs_task_arn` string COMMENT 'from deserializer', 
  `ecs_task_definition` string COMMENT 'from deserializer')
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string, 
  `hour` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://syslog-app/log/app-prod/rails'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'partitioning'='enabled', 
  'partitioning.day.range'='01..31', 
  'partitioning.day.type'='string', 
  'partitioning.hour.range'='00..23', 
  'partitioning.hour.type'='string', 
  'partitioning.month.range'='01..12', 
  'partitioning.month.type'='string', 
  'partitioning.year.range'='2021..2100', 
  'partitioning.year.type'='string', 
  'storage.location.template'='s3://syslog-app/log/app-prod/rails/year=${year}/month=${month}/day=${day}/hour=${hour}')

おわりに

ログ調査にあたっては殆ど気にならない速度まで高速化できたので嬉しいです
しかし調査開始時にMSCK REPAIR TABLE若しくはALTER TABLE ADD PARTITIONが必要であるという状況は変わっておらず、せっかくなら読み込みまで自動で実施したいということで③に続くかもしれません

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up