More than 3 years have passed since last update.

Amazon Athenaでパーティショニングを設定して複数のALBのログを高速＆安価に検索する

Last updated at 2021-01-03Posted at 2020-04-17

【2021/1/3追記】
この記事の内容（特に後半のコード）はAthenaがPartition Projectionをサポートしたことによりほぼ意味のない内容となりました。昔の人(?)はこんな面倒なことをしていた苦労話として読んでください。パーティショニングについての説明などは参考にしていただいて大丈夫です。
https://dev.classmethod.jp/articles/20200627-amazon-athena-partition-projection/
【追記終わり】

この記事の内容

ALBのアクセスログに対してAthenaで検索ができるようにする
複数のALBのアクセスログを串刺しで検索ができるようにする
テーブルにパーティショニングを設定する。検索条件に日付を加えることで不要なスキャンを減らし、Athenaを安価に利用できるようにする

やり方

https://github.com/netebakari/athena_access_log を git clone する
alb.py を実行する
テーブルと alb_logs_2020 というビューができるので好きに検索できる

前置き

読み飛ばしていいので閉じておきます。

Athenaと料金について

Athenaってなに

https://aws.amazon.com/jp/athena/
S3に置いてあるテキストファイル（gzip圧縮可）に対して検索ができるサービス
あらかじめテキストファイルの書式に応じたテーブルを作成しておく必要がある
料金はスキャンしたデータの量に対してかかる

ALBのアクセスログは（事前に設定しておけば）S3にどんどん蓄積されていくので、このログに対して検索をかけることができます。

-- ALBが50xを返したログを検索する
SELECT * FROM alb_accesslogs
WHERE
  time BETWEEN '2020-04-15T03:00:00.000000Z' AND '2020-04-15T12:00:00.000000Z' AND
  elb_status_code LIKE '50%'
LIMIT 100

便利ですね！

パーティショニングってなに

Athenaの料金はスキャンしたデータの量に対して決まります。何も考えずに書いた上記のクエリではS3にある全てのアクセスログに対して検索が走るので、もし大量のログがたまっていた場合、1回のクエリで1000円かかってしまったなどということが起こりかねず大変危険です。

しかしALBのアクセスログは s3://YOUR-BUCKET/ALB-NAME/.../2020/04/15/xxxx.log.gz のように最初から日付ごとにグループ分けして保存されています。つまり検索条件に日付を指定したのであればその日のログだけをスキャンしてほしいわけです。この気持ちを伝える手段がパーティショニングです。

実際の手順

以下、上述のスクリプトの内容説明です。

アクセスログ保存設定

ALBのログはもちろんS3に保存しているでしょうからここは飛ばします。

Athenaでテーブル作成

公式ドキュメントからDDLを持ってきて実行するのですが、ちょっとだけ修正します。

テーブル作成DDL（ALBの場合）

https://docs.aws.amazon.com/ja_jp/athena/latest/ug/application-load-balancer-logs.html

CREATE EXTERNAL TABLE IF NOT EXISTS ALBNAME_accesslogs_2020 (
    type string,
    time string,
    elb string,
    client_ip string,
    client_port int,
    target_ip string,
    target_port int,
    request_processing_time double,
    target_processing_time double,
    response_processing_time double,
    elb_status_code string,
    target_status_code string,
    received_bytes bigint,
    sent_bytes bigint,
    request_verb string,
    request_url string,
    request_proto string,
    user_agent string,
    ssl_cipher string,
    ssl_protocol string,
    target_group_arn string,
    trace_id string,
    domain_name string,
    chosen_cert_arn string,
    matched_rule_priority string,
    request_creation_time string,
    actions_executed string,
    redirect_url string,
    lambda_error_reason string,
    new_field string
)
PARTITIONED BY ( 
    lbname string, 
    year int, 
    month int, 
    day int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\"($| \"[^ ]*\")(.*)')
LOCATION 's3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/'

テーブル作成DDL（Classic ELBの場合）

https://docs.aws.amazon.com/ja_jp/athena/latest/ug/elasticloadbalancer-classic-logs.html

CREATE EXTERNAL TABLE IF NOT EXISTS ALBNAME_accesslogs_2020 (
    timestamp string,
    elb_name string,
    request_ip string,
    request_port int,
    backend_ip string,
    backend_port int,
    request_processing_time double,
    backend_processing_time double,
    client_response_time double,
    elb_response_code string,
    backend_response_code string,
    received_bytes bigint,
    sent_bytes bigint,
    request_verb string,
    url string,
    protocol string,
    user_agent string,
    ssl_cipher string,
    ssl_protocol string
)
PARTITIONED BY ( 
    lbname string, 
    year int, 
    month int, 
    day int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'serialization.format' = '1',
    'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/';

テーブルはログ単位・年単位で分けているので、テーブル名の末尾に 2020 を付けています。
- 全てのテーブル・全てのログを1つのテーブルに突っ込むことも可能ではありますが、パーティショニングの数が多くなりすぎるとパフォーマンスの劣化が起きるのでこのようにしています。
- 必要に応じて2019年版も作成してください（2021年版も）。
途中に PARTITIONED BY... という行を追加しています。これがパーティショニングを指示します。
- この部分により、実際のデータには存在しない4つのカラムが追加されます。
- このカラムを検索条件に含めると、Athenaはそれに応じてS3の検索対象を絞り込みます。
必要なALBの数だけテーブルを個別に作成してください。
- 手でやるときはテーブル名、S3のパスの対応を間違えないように注意しましょう。

正しく実行できると、テーブル一覧に ALBNAME_accesslogs_2020 (Partitioned) というテーブルが追加されます。

パーティショニング指定

今このテーブルはデータがどこにあるのか何も知らない状態です。SELECTしても1行も結果が返ってきません。そこで、テーブル全て、1年分全てについて、S3のパスを教えてやる必要があります。

-- 2020/1/1の分
ALTER TABLE ALBNAME_accesslogs_2020 
ADD PARTITION (albname='ALBNAME', year=2020, month=1, day=1)
location 's3://YOUR-BUCKET-NAME/ALB-NAME/AWSLogs/999999999999/elasticloadbalancing/ap-northeast-1/2020/01/01/'

-- 2020/1/2の分
ALTER TABLE ALBNAME_accesslogs_2020 
ADD PARTITION (albname='ALBNAME', year=2020, month=1, day=2)
location 's3://YOUR-BUCKET-NAME/ALB-NAME/AWSLogs/999999999999/elasticloadbalancing/ap-northeast-1/2020/01/02/'

これを1年分繰り返します。レート制限に引っかかると問答無用でクエリは全部エラーになるので適当にウェイトを入れないといけません。

検索してみる

これで検索ができるようになったはずです。

-- 2020/4のログを検索
SELECT * FROM ALBNAME_accesslogs_2020 WHERE lbname = 'ALBNAME' AND year = 2020 AND month = 4 LIMIT 10
-- 2020/4/15のログを検索
SELECT * FROM ALBNAME_accesslogs_2020 WHERE lbname = 'ALBNAME' AND year = 2020 AND month = 4 AND day = 15 LIMIT 10

パーティショニングが効いているので、スキャンしたデータの量は2個目のクエリの方が少なくなります。
albname, year はテーブルによって一意に決まるので本当は指定する必要はありませんが一応書いています。
year, month, day に上下関係はありません。 WHERE year = 2020 AND day = 1 というクエリも有効です。（毎月1日のログが検索対象になります）

Classic ELBのためのビューを作る

Classic ELBのログはALBとは形式が異なります。このままでは串刺し検索をするときに不便なので、無理やりダミーの値を入れてALBのログと形式を揃えた結果を返すビューを作りましょう。

ビュー作成DDL（Classic ELBの場合）

CREATE OR REPLACE VIEW ALBNAME_accesslogs_2020_t AS 
SELECT
    '-' "type",
    "timestamp" "time",
    "elb_name" "elb",
    "request_ip" "client_ip",
    "request_port" "client_port",
    "backend_ip" "target_ip",
    "backend_port" "target_port",
    "request_processing_time" "request_processing_time",
    "backend_processing_time" "target_processing_time",
    "client_response_time" "response_processing_time",
    "elb_response_code" "elb_status_code",
    "backend_response_code" "target_status_code",
    "received_bytes" "received_bytes",
    "sent_bytes" "sent_bytes",
    "request_verb" "request_verb",
    "url" "request_url",
    "protocol" "request_proto",
    "user_agent" "user_agent",
    "ssl_cipher" "ssl_cipher",
    "ssl_protocol" "ssl_protocol",
    '-' "target_group_arn",
    '-' "trace_id",
    '-' "domain_name",
    '-' "chosen_cert_arn",
    '-' "matched_rule_priority",
    '-' "request_creation_time",
    '-' "actions_executed",
    '-' "redirect_url",
    '-' "lambda_error_reason",
    '-' "new_field",
    "lbname",
    "year",
    "month",
    "day"
FROM ALBNAME_accesslogs_2020

全部くっつけたビューを作る

頭を空っぽにして何も考えずに全部UNIONしたビューを作ります。難しいことはすべてAWSがやってくれます。

CREATE OR REPLACE VIEW alb_access_logs_2020 AS 
SELECT * FROM (
  SELECT * FROM alb1_accesslogs_2020
  UNION
  SELECT * FROM alb2_accesslogs_2020
  UNION
  SELECT * FROM alb3_accesslogs_2020
  UNION
  SELECT * FROM alb4_accesslogs_2020_t 
) X

クエリを実行する

これで串刺し検索ができるようになりました。

SELECT * FROM alb_access_logs_2020
WHERE
  lbname IN ('ALB1', 'ALB2') AND
  year = 2020 AND
  month = 4 AND
  day = 15 AND
  elb_status_code LIKE '50%'

不要なパスのログは見に行かないので、最低限 lbname, month あたりを指定すればスキャン対象のデータ量は相当削減されます。

補足

なぜこんな苦行が必要なのか？

https://docs.aws.amazon.com/ja_jp/athena/latest/ug/partitions.html
もしログがS3に s3://..../paramA=1111/paramB=2222/xxx.log のようなパスで入っていればAthenaは MSCK REPAIR TABLE 文一発でパーティショニングを終えてくれます。
しかしALBやCloudTrailのログはそうなっていないのでこのような作業を行うことになります。（そのうち改善されるでしょう……多分）

未来の日付を指定しているが大丈夫？

大丈夫です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up