More than 3 years have passed since last update.

【AWS】Athenaでelbログを絞り込んで取得する

Posted at 2020-10-01

はじめに

「athenaでelbのログを確認してもらってもいいっすか」って唐突に投げられたので準備と自分の好みのクエリを備忘録的に残しておきます

正直公式ドキュメントがこれでもかっていうくらい読みやすいのでそっちを参考にするべきだと思う
https://docs.aws.amazon.com/ja_jp/athena/latest/ug/application-load-balancer-logs.html

テーブルの作成

上記の本家様に書いてある通りCREATEクエリを流し込めばテーブルが作成できます
左側にcreate tableがあるからGUI上からも作成できるけど、設定を元にクエリが作成される＋カラムの設定とかRegexをどうせコピペするのでクエリを最初から流し込んだ方がスムーズかと。

GUIから設定したい人向けに一応メモ

Data Format: Apache Web Logs
Regex: '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"'

直接クエリ流す人は公式に書いてある通り以下のクエリを流し込めばOK

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
            type string,
            time string,
            elb string,
            client_ip string,
            client_port int,
            target_ip string,
            target_port int,
            request_processing_time double,
            target_processing_time double,
            response_processing_time double,
            elb_status_code string,
            target_status_code string,
            received_bytes bigint,
            sent_bytes bigint,
            request_verb string,
            request_url string,
            request_proto string,
            user_agent string,
            ssl_cipher string,
            ssl_protocol string,
            target_group_arn string,
            trace_id string,
            domain_name string,
            chosen_cert_arn string,
            matched_rule_priority string,
            request_creation_time string,
            actions_executed string,
            redirect_url string,
            lambda_error_reason string,
            target_port_list string,
            target_status_code_list string,
            classification string,
            classification_reason string
            )
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
            WITH SERDEPROPERTIES (
            'serialization.format' = '1',
            'input.regex' = 
        '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
            LOCATION 's3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/';

AWSヌーブすぎてlocationがわかっていなかったけど、s3のelbログを格納しているパスを指定してあげたらOKって話だった

<REGION>/以下は年/月/日まで絞り込んでテーブル作成ができるらしいです

自分はalb_logsっていうDB作ってその下に
alb_logs_環境名_api名_2020_08_20みたいなテーブル名にしました。
この辺はお好みでどうぞ

クエリと結果の保存先を指定する

公式には特に説明が書かれていなかったけど、クエリと結果を保存する先を指定しないとクエリの実行ができない
s3://your-alb-logs-directory/AWSLogs/query_resultみたいなディレクトリを作ってあげるのが良いのでしょうか。あるいは<REGION>/2020/08/20/resultみたいに深く掘るのもいいのかな
AWSヌーブなのでこの辺りのベストプラクティスとか全くわからんです。教えて偉い人

自分が使うクエリの覚書

SELECT *でもいいけど、横並びで見たい情報とかあるじゃん。
ということで自分の好みのクエリの覚書
request_verbはHTTPメソッドが格納されてる

SELECT
    request_verb, time,request_url,elb_status_code
FROM db_name
WHERE regexp_like("elb_status_code",'5[0-9][0-9]')
;

今回はエラーを確認するケースだったのでregexp_like使いました。初めて使った。
あとはrequest_urlを絞ったりしました

最後に

各カラムの日本語対応表とか欲しいね。余力があったらそのうち作りたい(願望)
limitつけてたけどエラー0件の時だと結局レコード総舐めするのでコスト抑える効果はない。
8GBとかデータスキャンしてました。悲しい

でも1TBで5＄とからしいので変にバカスカクエリ叩きまくらない限りはそこまで心配しなくてもいいかもしれないですね。

一度使ったクエリと、その結果はhistoryから確認できるのできっちりコストカットしていきましょう

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up