More than 1 year has passed since last update.

ALBのログをAthenaでSQLを叩いて分析する

Last updated at 2023-03-23Posted at 2023-03-20

AthenaでALBログ分析

Athenaの開始方法は以下の公式参照して同じようにやっていけばセットアップ完了する
- https://docs.aws.amazon.com/ja_jp/athena/latest/ug/getting-started.html
自分自身がやった手順をさらに詳細に

データベースを作成

create database データベース名

クエリの結果を出力するS3バケット作成/ワークグループに紐付け

https://docs.aws.amazon.com/ja_jp/athena/latest/ug/querying.html
以前はデフォルトで勝手に出力用のS３バケットが作成されたが今は自分で作成する
ワークグループに紐付け

パーティションテーブルを作成

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs_20221125 (
            type string,
            time string,
            elb string,
            client_ip string,
            client_port int,
            target_ip string,
            target_port int,
            request_processing_time double,
            target_processing_time double,
            response_processing_time double,
            elb_status_code int,
            target_status_code string,
            received_bytes bigint,
            sent_bytes bigint,
            request_verb string,
            request_url string,
            request_proto string,
            user_agent string,
            ssl_cipher string,
            ssl_protocol string,
            target_group_arn string,
            trace_id string,
            domain_name string,
            chosen_cert_arn string,
            matched_rule_priority string,
            request_creation_time string,
            actions_executed string,
            redirect_url string,
            lambda_error_reason string,
            target_port_list string,
            target_status_code_list string,
            classification string,
            classification_reason string
            )
            PARTITIONED BY
            (
             day STRING
            )
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
            WITH SERDEPROPERTIES (
            'serialization.format' = '1',
            'input.regex' = 
        '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
            LOCATION 's3://hoge-production-log/alb/AWSLogs/{accountId}/elasticloadbalancing/ap-northeast-1/'
            TBLPROPERTIES
            (
             "projection.enabled" = "true",
             "projection.day.type" = "date",
             "projection.day.range" = "2022/01/01,NOW",
             "projection.day.format" = "yyyy/MM/dd",
             "projection.day.interval" = "1",
             "projection.day.interval.unit" = "DAYS",
             "storage.location.template" = "s3://hoge-production-log/alb/AWSLogs/{accounId}/elasticloadbalancing/ap-northeast-1/${day}"
            )

適宜、ALBログが格納されているS3バケットを指定する。

クエリを試しに叩く

SELECT COUNT(request_verb) AS
 count,
 request_verb,
 client_ip
FROM alb_logs
GROUP BY request_verb, client_ip
LIMIT 5;

tips関連

ALBのログをAthenaでクエリする場合そのままではタイムスタンプ(timeカラム)は文字列型

タイムスタンプのパース

parse_datetime(time, 'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')

jstにパースしたい

parse_datetime(time, 'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z') AT TIME ZONE '+09:00'

クエリ例

特定の期間における ELB 5xx エラーのpathごとのカウント
- ELB 5xxエラーなので、target（アプリケーション等）からレスポンスがなかった場合（taget_status_codeカラムに「-」が入っていたパターン）も含まれる
- 該当のALBを絞る場合はarnの末尾を使う

arn:aws:elasticloadbalancing:ap-northeast-1:{accountId}:loadbalancer/hoge/huga/123456789

SELECT
    url_extract_host(request_url) as url_host,
    url_extract_path(request_url) as url_path,
    elb_status_code,
    target_status_code,
    count(*)
FROM "alb"."alb_logs_20221125"
WHERE
    elb = '/hoge/huga/123456789' --ALBのarnの末尾を指定
    AND time >= '2022-09-26T00:00' AND time <= '2022-10-09T23:59'  
    AND elb_status_code / 100 = 5 -- status:5xx
GROUP BY
    url_extract_host(request_url) , url_extract_path(request_url), elb_status_code, target_status_code
ORDER BY
    count(*) desc

特定の期間における HTTP 5xx エラーのカウント
- target_status_codeで絞ろうとすると、カラムに「-」が入っている場合があるため（targetからレスがなかったパターン）、クエリで絞れなかったので注意。いいやりかた知っている人がいればぜひ。
特定期間における特定pathの日毎のELB 5××カウント
- サブクエリでtimeカラムの内容をカットしたテーブルを作って対応した

SELECT
    url_path,
    elb_status_code,
    target_status_code,
    time,
    count(*)
FROM (SELECT
          url_extract_path(request_url) as url_path,
          elb_status_code,
          target_status_code,
          SUBSTRING(time, 1, 10) as time
      FROM 
          "alb"."alb_logs_20221125"
      WHERE
          elb = '/hoge/huga/123456789' --ALBのarnの末尾を指定
          AND time >= '2022-09-26T00:00' AND time <= '2022-10-09T23:59'  
          AND elb_status_code / 100 = 5
          AND url_extract_path(request_url) = '/api/v1/hoge/huga'
      )
GROUP BY 
    url_path, elb_status_code, target_status_code, time
ORDER BY
    count(*) asc

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up