More than 3 years have passed since last update.

【AWS】AthenaでCloudFrontのログを集計するときのSQL

Posted at 2020-09-14

調べてもズバリこれといったものが見つけられなかったので自分用の備忘録として。
あるいは同じことを求めてる人が他にいたら参考になれば幸いです。

前提条件

CloudFrontのログが格納されているS3のフォルダがyear=2020、month=09、day=12みたいな感じでパーティションされてること
実際に格納されているパスはこんな感じ↓
s3://mybucket/cloudfront_log/year=2020/month=09/day=12/EXXXXXXXXXXXXX.2020-09-12-00.f96e98c5.gz
- 「パスに"year=2020"とか、"="いれて大丈夫なのか？？」と最初思ったが全然大丈夫である。
Athenaがパーティティションテーブルとして作成されていること

当日のログを時間昇順で全部見る

select *
from (select cast(concat(request_date,' ',request_time) as timestamp) at time zone 'Asia/Tokyo' request_datetime_jst
            ,*
      from cloudfront_log
      )
where 0=0
  and date (concat(year , '-' , month , '-' , day)) between
      current_date - interval '1' day and
      current_date 
  and request_datetime_jst between
      date_trunc('day' , current_timestamp at time zone 'Asia/Tokyo') and
      current_timestamp at time zone 'Asia/Tokyo'
order by
   request_datetime_jst

CloudFrontのログはUTCなので、日本時間（JST）に合わせるとちょっと使いづらい。
なので一度副問い合わせ内でJSTの日時をつくっている。
JSTに合わせると、見るべきパーティションは当日と前日になる
（例：日本時間で2020/9/12の場合、year=2020/month=09/day=11とyear=2020/month=09/day=12の２つのパーティションを見る必要がある）

前日のログを集計する

select count(1) cnt_all
from (select *
            ,concat(request_date,' ',request_time) request_datetime
            ,cast(concat(request_date,' ',request_time) as timestamp) at time zone 'Asia/Tokyo' request_datetime_jst
      from cloudfront_log
      )
where 0=0
  and date (concat(year , '-' , month , '-' , day)) between
      current_date - interval '2' day and
      current_date - interval '1' day
  and request_datetime_jst between 
      date_trunc('day' , current_timestamp at time zone 'Asia/Tokyo' - interval '1' day) and
      date_trunc('day' , current_timestamp at time zone 'Asia/Tokyo' - interval '1' second)

条件の考え方は基本的に↑と同じ。
「前日」になったのに合わせて見るべきパーティションとrequest_date、request_timeの条件を変えただけ。
count(1)は単純に全レコードの件数カウントしてるだけなので、有象無象の全ログが対象になる。
実際にはもう少し具体的な条件、例えばsc_statusの値(=HTTPステータス)とか、cs_uri_stemの内容(=リクエスト先のパス)とかで分類した結果を集計したくなると思う。
例えばsum(case when sc_status = '200' thne 1 else 0 end) as cnt_successとするとHTTPステータスコード200のログの数だけ集計する。

任意の日付のログを集計する

select count(1) cnt_all
from (select *
            ,concat(request_date,' ',request_time) request_datetime
            ,cast(concat(request_date,' ',request_time) as timestamp) at time zone 'Asia/Tokyo' request_datetime_jst
      from cloudfront_log
      )
where 0=0
  -- (1)
  and concat(year , '-' , month , '-' , day) between '2020-08-31' and '2020-09-01'
  -- (2)
  and request_datetime_jst between timestamp '2020-08-31 15:00 UTC' at time zone 'Asia/Tokyo' and timestamp '2020-09-01 14:59 UTC' at time zone 'Asia/Tokyo'

やはり基本的な条件の考え方は↑と同じ。
「任意の日付」の部分だけ文字列で固定で指定してる形。（もっといい感じのやり方はあるのかなー）
(1)では見るべきパーティションを指定している。
見たい日付が2020/09/01なら2020-08-31と2020-09-01になる。
(2)では実際に検索対象にするリクエストの日時をJSTに変換したうえで指定している。
見たい日付が2020/9/1分ならUTCに直すと2020-08-31 15:00～2020-09-01 14:59までがJSTにおける「2020/09/01」になるのでそこを指定している。

おまけ：CloudFrontの移動処理(Lambda)

ランタイムはNode.js。

const AWS = require('aws-sdk');
const s3 = new AWS.S3({region: 'ap-northeast-1'});

const destS3Bucket =  'destbucket';
const destS3PathPrefix = 'cloudfront_log_moved';

exports.handler = async function(event) {
    var responseBody = [];
    for(e of event.Records) {
        var sourceBucket = e.s3.bucket.name; // mybucket
        var sourceKey = e.s3.object.key; // cloudfront_log/EXXXXXXXXXXXXX.2020-08-20-14.78e51f19.gz
        var sourceObject = `${sourceBucket}/${sourceKey}`;
        var originalObjectName = sourceKey.split('/')[1]; // EXXXXXXXXXXXXX.2020-08-20-14.78e51f19.gz
        var destPaths = sourceKey.split('.')[1].split('-'); // [2020,08,20,14]
        var destPathYYYY = `year=${destPaths[0]}`; //  year=2020/
        var destPathMM   = `month=${destPaths[1]}`; // month=08/
        var destPathDD   = `day=${destPaths[2]}`; // day=20/
        var destPath = `${destS3PathPrefix}/${destPathYYYY}/${destPathMM}/${destPathDD}`; // cloudfront_log_moved/year=2020/month=08/day=20/EXXXXXXXXXXXXX.2020-08-20-14.78e51f19.gz
        var destKey = `${destPath}/${originalObjectName}`; 

        console.log(`sourceObject:${sourceObject} , destPath:${destPath} , destKey:${destKey}`);

        var copyObjectParam = {
            Bucket: destS3Bucket,
            Key: destKey,
            CopySource: sourceObject
        };
        var copyResult = await s3.copyObject(copyObjectParam).promise();
        var deleteResult = await s3.deleteObject({Bucket: sourceBucket , Key: sourceKey}).promise();

        responseBody.push({
            copyResult: copyResult,
            deleteResult: deleteResult
        });
    }

    const response = {
        statusCode: 200,
        body: JSON.stringify(responseBody),
    };
    return response;
};

S3へのオブジェクトPUTのイベントはevent.Recordsという配列に含まれるので一応これを基準にループさせている。
でも色々調べてみると0番目を固定で取得している処理→event.Records[0]しか見つからないので、いちいちループさせる必要ないのかもしれない。
アクセス数が少なくて、たまにぽつんと単独でアクセスされる程度なら、多分それでも問題ない（移動漏れは起きない）。
短い時間で連続でPUTされるとまとめて配列になってくる…とかなのかな？
CloudFrontのログファイル名を相手にすることを前提にしているコードなので、それ以外のファイルのPUTを受けると落ちる。。
呼び出し条件として、元バケットでCloudFrontのログが格納されているパスのprefixでも指定しておく。
移動元バケットに対するGetObject+DeleteObject、移動先バケットに対するPutObjectのアクセス権が必要。
ただ、DeleteObjectのほうはS3のLifeCycleに任せる（1日で削除とか）なら設定不要＆↑のコードにおけるawait s3.deleteObject({Bucket: sourceBucket , Key: sourceKey}).promise();も不要。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up