More than 1 year has passed since last update.

AthenaでTGWのフローログをクエリする【備忘】

Posted at 2023-03-13

前提

既にTGWのフローログ設定や、Athenaの設定は完了していること
Parquet形式で格納していること

テーブル作成

Athenaで下記のSQLでクエリを行います。
★部分は環境に合わせて変更を行ってください。（★を忘れずに消してね）

CREATE EXTERNAL TABLE IF NOT EXISTS `tgw_vpc_flow_logs_parquet` (
  `version` int, 
  `resource_type` string, 
  `account_id` string, 
  `tgw_id` string, 
  `tgw_attachment_id` string, 
  `tgw_src_vpc_account_id` string, 
  `tgw_dst_vpc_account_id` string, 
  `tgw_src_vpc_id` string, 
  `tgw_dst_vpc_id` string, 
  `tgw_src_subnet_id` string, 
  `tgw_dst_subnet_id` string, 
  `tgw_src_eni` string, 
  `tgw_dst_eni` string, 
  `tgw_src_az_id` string, 
  `tgw_dst_az_id` string, 
  `tgw_pair_attachment_id` string, 
  `srcaddr` string, 
  `dstaddr` string, 
  `srcport` int, 
  `dstport` int, 
  `protocol` bigint, 
  `packets` bigint, 
  `bytes` bigint, 
  `start` bigint, 
  `end` bigint, 
  `log_status` string, 
  `type` string, 
  `packets_lost_no_route` bigint, 
  `packets_lost_blackhole` bigint, 
  `packets_lost_mtu_exceeded` bigint, 
  `packets_lost_ttl_expired` bigint, 
  `tcp_flags` int, 
  `flow_direction` string, 
  `pkt_src_aws_service` string, 
  `pkt_dst_aws_service` string
)
PARTITIONED BY (
  `aws-account-id` string, `region` string, `day` string
)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  ★'s3://<ログが格納されているバケット名>/AWSLogs/'
TBLPROPERTIES (
    "skip.header.line.count"="1",
    "projection.enabled" = "true",
    "projection.region.type" = "enum",
    "projection.region.values" = "ap-northeast-1",
    "projection.day.type" = "date",
    ★"projection.day.range" = "<取得したい日付の範囲、例：2021/10/01, NOW>", 
    "projection.day.format" = "yyyy/MM/dd",
    'projection.aws-account-id.type'='enum', 
    ★'projection.aws-account-id.values'='<アカウントID>',
    ★"storage.location.template" = "s3://<ログが格納されているバケット名>/AWSLogs/${aws-account-id}/vpcflowlogs/${region}/${day}"
)

終わり

後はSELECT文とか投げれば取れます！
フローログは容量が多いので、日付は短めにとって少しずつクエリすることをお勧めします。

Parquet形式でやってる方がぱっと見当たらなかったので、簡単に記事にしました。
よかったら使ってみてください～

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up