LoginSignup
0
1

More than 3 years have passed since last update.

TreasureData(HiveQL)でIPアドレスの範囲を展開する

Posted at

背景

  • IPアドレスのマスタデータ(レンジ)から、ログに含まれるIPアドレスを突合したかった
  • BETWEENなども試したがうまくいかなかった
    • 処理が重くて終わらない

利用するデータ

今回は以下のようなデータを利用しました。(簡単のため必要なデータに絞ってあります

IPマスタデータ

ip_from ip_to attr
1.33.205.128 1.33.205.135 A
1.179.141.224 1.179.141.239 B
223.29.57.192 223.29.57.255 C

テーブル名を ip_mast としておきます

ログデータ

user_id ipaddress
1 223.29.57.10
2 1.33.205.130
3 1.179.141.238
4 223.29.57.192

テーブル名を user_ip としておきます

やりたいこと

ip_mastuser_ip から以下のようなテーブルを作りたい

user_id ipaddress attr
1 223.29.57.10
2 1.33.205.130 A
3 1.179.141.238 B
4 223.29.57.192 C

処理できなかったクエリ

WITH ip AS (
  SELECT
    CAST(ip_from_array[0] AS DOUBLE) * 16777216 +
    CAST(ip_from_array[1] AS DOUBLE) * 65536 +
    CAST(ip_from_array[2] AS DOUBLE) * 256 +
    CAST(ip_from_array[3] AS DOUBLE) as ip_from_decimal,
    CAST(ip_to_array[0] AS DOUBLE) * 16777216 +
    CAST(ip_to_array[1] AS DOUBLE) * 65536 +
    CAST(ip_to_array[2] AS DOUBLE) * 256 +
    CAST(ip_to_array[3] AS DOUBLE) as ip_to_decimal,
    attr
  FROM
  (
    SELECT
      SPLIT(ip_from, '\\.') AS ip_from_array,
      SPLIT(ip_to, '\\.') AS ip_to_array,
      attr
    FROM ip_mast
  ) t0
)

SELECT
  user_ip.user_id,
  user_ip.ipaddress,
  ip.attr
FROM (
  SELECT
    user_id,
    ipaddress,
    CAST(ipaddress[0] AS DOUBLE) * 16777216 +
    CAST(ipaddress[1] AS DOUBLE) * 65536 +
    CAST(ipaddress[2] AS DOUBLE) * 256 +
    CAST(ipaddress[3] AS DOUBLE) as ipaddress_decimal
  FROM user_ip
) as user_ip JOIN ip
WHERE user_ip.ipaddress_decimal BETWEEN ip.ip_from_decimal AND ip.ip_to_decimal

FULL OUTER JOINして、数値にしたIPアドレスで絞り込む作戦
が、データ量が多いと全然処理が進まず無事死亡

処理できたクエリ

どうにかIPアドレスを範囲から実際のIPアドレスに展開できる方法はないか考えました
救世主、 TD_NUMERIC_RANGEの登場です。

数値にしたIPアドレスに対して、fromからtoまで1つずつインクリメントしたARRAYをLATERAL VIEWで展開します

WITH ip AS (
  SELECT
    ip_decimal,
    attr
  FROM (
    SELECT
      CAST(ip_from_array[0] AS DOUBLE) * 16777216 +
      CAST(ip_from_array[1] AS DOUBLE) * 65536 +
      CAST(ip_from_array[2] AS DOUBLE) * 256 +
      CAST(ip_from_array[3] AS DOUBLE) as ip_from_decimal,
      CAST(ip_to_array[0] AS DOUBLE) * 16777216 +
      CAST(ip_to_array[1] AS DOUBLE) * 65536 +
      CAST(ip_to_array[2] AS DOUBLE) * 256 +
      CAST(ip_to_array[3] AS DOUBLE) as ip_to_decimal,
      attr
    FROM
    (
      SELECT
        SPLIT(ip_from, '\\.') AS ip_from_array,
        SPLIT(ip_to, '\\.') AS ip_to_array,
        attr
      FROM ip_mast
    ) t0
  ) as t1 LATERAL VIEW
     TD_NUMERIC_RANGE(CAST(ip_from_decimal as int),CAST(ip_to_decimal+1 as int),1) ip as ip_decimal
)


SELECT
  user_ip.user_id,
  user_ip.ipaddress,
  ip.attr
FROM (
  SELECT
    user_id,
    ipaddress,
    CAST(ipaddress[0] AS DOUBLE) * 16777216 +
    CAST(ipaddress[1] AS DOUBLE) * 65536 +
    CAST(ipaddress[2] AS DOUBLE) * 256 +
    CAST(ipaddress[3] AS DOUBLE) as ipaddress_decimal
  FROM user_ip
) as user_ip LEFT JOIN ip
ON user_ip.ipaddress_decimal = ip.ip_decimal

それなりのデータ量がありましたが、無事に処理できました

参考

0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1