4
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

【TreasureData】URLからの部分抽出(HIVE,Presto)

Posted at

##はじめに
アクセスログの解析を行う上でURLから一部だけ抽出する処理ががよくあるので、
その場合に利用できるHIVE、Prestoの関数と利用例をまとめてみました

##HIVEとPrestoそれぞれで使える関数
HIVEは、parse_url
Prestoは、url_extract_***
が関数として用意されている

##HIVE
parse_url(string urlString, string partToExtract [, string keyToExtract])

第一引数にURL、第二引数に抽出したい箇所を入力する
第三引数は特定のクエリパラメータだけを抽出したいときに利用する

第二引数に指定できるのは、
HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO
の8つ

URL:http://user:password@example.com:8888/index.html?query1=aaa&query2=bbb#hash
に対して実行した場合

hive_sample.sql
SELECT
  t1.URL,
  parse_url(t1.URL,'HOST') as HOST,
  parse_url(t1.URL,'PATH') as PATH,
  parse_url(t1.URL,'QUERY') as QUERY,
  parse_url(t1.URL,'QUERY','query1') as QUERY1,
  parse_url(t1.URL,'REF') as REF,
  parse_url(t1.URL,'PROTOCOL') as PROTOCOL,
  parse_url(t1.URL,'AUTHORITY') as AUTHORITY,
  parse_url(t1.URL,'FILE') as FILE,
  parse_url(t1.URL,'USERINFO') as USERINFO
FROM
  (SELECT 'http://user:password@example.com:8888/index.html?query1=aaa&query2=bbb#hash' as URL) t1

↓ 実行結果

Column Value
URL http://user:password@example.com:8888/index.html?query1=aaa&query2=bbb#hash
HOST example.com
PATH /index.html
QUERY query1=aaa&query2=bbb
QUERY1 aaa
REF hash
PROTCOL http
AUTHORITY user:password@example.com:8888
FILE /index.html?query1=aaa&query2=bbb
USERINFO user:password

##Presto

url_extract_host(string urlString)
url_extract_fragment(url)
url_extract_path(url)
url_extract_port(url)
url_extract_protocol(url)
url_extract_query(url)
url_extract_parameter(url, name)

##例えば

サイト内検索クエリを抽出したい場合

■HIVE
parse_url('http://example.com/search?q=book&page=1','QUERY','q')
→ book
■Presto
url_extract_parameter('http://example.com/search?q=book&page=1','q')
→ book

##参考
LanguageManual UDF — Hive Operators and User-Defined Functions (UDFs)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

6.18. URL Functions — Presto 0.187 Documentation
https://prestodb.io/docs/current/functions/url.html

4
8
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?