LoginSignup
2
0

More than 3 years have passed since last update.

Apache Hive の split 関数でカンマ区切りの文字列を処理する

Posted at

概要

  • Apache Hive の split 関数でカンマ区切りの文字列を処理するサンプルを書く
  • 検証環境: Apache Hadoop 3.3.0 + Apache Hive 3.1.2 + Java 8 (AdoptOpenJDK 1.8.0_265) + macOS Catalina

split 関数でカンマ区切りの文字列を処理する

以下のように foo_table テーブルの foo_data カラムにカンマ区切りのデータがある場合を想定。

hive> set hive.cli.print.header=true;
hive> select * from foo_table;
OK
foo_table.foo_id    foo_table.foo_data
1    aaa,bbb,ccc
2    111,222,333,444,555
Time taken: 0.179 seconds, Fetched: 2 row(s)

split 関数を使うと区切り文字列 (正規表現も指定可能) を指定して文字列の配列 ARRAY を取得することができる。

hive> select split(foo_data, ',') from foo_table;
OK
_c0
["aaa","bbb","ccc"]
["111","222","333","444","555"]
Time taken: 0.124 seconds, Fetched: 2 row(s)

配列の最初の要素には [0] でアクセスできる。

hive> select split(foo_data, ',')[0] from foo_table;
OK
_c0
aaa
111
Time taken: 0.149 seconds, Fetched: 2 row(s)

2つ目の要素を出力。

hive> select split(foo_data, ',')[1] from foo_table;
OK
_c0
bbb
222
Time taken: 0.277 seconds, Fetched: 2 row(s)

size 関数で配列の要素の数を取得できる。

hive> select size(split(foo_data, ',')) from foo_table;
OK
_c0
3
5
Time taken: 0.278 seconds, Fetched: 2 row(s)

size 関数で配列の要素数を取得できるので、これを利用して最後の要素を取得できる。

hive> select split(foo_data, ',')[size(split(foo_data, ',')) - 1] from foo_table;
OK
_c0
ccc
555
Time taken: 0.196 seconds, Fetched: 2 row(s)

concat_ws で区切り文字と配列を指定して文字列を生成できる。

hive> select concat_ws('|', split(foo_data, ',')) from foo_table;
OK
_c0
aaa|bbb|ccc
111|222|333|444|555
Time taken: 0.386 seconds, Fetched: 2 row(s)

explode 関数を使うと配列を行に変換できる。

hive> select explode(split(foo_data, ',')) from foo_table;
OK
col
aaa
bbb
ccc
111
222
333
444
555
Time taken: 0.089 seconds, Fetched: 8 row(s)

explode 関数と lateral view を組み合わせると行に変換した配列と元の入力行を結合できる。

hive> select foo_id, foo_splited_data, foo_data
    > from foo_table
    > lateral view
    > explode(split(foo_data, ',')) foo_table as foo_splited_data;
OK
foo_id    foo_splited_data    foo_data
1    aaa    aaa,bbb,ccc
1    bbb    aaa,bbb,ccc
1    ccc    aaa,bbb,ccc
2    111    111,222,333,444,555
2    222    111,222,333,444,555
2    333    111,222,333,444,555
2    444    111,222,333,444,555
2    555    111,222,333,444,555
Time taken: 0.294 seconds, Fetched: 8 row(s)

公式ドキュメントによる情報

split 関数

  • Return type: array
  • Name(Signature): split(string str, string pat)
  • Description: Splits str around pat (pat is a regular expression).

size 関数

  • Return type: int
  • Name(Signature): size(Map)
  • Description: Returns the number of elements in the map type.

explode 関数

  • Row-set column types: T
  • Name(Signature): explode(ARRAY a)
  • Description: Explodes an array to multiple rows. Returns a row-set with a single column (col), one row for each element from the array.

concat_ws 関数

  • Return type: concat_ws
  • Name(Signature): concat_ws(string SEP, array)
  • Description: Like concat_ws() above, but taking an array of strings. (as of Hive 0.9.0)

array

  • Operator: A[n]
  • Operand types: A is an Array and n is an int
  • Description: Returns the nth element in the array A. The first element has index 0. For example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'.

lateral view

LanguageManual LateralView - Apache Hive - Apache Software Foundation

Lateral view is used in conjunction with user-defined table generating functions such as explode(). As mentioned in Built-in Table-Generating Functions, a UDTF generates zero or more output rows for each input row. A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

参考資料

2
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
0