ケース①データタイプが「Array型」のJSONをあつかう場合
次のようなテーブルがあります
0: jdbc:hive2://localhost:10000> desc array_json;
+------------------+----------------+----------+
| col_name | data_type | comment |
+------------------+----------------+----------+
| id | string | |
| group | string | |
| multi_line_json | array<string> | |
+------------------+----------------+----------+
0: jdbc:hive2://localhost:10000> SELECT * FROM array_json;
+----------------+-------------------+----------------------------------------------------+
| array_json.id | array_json.group | array_json.multi_line_json |
+----------------+-------------------+----------------------------------------------------+
| 1 | group1 | ["{\"name\":\"A-ichiro\",\"age\":\"20\"}","{\"name\":\"B-ichiro\",\"age\":\"30\"}","{\"name\":\"C-ichiro\",\"age\":\"40\"}"] |
| 2 | group2 | ["{\"name\":\"A-saku\",\"age\":\"40\"}","{\"name\":\"B-saku\",\"age\":\"20\"}","{\"name\":\"C-saku\",\"age\":\"30\"}"] |
| 3 | group3 | ["{\"name\":\"A-ta\",\"age\":\"30\"}","{\"name\":\"B-ta\",\"age\":\"40\"}","{\"name\":\"C-ta\",\"age\":\"20\"}"] |
+----------------+-------------------+----------------------------------------------------+
3 rows selected (0.147 seconds)
このテーブルに対して、
- (STEP1)
multi_line_json
を複数レコードに展開し - (STEP2) 年齢が30歳(
multi_line_json.age="30"
)のレコードのみ抽出
したいと思います。
つまり、以下のような抽出を行います。
+-----+---------+---------------------------------+
| id | group | single_json |
+-----+---------+---------------------------------+
| 1 | group1 | {"name":"B-ichiro","age":"30"} |
| 2 | group2 | {"name":"C-saku","age":"30"} |
| 3 | group3 | {"name":"A-ta","age":"30"} |
+-----+---------+---------------------------------+
(STEP1)multi_line_json
を複数レコードに展開
Array型のカラムを複数レコードに展開するには、Hive関数のexplode()を使います。
0: jdbc:hive2://localhost:10000> SELECT explode(multi_line_json) as single_json FROM array_json;
+---------------------------------+
| single_json |
+---------------------------------+
| {"name":"A-ichiro","age":"20"} |
| {"name":"B-ichiro","age":"30"} |
| {"name":"C-ichiro","age":"40"} |
| {"name":"A-saku","age":"40"} |
| {"name":"B-saku","age":"20"} |
| {"name":"C-saku","age":"30"} |
| {"name":"A-ta","age":"30"} |
| {"name":"B-ta","age":"40"} |
| {"name":"C-ta","age":"20"} |
+---------------------------------+
9 rows selected (0.452 seconds)
展開したカラム single_json
を元のテーブルのカラムをくっつけるには、LATERAL VIEWを使います
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .> id
. . . . . . . . . . . . . . . .> ,`group`
. . . . . . . . . . . . . . . .> ,single_json
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .> array_json
. . . . . . . . . . . . . . . .> LATERAL VIEW explode(multi_line_json) first_view as single_json;
+-----+---------+---------------------------------+
| id | group | single_json |
+-----+---------+---------------------------------+
| 1 | group1 | {"name":"A-ichiro","age":"20"} |
| 1 | group1 | {"name":"B-ichiro","age":"30"} |
| 1 | group1 | {"name":"C-ichiro","age":"40"} |
| 2 | group2 | {"name":"A-saku","age":"40"} |
| 2 | group2 | {"name":"B-saku","age":"20"} |
| 2 | group2 | {"name":"C-saku","age":"30"} |
| 3 | group3 | {"name":"A-ta","age":"30"} |
| 3 | group3 | {"name":"B-ta","age":"40"} |
| 3 | group3 | {"name":"C-ta","age":"20"} |
+-----+---------+---------------------------------+
9 rows selected (0.094 seconds)
(STEP2) 年齢が30歳(multi_line_json.age="30"
)のレコードのみ抽出
(STEP1)で複数レコードに展開した single_json
の各キー(name, age)の値を取得してみます。
まずは(STEP1)でやったようにLATERAL VIEWでmulti_line_json
を複数レコードに展開し、
LATERAL VIEW explode(multi_line_json) first_view as single_json
さらにもう一つのLATERAL VIEWで、展開されたカラムfirst_view.single_json
に対して、JSONを操作する関数 json_tuple()を使って、name
およびage
の値を取得しています
LATERAL VIEW json_tuple(first_view.single_json, 'name', 'age') second_view as name, age;
実際のSQLはこんな感じです
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .> id
. . . . . . . . . . . . . . . .> ,`group`
. . . . . . . . . . . . . . . .> ,name
. . . . . . . . . . . . . . . .> ,age
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .> array_json
. . . . . . . . . . . . . . . .> LATERAL VIEW explode(multi_line_json) first_view as single_json
. . . . . . . . . . . . . . . .> LATERAL VIEW json_tuple(first_view.single_json, 'name', 'age') second_view as name, age;
+-----+---------+-----------+------+
| id | group | name | age |
+-----+---------+-----------+------+
| 1 | group1 | A-ichiro | 20 |
| 1 | group1 | B-ichiro | 30 |
| 1 | group1 | C-ichiro | 40 |
| 2 | group2 | A-saku | 40 |
| 2 | group2 | B-saku | 20 |
| 2 | group2 | C-saku | 30 |
| 3 | group3 | A-ta | 30 |
| 3 | group3 | B-ta | 40 |
| 3 | group3 | C-ta | 20 |
+-----+---------+-----------+------+
9 rows selected (0.093 seconds)
これまでのやり方を応用して、最終的なSQLは以下の通りとなります
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .> id
. . . . . . . . . . . . . . . .> ,`group`
. . . . . . . . . . . . . . . .> ,single_json
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .> array_json
. . . . . . . . . . . . . . . .> LATERAL VIEW explode(multi_line_json) first_view as single_json
. . . . . . . . . . . . . . . .> LATERAL VIEW json_tuple(first_view.single_json, 'name', 'age') second_view as name, age
. . . . . . . . . . . . . . . .> WHERE
. . . . . . . . . . . . . . . .> age = '30';
+-----+---------+---------------------------------+
| id | group | single_json |
+-----+---------+---------------------------------+
| 1 | group1 | {"name":"B-ichiro","age":"30"} |
| 2 | group2 | {"name":"C-saku","age":"30"} |
| 3 | group3 | {"name":"A-ta","age":"30"} |
+-----+---------+---------------------------------+
3 rows selected (0.103 seconds)
想定どおり抽出できました。
ケース②データタイプが「String型」のJSONをあつかう場合
次のようなテーブルがあります
0: jdbc:hive2://localhost:10000> desc string_json;
+------------------+------------+----------+
| col_name | data_type | comment |
+------------------+------------+----------+
| id | string | |
| group | string | |
| multi_line_json | string | |
+------------------+------------+----------+
0: jdbc:hive2://localhost:10000> SELECT * FROM string_json;
+-----------------+--------------------+----------------------------------------------------+
| string_json.id | string_json.group | string_json.multi_line_json |
+-----------------+--------------------+----------------------------------------------------+
| 1 | group1 | [{"name":"A-ichiro","age":"20"},{"name":"B-ichiro","age":"30"},{"name":"C-ichiro","age":"40"}] |
| 2 | group2 | [{"name":"A-saku","age":"40"},{"name":"B-saku","age":"20"},{"name":"C-saku","age":"30"}] |
| 3 | group3 | [{"name":"A-ta","age":"30"},{"name":"B-ta","age":"40"},{"name":"C-ta","age":"20"}] |
+-----------------+--------------------+----------------------------------------------------+
3 rows selected (0.137 seconds)
このテーブルに対して、ケース①と同様に、
- (STEP1)
multi_line_json
を複数レコードに展開し - (STEP2) 年齢が30歳(
multi_line_json.age="30"
)のレコードのみ抽出
したいと思います。
つまり、以下のような抽出を行います。
+-----+---------+---------------------------------+
| id | group | single_json |
+-----+---------+---------------------------------+
| 1 | group1 | {"name":"B-ichiro","age":"30"} |
| 2 | group2 | {"name":"C-saku","age":"30"} |
| 3 | group3 | {"name":"A-ta","age":"30"} |
+-----+---------+---------------------------------+
(STEP1)multi_line_json
を複数レコードに展開
ケース①と同様、*explode()を試してみますが、やはり怒られました。
Array型では無いとexplode()*は使えないからです。
0: jdbc:hive2://localhost:10000> SELECT explode(multi_line_json) as single_json FROM string_json;
Error: Error while compiling statement: FAILED: UDFArgumentException explode() takes an array or a map as a parameter (state=42000,code=40000)
そこで、次の手順を新たに追加します
- (追加STEP1-1)正規表現で
[{...},{...}]
から{...},{...}
部分だけ抜き出す - (追加STEP1-2)抜き出した
{...},{...}
をカンマ区切りのArray型に変換する - あとはケース①と同様の手順をふむ
(追加STEP1-1)正規表現で[{...},{...}]
から{...},{...}
部分だけ抜き出す
Hiveで正規表現を行う場合は、regexp_extract()関数を使います。
0: jdbc:hive2://localhost:10000> SELECT regexp_extract(multi_line_json,'^\\[(.*)\\]$',1) FROM string_json;
+----------------------------------------------------+
| _c0 |
+----------------------------------------------------+
| {"name":"A-ichiro","age":"20"},{"name":"B-ichiro","age":"30"},{"name":"C-ichiro","age":"40"} |
| {"name":"A-saku","age":"40"},{"name":"B-saku","age":"20"},{"name":"C-saku","age":"30"} |
| {"name":"A-ta","age":"30"},{"name":"B-ta","age":"40"},{"name":"C-ta","age":"20"} |
+----------------------------------------------------+
3 rows selected (0.135 seconds)
(追加STEP1-2)抜き出した{...},{...}
をカンマ区切りのArray型に変換する
指定文字区切りのArray型に変換する場合は、split()を使用します。
*split()*の第2引数に(?<=},)
を指定して、},
区切りのArray型に変換します。
0: jdbc:hive2://localhost:10000> SELECT split(regexp_extract(multi_line_json,'^\\[(.*)\\]$',1), '(?<=},)') FROM string_json;
+----------------------------------------------------+
| _c0 |
+----------------------------------------------------+
| ["{\"name\":\"A-ichiro\",\"age\":\"20\"},","{\"name\":\"B-ichiro\",\"age\":\"30\"},","{\"name\":\"C-ichiro\",\"age\":\"40\"}"] |
| ["{\"name\":\"A-saku\",\"age\":\"40\"},","{\"name\":\"B-saku\",\"age\":\"20\"},","{\"name\":\"C-saku\",\"age\":\"30\"}"] |
| ["{\"name\":\"A-ta\",\"age\":\"30\"},","{\"name\":\"B-ta\",\"age\":\"40\"},","{\"name\":\"C-ta\",\"age\":\"20\"}"] |
+----------------------------------------------------+
3 rows selected (0.12 seconds)
あとはケース①と同様の手順をふむ
Array型にできればあとはケース①と同じです。
*explode()*で複数レコードに展開し、LATERAL VIEWを使って展開したカラム single_json
を元のテーブルのカラムをくっつけます
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .> id
. . . . . . . . . . . . . . . .> ,`group`
. . . . . . . . . . . . . . . .> , single_json
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .> string_json
. . . . . . . . . . . . . . . .> LATERAL VIEW explode(split(regexp_extract(multi_line_json,'^\\[(.*)\\]$',1), '(?<=},)')) first_view as single_json;
+-----+---------+----------------------------------+
| id | group | single_json |
+-----+---------+----------------------------------+
| 1 | group1 | {"name":"A-ichiro","age":"20"}, |
| 1 | group1 | {"name":"B-ichiro","age":"30"}, |
| 1 | group1 | {"name":"C-ichiro","age":"40"} |
| 2 | group2 | {"name":"A-saku","age":"40"}, |
| 2 | group2 | {"name":"B-saku","age":"20"}, |
| 2 | group2 | {"name":"C-saku","age":"30"} |
| 3 | group3 | {"name":"A-ta","age":"30"}, |
| 3 | group3 | {"name":"B-ta","age":"40"}, |
| 3 | group3 | {"name":"C-ta","age":"20"} |
+-----+---------+----------------------------------+
9 rows selected (0.067 seconds)
(STEP2) 年齢が30歳(multi_line_json.age="30"
)のレコードのみ抽出
(STEP2)もケース①と同じやり方で実現できます。
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .> id
. . . . . . . . . . . . . . . .> ,`group`
. . . . . . . . . . . . . . . .> ,single_json
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .> string_json
. . . . . . . . . . . . . . . .> LATERAL VIEW explode(split(regexp_extract(multi_line_json,'^\\[(.*)\\]$',1), '(?<=},)')) first_view as single_json
. . . . . . . . . . . . . . . .> LATERAL VIEW json_tuple(first_view.single_json, 'name', 'age') second_view as name, age
. . . . . . . . . . . . . . . .> WHERE
. . . . . . . . . . . . . . . .> age = '30';
+-----+---------+----------------------------------+
| id | group | single_json |
+-----+---------+----------------------------------+
| 1 | group1 | {"name":"B-ichiro","age":"30"}, |
| 2 | group2 | {"name":"C-saku","age":"30"} |
| 3 | group3 | {"name":"A-ta","age":"30"}, |
+-----+---------+----------------------------------+
3 rows selected (0.074 seconds)
おわりに
もっと簡単な方法はあるのだろうか
おまけ
今回の検証を行う上での準備を追記しておきます
Hive環境構築
検証するにあたり、まずはHive環境(Docker)を構築します。
ちなみに、Dokcerイメージはこちらを使います。
- gitより取得しコンテナを起動
$ git clone https://github.com/big-data-europe/docker-hive
$ cd docker-hive
$ docker-compose up -d
- コンテナに入ってちゃんと動いているか確認
$ docker-compose exec hive-server bash
$ /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 2.3.2)
Driver: Hive JDBC (version 2.3.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.2 by Apache Hive
0: jdbc:hive2://localhost:10000>
beelineからhiveserverに接続できました。
大丈夫そうですね。
では早速検証用のテーブルを作っていきます。
データの準備
- テーブル作成
0: jdbc:hive2://localhost:10000> CREATE TABLE array_json (id STRING, `group` STRING, multi_line_json ARRAY<STRING>);
0: jdbc:hive2://localhost:10000> CREATE TABLE string_json (id STRING, `group` STRING, multi_line_json STRING);
0: jdbc:hive2://localhost:10000> show tables;
+--------------+
| tab_name |
+--------------+
| array_json |
| string_json |
+--------------+
2 rows selected (0.032 seconds)
- データ作成
0: jdbc:hive2://localhost:10000> INSERT INTO array_json
. . . . . . . . . . . . . . . .> SELECT '1', 'group1', ARRAY('{"name":"A-ichiro","age":"20"}','{"name":"B-ichiro","age":"30"}','{"name":"C-ichiro","age":"40"}');
0: jdbc:hive2://localhost:10000> INSERT INTO array_json
. . . . . . . . . . . . . . . .> SELECT '2', 'group2', ARRAY('{"name":"A-saku","age":"40"}','{"name":"B-saku","age":"20"}','{"name":"C-saku","age":"30"}');
0: jdbc:hive2://localhost:10000> INSERT INTO array_json
. . . . . . . . . . . . . . . .> SELECT '3', 'group3', ARRAY('{"name":"A-ta","age":"30"}','{"name":"B-ta","age":"40"}','{"name":"C-ta","age":"20"}');
0: jdbc:hive2://localhost:10000> SELECT * FROM array_json;
+----------------+-------------------+----------------------------------------------------+
| array_json.id | array_json.group | array_json.multi_line_json |
+----------------+-------------------+----------------------------------------------------+
| 1 | group1 | ["{\"name\":\"A-ichiro\",\"age\":\"20\"}","{\"name\":\"B-ichiro\",\"age\":\"30\"}","{\"name\":\"C-ichiro\",\"age\":\"40\"}"] |
| 2 | group2 | ["{\"name\":\"A-saku\",\"age\":\"40\"}","{\"name\":\"B-saku\",\"age\":\"20\"}","{\"name\":\"C-saku\",\"age\":\"30\"}"] |
| 3 | group3 | ["{\"name\":\"A-ta\",\"age\":\"30\"}","{\"name\":\"B-ta\",\"age\":\"40\"}","{\"name\":\"C-ta\",\"age\":\"20\"}"] |
+----------------+-------------------+----------------------------------------------------+
3 rows selected (0.2 seconds)
0: jdbc:hive2://localhost:10000> INSERT INTO string_json
. . . . . . . . . . . . . . . .> VALUES
. . . . . . . . . . . . . . . .> ('1','group1','[{"name":"A-ichiro","age":"20"},{"name":"B-ichiro","age":"30"},{"name":"C-ichiro","age":"40"}]')
. . . . . . . . . . . . . . . .> ,('2','group2','[{"name":"A-saku","age":"40"},{"name":"B-saku","age":"20"},{"name":"C-saku","age":"30"}]')
. . . . . . . . . . . . . . . .> ,('3','group3','["name":"A-ta","age":"30"},{"name":"B-ta","age":"40"},{"name":"C-ta","age":"20"}]');
0: jdbc:hive2://localhost:10000> SELECT * FROM string_json;
+-----------------+--------------------+----------------------------------------------------+
| string_json.id | string_json.group | string_json.multi_line_json |
+-----------------+--------------------+----------------------------------------------------+
| 1 | group1 | [{"name":"A-ichiro","age":"20"},{"name":"B-ichiro","age":"30"},{"name":"C-ichiro","age":"40"}] |
| 2 | group2 | [{"name":"A-saku","age":"40"},{"name":"B-saku","age":"20"},{"name":"C-saku","age":"30"}] |
| 3 | group3 | ["name":"A-ta","age":"30"},{"name":"B-ta","age":"40"},{"name":"C-ta","age":"20"}] |
+-----------------+--------------------+----------------------------------------------------+
3 rows selected (0.164 seconds)