ケース①データタイプが「Array型」のJSONをあつかう場合
次のようなテーブルがあります
0: jdbc:hive2://localhost:10000> desc array_json;
+------------------+----------------+----------+
|     col_name     |   data_type    | comment  |
+------------------+----------------+----------+
| id               | string         |          |
| group            | string         |          |
| multi_line_json  | array<string>  |          |
+------------------+----------------+----------+
0: jdbc:hive2://localhost:10000> SELECT * FROM array_json;
+----------------+-------------------+----------------------------------------------------+
| array_json.id  | array_json.group  |             array_json.multi_line_json             |
+----------------+-------------------+----------------------------------------------------+
| 1              | group1            | ["{\"name\":\"A-ichiro\",\"age\":\"20\"}","{\"name\":\"B-ichiro\",\"age\":\"30\"}","{\"name\":\"C-ichiro\",\"age\":\"40\"}"] |
| 2              | group2            | ["{\"name\":\"A-saku\",\"age\":\"40\"}","{\"name\":\"B-saku\",\"age\":\"20\"}","{\"name\":\"C-saku\",\"age\":\"30\"}"] |
| 3              | group3            | ["{\"name\":\"A-ta\",\"age\":\"30\"}","{\"name\":\"B-ta\",\"age\":\"40\"}","{\"name\":\"C-ta\",\"age\":\"20\"}"] |
+----------------+-------------------+----------------------------------------------------+
3 rows selected (0.147 seconds)
このテーブルに対して、
- (STEP1) multi_line_jsonを複数レコードに展開し
- (STEP2) 年齢が30歳(multi_line_json.age="30")のレコードのみ抽出
したいと思います。
つまり、以下のような抽出を行います。
+-----+---------+---------------------------------+
| id  |  group  |           single_json           |
+-----+---------+---------------------------------+
| 1   | group1  | {"name":"B-ichiro","age":"30"}  |
| 2   | group2  | {"name":"C-saku","age":"30"}    |
| 3   | group3  | {"name":"A-ta","age":"30"}      |
+-----+---------+---------------------------------+
(STEP1)multi_line_jsonを複数レコードに展開
Array型のカラムを複数レコードに展開するには、Hive関数のexplode()を使います。
0: jdbc:hive2://localhost:10000> SELECT explode(multi_line_json) as single_json FROM array_json;
+---------------------------------+
|           single_json           |
+---------------------------------+
| {"name":"A-ichiro","age":"20"}  |
| {"name":"B-ichiro","age":"30"}  |
| {"name":"C-ichiro","age":"40"}  |
| {"name":"A-saku","age":"40"}    |
| {"name":"B-saku","age":"20"}    |
| {"name":"C-saku","age":"30"}    |
| {"name":"A-ta","age":"30"}      |
| {"name":"B-ta","age":"40"}      |
| {"name":"C-ta","age":"20"}      |
+---------------------------------+
9 rows selected (0.452 seconds)
展開したカラム single_json を元のテーブルのカラムをくっつけるには、LATERAL VIEWを使います
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .>   id
. . . . . . . . . . . . . . . .>   ,`group`
. . . . . . . . . . . . . . . .>   ,single_json
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .>   array_json
. . . . . . . . . . . . . . . .>   LATERAL VIEW explode(multi_line_json) first_view as single_json;
+-----+---------+---------------------------------+
| id  |  group  |           single_json           |
+-----+---------+---------------------------------+
| 1   | group1  | {"name":"A-ichiro","age":"20"}  |
| 1   | group1  | {"name":"B-ichiro","age":"30"}  |
| 1   | group1  | {"name":"C-ichiro","age":"40"}  |
| 2   | group2  | {"name":"A-saku","age":"40"}    |
| 2   | group2  | {"name":"B-saku","age":"20"}    |
| 2   | group2  | {"name":"C-saku","age":"30"}    |
| 3   | group3  | {"name":"A-ta","age":"30"}      |
| 3   | group3  | {"name":"B-ta","age":"40"}      |
| 3   | group3  | {"name":"C-ta","age":"20"}      |
+-----+---------+---------------------------------+
9 rows selected (0.094 seconds)
(STEP2) 年齢が30歳(multi_line_json.age="30")のレコードのみ抽出
(STEP1)で複数レコードに展開した single_json の各キー(name, age)の値を取得してみます。
まずは(STEP1)でやったようにLATERAL VIEWでmulti_line_jsonを複数レコードに展開し、
LATERAL VIEW explode(multi_line_json) first_view as single_json
さらにもう一つのLATERAL VIEWで、展開されたカラムfirst_view.single_jsonに対して、JSONを操作する関数 json_tuple()を使って、nameおよびageの値を取得しています
LATERAL VIEW json_tuple(first_view.single_json, 'name', 'age') second_view as name, age; 
実際のSQLはこんな感じです
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .>   id
. . . . . . . . . . . . . . . .>   ,`group`
. . . . . . . . . . . . . . . .>   ,name
. . . . . . . . . . . . . . . .>   ,age
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .>   array_json
. . . . . . . . . . . . . . . .>   LATERAL VIEW explode(multi_line_json) first_view as single_json
. . . . . . . . . . . . . . . .>   LATERAL VIEW json_tuple(first_view.single_json, 'name', 'age') second_view as name, age;
+-----+---------+-----------+------+
| id  |  group  |   name    | age  |
+-----+---------+-----------+------+
| 1   | group1  | A-ichiro  | 20   |
| 1   | group1  | B-ichiro  | 30   |
| 1   | group1  | C-ichiro  | 40   |
| 2   | group2  | A-saku    | 40   |
| 2   | group2  | B-saku    | 20   |
| 2   | group2  | C-saku    | 30   |
| 3   | group3  | A-ta      | 30   |
| 3   | group3  | B-ta      | 40   |
| 3   | group3  | C-ta      | 20   |
+-----+---------+-----------+------+
9 rows selected (0.093 seconds)
これまでのやり方を応用して、最終的なSQLは以下の通りとなります
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .>   id
. . . . . . . . . . . . . . . .>   ,`group`
. . . . . . . . . . . . . . . .>   ,single_json
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .>   array_json
. . . . . . . . . . . . . . . .>   LATERAL VIEW explode(multi_line_json) first_view as single_json
. . . . . . . . . . . . . . . .>   LATERAL VIEW json_tuple(first_view.single_json, 'name', 'age') second_view as name, age
. . . . . . . . . . . . . . . .> WHERE
. . . . . . . . . . . . . . . .>   age = '30';
+-----+---------+---------------------------------+
| id  |  group  |           single_json           |
+-----+---------+---------------------------------+
| 1   | group1  | {"name":"B-ichiro","age":"30"}  |
| 2   | group2  | {"name":"C-saku","age":"30"}    |
| 3   | group3  | {"name":"A-ta","age":"30"}      |
+-----+---------+---------------------------------+
3 rows selected (0.103 seconds)
想定どおり抽出できました。
ケース②データタイプが「String型」のJSONをあつかう場合
次のようなテーブルがあります
0: jdbc:hive2://localhost:10000> desc string_json;
+------------------+------------+----------+
|     col_name     | data_type  | comment  |
+------------------+------------+----------+
| id               | string     |          |
| group            | string     |          |
| multi_line_json  | string     |          |
+------------------+------------+----------+
0: jdbc:hive2://localhost:10000> SELECT * FROM string_json;
+-----------------+--------------------+----------------------------------------------------+
| string_json.id  | string_json.group  |            string_json.multi_line_json             |
+-----------------+--------------------+----------------------------------------------------+
| 1               | group1             | [{"name":"A-ichiro","age":"20"},{"name":"B-ichiro","age":"30"},{"name":"C-ichiro","age":"40"}] |
| 2               | group2             | [{"name":"A-saku","age":"40"},{"name":"B-saku","age":"20"},{"name":"C-saku","age":"30"}] |
| 3               | group3             | [{"name":"A-ta","age":"30"},{"name":"B-ta","age":"40"},{"name":"C-ta","age":"20"}] |
+-----------------+--------------------+----------------------------------------------------+
3 rows selected (0.137 seconds)
このテーブルに対して、ケース①と同様に、
- (STEP1) multi_line_jsonを複数レコードに展開し
- (STEP2) 年齢が30歳(multi_line_json.age="30")のレコードのみ抽出
したいと思います。
つまり、以下のような抽出を行います。
+-----+---------+---------------------------------+
| id  |  group  |           single_json           |
+-----+---------+---------------------------------+
| 1   | group1  | {"name":"B-ichiro","age":"30"}  |
| 2   | group2  | {"name":"C-saku","age":"30"}    |
| 3   | group3  | {"name":"A-ta","age":"30"}      |
+-----+---------+---------------------------------+
(STEP1)multi_line_jsonを複数レコードに展開
ケース①と同様、*explode()を試してみますが、やはり怒られました。
Array型では無いとexplode()*は使えないからです。
0: jdbc:hive2://localhost:10000> SELECT explode(multi_line_json) as single_json FROM string_json;
Error: Error while compiling statement: FAILED: UDFArgumentException explode() takes an array or a map as a parameter (state=42000,code=40000)
そこで、次の手順を新たに追加します
- (追加STEP1-1)正規表現で[{...},{...}]から{...},{...}部分だけ抜き出す
- (追加STEP1-2)抜き出した{...},{...}をカンマ区切りのArray型に変換する
- あとはケース①と同様の手順をふむ
(追加STEP1-1)正規表現で[{...},{...}]から{...},{...}部分だけ抜き出す
Hiveで正規表現を行う場合は、regexp_extract()関数を使います。
0: jdbc:hive2://localhost:10000> SELECT regexp_extract(multi_line_json,'^\\[(.*)\\]$',1) FROM string_json;
+----------------------------------------------------+
|                        _c0                         |
+----------------------------------------------------+
| {"name":"A-ichiro","age":"20"},{"name":"B-ichiro","age":"30"},{"name":"C-ichiro","age":"40"} |
| {"name":"A-saku","age":"40"},{"name":"B-saku","age":"20"},{"name":"C-saku","age":"30"} |
| {"name":"A-ta","age":"30"},{"name":"B-ta","age":"40"},{"name":"C-ta","age":"20"} |
+----------------------------------------------------+
3 rows selected (0.135 seconds)
(追加STEP1-2)抜き出した{...},{...}をカンマ区切りのArray型に変換する
指定文字区切りのArray型に変換する場合は、split()を使用します。
*split()*の第2引数に(?<=},)を指定して、},区切りのArray型に変換します。
0: jdbc:hive2://localhost:10000> SELECT split(regexp_extract(multi_line_json,'^\\[(.*)\\]$',1), '(?<=},)') FROM string_json;
+----------------------------------------------------+
|                        _c0                         |
+----------------------------------------------------+
| ["{\"name\":\"A-ichiro\",\"age\":\"20\"},","{\"name\":\"B-ichiro\",\"age\":\"30\"},","{\"name\":\"C-ichiro\",\"age\":\"40\"}"] |
| ["{\"name\":\"A-saku\",\"age\":\"40\"},","{\"name\":\"B-saku\",\"age\":\"20\"},","{\"name\":\"C-saku\",\"age\":\"30\"}"] |
| ["{\"name\":\"A-ta\",\"age\":\"30\"},","{\"name\":\"B-ta\",\"age\":\"40\"},","{\"name\":\"C-ta\",\"age\":\"20\"}"] |
+----------------------------------------------------+
3 rows selected (0.12 seconds)
あとはケース①と同様の手順をふむ
Array型にできればあとはケース①と同じです。
*explode()*で複数レコードに展開し、LATERAL VIEWを使って展開したカラム single_json を元のテーブルのカラムをくっつけます
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .>   id
. . . . . . . . . . . . . . . .>   ,`group`
. . . . . . . . . . . . . . . .>   , single_json
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .>   string_json
. . . . . . . . . . . . . . . .>   LATERAL VIEW explode(split(regexp_extract(multi_line_json,'^\\[(.*)\\]$',1), '(?<=},)')) first_view as single_json;
+-----+---------+----------------------------------+
| id  |  group  |           single_json            |
+-----+---------+----------------------------------+
| 1   | group1  | {"name":"A-ichiro","age":"20"},  |
| 1   | group1  | {"name":"B-ichiro","age":"30"},  |
| 1   | group1  | {"name":"C-ichiro","age":"40"}   |
| 2   | group2  | {"name":"A-saku","age":"40"},    |
| 2   | group2  | {"name":"B-saku","age":"20"},    |
| 2   | group2  | {"name":"C-saku","age":"30"}     |
| 3   | group3  | {"name":"A-ta","age":"30"},      |
| 3   | group3  | {"name":"B-ta","age":"40"},      |
| 3   | group3  | {"name":"C-ta","age":"20"}       |
+-----+---------+----------------------------------+
9 rows selected (0.067 seconds)
(STEP2) 年齢が30歳(multi_line_json.age="30")のレコードのみ抽出
(STEP2)もケース①と同じやり方で実現できます。
0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .>   id
. . . . . . . . . . . . . . . .>   ,`group`
. . . . . . . . . . . . . . . .>   ,single_json
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .>   string_json
. . . . . . . . . . . . . . . .>   LATERAL VIEW explode(split(regexp_extract(multi_line_json,'^\\[(.*)\\]$',1), '(?<=},)')) first_view as single_json
. . . . . . . . . . . . . . . .>   LATERAL VIEW json_tuple(first_view.single_json, 'name', 'age') second_view as name, age
. . . . . . . . . . . . . . . .> WHERE
. . . . . . . . . . . . . . . .>   age = '30';
+-----+---------+----------------------------------+
| id  |  group  |           single_json            |
+-----+---------+----------------------------------+
| 1   | group1  | {"name":"B-ichiro","age":"30"},  |
| 2   | group2  | {"name":"C-saku","age":"30"}     |
| 3   | group3  | {"name":"A-ta","age":"30"},      |
+-----+---------+----------------------------------+
3 rows selected (0.074 seconds)
おわりに
もっと簡単な方法はあるのだろうか
おまけ
今回の検証を行う上での準備を追記しておきます
Hive環境構築
検証するにあたり、まずはHive環境(Docker)を構築します。
ちなみに、Dokcerイメージはこちらを使います。
- gitより取得しコンテナを起動
$ git clone https://github.com/big-data-europe/docker-hive
$ cd docker-hive
$ docker-compose up -d
- コンテナに入ってちゃんと動いているか確認
$ docker-compose exec hive-server bash
$ /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 2.3.2)
Driver: Hive JDBC (version 2.3.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.2 by Apache Hive
0: jdbc:hive2://localhost:10000> 
beelineからhiveserverに接続できました。
大丈夫そうですね。
では早速検証用のテーブルを作っていきます。
データの準備
- テーブル作成
0: jdbc:hive2://localhost:10000> CREATE TABLE array_json (id STRING, `group` STRING, multi_line_json ARRAY<STRING>);
0: jdbc:hive2://localhost:10000> CREATE TABLE string_json (id STRING, `group` STRING, multi_line_json STRING);
0: jdbc:hive2://localhost:10000> show tables;
+--------------+
|   tab_name   |
+--------------+
| array_json   |
| string_json  |
+--------------+
2 rows selected (0.032 seconds)
- データ作成
0: jdbc:hive2://localhost:10000> INSERT INTO array_json
. . . . . . . . . . . . . . . .> SELECT '1', 'group1', ARRAY('{"name":"A-ichiro","age":"20"}','{"name":"B-ichiro","age":"30"}','{"name":"C-ichiro","age":"40"}');
0: jdbc:hive2://localhost:10000> INSERT INTO array_json
. . . . . . . . . . . . . . . .> SELECT '2', 'group2', ARRAY('{"name":"A-saku","age":"40"}','{"name":"B-saku","age":"20"}','{"name":"C-saku","age":"30"}');
0: jdbc:hive2://localhost:10000> INSERT INTO array_json
. . . . . . . . . . . . . . . .> SELECT '3', 'group3', ARRAY('{"name":"A-ta","age":"30"}','{"name":"B-ta","age":"40"}','{"name":"C-ta","age":"20"}');
0: jdbc:hive2://localhost:10000> SELECT * FROM array_json;
+----------------+-------------------+----------------------------------------------------+
| array_json.id  | array_json.group  |             array_json.multi_line_json             |
+----------------+-------------------+----------------------------------------------------+
| 1              | group1            | ["{\"name\":\"A-ichiro\",\"age\":\"20\"}","{\"name\":\"B-ichiro\",\"age\":\"30\"}","{\"name\":\"C-ichiro\",\"age\":\"40\"}"] |
| 2              | group2            | ["{\"name\":\"A-saku\",\"age\":\"40\"}","{\"name\":\"B-saku\",\"age\":\"20\"}","{\"name\":\"C-saku\",\"age\":\"30\"}"] |
| 3              | group3            | ["{\"name\":\"A-ta\",\"age\":\"30\"}","{\"name\":\"B-ta\",\"age\":\"40\"}","{\"name\":\"C-ta\",\"age\":\"20\"}"] |
+----------------+-------------------+----------------------------------------------------+
3 rows selected (0.2 seconds)
0: jdbc:hive2://localhost:10000> INSERT INTO string_json
. . . . . . . . . . . . . . . .> VALUES
. . . . . . . . . . . . . . . .>  ('1','group1','[{"name":"A-ichiro","age":"20"},{"name":"B-ichiro","age":"30"},{"name":"C-ichiro","age":"40"}]')
. . . . . . . . . . . . . . . .> ,('2','group2','[{"name":"A-saku","age":"40"},{"name":"B-saku","age":"20"},{"name":"C-saku","age":"30"}]')
. . . . . . . . . . . . . . . .> ,('3','group3','["name":"A-ta","age":"30"},{"name":"B-ta","age":"40"},{"name":"C-ta","age":"20"}]');
0: jdbc:hive2://localhost:10000> SELECT * FROM string_json;
+-----------------+--------------------+----------------------------------------------------+
| string_json.id  | string_json.group  |            string_json.multi_line_json             |
+-----------------+--------------------+----------------------------------------------------+
| 1               | group1             | [{"name":"A-ichiro","age":"20"},{"name":"B-ichiro","age":"30"},{"name":"C-ichiro","age":"40"}] |
| 2               | group2             | [{"name":"A-saku","age":"40"},{"name":"B-saku","age":"20"},{"name":"C-saku","age":"30"}] |
| 3               | group3             | ["name":"A-ta","age":"30"},{"name":"B-ta","age":"40"},{"name":"C-ta","age":"20"}] |
+-----------------+--------------------+----------------------------------------------------+
3 rows selected (0.164 seconds)
