はじめに
Hiveとは、HiveqlというSQLライクなドメイン固有言語を提供するクエリエンジンで、Hadoopというデータ処理基盤で動作するものです。この記事では、Hiveqlを初めて触る人向けに10本ノックを作ってみました。
10本ノックの内容は、私が業務で初めてHiveQLを触った際にハマった内容を元に作成したひっかけ問題です。
(私はHive初心者ですので、間違いがあればご指摘下さい)
環境構築
こちらの記事を参考に環境構築すると、Dockerでカンタンにbeelineが利用できるようになります。
参考記事: https://qiita.com/letusfly85/items/b4e5fc67c9566765b794
(記事作成者の方、ありがとうございます!!)
参考記事からdocker-compose.ymlを入手して以下のコマンドを打つとHiveQLの実行環境が得られます。
$ docker-compose -f docker-compose.yml up -d
$ docker exec -it docker-compose_hiveserver2_1 /bin/bash
$ beeline -u jdbc:hive2://localhost:10000 hive hive org.apache.hive.jdbc.HiveDriver
※ この記事の10本ノックは上記の環境でクエリを実行することを前提に作りました。上記の環境以外でクエリを実行すると、10本ノックの解答例が記載通りに動かない場合があります。
1. INSERT文
それでは、第1問目です。
問題
次のようなテーブルが存在したとします。
CREATE TEMPORARY TABLE cols(
col3 INT,
col4 INT,
col2 INT,
col1 INT
);
このとき、SELECT文の結果が以下のようになるように、INSERT文の???
の部分を埋めてください。
0: jdbc:hive2://localhost:10000> SELECT * FROM cols;
+------------+------------+------------+------------+--+
| cols.col3 | cols.col4 | cols.col2 | cols.col1 |
+------------+------------+------------+------------+--+
| 3 | 4 | 2 | 1 |
+------------+------------+------------+------------+--+
INSERT INTO TABLE cols
SELECT
??? AS col1,
??? AS col2,
??? AS col3,
??? AS col4
FROM (SELECT 0) dummyTable;
誤答例
以下のように、順番通りに値を1, 2, 3, 4と指定してしまうと、正しい結果が得られません。
INSERT INTO TABLE cols
SELECT
1 AS col1,
2 AS col2,
3 AS col3,
4 AS col4
FROM (SELECT 0) dummyTable;
0: jdbc:hive2://localhost:10000> SELECT * FROM cols;
+------------+------------+------------+------------+--+
| cols.col3 | cols.col4 | cols.col2 | cols.col1 |
+------------+------------+------------+------------+--+
| 1 | 2 | 3 | 4 |
+------------+------------+------------+------------+--+
1 row selected (0.074 seconds)
解答例
HiveQLでINSERT...SELECT
する場合は、SELECTする順番に従って、INSERTが行われます。 AS
でカラム名を指定しても、無視されます。ですので、値を3, 4, 2, 1の順番でSELECTする必要があります。
INSERT INTO TABLE cols
SELECT
3 AS col1,
4 AS col2,
2 AS col3,
1 AS col4
FROM (SELECT 0) dummyTable;
2. NULLの比較
問題
次のようなtable1
と table2
があったとします。
CREATE TEMPORARY TABLE table1(
col1 INT,
col2 INT,
col3 INT
);
INSERT INTO table1 VALUES (1, 2, 3), (11, 22, NULL);
CREATE TEMPORARY TABLE table2(
col1 INT,
col2 INT,
col3 INT
);
INSERT INTO table2 VALUES (1, 2, -3), (11, 22, NULL);
SELECT * FROM table1;
SELECT * FROM table2;
-- SELECT文の実行結果
-- 0: jdbc:hive2://localhost:10000> SELECT * FROM table1;
-- +--------------+--------------+--------------+--+
-- | table1.col1 | table1.col2 | table1.col3 |
-- +--------------+--------------+--------------+--+
-- | 1 | 2 | 3 |
-- | 11 | 22 | NULL |
-- +--------------+--------------+--------------+--+
-- 2 rows selected (0.129 seconds)
-- 0: jdbc:hive2://localhost:10000> SELECT * FROM table2;
-- +--------------+--------------+--------------+--+
-- | table2.col1 | table2.col2 | table2.col3 |
-- +--------------+--------------+--------------+--+
-- | 1 | 2 | -3 |
-- | 11 | 22 | NULL |
-- +--------------+--------------+--------------+--+
-- 2 rows selected (0.041 seconds)
このとき、table1
とtable2
をJOINして、以下のような結果になるSELECT文を作成してください。ただし、JOINする際は、table1
とtable2
のcol1, col2, col3
の値が一致しているか事を比較してください。
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| table1.col1 | table1.col2 | table1.col3 | table2.col1 | table2.col2 | table2.col3 |
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| 11 | 22 | NULL | 11 | 22 | NULL |
+--------------+--------------+--------------+--------------+--------------+--------------+--+
ヒント: 作成するSELECT文の完成イメージは以下のようになります。 ???
の部分を穴埋めしてください。
SELECT
*
FROM table1
JOIN table2 ON
???
???
???
誤答例
NULLの値を含むcol3
を比較する際、 =
を比較に使用するとJOINに失敗します。
SELECT
*
FROM table1
JOIN table2 ON
table1.col1 = table2.col1 AND
table1.col2 = table2.col2 AND
table1.col3 = table2.col3
;
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000> *
0: jdbc:hive2://localhost:10000> FROM table1
0: jdbc:hive2://localhost:10000> JOIN table2 ON
0: jdbc:hive2://localhost:10000> table1.col1 = table2.col1 AND
0: jdbc:hive2://localhost:10000> table1.col2 = table2.col2 AND
0: jdbc:hive2://localhost:10000> table1.col3 = table2.col3
0: jdbc:hive2://localhost:10000> ;
INFO : Execution completed successfully
INFO : MapredLocal task succeeded
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : Job running in-process (local Hadoop)
INFO : 2019-11-24 06:00:52,674 Stage-3 map = 100%, reduce = 0%
INFO : Ended Job = job_local671600389_0025
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| table1.col1 | table1.col2 | table1.col3 | table2.col1 | table2.col2 | table2.col3 |
+--------------+--------------+--------------+--------------+--------------+--------------+--+
+--------------+--------------+--------------+--------------+--------------+--------------+--+
解答例
値がNULLの場合はISNULL関数またはIS NULL
で比較してください。
SELECT
*
FROM table1
JOIN table2 ON
table1.col1 = table2.col1 AND
table1.col2 = table2.col2
WHERE (table1.col3 = table2.col3) OR (ISNULL(table1.col3) AND ISNULL(table2.col3))
;
-- WHERE句は以下のようにしてもOK
-- WHERE (table1.col3 = table2.col3) OR (table1.col3 IS NULL AND table2.col3 IS NULL)
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000> *
0: jdbc:hive2://localhost:10000> FROM table1
0: jdbc:hive2://localhost:10000> JOIN table2 ON
0: jdbc:hive2://localhost:10000> table1.col1 = table2.col1 AND
0: jdbc:hive2://localhost:10000> table1.col2 = table2.col2
0: jdbc:hive2://localhost:10000> WHERE (table1.col3 = table2.col3) OR (ISNULL(table1.col3) AND ISNULL(table2.col3))
0: jdbc:hive2://localhost:10000> ;
INFO : Execution completed successfully
INFO : MapredLocal task succeeded
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : Job running in-process (local Hadoop)
INFO : 2019-11-24 06:03:46,862 Stage-3 map = 100%, reduce = 0%
INFO : Ended Job = job_local1646480365_0026
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| table1.col1 | table1.col2 | table1.col3 | table2.col1 | table2.col2 | table2.col3 |
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| 11 | 22 | NULL | 11 | 22 | NULL |
+--------------+--------------+--------------+--------------+--------------+--------------+--+
1 row selected (4.912 seconds)
3. UNION ALL
問題
次のtable1
とtable2
をUNION ALL
で統合するSELECT文を作成してください。
CREATE TEMPORARY TABLE table1 AS SELECT 1, 2;
CREATE TEMPORARY TABLE table2 AS SELECT 3, 4;
SELECT文は、実行結果が以下のようになるように作成してください。
+-------------+-------------+--+
| table3._c0 | table3._c1 |
+-------------+-------------+--+
| 3 | 4 |
| 1 | 2 |
+-------------+-------------+--+
誤答例
UNION ALL
はサブクエリの中でしか使用できませんので注意です。
-- 誤答例1
(SELECT * FROM table1) UNION ALL (SELECT * FROM table2);
-- 誤答例2
SELECT
*
FROM
table1
UNION ALL
table2
;
0: jdbc:hive2://localhost:10000> (SELECT * FROM table1) UNION ALL (SELECT * FROM table2);
Error: Error while compiling statement: FAILED: ParseException line 1:0 cannot recognize input near '(' 'SELECT' '*' (state=42000,code=40000)
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000> *
0: jdbc:hive2://localhost:10000> FROM
0: jdbc:hive2://localhost:10000> table1
0: jdbc:hive2://localhost:10000> UNION ALL
0: jdbc:hive2://localhost:10000> table2
0: jdbc:hive2://localhost:10000> ;
Error: Error while compiling statement: FAILED: ParseException line 6:4 cannot recognize input near 'table2' '<EOF>' '<EOF>' in select clause (state=42000,code=40000)
解答例
SELECT
*
FROM(
SELECT * FROM table1
UNION ALL
SELECT * FROM table2
) table3;
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000> *
0: jdbc:hive2://localhost:10000> FROM(
0: jdbc:hive2://localhost:10000> SELECT * FROM table1
0: jdbc:hive2://localhost:10000> UNION ALL
0: jdbc:hive2://localhost:10000> SELECT * FROM table2
0: jdbc:hive2://localhost:10000> ) table3;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : Job running in-process (local Hadoop)
INFO : 2019-11-24 06:13:21,578 Stage-1 map = 100%, reduce = 0%
INFO : Ended Job = job_local26487554_0031
+-------------+-------------+--+
| table3._c0 | table3._c1 |
+-------------+-------------+--+
| 3 | 4 |
| 1 | 2 |
+-------------+-------------+--+
2 rows selected (1.299 seconds)
4. リテラルのINSERT
問題
次のtbl
というテーブルに18446744073709001000
という数値をINSERTしてください。
CREATE TEMPORARY TABLE tbl (col DECIMAL(38, 18));
ただし、INSERT文は以下の形式で作成してください。
INSERT INTO TABLE tbl SELECT ??? FROM (SELECT 0) dummyTable;
誤答例
18446744073709001000
をそのままINSERTすると、1000の位の数が0になります。
INSERT INTO TABLE tbl SELECT 18446744073709001000 FROM (SELECT 0) dummyTable;
0: jdbc:hive2://localhost:10000> INSERT INTO TABLE tbl SELECT 18446744073709001000 FROM (SELECT 0) dummyTable;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : Job running in-process (local Hadoop)
INFO : 2019-11-24 06:46:40,619 Stage-1 map = 100%, reduce = 0%
INFO : Ended Job = job_local1838873872_0037
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: file:/tmp/hive/anonymous/cf5efc27-4200-4102-bec7-74e48029f052/_tmp_space.db/0fc85315-e03a-453a-981f-f68aed834083/.hive-staging_hive_2019-11-24_06-46-39_451_9085844297384824125-10/-ext-10000 from file:/tmp/hive/anonymous/cf5efc27-4200-4102-bec7-74e48029f052/_tmp_space.db/0fc85315-e03a-453a-981f-f68aed834083/.hive-staging_hive_2019-11-24_06-46-39_451_9085844297384824125-10/-ext-10002
INFO : Loading data to table default.tbl from file:/tmp/hive/anonymous/cf5efc27-4200-4102-bec7-74e48029f052/_tmp_space.db/0fc85315-e03a-453a-981f-f68aed834083/.hive-staging_hive_2019-11-24_06-46-39_451_9085844297384824125-10/-ext-10000
INFO : Table default.tbl stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
No rows affected (1.223 seconds)
0: jdbc:hive2://localhost:10000>
0: jdbc:hive2://localhost:10000> select * from tbl;
+-----------------------+--+
| tbl.col |
+-----------------------+--+
| 18446744073709000000 |
+-----------------------+--+
1 row selected (0.136 seconds)
解答例
18446744073709001000
を挿入する際は末尾にBD
を着けて18446744073709001000BD
とする必要があります。
INSERT INTO TABLE tbl SELECT 18446744073709001000BD FROM (SELECT 0) dummyTable;
0: jdbc:hive2://localhost:10000> INSERT INTO TABLE tbl SELECT 18446744073709001000BD FROM (SELECT 0) dummyTable;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : Job running in-process (local Hadoop)
INFO : 2019-11-24 06:51:31,349 Stage-1 map = 100%, reduce = 0%
INFO : Ended Job = job_local1036843654_0039
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: file:/tmp/hive/anonymous/cfb993aa-a328-481a-8e6c-be26cc623ec5/_tmp_space.db/fa20fbb0-8bfd-4b16-b137-142ed0a85316/.hive-staging_hive_2019-11-24_06-51-30_174_7370544646674941832-10/-ext-10000 from file:/tmp/hive/anonymous/cfb993aa-a328-481a-8e6c-be26cc623ec5/_tmp_space.db/fa20fbb0-8bfd-4b16-b137-142ed0a85316/.hive-staging_hive_2019-11-24_06-51-30_174_7370544646674941832-10/-ext-10002
INFO : Loading data to table default.tbl from file:/tmp/hive/anonymous/cfb993aa-a328-481a-8e6c-be26cc623ec5/_tmp_space.db/fa20fbb0-8bfd-4b16-b137-142ed0a85316/.hive-staging_hive_2019-11-24_06-51-30_174_7370544646674941832-10/-ext-10000
INFO : Table default.tbl stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
No rows affected (1.243 seconds)
0: jdbc:hive2://localhost:10000> select * from tbl;
+-----------------------+--+
| tbl.col |
+-----------------------+--+
| 18446744073709001000 |
+-----------------------+--+
1 row selected (0.167 seconds)
因みにINSERT INTO tbl VALUES
の形式でINSERTしても、正しくINSERTされます。
INSERT INTO tbl VALUES(18446744073709001000);
0: jdbc:hive2://localhost:10000> INSERT INTO tbl VALUES(18446744073709001000);
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : Job running in-process (local Hadoop)
INFO : 2019-11-24 06:48:56,695 Stage-1 map = 100%, reduce = 0%
INFO : Ended Job = job_local1313902307_0038
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: file:/tmp/hive/anonymous/5d456b22-4585-4e92-9ae5-f2f362241157/_tmp_space.db/371e970a-13cb-4c61-90a6-1c52ee49182c/.hive-staging_hive_2019-11-24_06-48-55_528_1421126263335840065-10/-ext-10000 from file:/tmp/hive/anonymous/5d456b22-4585-4e92-9ae5-f2f362241157/_tmp_space.db/371e970a-13cb-4c61-90a6-1c52ee49182c/.hive-staging_hive_2019-11-24_06-48-55_528_1421126263335840065-10/-ext-10002
INFO : Loading data to table default.tbl from file:/tmp/hive/anonymous/5d456b22-4585-4e92-9ae5-f2f362241157/_tmp_space.db/371e970a-13cb-4c61-90a6-1c52ee49182c/.hive-staging_hive_2019-11-24_06-48-55_528_1421126263335840065-10/-ext-10000
INFO : Table default.tbl stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
No rows affected (1.211 seconds)
0: jdbc:hive2://localhost:10000> select * from tbl;
+-----------------------+--+
| tbl.col |
+-----------------------+--+
| 18446744073709001000 |
+-----------------------+--+
1 row selected (0.145 seconds)
5. SELECT文で配列展開
問題
次のテーブルのarr
をEXPLODE
関数の引数として渡すと、配列の各要素がレコードとして展開されます。
CREATE TEMPORARY TABLE tbl AS SELECT ARRAY('a', 'b', 'c') AS arr;
0: jdbc:hive2://localhost:10000> SELECT EXPLODE(arr) AS element FROM tbl;
+----------+--+
| element |
+----------+--+
| a |
| b |
| c |
+----------+--+
3 rows selected (0.044 seconds)
このSELECT EXPLODE(arr) AS element FROM tbl;
というクエリを改修して、SELECT文の実行結果が以下のようになるようにしてください。
+----------+-------+--+
| element | hoge |
+----------+-------+--+
| a | hoge |
| b | hoge |
| c | hoge |
+----------+-------+--+
誤答例
SELECT句の中で使えるUDTF(この場合EXPLODE
関数)は一つのみです。ですので、以下のように、カンマ区切りでそのままEXPLODE
とhoge
を並べるとエラーが発生します。
SELECT EXPLODE(arr) AS element, 'hoge' AS hoge FROM tbl;
0: jdbc:hive2://localhost:10000> SELECT EXPLODE(arr) AS element, 'hoge' AS hoge FROM tbl;
Error: Error while compiling statement: FAILED: SemanticException 1:32 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'hoge' (state=42000,code=40000)
解答例
LATERAL VIEW
を使用すると、SELECT句の中に、配列を展開したものと他のカラムを記述することができます。
SELECT
element,
'hoge' AS hoge
FROM
tbl
LATERAL VIEW
EXPLODE(arr) arr AS element;
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000> element,
0: jdbc:hive2://localhost:10000> 'hoge' AS hoge
0: jdbc:hive2://localhost:10000> FROM
0: jdbc:hive2://localhost:10000> tbl
0: jdbc:hive2://localhost:10000> LATERAL VIEW
0: jdbc:hive2://localhost:10000> EXPLODE(arr) arr AS element;
+----------+-------+--+
| element | hoge |
+----------+-------+--+
| a | hoge |
| b | hoge |
| c | hoge |
+----------+-------+--+
終わりに
閲覧ありがとうございました!6~10問目は近日公開予定です。