Hive 0.14 on Tezを試す #hadoop

目的

最新バージョンの組み合わせ(Hive 0.14 + Tez 0.5.2 + Hadoop 2.6.0)でHive on Tezが動作することを確かめる。

前提

Hadoop 2.6.0と、Tez 0.5.2がインストールされていること。

Hiveのインストール

Hiveの最新リリースをダウンロード

$ wget http://www.us.apache.org/dist/hive/hive-0.14.0/apache-hive-0.14.0-src.tar.gz
$ tar xf apache-hive-0.14.0-src.tar.gz 
$ cd apache-hive-0.14.0-src

ビルド

せっかくなので、Hadoop 2.6.0とTez 0.5.2を使うよう設定

pom.xml

diff --git a/apache-hive-0.14.0-src.org/pom.xml b/apache-hive-0.14.0-src/pom.xml
index 501d547..d9a9fd1 100644
--- a/apache-hive-0.14.0-src.org/pom.xml
+++ b/apache-hive-0.14.0-src/pom.xml
@@ -115,7 +115,7 @@
     <groovy.version>2.1.6</groovy.version>
     <hadoop-20.version>0.20.2</hadoop-20.version>
     <hadoop-20S.version>1.2.1</hadoop-20S.version>
-    <hadoop-23.version>2.5.0</hadoop-23.version>
+    <hadoop-23.version>2.6.0</hadoop-23.version>
     <hadoop.bin.path>${basedir}/${hive.path.to.root}/testutils/hadoop</hadoop.bin.path>
     <hbase.hadoop1.version>0.98.3-hadoop1</hbase.hadoop1.version>
     <hbase.hadoop2.version>0.98.3-hadoop2</hbase.hadoop2.version>
@@ -152,7 +152,7 @@
     <stax.version>1.0.1</stax.version>
     <slf4j.version>1.7.5</slf4j.version>
     <ST4.version>4.0.4</ST4.version>
-    <tez.version>0.5.2-SNAPSHOT</tez.version>
+    <tez.version>0.5.2</tez.version>
     <super-csv.version>2.2.0</super-csv.version>
     <tempus-fugit.version>1.1</tempus-fugit.version>
     <snappy.version>0.2</snappy.version>
@@ -209,8 +209,19 @@
          <enabled>false</enabled>
        </snapshots>
      </repository>
+     <repository>
+       <id>org.apache.hadoop</id>
+       <url>https://repository.apache.org/content/repositories/orgapachehadoop-1012</url>
+     </repository>
   </repositories>

+  <pluginRepositories>
+    <pluginRepository>
+       <id>org.apache.hadoop</id>
+       <url>https://repository.apache.org/content/repositories/orgapachehadoop-1012</url>
+    </pluginRepository>
+  </pluginRepositories>
+
   <!-- Hadoop dependency management is done at the bottom under profiles -->
   <dependencyManagement>
     <dependencies>

Hadoop 2.6.0が正式リリースされたら、repository, pluginRepositoryの指定は不要

Mavenでビルド

$ mvn clean install -DskipTests -Phadoop-2,dist

Hiveの設定

$ cp -r packaging/target/apache-hive-0.14.0-bin/apache-hive-0.14.0-bin /usr/local/
# export文は、bashrcあたりに書いておくことを推奨
$ export HIVE_HOME=/usr/local/apache-hive-0.14.0-bin
$ export PATH=$PATH:$HIVE_HOME/bin
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-default.xml

Hiveを使うためのHDFSの初期設定

必要なディレクトリを作成する

$ hdfs dfs -mkdir /tmp
$ hdfs dfs -mkdir -p /user/hive/warehouse
$ hdfs dfs -chmod g+w /tmp
$ hdfs dfs -chmod g+w /user/hive/warehouse

Hiveの実行

sinchiiさんのblog:とりあえずPig on Tez を動かしてみたのデータとクエリをHiveに移植してみることにします。

データロード

hive> CREATE TABLE flight (number STRING, dept INT, dest INT, equip STRING)                  
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 0.916 seconds
hive> CREATE TABLE airport (id INT, name STRING)                                            
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 0.093 seconds
hive> LOAD DATA LOCAL INPATH './flight.txt' OVERWRITE INTO TABLE flight;
Loading data to table default.flight
Table default.flight stats: [numFiles=1, numRows=0, totalSize=284, rawDataSize=0]
OK
Time taken: 0.903 seconds
hive> LOAD DATA LOCAL INPATH './airport.txt' OVERWRITE INTO TABLE airport;
Loading data to table default.airport
Table default.airport stats: [numFiles=1, numRows=0, totalSize=30, rawDataSize=0]
OK
Time taken: 0.595 seconds

Pig版だとカラム名がfrom, toに設定されていたが、Hiveだとfromが予約語のためそれぞれdept, destに変更している

クエリを書く

個人的なことだけど、変換に苦労した

特に、SUBSTRING(number, 0, 2)を2回書かされてしまったのがつらい (AS句がうまく動いてほしかった)

# tezで実行する設定
hive> set hive.execution.engine=tez;
hive> SELECT name, SUBSTRING(number, 0, 2), count(*) FROM flight       
    > JOIN airport ON flight.dept = airport.id WHERE equip != '777-300'
    > GROUP BY SUBSTRING(number, 0, 2), name;                          
Query ID = root_20141113002626_5b2ab149-c279-4ba8-8089-f51aedf76e76
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1415780973920_0008)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 3 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 19.70 s    
--------------------------------------------------------------------------------
OK
hnd 7G  2
nrt AP  1
nrt GK  1
hnd JL  7
nrt JL  1
hnd NH  5
Time taken: 20.202 seconds, Fetched: 6 row(s)

クエリ実行中は以下の画像のようなProgress barが表示されて、リアルタイムで更新される

この機能は、Hive 0.14で追加されました! HIVE-8495

MapReduce版

# MapReduceで実行する設定
hive> set hive.execution.engine=mr;
# MapReduce on Tezを設定している場合は、元に戻す
hive> set mapreduce.framework.name=yarn;   
hive> SELECT name, SUBSTRING(number, 0, 2), count(*) FROM flight       
    > JOIN airport ON flight.dept = airport.id WHERE equip != '777-300'
    > GROUP BY SUBSTRING(number, 0, 2), name;                          
Query ID = root_20141113004545_9de60583-4809-4d0f-b139-efbec88a007b
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/11/13 00:45:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/root/root_20141113004545_9de60583-4809-4d0f-b139-efbec88a007b.log
2014-11-13 12:45:51 Starting to launch local task to process map join;  maximum memory = 477102080
2014-11-13 12:45:53 Dump the side-table for tag: 1 with group count: 5 into file: file:/tmp/root/d0dab1cf-a125-4722-b6c2-dd7214e13c9f/hive_2014-11-13_00-45-43_180_2968392168506954651-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile61--.hashtable
2014-11-13 12:45:53 Uploaded 1 File to: file:/tmp/root/d0dab1cf-a125-4722-b6c2-dd7214e13c9f/hive_2014-11-13_00-45-43_180_2968392168506954651-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile61--.hashtable (375 bytes)
2014-11-13 12:45:53 End of local task; Time Taken: 2.277 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1415780973920_0010, Tracking URL = http://localhost:8088/proxy/application_1415780973920_0010/
Kill Command = /usr/local/hadoop-2.6.0/bin/hadoop job  -kill job_1415780973920_0010
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2014-11-13 00:46:07,539 Stage-2 map = 0%,  reduce = 0%
2014-11-13 00:46:20,779 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 4.08 sec
2014-11-13 00:46:34,223 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 7.86 sec
MapReduce Total cumulative CPU time: 7 seconds 860 msec
Ended Job = job_1415780973920_0010
MapReduce Jobs Launched: 
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 7.86 sec   HDFS Read: 497 HDFS Write: 54 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 860 msec
OK
hnd 7G  2
nrt AP  1
nrt GK  1
hnd JL  7
nrt JL  1
hnd NH  5
Time taken: 53.31 seconds, Fetched: 6 row(s)

性能比較

特にチューニングはしてない状況ですが、

Tez: 20秒
MapReduce: 53秒

と差がつきました。Hive on Tez、いい感じだと思います。特にProgress barが。