More than 5 years have passed since last update.

hiveにscala（JDBC）で接続してDBUnitを回そうとして気づいたHiveQLの仕様とhiveのJDBCインターフェースの実装状況

Last updated at 2016-03-04Posted at 2015-12-05

0. 結論

回帰の単体テストについても技術的に「無理」みたいな話はなくハードルは低いという印象。
むしろ戦略的にテストして、がっつりと勝っていくべき。

1. やりたいこと

以前の投稿で、作成したhiveserverにscalaでJDBC接続をして、dbunitによる単体テストを実施してみたいと思いました。

2. 注意

本投稿では、検証目的のためにOSSの一部に手を入れていますが、十分な検証を経たものではありません。コピーペースト等をしていただいてご利用いただいたとしても、動作保証は出来ませんがご了承ください。

3. ざっくりとした手順

scalaでsbtプロジェクトを作成し、JDBC等の依存を設定したbuild.sbtを用意する
jdbcドライバ経由で、hiveに接続可能かどうかを動作確認する
build.sbtにdbunit, spec2の依存を追加してテストを動かしてみる
dbunitのソースコードを修正して、パッケージングする
改めて、dbunitを使ったspec2を実施してみる

4. 実施メモ

4-1. scalaでsbtプロジェクトを作成し、JDBC等の依存を設定したbuild.sbtを用意する

まずは、intellijを利用して、SBTプロジェクトを用意します。
build.sbtの内容は以下のとおりです。

name := "hive-jdbc-sample"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  "org.apache.hive"   % "hive-jdbc"          % "1.1.0",
  "org.apache.hive"   % "hive-common"        % "1.1.0",
  "org.apache.hive"   % "hive-exec"          % "1.1.0",
  "org.apache.hive"   % "hive-service"       % "1.1.0",
  "org.apache.hive"   % "hive-metastore"     % "1.1.0",
  "commons-logging"   % "commons-logging"    % "1.2",
  "org.apache.thrift" % "libthrift"          % "0.9.2",
  "org.apache.httpcomponents" % "httpclient" % "4.5.1",
  "org.apache.httpcomponents" % "httpcore"   % "4.4.4",
  "org.apache.hadoop" % "hadoop-core"        % "1.2.1",
  "org.apache.hadoop" % "hadoop-common"      % "2.7.1",
  "org.slf4j"         % "slf4j-api"          % "1.7.6",
  "org.slf4j"         % "slf4j-log4j12"      % "1.7.6",
  "com.google.guava"  % "guava"              % "18.0",
  "org.pentaho" % "pentaho-aggdesigner-algorithm"   % "5.1.5-jhyde",
  "commons-configuration" % "commons-configuration" % "1.10"
  )

resolvers += "Concurrent Maven Repo" at "http://conjars.org/repo"

めっちゃ依存多いですよね。。

ハマった点

jhydeの依存解決部分にまずはハマりました。なんでpentahoのライブラリが必要なんだろ、と思いましたが現状はそこまで深掘りできておらず。ぐぐってみると、（たとえば右記の）FAQサイトでpentahoのライブラリをpom.xmlやらbuild.sbtに通せばいいんだよ、と記載があり、いったんは依存解消のために通すことにしました。
hive-jdbcのjarのバージョンについてもはまりました。最初、1.2.0のjarを利用しようとしましたが、エラーが発生。。接続しに行くhiveserver2で利用されているhiveのjarとバージョンを合わせないと動かないようです。互換性はないみたいです。。

4-2. jdbcドライバ経由で、hiveに接続可能かどうかを動作確認する

公式サイトのjavaのサンプルコードをscalaになおしてみて、かつ、他のクエリも試してみました。

以下の様なコードを用意しました。

InsertSample.scala

package sample

import java.sql.{Statement, DriverManager, Connection}

class InsertSample {
  val driverName = "org.apache.hive.jdbc.HiveDriver"
  val databaseName = "erdrepo"

  def insert(): Unit = {
    Class.forName(driverName)
    val conn: Connection = DriverManager
      .getConnection(s"jdbc:hive2://*.*.*.*:10000/${databaseName}", "hive", "hive")

    val stmt: Statement = conn.createStatement()

    val tableName: String = "ms_tables"

    var sql = s"show tables '${tableName}'"
    val result = stmt.executeQuery(sql)
    while (result.next()) {
      println(result.getString(1))
    }

    //queryその１ カラムは指定しない。これはOK
    sql = s"insert into ${tableName} values ('1', '1', '1')"
    stmt.executeUpdate(sql)

    //queryその２　カラムを指定する。これはエラーになる
    sql = s"insert into ${tableName} (table_id) values ('1')"
    stmt.executeUpdate(sql)
  }
}

ハマった点

3. temp fileを作成することが出来ませんエラーが発生する。。

hive-execのライブラリの使用の様子でした。sbtの起動オプションに以下を追加しました。

-Dhive.exec.scratchdir=/path/to/temp

apache hive wiki AdminManual

4. Column指定だとクエリが通らない。。

Oracle, MySQL感覚でクエリを自分で書いてみてエラーが発生して怒られたことで、HiveQLの以下の仕様を知りました。

エラー内容

Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.ParseException:line 1:22 cannot recognize input near '(' 'table_id' ')' in statement
	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)

気づいたHiveQLの仕様

2015/12/5時点の安定版のHiveQLでは、insertクエリで、カラムを指定することが出来ない様子です。

これは、下記にも記載があって、0.14.0のバージョンからだと利用できるようになるよ、とのことです。

4-3. build.sbtにdbunit, spec2の依存を追加してテストを動かしてみる

上記のクエリがJDBCで通らないことで既に嫌な予感を覚えつつも、build.sbtにdbunit, spec2の依存を追加します。

build.sbt

name := "hive-jdbc-sample"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  "org.apache.hive"   % "hive-jdbc"          % "1.1.0",
  "org.apache.hive"   % "hive-common"        % "1.1.0",
  "org.apache.hive"   % "hive-exec"          % "1.1.0",
  "org.apache.hive"   % "hive-service"       % "1.1.0",
  "org.apache.hive"   % "hive-metastore"     % "1.1.0",
  "commons-logging"   % "commons-logging"    % "1.2",
  "org.apache.thrift" % "libthrift"          % "0.9.2",
  "org.apache.httpcomponents" % "httpclient" % "4.5.1",
  "org.apache.httpcomponents" % "httpcore"   % "4.4.4",
  "org.apache.hadoop" % "hadoop-core"        % "1.2.1",
  "org.apache.hadoop" % "hadoop-common"      % "2.7.1",
  "org.slf4j"         % "slf4j-api"          % "1.7.6",
  "org.slf4j"         % "slf4j-log4j12"      % "1.7.6",
  "com.google.guava"  % "guava"              % "18.0",
  "org.pentaho" % "pentaho-aggdesigner-algorithm"   % "5.1.5-jhyde",
  "commons-configuration" % "commons-configuration" % "1.10",
  "org.specs2"       %% "specs2-core"        % "2.4.9" % "test",
  "org.specs2"       %  "specs2-junit_2.10"  % "3.6.4" % "test",
  "junit"            %  "junit"              % "4.10",
  "org.dbunit"       %  "dbunit"             % "2.5.1",
  "org.apache.poi"   %  "poi"                % "3.10-FINAL" % "test",
  "org.apache.poi"   %  "poi-ooxml"          % "3.10-FINAL" % "test"
  )

resolvers += "Concurrent Maven Repo" at "http://conjars.org/repo"

上記のように設定することで、sbtのコンパイルは通りました。

ハマった点

5. poi-ooxmlがないと怒られる。。

ここは、hiveでもdbunitでもなくpoiの世界の話なので、素直にpoi-ooxmlを依存に追加しました。

stackoverflow - poi-ooxml for apache poi

6. hive-jdbcのインターフェースにサポートしていないメソッドがありdbunitが動かないことがある

次は、sbtコンソールからtestタスクを実施します。
spec2用のソースは以下のとおりです。

InsertSampleSpec.scala

import java.io.{File, FileInputStream}
import java.sql.{Connection, DriverManager}

import org.dbunit.database.{DatabaseConnection, IDatabaseConnection}
import org.dbunit.dataset.IDataSet
import org.dbunit.dataset.excel.XlsDataSet
import org.dbunit.operation.DatabaseOperation
import org.specs2.mutable.Specification


class InsertSampleSpec extends Specification {
  val driverName = "org.apache.hive.jdbc.HiveDriver"
  val databaseName = "erdrepo"

  Class.forName(driverName)
  val conn: Connection = DriverManager
    .getConnection(s"jdbc:hive2://*.*.*.*:10000/${databaseName}", "hive", "hive")

  val iConn: IDatabaseConnection  = new DatabaseConnection(conn, null, false)

  val url: String = "/spec/ms_tables.xlsx"
  val file: File = new File(getClass.getResource(url).toURI)
  val inputStream: FileInputStream = new FileInputStream(file)
  val partialDataSet: IDataSet = new XlsDataSet(inputStream)

  DatabaseOperation.TRUNCATE_TABLE.execute(iConn, partialDataSet)
  DatabaseOperation.INSERT.execute(iConn, partialDataSet)

  "ms_tables count should be 3" in {
    val stmt = conn.createStatement()
    val sql = "select count(1) from ms_tables"

    val result = stmt.executeQuery(sql)
    var cnt = 0
    while (result.next) {
      cnt = result.getInt(1)
    }

    cnt must_==(3)
  }

}

testタスク実行するとエラー。。

> test
[info] Compiling 1 Scala source to /path/to/hive-jdbc-sample/target/scala-2.10/test-classes...
15/12/05 10:22:39 INFO jdbc.Utils: Supplied authorities:*.*.*.*:10000
15/12/05 10:22:39 INFO jdbc.Utils: Resolved authority:*.*.*.*:10000
15/12/05 10:22:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://*.*.*.*:10000/erdrepo
[error] 
[error] cannot create an instance for class InsertSampleSpec
[error]   caused by org.dbunit.DatabaseUnitRuntimeException: Exception while trying to access database metadata
[error]   caused by java.sql.SQLException: Method not supported

...
...
...

[error]   CAUSED BY org.dbunit.DatabaseUnitRuntimeException: Exception while trying to access database metadata
[error]   org.dbunit.util.SQLHelper.correctCase(SQLHelper.java:606)
[error]   org.dbunit.util.SQLHelper.correctCase(SQLHelper.java:549)

databaseのmetastoreにアクセスしようとしてエラーが発生している様子です。

該当のdbunitのソースコードを見てみました。

SQLHelper.java

        try {
            String resultTableName = databaseIdentifier;
            String dbIdentifierQuoteString = databaseMetaData.getIdentifierQuoteString();
            if(!isEscaped(databaseIdentifier, dbIdentifierQuoteString)){
                if(databaseMetaData.storesLowerCaseIdentifiers())
                {
                    resultTableName = databaseIdentifier.toLowerCase(Locale.ENGLISH);
                }
                else if(databaseMetaData.storesUpperCaseIdentifiers())
                {
                    resultTableName = databaseIdentifier.toUpperCase(Locale.ENGLISH);
                }
                else
                {
                    logger.debug("Database does not store upperCase or lowerCase identifiers. " +
                            "Will not correct case of the table names.");
                }
            }
            else
            {
                if(logger.isDebugEnabled())
                    logger.debug("The tableName '{}' is escaped. Will not correct case.", databaseIdentifier);
                }
            return resultTableName;
        }
        catch (SQLException e)
        {
            throw new DatabaseUnitRuntimeException("Exception while trying to access database metadata", e);
        }

最後のcatch節でエラーが発生しています。
その原因となったdatabaseMetaData.storesLowerCaseIdentifiers()のJDBCのhive側のhive-jdbc 1.1.0の仕様を見てみると、

821   public boolean More ...storesLowerCaseIdentifiers() throws SQLException {
822     throw new SQLException("Method not supported");
823   }

これは、丸投げ。。1.2.1のapiでも同様のことが記載されていました。

4-4. dbunitのソースコードを修正して、パッケージングする

4-3.でやろうとしているのは、テーブル名を取得する際のlower case等のバリデーションやクオート許可なので、コーディング規約次第では「利用しない・該当しない・ソースコオードコミット時にgit hookでチェック済み」のものになりえるので、dbunitの改修をしてそもそも動くかを試してみました。

また、他にもdbunitをいじると動かない点が出てきたのですが細かい話なのでいったん割愛します。（時間見つけて後日追記するかもしれません）

ということで、dbunitのソースコードをgit cloneして、修正しました。

git clone http://git.code.sf.net/p/dbunit/code.git

なお、本投稿時での改修ポイントは以下の３つです。

対象	分類	修正内容
org.dbunit.util.SQLHelper.java	hive-jdbc	テーブル名返却処理でのバリデーションをスキップさせるように改修
org.dbunit.operation.InsertOperation.java	HiveQL	上述のHiveQLの仕様上、カラム指定だとinsertが通らないのでカラムを指定するコードをスキップするように改修
org.dbunit.dataset.datatype.BigIntegerDataType.java	HiveQL, hive-jdbc	Hiveにdbunit経由でINT型のデータを登録しようとすると、dbunit側でBigDecimal変換をしようとするのでスキップするように改修

上記の修正が終わったら、dbunitをパッケージングします。

なお、dbunitのビルドの注意によると、oracleのjarが必要になるので、Oracleの公式サイトで取得の上、ローカルのmavenリポジトリに手動でインストールしてから出ないとmavenコマンドは失敗に終わります。

cd /path/to/dbunit

mvn package -Dmaven.test.skip=true

パッケージングが成功したら、targetフォルダ以下のjarを、SBTプロジェクトのlibフォルダの下にcopyしました。

4-5. 改めて、dbunitを使ったspec2を実施してみる

SBTプロジェクトの直下に暫定的にlibフォルダをおいて、dbunitの改修版を格納しているので、build.sbtの依存を修正してdbunitをコメントアウトします。

build.sbt

name := "hive-jdbc-sample"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  "org.apache.hive"   % "hive-jdbc"          % "1.1.0",
  "org.apache.hive"   % "hive-common"        % "1.1.0",
  "org.apache.hive"   % "hive-exec"          % "1.1.0",
  "org.apache.hive"   % "hive-service"       % "1.1.0",
  "org.apache.hive"   % "hive-metastore"     % "1.1.0",
  "commons-logging"   % "commons-logging"    % "1.2",
  "org.apache.thrift" % "libthrift"          % "0.9.2",
  "org.apache.httpcomponents" % "httpclient" % "4.5.1",
  "org.apache.httpcomponents" % "httpcore"   % "4.4.4",
  "org.apache.hadoop" % "hadoop-core"        % "1.2.1",
  "org.apache.hadoop" % "hadoop-common"      % "2.7.1",
  "org.slf4j"         % "slf4j-api"          % "1.7.6",
  "org.slf4j"         % "slf4j-log4j12"      % "1.7.6",
  "com.google.guava"  % "guava"              % "18.0",
  "org.pentaho" % "pentaho-aggdesigner-algorithm"   % "5.1.5-jhyde",
  "commons-configuration" % "commons-configuration" % "1.10",
  "org.specs2"       %% "specs2-core"        % "2.4.9" % "test",
  "org.specs2"       %  "specs2-junit_2.10"  % "3.6.4" % "test",
  "junit"            %  "junit"              % "4.10",
  //"org.dbunit"       %  "dbunit"             % "2.5.1",
  "org.apache.poi"   %  "poi"                % "3.10-FINAL" % "test",
  "org.apache.poi"   %  "poi-ooxml"          % "3.10-FINAL" % "test"
  )

resolvers += "Concurrent Maven Repo" at "http://conjars.org/repo"

> test
[info] Compiling 1 Scala source to /path/to/hive-jdbc-sample/target/scala-2.10/test-classes...
15/12/05 10:54:52 INFO jdbc.Utils: Supplied authorities: *.*.*.*:10000
15/12/05 10:54:52 INFO jdbc.Utils: Resolved authority: *.*.*.*:10000
15/12/05 10:54:52 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://*.*.*.*:10000/erdrepo
15/12/05 10:54:53 WARN dataset.AbstractTableMetaData: Potential problem found: The configured data type factory 'class org.dbunit.dataset.datatype.DefaultDataTypeFactory' might cause problems with the current database 'Apache Hive' (e.g. some datatypes may not be supported properly). In rare cases you might see this message because the list of supported database products is incomplete (list=[derby]). If so please request a java-class update via the forums.If you are using your own IDataTypeFactory extending DefaultDataTypeFactory, ensure that you override getValidDbProducts() to specify the supported database products.
[info] InsertSampleSpec
[info] 
[info] + ms_tables count should be 3
[info] 
[info] Total for specification InsertSampleSpec
[info] Finished in 1 second, 218 ms
[info] 1 example, 0 failure, 0 error
[info] 
[info] Passed: Total 1, Failed 0, Errors 0, Passed 1
[success] Total time: 11 s, completed 2015/12/05 10:55:00
>

dbunitでデータ登録が出来ましたね！

5. 所感

もし本格的にやるとしたらhiveのテスト用のdbunitをブランチを切って、異なるバージョニングのもと運用する必要がある
hiveが提供するデータ型にArray, Struct等あるが、これらはそもそものdbunitにないから対応するとしたら大変だと思う（とはいえ、入れ子のデータモデリング設計をすることは定型化・標準化された業務ではあまりないと思うので心配はそんなにしていない）
そもそもテストツールとして、dbunitなのか。やりたいのは事前データ突っ込んで、クエリ流して、事後確認したいだけ。他のツール探すのがまずは先決。

もし数百人体制でSQL on Hadoopを実装するプロジェクトが発生したら必ず考えなければならないテーマには違いないと思います。

Oracleに比較してシステム的ランニングコストを圧倒的に下げることができるのであれば、DevOps的な観点でも圧倒的にコストは下げて、圧勝していきたいところです。

本日はここまでとなります。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up