PrestoとHiveでクエリを使い回す時の注意点 ~配列編~ #hive

Treasure DataでPrestoとHiveでクエリを使いまわそうとしてハマった小ネタ。

配列のインデックスの開始がPrestoは1から、Hiveは0から

表題の通りなのですが、実例を。

例えば、Webアクセスログのagentとして

agent
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; BTRS122159;
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0;
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11

といった値が格納されている時に、ブラウザの文字列を切り出したくてsplitするとします。

Prestoだと、配列のインデックスは1から始まるので

SELECT 
  agent_string_arry[1]
FROM (
    SELECT 
      SPLIT(
        agent,
        '/'
      ) AS agent_string_arry,
      agent
    FROM
      www_access
  )
LIMIT 5

で、

Mozilla
Mozilla
Mozilla
Mozilla
Mozilla

と値が取れますが、同じクエリをHiveに投げてしまうと、

4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident
5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident
5.0 (compatible; Googlebot
4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident
5.0 (Windows NT 5.1) AppleWebKit

となってしまいます。

Hiveに投げる場合は配列のインデックスは0から始まるので

SELECT 
  agent_string_arry[0]
FROM (
    SELECT 
      SPLIT(
        agent,
        '/'
      ) AS agent_string_arry,
      agent
    FROM
      www_access
  )
LIMIT 5

とする必要があります。

補足

Prestoのドキュメントにもこの点はHiveからの移行時の注意点として書かれていた。

Arrays are indexed starting from 1, not from 0:

そこにも書かれているように、ANSI SQL標準としては配列型のインデックスは1から始まると規定されているようだ。

4.11.1 Arrays
An array is a collection A in which each element is associated with exactly one ordinal position in A. If n is the cardinality of A, then the ordinal position p of an element is an integer in the range 1 (one) p n. If EDT is the element type of A, then A can thus be considered as a function of the integers in the range 1 (one) to n onto EDT.

PostgreSQLもANSI SQL標準に従っている

デフォルトでPostgreSQLは配列に対し「1始まり」の振り番規定を採用しています。つまり要素がn個ある配列はarray[1]で始まり、array[n]で終わります。

Hiveの弁護をするなら、CやJavaだと配列のインデックスは0から始まるので、あくまでMapReduceのラッパーとして実装されたHiveQLはそちらに引きずられているのかもしれない。

以上。