More than 5 years have passed since last update.

Hadoopだけど可変長整数さえあれば関係ないよねっ (1) ~ Writableで可変長整数を使う ~

Last updated at 2014-08-26Posted at 2014-08-23

HadoopではWritableインターフェイスを使ってデータのシリアライズを行うが、このとき整数の出力には可変長整数を使うとサイズが小さくなる可能性が高い。
データサイズがパフォーマンスやマシン負荷に与える影響が大きいので、整数の出力が多い場合には可変長での出力を試してみることをお勧めする。

※Hadoopを実際に触っていたのは1年前になるのでもし間違えがあればご指摘下さい。

要点をざっくり言うと...

Hadoopには可変長整数を出力するAPIが用意されている。
可変長整数とは値の大きさに応じて格納サイズが変わる整数である。
固定長整数より可変長整数のほうがサイズが小さくなる事が多い。
だけど実際にやってみないと結果はわからない。

(次回以降)

可変長整数を活かすにはデータを小さい値で表現するほうが良い。
Hadoopの可変長整数では効率が悪いケースもある。
扱うデータに最適化したスペシャル可変長整数を実装すれば更に効率よくできる。

可変長整数とは

javaでint型やlong型は固定長整数で、例えばint型では3という小さい値も12345678という大きい値も同じ4バイトになる。

これに対して、3は1バイト、12345678は4バイト、という具合に値の大きさに応じてサイズが変わるのが可変長整数。

Hadoopで用意されている可変長整数では出力バイト数は以下のようになる。

整数の値	出力バイト数
-112～127	1
-256～255	2
-65536～65535	3
-16777216～16777215	4
:	:
Long.MIN_VALUE～Long.MAX_VALUE	9

これを見ればわかるように、可変長整数では値(絶対値)が小さいほどサイズを小さくする事ができる。

ちなみにlongの最大値で9バイトになる理由は、2バイト以上の場合は1バイト目にバイト数を表す情報を格納しているから。

使い方

既に固定長整数で出力している場合は、以下の表に従って使用するメソッド／クラスを置き換える。

固定長整数	可変長整数
IntWritableクラス	VIntWritableクラス
LongWritableクラス	VLongWritableクラス
DataOutput.writeInt()	WritableUtils.writeVInt()
DataOutput.writeLong()	WritableUtils.writeVLong()
DataInput.readInt()	WritableUtils.readVInt()
DataInput.readLong()	WritableUtils.readVLong()

可変長整数にした時のサイズを調べるには

WritableUtils.getVIntSize()を使えば引数で渡した値を可変長でシリアライズした時に何バイトになるか調べる事ができる。

System.out.println(WritableUtils.getVIntSize(12345678));

サンプルコード

タイムスタンプの配列をシリアライズするTimestampArrayクラスを作成して結果の違いを見てみる。
テストデータは時系列に並んだタイムスタンプ10個。

テストクラス

TimestampArrayTest.java

import static org.hamcrest.CoreMatchers.*;
import static org.junit.Assert.*;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.time.LocalDateTime;
import java.time.ZoneOffset;
import java.time.format.DateTimeFormatter;

import org.junit.Test;

public class TimestampArrayTest {

    @Test
    public void test() throws Exception {
        long[] timestamps = {
                timestamp("2014-08-01 00:00:00"),
                timestamp("2014-08-01 00:05:30"),
                timestamp("2014-08-01 00:10:42"),
                timestamp("2014-08-01 00:15:15"),
                timestamp("2014-08-01 00:20:31"),
                timestamp("2014-08-01 13:15:10"),
                timestamp("2014-08-01 13:20:16"),
                timestamp("2014-08-01 13:24:47"),
                timestamp("2014-08-01 13:31:31"),
                timestamp("2014-08-01 13:40:05"),
        };

        //シリアライズ
        TimestampArray timestampArray = new TimestampArray();
        timestampArray.setTimestamps(timestamps);
        ByteArrayOutputStream bout = new ByteArrayOutputStream();
        DataOutputStream dout = new DataOutputStream(bout);
        timestampArray.write(dout);

        byte[] bytes = bout.toByteArray();
        System.out.println("size=" + bytes.length);

        //デシリアライズ
        ByteArrayInputStream bin = new ByteArrayInputStream(bytes);
        DataInputStream din = new DataInputStream(bin);
        timestampArray = new TimestampArray();
        timestampArray.readFields(din);

        //シリアライズ⇒デシリアライズ後のデータが元の値と同じであることを確認
        assertThat(timestampArray.getTimestamps(), is(timestamps));
    }

    private long timestamp(String text) {
        return LocalDateTime.parse(text, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"))
                .toInstant(ZoneOffset.ofHours(9))
                .toEpochMilli();
    }
}

固定長整数で出力する実装

TimestampArray.java

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class TimestampArray implements Writable {
    private long[] timestamps;
    public void setTimestamps(long[] timestamps) {
        this.timestamps = timestamps;
    }
    public long[] getTimestamps() {
        return timestamps;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        int len = in.readInt();
        timestamps = new long[len];
        for (int i = 0; i < len; i++) {
            timestamps[i] = in.readLong();
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(timestamps.length);
        for (long timestamp : timestamps) {
            out.writeLong(timestamp);
        }
    }
}

実行結果

size=84

可変長整数で出力する実装に変更

TimestampArray.java

    @Override
    public void readFields(DataInput in) throws IOException {
        int len = WritableUtils.readVInt(in);
        timestamps = new long[len];
        for (int i = 0; i < len; i++) {
            timestamps[i] = WritableUtils.readVLong(in);
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        WritableUtils.writeVInt(out, timestamps.length);
        for (long timestamp : timestamps) {
            WritableUtils.writeVLong(out, timestamp);
        }
    }

実行結果

size=71

84バイトから71バイトと僅かながら改善できた。

実際にはやってみないとわからない

ここまで可変長引数を使えばサイズが小さくなってみんなハッピー的なことを書いたが、実際には様々な要因でそれほど結果に変化がない場合もある。

扱っているデータの値が大きい

今回のサンプルコードのデータがまさしくそうだが、扱っているデータの値が大きければ可変長整数の効果は低くなる。
下手すると固定長より大きくなる可能性もありえる。

データ全体の割合として整数値の比率が低い

例えばTwitterのつぶやきを扱う場合、

つぶやいた日時：long型で8バイト
つぶやき：最大140文字で280バイト(文字コードによる)

となるので、仮に日時部分を1バイトで表現できたとしても全体からみた効果は低い。

Map/Reduceの出力を圧縮している

HadoopではMap/Reduceの結果を圧縮できるので、普通はこれを使用している事が多い。
この場合、既に圧縮率が高ければ可変長整数を使うことによる効果は低くなる。

次回は...

今回のサンプルの結果

実装	サイズ
固定長整数	84
可変長整数	71

この程度の差なら意味がないと思うかも知れない。
でもこれは固定長整数をそのまま可変長整数に置き換えただけで、何の工夫もしていない結果である。

バカとハサミは使いよう
Hdoopと可変長整数も使いようである。

次回は、色々な工夫をして可変長整数にしたときの結果を改善する方法を紹介する。

記事一覧

(1) Writableで可変長整数を使う
(2) データを加工して可変長整数の効果を高める
 (3) 独自ロジックで整数をシリアライズする

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up