More than 5 years have passed since last update.

Hadoopだけど可変長整数さえあれば関係ないよねっ (3) ~ 独自ロジックで整数をシリアライズする ~

Posted at 2014-08-26

前回は、Hadoopの可変長整数を使って整数を効率的にシリアライズする方法を紹介したが、今回はHadoopの実装では効率が悪い場合に、独自ロジックで可変長整数をシリアライズする例を紹介する。

Hadoop可変長整数の特徴と苦手な部分

Hadoop可変長整数の特徴を把握する為に、各サイズにおける格納可能な値の範囲を固定長整数と比較してみた。

サイズ	Hadoop 可変長整数の範囲	固定長符号付整数の範囲	固定長符号なし整数の範囲
1	-112 ~127	-128 ~127	0~255
2	-256 ~255	-32,768 ~32,767	65,535
3	-65,536 ~65,535	-8,388,608 ~8,388,607	16,777,215
4	-16,777,216 ~16,777,215	-2,147,483,648 ~2,147,483,647	4,294,967,295

これを見ると、Hadoop可変長整数が以下のような特徴を持つ事がわかる。

1バイトの場合は固定長符号付整数とほぼ同じ範囲を格納できるので効率が良い。
2バイト以上になる場合には、固定長より1バイト余計に消費する。

Hadoop可変長整数は1バイトまでの整数に最適化した可変長整数といえる。

Hadoop可変長整数が苦手なサンプルデータを作ってみる

以上を踏まえた上で、Hadoopが苦手そうなサンプルデータとして、0~20,000の範囲でサイズ50のint配列を作ってみた。

IntArrayTest.java

import static org.hamcrest.CoreMatchers.*;
import static org.junit.Assert.*;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;

import org.junit.Test;

public class IntArrayTest {
    @Test
    public void test() throws IOException {
        int[] data = {
                10567, 16701,  8734, 11490, 19112,  1785, 12334,  7669, 10196, 12499,
                 9714,  3499, 13229,   982,  6779, 10034,  9888, 11378,  5541, 13914,
                17164,  9100, 13049, 10947, 17655,  3678, 12998,  6788,   119, 11245,
                11644, 14490,  8813, 14991,  5981, 15110, 10347,  9204,  5999, 19204,
                10004, 87992, 10742, 19047,  3489, 10477,  4779,  9880,  1077, 14779,
        };

        //シリアライズ
        IntArray intArray = new IntArray();
        intArray.setData(data);
        ByteArrayOutputStream bout = new ByteArrayOutputStream();
        DataOutputStream dout = new DataOutputStream(bout);
        intArray.write(dout);

        byte[] bytes = bout.toByteArray();
        System.out.println("size=" + bytes.length);

        //デシリアライズ
        ByteArrayInputStream bin = new ByteArrayInputStream(bytes);
        DataInputStream din = new DataInputStream(bin);
        intArray = new IntArray();
        intArray.readFields(din);

        //シリアライズ⇒デシリアライズ後のデータが元の値と同じであることを確認
        assertThat(intArray.getData(), is(data));
    }
}

固定長整数でシリアライズしてみる

IntArray.java

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class IntArray implements Writable {
    private int[] data;
    public void setData(int[] data) {
        this.data = data;
    }
    public int[] getData() {
        return data;
    }
    @Override
    public void readFields(DataInput in) throws IOException {
        int len = in.readInt();
        data = new int[len];
        for (int i = 0; i < len; i++) {
            data[i] = in.readInt();
        }
    }
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(data.length);
        for (int v : data) {
            out.writeInt(v);
        }
    }
}

結果

size=204

Hadoopの可変長整数でシリアライズ

IntArray.java

    @Override
    public void readFields(DataInput in) throws IOException {
        int len = WritableUtils.readVInt(in);
        data = new int[len];
        for (int i = 0; i < len; i++) {
            data[i] = WritableUtils.readVInt(in);
        }
    }
    @Override
    public void write(DataOutput out) throws IOException {
       WritableUtils.writeVInt(out, data.length);
        for (int v : data) {
            WritableUtils.writeVInt(out, v);
        }
    }

結果

size=150

符号なし2バイトまでの整数に最適化した可変長整数を作ってみた

Hdoopの実装を参考に、0~65,534の範囲を2バイトで出力できる可変長整数のシリアライズを作ってみた。

値が0~65,534(0xfffe)の場合は、2バイトで出力。
65,535(0xffff)以上の場合は、先頭2バイトに0xffff、3バイト目以降にint値を出力(合計6バイト)。

IntArray.java

    @Override
    public void readFields(DataInput in) throws IOException {
        int len = readSpecialVInt(in);
        data = new int[len];
        for (int i = 0; i < len; i++) {
            data[i] = readSpecialVInt(in);
        }
    }
    @Override
    public void write(DataOutput out) throws IOException {
       writeSpecialVInt(out, data.length);
        for (int v : data) {
            writeSpecialVInt(out, v);
        }
    }
    private int readSpecialVInt(DataInput in) throws IOException {
        int v = ((in.readByte() & 0xff) << 8) | (in.readByte() & 0xff);
        v = (0 <= v && v <= 0xfffe) ? v : in.readInt();
        return v;
    }
    private void writeSpecialVInt(DataOutput out, int v) throws IOException {
        System.out.println(Long.toHexString(v) + "->");
        if (0 <= v && v <= 0xfffe) {
            System.out.println("  2bytes");
            out.writeByte((byte) (0xff & (v >> 8)));
            out.writeByte((byte) (0xff & (v)));
        }
        else {
            System.out.println("  6bytes");
            out.writeByte(0xff);
            out.writeByte(0xff);
            out.writeInt(v);
        }
    }

結果

size=106

結果まとめ

シリアライズ方法	サイズ
固定長整数	204
Hadoopの可変長整数	150
2バイトに最適化した可変長整数	106

記事一覧

(1) Writableで可変長整数を使う
 (2) データを加工して可変長整数の効果を高める
(3) 独自ロジックで整数をシリアライズする

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up