More than 5 years have passed since last update.

Brainfuck -> Javaバイトコードコンパイラー

Last updated at 2018-11-01Posted at 2018-01-17

BrainfuckソースコードをJavaバイトコードにコンパイルするプログラムを作ってみたのでメモ。
詳細なJavaバイトコードの解説は行いません。
普通であればASMなどのライブラリーを使うところですが、せっかくなのでバイトを手書きしていきます。

The Java Virtual Machine Specification
クラスファイルについては4章、命令セットについては6章を参照されたし。

Brainfuckとは?

Wikipedia

命令がわずか8つしかないチューリング完全なプログラミング言語。実用性はないが面白い。
大雑把に説明すると、Brainfuck実行環境はbyteの配列とその要素を指すポインタを持ち、それを命令セットで操作していく感じ。
詳細はリンク先を参照のこと。

準備

byte配列をちまちま作るのはやっていられないので、まず、Byteをまとめて書き込むためのコンビニエンスメソッドを定義する。

/**
 * Helper class to create byte array fluently.
 */
public final class FluentByteWriter {

    private final OutputStream os;

    public FluentByteWriter(OutputStream os) {
        this.os = os;
    }

    public FluentByteWriter write(Object... bytes) {

        try {
            for (Object o : bytes) {
                // パターンマッチできない言語があるらしい
                if (o instanceof Integer) {
                    os.write((Integer) o);  // Note high-order bits are ignored
                } else if (o instanceof Byte) {
                    os.write((Byte) o);
                } else if (o instanceof byte[]) {
                    os.write((byte[]) o);
                } else if (o instanceof String) {
                    os.write(((String) o).getBytes(StandardCharsets.UTF_8));
                } else {
                    throw new UnsupportedOperationException("Unwritable class: " + o.getClass().getCanonicalName());
                }
            }
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }

        return this;
    }

}

可変長引数でObjectを受け取って、その型に応じてOutputStreamに書き込んでいく。
OutputStream.write(int)は、引数のintの上位24ビットを無視して書き込むメソッド(Javaのintは4バイト)。
これがないと、Javaの16進リテラルは符号付のため、最上位の1ビットが有効な値を書き込めなくなってしまう
(例えば0xCAを書き込もうとしたとき、byteではなくint型とみなされ、byteを引数にとるメソッドでは利用できなくなる)。

次に、intを4バイト/2バイトのバイト配列に変換して書き込むヘルパーを定義。これも上位ビットは無視。

public static byte[] toByteArray4(int n) {
    return ByteBuffer.allocate(4).putInt(n).array();
}
public static byte[] toByteArray2(short n) {
    return ByteBuffer.allocate(2).putShort(n).array();
}

4バイト版だけならGuavaにInts.toByteArray()があるが、2バイト版も必要なので独自定義。
(今思うとShorts.toByteArray()でよかった)

コンパイル

出力されるバイトコードの仕様は以下の通り。

デフォルトパッケージのMainクラス
public static void main(String[] args)メソッドを持つので、javaコマンドで実行可能
Brainfuckソースコードは、mainメソッドの中身にコンパイルされる

要するに、java Mainで変換されたバイトコードを実行できるようにする。

準備

0xCA, 0xFE, 0xBA, 0xBE, // CAFEBABE
0x00, 0x00,  // miner version: 0
0x00, 0x31,  // major version: 49  // Version 49 doesn't require stack map
0x00, 0x20,  // constant pool count: 31 + 1
// constant pool
0x07, 0x00, 0x02,  // 1. class: Main
0x01, 0x00, 0x04,  // 2. utf8
"Main",
0x07, 0x00, 0x04,  // 3. class: java/lang/Object
0x01, 0x00, 0x10,  // 4. utf8
"java/lang/Object",
// System.out.print
0x09, 0x00, 0x06, 0x00, 0x08,  // 5. fieldref System.out
0x07, 0x00, 0x07,  // 6. class
0x01, 0x00, 0x10,  // 7. utf8
"java/lang/System",
0x0C, 0x00, 0x09, 0x00, 0x0A,  // 8. name and type
0x01, 0x00, 0x03,  // 9. utf8
"out",
0x01, 0x00, 0x15,  // 10. utf8
"Ljava/io/PrintStream;",
0x0A, 0x00, 0x0C, 0x00, 0x0E,  // 11. method PrintStream.print(int)
0x07, 0x00, 0x0D,  // 12. class
0x01, 0x00, 0x13,  // 13. utf8
"java/io/PrintStream",
0x0C, 0x00, 0x0F, 0x00, 0x10,  // 14. name and type
0x01, 0x00, 0x05,  // 15. utf8
"print",
0x01, 0x00, 0x04,  // 16. utf8
"(C)V",
// System.in.read(int)
0x09, 0x00, 0x06, 0x00, 0x12,  // 17. fieldref System.in
0x0C, 0x00, 0x13, 0x00, 0x14,  // 18. name and type
0x01, 0x00, 0x02,  // 19. utf8
"in",
0x01, 0x00, 0x15,  // 20. utf8
"Ljava/io/InputStream;",
0x0A, 0x00, 0x16, 0x00, 0x18,  // 21. method InputStream.read(int)
0x07, 0x00, 0x17,  // 22. class
0x01, 0x00, 0x13,  // 23. utf8
"java/io/InputStream",
0x0C, 0x00, 0x19, 0x00, 0x1A,  // 24. name and type
0x01, 0x00, 0x04,  // 25. utf8
"read",
0x01, 0x00, 0x3,  // 26. utf8
"()I",
// main
0x01, 0x00, 0x04,  // 27. utf8
"main",
0x01, 0x00, 0x16,  // 28. utf8
"([Ljava/lang/String;)V",
0x01, 0x00, 0x04,  // 29. utf8
"args",
0x01, 0x00, 0x13,  // 30. utf8
"[Ljava/lang/String;",
// "Code" for Attribute
0x01, 0x00, 0x04,  // 31. utf8
"Code",
0x00, 0x21,  // access_flags: ACC_SUPER ACC_PUBLIC
0x00, 0x01,  // this class
0x00, 0x03,  // super class
0x00, 0x00,  // interfaces count
// interfaces[]
//NOP
0x00, 0x00,  // fields count
// fields[]
// NOP
0x00, 0x01,  // method count

特筆すべきことはなし。普通のJavaバイトコードを書き出していく。

インスタンス化する必要がないので<init>は持たない。
バイトコードのバージョンが49なのは、50からスタックマップが必要になって面倒なため。
(スタックマップについてはここを参照)

        // methods[]
        // main
        0x00, 0x09,  // access flags: ACC_PUBLIC ACC_STATIC
        0x00, 0x1B,  // name index: main
        0x00, 0x1C,  // descriptor index
        0x00, 0x01,  // attributes count
        // attribute info
        0x00, 0x1F  // attribute name index: Code
);
byte[] code = compileCodes(is);
w.write(
        ByteUtils.toByteArray4(code.length + 12),  // attribute length
        // info
        0x00, 0x04,  // max stack
        0x00, 0x02,  // max locals
        // code length
        ByteUtils.toByteArray4(code.length),
        // code
        code,
        0x00, 0x00,  // exception table length
        // exception table
        // NOP
        0x00, 0x00,  // attribute count
        // attribute info[]
        // NOP
        // class attributes count
        0x00, 0x00
        // attributes
        // NOP

compileCodes()メソッドが、実際のBrainfuck-Javaバイトコード変換を行っている。

Brainfuck-Javaバイトコード変換

まず環境を準備する。


w.write(
        // creates data buffer
        0x11, 0x75, 0x30,  // sipush 30000
        0xBC, 0x0A,        // newarray int
        0x4B,              // astore_0  // ignore application arguments (String[] args)
        // creates instruction pointer
        0x03,              // iconst_0
        0x3C               // istore_1
);
w.write(
        compileCodeElements(is),
        0xB1  // return
);

最初にバッファーを作る。

sipush 30000
newarray int
astore_0

何を間違えたのかbyteではなくintの配列を作っていた。気にしない気にしない……。

続いてインストラクションポインターを作る。

iconst_0
istore_1

compileCodeElements()

いよいよ実際のコード変換。

while ((i = is.read()) >= 0) {
    switch (i) {
    case ('+'):
        // ...
        break;
    case ('-'):
        // ...
        break;
    /*
     * ....

レクサーもパーサーもいらないので楽。直接ストリームから一文字読み込み、switchでトークンを分岐していく。

+

0x2A,  // aload_0
0x1B,  // iload_1
0x5C,  // dup2
0x2E,  // iaload
0x04,  // iconst_1
0x60,  // iadd
0x4F   // iastore

dup2はスタックの上二つを複製する命令。
ialoadで配列の値を読み込み、それを保存。

やっていることは単に1を足しただけ。そのまんま。

-

0x2A,  // aload_0
0x1B,  // iload_1
0x5C,  // dup2
0x2E,  // iaload
0x02,  // iconst_m1
0x60,  // iadd
0x4F   // iastore

-1を足すようにしただけで、+とほぼ同じ。

>

0x84, 0x01, 0x01  // iinc 1 1

<

0x84, 0x01, 0xFF  // iinc 1 -1

ローカル変数はiinc命令でインクリメント/デクリメントできるので、わずか一命令で書ける。

.

0xB2, 0x00, 0x05,  // getstatic System.out
0x2A,              // aload_0
0x1B,              // iload_1
0x2E,              // iaload
0x92,              // i2c
0xB6, 0x00, 0x0B   // invokevirtual print(C)V

文字列として出力したいので、intをcharに変換してから、System.out.print(char)を呼び出す。

,

0x2A,              // aload_0
0x1B,              // iload_1
0xB2, 0x00, 0x11,  // getstatic System.in
0xB6, 0x00, 0x15,  // invokevirtual read()I
0x4F               // iastore

System.in.read()はintを返すので、そのまま配列に保存。

[

w.write((Object) compileLoop(is));

ループは別のメソッドを呼び出す。相互再帰的にcompileLoop()とcompileElement()を呼び出していけば、処理を綺麗に書けるという仕組み。

w.write(
        // load current pointer value
        0x2A,  // aload_0
        0x1B,  // iload_1
        0x2E   // iaload
);
byte[] insideLoop = compileCodeElements(is);
w.write(
        // if the current pointer indicates 0...
        // 6 = ifeq(3) + goto(3)
        0x99, ByteUtils.toByteArray2(insideLoop.length + 6),    // ifeq
        insideLoop,
        0xA7, ByteUtils.toByteArray2(-(insideLoop.length + 6))  // goto
);

まず現在のポインターが指すバッファーの値を読み込む。
ifeq命令で0かどうかをチェックし、0ならばループ終わりにジャンプし、そうでなければ処理を行う。
処理終わりにはgotoでループの最初に戻る。
コードの長さに6を足しているのは、ifeqおよびgoto命令の長さを除くため。

初めジャンプ先の指定座標が絶対だと思い込んでいたので、何故かうまく動かず悩んでいた。実際には現在地からの相対サイズを指定する。
そのほうがコード生成がずっと楽。よく考えられているなぁと感心しました(小並感)。

上で述べたように相互再帰を利用してループを処理するので、よく見かけるBFインタプリタ―のように、ループの数を数えておかなくてもジャンプ先がわかる。

]

case (']'):
    // exit loop
    return baos.toByteArray();

処理を終了してループを抜ける。
もしマッチする[がないまま呼ばれた場合、処理が終了することになってしまう。
マッチが見つからなかった場合の仕様を発見できなかったため、未定義とみなして楽な処理にした。
コンパイルエラーにするほうが親切かもしれない。

最適化

TODO

POSTDの記事を参考にいつかやる。

同じ命令の繰り返しは容易に最適化できる。例えば+命令では、iconst_1を命令の数に合わせて書き換えるだけでいい。
ループは初めから位置を固定している。
記事の「その3」([-]など)はバイトコード自身に難しいところはあまりないと思われる。どちらかといえば入力を操作するほうが面倒。

感想

思っていたより簡単だった
Javaバイトコードを出力するからと言ってJavaで書く必要はどこにもなかった。Javaは面倒。

おまけ

元記事 (本Qiita記事はこの記事のリファインです)
リポジトリ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up