More than 1 year has passed since last update.

逆アセンブラを実装しながら学ぶx86-64機械語とアセンブラ

Last updated at 2024-05-15Posted at 2024-05-15

春先の暇つぶしに、x86-64機械語を学びながら逆アセンブラ (バイナリから元のアセンブラを復元するツール) を標準ライブラリのみで実装したので、その際に作ったメモをここに置いておきます。

コードは以下のレポジトリで管理しており、基本的な命令の多くの復元が可能な一方で、浮動小数点命令などは現状ごく一部しかサポートできていません。バイナリのフォーマットはELFを想定しています。

オペコード・オペランド

アセンブラにおいて、オペコードは命令の種類を表し、オペランドが命令の対象を意味する。

例

オペコード オペランド
   add     eax 0x01

オペランドの種類としては、例えば以下のようなものがあり、それぞれOp/Enと呼ばれるエンコーディングを持つ。

Op/En	Operand 1	Operand 2	Operand 3	Operand 4
RM	r(8,16,32,64)	r/m(8,16,32,64)	N/A	N/A
MR	r/m(8,16,32,64)	r(8,16,32,64)	N/A	N/A
MI	r/m(8,16,32,64)	imm(8,16,32,64)	N/A	N/A
I	AL/AX/EAX/RAX	imm(8,16,32,64)	N/A	N/A

ここで、immは即値、rはレジスタに対応し、r/mは後述するModR/M・SIBバイトによってアドレッシングモードが指定されることを意味する。

このようなオペコード・オペランドの組み合わせが分かりやすくまとまっているサイトとしては、https://www.felixcloutier.com/x86/ などがある。

命令フォーマット

x86-64において、一つの命令のフォーマットは一般的に以下のようになっている。

|prefix | REX prefix  |     opecode      |   ModR/M    |   SIB       |     address offset     |      immediate         |
|------ |-------------|------------------|-------------|-------------|------------------------|------------------------|
|       | 0 or 1 byte | 1, 2, or 3 bytes | 0 or 1 byte | 0 or 1 byte | 0, 1, 2, 3, or 4 bytes | 0, 1, 2, 3, or 4 bytes |

Prefix

x86-64においては、命令に可変長のプレフィックスをつけることで、命令の動作を拡張したりオペランドの長さを変更することができる。例えば以下のようなプレフィックスが存在する。

Prefix Instruction

Prefix Instructionは他の命令の前に置くことで、その後の命令の動作を変更できる。例えば以下のようなPrefix Instructionがある。

REP: 文字列操作命令に対してい用いる繰り返し命令
LOCK: ADDやXORなど一部の命令がメモリにアクセスする際に、他からそのメモリがアクセスされないことを保証する。

Operand-size Prefix

Operand-size prefix (0x66) がつくと、オペランドサイズを16ビットにすることができる。

Opecode

。基本的にx86-64においては、prefix instruction・opecodeおよび後述するREX PrefixおよびModR/Mのregフィールドによって、ニーモニックおよびオペランドの種類を特定することができる。

例

プレフィクス | opecode |  reg |       命令
-------------|---------|------|-----------------
     -       |  0x81   |  0   | ADD r/m32, imm32
     -       |  0x81   |  5   | SUB r/m32, imm32
   REX.W     |  0x81   |  0   | ADD r/m64, imm32

REX Prefix

REX Prefixは、後述するModR/MやSIBバイトの拡張・オペランドのサイズの拡張・r8~r15の指定などのために用いられる。

bit   |7|6|5|4|3|2|1|0|
------|-|-|-|-|-|-|-|-|
field |0|1|0|0|W|R|X|B|

フィールド名 | ビット位置 | 定義
------------|-----------|-----------------------------------------------------------------
            |   7 - 4   | 上位4ビットを0100にすることで、REX Prefixであることを表す
         W  |         3 | 0 = デフォルトのオペランドサイズ、1 = 64 ビットのオペランド・サイズ
         R  |         2 | ModRM のreg フィールドの拡張
         X  |         1 | SIB のindex フィールドの拡張
         B  |         0 | ModRM の r/m、SIB の base、またはオペコードのregの各フィールドの拡張

ModR/M (Mode Register Memory)

ModR/Mは、オペランドのレジスタとアドレッシングモードを指定するために用いられる。

 bit   |7|6|5|4|3|2|1|0|
-------|---|-----|-----|
 field |mod| reg | r/m |
-------|---|-----|-----|
 rex   |   |  r  |  b  |

regフィールドは、オペランドのレジスタを指定する。REX PrefixのRフィールドが立つている場合は、r8 ~ r15を指定する。

reg	レジスタ (rex.r = 0)	レジスタ (rex.r = 1)
000	RAX	R8
001	RCX	R9
010	RDX	R10
011	RBX	R11
100	RSP	R12
101	RBP	R13
110	RSI	R14
111	RDI	R15

modおよびr/m

アドレッシングモードは、modフィールド・r/mフィールドおよびREX PrefixのBフィールドの組み合わせによって指定される。

modが11の場合はオペランドに直接レジスタが指定され、計算や転送の対象がレジスタになる。それ以外の場合は、指定されたメモリアドレスに格納されているデータが計算・転送の対象となる。

また、r/mフィールドが101かつmodフィールドが00である場合は、RIP相対アドレッシングが用いられ、PIC (位置独立コード) を実現することができる。

			mod
		00	01	10	11
	000	[RAX]	[RAX + disp8]	[RAX + disp32]	RAX
	001	[RCX]	[RCX + disp8]	[RCX + disp32]	RCX
	010	[RDX]	[RDX + disp8]	[RDX + disp32]	RDX
r/m (rex.b = 0)	011	[RBX]	[RBX + disp8]	[RBX + disp32]	RBX
	100	[SIB]	[SIB + disp8]	[SIB + disp32]	RSP
	101	[RIP + disp32]	[RBP + disp8]	[RBP+ disp32]	RBP
	110	[RSI]	[RSI + disp8]	[RSI + disp32]	RSI
	111	[RDI]	[RDI + disp8]	[RDI + disp32]	RDI

			mod
		00	01	10	11
	000	[R8]	[R8 + disp8]	[R8 + disp32]	R8
	001	[R9]	[R9 + disp8]	[R9+ disp32]	R9
	010	[R10]	[R10 + disp8]	[R10 + disp32]	R10
r/m (rex.b = 1)	011	[R11]	[R11 + disp8]	[R11 + disp32]	R11
	100	[SIB]	[SIB + disp8]	[SIB + disp32]	R12
	101	[RIP + disp32]	[R13 + disp8]	[R13+ disp32]	R13
	110	[R14]	[R14 + disp8]	[R14 + disp32]	R14
	111	[R15]	[R15 + disp8]	[R15 + disp32]	R15

SIB (Scale Index Base)

|  bit  | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|-------|---|---|---|---|---|---|---|---|
| field | scale |   index   |   base    |
|-------|-------|-----------|-----------|

ModR/MバイトがSIBを指定している時、ベースレジスタ + インデックスレジスタ * スケール + 定数の形式でメモリアドレスを指定することができる。ベースレジスタおよびインデックスレジスタには全ての汎用レジスタを指定でき、スケールは1, 2, 4もしくは8を指定可能。

この形式の利用例としては、ベースレジスタに配列の先頭アドレス、インデックスレジスタに配列のインデックス、スケールに書く配列要素のサイズを指定することで、配列の要素に1命令でアクセスする場合などが挙げられる。

base

baseフィールドによって、ベースレジスタを指定する。拡張汎用レジスタ (r8 ~ r15) を指定する場合は、REX PrefixのBフィールドを立てる。

base (reb.b = 0)	mod	レジスタ
000		RAX
001		RCX
010		RDX
011		RBX
100		RSP
101	00	disp32
101	01	RBP + disp8
101	10	RBP + disp32
110		RSI
111		RDI

base (reb.b = 1)	mod	レジスタ
000		R8
001		R9
010		R10
011		R11
100		R12
101	00	disp32
101	01	R13 + disp8
101	10	R13 + disp32
110		R14
111		R15

index

indexフィールドはインデックスレジスタを指定する。この際、拡張汎用レジスタであるr8-r15はREXプレフィックスのXフィールドを立てることで指定できる。

index (rex.x = 0)	レジスタ
000	RAX
001	RCX
010	RDX
011	RBX
100	(RSP)
101	RBP
110	RSI
111	RDI

index (rex.x = 1)	レジスタ
000	R8
001	R9
010	R10
011	R11
100	R12
101	R13
110	R14
111	R15

scale

scaleフィールドは、インデックスレジスタの倍率を指定する。

scale	倍率
00	1
01	2
10	4
11	8

なお、rspは基本的にスタックのトップを指すために用いられるので、indexフィールドが100 (rsp) である場合、インデックスレジスタは用いられない。

例

0x8b, 0x88, 0x00, 0x01, 0x00, 0x00

1. 0x8b - Opcode for mov r32, r/m32.
   This operand type is RM, requiring the ModR/M byte
2. 0x88 - ModR/M byte:
      mod = 10 (memory with 32-bit displacement)
      reg = 001 (ecx)
      r/m = 000 (rax)
3. 0x00, 0x01, 0x00, 0x00 - 32-bit displacement 0x00000100.

よって、0x8b, 0x88, 0x00, 0x01, 0x00, 0x00 は mov ecx, [rax + 0x00000100]になる。

0x41, 0x01, 0x04, 0x91

1. 0x41 - REX prefix
      W = 0
      R = 0
      X = 0
      B = 1
2. 0x01 - Opcode for add r/m32, r32.
3. 0x04 - ModR/M byte:
      mod = 00 (register indirect addressing, no displacement).
      reg = 000 (eax).
      r/m = 100 (indicates SIB follows).
4. 0x91 - SIB byte:
      scale = 10 (multiplier of 4).
      index = 010 (rdx).
      base = 001 (r9).

よって、0x41, 0x01, 0x04, 0x91 は add [r9 + rdx * 4], eaxになる。

逆アセンブラ

一つの命令のデコード

今回実装する逆アセンブラでは、一つの命令を、おおまかには以下のような手順で、1バイトづつデコードする。

1. 先頭1バイトがprefix instructionと一致する -> prefix instructionが存在する & 1バイト進む
2. 先頭1バイトが0x66である -> operand-size prefixが存在する & 1バイト進む
3. 先頭4ビットが0b0100である -> 先頭1バイトがREX Prefixである & 1バイト進む
4.a: 先頭1バイトがマルチバイト命令に含まれるものである場合 ->  先頭2バイトが既知のオペコードと一致するなら2バイト進む・そうでないなら、1バイト進む
4.b: 先頭1バイトがマルチバイト命令に含まれない場合 -> 先頭1バイトがオペコードであり、1バイト進む
5. オペコードが、ニーモニック (命令の種類) を特定するためにModR/Mバイトのregフィールドが必要であるようなものである場合、先頭バイトをModR/Mバイトとして解釈し、ニーモニックを特定する。
6. prefix instructionを発見していた場合、ニーモニックの種類を考慮して、prefix instructionの種類を特定する (prefix instructionのバイトが0xF2でありニーモニックが制御フロー命令ならbnd、文字列命令ならrepneである、など)
7. prefix、ニーモニック、およびオペコードからオペランドの種類を特定する
8. オペランドがModR/Mを必要とする場合、先頭1バイトがModR/Mバイトである & 1バイト進む
9. ModR/MがSIBバイトを指定している場合、先頭1バイトがSIBバイトである & 1バイト進む
10.a: ModR/Mバイトが8ビットの変位 (disp8) を指定している場合、先頭1バイトがその変位である & 1バイト進む
10.b: ModR/Mバイトが32ビットの変位 (disp32) を指定している場合、先頭4バイトがその変位である & 4バイト進む
10.c: 10.aもしくは10.bで取得した変位のエンディアン・補数表現等をデコードする
11. これまでにデコードしたオペランドの種類・ModR/Mバイト・SIBバイト・変位等を用いて、オペランドの具体的な値を一つ一つ特定していく。オペランドに則値が含まれる場合は、そのサイズ分適宜進む。

一つの命令をデコードするプロセスを以下のような構造体を用いて実装した。

/**
 * @struct State
 * @brief Represents the state of the disassembler.
 */
struct State {
    const std::vector<unsigned char>& objectSource; // 逆アセンブラ対象のバイナリのバイト配列
                    
                    .
                    .
                    .

    /**
     * @brief Executes a step in disassembling the instruction.
     * @param startAddr The starting address of the instruction.
     * @return The disassembled result.
     */
    DisassembledResult step(uint64_t startAddr) {
        // ############### Initialize ##############################
        curAddr = startAddr;

        // prefixの処理
        parsePrefixInstructions();
        parseSegmentOverridePrefix();
        parsePrefix();

        // REX Prefixの処理
        parseREX();

        // opcodeの処理
        parseOpecode();

        // ModR/Mの処理
        parseModRM();

        // SIBバイトの処理
        parseSIB();

        // アドレッシングモードにおける、オフセットの処理
        parseAddressOffset();

        // オペランドの処理 

                .
                .
                .
    }
}

まず、REXプレフィクスを処理する。

/**
 * @struct REX
 * @brief Represents the REX prefix byte in x86 instruction encoding.
 */
struct REX {
    bool rexB;
    bool rexX;
    bool rexR;
    bool rexW;

    /**
     * @brief Default constructor for REX.
     */
    REX() : rexB(false), rexX(false), rexR(false), rexW(false) {}

    /**
     * @brief Constructor for REX with byte parameter.
     * @param rexByte The REX prefix byte.
     */
    REX(unsigned char rexByte) {
        rexB = (rexByte & 0x1) == 0x1;
        rexX = (rexByte & 0x2) == 0x2;
        rexR = (rexByte & 0x4) == 0x4;
        rexW = (rexByte & 0x8) == 0x8;
    }
};

/**
    * @brief Parses the REX prefix.
    */
void parseREX() {
    // The format of REX prefix is 0100|W|R|X|B
    if ((objectSource[curAddr] >> 4) == 4) {
        hasREX = true;
        rex = REX(objectSource[curAddr]);
        instructionLen += 1;
        curAddr += 1;

        if (rex.rexW) {
            prefix = Prefix::REXW;
        } else {
            prefix = Prefix::REX;
        }
    }
}

次に、命令の種類を特定する。前述の通り、opecodeに加えてREX Prefixおよび、opecodeのあとにModR/Mバイトがある場合はそのregフィールドを用いて、命令の種類を識別する。また、命令の種類が識別できれば、その前におかれていたPrefix Instructionを特定することもできる。

/**
    * @brief Parses the opcode byte.
    */
void State::parseOpecode() {
    // eat opecode
    opcodeByte = objectSource[curAddr];
    instructionLen += 1;
    curAddr += 1;

    int pottentialOpCodeByte = (opcodeByte << 8) + objectSource[curAddr];

    if ((TWO_BYTES_OPCODE_PREFIX.find(opcodeByte) !=
            TWO_BYTES_OPCODE_PREFIX.end()) &&
        ((OP_LOOKUP.find(std::make_pair(prefix, pottentialOpCodeByte)) !=
            OP_LOOKUP.end()) ||
            (prefix == Prefix::REXW &&
            OP_LOOKUP.find(std::make_pair(
                Prefix::REX, pottentialOpCodeByte)) != OP_LOOKUP.end()) ||
            (prefix == Prefix::REX &&
            OP_LOOKUP.find(std::make_pair(
                Prefix::NONE, pottentialOpCodeByte)) != OP_LOOKUP.end()))) {
        opcodeByte = pottentialOpCodeByte;
        instructionLen += 1;
        curAddr += 1;
    }

    // (prefix, opcode) -> (reg, mnemonic)
    std::unordered_map<int, Mnemonic> reg2mnem;
    if (OP_LOOKUP.find(std::make_pair(prefix, opcodeByte)) !=
        OP_LOOKUP.end()) {
        reg2mnem = OP_LOOKUP.at(std::make_pair(prefix, opcodeByte));
    } else if (prefix == Prefix::REXW &&
                OP_LOOKUP.find(std::make_pair(Prefix::REX, opcodeByte)) !=
                    OP_LOOKUP.end()) {
        reg2mnem = OP_LOOKUP.at(std::make_pair(Prefix::REX, opcodeByte));
        prefix = Prefix::REX;
    } else if (prefix == Prefix::REX &&
                OP_LOOKUP.find(std::make_pair(Prefix::NONE, opcodeByte)) !=
                    OP_LOOKUP.end()) {
        reg2mnem = OP_LOOKUP.at(std::make_pair(Prefix::NONE, opcodeByte));
        prefix = Prefix::NONE;
    } else {
        std::stringstream ss;
        ss << std::hex << opcodeByte;
        throw OPCODE_LOOKUP_ERROR(
            "Unknown combination of the prefix and the opcodeByte: (" +
            to_string(prefix) + ", " + ss.str() + ")");
    }

    // We sometimes need reg of modrm to determine the opecode
    // e.g. 83 /4 -> AND
    //      83 /1 -> OR
    if (curAddr < objectSource.size()) {
        modrmByte = objectSource[curAddr];
    }

    if (modrmByte >= 0) {
        int reg = (modrmByte >> 3) & 0x7;
        mnemonic = (reg2mnem.find(reg) != reg2mnem.end()) ? reg2mnem.at(reg)
                                                            : reg2mnem.at(-1);
    } else {
        mnemonic = reg2mnem.at(-1);
    }

    if (hasInstructionPrefix) {
        if (instructionPrefixByte == 0xF0) {
            disassembledInstruction.emplace_back("lock");
        } else if (instructionPrefixByte == 0xF2) {
            if (isControlFlowInstruction(mnemonic)) {
                disassembledInstruction.emplace_back("bnd");
            } else {
                disassembledInstruction.emplace_back("repne");
            }
        } else if (instructionPrefixByte == 0xF3) {
            disassembledInstruction.emplace_back("rep");
        } else if (instructionPrefixByte == 0x3E) {
            disassembledInstruction.emplace_back("notrack");
        }
    }

    disassembledInstruction.emplace_back(to_string(mnemonic));

    if (OPERAND_LOOKUP.find(std::make_tuple(
            prefix, mnemonic, opcodeByte)) != OPERAND_LOOKUP.end()) {
        std::tuple<OpEnc, std::vector<std::string>, std::vector<Operand>>
            res = OPERAND_LOOKUP.at(
                std::make_tuple(prefix, mnemonic, opcodeByte));
        opEnc = std::get<0>(res);
        remOps = std::get<1>(res);
        operands = std::get<2>(res);
    } else {
        std::stringstream ss;
        ss << std::hex << opcodeByte;
        throw OPERAND_LOOKUP_ERROR(
            "Unknown combination of prefix, mnemonic and opcodeByte: (" +
            to_string(prefix) + ", " + to_string(mnemonic) + ", " +
            ss.str() + ")");
    }
}

ModR/Mは以下のように処理できる。

/**
 * @struct ModRM
 * @brief Represents the ModRM byte in x86 instruction encoding.
 */
struct ModRM {
    REX rex;
    int modByte;
    int regByte;
    int rmByte;

    bool hasDisp8;
    bool hasDisp32;
    bool hasSib;

    /**
     * @brief Default constructor for ModRM.
     */
    ModRM()
        : modByte(0),
          regByte(0),
          rmByte(0),
          hasDisp8(false),
          hasDisp32(false),
          hasSib(false) {}

    /**
     * @brief Constructor for ModRM with byte parameter.
     * @param modrmByte The ModRM byte.
     * @param rex The associated REX prefix.
     */
    ModRM(unsigned char modrmByte, REX rex) : rex(rex) {
        rmByte = modrmByte & 0x7;
        regByte = (modrmByte >> 3) & 0x7;
        modByte = (modrmByte >> 6) & 0x3;
        hasSib = false;

        if (modByte < 3 && rmByte == 4) {
            hasSib = true;
        }
        switch (modByte) {
            case 0: {
                hasDisp8 = false;
                hasDisp32 = false;
                break;
            }
            case 1: {
                hasDisp8 = true;
                hasDisp32 = false;
                break;
            }
            case 2: {
                hasDisp8 = false;
                hasDisp32 = true;
                break;
            }
            case 3: {
                hasDisp8 = false;
                hasDisp32 = false;
                break;
            }
        }

        if (modByte == 0 && rmByte == 5) {
            hasDisp8 = false;
            hasDisp32 = true;
        }
    }

    /**
     * @brief Gets the register name based on the operand type.
     * @param operand The operand type.
     * @return The register name.
     */
    std::string getReg(Operand operand) {
        if (operand == Operand::xmm) {
            return "xmm" + std::to_string(regByte + (rex.rexR ? 8 : 0));
        } else {
            return operand2register(operand)->at(regByte + (rex.rexR ? 8 : 0));
        }
    }

    /**
     * @brief Generates the addressing mode string.
     * @param operand The operand type.
     * @param disp8 The 8-bit displacement.
     * @param disp32 The 32-bit displacement.
     * @return The addressing mode string.
     */
    std::string getAddrMode(Operand operand, std::string disp8,
                            std::string disp32) {
        std::string addrBaseReg;
        if (modByte == 3) {
            if (operand == Operand::xm128) {
                addrBaseReg =
                    "xmm" + std::to_string(rmByte + (rex.rexB ? 8 : 0));
            } else {
                addrBaseReg =
                    operand2register(operand)->at(rmByte + (rex.rexB ? 8 : 0));
            }
        } else {
            if (operand == Operand::xm128) {
                addrBaseReg =
                    "xmm" + std::to_string(rmByte + (rex.rexB ? 8 : 0));
            } else {
                addrBaseReg = REGISTERS64.at(rmByte + (rex.rexB ? 8 : 0));
            }
        }

        if (modByte < 3 && rmByte == 4) {
            addrBaseReg = "SIB";
        }

        std::string addressingMode;
        switch (modByte) {
            case 0: {
                addressingMode = "[" + addrBaseReg + "]";
                break;
            }
            case 1: {
                addressingMode = "[" + addrBaseReg + disp8 + "]";
                break;
            }
            case 2: {
                addressingMode = "[" + addrBaseReg + disp32 + "]";
                break;
            }
            case 3: {
                addressingMode = addrBaseReg;
                break;
            }
        }

        if (modByte == 0 && rmByte == 5) {
            addressingMode = "[rip" + disp32 + "]";
        }

        return addressingMode;
    }
};

/**
    * @brief Parses the ModRM byte.
    */
void State::parseModRM() {
    if (hasModrm(opEnc)) {
        if (modrmByte < 0) {
            throw std::runtime_error(
                "Expected ModRM byte but there aren't any bytes left.");
        }
        instructionLen += 1;
        curAddr += 1;
        modrm = ModRM(modrmByte, rex);
    }
}

SIBバイトは以下のように処理できる。

/**
 * @struct SIB
 * @brief Represents the SIB (Scale-Index-Base) byte in x86 instruction
 * encoding.
 */
struct SIB {
    unsigned char scaleByte;
    unsigned char indexByte;
    unsigned char baseByte;
    unsigned char modByte;
    REX rex;

    std::string address;
    std::string addrBaseReg, indexReg;
    int scale;
    bool hasDisp8;
    bool hasDisp32;

    /**
     * @brief Default constructor for SIB.
     */
    SIB() : hasDisp8(false), hasDisp32(false) {}

    /**
     * @brief Constructor for SIB with byte parameters.
     * @param sibByte The SIB byte.
     * @param modByte The associated mod field from ModRM.
     * @param rex The associated REX prefix.
     */
    SIB(unsigned char sibByte, unsigned char modByte, REX rex)
        : modByte(modByte), rex(rex) {
        scaleByte = (sibByte >> 6) & 0x3;
        indexByte = (sibByte >> 3) & 0x7;
        baseByte = sibByte & 0x7;

        hasDisp8 = baseByte == 5 && modByte == 1;
        hasDisp32 = baseByte == 5 && modByte != 1;
    }

    /**
     * @brief Generates the address string.
     * @param operand The operand type.
     * @param disp8 The 8-bit displacement.
     * @param disp32 The 32-bit displacement.
     * @return The address string.
     */
    std::string getAddr(Operand operand, std::string disp8,
                        std::string disp32) {
        std::string offset = "";

        if (baseByte == 5) {
            switch (modByte) {
                case 0: {
                    if (disp32.size() > 2) {
                        if (disp32[1] == '+') {
                            addrBaseReg = disp32.substr(3, disp32.size() - 3);
                        } else {
                            addrBaseReg =
                                "-" + disp32.substr(3, disp32.size() - 3);
                        }
                    }
                    break;
                }
                case 1: {
                    addrBaseReg = rex.rexB ? "r13" : "rbp";
                    offset = disp8;
                    break;
                }
                case 2: {
                    addrBaseReg = rex.rexB ? "r13" : "rbp";
                    offset = disp32;
                    break;
                }
            }
        } else {
            addrBaseReg = REGISTERS64.at(baseByte + (rex.rexB ? 8 : 0));
            if (modByte == 1) {
                offset = disp8;
            } else if (modByte == 2) {
                offset = disp32;
            }
        }

        if (modByte == 0 && baseByte == 5 && indexByte == 4) {
            address = addrBaseReg;
        } else if (indexByte == 4 && (!rex.rexX)) {
            address = "[" + addrBaseReg + offset + "]";
        } else {
            indexReg = REGISTERS64.at(indexByte + (rex.rexX ? 8 : 0));
            scale = SCALE_FACTOR.at(scaleByte);
            address = "[" + addrBaseReg + " + " + indexReg + " * " +
                      std::to_string(scale) + offset + "]";
        }
        return address;
    }
};

/**
    * @brief Parses the SIB byte.
    */
void State::parseSIB() {
    if (hasModrm(opEnc) && modrm.hasSib) {
        // eat the sib (1 byte)
        if (curAddr < objectSource.size()) {
            sibByte = objectSource[curAddr];
        }
        if (sibByte < 0) {
            throw std::runtime_error(
                "Expected SIB byte but there aren't any bytes left.");
        }
        sib = SIB(sibByte, modrm.modByte, rex);
        instructionLen += 1;
        curAddr += 1;
    }
}

ModR/Mバイト・SIBバイトが変位 (displacement) を指定している場合、以下のように処理する。

/**
    * @brief Parses the address offset.
    */
void parseAddressOffset() {
    if ((hasModrm(opEnc) && modrm.hasDisp8) ||
        (hasModrm(opEnc) && modrm.hasSib && sib.hasDisp8) ||
        (hasModrm(opEnc) && modrm.hasSib && modrm.modByte == 1 &&
            sib.baseByte == 5)) {
        std::stringstream ss1;
        ss1 << std::hex << (int)objectSource[curAddr];
        disp8 = "0x" + ss1.str();

        long long decoded_disp8 = decodeOffset(disp8);
        if (decoded_disp8 < 0) {
            std::stringstream ss2;
            ss2 << std::hex << (-1 * decoded_disp8);
            disp8 = " - 0x" + ss2.str();
        } else {
            disp8 = " + " + disp8;
        }

        hasDisp8 = true;
        instructionLen += 1;
        curAddr += 1;
    }

    if ((hasModrm(opEnc) && modrm.hasDisp32) ||
        (hasModrm(opEnc) && modrm.hasSib && sib.hasDisp32) ||
        (hasModrm(opEnc) && modrm.hasSib &&
            (modrm.modByte == 0 || modrm.modByte == 2) && sib.baseByte == 5)) {
        std::vector<uint8_t> _disp32 =
            std::vector<uint8_t>(objectSource.begin() + curAddr,
                                    objectSource.begin() + curAddr + 4);
        std::reverse(_disp32.begin(), _disp32.end());
        std::stringstream ss;
        ss << "0x";
        for (unsigned char x : _disp32) {
            ss << std::hex << std::setw(2) << std::setfill('0')
                << static_cast<int>(x);
        }
        disp32 = ss.str();

        long long decoded_disp32 = decodeOffset(disp32);
        if (decoded_disp32 < 0) {
            std::stringstream ss;
            ss << std::hex << (-1 * decoded_disp32);
            disp32 = " - 0x" + ss.str();
        } else {
            disp32 = " + " + disp32;
        }

        hasDisp32 = true;
        instructionLen += 4;
        curAddr += 4;
    }
}

各オペランドの処理は以下のように実装される。

// ############### Process Operands ################
std::vector<uint8_t> imm;
for (Operand operand : operands) {
    std::string decodedTranslatedValue;

    if (isA_REG(operand) || operand == Operand::cl ||
        operand == Operand::dx) {
        decodedTranslatedValue = to_string(operand);
    } else if (operand == Operand::sti) {
        decodedTranslatedValue = "st(" + remOps[0] + ")";
    } else if (isRM(operand) || isREG(operand) || isM(operand)) {
        if (hasModrm(opEnc)) {
            if (isRM(operand) || isM(operand)) {
                decodedTranslatedValue =
                    modrm.getAddrMode(operand, disp8, disp32);
            } else {
                decodedTranslatedValue = modrm.getReg(operand);
            }
        } else {
            int regIdx = (hasREX && rex.rexB) ? std::stoi(remOps[0]) + 8
                                                : std::stoi(remOps[0]);
            if (is8Bit(operand)) {
                decodedTranslatedValue = REGISTERS8.at(regIdx);
            } else if (is16Bit(operand)) {
                decodedTranslatedValue = REGISTERS16.at(regIdx);
            } else if (is32Bit(operand)) {
                decodedTranslatedValue = REGISTERS32.at(regIdx);
            } else if (is64Bit(operand)) {
                decodedTranslatedValue = REGISTERS64.at(regIdx);
            } else if (operand == Operand::xm128) {
                decodedTranslatedValue = "xmm" + remOps[0];
            }
        }

        if ((isRM(operand) || isM(operand)) && hasModrm(opEnc) &&
            modrm.hasSib) {
            decodedTranslatedValue =
                sib.getAddr(operand, disp8, disp32);
        }
        if (hasSegmentOverridePrefix) {
            decodedTranslatedValue =
                prefixSegmentOverrideStr + ":" + decodedTranslatedValue;
        }
    } else if (isIMM(operand)) {
        int immSize = 0;
        if (operand == Operand::imm64 || operand == Operand::ymm) {
            immSize = 8;
        } else if (operand == Operand::imm32 ||
                    operand == Operand::xmm) {
            immSize = 4;
        } else if (operand == Operand::imm16) {
            immSize = 2;
        } else if (operand == Operand::imm8) {
            immSize = 1;
        }
        imm = std::vector<uint8_t>(
            objectSource.begin() + curAddr,
            objectSource.begin() + curAddr + immSize);
        std::reverse(imm.begin(), imm.end());
        instructionLen += immSize;
        curAddr += immSize;

        std::stringstream ss;
        ss << "0x";
        for (unsigned char x : imm) {
            ss << std::hex << std::setw(2) << std::setfill('0')
                << static_cast<int>(x);
        }
        decodedTranslatedValue = ss.str();
    }

    disassembledOperands.emplace_back(decodedTranslatedValue);
}

バイナリ全体の逆アセンブラとしては、コードセグメントの先頭から末尾まで命令を順次処理するLinear Sweepingと、制御フロー命令に従って制御フローをたどるRecursive Descentの二つの戦略が代表的である。

## Linear Sweeping Disassembler Pseudo Code

Initialize:
    instruction_pointer = start_address

While instruction_pointer < end_address:
    Fetch the instruction at instruction_pointer
    Decode the instruction
    Update instruction_pointer to point to the next instruction

## Recursive Descent Disassembler Pseudo Code

Function disassemble(address):
    If address is in visited:
        Return
    Else
        Fetch the instruction at address
        Decode the instruction
        visited.add(address)

    If instruction is a conditional branch, jump or a call:
      Add target address to worklist
      Add next sequential address to worklist
    Else if instruction is a return:
      Return
    Else:
      Add next sequential address to worklist

Initialize:
    worklist = [start_address]
    visited = {}

While worklist is not empty:
    address = worklist.pop()
    disassemble(address)

実際のコードは、様々な例外を処理するために、それなりに煩雑になっているので省略。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

reg	レジスタ (rex.r = 0)	レジスタ (rex.r = 1)
000	RAX	R8
001	RCX	R9
010	RDX	R10
011	RBX	R11
100	RSP	R12
101	RBP	R13
110	RSI	R14
111	RDI	R15

reg	レジスタ (rex.r = 0)	レジスタ (rex.r = 1)
000	RAX	R8
001	RCX	R9
010	RDX	R10
011	RBX	R11
100	RSP	R12
101	RBP	R13
110	RSI	R14
111	RDI	R15

reg	レジスタ (rex.r = 0)	レジスタ (rex.r = 1)
000	RAX	R8
001	RCX	R9
010	RDX	R10
011	RBX	R11
100	RSP	R12
101	RBP	R13
110	RSI	R14
111	RDI	R15