ポインタが指し示すデータを一時変数に受けていたせいで最適化が効かなくなった例 #C++

ポインタが指し示すデータを一時変数に受けていたせいで最適化が効かなくなって激遅になったパターン。
アセンブリコードをしばらく眺めてやっと気づいた。
そんな概念すっかり忘れてた。

ソースコード

遅かった方

Before.cpp

class BufferAnalyzeDirect
{
public:
    static const size_t msgOfs_ = offsetof(Header, messageByte);
    static const size_t optOfs_ = offsetof(Header, optionByte);
    void operator()(uint8_t* b, Result & r)
    {
        uint16_t msgByte = *reinterpret_cast<uint16_t*>(&b[msgOfs_]);
        uint16_t optByte = *reinterpret_cast<uint16_t*>(&b[optOfs_]);
        r.requestStatus = *reinterpret_cast<uint16_t*>(&b[4 + 4]);
        r.optionCode = *reinterpret_cast<uint16_t*>(&b[4 + 4 + msgByte]);
        r.count = *reinterpret_cast<uint16_t*>(&b[4 + 4 + msgByte + optByte]);
    }
};

C言語チックにスコープの先頭で変数宣言してた。

速かった方

After.cpp

class BufferAnalyzeDirect
{
public:
    static const size_t msgOfs_ = offsetof(Header, messageByte);
    static const size_t optOfs_ = offsetof(Header, optionByte);
    void operator()(uint8_t* b, Result & r)
    {
        r.requestStatus = *reinterpret_cast<uint16_t*>(&b[4 + 4]);
        uint16_t msgByte = *reinterpret_cast<uint16_t*>(&b[msgOfs_]);
        r.optionCode = *reinterpret_cast<uint16_t*>(&b[4 + 4 + msgByte]);
        uint16_t optByte = *reinterpret_cast<uint16_t*>(&b[optOfs_]);
        r.count = *reinterpret_cast<uint16_t*>(&b[4 + 4 + msgByte + optByte]);
    }
};

一時変数を受ける場所が違うだけ。

結果

100000000回繰り返した結果

Before:      0.198654sec
After:       0.096776sec

アセンブラ

遅かった方

Before.asm

push    ebp
mov ebp, esp
push    esi
mov esi, DWORD PTR _r$[ebp]
push    edi
mov edi, DWORD PTR _b$[ebp]
movzx   ecx, WORD PTR [edi]
movzx   eax, WORD PTR [edi+8]
movzx   edx, WORD PTR [edi+2]
mov WORD PTR [esi], ax
movzx   eax, WORD PTR [ecx+edi+8]
mov WORD PTR [esi+2], ax
lea eax, DWORD PTR [ecx+edx]
movzx   eax, WORD PTR [eax+edi+8]
pop edi
mov WORD PTR [esi+4], ax
pop esi
pop ebp
ret 8

速かった方

After.cpp

push    ebp
mov ebp, esp
mov edx, DWORD PTR _b$[ebp]
mov ecx, DWORD PTR _r$[ebp]
movzx   eax, WORD PTR [edx+8]
mov WORD PTR [ecx], ax
movzx   eax, WORD PTR [edx+40]
mov WORD PTR [ecx+2], ax
movzx   eax, WORD PTR [edx+72]
mov WORD PTR [ecx+4], ax
pop ebp
ret 8

考察

ポインタからデータとってきてるので厳密にそのタイミングで取得して保持しておかないとデータ変わるかも！とコンパイラさんが気を利かせてくれているっぽい。
そのせいで余分なpush, pop, leaが出てきているっぽい。
データを使う直前で取得することによってスタックへの出し入れが減ったおかげで速くなったっぽい。

っぽい？