More than 5 years have passed since last update.

PythonでバイナリをあつかうためのTips

Python

Last updated at 2017-03-18Posted at 2017-02-11

Pythonでバイナリを扱う時のTipsです。

Pythonでバイナリを扱う方法は2つあります、structモジュールを使う方法とctypes.Structureクラスを使う方法です。
基本的にstructモジュールは数バイトのバイナリを扱いたい時、それ以上のバイト数やC/C++と連携したい時にctypes.Structureクラスを使います。

`struct`モジュール

例としてPNGファイルのバイナリを読んでみます、PNGファイルは頭の8バイトはヘッダで決め打ちです。
9バイト目から18バイトのデータはIHDR領域（正確にはIHDRの一部）でイメージの縦横サイズとビット深度、カラーモードが格納されています。

import struct

png_data = open("sample.png", "rb").read()

struct.unpack_from(">I4sIIBB", png_data, 8)
# (13, b'IHDR', 250, 156, 8, 2)

データの読み込みはstruct.unpackでも良いのですが、与えるバッファのオフセットとサイズがピッタリあっていないとエラーになります。
データの一部を読みたい場合はstruct.unpack_fromが便利です。

パティングは`x`で

バイナリを読んでいるとどうしてもパティング（アライメント合わせのためのゴミ領域）が出てきます。
xフォーマットを使用すると、データを読み飛ばしてくれるので便利です。

data = b'd\x00\xb0\x04'

# NG
kind, _, value = struct.unpack("BBH", data)

# Yes!
kind, value = struct.unpack("BxH", data)

struct.Structクラス

struct.Structクラスは、structモジュールのフォーマット文字列をクラス化したものです。
クラスのインスタンス生成時にフォーマットの解析を行うので、ループ内で繰り返しpack/unpackする場合は事前にインスタンスを生成しておくと高速です。
ctypes.Structreクラスと紛らわしいですね。

point = struct.Struct("HH")

for x, y in zip(range(10), range(10)):
    point.pack(x, y)

フォーマット文字一覧

文字	C言語型	標準サイズ
x	パティングバイト	1
c	char	1
b	signed char	1
B	unsigned char, BYTE	1
?	_Bool	1
h	short	2
H	unsinged short, WORD	2
i	int	4
I	unsigned int, DWORD	4
l	long, LONG	4
L	unsigned long, ULONG	4
q	long long, LONGLONG	8
Q	unsigned long long, ULONGLONG	8
n	ssize_t(Python3.3以降)	Nativeのみ
N	size_t(Python3.3以降)	Nativeのみ
f	float	4
d	double	8
s	char[]	-
p	char[]	-
P	void *	-

フォーマット文字例：

BITMAPINFOHEADER構造体

typedef struct tagBITMAPINFOHEADER {
    DWORD  biSize;
    LONG   biWidth;
    LONG   biHeight;
    WORD   biPlanes;
    WORD   biBitCount;
    DWORD  biCompression;
    DWORD  biSizeImage;
    LONG   biXPelsPerMeter;
    LONG   biYPelsPerMeter;
    DWORD  biClrUsed;
    DWORD  biClrImportant;
} BITMAPINFOHEADER;

BITMAPINFOHEADER構造体のフォーマット文字

"IllHHIIllII"

バイトオーダー、アライメントの一覧表

文字	バイトオーダー	サイズ	アライメント
@	Native	Native	Native
=	Native	標準のサイズ	なし
<	リトルエンディアン	標準のサイズ	なし
>	ビッグエンディアン	標準のサイズ	なし
!	ビッグエンディアン	標準のサイズ	なし

※省略された場合は@となる

@と=の違い(CPU=amd64,OS=Ubuntu64bit)

struct.calcsize("BI")
# 8

struct.calcsize("=BI")
# 5

エンディアンを明示的に指定してしまうとアライメントが「なし」になるので注意。

ctypes.Structureクラス

ctypes.StructureクラスでC/C++の構造体を扱う事ができます。
'struct'モジュールで沢山のデータを読もうとすると、フォーマットが呪文のようになりますので、大量のバイナリデータの読み込みをしっかりと書きたい場合はctypes.Structureクラスを使った方が良いでしょう。

Structureの基本

ctypes.Structureを継承し、_field_に型を定義します。

from ctypes import *

"""
typedef struct {
    char identity[4];
    uint16_t x;
    uint16_t y;
} TestStructure;
"""
class TestStructure(Structure):
    _fields_ = (
        ('identity', c_char * 4),
        ('x', c_uint16),
        ('y', c_uint16),
    )

インスタンスは以下のように定義します。

t = TestStructure(b"TEST", 100, 100)

サイズ固定の型指定を使いましょう

C言語でintやshortはサイズが環境によって変化します、C99からint16_tやint32_tなどのサイズ固定の型指定が可能になったので、可能な限りサイズ固定の型指定を使用すべきです。それに伴いPython側でもctypes.c_intではなくctypes.c_int16などのサイズ固定の型を使いましょう。

書き込み/読み込み

ioまたはFILEのwriteにctypes.Structureインスタンスをそのまま渡せば書き込みができます。

import io

buffer = io.BytesIO()
buffer.write(TestStructure(b"TEST", 100, 100))

buffer.getvalue()
# b'TESTd\x00d\x00'

readintoにctypes.Structureインスタンスをそのまま渡せば読み込みができます。

buffer = io.BytesIO(b'TESTd\x00d\x00')

t = TestStructure()
buffer.readinto(t)

t.identity, t.x, t.y
# (b'TEST', 100, 100)

メンバのオフセットの取得

構造体メンバのオフセット位置は、クラス名.メンバ名.offsetというクラスメソッドで取得できます。

class Point(Structure):
    _fields_ = (
        ('x', c_uint16),
        ('y', c_uint16),
    )
    
Point.y.offset
# 2

sizeof

ctypes.sizeofで構造体のサイズを取得できます。

class TestStructure(Structure):
    _fields_ = (
        ('flags', c_ubyte),
        ('value', c_int32),
    )
    
sizeof(TestStructure)
# 8

memset / memmove

C言語のmemsetとmemmoveと同等のものがctypes.memsetとctypes.memmoveです。

c_array = (c_char * 12)()

memset(c_array, 0, sizeof(c_array))
memmove(c_array, b"test\x00", len(b"test\x00"))

ポインタでデータマッピング

C/C++のように構造体のポインタをキャストすることにより、データをマッピングできます。
構造体のポインタを指定したい場合はctypes.POINTER、ctypes.castでキャストしてあげます、ポインタが参照している値はcontentsで取得できます。

class PointText(Structure):
    _fields_ = (
        ('x', c_uint16),
        ('y', c_uint16),
        ('text', c_char * 0),
    )
   
data = b'd\x00d\x00null terminate text\x00'
p_point = cast(data, POINTER(Point))

p_point.contents.x, p_point.contents.y
# (200, 120)

# NULL終端の文字列読み込み
string_at(addressof(p_point.contents) + PointText.text.offset)
# b'null terminate text'

ctypes.stering_atでNULL終端の文字列を読み込み、Unicodeの場合はctypes.wstring_atを使います。
しかし、ポインタ操作はPython自体をクラッシュさせる可能性があるので注意してください、可能であればchar []などの長さ未指定のメンバは避けるべきです。

memoryviewでbytesに変換

memoryviewでctypesオブジェクトをPyObjectに変換できます。

p = Point(200, 120)

memoryview(p).tobytes()
# b'\xc8\x00x\x00'

リトルエンディアン、ビッグエンディアン

class BPoint(BigEndianStructure):
    _fields_ = (
        ('x', c_uint16),
        ('y', c_uint16),
    )

class LPoint(LittleEndianStructure):
    _fields_ = (
        ('x', c_uint16),
        ('y', c_uint16),
    )

bpoint = BPoint(0x0102, 0x0304)
lpoint = LPoint(0x0102, 0x0304)

memoryview(bpoint).tobytes()
# b'\x01\x02\x03\x04'

memoryview(lpoint).tobytes()
# b'\x02\x01\x04\x03'

参考

http://docs.python.jp/3.5/library/struct.html
http://docs.python.jp/3.5/library/ctypes.html

203

219

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up