More than 3 years have passed since last update.

[Python] バイナリファイルを少しずつ読む

Posted at 2021-04-22

import numpy as np

まとめ

np.fromfileのオプションcountとoffsetを使うと，バイナリファイルの一部だけを読み込むことが出来る．
小さなファイルではこれらを使わずに一括して読む方が速い
大きなファイルでチャンクごとに何らかの処理を挟む場合は，分割して読む方が速くなる場合がある

問題

（特に大きなサイズの）配列をnp.fromfile()で一括して読み込むと遅い．
少しずつ読み込むためには？

解決策

np.fromfileのcountオプション，offsetオプションを使う．
これらは単位が異なるので注意^[1]．

count: 読み込むデータの大きさ（バイトサイズではなく，取得したい配列の大きさと一致）
offset: ファイルを読み始めるバイト位置

n0, n1 = 5, 24 # テスト用の配列の大きさ
# dtype  = 'float' # 8バイト
dtype  = 'float32' # 4バイト，どちらでもOK．
path   = './test.bin' # ファイルを書き出す/読み込むパス

np.random.seed(0)
orgarr = np.random.rand(n0, n1).astype(dtype)
orgarr.tofile(path)

# 一括して読む
arr0 = np.fromfile(path, dtype=dtype).reshape((n0, n1))

# n1サイズの配列をn0回読む＋一括版と比較
bytesize = np.dtype(dtype).itemsize # datatypeごとのバイトサイズを取得
for i in range(n0):
    # 1回ごとにn1サイズの配列を読むのでcount=n1
    # offsetにはスキップしたいバイトサイズ
    _arr1 = np.fromfile(path, dtype=dtype, count=n1, offset=n1*i*bytesize)
    print((arr0[i] == _arr1).all(), _arr1.mean())
'''
結果は
True 0.6104534
True 0.48200977
True 0.40383717
True 0.4136633
True 0.5744956
'''

例外(?)処理

count，offsetが実際のファイルと整合していなくても，np.fromfileはエラーを出さない．

countが実際のファイルサイズをオーバーすると，実際にデータが存在したところまでの大きさの配列を返す
offsetが既に実際のファイルサイズをオーバーしていた場合，配列の形状は(0,)

従ってファイル末尾の処理（StopIteration等）には，実際のファイルを取得しておく必要がある．
（e.g., os.path.getsize(path)でファイルのバイトサイズが返る）

_arr1 = np.fromfile(path, dtype=dtype, count=n1+3, offset=n1*(n0-1)*bytesize)
print(_arr1.shape) # (24,)
print((arr0[-1] == _arr1).all()) # True
_arr1 = np.fromfile(path, dtype=dtype, count=n1, offset=(n1*(n0-1)+n1//2)*bytesize)
print(_arr1.shape) # (12,)
_arr1 = np.fromfile(path, dtype=dtype, count=n1, offset=n1*n0*bytesize)
print(_arr1.shape) # (0,)
_arr1 = np.fromfile(path, dtype=dtype, count=n1, offset=n1*(n0+1)*bytesize)
print(_arr1.shape) # (0,)

実行速度: ファイルを分割して読むだけでは遅い！

しかし，ファイルを分割して読む→結合する（以下「分割」），という使い方は，一括して読む（以下「一括」）よりも遅くなる．
以下を比較する．

# パターン1
arr0 = np.fromfile(path, dtype=dtype).reshape((n0, n1))

# パターン2
bytesize = np.dtype(dtype).itemsize
arr1 = []
for i in range(n0):
    _arr1 = np.fromfile(path, dtype=dtype, count=n1, offset=n1*i*bytesize)
    arr1 += [_arr1]
arr1 = np.stack(arr1)

JupyterLabの%%timeitによる結果:

# 上記で出力した(5, 24) float32のファイル
一括: 270 µs ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
分割: 1.25 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 別の(365, 1036800=720*1440) float32のファイル
一括: 3.76 s ± 1.28 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
分割: 6.33 s ± 1.95 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

この分割は(1) 大きなファイルについて (2) 各チャンクごとに何か処理を行うときに効果的．
例えばチャンクごとの最大値を取得するような以下を考えると，

# パターン1
arr0 = arr0.max(axis=1)
# パターン2
arr1 += [_arr1.max()]

小さなファイルではやはり一括して読む方が効率的だが，
大きなファイルについては大小関係が逆転する．

# 上記で出力した(5, 24) float32のファイル
一括: 329 µs ± 92.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
分割: 1.71 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 別の(365, 720*1440) float32のファイル
一括: 5.32 s ± 3.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
分割: 949 ms ± 196 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

他手法との比較

open('rb')とstruct.unpackの組み合わせもよく挙げられるが，実行速度上で不利らしい^[2]．

参考文献

[1] [Numpy: numpy.fromfile]
(https://numpy.org/doc/stable/reference/generated/numpy.fromfile.html)
[2] [stackoverflow: Fastest way to read in and slice binary data files in Python]
(https://stackoverflow.com/questions/44169233/fastest-way-to-read-in-and-slice-binary-data-files-in-python?noredirect=1&lq=1)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up