More than 1 year has passed since last update.

GZipStream.Read() はなかなか手強い

Last updated at 2023-12-14Posted at 2023-12-14

はじめに

ファイル拡張子(.gz)のファイルを、GZipStream.Read()で読み込む処理を考えます。いくつかの罠にはまったので、笑い話ついでにTipsを記します。

事象を確認したターゲットフレームワークは、.NET 8.0

ファイルを解凍してから扱えばいい、というのは正論だが(StackOverflowあたりにもそんな回答があった)、あえてGZipStreamで頑張ってみる。

コード例

機械学習の練習でよく使われるMNISTのイメージデータを読み込む処理(TensorFlow.NETを使用)を例に取ります。扱うファイルのひとつは、'train-images-idx3-ubyte.gz'でgzipで圧縮されています。

MNISTManager.cs

public static NDArray Load_Image(string image_path)
{
    using Stream fs = File.OpenRead(image_path);
    using Stream zs = new GZipStream(fs, CompressionMode.Decompress);

    // ファイルヘッダを解析してからイメージデータ部を読み込む
    byte[] buf = new byte[16];
    zs.Read(buf,0,16);
    long size = BinaryPrimitives.ReadInt32BigEndian(buf[4..8]);
    long rows = BinaryPrimitives.ReadInt32BigEndian(buf[8..12]);
    long cols = BinaryPrimitives.ReadInt32BigEndian(buf[12..16]);

    // 想定通りなら 60000 * 28 * 28
    int cnt = (int)size * (int)rows * (int)cols;

    // 罠その1
    // バッファサイズを指定しても、一部しか読んでくれない
    //zs.Read(bufdata,0,cnt);

    int nread = 0;
    int offset = 0;
    int len = 1024 * 1024;  // 1M

    byte[] readbuf = new byte[len];
    byte[] bufdata = new byte[cnt + len];

    while (true)
    {        
        nread = zs.Read(readbuf,0,len);

        // 罠その2
        // ファイルの途中でも読み込み単位(len)以下のサイズが返ることがある
        // したがって、nread < len は終了判定に使えない
        if (nread == 0) { break; }
        if (offset >= cnt) { break; }   // ignore residuals

        // 罠その3
        // 読み込み単位以下のサイズが返った場合、
        // CopyToでは、読んでいないデータも複写してしまう
        // readbuf.CopyTo(bufdata, offset); // this copys unread data...
        for (var i = 0; i < nread;i++)
        {
            bufdata[offset + i] = readbuf[i]; 
        }       

        offset += nread;
    }

    NDArray data = np.array(bufdata[0..cnt]).reshape((size, rows * cols));
    data = data.astype(np.float32);
    
    return data;
}

罠その1

データを収めるのに十分なサイズのバッファを用意し、
Read(buf,offset,length)
を実行してもデータ全体は読み込まれない。

Streamを用いる際の基本を忘れていました。一度のReadでは内部のバッファ分しか読み込まれず、読み残しがある限り、繰り返しReadを呼び出す必要がありました。

罠その2

Readの戻り値として読み込んだデータのサイズが返されるのですが、これが規定値以下になることを終了条件にしたら、途中終了しました。

読み込み用のバッファのサイズ以上に読み残しがあったとしても、バッファのサイズ未満しか読み込まない場合があります。別の終了条件を設けます。

罠その3

その2の続きですが、読み込み用のバッファに一部しかデータが読み込まれないので、CopyToで配列全体を複写すると、ゴミのデータも複写してしまいます。

しかたがないので、for文で一要素ずつ複写します。

おわりに

下書きを書いてから改めて調べてみると、関連する記事を見つけました。こちらの方の実装の方が実用的で参考になると思います。

この記事で必要は尽くされている気もしますが、自戒を込めて本記事も残しておきます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up