More than 5 years have passed since last update.

HDF5のpython実装で文字列データを圧縮して保存するだけ

Last updated at 2020-02-07Posted at 2020-01-16

これは何

HDF5でPython実装から文字列を圧縮して保存したいとき、 str ではなく固定長 bytes （dtype='S1024'など）を使いましょう。 str や可変長配列は圧縮されずに保存されている可能性があります。

経緯

あるとき、文字列データ（というか計算結果の出力ファイル）をHDF5形式でまとめておこうと思い立ちました。
せっかくなので容量節約のため、圧縮もかけておきたいと思います。

文字列長は不定なので、固定長は使わない方向で。

こちらを参考にしつつ、

    test_string = "hoge"*5
    with h5py.File("test.hdf5", "w") as f:
        g = f.require_group("test_group")
        # dt = h5py.special_dtype(vlen=str)
        dt = h5py.string_dtype()
        ds = g.create_dataset("test_vstr", (1, ), dtype=dt, compression="gzip")
        ds[0] = test_string

としました。（試行錯誤してわかったのですが、dt は special_dtype と string_dtype のどちらを使ってもpythonのstr クラスに対応します。）

普通に test.hdf5 が生成され、

    with h5py.File("test.hdf5", "r") as f:
        g = f.require_group("test_group")
        data = g["test_vstr"][0]
        print(type(data) )
        print("Read data: ", data)

とすると

<class 'str'>
Read data:  hogehogehogehogehoge

という感じで読み込めました。HDFViewでも中身が確認でき、文字列も見ることができました。

これで安心、と思っていましたが、ふと test.hdf5 の中身をのぞいてみました。

test.hdf5

�HDF

��������　（以下略）
O��

という感じで、3行目が非常に長くこれがデータ本体のようです。これをずっと見ていくと、

test.hdf5の3行目

・・・����hogehogehogehogehoge����・・・

圧縮されてないやん。

ここから試行錯誤が始まりました。

結論

create_dataset(data_name, (1, ), dtype='S{}'.format(len(data_string)), compression="gzip")

代入時、 bytes にします。 str で持っているデータは encode() してやりましょう。エンコードも指定してやるとよりよいです。

環境

WSL
python 3.6.9
h5py 2.10.0

そしてHDFViewはWindows上で動いていますので、可搬性も確認できたことになるかなと。

ダメだった試行錯誤たち

これらはPythonでの読み書き、HDFViewでの確認はできます。しかしHDF5ファイルをそのままテキストで見てみると、文字列がそのまま読めました。つまり圧縮されていない可能性が高いです。

可変長バージョン　special_dtype

str の時と違い、 special_dtype と string_dtype では読みこむときに違いが出ます。

書き込みの時。

    with h5py.File("test.hdf5", "a") as f:
        g = f.require_group("test_group")
        dt = h5py.special_dtype(vlen=bytes)
        ds = g.create_dataset("test_vbytes", (1, ), dtype=dt, compression="gzip")
        ds[0] = test_string.encode()

読み込み。

    with h5py.File("test.hdf5", "r") as f:
        data = g["test_vbytes"][0]
        print(type(data) )
        print("Read data decoded: ", data.decode())

<class 'bytes'>
Read data decoded:  hogehogehogehogehoge

可変長バージョン　string_dtype

書き込みの時。

    with h5py.File("test.hdf5", "a") as f:
        g = f.require_group("test_group")
        dt = h5py.string_dtype(encoding='utf-8')
        ds = g.create_dataset("test_vbytes", (1, ), dtype=dt, compression="gzip")
        ds[0] = np.array(test_string.encode('utf-8'), dtype=dt)

なんか突然代入時に np.array で囲っています。囲わない生の bytes では、マルチバイト文字の時に問題が起こりました。
読み込み。

    with h5py.File("test.hdf5", "r") as f:
        data = g["test_vbytes"][0]
        print(type(data) )
        print("Read data: ", data)

<class 'str'>
Read data:  hogehogehogehogehoge

いつの間にか str になってますね。

おまけ

    with h5py.File("test.hdf5", "a") as f:
        g = f.require_group("test_group")
        dt = h5py.string_dtype(encoding='ascii')
        ds = g.create_dataset("test_vbytes", (1, ), dtype=dt, compression="gzip")
        ds[0] = test_string.encode('ascii', 'backslashreplace')

という感じで、エンコードをASCIIにしてみると、

    with h5py.File("test.hdf5", "r") as f:
        data = g["test_vbytes"][0]
        print(type(data) )
        print("Read data decoded: ", data)

という読み込みに対して、

<class 'bytes'>
Read data decoded:  hogehogehogehogehoge

と bytes で返すようになります。
あと、encode の引数に 'backslashreplace' とつけているのは、マルチバイト文字の時の対応のためです。

とりあえずうまくいったパターン

固定長バージョン(実質可変長)

    with h5py.File("test.hdf5", "a") as f:
        g = f.require_group("test_group")
        ds = g.create_dataset("test_fixbytes", (1, ), dtype='S{}'.format(len(test_string)), compression="gzip")
        ds[0] = test_string.encode()

これは以下のようにして読み出します。

    with h5py.File("test.hdf5", "r") as f:
        data = g["test_fixbytes"][0]
        print(type(data) )
        print("Read data decoded: ", data.decode())

<class 'numpy.bytes_'>
Read data decoded:  hogehogehogehogehoge

HDFViewでも見ることができます。

実質可変長というのは、create_dataset の時に長さを動的に決定できるからです。多分このデータセットの中身を書き換えるときは固定長配列であるところがネックになってくるかと思います。

実際圧縮されてる？

一応とある計算結果ファイル群で実験した感じでは、 str で無圧縮だった場合 7MB 弱あったものが、 bytes にすると 2.7 MB程度になった感じです。

マルチバイト文字

マルチバイト文字列でも結果が異なってきます。

test_string = "魑魅魍魎"*5

として、この魑魅魍魎が5回続いた文字列を保存してみます。

最初のパターン

最初の圧縮できていない str を代入するタイプでは、書き込んでから読み込むと

Read data:  魑魅魍魎魑魅魍魎魑魅魍魎魑魅魍魎魑魅魍魎

とできました。

固定長利用

固定長で numpy.bytes_ を使うやつは

Read data:  b'\xe9\xad\x91\xe9\xad\x85\xe9\xad\x8d\xe9\xad\x8e\xe9\xad\x91\xe9\xad\x85\xe9\xad'

となり、 decode() でエラーが出ます。
よく思い出すと、固定長バージョンではデータセット生成時に dtype='S{}'.format(len(test_string)) としています。つまり、len がマルチバイト文字列のバイト長を正しく取れないことによります。とりあえず、

        ds = g.create_dataset("test_fixbytes", (1, ), dtype='S{}'.format(len(test_string)*4), compression="gzip")

と場当たり的に十分な長さをとってやれば魑魅魍魎×5が読み取れました。

`special_dtype`

一方で special_dtype(vlen=bytes) では

Read data:  b'\xe9\xad\x91\xe9\xad\x85\xe9\xad\x8d\xe9\xad\x8e\xe9\xad\x91\xe9\xad\x85\xe9\xad\x8d\xe9\xad\x8e\xe9\xad\x91\xe9\xad\x85\xe9\xad\x8d\xe9\xad\x8e\xe9\xad\x91\xe9\xad\x85\xe9\xad\x8d\xe9\xad\x8e\xe9\xad\x91\xe9\xad\x85\xe9\xad\x8d\xe9\xad\x8e'
Read data decoded:  魑魅魍魎魑魅魍魎魑魅魍魎魑魅魍魎魑魅魍魎

とな、こちらはマルチバイト文字のバイト長さを気にしなくてもよいようです。

Encoded string

最後に string_dtype(encoding='utf-8')を使うものでも

Read data:  魑魅魍魎魑魅魍魎魑魅魍魎魑魅魍魎魑魅魍魎

となり、問題なく読み書きできていることがわかります。
HDFViewでも見れました。

ここで、 ds[0] = np.array(test_string.encode('utf-8'), dtype=dt) と無駄に np.array で囲っていることを思い出してみます。もし ds[0] = test_string.encode('utf-8') とすると、Python上でWriteとReadする分には問題ないのですが、HDFViewから見ると、

Failed to read scalar dataset: Address overflowed

と言われ、見ることができませんでした。おそらく文字列長の取得においてnumpyにやってもらったほうがHDF5内部実装のものより正確なためと思われます。

まとめ

dtypeの引数	代入時の型	取得時の型	注意
`h5py.special_dtype(vlen=str)`	`str`	`str`	`compression` が無視される
`h5py.string_dtype()`	`str`	`str`	同上
`h5py.special_dtype(vlen=bytes)`	`bytes`	`bytes`	変換がめんどい
`h5py.string_dtype(encoding='utf-8')`	`numpy.ndarray`	`str`	入力時の変換
`h5py.string_dtype(encoding='ascii')`	`bytes`	`bytes`	HDFViewで文字化け
`'S{nlen}'`	`bytes`	`numpy.bytes_`	マルチバイト文字の時、長さに注意

参考

他にもspecial_typesについてのドキュメントも目を通しておけばいいでしょう。
http://docs.h5py.org/en/stable/special.html

http://docs.h5py.org/en/stable/strings.html#exceptions-for-python-3
をみると、

Most strings in the HDF5 world are stored in ASCII, which means they map to byte strings. But in Python 3, there’s a strict separation between data and text, which intentionally makes it painful to handle encoded strings directly.

とあるように、HDF5とPythonではエンコードの違いから結構苦労しているみたいです。

前バージョンからの遺産

これ、よく見ると2012年で、

Python 2.7.1
HDF5 1.8.5 patch 1
h5py 1.3.1

という環境。しかし今でも同様の問題に苦しみました。

このIssueを閉じるときに

Thanks a lot for your information. I was searching around for details but could find nothing.

Can you give me some details about the storage of structured arrays? I guess the compression is also not working. What about the IO performance. I read from people complaining about that.

と言っているので、結局直さずに放置されているのでしょうか。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up