Parquet file by DuckDB の python での読み込み確認

Last updated at 2025-02-08Posted at 2025-02-08

(変更履歴）

v2 format の説明の誤解を招くためタイトル変更しました。

before: Parquet format v2 by DuckDB の python での読み込み確認
after: Parquet file by DuckDB の python での読み込み確認

執筆の動機

DuckDB 1.2.0 より parquet file V2 format での read, write ができるとの記事が先日公開された。

Python 使いである私は pandas (pyarrow), polars での挙動が気になったので調べてみることとした。以下ではv2 formatについての検証ではなく、DuckDBの型とParquetの型の挙動について調査した。

結論

DuckDBで書き出した Parquet file を読み込んだ場合、データ型について以下の挙動に注意する必要がある。

hugeint (int128) は parquet で定義されていないのでfloat(double64)に変換される。floatで表現できない巨大数はround tripできない
bit 型は文字列として保存される
array(固定長配列）は list（動的配列）となる。Pandas, Polarsでの読み込みの場合は余分なkey "element" の付いたdict型となる。

Prerequisites

2025-02-05 に duckdb 1.2.0 が公開されたのでこれを使う。

## macOS or ubuntu (WSL2)
brew install duckdb

duckdb --version
v1.2.0 5f5512b827

$ python -VV
Python 3.12.8 (main, Dec  3 2024, 18:42:41) [GCC 11.4.0]

uv add pandas polars pyarrow

uv pip list
# Package         Version
# --------------- -----------
# numpy           2.2.2
# pandas          2.2.3
# polars          1.21.0
# pyarrow         19.0.0
# python-dateutil 2.9.0.post0
# pytz            2025.1
# six             1.17.0
# tzdata          2025.1

# python console
import pandas as pd
import polars as pl
import pyarrow.parquet as pq

サンプルーデータの作成 by DuckDB

DuckDBで扱える型(Data Types)は公式ページを参照するとよい。

Perplexity Pro の DeepSeek R1で以下のプロンプトを入力した。便利な時代である。

duckdbの Type の種類ごとに列を作り、サンプルデータを1行作成するクエリを作成してください。duckdb の Type は https://duckdb.org/docs/sql/data_types/overview.html このページを参照してください。

以下のSQL文を参考にして、参照した全てのTypeについてサンプルデータを作成してください。作成後にSQL文として文法が正しいか確認してから回答を出力してください。

COPY (FROM VALUES
  (1::BIGINT, '10'::BIT, '\x0a1b'::BLOB, false) AS t(_bigint, _bit, _blob, _boolean)
)
TO 'all_types.zstd.v2.parquet' (COMPRESSION zstd, PARQUET_VERSION V2);

COPY (FROM VALUES (
    1::BIGINT,
    '10'::BIT,
    '\x0a1b'::BLOB,
    false::BOOLEAN,
    '2025-01-01'::DATE,
    '99.9999'::DECIMAL(10,6),
    1.23::DOUBLE,
    1.23::FLOAT,
    123::INTEGER,
    170141183460469231731687303715884105727::HUGEINT,
    1::SMALLINT,
    1::TINYINT,
    255::UTINYINT,
    65535::USMALLINT,
    4294967295::UINTEGER,
    18446744073709551615::UBIGINT,
    '12:00:00'::TIME,
    '2025-01-01 12:00:00'::TIMESTAMP,
    '2025-01-01 12:00:00+00'::TIMESTAMPTZ,
    INTERVAL '1 DAY',
    '123e4567-e89b-12d3-a456-426614174000'::UUID,
    '{"key": "value"}'::JSON,
    [1, 2, 3]::INTEGER[3],
    [4, 5, 6]::INTEGER[],
    map([1, 2], ['a', 'b'])::MAP(INTEGER, VARCHAR),
    {'i': 42, 's': 'text'}::STRUCT(i INTEGER, s VARCHAR),
    union_value(num := 2)::UNION(num INTEGER, str VARCHAR)
) AS t(
    _bigint, _bit, _blob, _boolean, _date, _decimal,
    _double, _float, _integer, _hugeint, _smallint,
    _tinyint, _utinyint, _usmallint, _uinteger, _ubigint,
    _time, _timestamp, _timestamptz, _interval, _uuid,
    _json, _array, _list, _map, _struct, _union
))
TO 'all_types.zstd.v2.parquet' (COMPRESSION zstd, PARQUET_VERSION V2);

読み込みの確認

さて、まずは DuckDB自身のround tripの確認してみた。.maxwidth 1000 でconsole横幅の自動省略をさせず強制的に全表示させる。

DuckDB自身で parquet ファイルを経由して round trip できなかったものは次の型であった。

bit: varcharに変換
hugeint: 上限値はdoubleに変換。再現不能
array: 長さ指定が消えて list となっている。配列内の型はキープ

# duckdb
D .maxwidth 1000
D from read_parquet('all_types.zstd.v2.parquet');
┌─────────┬─────────┬────────┬──────────┬────────────┬───────────────┬─────────┬────────┬──────────┬────────────────────────┬───────────┬──────────┬───────────┬────────────┬────────────┬──────────────────────┬──────────┬─────────────────────┬──────────────────────────┬───────────┬──────────────────────────────────────┬──────────────────┬───────────┬───────────┬───────────────────────┬──────────────────────────────┬────────────────────────────────────┐
│ _bigint │  _bit   │ _blob  │ _boolean │   _date    │   _decimal    │ _double │ _float │ _integer │        _hugeint        │ _smallint │ _tinyint │ _utinyint │ _usmallint │ _uinteger  │       _ubigint       │  _time   │     _timestamp      │       _timestamptz       │ _interval │                _uuid                 │      _json       │  _array   │   _list   │         _map          │           _struct            │               _union               │
│  int64  │ varchar │  blob  │ boolean  │    date    │ decimal(10,6) │ double  │ float  │  int32   │         double         │   int16   │   int8   │   uint8   │   uint16   │   uint32   │        uint64        │   time   │      timestamp      │ timestamp with time zone │ interval  │                 uuid                 │       json       │  int32[]  │  int32[]  │ map(integer, varchar) │ struct(i integer, s varchar) │ struct(utinyint, integer, varchar) │
├─────────┼─────────┼────────┼──────────┼────────────┼───────────────┼─────────┼────────┼──────────┼────────────────────────┼───────────┼──────────┼───────────┼────────────┼────────────┼──────────────────────┼──────────┼─────────────────────┼──────────────────────────┼───────────┼──────────────────────────────────────┼──────────────────┼───────────┼───────────┼───────────────────────┼──────────────────────────────┼────────────────────────────────────┤
│    1    │ 10      │ \x0A1b │ false    │ 2025-01-01 │   99.999900   │  1.23   │  1.23  │   123    │ 1.7014118346046923e+38 │     1     │    1     │    255    │   65535    │ 4294967295 │ 18446744073709551615 │ 12:00:00 │ 2025-01-01 12:00:00 │ 2025-01-01 21:00:00+09   │ 1 day     │ 123e4567-e89b-12d3-a456-426614174000 │ {"key": "value"} │ [1, 2, 3] │ [4, 5, 6] │ {1=a, 2=b}            │ {'i': 42, 's': text}         │ (0, 2, NULL)                       │
└─────────┴─────────┴────────┴──────────┴────────────┴───────────────┴─────────┴────────┴──────────┴────────────────────────┴───────────┴──────────┴───────────┴────────────┴────────────┴──────────────────────┴──────────┴─────────────────────┴──────────────────────────┴───────────┴──────────────────────────────────────┴──────────────────┴───────────┴───────────┴───────────────────────┴──────────────────────────────┴────────────────────────────────────┘

hugeint はPythonのような上限を持たない整数型ではなく 2^128 であることに注意。なお、Parquetには INT96 が存在しているが INT128 はない。

Python 側の確認

parquetファイルを扱う主要なライブラリ、Pandas, Polars, PyArrow で確認してみる。

import pandas as pd
import polars as pl
import pyarrow.parquet as pq

Pandas

ヘルパー関数を用意しておく

def show_df_elem(df: pd.DataFrame) -> pd.DataFrame:
    cols = df.columns.tolist()
    dtypes = df.dtypes.tolist()
    r0_repr = [repr(x) for x in df.iloc[0].tolist()]
    r0_type = [type(x) for x in df.iloc[0].tolist()]
    out = pd.DataFrame({
      "name": cols,
      "dtypes": dtypes,
      "r0_repr": r0_repr,
      "r0_type": r0_type,
    })
    return out

まずは Pandas。上記で作成したへルーパー関数を実行してみる。

プリミティブ以外は object 型としてSeries列には入る

>>> df_pd = pd.read_parquet("all_types.zstd.v2.parquet")

>>> show_df_elem(df)
            name               dtypes                                            r0_repr                                            r0_type
0        _bigint                int64                                        np.int64(1)                              <class 'numpy.int64'>
1           _bit               object                                               '10'                                      <class 'str'>
2          _blob               object                                            b'\n1b'                                    <class 'bytes'>
3       _boolean                 bool                                          np.False_                               <class 'numpy.bool'>
4          _date               object                          datetime.date(2025, 1, 1)                            <class 'datetime.date'>
5       _decimal               object                               Decimal('99.999900')                          <class 'decimal.Decimal'>
6        _double              float64                                   np.float64(1.23)                            <class 'numpy.float64'>
7         _float              float32                                   np.float32(1.23)                            <class 'numpy.float32'>
8       _integer                int32                                      np.int32(123)                              <class 'numpy.int32'>
9       _hugeint              float64                 np.float64(1.7014118346046923e+38)                            <class 'numpy.float64'>
10     _smallint                int16                                        np.int16(1)                              <class 'numpy.int16'>
11      _tinyint                 int8                                         np.int8(1)                               <class 'numpy.int8'>
12     _utinyint                uint8                                      np.uint8(255)                              <class 'numpy.uint8'>
13    _usmallint               uint16                                   np.uint16(65535)                             <class 'numpy.uint16'>
14     _uinteger               uint32                              np.uint32(4294967295)                             <class 'numpy.uint32'>
15      _ubigint               uint64                    np.uint64(18446744073709551615)                             <class 'numpy.uint64'>
16         _time               object                               datetime.time(12, 0)                            <class 'datetime.time'>
17    _timestamp       datetime64[us]                   Timestamp('2025-01-01 12:00:00')  <class 'pandas._libs.tslibs.timestamps.Timesta...
18  _timestamptz  datetime64[us, UTC]    Timestamp('2025-01-01 12:00:00+0000', tz='UTC')  <class 'pandas._libs.tslibs.timestamps.Timesta...
19     _interval               object  b'\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00...                                    <class 'bytes'>
20         _uuid               object     b'\x12>Eg\xe8\x9b\x12\xd3\xa4VBf\x14\x17@\x00'                                    <class 'bytes'>
21         _json               object                                 '{"key": "value"}'                                      <class 'str'>
22        _array               object  array([{'element': 1}, {'element': 2}, {'eleme...                            <class 'numpy.ndarray'>
23         _list               object                      array([4, 5, 6], dtype=int32)                            <class 'numpy.ndarray'>
24          _map               object                               [(1, 'a'), (2, 'b')]                                     <class 'list'>
25       _struct               object                             {'i': 42, 's': 'text'}                                     <class 'dict'>
26        _union               object                     {'': 0, 'num': 2, 'str': None}                                     <class 'dict'>
>>>

r0_repr のround tripできていない項目について説明する。

hugeint は浮動小数点 float64となっている。DuckDB側での読み込みと同じく巨大数は元に戻せない。

interval は生バイナリとして保存されている。ちょっとよくわからないので次の値を保存してみた。

# duckdb
D COPY (SELECT '1 day 2 hours 3 minutes 4.5 seconds'::INTERVAL as _interval)
  TO '_interval.zstd.v2.parquet' (COMPRESSION zstd, PARQUET_VERSION V2);

>>> pd.read_parquet("_interval.zstd.v2.parquet")
                                          _interval
0  b'\x00\x00\x00\x00\x01\x00\x00\x00\xb4\xadp\x00'

parplexity DeepSeek R1の回答は以下である。

import struct
import pandas as pd

byte_string = pd.read_parquet("_interval.zstd.v2.parquet").iloc[0,0]
# b'\x00\x00\x00\x00\x01\x00\x00\x00\xb4\xadp\x00'

# DuckDB INTERVAL構造（4バイトの月、4バイトの日、4バイトのマイクロ秒）
months, days, microseconds = struct.unpack('<iii', byte_string)

print(pd.Timedelta(days=days, microseconds=microseconds)) # マイクロ
# 1 days 00:00:07.384500

# 筆者追加
print(pd.Timedelta(days=days, milliseconds=microseconds)) # ミリ
# 1 days 02:03:04.500000

DuckDBのduckdb.hの内部実装では以下のように micors が定義されて int64 と 8 bytes のはずだが、Pandas側での読み込みは 12 bytes の binary なので、最後も4 bytesとなっている。ここはPython側の最後の4 bytes を読み込みミリセカンドとして指定するとよさそうでさる。これは意図されているのかそうでないかは私にはわからないが、注意が必要だ。

typedef struct {
	int32_t months;
	int32_t days;
	int64_t micros;
} duckdb_interval;

uuid については生バイナリとして保存はされているので、変換処理が必要である。

>>>> import uuid
>>> uuid.UUID(bytes=b'\x12>Eg\xe8\x9b\x12\xd3\xa4VBf\x14\x17@\x00')
UUID('123e4567-e89b-12d3-a456-426614174000')

json はjson表現のstrとして保存されている。

array は長さ固定の配列のはずが、{'element': 1}, ... という python dict が入っている。

union はよくわからんので割愛。

pd.read_parquetで `dtype_backend` を指定

dtype_backend="pyarrow" を指定すると Seriesのdtypesが異なる。再び parquet に書き込む場合を考えると dtype_backend="pyarrow" の指定が良いのではないかと考える。

>>> df_pd2 = pd.read_parquet("all_types.zstd.v2.parquet", dtype_backend="numpy_nullable")
>>> show_df_elem(df_pd2)
            name               dtypes                                            r0_repr                                            r0_type
0        _bigint                Int64                                        np.int64(1)                              <class 'numpy.int64'>
1           _bit       string[python]                                               '10'                                      <class 'str'>
2          _blob               object                                            b'\n1b'                                    <class 'bytes'>
3       _boolean              boolean                                          np.False_                               <class 'numpy.bool'>
4          _date               object                          datetime.date(2025, 1, 1)                            <class 'datetime.date'>
5       _decimal               object                               Decimal('99.999900')                          <class 'decimal.Decimal'>
6        _double              Float64                                   np.float64(1.23)                            <class 'numpy.float64'>
7         _float              Float32                                   np.float32(1.23)                            <class 'numpy.float32'>
8       _integer                Int32                                      np.int32(123)                              <class 'numpy.int32'>
9       _hugeint              Float64                 np.float64(1.7014118346046923e+38)                            <class 'numpy.float64'>
10     _smallint                Int16                                        np.int16(1)                              <class 'numpy.int16'>
11      _tinyint                 Int8                                         np.int8(1)                               <class 'numpy.int8'>
12     _utinyint                UInt8                                      np.uint8(255)                              <class 'numpy.uint8'>
13    _usmallint               UInt16                                   np.uint16(65535)                             <class 'numpy.uint16'>
14     _uinteger               UInt32                              np.uint32(4294967295)                             <class 'numpy.uint32'>
15      _ubigint               UInt64                    np.uint64(18446744073709551615)                             <class 'numpy.uint64'>
16         _time               object                               datetime.time(12, 0)                            <class 'datetime.time'>
17    _timestamp       datetime64[us]                   Timestamp('2025-01-01 12:00:00')  <class 'pandas._libs.tslibs.timestamps.Timesta...
18  _timestamptz  datetime64[us, UTC]    Timestamp('2025-01-01 12:00:00+0000', tz='UTC')  <class 'pandas._libs.tslibs.timestamps.Timesta...
19     _interval               object  b'\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00...                                    <class 'bytes'>
20         _uuid               object     b'\x12>Eg\xe8\x9b\x12\xd3\xa4VBf\x14\x17@\x00'                                    <class 'bytes'>
21         _json       string[python]                                 '{"key": "value"}'                                      <class 'str'>
22        _array               object  array([{'element': 1}, {'element': 2}, {'eleme...                            <class 'numpy.ndarray'>
23         _list               object                      array([4, 5, 6], dtype=int32)                            <class 'numpy.ndarray'>
24          _map               object                               [(1, 'a'), (2, 'b')]                                     <class 'list'>
25       _struct               object                             {'i': 42, 's': 'text'}                                     <class 'dict'>
26        _union               object                     {'': 0, 'num': 2, 'str': None}                                     <class 'dict'>

df_pd3 = pd.read_parquet("all_types.zstd.v2.parquet", dtype_backend="pyarrow")
>>> show_df_elem(df_pd3)
            name                                             dtypes                                            r0_repr                                            r0_type
0        _bigint                                     int64[pyarrow]                                                  1                                      <class 'int'>
1           _bit                                    string[pyarrow]                                               '10'                                      <class 'str'>
2          _blob                                    binary[pyarrow]                                            b'\n1b'                                    <class 'bytes'>
3       _boolean                                      bool[pyarrow]                                              False                                     <class 'bool'>
4          _date                               date32[day][pyarrow]                          datetime.date(2025, 1, 1)                            <class 'datetime.date'>
5       _decimal                         decimal128(10, 6)[pyarrow]                               Decimal('99.999900')                          <class 'decimal.Decimal'>
6        _double                                    double[pyarrow]                                               1.23                                    <class 'float'>
7         _float                                     float[pyarrow]                                 1.2300000190734863                                    <class 'float'>
8       _integer                                     int32[pyarrow]                                                123                                      <class 'int'>
9       _hugeint                                    double[pyarrow]                             1.7014118346046923e+38                                    <class 'float'>
10     _smallint                                     int16[pyarrow]                                                  1                                      <class 'int'>
11      _tinyint                                      int8[pyarrow]                                                  1                                      <class 'int'>
12     _utinyint                                     uint8[pyarrow]                                                255                                      <class 'int'>
13    _usmallint                                    uint16[pyarrow]                                              65535                                      <class 'int'>
14     _uinteger                                    uint32[pyarrow]                                         4294967295                                      <class 'int'>
15      _ubigint                                    uint64[pyarrow]                               18446744073709551615                                      <class 'int'>
16         _time                                time64[us][pyarrow]                               datetime.time(12, 0)                            <class 'datetime.time'>
17    _timestamp                             timestamp[us][pyarrow]                   Timestamp('2025-01-01 12:00:00')  <class 'pandas._libs.tslibs.timestamps.Timesta...
18  _timestamptz                     timestamp[us, tz=UTC][pyarrow]    Timestamp('2025-01-01 12:00:00+0000', tz='UTC')  <class 'pandas._libs.tslibs.timestamps.Timesta...
19     _interval                     fixed_size_binary[12][pyarrow]  b'\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00...                                    <class 'bytes'>
20         _uuid                     fixed_size_binary[16][pyarrow]     b'\x12>Eg\xe8\x9b\x12\xd3\xa4VBf\x14\x17@\x00'                                    <class 'bytes'>
21         _json                                    string[pyarrow]                                 '{"key": "value"}'                                      <class 'str'>
22        _array  list<array: struct<element: int32> not null>[p...   [{'element': 1}, {'element': 2}, {'element': 3}]                                     <class 'list'>
23         _list                      list<element: int32>[pyarrow]                                          [4, 5, 6]                                     <class 'list'>
24          _map               map<int32, string ('_map')>[pyarrow]                               [(1, 'a'), (2, 'b')]                                     <class 'list'>
25       _struct               struct<i: int32, s: string>[pyarrow]                             {'i': 42, 's': 'text'}                                     <class 'dict'>
26        _union  struct<: uint8, num: int32, str: string>[pyarrow]                     {'': 0, 'num': 2, 'str': None}                                     <class 'dict'>

Parquet

以下の通り、エラーで読めない。 Arrow datatype Interval(DayTime) not supported by Polars. とあるので interval 型は読めない。

>>> pl.read_parquet("all_types.zstd.v2.parquet"),

thread '<unnamed>' panicked at crates/polars-core/src/datatypes/field.rs:234:19:
Arrow datatype Interval(DayTime) not supported by Polars. You probably need to activate that data-type feature.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kmkm/_work/duckdb_parquet_v2/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kmkm/_work/duckdb_parquet_v2/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kmkm/_work/duckdb_parquet_v2/.venv/lib/python3.12/site-packages/polars/io/parquet/functions.py", line 241, in read_parquet
    return lf.collect()
           ^^^^^^^^^^^^
  File "/home/kmkm/_work/duckdb_parquet_v2/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2056, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: Arrow datatype Interval(DayTime) not supported by Polars. You probably need to activate that data-type feature.

そこで interval 型を除いたparquet fileを作成して読み込んだが、次のエラーが出た。map型もダメ。

pyo3_runtime.PanicException: Arrow datatype Map(Field { name: "key_value", dtype: LargeList(Field { name: "key_value", dtype: Struct([Field { name: "key", dtype: Int32, is_nullable: false, metadata: None }, Field { name: "value", dtype: Utf8View, is_nullable: true, metadata: None }]), is_nullable: true, metadata: None }), is_nullable: true, metadata: None }, false) not supported by Polars. You probably need to activate that data-type feature.

以下は interval型とmap型を除いたparquetファイルの作成

COPY (FROM VALUES (
    1::BIGINT,
    '10'::BIT,
    '\x0a1b'::BLOB,
    false::BOOLEAN,
    '2025-01-01'::DATE,
    '99.9999'::DECIMAL(10,6),
    1.23::DOUBLE,
    1.23::FLOAT,
    123::INTEGER,
    170141183460469231731687303715884105727::HUGEINT,
    1::SMALLINT,
    1::TINYINT,
    255::UTINYINT,
    65535::USMALLINT,
    4294967295::UINTEGER,
    18446744073709551615::UBIGINT,
    '12:00:00'::TIME,
    '2025-01-01 12:00:00'::TIMESTAMP,
    '2025-01-01 12:00:00+00'::TIMESTAMPTZ,
    '123e4567-e89b-12d3-a456-426614174000'::UUID,
    '{"key": "value"}'::JSON,
    [1, 2, 3]::INTEGER[3],
    [4, 5, 6]::INTEGER[],
    {'i': 42, 's': 'text'}::STRUCT(i INTEGER, s VARCHAR),
    union_value(num := 2)::UNION(num INTEGER, str VARCHAR)
) AS t(
    _bigint, _bit, _blob, _boolean, _date, _decimal,
    _double, _float, _integer, _hugeindt, _smallint,
    _tinyint, _utinyint, _usmallint, _uinteger, _ubigint,
    _time, _timestamp, _timestamptz, _uuid,
    _json, _array, _list, _struct, _union
))
TO 'all_types_for_polars.zstd.v2.parquet' (COMPRESSION zstd, PARQUET_VERSION V2);

読み込めるようになった。

df_pl = pl.read_parquet("all_types_for_polars.zstd.v2.parquet")

同様にヘルパー関数を準備する。

def show_df_pl(df: pl.DataFrame) -> pl.DataFrame:
    cols = df.columns
    dtypes = df.dtypes
    r0_repr = [repr(v[0]) for _, v in df_pl.to_dict().items()]
    r0_type = [type(v[0]) for _, v in df_pl.to_dict().items()]
    out = pd.DataFrame({
      "name": cols,
      "dtypes": dtypes,
      "r0_repr": r0_repr,
      "r0_type": r0_type,
    })
    return out

さて、以下の実行の通りPandasと同様の傾向であるが、Pandasと異なるのは json が Binary として読み込まれている。

>>> df_pl = pl.read_parquet("all_types_for_polars.zstd.v2.parquet")
>>> show_df_pl(df_pl)
            name                                            dtypes                                            r0_repr                                r0_type
0        _bigint                                             Int64                                                  1                          <class 'int'>
1           _bit                                            String                                               '10'                          <class 'str'>
2          _blob                                            Binary                                            b'\n1b'                        <class 'bytes'>
3       _boolean                                           Boolean                                              False                         <class 'bool'>
4          _date                                              Date                          datetime.date(2025, 1, 1)                <class 'datetime.date'>
5       _decimal                    Decimal(precision=10, scale=6)                               Decimal('99.999900')              <class 'decimal.Decimal'>
6        _double                                           Float64                                               1.23                        <class 'float'>
7         _float                                           Float32                                 1.2300000190734863                        <class 'float'>
8       _integer                                             Int32                                                123                          <class 'int'>
9       _hugeint                                           Float64                             1.7014118346046923e+38                        <class 'float'>
10     _smallint                                             Int16                                                  1                          <class 'int'>
11      _tinyint                                              Int8                                                  1                          <class 'int'>
12     _utinyint                                             UInt8                                                255                          <class 'int'>
13    _usmallint                                            UInt16                                              65535                          <class 'int'>
14     _uinteger                                            UInt32                                         4294967295                          <class 'int'>
15      _ubigint                                            UInt64                               18446744073709551615                          <class 'int'>
16         _time                                              Time                               datetime.time(12, 0)                <class 'datetime.time'>
17    _timestamp          Datetime(time_unit='us', time_zone=None)               datetime.datetime(2025, 1, 1, 12, 0)            <class 'datetime.datetime'>
18  _timestamptz         Datetime(time_unit='us', time_zone='UTC')  datetime.datetime(2025, 1, 1, 12, 0, tzinfo=zo...            <class 'datetime.datetime'>
19         _uuid                                            Binary     b'\x12>Eg\xe8\x9b\x12\xd3\xa4VBf\x14\x17@\x00'                        <class 'bytes'>
20         _json                                            Binary                                b'{"key": "value"}'                        <class 'bytes'>
21        _array                  List(Struct({'element': Int32}))  shape: (3,)\nSeries: '' [struct[1]]\n[\n\t{1}\...  <class 'polars.series.series.Series'>
22         _list                                       List(Int32)  shape: (3,)\nSeries: '' [i32]\n[\n\t4\n\t5\n\t...  <class 'polars.series.series.Series'>
23       _struct                 Struct({'i': Int32, 's': String})                             {'i': 42, 's': 'text'}                         <class 'dict'>
24        _union  Struct({'': UInt8, 'num': Int32, 'str': String})                     {'': 0, 'num': 2, 'str': None}                         <class 'dict'>

PyArrow

PyArrowの読み込みは以下である。

>>> pq.read_table("all_types.zstd.v2.parquet")
pyarrow.Table
_bigint: int64
_bit: string
_blob: binary
_boolean: bool
_date: date32[day]
_decimal: decimal128(10, 6)
_double: double
_float: float
_integer: int32
_hugeint: double
_smallint: int16
_tinyint: int8
_utinyint: uint8
_usmallint: uint16
_uinteger: uint32
_ubigint: uint64
_time: time64[us]
_timestamp: timestamp[us]
_timestamptz: timestamp[us, tz=UTC]
_interval: fixed_size_binary[12]
_uuid: fixed_size_binary[16]
_json: string
_array: list<array: struct<element: int32> not null>
  child 0, array: struct<element: int32> not null
      child 0, element: int32
_list: list<element: int32>
  child 0, element: int32
_map: map<int32, string ('_map')>
  child 0, _map: struct<key: int32 not null, value: string> not null
      child 0, key: int32 not null
      child 1, value: string
_struct: struct<i: int32, s: string>
  child 0, i: int32
  child 1, s: string
_union: struct<: uint8, num: int32, str: string>
  child 0, : uint8
  child 1, num: int32
  child 2, str: string
----
_bigint: [[1]]
_bit: [["10"]]
_blob: [[0A3162]]
_boolean: [[false]]
_date: [[2025-01-01]]
_decimal: [[99.999900]]
_double: [[1.23]]
_float: [[1.23]]
_integer: [[123]]
_hugeint: [[1.7014118346046923e+38]]
...

json はPandasと同じ str であった。

_hugeintより下の値表示が省略されてしまったので、以下のように確認する。 array については Pandas, Polarsのような "element" という key の dict にはなっていないように見えるがreprの表示の問題であろうか。このあたりは正直なところよくわかっていない。

>>> [x for x in pq.read_table("all_types.zstd.v2.parquet")]

上記実行の結果

[<pyarrow.lib.ChunkedArray object at 0x7eff073ec340>
[
  [
    1
  ]
], <pyarrow.lib.ChunkedArray object at 0x7eff073edcc0>
[
  [
    "10"
  ]
], <pyarrow.lib.ChunkedArray object at 0x7eff073edc60>
[
  [
    0A3162
  ]
], <pyarrow.lib.ChunkedArray object at 0x7eff073efd60>
[
  [
    false
  ]
], <pyarrow.lib.ChunkedArray object at 0x7eff073ec400>
[
  [
    2025-01-01
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c37c0>
[
  [
    99.999900
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c1540>
[
  [
    1.23
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c18a0>
[
  [
    1.23
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c2080>
[
  [
    123
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c1480>
[
  [
    1.7014118346046923e+38
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c1f00>
[
  [
    1
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c32e0>
[
  [
    1
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c3460>
[
  [
    255
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c0fa0>
[
  [
    65535
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c1180>
[
  [
    4294967295
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c13c0>
[
  [
    18446744073709551615
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c3fa0>
[
  [
    12:00:00.000000
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c2260>
[
  [
    2025-01-01 12:00:00.000000
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c3b20>
[
  [
    2025-01-01 12:00:00.000000Z
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c3e80>
[
  [
    000000000100000000000000
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c1120>
[
  [
    123E4567E89B12D3A456426614174000
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c3760>
[
  [
    "{"key": "value"}"
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c14e0>
[
  [
    -- is_valid: all not null
    -- child 0 type: int32
      [
        1,
        2,
        3
      ]
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c1360>
[
  [
    [
      4,
      5,
      6
    ]
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c11e0>
[
  [
    keys:
    [
      1,
      2
    ]
    values:
    [
      "a",
      "b"
    ]
  ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c12a0>
[
  -- is_valid: all not null
  -- child 0 type: int32
    [
      42
    ]
  -- child 1 type: string
    [
      "text"
    ]
], <pyarrow.lib.ChunkedArray object at 0x7efedd0c3d60>
[
  -- is_valid: all not null
  -- child 0 type: uint8
    [
      0
    ]
  -- child 1 type: int32
    [
      2
    ]
  -- child 2 type: string
    [
      null
    ]
]]

なお、pyarrowで読み込んだデータを polars で読み込む場合はエラーとならなかったので、interval型、map型を含むparquetファイルをpolarsで扱いたい場合は一度 pyarrow を経由すればよいことがわかった。

>>> pl.DataFrame(pq.read_table("all_types.zstd.v2.parquet"))
shape: (1, 27)
┌─────────┬──────┬───────────┬──────────┬───┬───────────┬────────────────────┬─────────────┬────────────┐
│ _bigint ┆ _bit ┆ _blob     ┆ _boolean ┆ … ┆ _list     ┆ _map               ┆ _struct     ┆ _union     │
│ ---     ┆ ---  ┆ ---       ┆ ---      ┆   ┆ ---       ┆ ---                ┆ ---         ┆ ---        │
│ i64     ┆ str  ┆ binary    ┆ bool     ┆   ┆ list[i32] ┆ list[struct[2]]    ┆ struct[2]   ┆ struct[3]  │
╞═════════╪══════╪═══════════╪══════════╪═══╪═══════════╪════════════════════╪═════════════╪════════════╡
│ 1       ┆ 10   ┆ b"\x0a1b" ┆ false    ┆ … ┆ [4, 5, 6] ┆ [{1,"a"}, {2,"b"}] ┆ {42,"text"} ┆ {0,2,null} │
└─────────┴──────┴───────────┴──────────┴───┴───────────┴────────────────────┴─────────────┴────────────┘

余談

DuckDB vv1.2.0 5f5512b827 で作成した "zstd.v2.parquet" について、メタデータでv2 format であることは確認できなかった。 #16099 にて報告されているので、次のminor versionで修正されていることを期待する。

この parquet の versionについてだが、ここによると

In the case of WriterVersion.PARQUET_2_0, PageHeaderV2 will be used for data pages. It also appears in the footer in the EncodingStats of each column chunks. You may use the parquet-cli tool to check this:
parquet-cli footer {parquet file} | grep usesV2Pages

とあるように usesV2Pages という値が取得できるかもとあるが、以下の実行の通り全部 null であった。

$ parquet footer all_types.zstd.v2.parquet | grep encodingStats
      "encodingStats" : null,
      "encodingStats" : null,
      "encodingStats" : null,
      "encodingStats" : null,
      ...略...

なお

In parquet-mr if you set WriterVersion.PARQUET_2_0, it might use the delta encodings (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY) for the related types but dictionary encoding has precedence, so it might happen that such a file does not have any of these encodings.

とあるように encodings には DELTA_* が設定されていることが確認できた。（1行しかデータを入れていないので DELTA_BINARY_PACKED など他のencodingは無い。）

$ parquet footer all_types.zstd.v2.parquet | grep DELTA_*
        "encodings" : [ "DELTA_LENGTH_BYTE_ARRAY" ]

以上

References

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up