Pandas で時間を溶かす arrow型（date型の違い）

Last updated at 2025-04-05Posted at 2025-04-04

この記事の目的

Pandas の DataFrame は pyarrowの型が使えるようになったが、うっかり型が不一致でまたまたデータパイプラインが止まってしまった事象を皆様に共有します。

Disclaimer

この記事の内容は、私個人の意見や見解であり、私が所属する組織の公式な立場、方針、意見を反映するものではありません。この記事の内容について、組織はいかなる責任も負いません。

事象

Parquet file に保存された日付を表す列について、（pd.Timestampではなく）datetime.date型が入っていた。

そのparquet fileを読み込む pd.read_parquet の引数に、うっかり dtype_backend="pyarrow" を加えてしまったのが事の始まりでした。

先に結論

（Python 3.12.9, pandas 2.2.3, pyarrow 19.0.1の環境下）
Pandas にて datetime.date型を含む object型の Series について、df.dtype_convert(dtype_backend=...) で変換した df を parquet での書き込んだが、その後の pd.read_parquet にて dtype_backend のオプション指定による型対応は以下であった。（これがあるべきAPIの姿であるかは別問題）

縦軸: to_parquet前の dtype_convert(dtype_backend=...)
横軸: read_parquet(dtype_backedn=...) の引数

convert_dtype \ read_parquet	指定なし	numpy_nullable	pyarrow
指定なし	object	object	date32[day][pyarrow]
numpy_nullable	object	object	date32[day][pyarrow]
pyarrow	object	object	date32[day][pyarrow]

ちなみに整数型は以下の挙動（intとIntの大文字・小文字に違いあり）でより注意が必要

convert_dtype \ read_parquet	指定なし	numpy_nullable	pyarrow
指定なし	int64	Int64	int64[pyarrow]
numpy_nullable	Int64	Int64	int64[pyarrow]
pyarrow	int64[pyarrow]	Int64	int64[pyarrow]

reproduction

uv を使っているのでversion指定して以下で起動しサンプルデータを作成。

$ uvx --with pandas==2.2.3,pyarrow==19.0.1 -p=3.12.9 ipython

sample data

import pandas as pd
import datetime as datetime

df = pd.DataFrame({
    "a": ["2024-12-01", "2024-12-02"],
    "b": [1, 2]
}).assign(
    a=lambda df: pd.to_datetime(df["a"]).dt.date
)
df.to_parquet("sample1.parquet")
df.convert_dtypes(dtype_backend="numpy_nullable").to_parquet("sample2.parquet")
df.convert_dtypes(dtype_backend="pyarrow").to_parquet("sample3.parquet")

挙動の確認

df.convert_dtypes(dtype_backend...) 後はobject型のSeries（中身は datetime.date）のままで、変換は行われない

df.dtypes
# a    object
# b     int64
# dtype: object

df.convert_dtypes(dtype_backend="numpy_nullable").dtypes
# a    object
# b     Int64
# dtype: object

df.convert_dtypes(dtype_backend="pyarrow").dtypes
# a            object
# b    int64[pyarrow]
# dtype: object

sample1の読込

pd.read_parquet("sample1.parquet").dtypes
# a    object
# b     int64
# dtype: object

pd.read_parquet("sample1.parquet", dtype_backend="numpy_nullable").dtypes
# a    object
# b     Int64
# dtype: object

pd.read_parquet("sample1.parquet", dtype_backend="pyarrow").dtypes
# a    date32[day][pyarrow]
# b          int64[pyarrow]
# dtype: object

sample2の読込

pd.read_parquet("sample2.parquet").dtypes
# a    object
# b     Int64
# dtype: object

pd.read_parquet("sample2.parquet", dtype_backend="numpy_nullable").dtypes
# a    object
# b     Int64
# dtype: object

pd.read_parquet("sample2.parquet", dtype_backend="pyarrow").dtypes
# a    date32[day][pyarrow]
# b          int64[pyarrow]
# dtype: object

sample3の読込

pd.read_parquet("sample3.parquet").dtypes
# a            object
# b    int64[pyarrow]
# dtype: object

pd.read_parquet("sample3.parquet", dtype_backend="numpy_nullable").dtypes
# a    object
# b     Int64
# dtype: object

pd.read_parquet("sample3.parquet", dtype_backend="pyarrow").dtypes
# a    date32[day][pyarrow]
# b          int64[pyarrow]
# dtype: object

終わりに

Pandas の DataFrame でindexを使う既存コードを撲滅して duckdb（ibis経由) または Polars に早く移行したいです。

N.B: このあたりでpandas 3.0にむけて整理してそうですが進んでいなさげ
https://github.com/pandas-dev/pandas/pull/58455

追記: datetime に対し pd.read_csv(dtype_backend=...)も注意が必要。てか issue 立てた記憶飛んでた
https://github.com/pandas-dev/pandas/issues/55422

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up