pandas 2.2.0の「Downcasting object dtype arrays ～」というFutureWarningに対応した

Last updated at 2024-02-02Posted at 2024-02-02

環境

Python 3.12.1
pandas 2.2.0

何が起きたのか

以下のPythonスクリプトを実行すると、fillna()でFutureWarningが発生しました。

sample.py

import pandas

df1 = pandas.DataFrame(columns=["name","count"])
print("\n=== df1 ===")
print(df1)
print("\n=== df1.dtypes ===")
print(df1.dtypes)

df2 = pandas.DataFrame({"name":["Alice","Bob"], "birth":[1990,1985]})
print("\n=== df2 ===")
print(df2)
print("\n=== df2.dtypes ===")
print(df2.dtypes)

df3 = df2.merge(df1, on="name", how="left")
print("\n=== df3 ===")
print(df3)
print("\n=== df3.dtypes ===")
print(df3.dtypes)

df4 = df3.fillna({"count":0})
print("\n=== df4 ===")
print(df4)
print("\n=== df4.dtypes ===")
print(df4.dtypes)

$ python sample.py 

=== df1 ===
Empty DataFrame
Columns: [name, count]
Index: []

=== df1.dtypes ===
name     object
count    object
dtype: object

=== df2 ===
    name  birth
0  Alice   1990
1    Bob   1985

=== df2.dtypes ===
name     object
birth     int64
dtype: object

=== df3 ===
    name  birth count
0  Alice   1990   NaN
1    Bob   1985   NaN

=== df3.dtypes ===
name     object
birth     int64
count    object
dtype: object
/home/vagrant/Documents/study/20240202/sample.py:21: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df4 = df3.fillna({"count":0.0})

=== df4 ===
    name  birth  count
0  Alice   1990      0
1    Bob   1985      0

=== df4.dtypes ===
name      object
birth      int64
count    int64
dtype: object

原因

df3のcount列のdtypeはobjectです。fillna()を実行することで、count列のdtypeがobjectからint64になります。このダウンキャストがFutureWarningの原因でした。

この警告はpandas 2.2.0から出るようになりました。

Deprecated the automatic downcasting of object dtype results in a number of methods. These would silently change the dtype in a hard to predict manner since the behavior was value dependent. Additionally, pandas is moving away from silent dtype changes (GH 54710, GH 54261).

対策

df1のcount列のdtypeをintかfloatに変更すれば、警告は発生しなくなりました。

df1 = pandas.DataFrame(columns=["name","count"])
df1 = df1.astype({"count":"int"})

$ python sample.py 

=== df1 ===
Empty DataFrame
Columns: [name, count]
Index: []

=== df1.dtypes ===
name     object
count     int64
dtype: object

=== df2 ===
    name  birth
0  Alice   1990
1    Bob   1985

=== df2.dtypes ===
name     object
birth     int64
dtype: object

=== df3 ===
    name  birth  count
0  Alice   1990    NaN
1    Bob   1985    NaN

=== df3.dtypes ===
name      object
birth      int64
count    float64
dtype: object

=== df4 ===
    name  birth  count
0  Alice   1990    0.0
1    Bob   1985    0.0

=== df4.dtypes ===
name      object
birth      int64
count    float64
dtype: object

今回はdf1のcount列をint64に変更しましたが、df3のcount列のdtypeはfloat64です。これは、df2.merge()を実行してdf3のcount列にNaNが含まれるようになったためです。

補足

警告メッセージの提案にあるinfer_objects()を使っても、警告は出なくなります。

df3 = df2.merge(df1, on="name", how="left")
df3 = df3.infer_objects()

しかし、今回の例だとマージ前のdf1のcountのdtypeがobjectであることが根本の原因なので、infer_objects()を利用しない方がよいでしょう。

所感

pandas.DataFrame(columns=["name","count"])のように空のDataFrameを生成する場合は、dtypeも正しく設定した方がよいなと思いました。

本当はpandas.DataFrame(dtype={"count":"int64"})のように、列ごとにdtypeを指定したいです。¹しかし、pandas.DataFrame()のdtype引数は1つのdtypeしか指定できないので、astype()関数で列ごとにdtypeを指定しました。

pandas.read_csvのdtype引数は列ごとにdtypeを指定できます。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up