More than 5 years have passed since last update.

pandas category 型集計の罠

Python

Last updated at 2020-09-11Posted at 2020-09-10

pandas category 型集計の罠

dtypeがcategoryだと、存在しない値に対しても集計されてしまうことがある。

import pandas as pd  # version 1.1.2

# DataFrame を定義
df = pd.DataFrame({
    'col1': ['a', 'a', 'b', 'b', 'c', 'c'],
    'col2': [1, 2, 1, 2, 1, 2]
})
# col1 を category 型にする
df['col1'] = df['col1'].astype('category')
# 先頭の3行をコピー
df_sub = df.head(3).copy()
# col1 で groupby して、col2 について集計
df_grp = df_sub.groupby('col1')
df_agg = df_grp.agg({'col2': 'mean'}).reset_index()
df_agg.columns = ['col1', 'mean_col2']

df_subは以下のようになる。

	col1	col2
0	a	1
1	a	2
2	b	1

df_aggは以下のようになる。

	col1	mean_col2
0	a	1.5
1	b	1.0
2	c	NaN

問題点

df_subに対して集計したはずなのに、col1がcの行がある。df_grp.groupsを確かめてみると、{'a': [0, 1], 'b': [2], 'c': []}となっている。

対策

df_grpの定義を以下のようにする。

df_grp = df_sub.groupby('col1', observed=True)

※@nkay様にご指摘いただき、対策を変更いたしました。ありがとうございます。

groupbyのドキュメント

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up