More than 3 years have passed since last update.

Pandasで列を抜き出すときはreindexを使った方がいいと思った話

Posted at 2021-11-27

要約

事前に列の存在をチェックしないでPandasのDataFrameから列を抜き出すのはやめよう．
DataFrameから列を抜き出す際は，

事前に列の存在を確認してから抜き出す
reindex関数を使って抜き出す

のどちらかにしましょう．

よくあるサンプル

import numpy as np
import pandas as pd

df = pd.DataFrame(
        np.arange(12).reshape(3, 4),
        index=["X", "Y", "Z"],
        columns=["A", "B", "C", "D"]
    )

print(df)
#    A  B   C   D
# X  0  1   2   3
# Y  4  5   6   7
# Z  8  9  10  11


# AとB列だけ抜き出す
df_AB = df[["A", "B"]]

print(df_AB)
#    A  B
# X  0  1
# Y  4  5
# Z  8  9


# 存在しない列を抜き出そうとするとエラー
df_PD = df[["P", "D"]]
# KeyError: "['P'] not in index"

存在しない列をとろうとするとエラーなのはいいとして，pandasの関数を使うと，知らぬ間に列がなくなっていることがあります．

sample_data.csv

日付,店舗,売上
2019-12-31,A,100
2019-12-31,B,200
2019-12-31,C,300
2020-01-01,A,100
2020-01-01,C,300
2020-01-02,A,100
2020-01-02,C,300

from datetime import datetime
import pandas as pd

# あるスーパーの店舗ごとの売り上げをCSVから読み取った例
df = pd.read_csv(
        "sample_data.csv",
        index_col=[0, 1],
        parse_dates=[0,]
    )

print(df)
#                売上
# 日付       店舗
# 2019-12-31 A   100
#            B   200
#            C   300
# 2020-01-01 A   100
#            C   300
# 2020-01-02 A   100
#            C   300

# unstack関数を使って，店舗を列方向に並び替える
df_unstack = df.unstack()
print(df_unstack)
#             売上
# 店舗         A      B      C
# 日付
# 2019-12-31  100.0  200.0  300.0
# 2020-01-01  100.0    NaN  300.0
# 2020-01-02  100.0    NaN  300.0

# 店舗ごとの売り上げが日ごとに分けられて見やすくなった


# 2020年のデータだけ抜き出して同じことをしてみると・・・
df_only_2020 = df.loc[df.index.get_level_values("日付").year == 2020]
print(df_only_2020)
#                売上
# 日付       店舗
# 2020-01-01 A   100
#            C   300
# 2020-01-02 A   100
#            C   300

df_only_2020_unstack = df_only_2020.unstack()
print(df_only_2020_unstack)
#             売上
# 店舗         A      C
# 日付
# 2020-01-01  100.0  300.0
# 2020-01-02  100.0  300.0

# 店舗Bのデータが列にないからこの状態で取り出そうとするとエラーになる

なので

DataFrameから列を抜き出すときは存在の確認をするかreindexを使いましょう．
個人的にはreindexがおすすめなのでreindexの場合の結果をチラっと．

from datetime import datetime
import pandas as pd

# あるスーパーの店舗ごとの売り上げをCSVから読み取った例
df = pd.read_csv(
        "sample_data.csv",
        index_col=[0, 1],
        parse_dates=[0,]
    )

print(df)
#                売上
# 日付       店舗
# 2019-12-31 A   100
#            B   200
#            C   300
# 2020-01-01 A   100
#            C   300
# 2020-01-02 A   100
#            C   300

# 2020年のデータだけ抜きだす
df_only_2020 = df.loc[df.index.get_level_values("日付").year == 2020]
print(df_only_2020)
#                売上
# 日付       店舗
# 2020-01-01 A   100
#            C   300
# 2020-01-02 A   100
#            C   300

# 列方向に店舗を並べる
df_only_2020_unstack = df_only_2020.unstack()
print(df_only_2020_unstack)
#             売上
# 店舗         A      C
# 日付
# 2020-01-01  100.0  300.0
# 2020-01-02  100.0  300.0

# 上記のデータから3店舗分のデータをreindexで取り出す
df_only_2020_ABC = df_only_2020_unstack.reindex(columns=["A", "B", "C"])
print(df_only_2020_ABC)
#             売上
# 店舗         A      B      C
# 日付
# 2019-12-31  100.0    NaN  300.0
# 2020-01-01  100.0    NaN  300.0
# 2020-01-02  100.0    NaN  300.0

# 店舗BのデータはNaNで埋めてくれる

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up