More than 5 years have passed since last update.

DataFrame で groupby の際、any() を簡単に 100 倍近く高速化する方法

pandas

Last updated at 2018-02-18Posted at 2018-02-17

概要

groupby の any() はなぜか遅い。
しかし、次のようにすることで高速化が可能。

方法

計測につかった集計元データ

import pandas as pd
df = pd.DataFrame({
    "A": np.arange(100000) // 10,  # 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ... という具合に 10 個ずつ 1 万種類の数値を持たせる
    "B": list(map(lambda x: x == 0, np.random.randint(0, 20, size=100000)))  # 5%の確率でTrueにする
})

従来の方法（遅い）

grp = df.groupby('A')[['B']].any()  # B 列でいずれかが True なら True にする
# 830 ms くらいかかる

新しい方法（速い）

# any2 という早い関数を groupby クラスに追加する
def any2(self):
    sums = self.sum()  # True の合計数を算出
    return sums > 0  # 1以上なら True 
pd.core.groupby.DataFrameGroupBy.any2 = any2

grp = df.groupby('A')[['B']].any2()
# 8～9 ms で終わる

このように sum() を介することで速くなる。
なぜか any() や all() のような bool 系の集計メソッドは遅いので、数値計算系の集計で実現する方が速くなる。

ちなみに all() も高速化するなら下記:

allの高速化

def all2(self):
    sums = self.sum()  # True の合計数を算出
    counts = self.count()
    return sums == counts  # すべて一致なら True
pd.core.groupby.DataFrameGroupBy.all2 = all2

grp = df.groupby('A')[['B']].all2()
# 11 ms くらいで終わる

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up