More than 1 year has passed since last update.

pandasのapplyよりもnumpyのwhereを使おう

Posted at 2024-04-01

目的

sample.csv

mean,std
-69.33333333333333,0
-61.785714285714285,4.516284468102667
-50.6875,2.4418230894149553
-61.5,0
-60.95,5.862234170833182
...

サンプルのような 2000 行の CSV ファイルを読みこみ、std が 0 の場合は 1 にする

実装方法

以上の処理を行う時、多くの方は以下のような処理を書くのではないでしょうか？

main.py

import pandas as pd

sample_df = pd.read_csv("sample.csv")

sample_df["std"] = sample_df["std"].apply(lambda x: 1 if x == 0 else x)

以上の方法でも全然問題はないのですが、numpy.whereを使うことでより高速に処理することができます

numpy の where とは

numpy 配列において特定の条件を満たす要素に対して特定の値を返すメソッドです

pandas.apply は各行に対して一行づつ処理をかけているのですが、それと違って numpy.where はベクトル化された演算、つまり各行に並列して処理をかけます

そのため、pandas.apply よりも高速に処理を行えると言うわけです

np.where(condition, x, y)

condition : 条件
x : true の時に返される値
y : false の時に返される値

実行速度の比較

pandas.py

import time

import pandas as pd

star_time = time.time()

sample_df = pd.read_csv("sample.csv")

sample_df["std"] = sample_df["std"].apply(lambda x: 1 if x == 0 else x)

print(time.time() - star_time) # 0.0034990310668945312

numpy.py

import time

import numpy as np
import pandas as pd

star_time = time.time()

sample_df = pd.read_csv("sample.csv")

sample_df["std"] = np.where(sample_df["std"] == 0, 1, sample_df["std"])

print(time.time() - star_time) # 0.0016324520111083984

実際に検証してみると pandas.apply よりも numpy.where の方が約三倍程度高速であるとわかります

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up