5
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

【pandas高速化】mergeが遅いと思ったらmapを使え

Posted at

#mergeが遅い!

と感じませんか?
特にコンペで行数が多いときとかマジで遅い。特徴量作成にどんだけ時間使うねん!
そういう時はmapを使うと速くなります。

#例

それなりに行数がないと差が出ないので今回はkaggleのタイタニックを使います。
その中にAgeというカラムがあるので、それのカウントエンコーディングをしたい。

merge_map.py
import pandas as pd
import time

df = pd.read_csv("train.csv")
df["Age"] = df["Age"].dtype("str")
t1 = time.time()

#pettern 1
df = pd.merge(df,df.Age.value_counts().reset_index().rename(columns = {"Age":"Age_count1"}),
                         left_on = "Age", right_on = "index", how = "left")
t2 = time.time()

#pettern 2
df["Age_count2"] = df["Age"].map(df.Age.value_counts())
t3  =time.time()

print(t2-t1)
print(t3-t2)
#output
0.004603147506713867                                                                                                                                                                                                                        
0.0012080669403076172

この場合、mapの方が4倍高速です。
target encodingなんかもmapでできます。

df = pd.merge(df, df.groupby("Age").Survived.mean().reset_index().rename(columns = 
                    {"Survived":"Age_target1"}), on = "Age", how = "left")
t4 = time.time()
df["Age_target2"] = df["Age"].map(df.groupby(["Age"]).Survived.mean())
print(t4-t3)
print(t5-t4)
#output
0.005101919174194336                                                                                                                                                                                                                        
0.001428842544555664

これも4倍くらいはやいです。このスケールならまだしも、10秒か40秒か。1分か4分かっってなってくるとだいぶ違います。

#キーが2つある場合
mapはキーが1つじゃないといけません。
その場合は無理やりキーを作ります。

df = pd.merge(df, df.assign(sex_age_count = 0).groupby(
["Sex", "Age"])["sex_age_count"].count().reset_index(),on = ["Sex", "Age"] ,how = "left")

t6 = time.time()
#キーを無理やり作る
df["Sex_Age"] = df["Sex"] + df["Age"]
t7 = time.time()
df["Sex_Age_count"] = df["Sex_Age"].map(df["Sex_Age"].value_counts())
t8 = time.time()

print(t6-t5)
print(t8-t7)

これも4倍くらいはやいです。

#output
0.006415843963623047                                                                                                                                                                                                                        
0.0015180110931396484    
5
1
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
5
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?