python pandas dataframe のループ処理が遅すぎる問題

Last updated at 2019-10-12Posted at 2019-10-11

python pandas dataframe のループ処理が遅すぎる問題

pandas をdf.iterrows()で純粋にループさせると遅すぎでした。
numpy array型に変換してからindexで参照するようにすると爆速になりました。
思わぬところで足を取られました

以下のように解決。

def add_pred_in_df(df_input, df_unixepoch_pred):
    prev_unixepoch = 0
    index_pred = 0
    array_time_input = df_input.timestamp.values
    array_time_pred = df_unixepoch_pred.unixepoch.values
    array_label_pred = df_unixepoch_pred.pred.values
    list_input_pred = []
    for index in range(df_input.shape[0]):
        if array_time_input[index] >  array_time_pred[index_pred]:
            index_pred += 1
        list_input_pred.append(array_label_pred[index_pred])
    df_input["prediction"] = list_input_pred
    return df_input

（コードが汚いのはごめんなさい）

追記 20191012

やりたいこととしては、下記です。

timestamp（ここでは整数型のunixepoch）をcolumnに持った、dataframe df_input に、新しく prediction というcolumnを追加したい。
どうやって追加するかというと、ある時間区間ごとに対応するpredictionの値を格納した dataframe df_unixepoch_pred を参照したい。
df_inputのtimestampの間隔は一定ではないので、df_inputの1レコードずつを df_unixepoch_pred の時間区切りのどれに当てはまるか比較して、predictionの値を割り当てたい。

参考

ちなみに

ふつうに iterrows() で回してた時のコードは以下。

def add_pred_in_df(df_input, df_unixepoch_pred):
    prev_unixepoch = 0
    df_input["prediction"] = ""
    for index, items in df_unixepoch_pred.iterrows():
        if df_input[(df_input["timestamp"] > prev_unixepoch) & (df_input["timestamp"] <= items["unixepoch"])].shape[0] == 0:
            continue
        df_input[(df_input["timestamp"] > prev_unixepoch) & (df_input["timestamp"] <= items["unixepoch"])].prediction = items["pred"]
        prev_unixepoch = items["unixepoch"]
    return df_input

そのあと、高速化しようと思って Refactoring したのが以下。
ループのたびに df_input を比較を使って抽出するときに、全体スキャンするから遅いと思って下記のように修正したが、
それでも遅かった。

def add_pred_in_df(df_input, df_unixepoch_pred):
    prev_unixepoch = 0
    index_pred = 0
    df_input["prediction"] = ""
    for index, items in df_input.iterrows():
        print(index, index_pred)
        if items["timestamp"] >  df_unixepoch_pred.unixepoch[index_pred]:
            index_pred += 1
        df_input.prediction[index] = df_unixepoch_pred.pred[index_pred]
    return df_input

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

python pandas dataframe の ループ処理が遅すぎる問題

python pandas dataframe の ループ処理が遅すぎる問題

参考

ちなみに

python pandas dataframe のループ処理が遅すぎる問題

python pandas dataframe のループ処理が遅すぎる問題