More than 1 year has passed since last update.

pandas.DataFrameのループ速度比較（iterrows vs apply）

Last updated at 2024-01-08Posted at 2024-01-08

概要

pandas.DataFrameで行ごとの処理をする場合の、iterrowsとapplyの速度比較を行ったコードのメモです。iterrowsのほうが遅いという話はよく聞きますが、実際どの程度遅いのか、実感を掴むためにやりました。新規性は特にありません。

実装

wikipediaの記事を文単位で分割して新たなDataFrameをつくるというタスクで計測します。
列名の変更など一部無駄な処理があるのは、他のプロジェクトのためのコードを流用しているためです。
コードはGPT4の力を借りて作りました。

wikipediaの記事をhuggingfaceからダウンロードします。

from datasets import load_dataset

# URL of the dataset
dataset_url = "izumi-lab/wikipedia-ja-20230720"

# Load the dataset
dataset = load_dataset(dataset_url)

# To view basic information or manipulate the dataset, you can use:
print(dataset)

計測用のコードです。

import pandas as pd
import timeit

# 既存の iterrows() を使用した関数
def split_surface_to_sentences_iterrows(df: pd.DataFrame) -> pd.DataFrame:
    new_rows = []
    for _, row in df.iterrows():
        for line in row["surface"].splitlines():
            for sentence in line.split("。"):
                if sentence:
                    new_row = row.to_dict()
                    new_row["surface"] = sentence
                    new_rows.append(new_row)
    return pd.DataFrame(new_rows)

# apply() を使用した関数
def split_surface_to_sentences_apply(df: pd.DataFrame) -> pd.DataFrame:
    def split_sentences(row):
        sentences = [sentence for line in row['surface'].splitlines() for sentence in line.split('。') if sentence]
        return pd.DataFrame({col: [row[col]]*len(sentences) if col != 'surface' else sentences for col in df.columns})
    return pd.concat(df.apply(split_sentences, axis=1).tolist(), ignore_index=True)

# テスト用のデータフレームを作成
subset = dataset["train"][:10]
passage_df = pd.DataFrame(subset)
passage_df.rename(columns={"text": "surface"}, inplace=True)

# iterrows() の実行時間を計測
iterrows_time = timeit.timeit('split_surface_to_sentences_iterrows(passage_df)', globals=globals(), number=100)

# apply() の実行時間を計測
apply_time = timeit.timeit('split_surface_to_sentences_apply(passage_df)', globals=globals(), number=100)

print(f'iterrows() time: {iterrows_time}')
print(f'apply() time: {apply_time}')

iterrows() time: 1.5944697080121841
apply() time: 0.2716344580112491

結果

applyのほうが5倍くらい早いみたいです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up