More than 3 years have passed since last update.

高速にpandasのDataFrameにobjectが含まれているかどうか確認する方法

Posted at 2021-10-03

目的

渡されたDataFrameタイプのデータに対して，columnにobjectが含まれているかどうかをなるべく早く確認したい．

実験コード

結論，3つ目の実装が最速です．

import pandas as pd
import numpy as np


def test(n_rows, n_cols):
    X_only_num = pd.DataFrame(np.ones((n_rows, n_cols)))
    X_only_cat = pd.DataFrame([['a' for c in range(n_cols)] for r in range(n_rows)])
    

    def has_object_columns1(X) -> bool:
        return not X.select_dtypes(include='object').empty


    def has_object_columns2(X) -> bool:
        feature_types = X.dtypes
        for feat_type in feature_types:
            if pd.api.types.is_object_dtype(feat_type):
                return True
        return False


    def has_object_columns3(X) -> bool:
        return np.dtype('O') in X.dtypes.values


    %timeit has_object_columns1(X_only_num)
    %timeit has_object_columns1(X_only_cat)

    %timeit has_object_columns2(X_only_num)
    %timeit has_object_columns2(X_only_cat)

    %timeit has_object_columns3(X_only_num)
    %timeit has_object_columns3(X_only_cat)

結果

n_rows=100, n_cols=1000

Method	Only numerical (ms)	Only categorical (ms)
method 1	0.5	1.1
method 2	0.6	0.6
method 3	0.1	0.09

n_rows=100, n_cols=100000

Method	Only numerical (ms)	Only categorical (ms)
method 1	4.1	69.1
method 2	52	50
method 3	3.8	1.54

Method 3が最速になる理由

他の方法は確認のために新たなデータを作成し，そのメモリを確保していますが，Method 3はすでに定義されている変数のみを利用しているのでループの回数が少なく済みます．
また in を利用することで for の利用を回避できていることも高速化に繋がっています．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up