SMN a.i lab.Advent Calendar 2024

Pandas風ライブラリのパフォーマンス比較

Last updated at 2025-03-24Posted at 2024-11-01

この記事にはPart 2があります。
諸事情によりこちらの記事も残しておきます。

はじめに

SMN a.i lab. Advent Calendar 2024 1日目です。
一般に、機械学習では学習自体よりもデータの前処理に多大な計算リソースを消費します。
前処理ではさまざまなソフトウェアが使われますが、表形式のデータの処理においてポピュラーなライブラリとしてPandasが知られています。
しかし、近年ではPandasに近いAPIを持ちつつ、高度に最適化されたライブラリが多く配布されています。
この記事ではそれらのPandas風ライブラリのパフォーマンスを比較します。

Pandas風ライブラリたち

Pandas

もっともポピュラーな表形式データ処理ライブラリです。

PyArrow

Apache ArrowのPython用ライブラリです。
それほどPandas風ではありませんが、表形式のデータ型があります。
Arrow自体は列指向のデータフォーマット仕様であり、公式から多くの言語向けにライブラリが配布されています。
PandasからPyArrowの一部の機能を利用することもできます。

PySpark

Apache SparkのPython用ライブラリです。
それほどPandas風ではありませんが、表形式のデータ型があります。
Spark自体は主にマルチノードで分散処理することを想定したソフトウェアですが、シングルノードでも使えます。

Polars

Pandasの代替として近年注目されているようです。
Pandasとある程度。

Dask

Pandasとの互換性の高さを謳っています。
マルチノードでの分散処理にも対応しており、Sparkより高速だと主張しています。

cuDF

NVIDIA GPU用のライブラリです。
"cuDF pandas Accelerator Mode" を利用するとPandasと完全な互換性があると主張しています。

Notebook

環境

Google ColaboratoryでランタイムのタイプをT4 GPUに設定して使用します。

CPUの情報

!lscpu

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) CPU @ 2.00GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   1
    Socket(s):            1
    Stepping:             3
    BogoMIPS:             4000.28
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 cl
                          flush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc re
                          p_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3
                           fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand
                           hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp 
                          fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f a
                          vx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveop
                          t xsavec xgetbv1 xsaves arat md_clear arch_capabilities
Virtualization features:  
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):      
  L1d:                    32 KiB (1 instance)
  L1i:                    32 KiB (1 instance)
  L2:                     1 MiB (1 instance)
  L3:                     38.5 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0,1
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Mitigation; PTE Inversion
  Mds:                    Vulnerable; SMT Host state unknown
  Meltdown:               Vulnerable
  Mmio stale data:        Vulnerable
  Reg file data sampling: Not affected
  Retbleed:               Vulnerable
  Spec rstack overflow:   Not affected
  Spec store bypass:      Vulnerable
  Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swa
                          pgs barriers
  Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BH
                          I: Vulnerable (Syscall hardening enabled)
  Srbds:                  Not affected
  Tsx async abort:        Vulnerable

ホストメモリの情報

!free -h

               total        used        free      shared  buff/cache   available
Mem:            12Gi       792Mi       4.0Gi       2.0Mi       7.9Gi        11Gi
Swap:             0B          0B          0B

GPUの情報

!nvidia-smi

Thu Sep 26 10:12:26 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

CUDA Toolkitのバージョン

!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

ライブラリのインストールとimport

!pip install pyarrow pyspark polars dask
!pip install --extra-index-url=https://pypi.nvidia.com cudf-cu12

import pandas as pd
import numpy as np

import pyarrow as pa
import pyarrow.compute as pc

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

import polars as pl

import dask.dataframe as dd

import cudf

データ準備

100列 * 50万行の表形式のデータをランダムに32ビット浮動小数点数で埋め、カテゴリ列A, B, C, Dを追加します。
あまり大きすぎるとMemoryErrorになるので気を付けてください。

np.random.seed(42)
n_rows = 500000
n_cols = 100

# 100列 * 50万行のデータをランダムに作成
# NVIDIA GPUはたぶん32ビット浮動小数点数の方が性能がいいため32ビットで作成
data = np.random.randn(n_rows, n_cols).astype(np.float32)
df = pd.DataFrame(data, columns=[f'col_{i}' for i in range(n_cols)])

# ある列を基準にグループ化するために、適当なカテゴリ列を追加
df['group'] = np.random.choice(['A', 'B', 'C', 'D'], size=n_rows)

問題

今回はgroup列でグループ化し平均を取るという処理で比較しました。
Jupyterのtimeitマジックコマンドで10ループ * 7回実行して実行時間の平均を比較します。
また、今回はコピーの速度は主眼ではありませんが極端に差がついたのでついでに測りました。

Pandas

%%time
# 時間比較用にコピー
df_copied = df.copy()

CPU times: user 479 ms, sys: 44 ms, total: 523 ms
Wall time: 1.21 s

%%timeit -r 7 -n 10
# group列でグループ化し平均を取る
grouped_mean = df.groupby('group').mean()

The slowest run took 4.02 times longer than the fastest. This could mean that an intermediate result is being cached.
349 ms ± 219 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

PyArrow

%%time
arrow_table = pa.Table.from_pandas(df)

CPU times: user 1.76 s, sys: 151 ms, total: 1.91 s
Wall time: 1.05 s

%%timeit -r 7 -n 10
# group列でグループ化し平均を取る
grouped_mean_pa = arrow_table.group_by('group').aggregate(
  [
    (f'col_{i}', "mean") for i in range(n_cols)
  ]
)

129 ms ± 34.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

倍以上速くなり、標準偏差も小さくなりました。

PySpark

# Sparkセッションを作成
spark = SparkSession.builder.appName("tabular_data_processing_benchmark").getOrCreate()

%%time
df_spark = spark.createDataFrame(df)

CPU times: user 4min 39s, sys: 2.64 s, total: 4min 41s
Wall time: 4min 54s

今回はコピーは主眼ではありませんが、PySparkはなぜかコピーが異様に遅いです。

%%timeit -r 7 -n 10
# group列でグループ化し平均を取る
grouped_mean_spark = df_spark.groupBy("group").mean()

73 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

5倍程度速くなり、標準偏差も小さくなりました。

Polars

%%time
df_pl = pl.from_pandas(df)

CPU times: user 902 ms, sys: 138 ms, total: 1.04 s
Wall time: 661 ms

%%timeit -r 7 -n 10
# group列でグループ化し平均を取る
grouped_mean_pl = df_pl.group_by('group').mean()

80.3 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

PySparkと同じくらいです。

Dask

%%time
df_dask = dd.from_pandas(df, npartitions=4)

CPU times: user 694 ms, sys: 47.9 ms, total: 742 ms
Wall time: 742 ms

%%timeit -r 7 -n 10
# group列でグループ化し平均を取る
grouped_mean_dask= df_dask.groupby('group').mean()

51.6 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

速いです。今回最速でした。

cuDF

%%time
df_cudf = cudf.from_pandas(df)

CPU times: user 526 ms, sys: 123 ms, total: 649 ms
Wall time: 658 ms

%%timeit -r 7 -n 10
# group列でグループ化し平均を取る
grouped_mean_cudf = df_cudf.groupby('group').mean()

54.6 ms ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Daskと同じくらい速いです。

結果

	平均	標準偏差
Pandas	349 ms	219 ms
PyArrow	129 ms	34.3 ms
PySpark	73 ms	32.7 ms
Polars	80.3 ms	17.5 ms
Dask	51.6 ms	2.12 ms
cuDF	54.6 ms	5.16 ms

最遅は予想通りPandasでした。
最速はDaskで、cuDFをわずかに上回りました。
最速はGPUを使うcuDFだと予想していたので、この結果は意外でした。
環境や問題の種類、測定方法によって結果は変わると思うのでいろいろな条件で比較してみてください。
特に、メモリに乗り切らないデータを分散処理する場合はPySparkとDaskの差がどうなるか気になります。

残課題

試している間に次から次へとPandas風ライブラリが見つかりました。
他にも以下のようなPandas風ライブラリがあるため、比較対象に追加しようと思います。

CPU
- Fireducks
- Vaex
- Modin
- Ibis
- Daft
GPU
- NVTabular
- cuStreamz

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up