polarsの`.when().then().otherwise()`と`.replace()`の速度比較

Last updated at 2024-06-18Posted at 2023-12-12

polarsの`.when().then().otherwise()`と`.replace()`の速度比較

前提

特定の列の値の置換を行う場合、pl.when().then()...otherwise()を使う方法と、Expr.replace()を使う方法とがある。

例えば、1から10までの整数が格納されている列を、a ~ j のアルファベットに置換する作業を行うとしよう。

1つ目の方法として、.when().then()を地道に続けて書くことができる。

EXPR1 = (
    pl.when(column_0=1)
    .then(pl.lit("a"))
    .when(column_0=2)
    .then(pl.lit("b"))
    .when(column_0=3)
    .then(pl.lit("c"))
    .when(column_0=4)
    .then(pl.lit("d"))
    .when(column_0=5)
    .then(pl.lit("e"))
    .when(column_0=6)
    .then(pl.lit("f"))
    .when(column_0=7)
    .then(pl.lit("g"))
    .when(column_0=8)
    .then(pl.lit("h"))
    .when(column_0=9)
    .then(pl.lit("i"))
    .when(column_0=10)
    .then(pl.lit("j"))
    .otherwise(pl.lit("other"))
)

第2の選択肢として、辞書を用意していれば、.replace()を使うことができる。

MY_DICT = dict(enumerate("abcdefghij", 1))
EXPR2 = pl.col("column_0").replace(MY_DICT, default="other")

どちらで実行したほうが速いのだろうか。

比較

上に挙げた例を使って比較してみよう（polarsのバージョンは0.19.19）。

ここでは1～1000万行2列のデータフレームを用意した。全てに0~10の整数が格納されている。2列あるうち1列目の1~10をa~jに変換した列を追加する。0は"other"に変換される。

import benchit
import numpy as np
import polars as pl

# ~~~中略~~~ （変数 EXPR1, EXPR2 は上記参照）

def when_then_otherwise(df: pl.DataFrame):
    return df.with_columns(new_column=EXPR1)


def replace_dict(df: pl.DataFrame):
    return df.with_columns(new_column=EXPR2)

rng = np.random.default_rng(0)
funcs = [when_then_otherwise, replace_dict]
inputs = {n: pl.DataFrame(rng.integers(0, 4, (n, 2))) for n in 10 ** np.arange(8)}
t = benchit.timings(funcs, inputs)
t.plot(figsize=(8, 5), logx=True)

これを見ると、データフレームの長さ（行数）が1万行未満では.when().then()...otherwise()の方が速いが、1万行以上では.replace()の方が速くなるようだ。

置換パターンを減らして比較

上の例では数字からアルファベットへの置換パターンが10種類あった。これを3種類に減らしたバージョンを作ってみる。すなわち、1~3をa~cに置換する。

EXPR3 = (
    pl.when(column_0=1)
    .then(pl.lit("a"))
    .when(column_0=2)
    .then(pl.lit("b"))
    .when(column_0=3)
    .then(pl.lit("c"))
    .otherwise(pl.lit("other"))
)

MY_DICT2 = dict(enumerate("abc", 1))
EXPR4 = pl.col("column_0").replace(MY_DICT2, default="other")

先の10方向置換パターンに、この3方向置換パターンを加えて比較してみよう。

def when_then_otherwise_v3(df: pl.DataFrame):
    return df.with_columns(new_column=EXPR3)


def replace_dict_v3(df: pl.DataFrame):
    return df.with_columns(new_column=EXPR4)


funcs = [when_then_otherwise, replace_dict, when_then_otherwise_v3, replace_dict_v3]
inputs = {n: pl.DataFrame(rng.integers(0, 10, (n, 2))) for n in 10 ** np.arange(8)}
t = benchit.timings(funcs, inputs)
t.plot(figsize=(8, 5), logx=True)

.when().then()...otherwise()は速度が向上し（真下に平行移動したような形だ）、.replace()はほとんど変わらなかった（1万行以上で気持ち向上している）。それでも、10万行くらいをこえると.when().then()...otherwise()の速度を.replace()が上回っている。

2024/06/18追記

半年経過したので、バージョン0.19.19から0.20.31にあげて再検証してみた。

全体的に速度が速くなっているが、今回の比較の観点で言えば、.when().then()...otherwise()がさらに有利になったようである。

辞書を`.when().then()...otherwise()`に渡す

そうなると、データフレームの大きさが数万行程度なら、置換パターンの辞書オブジェクトを持っていても、それを直接.replace()に渡すよりも、何らかの方法で.when().then()...otherwise()を使ったほうが処理は速いということになる。

しかし、辞書を.when().then()...otherwise()タイプのエクスプレッションに渡すのは結構めんどくさそうだ。

例えば

expr_tmp = pl
for key, value in MY_DICT.items():
    expr_tmp = expr_tmp.when(column_0=key).then(pl.lit(value))
EXPR = expr_tmp.otherwise("other")

↑何をやっているのか分かりづらい。

結論

数万行以下ではたいてい.when().then()...otherwise()の方が.replace()よりも速い。
.when().then().when().then()....は続けば続くほど（置換パターンが多いほど）ぐんぐん遅くなる。
.replace()の速度は辞書の長さ（置換パターンの多さ）の影響をほとんど受けない。

それだけ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

polarsの`.when().then().otherwise()`と`.replace()`の速度比較

polarsの.when().then().otherwise()と.replace()の速度比較

前提

比較

置換パターンを減らして比較

2024/06/18追記

辞書を.when().then()...otherwise()に渡す

結論

polarsの`.when().then().otherwise()`と`.replace()`の速度比較

辞書を`.when().then()...otherwise()`に渡す