More than 1 year has passed since last update.

Polarsの処理時間について確認してみた

Last updated at 2023-01-22Posted at 2023-01-21

1.Polarsについて

遅ればせながらPythonの新しいデータフレーム形式のPolarsなるものが公開されていたことを最近知りました。
詳しいことは全然わかっていないのですが、ざっくり読んだ限りRustで実装されていて処理速度が速いとのことだそうです。詳しくはwebサイトをご参照ください。

Pythonだと軽いデータではpandasで処理することが多いかと思いますが、pandasと比較してどれくらい速いのか気になったので、いくつかの処理について処理時間を確認してみました。（今回は「サンプリング」「要約統計量」「ソート」「横結合」の４つの処理時間について確認してみました。）
どれくらいの差が見られたのか忘れそうなのでメモしておこうと思います。

2.ライブラリのバージョン

使用しているライブラリのバージョンは下記のとおりです。

In[1]:from sklearn import datasets
      import pandas as pd
      import polars as pl
      print("pandasのバージョン",pd.__version__)
      print("Polarsのバージョン",pl.__version__)

      # pandasのバージョン 1.5.2
      # Polarsのバージョン 0.15.16

3.テストデータの作成とサンプリング

scikit-learnのirisデータからpandas.DataFrameとpolars.DataFrameを作成。irisの通常データだと150行のデータで実務のデータとの感覚に差がありそうなので、サンプリングしてデータを水増しします。

In[2]:iris_df = pd.concat([
                    datasets.load_iris(return_X_y=True,as_frame=True)[0],
                    datasets.load_iris(return_X_y=True,as_frame=True)[1]],
                  axis = 1)
      iris_pl = pl.DataFrame(iris_df)
      print(iris_pl.head())

      # shape: (5, 5)
      # ┌────────────────────┬─────────────┬──────────────┬───────────────────┬────────┐
      # │ sepal length (cm)  ┆ sepal width ┆ petal length ┆ petal width (cm)  ┆ target │
      # │ ---                ┆ (cm)        ┆ (cm)         ┆ ---               ┆ ---    │
      # │ f64                ┆ ---         ┆ ---          ┆ f64               ┆ i64    │
      # │                    ┆ f64         ┆ f64          ┆                   ┆        │
      # ╞════════════════════╪═════════════╪══════════════╪═══════════════════╪════════╡
      # │ 5.1                ┆ 3.5         ┆ 1.4          ┆ 0.2               ┆ 0      │
      # │ 4.9                ┆ 3.0         ┆ 1.4          ┆ 0.2               ┆ 0      │
      # │ 4.7                ┆ 3.2         ┆ 1.3          ┆ 0.2               ┆ 0      │
      # │ 4.6                ┆ 3.1         ┆ 1.5          ┆ 0.2               ┆ 0      │
      # │ 5.0                ┆ 3.6         ┆ 1.4          ┆ 0.2               ┆ 0      │
      # └────────────────────┴─────────────┴──────────────┴───────────────────┴────────┘

サンプリングの処理時間については下記のとおりです。

pandas

In[3]:%%time
      test_df = iris_df.sample(10000000,random_state = 12345,replace = True)
      print(test_df.shape)

      # (10000000, 5)
      # CPU times: user 239 ms, sys: 164 ms, total: 403 ms
      # Wall time: 401 ms

Polars

In[4]:%%time
      test_pl = iris_pl.sample(10000000,seed = 12345,with_replacement = True)
      print(test_pl.shape)

      # (10000000, 5)
      # CPU times: user 80.1 ms, sys: 205 ms, total: 285 ms
      # Wall time: 86 ms

Wall timeで1/5の程度になっている！速い！

4.要約統計量

よく使う要約統計量の処理時間についても確認してみます。

pandas

In[5]:%%time
      df_desc = test_df.describe()
      
      # CPU times: user 1.64 s, sys: 161 ms, total: 1.8 s
      # Wall time: 1.8 s

Polars

In[6]:%%time
      pl_desc = test_pl.describe()

      # CPU times: user 2.27 s, sys: 176 ms, total: 2.44 s
      # Wall time: 454 ms

Wall timeでは1/4程度になっている。やはり速い。

5.ソート

時間がかかる処理のソートについても確認してみます。Polarsではinplaceの処理が無いようなので、pandasでも同様の処理にします。

pandas

In[7]:%%time
      test_df = test_df.sort_values("sepal length (cm)",ascending = False)

      # CPU times: user 551 ms, sys: 163 ms, total: 714 ms
      # Wall time: 711 ms

Polars

In[8]:%%time
      test_pl = test_pl.sort("sepal length (cm)",reverse = False)

      # CPU times: user 2.76 s, sys: 293 ms, total: 3.06 s
      # Wall time: 559 ms

こちらはあまり差がない様子。

6.横結合

最後に、分析でよく使うデータ結合の中から横結合についても確認してみます。

pandas

In[9]:test_df_re = test_df.copy() #Polarsだとカラム名が重複していると横結合できないようなので、カラム名を変更したDataFrameを作成します
      test_df_re.columns = ["1","2","3","4","5"]

In[10]:%%time
      test_df2 = pd.concat([test_df,test_df_re],axis = 1)
    
      # CPU times: user 309 ms, sys: 152 ms, total: 461 ms
      # Wall time: 458 ms

Polars

In[11]:test_pl_re = test_pl.clone() #Polarsだとカラム名が重複していると横結合できないようなので、カラム名を変更したDataFrameを作成します
       test_pl_re.columns = ["1","2","3","4","5"]

In[12]:%%time
       test_pl2 = pl.concat([test_pl,test_pl_re],how = "horizontal")

       # CPU times: user 45 µs, sys: 5 µs, total: 50 µs
       # Wall time: 54.4 µs

横結合では大きく差がついたように見えます。かなり速い。

7.まとめ

上記で確認したWall timeをまとめると下記のとおりです。

	3.サンプリング	4.要約統計量	5.ソート	6.横結合
pandas	401 ms	1.8 s	711 ms	458 ms
Polars	86 ms	454 ms	559 ms	54.4 µs

サンプリング、要約統計量では1/4のくらいになってますが、横結合では1/8000くらいになっているようです。（※1 s = 1,000 ms、1 ms = 1,000 µs）
ソートについてはあまり差は見られないようですね。

今回は上記の4つの処理時間だけで、データ読み込みやキー結合などよく使う処理で確認していないものも多くありますが、上記の結果だけ見るにPolarsのほうが概ねpandasよりも速くなっていて、速度でpandasに劣る処理は少ないんじゃないかというような感じがしました。
pandasでできた処理がPolarsでどれくらいカバーされているのかは把握できていませんが、この速度は魅力的に感じます。
今後の開発についてもウォッチしていこうと思いました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up