More than 3 years have passed since last update.

pandasが遅い？ Polarsを使いましょ

Last updated at 2021-11-16Posted at 2021-10-26

はじめに🐍

pandas の DataFrame が遅い！高速化したい！と思っているそこのあなた！
Polars の DataFrame を試してみてはいかがでしょうか？🦀

GitHub: https://github.com/pola-rs/polars
User Guide: https://pola-rs.github.io/polars-book/user-guide/index.html
API reference: https://pola-rs.github.io/polars/py-polars/html/reference/index.html

この記事の目的

Polars の使い方をざっくり紹介。適当に例を並べていくので、雰囲気だけでもつかんでいただければ（系統立った説明はしてません。。）

Polars のメリデメ

👍速い🚀🚀
- ベンチマーク：
  - https://pola-rs.github.io/polars-book/user-guide/introduction.html#performance-
  - https://h2oai.github.io/db-benchmark/
- IOも速い
- 下記「どのくらい速いの？」も参照
👍メモリエラーが起こりにくい
- MemoryError: Unable to allocate の悪夢から逃れられる
👍使いやすい
- pandas の使い方と近いので、すぐ慣れる
👍機能も十分
- pandas でできることは大体できる（はず）
- SQL でいうところの PARTITION BY ができる
- CROSS JOIN もできる
- int の列に NULL を入れられる
  - NaN と NULL は区別されている
👎ドキュメントが充実してない
👎型によっては apply ができなかったりする

インストール

pip で入れるだけ

pip install polars

※私が試したバージョン：

Python==3.9.7
polars==0.10.7
pyarrow==5.0.0

使い方

まずは import

import polars as pl

基本的な使い方

pd を pl に変えるだけでも、ある程度の操作ができる

DataFrame の作成

import numpy as np

df = pl.DataFrame({
    'col_str': ['a', 'b', 'c', 'd', 'e'],
    'col_int': [1, None, 3, 4, 5],
    'col_float': [0.1, np.nan, 0.3, None, 0.5],
})
print(df)
# shape: (5, 3)
# ┌─────────┬─────────┬───────────┐
# │ col_str ┆ col_int ┆ col_float │
# │ ---     ┆ ---     ┆ ---       │
# │ str     ┆ i64     ┆ f64       │
# ╞═════════╪═════════╪═══════════╡
# │ a       ┆ 1       ┆ 0.1       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ null    ┆ NaN       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 3       ┆ 0.3       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ d       ┆ 4       ┆ null      │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ e       ┆ 5       ┆ 0.5       │
# └─────────┴─────────┴───────────┘

※このように、print しても読みやすい。jupyter notebook などでは、pandas と同様の表示も可能。
pandas の index のようなものはない。
read_csv や to_csv なども可能。
pl.DataFrame() には pandas の DataFrame を渡すこともできる。

shape など

print(df.shape)
# (5, 3)
print(df.height)
# 5
print(df.width)
# 3

行や列の選択

print(df.head(2))
# shape: (2, 3)
# ┌─────────┬─────────┬───────────┐
# │ col_str ┆ col_int ┆ col_float │
# │ ---     ┆ ---     ┆ ---       │
# │ str     ┆ i64     ┆ f64       │
# ╞═════════╪═════════╪═══════════╡
# │ a       ┆ 1       ┆ 0.1       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ null    ┆ NaN       │
# └─────────┴─────────┴───────────┘

print(df[2])
# shape: (1, 3)
# ┌─────────┬─────────┬───────────┐
# │ col_str ┆ col_int ┆ col_float │
# │ ---     ┆ ---     ┆ ---       │
# │ str     ┆ i64     ┆ f64       │
# ╞═════════╪═════════╪═══════════╡
# │ c       ┆ 3       ┆ 0.3       │
# └─────────┴─────────┴───────────┘

print(df[3:])
# shape: (2, 3)
# ┌─────────┬─────────┬───────────┐
# │ col_str ┆ col_int ┆ col_float │
# │ ---     ┆ ---     ┆ ---       │
# │ str     ┆ i64     ┆ f64       │
# ╞═════════╪═════════╪═══════════╡
# │ d       ┆ 4       ┆ null      │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ e       ┆ 5       ┆ 0.5       │
# └─────────┴─────────┴───────────┘

print(df[[1, 3], 'col_str'])
# shape: (2, 1)
# ┌─────────┐
# │ col_str │
# │ ---     │
# │ str     │
# ╞═════════╡
# │ b       │
# ├╌╌╌╌╌╌╌╌╌┤
# │ d       │
# └─────────┘

print(df[[1, 3], [0, 2]])
# shape: (2, 2)
# ┌─────────┬───────────┐
# │ col_str ┆ col_float │
# │ ---     ┆ ---       │
# │ str     ┆ f64       │
# ╞═════════╪═══════════╡
# │ b       ┆ NaN       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ d       ┆ null      │
# └─────────┴───────────┘

print(df[['col_str', 'col_float']])
# shape: (5, 2)
# ┌─────────┬───────────┐
# │ col_str ┆ col_float │
# │ ---     ┆ ---       │
# │ str     ┆ f64       │
# ╞═════════╪═══════════╡
# │ a       ┆ 0.1       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ NaN       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 0.3       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ d       ┆ null      │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
# │ e       ┆ 0.5       │
# └─────────┴───────────┘

print(df['col_int'])  # df.col_int も可
# shape: (5,)
# Series: 'col_int' [i64]
# [
#         1
#         null
#         3
#         4
#         5
# ]

print(df[-1, 'col_float'])
# 0.5

User Guide https://pola-rs.github.io/polars-book/user-guide/indexing.html も参照。

ちなみに Polars では Series に対しても head を使える。

列の追加

df['col_bool'] = [True, True, False, False, True]
# 右辺は np.array や pd.Series などでも OK

with_column や with_columns を使う方法もある

# 上と同じ操作
df = df.with_column(
    pl.Series('col_bool', [True, True, False, False, True])
)

列名の変更

df = df.rename({'col_float': 'col_flt'})

型のcast

df['col_flt'] = df['col_flt'].cast(pl.Float32)

使える型 → https://pola-rs.github.io/polars-book/user-guide/datatypes.html

列の削除

df = df.drop('col_bool')

pandas や numpy への変換

print(df.to_pandas())
#   col_str  col_int  col_flt
# 0       a        1      0.1
# 1       b        0      NaN
# 2       c        3      0.3
# 3       d        4      0.0
# 4       e        5      0.5

print(df.to_numpy())
# [['a' 1 0.10000000149011612]
#  ['b' 0 nan]
#  ['c' 3 0.30000001192092896]
#  ['d' 4 0.0]
#  ['e' 5 0.5]]

Config

print で表示される行数や列数を調整したいときに

pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(10)

map や apply 的なやつ

df = pl.DataFrame({
    'col_str': ['a', 'b', 'c'],
    'col_int': [1, 2, None],
})

df['col_int_div_2'] = df['col_int'].apply(lambda x: x / 2)
df = df.with_columns([
    pl.col('col_int')
    .is_in([1, 2])
    .is_not()
    .alias('col_int_not_in_1_2'),
    # when / then / otherwise
    pl.when(pl.col('col_int_div_2') >= 1)
    .then(1)
    .otherwise(pl.Series([11, 12, 13]))
    .alias('wto'),
])
print(df)
# shape: (3, 5)
# ┌─────────┬─────────┬───────────────┬────────────────────┬─────┐
# │ col_str ┆ col_int ┆ col_int_div_2 ┆ col_int_not_in_1_2 ┆ wto │
# │ ---     ┆ ---     ┆ ---           ┆ ---                ┆ --- │
# │ str     ┆ i64     ┆ f64           ┆ bool               ┆ i64 │
# ╞═════════╪═════════╪═══════════════╪════════════════════╪═════╡
# │ a       ┆ 1       ┆ 0.5           ┆ false              ┆ 11  │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
# │ b       ┆ 2       ┆ 1             ┆ false              ┆ 1   │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
# │ c       ┆ null    ┆ null          ┆ true               ┆ 13  │
# └─────────┴─────────┴───────────────┴────────────────────┴─────┘

結合

from datetime import datetime

df_2 = pl.DataFrame({
    'col_str': ['a', 'c', 'd'],
    'col_datetime': [
        datetime.strptime(
            f'2021-10-{i} 11:22:33 +0900',
            '%Y-%m-%d %H:%M:%S %z'
        ) for i in [12, 15, 17]
    ],
})
df_join = df[['col_str', 'col_int']].join(
    df_2, on='col_str', how='left')
print(df_join)
# shape: (3, 3)
# ┌─────────┬─────────┬─────────────────────┐
# │ col_str ┆ col_int ┆ col_datetime        │
# │ ---     ┆ ---     ┆ ---                 │
# │ str     ┆ i64     ┆ datetime            │
# ╞═════════╪═════════╪═════════════════════╡
# │ a       ┆ 1       ┆ 2021-10-12 02:22:33 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 2       ┆ null                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ null    ┆ 2021-10-15 02:22:33 │
# └─────────┴─────────┴─────────────────────┘

※Datetimeは内部的にUNIX時刻で表現されているため、timezoneを適切に扱わないとこのようにずれてしまう

pandas の concat のように、単純に積むこともできる

df = df[['col_str', 'col_int']].vstack(
    pl.DataFrame({
        'col_str': ['x', 'y', 'z'],
        'col_int': [7, 8, 9],
    })
)
print(df)
# shape: (6, 2)
# ┌─────────┬─────────┐
# │ col_str ┆ col_int │
# │ ---     ┆ ---     │
# │ str     ┆ i64     │
# ╞═════════╪═════════╡
# │ a       ┆ 1       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 2       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ null    │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
# │ x       ┆ 7       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
# │ y       ┆ 8       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
# │ z       ┆ 9       │
# └─────────┴─────────┘

フィルターとソート

df = df.filter((pl.col('col_int') >= 1) & (pl.col('col_int') <= 7))
df = df.sort('col_int', reverse=True)
print(df)
# shape: (3, 2)
# ┌─────────┬─────────┐
# │ col_str ┆ col_int │
# │ ---     ┆ ---     │
# │ str     ┆ i64     │
# ╞═════════╪═════════╡
# │ x       ┆ 7       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 2       │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 1       │
# └─────────┴─────────┘

シフト

df['col_int_shifted'] = df['col_int'].shift(1)
print(df)
# shape: (3, 3)
# ┌─────────┬─────────┬─────────────────┐
# │ col_str ┆ col_int ┆ col_int_shifted │
# │ ---     ┆ ---     ┆ ---             │
# │ str     ┆ i64     ┆ i64             │
# ╞═════════╪═════════╪═════════════════╡
# │ x       ┆ 7       ┆ null            │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 2       ┆ 7               │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 1       ┆ 2               │
# └─────────┴─────────┴─────────────────┘

集約

改めて df を定義

df = pl.DataFrame({
    'col_str': ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'a', 'c'],
    'col_int': [1, 3, 2, 6, 5, 3, 1, 4, 2, 1],
    'col_float': [.2, .4, .1, .5, .6, .8, .9, .1, .5, .2],
})

df は以下のようなテーブル

col_str	col_int	col_float
a	1	0.2
b	3	0.4
c	2	0.1
a	6	0.5
b	5	0.6
c	3	0.8
a	1	0.9
b	4	0.1
a	2	0.5
c	1	0.2

describe

print(df.describe())
# shape: (5, 4)
# ┌──────────┬─────────┬────────────────────┬─────────────────────┐
# │ describe ┆ col_str ┆ col_int            ┆ col_float           │
# │ ---      ┆ ---     ┆ ---                ┆ ---                 │
# │ str      ┆ str     ┆ f64                ┆ f64                 │
# ╞══════════╪═════════╪════════════════════╪═════════════════════╡
# │ mean     ┆ null    ┆ 2.8                ┆ 0.43000000000000005 │
# ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ std      ┆ null    ┆ 1.7511900715418263 ┆ 0.28303906287138375 │
# ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ min      ┆ null    ┆ 1                  ┆ 0.1                 │
# ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ max      ┆ null    ┆ 6                  ┆ 0.9                 │
# ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ median   ┆ null    ┆ 2.5                ┆ 0.45                │
# └──────────┴─────────┴────────────────────┴─────────────────────┘

groupby

print(df.groupby('col_str').max())
# shape: (3, 3)
# ┌─────────┬─────────────┬───────────────┐
# │ col_str ┆ col_int_max ┆ col_float_max │
# │ ---     ┆ ---         ┆ ---           │
# │ str     ┆ i64         ┆ f64           │
# ╞═════════╪═════════════╪═══════════════╡
# │ b       ┆ 5           ┆ 0.6           │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 6           ┆ 0.9           │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 3           ┆ 0.8           │
# └─────────┴─────────────┴───────────────┘

print(df.groupby('col_str').agg({'col_int': 'min'}))
# shape: (3, 2)
# ┌─────────┬─────────────┐
# │ col_str ┆ col_int_min │
# │ ---     ┆ ---         │
# │ str     ┆ i64         │
# ╞═════════╪═════════════╡
# │ c       ┆ 1           │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 3           │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 1           │
# └─────────┴─────────────┘

より複雑なもの

df_agg = df.groupby('col_str').agg([
    pl.col('col_float').sum(),
    pl.sum('col_int'),  # 短く書ける
    pl.sum('col_int').alias('int_sum'),  # 列名を自分でつけられる
    pl.col('col_int').list(),  # list にもできる
    pl.col('col_int').first(),  # 他にも count, mean, などなど
    (pl.col('col_int') > 2).sum().alias(
        'col_int_gt_2_count'),  # 条件を満たすものをカウント
])
print(df_agg)
# shape: (3, 7)
# ┌─────────┬────────────────┬─────────────┬─────────┬───────────────┬───────────────┬───────────────┐
# │ col_str ┆ col_float_sum  ┆ col_int_sum ┆ int_sum ┆ col_int_agg_l ┆ col_int_first ┆ col_int_gt_2_ │
# │ ---     ┆ ---            ┆ ---         ┆ ---     ┆ ist           ┆ ---           ┆ count         │
# │ str     ┆ f64            ┆ i64         ┆ i64     ┆ ---           ┆ i64           ┆ ---           │
# │         ┆                ┆             ┆         ┆ list [i64]    ┆               ┆ u32           │
# ╞═════════╪════════════════╪═════════════╪═════════╪═══════════════╪═══════════════╪═══════════════╡
# │ b       ┆ 1.1            ┆ 12          ┆ 12      ┆ [3, 5, 4]     ┆ 3             ┆ 3             │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 1.1            ┆ 6           ┆ 6       ┆ [2, 3, 1]     ┆ 2             ┆ 1             │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 2.1            ┆ 10          ┆ 10      ┆ [1, 6, ... 2] ┆ 1             ┆ 1             │
# └─────────┴────────────────┴─────────────┴─────────┴───────────────┴───────────────┴───────────────┘

Window 関数

各 window での最大や平均

df_window = df.select([
    # 'col_str',
    # 'col_int',
    # 'col_float',
    pl.all(),  # 元の df の列を全て選択
    pl.col('col_int')
    .max()
    .over('col_str')
    .alias('max_int_by_str'),
    pl.col('col_float')
    .mean()
    .over('col_str')
    .alias('avg_float_by_str'),
])
print(df_window)
# shape: (10, 5)
# ┌─────────┬─────────┬───────────┬────────────────┬────────────────────┐
# │ col_str ┆ col_int ┆ col_float ┆ max_int_by_str ┆ avg_float_by_str   │
# │ ---     ┆ ---     ┆ ---       ┆ ---            ┆ ---                │
# │ str     ┆ i64     ┆ f64       ┆ i64            ┆ f64                │
# ╞═════════╪═════════╪═══════════╪════════════════╪════════════════════╡
# │ a       ┆ 1       ┆ 0.2       ┆ 6              ┆ 0.525              │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 3       ┆ 0.4       ┆ 5              ┆ 0.3666666666666667 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 2       ┆ 0.1       ┆ 3              ┆ 0.3666666666666667 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 6       ┆ 0.5       ┆ 6              ┆ 0.525              │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 5       ┆ 0.6       ┆ 5              ┆ 0.3666666666666667 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 3       ┆ 0.8       ┆ 3              ┆ 0.3666666666666667 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 1       ┆ 0.9       ┆ 6              ┆ 0.525              │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 4       ┆ 0.1       ┆ 5              ┆ 0.3666666666666667 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 2       ┆ 0.5       ┆ 6              ┆ 0.525              │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 1       ┆ 0.2       ┆ 3              ┆ 0.3666666666666667 │
# └─────────┴─────────┴───────────┴────────────────┴────────────────────┘

各 window でのランク

# 先に over の中身で sort しておかないと上手くいかない
df_window_sort = df.sort('col_str').select([
    pl.all(),
    pl.col('col_int')
    .rank('min')
    .over('col_str')
    .flatten()
    .alias('rank_int_by_str'),
])
print(df_window_sort)
# shape: (10, 5)
# ┌─────────┬─────────┬───────────┬───────────────────────┬──────────────────┐
# │ col_str ┆ col_int ┆ col_float ┆ rank_int_list_by_str  ┆ rank_int_by_str  │
# │ ---     ┆ ---     ┆ ---       ┆ ---                   ┆ ---              │
# │ str     ┆ i64     ┆ f64       ┆ list [u32]            ┆ u32              │
# ╞═════════╪═════════╪═══════════╪═══════════════════════╪══════════════════╡
# │ a       ┆ 1       ┆ 0.2       ┆ [1, 4, ... 3]         ┆ 1                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 6       ┆ 0.5       ┆ [1, 4, ... 3]         ┆ 4                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 1       ┆ 0.9       ┆ [1, 4, ... 3]         ┆ 1                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ a       ┆ 2       ┆ 0.5       ┆ [1, 4, ... 3]         ┆ 3                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 3       ┆ 0.4       ┆ [1, 3, 2]             ┆ 1                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 5       ┆ 0.6       ┆ [1, 3, 2]             ┆ 3                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ b       ┆ 4       ┆ 0.1       ┆ [1, 3, 2]             ┆ 2                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 2       ┆ 0.1       ┆ [2, 3, 1]             ┆ 2                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 3       ┆ 0.8       ┆ [2, 3, 1]             ┆ 3                │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ c       ┆ 1       ┆ 0.2       ┆ [2, 3, 1]             ┆ 1                │
# └─────────┴─────────┴───────────┴───────────────────────┴──────────────────┘

cumcount はないっぽいので、中身が 1 のカラムを作って cumsum するのがよさげ

pivot と melt

df = pl.DataFrame({
    'col_str': ['a', 'a', 'a', 'b', 'b'],
    'col_str_2': ['x', 'y', 'z', 'x', 'y'],
    'col_int': [1, 3, 1, 2, 5],
})

df_pivot = df.groupby('col_str').pivot(
    pivot_column='col_str_2',
    values_column='col_int'
).first()
print(df_pivot)
# shape: (2, 4)
# ┌─────────┬──────┬─────┬─────┐
# │ col_str ┆ z    ┆ x   ┆ y   │
# │ ---     ┆ ---  ┆ --- ┆ --- │
# │ str     ┆ i64  ┆ i64 ┆ i64 │
# ╞═════════╪══════╪═════╪═════╡
# │ a       ┆ 1    ┆ 1   ┆ 3   │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
# │ b       ┆ null ┆ 2   ┆ 5   │
# └─────────┴──────┴─────┴─────┘

df_melt = df_pivot.melt(
    id_vars='col_str',
    value_vars=['x', 'y'])
print(df_melt)
# shape: (4, 3)
# ┌─────────┬──────────┬───────┐
# │ col_str ┆ variable ┆ value │
# │ ---     ┆ ---      ┆ ---   │
# │ str     ┆ str      ┆ i64   │
# ╞═════════╪══════════╪═══════╡
# │ a       ┆ x        ┆ 1     │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ b       ┆ x        ┆ 2     │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ a       ┆ y        ┆ 3     │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ b       ┆ y        ┆ 5     │
# └─────────┴──────────┴───────┘

NaN と NULL について

NULL は NaN か？ → NULL
NaN は NULL か？ → False

df = pl.DataFrame(
    [1.0, np.inf, np.nan, None],
    columns=['col_sample']
).with_columns([
    pl.col('col_sample').is_nan().alias('is_nan'),
    pl.col('col_sample').is_not_nan().alias('is_not_nan'),
    pl.col('col_sample').is_null().alias('is_null'),
    pl.col('col_sample').is_not_null().alias('is_not_null'),
])
print(df)
# shape: (4, 5)
# ┌────────────┬────────┬────────────┬─────────┬─────────────┐
# │ col_sample ┆ is_nan ┆ is_not_nan ┆ is_null ┆ is_not_null │
# │ ---        ┆ ---    ┆ ---        ┆ ---     ┆ ---         │
# │ f64        ┆ bool   ┆ bool       ┆ bool    ┆ bool        │
# ╞════════════╪════════╪════════════╪═════════╪═════════════╡
# │ 1          ┆ false  ┆ true       ┆ false   ┆ true        │
# ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ inf        ┆ false  ┆ true       ┆ false   ┆ true        │
# ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ NaN        ┆ true   ┆ false      ┆ false   ┆ true        │
# ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ null       ┆ null   ┆ null       ┆ true    ┆ false       │
# └────────────┴────────┴────────────┴─────────┴─────────────┘

print(df.col_sample.value_counts())  # dropna 的な引数はない
# shape: (4, 2)
# ┌────────────┬────────┐
# │ col_sample ┆ counts │
# │ ---        ┆ ---    │
# │ f64        ┆ u32    │
# ╞════════════╪════════╡
# │ NaN        ┆ 1      │
# ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ null       ┆ 1      │
# ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ 1          ┆ 1      │
# ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ inf        ┆ 1      │
# └────────────┴────────┘

print(df.col_sample.null_count())
# 1
print(df.col_sample.n_unique())
# 4

fill_nan, fill_null, drop_nulls などのメソッドもある

どのくらい速いの？

groupby や join などが pandas に比べてどのくらい速いのか、試してみました。
1回しか計測してないですが、高速感は伝わると思います。

from contextlib import contextmanager
import time
import numpy as np
import pandas as pd
import polars as pl


@contextmanager
def timer(name: str):
    t0 = time.time()
    yield
    print(f'{name}: {time.time() - t0:.1f} s')


np.random.seed(42)

N = 10**8
M = 10**4

df_dict = {
    'col_int': np.random.randint(0, M, N),
    'col_float': np.random.rand(N),
}
df_dict_2 = {
    'col_int': np.random.randint(0, 10**5, M),
    'col_float': np.random.rand(M),
}
df_pd = pd.DataFrame(df_dict)
df_pl = pl.DataFrame(df_dict)
df_pd_2 = pd.DataFrame(df_dict_2)
df_pl_2 = pl.DataFrame(df_dict_2)

with timer('pandas groupby'):
    df_pd.groupby('col_int').agg({'col_float': 'mean'})
with timer('polars groupby'):
    df_pl.groupby('col_int').agg({'col_float': 'mean'})

with timer('pandas join'):
    pd.merge(
        df_pd, df_pd_2, on='col_int',
        how='left',  suffixes=['', '_2']
    )
with timer('polars join'):
    df_pl.join(
        df_pl_2, on='col_int',
        how='left', suffix='_2'
    )

with timer('pandas sort'):
    df_pd.sort_values('col_float')
with timer('polars sort'):
    df_pl.sort('col_float')

with timer('pandas filter'):
    df_pd.query('col_float < 0.5')
with timer('polars filter'):
    df_pl.filter(pl.col('col_float') < 0.5)

測定結果：

	pandas	polars
groupby	3.8 s	1.5s
join	31.0 s	2.6 s
sort	45.8 s	8.9 s
filter	2.6 s	0.9 s

終わりに

以上です。最後までお読みいただきありがとうございました。
快適なPythonライフをエンジョイしましょう🐍

GitHub: https://github.com/pola-rs/polars
User Guide: https://pola-rs.github.io/polars-book/user-guide/index.html
API reference: https://pola-rs.github.io/polars/py-polars/html/reference/index.html

120

134

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up