More than 1 year has passed since last update.

【polars】データ分析時の操作と動作イメージ

Last updated at 2023-02-12Posted at 2023-02-12

polarsとは

polarsはDataFrameライブラリです。
参考：超高速…だけじゃない！Pandasに代えてPolarsを使いたい理由
上記のリンク内でも下記の記載がありますが、pandasと比較して高速である点はもちろんのこと、書きやすさ・読みやすさの面でも非常に優れたライブラリだと思います。

Polarsはその高速さを売りに紹介されることが多いのですが、書きやすさとか使いやすさの面でもいいぞ！という主張でした。

本記事では基本的にはデータ分析時によく使用する操作をスニペット的に残しつつ、Expressionなどpolars独自の概念や挙動については、簡単な例や図を交えながら（私の想像する）動作イメージも交えてご紹介したいと思います。

ひとまず使用するデータセット

公式のユーザーガイドにも登場する下記のポケモンデータセットを使用します。(途中で使うデータが変わります）
https://gist.github.com/ritchie46/cac6b337ea52281aa23c049250a4ff03

準備

import pandas as pd
import polars as pl
from datetime import datetime

データの読み込み/書き込み

読み込み

# read_parquetなども可。
pokemon_df = pl.read_csv(
    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
    columns=['Name','Type 1','Type 2','HP','Attack','Defense','Legendary']
)
# 操作しにくいので、カラムはスペース削除の上で小文字化しておく
pokemon_df.columns = [col.replace(" ", "").lower() for col in pokemon_df.columns]
pokemon_df

実行結果

output

shape: (163, 7)
┌───────────────────────┬─────────┬────────┬─────┬────────┬─────────┬───────────┐
│ name                  ┆ type1   ┆ type2  ┆ hp  ┆ attack ┆ defense ┆ legendary │
│ ---                   ┆ ---     ┆ ---    ┆ --- ┆ ---    ┆ ---     ┆ ---       │
│ str                   ┆ str     ┆ str    ┆ i64 ┆ i64    ┆ i64     ┆ bool      │
╞═══════════════════════╪═════════╪════════╪═════╪════════╪═════════╪═══════════╡
│ Bulbasaur             ┆ Grass   ┆ Poison ┆ 45  ┆ 49     ┆ 49      ┆ false     │
│ Ivysaur               ┆ Grass   ┆ Poison ┆ 60  ┆ 62     ┆ 63      ┆ false     │
│ Venusaur              ┆ Grass   ┆ Poison ┆ 80  ┆ 82     ┆ 83      ┆ false     │
│ VenusaurMega Venusaur ┆ Grass   ┆ Poison ┆ 80  ┆ 100    ┆ 123     ┆ false     │
│ ...                   ┆ ...     ┆ ...    ┆ ... ┆ ...    ┆ ...     ┆ ...       │
│ Dratini               ┆ Dragon  ┆ null   ┆ 41  ┆ 64     ┆ 45      ┆ false     │
│ Dragonair             ┆ Dragon  ┆ null   ┆ 61  ┆ 84     ┆ 65      ┆ false     │
│ Dragonite             ┆ Dragon  ┆ Flying ┆ 91  ┆ 134    ┆ 95      ┆ false     │
│ Mewtwo                ┆ Psychic ┆ null   ┆ 106 ┆ 110    ┆ 90      ┆ true      │
└───────────────────────┴─────────┴────────┴─────┴────────┴─────────┴───────────┘

書き込み

# csv
pokemon_df.write_csv("tmp.csv")

# parquet
pokemon_df.write_parquet("tmp.pkt")

polars -> pandasへの変換

pokemon_df.to_pandas()

DataFrameに対する処理

カラム名と型チェック

# 列名、データ型、および最初のいくつかの値が表示
print(pokemon_df.glimpse())

実行結果

output

Rows: 163
Columns: 7
$ Name       <str> Bulbasaur, Ivysaur, Venusaur, VenusaurMega Venusaur, Charmander, Charmeleon, Charizard, CharizardMega Charizard X, CharizardMega Charizard Y, Squirtle
$ Type 1     <str> Grass, Grass, Grass, Grass, Fire, Fire, Fire, Fire, Fire, Water
$ Type 2     <str> Poison, Poison, Poison, Poison, None, None, Flying, Dragon, Flying, None
$ HP         <i64> 45, 60, 80, 80, 39, 58, 78, 78, 78, 44
$ Attack     <i64> 49, 62, 82, 100, 52, 64, 84, 130, 104, 48
$ Defense    <i64> 49, 63, 83, 123, 43, 58, 78, 111, 78, 65
$ Legendary <bool> False, False, False, False, False, False, False, False, False, False

行数、列数

# lenやshapeなどはpandasと同じように使用可能
print(pokemon_df.height, pokemon_df.width) # 行数、列数
# > 163 7

基本統計量の算出

pokemon_df.describe()

実行結果

output

shape: (7, 8)
┌────────────┬───────┬────────┬────────┬───────────┬───────────┬───────────┬───────────┐
│ describe   ┆ Name  ┆ Type 1 ┆ Type 2 ┆ HP        ┆ Attack    ┆ Defense   ┆ Legendary │
│ ---        ┆ ---   ┆ ---    ┆ ---    ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ str        ┆ str   ┆ str    ┆ str    ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
╞════════════╪═══════╪════════╪════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ count      ┆ 163   ┆ 163    ┆ 163    ┆ 163.0     ┆ 163.0     ┆ 163.0     ┆ 163.0     │
│ null_count ┆ 0     ┆ 0      ┆ 86     ┆ 0.0       ┆ 0.0       ┆ 0.0       ┆ 0.0       │
│ mean       ┆ null  ┆ null   ┆ null   ┆ 65.116564 ┆ 75.349693 ┆ 70.509202 ┆ 0.02454   │
│ std        ┆ null  ┆ null   ┆ null   ┆ 27.92282  ┆ 29.071545 ┆ 28.721173 ┆ 0.155195  │
│ min        ┆ Abra  ┆ Bug    ┆ Dark   ┆ 10.0      ┆ 5.0       ┆ 5.0       ┆ 0.0       │
│ max        ┆ Zubat ┆ Water  ┆ Water  ┆ 250.0     ┆ 155.0     ┆ 180.0     ┆ 1.0       │
│ median     ┆ null  ┆ null   ┆ null   ┆ 61.0      ┆ 73.0      ┆ 65.0      ┆ 0.0       │
└────────────┴───────┴────────┴────────┴───────────┴───────────┴───────────┴───────────┘

行抽出

# 数値型
pokemon_df.filter(pl.col("hp") > 80)

# 文字列型
pokemon_df.filter(pl.col("type1")=="Grass")

# 指定の文字列を含む行、含まない行の抽出（正規表現使用可）
pokemon_df.filter(pl.col("name").str.contains(r"am$"))
pokemon_df.filter(pl.col("name").str.contains(r"am$").is_not())

# 複数条件（or条件の時は「|」を使用）
pokemon_df.filter(
    (pl.col("type1")=="Grass")&(pl.col("hp") > 80)
)

サンプリング

# pandasと微妙にパラメータ名が違う（pandasはrandom_state、とか）
df1 = pokemon_df.sample(50, seed=1)
df2 = pokemon_df.sample(50, seed=2)

結合

# 縦につなげる
concat_df = pl.concat([df1, df2])

# カラムで結合する
join_df = df1.join(df2.select(["name"]), on="name")

重複確認/削除

# 重複データの確認
concat_df.filter(pl.all().is_duplicated()).sort("name")

実行結果

output

shape: (32, 7)
┌────────────┬───────┬─────────┬─────┬────────┬─────────┬───────────┐
│ name       ┆ type1 ┆ type2   ┆ hp  ┆ attack ┆ defense ┆ legendary │
│ ---        ┆ ---   ┆ ---     ┆ --- ┆ ---    ┆ ---     ┆ ---       │
│ str        ┆ cat   ┆ cat     ┆ i16 ┆ i16    ┆ i16     ┆ bool      │
╞════════════╪═══════╪═════════╪═════╪════════╪═════════╪═══════════╡
│ Bellsprout ┆ Grass ┆ Poison  ┆ 50  ┆ 75     ┆ 35      ┆ false     │
│ Bellsprout ┆ Grass ┆ Poison  ┆ 50  ┆ 75     ┆ 35      ┆ false     │
│ Charmeleon ┆ Fire  ┆ null    ┆ 58  ┆ 64     ┆ 58      ┆ false     │
│ Charmeleon ┆ Fire  ┆ null    ┆ 58  ┆ 64     ┆ 58      ┆ false     │
│ ...        ┆ ...   ┆ ...     ┆ ... ┆ ...    ┆ ...     ┆ ...       │
│ Starmie    ┆ Water ┆ Psychic ┆ 60  ┆ 75     ┆ 85      ┆ false     │
│ Starmie    ┆ Water ┆ Psychic ┆ 60  ┆ 75     ┆ 85      ┆ false     │
│ Victreebel ┆ Grass ┆ Poison  ┆ 80  ┆ 105    ┆ 65      ┆ false     │
│ Victreebel ┆ Grass ┆ Poison  ┆ 80  ┆ 105    ┆ 65      ┆ false     │
└────────────┴───────┴─────────┴─────┴────────┴─────────┴───────────┘

# pythonでいうところのdrop_duplicates、SQLでいうdistinct
concat_df.unique()

実行結果

output

shape: (84, 7)
┌───────────────────────┬──────────┬────────┬─────┬────────┬─────────┬───────────┐
│ name                  ┆ type1    ┆ type2  ┆ hp  ┆ attack ┆ defense ┆ legendary │
│ ---                   ┆ ---      ┆ ---    ┆ --- ┆ ---    ┆ ---     ┆ ---       │
│ str                   ┆ cat      ┆ cat    ┆ i16 ┆ i16    ┆ i16     ┆ bool      │
╞═══════════════════════╪══════════╪════════╪═════╪════════╪═════════╪═══════════╡
│ Jolteon               ┆ Electric ┆ null   ┆ 65  ┆ 65     ┆ 60      ┆ false     │
│ Shellder              ┆ Water    ┆ null   ┆ 30  ┆ 65     ┆ 100     ┆ false     │
│ Electrode             ┆ Electric ┆ null   ┆ 60  ┆ 50     ┆ 70      ┆ false     │
│ VenusaurMega Venusaur ┆ Grass    ┆ Poison ┆ 80  ┆ 100    ┆ 123     ┆ false     │
│ ...                   ┆ ...      ┆ ...    ┆ ... ┆ ...    ┆ ...     ┆ ...       │
│ Flareon               ┆ Fire     ┆ null   ┆ 65  ┆ 130    ┆ 60      ┆ false     │
│ GengarMega Gengar     ┆ Ghost    ┆ Poison ┆ 60  ┆ 65     ┆ 80      ┆ false     │
│ Golbat                ┆ Poison   ┆ Flying ┆ 75  ┆ 80     ┆ 70      ┆ false     │
│ Rapidash              ┆ Fire     ┆ null   ┆ 65  ┆ 100    ┆ 70      ┆ false     │
└───────────────────────┴──────────┴────────┴─────┴────────┴─────────┴───────────┘

列の新規作成/内容更新

列の新規作成/内容更新する方法は主に下記の2つです。

df.select([..])
df.with_columns([..])

[..]の中にはExpressionと呼ばれる操作命令を数珠つなぎのように繋ぎ合わせて記述します。

Expressionの動作イメージ

具体例から、Expressionの動作イメージを掴みましょう。

Expressionの例

# c列を取得して、ソートして、上位2つを取って、名前を"cc"に変更
pl.col("c").sort().head(2).alias("cc")

下記の通り、各Expression（操作）の出力がSeriesになるイメージを持つと理解がしやすいと思います。

`df.select([..])`の例

サンプルデータの準備

d = {"a": ["x", "x", "x", "y"],"b": [30, 60, 10, 40],"c": [2, 4, 1, 3]}
sample_df = pl.DataFrame(d)

実行結果

shape: (4, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 30  ┆ 2   │
│ x   ┆ 60  ┆ 4   │
│ x   ┆ 10  ┆ 1   │
│ y   ┆ 40  ┆ 3   │
└─────┴─────┴─────┘

df.select([..])ではリスト内のExpressionをまとめて実行します。下記の例では、2つのカラムをまとめて生成します。

selectの例1

sample_df.select([
    pl.col("b").sort().head(2).alias("bb"), # "B"をソート、上位2つを取得、"bb"に名称変更
    pl.col("c").head(2).alias("cc"), # "c"の上位2つを取得、"cc"に名称変更
])

実行結果

output

shape: (2, 2)
┌─────┬─────┐
│ bb  ┆ cc  │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 10  ┆ 2   │
│ 30  ┆ 4   │
└─────┴─────┘

動作イメージは下記の通りです。

注意点として、各Expression実行後のSeriesのlengthは全て同じor1である必要があります。これは、lengthが揃わないとDataFrameにできないためです。lengthが1となるExpressionの出力値は他のExpressionのlengthに合わせてブロードキャストされます。

selectの例2

sample_df.select([
    pl.col("*"), # 全カラム, length->4
    pl.col("b").sort().alias("bb"), # length->4
    pl.col("c").max().alias("b_max"), # length->1
])

実行結果

output

shape: (4, 5)
┌─────┬─────┬─────┬─────┬───────┐
│ a   ┆ b   ┆ c   ┆ bb  ┆ b_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ ---   │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═════╪═══════╡
│ x   ┆ 30  ┆ 2   ┆ 10  ┆ 60    │← b_max列には、bの最大値の60がブロードキャストされる
│ x   ┆ 60  ┆ 4   ┆ 30  ┆ 60    │
│ x   ┆ 10  ┆ 1   ┆ 40  ┆ 60    │ 
│ y   ┆ 40  ┆ 3   ┆ 60  ┆ 60    │
└─────┴─────┴─────┴─────┴───────┘

`with_columns([..])`の例

with_columnsは既存のDataFrameに対して任意の列を追加することが可能です。使い方はselectと同じです。

df.with_columns([..])の例

sample_df.with_columns([
    pl.col("b").sort(reverse=True).suffix("_desc_sort"),
    pl.col("c").sort().suffix("_asc_sort"),
])

実行結果

output

shape: (4, 5)
┌─────┬─────┬─────┬─────────────┬─────────────┐
│ a   ┆ b   ┆ c   ┆ b_desc_sort ┆ c_asc_sort  │
│ --- ┆ --- ┆ --- ┆ ---         ┆ ---         │
│ str ┆ i64 ┆ i64 ┆ i64         ┆ i64         │
╞═════╪═════╪═════╪═════════════╪═════════════╡
│ x   ┆ 30  ┆ 2   ┆ 60          ┆ 1           │
│ x   ┆ 60  ┆ 4   ┆ 40          ┆ 2           │
│ x   ┆ 10  ┆ 1   ┆ 30          ┆ 3           │
│ y   ┆ 40  ┆ 3   ┆ 10          ┆ 4           │
└─────┴─────┴─────┴─────────────┴─────────────┘

なお、aliasやprefix,suffix等によるカラム名の指定がない場合は、Expression内に登場するcol()内のカラム名で上書きされるようです。

カラム名の指定なし

sample_df.with_columns([pl.col("b").sort(reverse=True)])

実行結果

output

shape: (4, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 60  ┆ 2   │
│ x   ┆ 40  ┆ 4   │
│ x   ┆ 30  ┆ 1   │
│ y   ┆ 10  ┆ 3   │
└─────┴─────┴─────┘

Expressionでできる色々な処理

Expressionでは様々な処理が実行可能です。いくつか例を見て見ます。使用するデータは引き続きpokemonデータです。

型変換

使用可能な型は下記の通りです。
https://pola-rs.github.io/polars-book/user-guide/datatypes.html

print(f"変換前：{pokemon_df.dtypes}")
pokemon_df = pokemon_df.with_columns([
    pl.col("^type.$").cast(pl.Categorical),
    pl.col(["hp", "attack", "defense"]).cast(pl.Int16),
])
print(f"変換後：{pokemon_df.dtypes}")
# 変換前：[Utf8, Utf8, Utf8, Int64, Int64, Int64, Boolean]
# 変換後：[Utf8, Categorical, Categorical, Int16, Int16, Int16, Boolean]

条件分岐（if文、case文）

pokemon_df.select([
    pl.col("hp"),
    pl.when(pl.col("hp")>=100).then("high")
        .when(pl.col("hp")>=50).then("mid")
        .otherwise("low").alias("rating_cat")
])

実行結果

output

┌───────────────────────┬─────┬────────────┐
│ name                  ┆ hp  ┆ rating_cat │
│ ---                   ┆ --- ┆ ---        │
│ str                   ┆ i16 ┆ str        │
╞═══════════════════════╪═════╪════════════╡
│ Bulbasaur             ┆ 45  ┆ low        │
│ Ivysaur               ┆ 60  ┆ mid        │
│ Venusaur              ┆ 80  ┆ mid        │
│ VenusaurMega Venusaur ┆ 80  ┆ mid        │
│ ...                   ┆ ... ┆ ...        │

shift, diff

pokemon_df.select([
    pl.col(["name", "hp", "type1"]),
    # periods分だけ下にずらす(マイナスにすれば上にずらせる)
    pl.col("hp").shift(periods=1).suffix("_shift"),
    # n個前の値（n個上の値）との差を計算、値がなくnullの場合はnullを返す
    pl.col("hp").diff(n=1, null_behavior="ignore").suffix("_diff"),
    # window関数を組み合わせればgroup毎に同様の処理が可能
    pl.col("hp").shift(periods=1).over("type1").suffix("_shift_by_type1"),
    pl.col("hp").diff(n=1, null_behavior="ignore").over("type1").suffix("_diff_by_type1"),

]).head(8)

実行結果

output

shape: (8, 7)
┌────────────────────────┬─────┬───────┬──────────┬─────────┬───────────────────┬──────────────────┐
│ name                   ┆ hp  ┆ type1 ┆ hp_shift ┆ hp_diff ┆ hp_shift_by_type1 ┆ hp_diff_by_type1 │
│ ---                    ┆ --- ┆ ---   ┆ ---      ┆ ---     ┆ ---               ┆ ---              │
│ str                    ┆ i16 ┆ cat   ┆ i16      ┆ i16     ┆ i16               ┆ i16              │
╞════════════════════════╪═════╪═══════╪══════════╪═════════╪═══════════════════╪══════════════════╡
│ Bulbasaur              ┆ 45  ┆ Grass ┆ null     ┆ null    ┆ null              ┆ null             │
│ Ivysaur                ┆ 60  ┆ Grass ┆ 45       ┆ 15      ┆ 45                ┆ 15               │
│ Venusaur               ┆ 80  ┆ Grass ┆ 60       ┆ 20      ┆ 60                ┆ 20               │
│ VenusaurMega Venusaur  ┆ 80  ┆ Grass ┆ 80       ┆ 0       ┆ 80                ┆ 0                │
│ Charmander             ┆ 39  ┆ Fire  ┆ 80       ┆ -41     ┆ null              ┆ null             │
│ Charmeleon             ┆ 58  ┆ Fire  ┆ 39       ┆ 19      ┆ 39                ┆ 19               │
│ Charizard              ┆ 78  ┆ Fire  ┆ 58       ┆ 20      ┆ 58                ┆ 20               │
│ CharizardMega          ┆ 78  ┆ Fire  ┆ 78       ┆ 0       ┆ 78                ┆ 0                │
│ Charizard X            ┆     ┆       ┆          ┆         ┆                   ┆                  │
└────────────────────────┴─────┴───────┴──────────┴─────────┴───────────────────┴──────────────────┘

カラム同士の数値計算・文字列結合

pokemon_df.select([
    pl.col(["name", "attack", "hp"]),
    # 数値同士の計算
    (pl.col("hp") + pl.col("attack")).alias("hp_attack_sum"),
    # 文字列同士の結合
    pl.col("^type.$"),
    pl.concat_str([pl.col("type1"), pl.col("type2")], sep=" / ").alias("type"),
])

実行結果

output

┌───────────────────────┬─────┬────────┬───────────────┬─────────┬────────┬─────────────────┐
│ name                  ┆ hp  ┆ attack ┆ hp_attack_sum ┆ type1   ┆ type2  ┆ type            │
│ ---                   ┆ --- ┆ ---    ┆ ---           ┆ ---     ┆ ---    ┆ ---             │
│ str                   ┆ i16 ┆ i16    ┆ i16           ┆ cat     ┆ cat    ┆ str             │
╞═══════════════════════╪═════╪════════╪═══════════════╪═════════╪════════╪═════════════════╡
│ Bulbasaur             ┆ 45  ┆ 49     ┆ 94            ┆ Grass   ┆ Poison ┆ Grass / Poison  │
│ Ivysaur               ┆ 60  ┆ 62     ┆ 122           ┆ Grass   ┆ Poison ┆ Grass / Poison  │
│ Venusaur              ┆ 80  ┆ 82     ┆ 162           ┆ Grass   ┆ Poison ┆ Grass / Poison  │
│ VenusaurMega Venusaur ┆ 80  ┆ 100    ┆ 180           ┆ Grass   ┆ Poison ┆ Grass / Poison  │
│ ...                   ┆ ... ┆ ...    ┆ ...           ┆ ...     ┆ ...    ┆ ...             │

window関数

pokemon_df.select([
    pl.col(["name", "type1", "hp"]),
    # 全体のHP平均
    pl.col("hp").mean().alias("all_hp_mean"),
    # type1毎のHP平均
    pl.col("hp").mean().over("type1").alias("type1_hp_mean"),
    # type1毎のHP順位
    pl.col("hp").rank(method="dense", reverse=True).over("type1").alias("type1_hp_rank")
])

実行結果

output

┌───────────────────────┬─────────┬─────┬─────────────┬───────────────┬───────────────┐
│ name                  ┆ type1   ┆ hp  ┆ all_hp_mean ┆ type1_hp_mean ┆ type1_hp_rank │
│ ---                   ┆ ---     ┆ --- ┆ ---         ┆ ---           ┆ ---           │
│ str                   ┆ cat     ┆ i16 ┆ f64         ┆ f64           ┆ u32           │
╞═══════════════════════╪═════════╪═════╪═════════════╪═══════════════╪═══════════════╡
│ Bulbasaur             ┆ Grass   ┆ 45  ┆ 65.116564   ┆ 66.153846     ┆ 7             │
│ Ivysaur               ┆ Grass   ┆ 60  ┆ 65.116564   ┆ 66.153846     ┆ 5             │
│ Venusaur              ┆ Grass   ┆ 80  ┆ 65.116564   ┆ 66.153846     ┆ 2             │
│ VenusaurMega Venusaur ┆ Grass   ┆ 80  ┆ 65.116564   ┆ 66.153846     ┆ 2             │
│ ...                   ┆ ...     ┆ ... ┆ ...         ┆ ...           ┆ ...           │

groupbyによる集計

df.groupby(..).agg([..])という形で記載します。
agg内でもselectやwith_columnsと同様にExpressionが使用可能です。groupbyの解説にはExpressionの動作イメージで使用したDataFramesample_dfを再び使用します。

sample_dfの内容

output

shape: (4, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 30  ┆ 2   │
│ x   ┆ 60  ┆ 4   │
│ x   ┆ 10  ┆ 1   │
│ y   ┆ 40  ┆ 3   │
└─────┴─────┴─────┘

groupbyの例

sample_df.groupby("a").agg([
    pl.col("b").max().suffix("_max"),
    pl.col("c").mean().suffix("_mean"),
])

実行結果

output

shape: (2, 3)
┌─────┬───────┬──────────┐
│ a   ┆ b_max ┆ c_mean   │
│ --- ┆ ---   ┆ ---      │
│ str ┆ i64   ┆ f64      │
╞═════╪═══════╪══════════╡
│ x   ┆ 60    ┆ 2.333333 │
│ y   ┆ 40    ┆ 3.0      │
└─────┴───────┴──────────┘

groupbyの動作イメージ

pl.colはgroupbyで指定したgroup毎に、値のリストを返す挙動をします。（元のDataFrameに入っている順番でリストに格納）。pl.col以降に繋げる処理はこのリストに対して行われるイメージで記述をすると、処理の内容が想像しやすいと思います。

colのみ

sample_df.groupby("a").agg([pl.col("b")])

実行結果

output

shape: (2, 2)
┌─────┬──────────────┐
│ a   ┆ b            │
│ --- ┆ ---          │
│ str ┆ list[i64]    │
╞═════╪══════════════╡ 
│ x   ┆ [30, 60, 10] │ ← group毎の値のリストが返ってくる
│ y   ┆ [40]         │ ← group毎の値のリストが返ってくる
└─────┴──────────────┘

続いて、pl.col("b")にfirst()を繋げると、生成されたgroup毎の値リストの最初の値が返されます。

colにfirstを繋げる

sample_df.groupby("a").agg([pl.col("b").first()])

実行結果

output

shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ x   ┆ 30  │　← 元のリスト：[30, 60, 10] → 最初の値=30
│ y   ┆ 40  │　← 元のリスト：[40] → 最初の値=10
└─────┴─────┘

最後に、pl.col("b")の後にsort()を挟んでからfirst()を繋げます。この場合、リスト内の値がソートされた上で、最初の値を取得する挙動になるため、出力が変わります。

col→sort→firstの順で繋げる

sample_df.groupby("a").agg([pl.col("b").sort().first()])

実行結果

output

shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ x   ┆ 10  │ ← 元リスト：[30, 60, 10] → ソート:[10, 30, 60] → 最初の値=10
│ y   ┆ 40  │ ← 元リスト：[40] → ソート:[40] → 最初の値=40
└─────┴─────┘

時系列データ

時系列データは下記のヨセミテ州の気温データを使用します。dsは日時、yは気温です。
https://github.com/facebook/prophet/blob/main/examples/example_yosemite_temps.csv

yosemite_df = pl.read_csv(
    'https://raw.githubusercontent.com/facebook/prophet/main/examples/example_yosemite_temps.csv',
)
yosemite_df.columns = ["dt", "temp"] # datetime, temperature

# 欠損値は除外
yosemite_df = yosemite_df.filter(pl.col("temp").is_not_nan())

yosemite_df.head()

実行結果

output

shape: (5, 2)
┌─────────────────────┬──────┐
│ dt                  ┆ temp │
│ ---                 ┆ ---  │
│ str                 ┆ f64  │
╞═════════════════════╪══════╡
│ 2017-05-01 00:00:00 ┆ 27.8 │
│ 2017-05-01 00:05:00 ┆ 27.0 │
│ 2017-05-01 00:10:00 ┆ 26.8 │
│ 2017-05-01 00:15:00 ┆ 26.5 │
│ 2017-05-01 00:20:00 ┆ 25.6 │
└─────────────────────┴──────┘

string → 日付型への変換

strptimeを使用します。書き方は下記参照
https://docs.rs/chrono/latest/chrono/format/strftime/index.html

# ts=timestamp
yosemite_df = yosemite_df.select([
    pl.col("dt").str.strptime(pl.Date, fmt="%Y-%m-%d %T").alias("date"),
    pl.col("dt").str.strptime(pl.Datetime, fmt="%Y-%m-%d %T").alias("ts"),
    pl.col("temp")
])
yosemite_df.head()

実行結果

output

┌────────────┬─────────────────────┬──────┐
│ date       ┆ ts                  ┆ temp │
│ ---        ┆ ---                 ┆ ---  │
│ date       ┆ datetime[μs]        ┆ f64  │
╞════════════╪═════════════════════╪══════╡
│ 2017-05-01 ┆ 2017-05-01 00:00:00 ┆ 27.8 │
│ 2017-05-01 ┆ 2017-05-01 00:05:00 ┆ 27.0 │
│ 2017-05-01 ┆ 2017-05-01 00:10:00 ┆ 26.8 │
│ 2017-05-01 ┆ 2017-05-01 00:15:00 ┆ 26.5 │
│ 2017-05-01 ┆ 2017-05-01 00:20:00 ┆ 25.6 │
└────────────┴─────────────────────┴──────┘

日付や時間の要素（月、時間、分など）の抽出

yosemite_df.select([
    pl.col("ts"),
    pl.col("ts").dt.year().alias("year"),
    pl.col("ts").dt.minute().alias("minute"),
    pl.col("ts").dt.weekday().alias("weekday"), # 1が月曜日、7が日曜日
]).head()

実行結果

output

shape: (5, 4)
┌─────────────────────┬──────┬────────┬─────────┐
│ ts                  ┆ year ┆ minute ┆ weekday │
│ ---                 ┆ ---  ┆ ---    ┆ ---     │
│ datetime[μs]        ┆ i32  ┆ u32    ┆ u32     │
╞═════════════════════╪══════╪════════╪═════════╡
│ 2017-05-01 00:00:00 ┆ 2017 ┆ 0      ┆ 1       │
│ 2017-05-01 00:05:00 ┆ 2017 ┆ 5      ┆ 1       │
│ 2017-05-01 00:10:00 ┆ 2017 ┆ 10     ┆ 1       │
│ 2017-05-01 00:15:00 ┆ 2017 ┆ 15     ┆ 1       │
│ 2017-05-01 00:20:00 ┆ 2017 ┆ 20     ┆ 1       │
└─────────────────────┴──────┴────────┴─────────┘

日付型による行抽出（フィルタリング）

# 特定の時刻
yosemite_df.filter(pl.col("ts")==datetime(2017, 5, 4, 3, 15, 0))

実行結果

output

shape: (1, 3)
┌────────────┬─────────────────────┬──────┐
│ date       ┆ ts                  ┆ temp │
│ ---        ┆ ---                 ┆ ---  │
│ date       ┆ datetime[μs]        ┆ f64  │
╞════════════╪═════════════════════╪══════╡
│ 2017-05-04 ┆ 2017-05-04 03:15:00 ┆ 13.6 │
└────────────┴─────────────────────┴──────┘

# 特定の範囲
yosemite_df.filter(
    pl.col("date")
    .is_between(datetime(2017, 5, 2),datetime(2017, 5, 3))
)

実行結果

output

shape: (576, 3)
┌────────────┬─────────────────────┬──────┐
│ date       ┆ ts                  ┆ temp │
│ ---        ┆ ---                 ┆ ---  │
│ date       ┆ datetime[μs]        ┆ f64  │
╞════════════╪═════════════════════╪══════╡
│ 2017-05-02 ┆ 2017-05-02 00:00:00 ┆ 29.4 │
│ 2017-05-02 ┆ 2017-05-02 00:05:00 ┆ 28.9 │
│ 2017-05-02 ┆ 2017-05-02 00:10:00 ┆ 29.3 │
│ 2017-05-02 ┆ 2017-05-02 00:15:00 ┆ 29.1 │
│ ...        ┆ ...                 ┆ ...  │
│ 2017-05-03 ┆ 2017-05-03 23:40:00 ┆ 35.0 │
│ 2017-05-03 ┆ 2017-05-03 23:45:00 ┆ 34.7 │
│ 2017-05-03 ┆ 2017-05-03 23:50:00 ┆ 34.3 │
│ 2017-05-03 ┆ 2017-05-03 23:55:00 ┆ 33.7 │
└────────────┴─────────────────────┴──────┘

groupby_dynamic

groupby_dynamicは時間単位ごとに集計を実施する関数です。
例：3時間ごとの平均、5日毎の最大値、1か月毎の中央値、etc
なお、groupby_dynamicを使用する際はindex_column（時間のカラム）で予めソートしておく必要あります。
使用頻度の高そうな引数の内容は下記の通りです。

index_column: group化する際に使用するカラム。時系列でなくてもOK
every: windowの開始時点を置く間隔
period: windowの長さ（デフォルトはeveryと同じ値）
offset: 最初の使わない範囲
include_boundaries: 各windowの開始時点と終了時点を列に追加するかどうか。並列化するのが難しいためパフォーマンスに影響ありだが、慣れないうちは付けたほうがよい
by: 指定したカラム別にwindowの集計を行う
closed: window区間の開閉。デフォルトは"left"(a≦x<b)

every、period、offsetの違いのイメージは下記の通りです。
期間の記載の仕方は公式ドキュメントを参考にしましょう。
https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby_dynamic.html#polars.DataFrame.groupby_dynamic

例1

# ソートを忘れない
grp_dynamic = yosemite_df.sort("ts").groupby_dynamic(
    index_column="ts",
    every="15d", # 15days
    period="30d", # 30days
    offset="1mo", # 1month
    include_boundaries=True
)

grp_dynamic.agg([
    pl.col("temp").mean().suffix("_mean"),
    pl.col("temp").min().suffix("_min")
])

output

shape: (3, 5)
┌─────────────────────┬─────────────────────┬─────────────────────┬───────────┬──────────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ ts                  ┆ temp_mean ┆ temp_min │
│ ---                 ┆ ---                 ┆ ---                 ┆ ---       ┆ ---      │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ f64       ┆ f64      │
╞═════════════════════╪═════════════════════╪═════════════════════╪═══════════╪══════════╡
│ 2017-06-01 00:00:00 ┆ 2017-07-01 00:00:00 ┆ 2017-06-01 00:00:00 ┆ 20.996685 ┆ -8.4     │
│ 2017-06-16 00:00:00 ┆ 2017-07-16 00:00:00 ┆ 2017-06-16 00:00:00 ┆ 25.750247 ┆ 6.6      │
│ 2017-07-01 00:00:00 ┆ 2017-07-31 00:00:00 ┆ 2017-07-01 00:00:00 ┆ 24.828881 ┆ 9.0      │
└─────────────────────┴─────────────────────┴─────────────────────┴───────────┴──────────┘

include_boundariesオプションにより_lower_boundary, _upper_boundary（windowの開始時点, 終了時点）が生成
データ開始時点は2017-5-1、offsetは"1mo"に設定したため、最初のwindowの開始時点（1行目）は2017-6-1(2017-5-1の1か月後)となっている
everyは"15d"に設定したため、2つ目, 3つ目のwindowの開始時点はそれぞれ2017-6-16(2017-6-1の15日後)、2017-7-1(2017-6-16の15日後)になっている
periodは"30d"に設定したため、各windowの終了時点は開始時点の30日後に設定される

例2

引数byを使用すれば、指定したカラム別に時間単位ごとの集計処理を実施することもできます。

# 休日フラグを立てる
yosemite_df = yosemite_df.select([
    pl.col("*"),
    pl.when(pl.col("date").dt.weekday() >= 6).then(1)
        .otherwise(0).alias("holiday_flg").cast(pl.Utf8)
])

# ソートを忘れない
yosemite_df.sort("ts").groupby_dynamic("ts", every="1mo", include_boundaries=True, by="holiday_flg").agg([
    pl.col("temp").mean().alias("temp_mean")
]).head()

output

shape: (5, 5)
┌─────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────┐
│ holiday_flg ┆ _lower_boundary     ┆ _upper_boundary     ┆ ts                  ┆ temp_mean │
│ ---         ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---       │
│ str         ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ f64       │
╞═════════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡
│ 0           ┆ 2017-05-01 00:00:00 ┆ 2017-06-01 00:00:00 ┆ 2017-05-01 00:00:00 ┆ 15.759481 │
│ 0           ┆ 2017-06-01 00:00:00 ┆ 2017-07-01 00:00:00 ┆ 2017-06-01 00:00:00 ┆ 20.992061 │
│ 0           ┆ 2017-07-01 00:00:00 ┆ 2017-08-01 00:00:00 ┆ 2017-07-01 00:00:00 ┆ 25.119757 │
│ 1           ┆ 2017-05-01 00:00:00 ┆ 2017-06-01 00:00:00 ┆ 2017-05-01 00:00:00 ┆ 13.451432 │
│ 1           ┆ 2017-06-01 00:00:00 ┆ 2017-07-01 00:00:00 ┆ 2017-06-01 00:00:00 ┆ 21.009468 │
└─────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┘

byを設定したことでholiday_flg毎にwindowが作成される
offsetは未設定のため、各holiday_flgの最初のwindowの開始時点（1行目）はデータ開始時点である2017-5-1
everyは"1mo"に設定したため、2つ目, 3つ目のwindowの開始時点はそれぞれ2017-6-1(2017-5-1の1か月後)、2017-7-1(2017-6-1の1か月後)になっている
periodは未設定のため、everyと同じ"1mo"にデフォルトで設定され、各windowの終了時点は開始時点の1か月後に設定される

groupby_rolling

groupby_rollingは各データの時点から、引数のperiod分だけ戻す形でwindowを作成し、windowごとに集計を行います。
なお、groupby_dynamicと同様に、使用する際はindex_column（時間のカラム）で予めソートしておく必要あります。
主な引数は下記の通り。

period: windowの長さ
closed: window区間の開閉。デフォルトは"right"(a<x≦b)
※ groupby_dynamicのデフォルトは"left"、違うので注意

# ソートを忘れない
rolling_df = yosemite_df.sort("ts").groupby_rolling("ts", period="15m").agg([
    pl.col("temp").sum().alias("rolling_sum")
])

# 結果を分かりやすくするために、元データに結合
yosemite_df.drop("holiday_flg").join(rolling_df, on="ts")

output

shape: (18709, 4)
┌────────────┬─────────────────────┬──────┬─────────────┐
│ date       ┆ ts                  ┆ temp ┆ rolling_sum │
│ ---        ┆ ---                 ┆ ---  ┆ ---         │
│ date       ┆ datetime[μs]        ┆ f64  ┆ f64         │
╞════════════╪═════════════════════╪══════╪═════════════╡
│ 2017-05-01 ┆ 2017-05-01 00:00:00 ┆ 27.8 ┆ 27.8        │
│ 2017-05-01 ┆ 2017-05-01 00:05:00 ┆ 27.0 ┆ 54.8        │
│ 2017-05-01 ┆ 2017-05-01 00:10:00 ┆ 26.8 ┆ 81.6        │
│ 2017-05-01 ┆ 2017-05-01 00:15:00 ┆ 26.5 ┆ 80.3        │
│ ...        ┆ ...                 ┆ ...  ┆ ...         │
│ 2017-07-04 ┆ 2017-07-04 23:45:00 ┆ 43.0 ┆ 129.1       │
│ 2017-07-04 ┆ 2017-07-04 23:50:00 ┆ 42.1 ┆ 127.9       │
│ 2017-07-04 ┆ 2017-07-04 23:55:00 ┆ 42.1 ┆ 127.2       │
│ 2017-07-05 ┆ 2017-07-05 00:00:00 ┆ 41.4 ┆ 125.6       │
└────────────┴─────────────────────┴──────┴─────────────┘

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【polars】データ分析時の操作と動作イメージ

polarsとは

ひとまず使用するデータセット

準備

データの読み込み/書き込み

読み込み

書き込み

polars -> pandasへの変換

DataFrameに対する処理

カラム名と型チェック

行数、列数

基本統計量の算出

行抽出

サンプリング

結合

重複確認/削除

列の新規作成/内容更新

Expressionの動作イメージ

df.select([..])の例

with_columns([..])の例

Expressionでできる色々な処理

型変換

条件分岐（if文、case文）

shift, diff

カラム同士の数値計算・文字列結合

window関数

groupbyによる集計

groupbyの動作イメージ

時系列データ

string → 日付型への変換

日付や時間の要素（月、時間、分など）の抽出

日付型による行抽出（フィルタリング）

groupby_dynamic

例1

例2

groupby_rolling

参考

`df.select([..])`の例

`with_columns([..])`の例