More than 1 year has passed since last update.

【Python】Rust版Pandas 'Polars'備忘録

Last updated at 2023-07-21Posted at 2023-04-28

Polarsとは

Rustで構成されたPandasを強く意識したPythonライブラリ。
Pandasより実行スピードが早いのが特徴。

API

Input/output

SeriesのMethod

DataFrameのMethod

LazyFrameのMethod

Expressions(評価式)

Function(関数)

SQL

実際のコード

import os
import sys
import time
import glob
import re
import datetime
import csv
import numpy as np
import polars as pl

# 現在時間取得
now = datetime.datetime.now()
today = now.strftime('%Y-%m-%d %H:%M')

# 引数取得
args = sys.argv
print(os.path.abspath(args[1]))
input_path = os.path.abspath(args[1])
basename = os.path.basename(input_path)
basename_wo_ext = os.path.splitext(os.path.basename(input_path))[0]
dirname = os.path.dirname(input_path)

# DataFrame生成1
rng = np.random.default_rng(0)
df = pl.DataFrame(rng.random((2, 3)), schema=["A", "B", "C"])

# DataFrame生成2
df = pl.DataFrame(
    {
        "Integer": [1, 2, 3, 4],
        "Float": np.array([1, 2, 3, 4], dtype=float),
        "Datetime": [datetime.datetime(2022, 4, 1)] * 4,
        "String": ["test", "train", "test", "train"],
    }
)

# ファイル読み込み
files_file = input_path
df = pl.read_csv(files_file, has_header=True, skip_rows=0, columns=[0, 1, 2, 3, 4, 5], dtypes={'ID': str, 'X': float}, separator=',', encoding='utf8')
# サイズ確認
print(df.shape)
print(df.height)
print(df.width)
# NullとNanを置換
df = df.fill_null(0)
df = df.fill_nan(0)
# 行抽出
print(df[0:3])
# 列抽出
print(df.get_column("ID"))
# すべての列をリストで取得
print(df.get_columns())
# 左から何番目の列か
print(df.find_idx_by_name("ID"))
# 列取得
print(df.select("ID"))
# 新しい列を追加
new_seires = (df.get_column("xcenter") * 2).alias("xcenter2")
df = df.with_columns(new_seires)
# 指定した列でソート
df = df.select(pl.col(df.columns).sort_by("ID"))
# 中身確認
print(df.glimpse())
print(df.describe())
print(df.columns)
# 特定の列に特定の文字が存在するか確認
print(df.get_column("ID").is_in(['ea02e0be-b90b-48a6-a90']))
# reset index
df = df.with_row_count()
# フィルター
df = df.filter(((2000 >= pl.col("xcenter")) & (pl.col("xcenter") > 1160)) | ((1000 >= pl.col("ycenter")) & (pl.col("ycenter") > 750)))

# カウント数でフィルター
df_counts = df["ID"].value_counts()
df_counts = df_counts.filter((35 < pl.col("counts")) & (250 > pl.col("counts")))
fill_list = df_counts.get_column("ID")
df = df.filter(df.get_column("ID").is_in(fill_list))

# カウント分布を保存
df_counts = df["ID"].value_counts()
df_counts = df_counts.select(pl.col(df_counts.columns).sort_by("counts"))
df_counts.write_csv("./counts_by_ID.csv")

# 重複を削除
df = df.unique(subset=["ID"], keep="first")

print(df)
Number_of_employees = df["ID"].value_counts().sum().get_column("counts")[0]
print(Number_of_employees)
df.write_csv("./filtered.csv")

まとめ

今回は、Polarsについて紹介した。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up