More than 1 year has passed since last update.

Julia でデータフレームをフィルターすると

Last updated at 2023-01-28Posted at 2023-01-28

以下を参照して，Julia ならどうなるかやってみた。
https://qiita.com/yossy_stu/items/9c943c2f01b5a941666d

作業環境

種類	バージョン
Mac mini
チップ	Apple M1
メモリ	8 GB
macOS Ventura	13.1
Python	3.11.0
pandas	1.5.1
Julia	1.8.5
Query	1.0.0
jupyterlab	3.4.1

結論からいえば，直接指定 df[df.foo > x, :] が一番速いが，Python に比べると 10 倍ほど遅い。

using DataFrames, CSV
@time df = CSV.read("../SN_d_tot_V2.0.csv", DataFrame,
    delim=";",
    header=["year", "month", "day", "date_frac", "num", "std", "obs spot", "certanty"]
    );

  3.139465 seconds (1.55 M allocations: 73.513 MiB, 0.55% gc time, 99.25% compilation time: 85% of which was recompilation)

読み込みは Python の場合 0.030 〜 0.035 秒ほど。Julia だと約 2 倍速。

使用する変数（列）だけ読み込むようにしても，特段早くなるわけでもない。

@time df2 = CSV.read("../SN_d_tot_V2.0.csv", DataFrame,
    delim=";",
    header=["year", "month", "day", "date_frac", "num", "std", "obs spot", "certanty"],
    select=["num"]
    );

  0.034364 seconds (46.22 k allocations: 3.047 MiB, 52.56% compilation time)

using Query

# 行の選択
@time df |>
    @filter(_.date_frac >= 2000) |>
    @take(3) |>
    DataFrame

  0.422586 seconds (2.46 M allocations: 128.434 MiB, 3.40% gc time, 99.25% compilation time)

3×8 DataFrame

Row	year	month	day	date_frac	num	std	obs spot	certanty
	Int64	Int64	Int64	Float64	Int64	Float64	Int64	Int64
1	2000	1	1	2000.0	71	3.8	14	1
2	2000	1	2	2000.0	75	4.1	10	1
3	2000	1	3	2000.01	80	3.9	13	1

Python の df_dSN.query('date_frac >= 2000').head(3) だと 0.003 秒ほどなのに比べるととても遅い。

@time df[df.date_frac .>= 2000, :] |> (x -> first(x, 3))

  0.170744 seconds (474.33 k allocations: 23.493 MiB, 99.75% compilation time: 65% of which was recompilation)

3×8 DataFrame

Row	year	month	day	date_frac	num	std	obs spot	certanty
	Int64	Int64	Int64	Float64	Int64	Float64	Int64	Int64
1	2000	1	1	2000.0	71	3.8	14	1
2	2000	1	2	2000.0	75	4.1	10	1
3	2000	1	3	2000.01	80	3.9	13	1

Python の df_dSN[df_dSN['date_frac'] >= 2000].head(3) だと 0.002 秒ほどなのに比べるととても遅い。

@time filter(:date_frac => x -> x >= 2000, df) |> (x -> first(x, 3))
size(df)

  0.168857 seconds (1.53 M allocations: 82.448 MiB, 9.10% gc time, 99.77% compilation time)

(74875, 8)

df2 = copy(df)
@time filter!(:date_frac => x -> x >= 2000, df2) |> (x -> first(x, 3))
size(df2)

  0.159689 seconds (799.11 k allocations: 41.806 MiB, 4.68% gc time, 99.54% compilation time)

(8401, 8)

インプレースで操作してもほとんど速度の変化はない。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up