0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

Julia でデータフレームをフィルターすると

Last updated at Posted at 2023-01-28

以下を参照して,Julia ならどうなるかやってみた。
https://qiita.com/yossy_stu/items/9c943c2f01b5a941666d

作業環境

種類 バージョン
Mac mini
チップ Apple M1
メモリ 8 GB
macOS Ventura 13.1
Python 3.11.0
pandas 1.5.1
Julia 1.8.5
Query 1.0.0
jupyterlab 3.4.1

結論からいえば,直接指定 df[df.foo > x, :] が一番速いが,Python に比べると 10 倍ほど遅い。

using DataFrames, CSV
@time df = CSV.read("../SN_d_tot_V2.0.csv", DataFrame,
    delim=";",
    header=["year", "month", "day", "date_frac", "num", "std", "obs spot", "certanty"]
    );

  3.139465 seconds (1.55 M allocations: 73.513 MiB, 0.55% gc time, 99.25% compilation time: 85% of which was recompilation)

読み込みは Python の場合 0.030 〜 0.035 秒ほど。Julia だと約 2 倍速。

使用する変数(列)だけ読み込むようにしても,特段早くなるわけでもない。

@time df2 = CSV.read("../SN_d_tot_V2.0.csv", DataFrame,
    delim=";",
    header=["year", "month", "day", "date_frac", "num", "std", "obs spot", "certanty"],
    select=["num"]
    );
  0.034364 seconds (46.22 k allocations: 3.047 MiB, 52.56% compilation time)
using Query
# 行の選択
@time df |>
    @filter(_.date_frac >= 2000) |>
    @take(3) |>
    DataFrame
  0.422586 seconds (2.46 M allocations: 128.434 MiB, 3.40% gc time, 99.25% compilation time)
3×8 DataFrame
Row year month day date_frac num std obs spot certanty
Int64 Int64 Int64 Float64 Int64 Float64 Int64 Int64
1 2000 1 1 2000.0 71 3.8 14 1
2 2000 1 2 2000.0 75 4.1 10 1
3 2000 1 3 2000.01 80 3.9 13 1

Python の df_dSN.query('date_frac >= 2000').head(3) だと 0.003 秒ほどなのに比べるととても遅い。

@time df[df.date_frac .>= 2000, :] |> (x -> first(x, 3))
  0.170744 seconds (474.33 k allocations: 23.493 MiB, 99.75% compilation time: 65% of which was recompilation)
3×8 DataFrame
Row year month day date_frac num std obs spot certanty
Int64 Int64 Int64 Float64 Int64 Float64 Int64 Int64
1 2000 1 1 2000.0 71 3.8 14 1
2 2000 1 2 2000.0 75 4.1 10 1
3 2000 1 3 2000.01 80 3.9 13 1

Python の df_dSN[df_dSN['date_frac'] >= 2000].head(3) だと 0.002 秒ほどなのに比べるととても遅い。

@time filter(:date_frac => x -> x >= 2000, df) |> (x -> first(x, 3))
size(df)
  0.168857 seconds (1.53 M allocations: 82.448 MiB, 9.10% gc time, 99.77% compilation time)

(74875, 8)
df2 = copy(df)
@time filter!(:date_frac => x -> x >= 2000, df2) |> (x -> first(x, 3))
size(df2)
  0.159689 seconds (799.11 k allocations: 41.806 MiB, 4.68% gc time, 99.54% compilation time)

(8401, 8)

インプレースで操作してもほとんど速度の変化はない。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?