Julia で Parquet（その0）

Last updated at 2024-12-13Posted at 2024-12-13

Julia で Parquet ファイルを読み書きするには，Parquet.jl パッケージを使用するのが一般的である。このパッケージは，Apache Parquet フォーマットをサポートしており，データフレームの読み書きが可能である。

使うか使わないかは，あなた次第...

以下に，基本的な使用方法を示す。

1. パッケージのインストール

必要ならば，まず Parquet.jl パッケージをインストールする。

# using Pkg
# Pkg.add("Parquet")
# Pkg.add("DataFrames")  # DataFrame と組み合わせる場合

2. Parquet ファイルの読み込み

Parquet ファイルを読み込むには，read_parquet を使用する。
データフレーム形式に変換する場合は，DataFrame 関数を使用する。

注：すでに Parquet ファイルがあるならば，ファイル名を指定して以下を行えばよい。
Parquet ファイルがないときには，まず第 3 節を実行して，仮の Parquet ファイル("sample.parquet")を作ってから，以下を行う。

using Parquet
using DataFrames

# Parquet ファイルを読み込み，データフレームにする
parquet_file = "sample.parquet"
df = read_parquet(parquet_file) |> DataFrame

# DataFrame を表示
df |> println

[1m5×2 DataFrame[0m
[1m Row [0m│[1m A      [0m[1m B       [0m
     │[90m Int64? [0m[90m String? [0m
─────┼─────────────────
   1 │      1  a
   2 │      2  b
   3 │      3  c
   4 │      4  d
   5 │      5  e

3. Parquet ファイルへの書き出し

write_parquet 関数を使って Parquet ファイルにデータを書き出すことができる。
以下では，データフレームを Parquet ファイルとして書き出す。

# サンプルデータフレームを作成
df = DataFrame(A = 1:5, B = ["a", "b", "c", "d", "e"])

# Parquet ファイルに書き込む
output_file = "sample.parquet"
write_parquet(output_file, df)
println("Parquet ファイルを書き込みました: $output_file")

Parquet ファイルを書き込みました: sample.parquet

4. 効率，実行速度の検証

以下の内容のファイルを作り，CSV ファイル，Parquet ファイルとして書き出す実行速度，ファイルサイズを比較する。

using DataFrames
using CSV
using Parquet
using Random

n = 1000_0000
tbl = (
    int32 = Int32.(1:n),
    int64 = Int64.(1:n),
    float32 = Float32.(1:n),
    float64 = Float64.(1:n),
    bool = rand(Bool, n),
    string = [randstring(8) for i in 1:n],
    int32m = rand([missing, 1:100...], n),
    int64m = rand([missing, 1:100...], n),
    float32m = rand([missing, Float32.(1:100)...], n),
    float64m = rand([missing, Float64.(1:100)...], n),
    boolm = rand([missing, true, false], n),
    stringm = rand([missing, "abc", "def", "ghi"], n)
);

@time write_parquet("test.parquet", tbl);

  4.841535 seconds (67.96 M allocations: 4.306 GiB, 7.74% gc time, 54.39% compilation time)

@time df = DataFrame(read_parquet("test.parquet"));

  3.513579 seconds (55.19 M allocations: 3.312 GiB, 18.68% gc time, 32.45% compilation time: 13% of which was recompilation)

@time read_parquet("test.parquet");

  0.008872 seconds (1.75 k allocations: 58.398 KiB)

@time CSV.write("test.csv",  df);

 33.501988 seconds (551.77 M allocations: 13.656 GiB, 5.82% gc time, 0.95% compilation time)

@time df = CSV.read("test.csv",  DataFrame);

  7.589468 seconds (1.40 M allocations: 1.537 GiB, 0.92% gc time, 16.50% compilation time: 73% of which was recompilation)

項目	Parqet	CSV	Parqet/CSV
読み込み時間	4.421574	7.663862	0.57694
読み込み時間2	0.001304	7.663862	0.00017
書き出し時間	3.835010	28.799825	0.13316
ファイルサイズ	358.5 MB	744.7 MB	0.48140

「読み込み時間2」はデータフレームに変換しない場合の処理時間である。
データフレームに変換しなくても，いろいろなことができる。

tbl = read_parquet("test.parquet");

x = skipmissing(tbl[10])

skipmissing(Union{Missing, Float64}[1.0, 36.0, 4.0, 91.0, 63.0, 57.0, 87.0, 94.0, 52.0, 5.0  …  51.0, 26.0, 45.0, 12.0, 1.0, 32.0, 66.0, 67.0, 56.0, 59.0])

using Statistics
mean(x)

50.489708745041774

y = filter(z -> !isnan(z), x);

length(y), mean(y)

(9901222, 50.489708745041774)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up