More than 3 years have passed since last update.

PySpark でパーティショニングされた parquet ファイルを扱う

Spark
Pyspark
Parquet

Last updated at 2022-09-21Posted at 2022-09-18

データの読み込み

path = "gs://bucket/table/year=*/month=*/day=*/location=*/*.parquet"
base_path = "gs://bucket/table/"

df = spark.read.option("basePath", base_path).parquet(path)

ちょっとデータ見たり、件数確認したり。

df.head()
df.count()

スキーマの確認

df.printSchema()

カラムの追加

from pyspark.sql.functions import *
df_new = df.withColumn('timestamp', from_unixtime('ts_unix'))

書き出し

path = "gs://bucket/table/"
df.write.option("compression", "snappy").partitionBy("partition_col1","partition_col2").parquet(path)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

PySpark で パーティショニングされた parquet ファイルを扱う

PySpark でパーティショニングされた parquet ファイルを扱う