More than 5 years have passed since last update.

Glue上のPySparkでS3上のファイルをリネーム

Last updated at 2018-11-20Posted at 2018-11-02

リネーム処理

いろいろ調べた結果以下のように

# SparkContext
sc = SparkContext()

# Javaのクラス
URI           = sc._gateway.jvm.java.net.URI
Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.s3.S3FileSystem

# HDFSのFileSystemを設定
fs = FileSystem.get(URI("s3://{}".format('your.bucket.name')), sc._jsc.hadoopConfiguration())
fs.rename(
    Path("s3://your.bucket.name/BEFORE_RENAME.csv"),
    Path("s3://your.bucket.name/RENAMED.csv")
)

s3上で使うときにバグを踏む

HADOOP-13574 - Unnecessary file existence check causes problems with S3

s3に対してoverwriteを指定して書き込んだ時、ファイルが存在しないときにコードの条件文がおかしいのでExceptionになる？

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up