More than 1 year has passed since last update.

[小ネタ] AWS Glue/PySparkで出力するCSVのレコード区切りをCRLFにする [懸念あり]

Posted at 2023-09-05

結論

DynamicFrameやSparkのDataframeではCRLFにできないし、デフォルトはLF
toPandas()してlineterminatorで指定する

なんでCRLFにしたいのか

世の中のCSVは相当カオス。少しでも”標準”的に作りたい。
実現が難しくないならちゃんとRFCに準拠してあげたい。

さて、レコード区切りは何が標準なのだろうか？

RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files

Each record is located on a separate line, delimited by a line break (CRLF).

このように書かれています。

RFC 7111 - URI Fragment Identifiers for the text/csv Media Type

Encoding considerations: CSV MIME entities consist of binary data [RFC6838]. As per Section 4.1.1. of RFC 2046 [RFC2046], this media type uses CRLF to denote line breaks. However, implementers should be aware that some implementations may use other values.

こちらもCRLFだとしつつも、実装次第？？？うーん・・・・

Glueってどうなのか

Using the CSV format in AWS Glue - AWS Glue

AWS Glue supports using the comma-separated value (CSV) format. This format is a minimal, row-based data format. CSVs often don't strictly conform to a standard, but you can refer to RFC 4180 and RFC 7111 for more information.

・・・サポートとは？文が切れているので、RFC 4180/7111の通りだ、とは言っていないようにも読める。

結局できるのか、できないのか、どっちなんだい。

まずDynamicFrame

上記のドキュメントの下の方を見ていきますが・・・

レコード区切りに関するオプションがない。。。

SparkのDataframeなら・・・

pyspark.sql.DataFrameWriter.csv — PySpark 3.1.2 documentation

lineSep : str, optional
defines the line separator that should be used for writing. If None is set, it uses the default value, \n. Maximum length is 1 character.

デフォルトはLFになっています。（RFC 4180/7111を無視しちゃいませんか？）

で、ここをCRLFにすれば良さそうなのだが・・・・

[SPARK-34529] spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF) - ASF JIRA

なんと、1文字しか許容していない。
CRLF(\r\n)を指定すると、

pyspark.sql.utils.IllegalArgumentException: requirement failed: 'lineSep' can contain only 1 character.

見事にExceptionとなってしまう。

ではPandasだ

仕方ない・・・・

spark_df = source_dynamicframe.toDF()
pandas_df = csv_df.toPandas()
pandas_df.to_csv(output_path, lineterminator = "\r\n")

こうなっちゃう。

あれれ？

Pandasになっちゃうとしたら、そもそもなんでSpark(PySpark)使ってるんだっけ？との疑問が当然にでてまいります。
並列分散処理できなくなるため、大量データをCSV出力する場合には留意が必要と考えられます。（まだ試せてはいない）

CRLFを求めるようなシステムは、1ファイルにまとめてほしかったりもするケースが多いような感覚がありますので、実質は許容できる制約なのかもしれません。

モヤッとすること

RFCはなんでCRLFと書いているのだろう・・・

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up