JSONファイルをsnowpark上から取り出して、テーブル形式にする方法についてまとめました。
join_table_function
によるLATERAL JOIN(元テーブルに対して一行ずつ適用しながらカラムを追加する)
方法やFlattenによるJSONの中身を行に展開する流れをめもしておきます
# 動作確認
# JSONデータ
json_data = [
{"customer": [{"name": "Joyce Ridgely", "phone": "16504378889"}]},
{"customer": [{"name": "Bradley Green", "phone": "12127593751"}]}
]
values = [(json.dumps(row),) for row in json_data]
df = session.create_dataframe(values, schema=["src"])
df = df.with_column("src", parse_json(col("src")))
df.show()
----------------------------------
|"SRC" |
----------------------------------
|{ |
| "customer": [ |
| { |
| "name": "Joyce Ridgely", |
| "phone": "16504378889" |
| } |
| ] |
|} |
|{ |
| "customer": [ |
| { |
| "name": "Bradley Green", |
| "phone": "12127593751" |
| } |
| ] |
|} |
----------------------------------
下記のようにjoin_table_function
を使用してflattenを使用すると、selectすると
JSON形式のファイルを取得することができる。
flattened_df2 = df.join_table_function("flatten", col("src")["customer"]).select(col("value")["name"].alias("name"), col("value")["phone"].alias("phone"))
flattened_df2.show()
-----------------------------------
|"NAME" |"PHONE" |
-----------------------------------
|"Joyce Ridgely" |"16504378889" |
|"Bradley Green" |"12127593751" |
-----------------------------------
順を追ってみていきたいと思います。下記のようにjoin_table_function
を使ってLATERAL JOINを実施するとsrc列のcustomerのキーの部分を展開して取り出すことができる。
flattened_df2 = df.join_table_function("flatten", col("src")["customer"])
flattened_df2.show()
---------------------------------------------------------------------------------------------------------------------------------
|"SRC" |"SEQ" |"KEY" |"PATH" |"INDEX" |"VALUE" |"THIS" |
---------------------------------------------------------------------------------------------------------------------------------
|{ |1 |NULL |[0] |0 |{ |[ |
| "customer": [ | | | | | "name": "Joyce Ridgely", | { |
| { | | | | | "phone": "16504378889" | "name": "Joyce Ridgely", |
| "name": "Joyce Ridgely", | | | | |} | "phone": "16504378889" |
| "phone": "16504378889" | | | | | | } |
| } | | | | | |] |
| ] | | | | | | |
|} | | | | | | |
|{ |2 |NULL |[0] |0 |{ |[ |
| "customer": [ | | | | | "name": "Bradley Green", | { |
| { | | | | | "phone": "12127593751" | "name": "Bradley Green", |
| "name": "Bradley Green", | | | | |} | "phone": "12127593751" |
| "phone": "12127593751" | | | | | | } |
| } | | | | | |] |
| ] | | | | | | |
|} | | | | | | |
---------------------------------------------------------------------------------------------------------------------------------
ここでselectメソッドを使用してみると、一般的なselectメソッドのように"VALUE"列を取り出すことができる。
flattened_df2 = df.join_table_function("flatten", col("src")["customer"]).select(col("value"))
flattened_df2.show()
------------------------------
|"VALUE" |
------------------------------
|{ |
| "name": "Joyce Ridgely", |
| "phone": "16504378889" |
|} |
|{ |
| "name": "Bradley Green", |
| "phone": "12127593751" |
|} |
------------------------------
ここで、nameのキーのみを取り出したいなと思う時は、col("value")["name"]
の形式で"name"のキーを指定することができる。
flattened_df2 = df.join_table_function("flatten", col("src")["customer"]).select(col("value")["name"].alias("name"))
flattened_df2.show()
-------------------
|"NAME" |
-------------------
|"Joyce Ridgely" |
|"Bradley Green" |
-------------------
上記のように取り出すことができました。
ここでphone列も取り出したいなという時は、下記のように取り出すことができる。
flattened_df2 = df.join_table_function("flatten", col("src")["customer"]).select(col("value")["name"].alias("name"),col("value")["phone"].alias("phone"))
flattened_df2.show()
-----------------------------------
|"NAME" |"PHONE" |
-----------------------------------
|"Joyce Ridgely" |"16504378889" |
|"Bradley Green" |"12127593751" |
-----------------------------------
なお、下記のようにcol("src")
までの指定する場合だと
flattened_df2 = df.join_table_function("flatten", col("src")])
flattened_df2.show()
------------------------------------------------------------------------------------------------------------------------------------------
|"SRC" |"SEQ" |"KEY" |"PATH" |"INDEX" |"VALUE" |"THIS" |
------------------------------------------------------------------------------------------------------------------------------------------
|{ |1 |customer |customer |NULL |[ |{ |
| "customer": [ | | | | | { | "customer": [ |
| { | | | | | "name": "Joyce Ridgely", | { |
| "name": "Joyce Ridgely", | | | | | "phone": "16504378889" | "name": "Joyce Ridgely", |
| "phone": "16504378889" | | | | | } | "phone": "16504378889" |
| } | | | | |] | } |
| ] | | | | | | ] |
|} | | | | | |} |
|{ |2 |customer |customer |NULL |[ |{ |
| "customer": [ | | | | | { | "customer": [ |
| { | | | | | "name": "Bradley Green", | { |
| "name": "Bradley Green", | | | | | "phone": "12127593751" | "name": "Bradley Green", |
| "phone": "12127593751" | | | | | } | "phone": "12127593751" |
| } | | | | |] | } |
| ] | | | | | | ] |
|} | | | | | |} |
------------------------------------------------------------------------------------------------------------------------------------------
下記のように[0]を指定して、リストから取り出す必要がありそう。
flattened_df2 = df.join_table_function("flatten", col("src")).select(col("value")[0]["name"])
flattened_df2.show()
--------------------------
|"""VALUE""[0]['NAME']" |
--------------------------
|"Joyce Ridgely" |
|"Bradley Green" |
--------------------------
おまけ flatten
下記でも同じ結果が得られそうでした。
df = df.with_column("src", parse_json(col("src")))
flattened_df = df.flatten(col("src")["customer"])
flattened_df.show()
---------------------------------------------------------------------------------------------------------------------------------
|"SRC" |"SEQ" |"KEY" |"PATH" |"INDEX" |"VALUE" |"THIS" |
---------------------------------------------------------------------------------------------------------------------------------
|{ |1 |NULL |[0] |0 |{ |[ |
| "customer": [ | | | | | "name": "Joyce Ridgely", | { |
| { | | | | | "phone": "16504378889" | "name": "Joyce Ridgely", |
| "name": "Joyce Ridgely", | | | | |} | "phone": "16504378889" |
| "phone": "16504378889" | | | | | | } |
| } | | | | | |] |
| ] | | | | | | |
|} | | | | | | |
|{ |2 |NULL |[0] |0 |{ |[ |
| "customer": [ | | | | | "name": "Bradley Green", | { |
| { | | | | | "phone": "12127593751" | "name": "Bradley Green", |
| "name": "Bradley Green", | | | | |} | "phone": "12127593751" |
| "phone": "12127593751" | | | | | | } |
| } | | | | | |] |
| ] | | | | | | |
|} | | | | | | |
---------------------------------------------------------------------------------------------------------------------------------
flattened_df2 = flattened_df.select(col("value")["name"], col("value")["phone"])
flattened_df2.show()
----------------------------------------------
|"""VALUE""['NAME']" |"""VALUE""['PHONE']" |
----------------------------------------------
|"Joyce Ridgely" |"16504378889" |
|"Bradley Green" |"12127593751" |
----------------------------------------------