0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

【snowpark/python】snowparkにおけるJSONファイルに対するLATERAL JOINについて

Posted at

JSONファイルをsnowpark上から取り出して、テーブル形式にする方法についてまとめました。
join_table_functionによるLATERAL JOIN(元テーブルに対して一行ずつ適用しながらカラムを追加する)
方法やFlattenによるJSONの中身を行に展開する流れをめもしておきます

python.py
# 動作確認
# JSONデータ
json_data = [
    {"customer": [{"name": "Joyce Ridgely", "phone": "16504378889"}]},
    {"customer": [{"name": "Bradley Green", "phone": "12127593751"}]}
]
values = [(json.dumps(row),) for row in json_data]
df = session.create_dataframe(values, schema=["src"])
df = df.with_column("src", parse_json(col("src")))
df.show()
----------------------------------
|"SRC"                           |
----------------------------------
|{                               |
|  "customer": [                 |
|    {                           |
|      "name": "Joyce Ridgely",  |
|      "phone": "16504378889"    |
|    }                           |
|  ]                             |
|}                               |
|{                               |
|  "customer": [                 |
|    {                           |
|      "name": "Bradley Green",  |
|      "phone": "12127593751"    |
|    }                           |
|  ]                             |
|}                               |
----------------------------------

下記のようにjoin_table_functionを使用してflattenを使用すると、selectすると
JSON形式のファイルを取得することができる。

python.py
flattened_df2 = df.join_table_function("flatten", col("src")["customer"]).select(col("value")["name"].alias("name"), col("value")["phone"].alias("phone"))
flattened_df2.show()
-----------------------------------
|"NAME"           |"PHONE"        |
-----------------------------------
|"Joyce Ridgely"  |"16504378889"  |
|"Bradley Green"  |"12127593751"  |
-----------------------------------

順を追ってみていきたいと思います。下記のようにjoin_table_functionを使ってLATERAL JOINを実施するとsrc列のcustomerのキーの部分を展開して取り出すことができる。

python.py
flattened_df2 = df.join_table_function("flatten", col("src")["customer"]) 
flattened_df2.show()
---------------------------------------------------------------------------------------------------------------------------------
|"SRC"                           |"SEQ"  |"KEY"  |"PATH"  |"INDEX"  |"VALUE"                     |"THIS"                        |
---------------------------------------------------------------------------------------------------------------------------------
|{                               |1      |NULL   |[0]     |0        |{                           |[                             |
|  "customer": [                 |       |       |        |         |  "name": "Joyce Ridgely",  |  {                           |
|    {                           |       |       |        |         |  "phone": "16504378889"    |    "name": "Joyce Ridgely",  |
|      "name": "Joyce Ridgely",  |       |       |        |         |}                           |    "phone": "16504378889"    |
|      "phone": "16504378889"    |       |       |        |         |                            |  }                           |
|    }                           |       |       |        |         |                            |]                             |
|  ]                             |       |       |        |         |                            |                              |
|}                               |       |       |        |         |                            |                              |
|{                               |2      |NULL   |[0]     |0        |{                           |[                             |
|  "customer": [                 |       |       |        |         |  "name": "Bradley Green",  |  {                           |
|    {                           |       |       |        |         |  "phone": "12127593751"    |    "name": "Bradley Green",  |
|      "name": "Bradley Green",  |       |       |        |         |}                           |    "phone": "12127593751"    |
|      "phone": "12127593751"    |       |       |        |         |                            |  }                           |
|    }                           |       |       |        |         |                            |]                             |
|  ]                             |       |       |        |         |                            |                              |
|}                               |       |       |        |         |                            |                              |
---------------------------------------------------------------------------------------------------------------------------------

ここでselectメソッドを使用してみると、一般的なselectメソッドのように"VALUE"列を取り出すことができる。

python.py
flattened_df2 = df.join_table_function("flatten", col("src")["customer"]).select(col("value"))
flattened_df2.show()
------------------------------
|"VALUE"                     |
------------------------------
|{                           |
|  "name": "Joyce Ridgely",  |
|  "phone": "16504378889"    |
|}                           |
|{                           |
|  "name": "Bradley Green",  |
|  "phone": "12127593751"    |
|}                           |
------------------------------

ここで、nameのキーのみを取り出したいなと思う時は、col("value")["name"]の形式で"name"のキーを指定することができる。

flattened_df2 = df.join_table_function("flatten", col("src")["customer"]).select(col("value")["name"].alias("name"))
flattened_df2.show()
-------------------
|"NAME"           |
-------------------
|"Joyce Ridgely"  |
|"Bradley Green"  |
-------------------

上記のように取り出すことができました。
ここでphone列も取り出したいなという時は、下記のように取り出すことができる。

python.py
flattened_df2 = df.join_table_function("flatten", col("src")["customer"]).select(col("value")["name"].alias("name"),col("value")["phone"].alias("phone"))
flattened_df2.show()
-----------------------------------
|"NAME"           |"PHONE"        |
-----------------------------------
|"Joyce Ridgely"  |"16504378889"  |
|"Bradley Green"  |"12127593751"  |
-----------------------------------

なお、下記のようにcol("src")までの指定する場合だと

python.py
flattened_df2 = df.join_table_function("flatten", col("src")]) 
flattened_df2.show()
------------------------------------------------------------------------------------------------------------------------------------------
|"SRC"                           |"SEQ"  |"KEY"     |"PATH"    |"INDEX"  |"VALUE"                       |"THIS"                          |
------------------------------------------------------------------------------------------------------------------------------------------
|{                               |1      |customer  |customer  |NULL     |[                             |{                               |
|  "customer": [                 |       |          |          |         |  {                           |  "customer": [                 |
|    {                           |       |          |          |         |    "name": "Joyce Ridgely",  |    {                           |
|      "name": "Joyce Ridgely",  |       |          |          |         |    "phone": "16504378889"    |      "name": "Joyce Ridgely",  |
|      "phone": "16504378889"    |       |          |          |         |  }                           |      "phone": "16504378889"    |
|    }                           |       |          |          |         |]                             |    }                           |
|  ]                             |       |          |          |         |                              |  ]                             |
|}                               |       |          |          |         |                              |}                               |
|{                               |2      |customer  |customer  |NULL     |[                             |{                               |
|  "customer": [                 |       |          |          |         |  {                           |  "customer": [                 |
|    {                           |       |          |          |         |    "name": "Bradley Green",  |    {                           |
|      "name": "Bradley Green",  |       |          |          |         |    "phone": "12127593751"    |      "name": "Bradley Green",  |
|      "phone": "12127593751"    |       |          |          |         |  }                           |      "phone": "12127593751"    |
|    }                           |       |          |          |         |]                             |    }                           |
|  ]                             |       |          |          |         |                              |  ]                             |
|}                               |       |          |          |         |                              |}                               |
------------------------------------------------------------------------------------------------------------------------------------------

下記のように[0]を指定して、リストから取り出す必要がありそう。

python.py
flattened_df2 = df.join_table_function("flatten", col("src")).select(col("value")[0]["name"])
flattened_df2.show()
--------------------------
|"""VALUE""[0]['NAME']"  |
--------------------------
|"Joyce Ridgely"         |
|"Bradley Green"         |
--------------------------

おまけ flatten

下記でも同じ結果が得られそうでした。

df = df.with_column("src", parse_json(col("src")))
flattened_df = df.flatten(col("src")["customer"])
flattened_df.show()
---------------------------------------------------------------------------------------------------------------------------------
|"SRC"                           |"SEQ"  |"KEY"  |"PATH"  |"INDEX"  |"VALUE"                     |"THIS"                        |
---------------------------------------------------------------------------------------------------------------------------------
|{                               |1      |NULL   |[0]     |0        |{                           |[                             |
|  "customer": [                 |       |       |        |         |  "name": "Joyce Ridgely",  |  {                           |
|    {                           |       |       |        |         |  "phone": "16504378889"    |    "name": "Joyce Ridgely",  |
|      "name": "Joyce Ridgely",  |       |       |        |         |}                           |    "phone": "16504378889"    |
|      "phone": "16504378889"    |       |       |        |         |                            |  }                           |
|    }                           |       |       |        |         |                            |]                             |
|  ]                             |       |       |        |         |                            |                              |
|}                               |       |       |        |         |                            |                              |
|{                               |2      |NULL   |[0]     |0        |{                           |[                             |
|  "customer": [                 |       |       |        |         |  "name": "Bradley Green",  |  {                           |
|    {                           |       |       |        |         |  "phone": "12127593751"    |    "name": "Bradley Green",  |
|      "name": "Bradley Green",  |       |       |        |         |}                           |    "phone": "12127593751"    |
|      "phone": "12127593751"    |       |       |        |         |                            |  }                           |
|    }                           |       |       |        |         |                            |]                             |
|  ]                             |       |       |        |         |                            |                              |
|}                               |       |       |        |         |                            |                              |
---------------------------------------------------------------------------------------------------------------------------------
flattened_df2 = flattened_df.select(col("value")["name"], col("value")["phone"])
flattened_df2.show()
----------------------------------------------
|"""VALUE""['NAME']"  |"""VALUE""['PHONE']"  |
----------------------------------------------
|"Joyce Ridgely"      |"16504378889"         |
|"Bradley Green"      |"12127593751"         |
----------------------------------------------
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?