【Python】S3上の複数JSONファイルを結合する

Posted at 2021-09-01

awswranglerを使用してS3上の複数JSONファイルを結合し、S3に出力する

概要

AWS Data Wranglerを使用する。
読み込みには以下のJSONLファイルを圧縮した[sample1.json.gz][sample2.json.gz]を使用する。

sample1.jsonl

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}

sample2.jsonl

{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

事前準備

事前にawswranglerをインストールする

$ pip install awswrangler

コード

import awswrangler as wr
import pandas as pd
from datetime import datetime,timezone

# 入力
file_list = ["s3://testbucket/prefix/sample1.json.gz",
             "s3://testbucket/prefix/sample2.json.gz"]
dfs = wr.s3.read_json(path=file_list, lines=True)

# 出力
today = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
output_path = 's3://testbucket/output/{}'.format(today)
wr.s3.to_json(
    df=dfs,
    path=output_path,
    orient="records",
    lines=True
)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up