More than 3 years have passed since last update.

Pythonのリストをテキスト変換してGZIP圧縮、S3へUP

Last updated at 2021-11-20Posted at 2021-11-20

実な検証ですが、忘備録として残します

環境

python3.7
実行時間はscaleneで計測

やりたかったこと

PythonでDBからデータを取得（今回は省略）
取得したデータをテキストファイル化
上記をGZ圧縮してS3に保存

仮説

メモリをバッファに一気に格納して処理すると実行速度が落ちる？
1行ずつ処理するより、ある程度の塊で処理したほうがよさそう

結果

変換行数が増加すると実行速度は落ちていった
思いのほか1行ずつの処理が速い。逐次処理で良さそう

検証結果

変換行数	実行時間（秒）
1	4.16
10	4.30
100	4.49
1000	5.30

ソースコード

from io import BytesIO,TextIOWrapper
import gzip
from boto3 import client,Session

testlist:list = [[x,x+10,x+20,x+30,x+40] for x in range(1,1000000)]
ROW_SPLIT = 10

buf = BytesIO()
with gzip.GzipFile(mode='wb', fileobj=buf, compresslevel=5) as gz_file:
    with TextIOWrapper(gz_file, encoding='utf-8') as wrapper:
        if ROW_SPLIT > 1:
            _line = ""
            for i,row in enumerate(testlist,start=1):
                _line += '\t'.join(map(str,row)) + '\n'
                if i % ROW_SPLIT == 0:
                    #print(f'{ROW_SPLIT}行分書き込み\n{_line}')
                    wrapper.write(_line)
                    _line = ""
            else:
                #print(f'最後の書き込み\n{_line}')
                wrapper.write(_line)
        else:
            [wrapper.write('\t'.join(map(str,row))) in testlist]
    
    # S3へUP（ここではプロファイル情報を読み込んでアクセスしている）
    s3 = Session(profile_name='my_profile').client('s3')
    s3.put_object(Bucket='xxx', Key='my_key/test.gz', Body=buf.getvalue())

参考

gzip 圧縮されたテキストファイルを書き出す
 Scalene: a high-performance CPU, GPU and memory profiler for Python

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up