同一リージョンS3バケット間で大きなサイズのファイルをコピーする

Last updated at 2020-07-23Posted at 2020-07-23

概要

東京リージョンのs3バケットにあるサイズの大きなファイルを、リージョン内の別のバケットに効率よくコピーする方法を検討した。

課題

対象ファイルは、4Kの動画ファイル（.MOVまたは.MXF)でサイズは大きいものだと数ギガから数十ギガバイトである。このサイズだと、Multipart Uploadの利用が前提である。
時間と費用を最小限にしたい。

仮説

boto3 s3 clientに、copyという関数が用意されていて、

This is a managed transfer which will perform a multipart copy in multiple threads if necessary.

と記載されいるので、これを使えばよさそうである。しかし、The transfer configuration に何を設定すればよいのかわからない。
参考 : https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.copy

試したこと

東京リージョンにバケットを2つ用意。デフォルト設定。
コピーに用いたファイルは 5.09GB (5,471,004,718 バイト)の動画ファイル
Lambda関数のランタイムは Pyathon3.8
実験で使うLambda関数

lambda_function.py

import json
import boto3

def copy(src_bucket,src_key,dest_bucket,dest_key):
    s3client = boto3.client('s3')
    response = s3client.copy(
        {"Bucket":src_bucket,"Key":src_key},
        dest_bucket,dest_key
    )
    return

def lambda_handler(event, context):
    src_bucket=event['src_bucket']
    src_key=event['src_key']
    dest_bucket=event['dest_bucket']
    dest_key=event['dest_key']
    copy(src_bucket,src_key,dest_bucket,dest_key)
    return "OK"

Lambda関数の入力値はソースをみればわかると思うが、コピー元のバケット、キー、コピー先のバケット、キーである。

{
  "src_bucket": "コピー元バケット",
  "src_key": "test_4k.mov",
  "dest_bucket": "コピー先バケット",
  "dest_key": "test_4k.mov"
}

Config指定なしの結果

REPORT RequestId: d5f00753-aae0-41a5-8e1a-e6c18747b114	Duration: 36905.20 ms	
Billed Duration: 37000 ms	Memory Size: 128 MB	Max Memory Used: 91 MB

課金実行時間が37秒かかっている。

Config{max_concurrency=20}の結果

max_concurrencyとは、thread数のことらしい。単純に増やせばどうかと思い、デフォルト10のところを20にかえてみた。
コピー関数部分を次のように変更して実行


def copy(src_bucket,src_key,dest_bucket,dest_key):
    s3client = boto3.client('s3')
    TransferConfig = boto3.s3.transfer.TransferConfig(
        multipart_threshold=8388608,
        max_concurrency=20,
        multipart_chunksize=8388608,
        num_download_attempts=5,
        max_io_queue=100,
        io_chunksize=262144,
        use_threads=True
    )
    response = s3client.copy(
        {"Bucket":src_bucket,"Key":src_key},
        dest_bucket,dest_key,Config=TransferConfig
    )
    return

実行結果

[WARNING] Connection pool is full, discarding connection が大量に発生。

REPORT RequestId: 655f0859-0599-4231-923a-480cb47fcef8	Duration: 43544.81 ms	
Billed Duration: 43600 ms	Memory Size: 128 MB	Max Memory Used: 98 MB	Init Duration: 251.39 ms

コネクションプールがあふれ？ 43秒に増えてしまった。
スレッド数20だと不安定のようだ。ここは、デフォルトの10に固定し、ほかのパラメータを変更してみる

Config{multipart_chunksize=64MB}の結果

転送のブロックを大きくすればどうか、ということでデフォルト8MBから64MBに増やしてみる


def copy(src_bucket,src_key,dest_bucket,dest_key):
    s3client = boto3.client('s3')
    TransferConfig = boto3.s3.transfer.TransferConfig(
        multipart_threshold=64*1024*1024,
        max_concurrency=10,
        multipart_chunksize=64*1024*1024,
        num_download_attempts=5,
        max_io_queue=100,
        io_chunksize=262144,
        use_threads=True
    )
    response = s3client.copy(
        {"Bucket":src_bucket,"Key":src_key},
        dest_bucket,dest_key,Config=TransferConfig
    )
    return

実行結果

REPORT RequestId: 45c89f9b-47b9-4a26-a7f0-59149f9b3e6a	Duration: 13235.39 ms	
Billed Duration: 13300 ms	Memory Size: 128 MB	Max Memory Used: 87 MB	Init Duration: 254.25 ms

13秒まで減った。メモリサイズも減っている。いい感じである。

Config{multipart_chunksize=128MB}の結果

REPORT RequestId: f6d1bec9-7c0f-4c2e-ae56-04ddd6f8d929	Duration: 11707.85 ms	
Billed Duration: 11800 ms	Memory Size: 128 MB	Max Memory Used: 86 MB	Init Duration: 255.65 ms

12秒弱である。まだまだいけそうだ。

Config{multipart_chunksize=256MB}の結果

REPORT RequestId: 1ba387d6-0da3-4f7f-8893-6a98b63c9368	Duration: 11088.25 ms	
Billed Duration: 11100 ms	Memory Size: 128 MB	Max Memory Used: 88 MB	Init Duration: 236.17 ms

11秒。

Config{multipart_chunksize=512MB}の結果

REPORT RequestId: b6eacb3d-fe1c-4abd-9bd6-eedd9ee69aa7	Duration: 11371.11 ms	
Billed Duration: 11400 ms	Memory Size: 128 MB	Max Memory Used: 87 MB	Init Duration: 257.59 ms

11.4秒。multipart_chunksize=256MBよりも時間が増えている。

まとめ

結論

実行タイミングや環境で差がでる可能性はあるが、今回の実験結果によると、boto3 s3 clientのcopy関数で、5GBのファイルを東京リージョンのバケット間でコピーする場合、max_concurrency=10,multipart_chunksize=25610241024 が最小のBilled Durationであった。

def copy(src_bucket,src_key,dest_bucket,dest_key):
    s3client = boto3.client('s3')
    TransferConfig = boto3.s3.transfer.TransferConfig(
        multipart_threshold=256*1024*1024,
        max_concurrency=10,
        multipart_chunksize=256*1024*1024,
        num_download_attempts=5,
        max_io_queue=100,
        io_chunksize=262144,
        use_threads=True
    )
    response = s3client.copy(
        {"Bucket":src_bucket,"Key":src_key},
        dest_bucket,dest_key,Config=TransferConfig
    )
    return

考察

今回の評価はおよそ5GB というファイルで行ったが、5GBより大きなファイルにおいても、本実験の結果は参考になると考える。しかし、5GBよりも十分小さいファイルでは別の最適解である可能性がある。
今回の実験では、max_concurrency増加は結果がよくなかったので深掘りしなかった。エラーの内容から、S3の書き込みでコネクションプールを制限していることが想定されるが、これについて記述したドキュメントは発見できなかった。S3の書き込み特性が理解できれば、チューニングの可能性は残されていると考える。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up